Help - Reading and writing data

From LibrAlign Documentation

LibrAlign's I/O is based on JPhyloIO and therefore uses its functionality to support multiple file formats which increases the compatibly between applications.

Overview of the architecture

The LibrAlign classes contained in libralign.io work as adapters between the model implementation of LibrAlign and the readers and writers of JPhyloIO. For reading alignments JPhyloIOEventReaders translate the content of a document into a stream of events, which can be processed by the I/O classes of LibrAlign. This way, each processed event, obtained from a hierarchical data structure of a document, is translated into a linear sequence of event objects. Each event corresponds to a certain grammar node in the JPhyloIO event grammar that allows nesting sequences of metadata events in all data elements.

The following main components are always required for reading and writing in LibrAlign:

  • Data Reader
  • Data Model
  • Data Adapter
Figure 1: LibrAlign I/O data flow diagram providing an overview over the three main components AlignmentDataReader, AlignmentDataModel and AlignmentModelDataAdapter. The AlignmentDataReader processes the content of the event stream generated by the interface JPhyloIOEventReader. The AlignemntModelDataAdapter accesses the AlignmentDataModel and works as an adapter between the AlignmentDataModel and the JPhyloIOEventWriters.


An important class is the AlignmentDataReader which reads the content of AlignmentModels and DataModels from the event stream created by JPhyloIO. One DataReader has knowledge of one specific application DataModel, which acts as a storage for relevant information of the data. Thus, DataReaders act as mediators between a specific DataModel and the sequences of events generated by JPhyloIO. For writing alignments, the LibrAlign I/O class AlignmnetModelDataAdapter allows to write the contents of an implementation of AlignmentModel to an implementation of JPhyloIOEventWriter. DataAdapters allow JPhyloIOEventWriters to randomly access the application DataModel in different orders depending on the target format. This is done by using a set of adapter implementations of the application, each providing a subsequence of the entire event stream modelling the document.


Reading and writing alignments

Reading and writing alignments can be done easily by classes of libralign.io. The AlignmentIODemo demonstrates how to read and write alignments with LibrAlign. Some code examples contained in this demonstration are shown below.

Reading

  • First, the user needs to create an instance of JPhyloIOReaderWriterFactory (called factory here).
  • By using this factory, it is possible to create instances of event readers and event writers. Additionally, JPhyloIOReaderWriterFactory provides methods to create instances for a specific file format or to guess a format of data without specifying it before loading the file into our application. For reading an instance of JPhyloIOEventReader (called eventReader here) needs to be created. All readers and writers use an instance of ReadWriteParameterMap that allows to specify and to collect parameters that influence the behaviour of an I/O class.
  • Next, a new instance of AlignmentDataReader (called mainReader here) has to be created for each stream of JPhyloIO events. This instance receives the JPhyloIO events and creates AlignmentModels and DataModels from these. (In this concrete example, BioPolymerCharAlignmentModelFactory creates instances of PackedAlignmentModel using CharacterTokenSets for either nucleotide or amino acid data.)
  • The method readAll() provided by AlignmentDataReader processes all events from the underlying JPhyloIO event stream. The resulting model instances can then be accessed by the method getCompletedModels().
JPhyloIOReaderWriterFactory factory = new JPhyloIOReaderWriterFactory();

JPhyloIOEventReader eventReader = factory.guessReader(file, new ReadWriteParameterMap());

AlignmentDataReader mainReader = new AlignmentDataReader(eventReader, new BioPolymerCharAlignmentModelFactory());

mainReader.readAll();

mainReader.getAlignmentModelReader().getCompletedModels();

Writing

  • To write alignments, first the user needs any implementation of DocumentDataAdapter that allows the use of the LibrAlign I/O class AlignmentModelDataAdapter. For example, this can be done by creating an instance of ListBasedDocumentDataAdapter (called document here) as shown in the code example below.
  • Instances of ListBasedDocumentDataAdapter contain a list of properties (for example matrices) that can be filled by DataAdapter instances of the LibrAlign I/O class AlignmnetModelDataAdapter.
  • The JPhyloIOReaderWriterFactory (called factory here) of the code example in “Reading” is used again to create an instance of JPhyloIOEventWriter (called eventWriter here).
  • The method writeDocument(), provided by JPhyloIOEventWriter writes the data provided by the DataAdapter into a document and according to the format of the implemented class.
ListBasedDocumentDataAdapter document = new ListBasedDocumentDataAdapter();

document.getMatrices().add(new AlignmentModelDataAdapter<T>(idPrefix ,new LinkedLabeledIDEvent(EventContentType.ALIGNMENT, 
    model.getID(), model.getLabel(), null), model, false));

JPhyloIOEventWriter eventWriter = factory.getWriter(formatID);

eventWriter.writeDocument(document, file, new ReadWriteParameterMap());


Reading and writing file metadata using available components of LibrAlign

As mentioned above, the event grammar of JPhyloIO allows nesting sequences of metadata events in all data elements. In LibrAlign there are already implementations to read and write metadata, which defined the contents of DataModels like character sets or pherograms. (How to read and write metadata for which there is no implementation in LibrAlign yet, see next section.) To read and write metadata, a specific DataModel that works as a storage for relevant data, is needed. This DataModel can then be read and written by a specific DataReader and DataAdapter. The following example shows how to read and write metadata like the color of a character set for which there are already components in LibrAlign. The specific DataReader CharSetEventReader is needed to read character set events from JPhyloIO into instances of the corresponding DataModel CharSetDataModel. The CharSetEventReader needs to be added to your main data reader that extends the AlignmentDataReader and collects all future readers.

charSetReader = new CharSetEventReader(AlignmentDataReader main reader, new URIOrStringIdentifier(null, PREDICATE_COLOR));

mainReader.addDataElementReader(charSetReader);

The resulting models from the method readAll(), which processes all events from the underlying JPhyloIO event stream as mentioned above, cannot only be accessed by AlignmentModelEventReader.getCompletedModels() but also by the according DataReaders. In this case the DataReader is the newly created charSetReader that was added to the main DataReader. The loaded DataModels that belong to a specific DataElementKey can be stored in a collection shown in the code example below. A DataElementKey stores information of the current AlignmentModel and sequence to access a data element that has been read from a data source. In the collection 0-n stored DataModels can be expected. In our example exactly one DataModel is expected. If the collection of charSetModels in our example is empty, a new CharSetDataModel is created, otherwise it is possible to iterate over the containing charSetModel.

Collection<CharSetDataModel> charSetModels = 
     mainReader.getCharSetReader().getCompletedElements().get(new DataElementKey(alignmentModel.getID()));
CharSetDataModel charSetModel;

    if (charSetModels.isEmpty()) {
	    charSetModel = new CharSetDataModel(alignmentModel);
	}
	else {
	    charSetModel = charSetModels.iterator().next();
	}
	if (charSetModels.size() > 1) {
	    message = "…";
    }

The CharSetDataAdapter is an JPhyloIO DataAdapter implementation that can write the content of a CharSetDataModel. The color of a character set can be written as metadata by overwriting the method getCharacterSets() of the interface MatrixDataAdapter implemented in AlignmnetModelDataAdapter. The method returns a list of character sets defined for the matrix modelled by this instance.

    @Override
    public ObjectListDataAdapter<LinkedLabeledIDEvent> getCharacterSets(ReadWriteParameterMap parameters) {
        return new CharSetDataAdapter(); 			
    }


Reading and writing custom file metadata

To read and write metadata for which there are no ready-to-use implementations in LibrAlign, it is necessary to create a specific data model that works as a storage for relevant data by yourself. This data model can then be read and written by a specific data reader and data adapter that needs to be implemented by the application developer. As an example, let's assume an application data model that stores gene bank IDs along with sequences and we want to read and write these with the alignment data. A respective new data model could be called GeneBankIdDataModel as shown in the code example below.

public class GeneBankIdDataModelDataModel extends AbstractDataModel implements DataModel {
    private String geneBankID;
    private AlignmentModel alignmentModel; 
	

    public GeneBankIdDataModelDataModel (AlignmentModel alignmentModel, String geneBankID) {
	    super(alignmentModel);
	    this. geneBankID = geneBankID;
	    this.alignmentModel = alignmentModel; 
    }


    public String getGeneBankID() {
	    return geneBankID;
    }
}

To write metadata associated with a whole alignment or a sequence, the methods writeMetadata() or writeSequenceMetadata() must be overwritten in the corresponding DataAdapter. In this example we want to write metadata (the gene bank ID) for each sequence contained in the alignment. Therefore, it is necessary to overwrite the method writeSequenceMetadata(). This can be done by creating a new GeneBankIdDataAdapter that extends AlignmnetModelDataAdapter.

public class GeneBankIdDataAdapter extends AlignmentModelDataAdapter<Character> {
    private static final QName PREDICATE_HAS_GENEBANK_ID = new QName(PREDICATE_NAMESPACE_URI, "hasGeneBankID",
          PREDICATE_NAMESPACE_PREFIX); 
    private AlignmentModel model;
    private int sequenceStart;	


    public GeneBankIdDataAdapter (String idPrefix, LinkedLabeledIDEvent startEvent, AlignmentModel<Character> model,boolean linkOTUs) {
        super(model.getID(), new LinkedLabeledIDEvent(EventContentType.ALIGNMENT, model.getID(), model.getLabel(), null), model, false);
	    this.model = model;
	    sequenceStart = getIDPrefix().length();
    }
	

    @Override
    protected void writeSequenceMetadata(JPhyloIOEventReceiver receiver, String jPhyloIOPrefixSequenceID) 
          throws IOException, IllegalArgumentException {
        super.writeSequenceMetadata(receiver, jPhyloIOPrefixSequenceID);
			
        String sequenceID = jPhyloIOPrefixSequenceID.substring(sequenceStart);
        model.getGeneBankID(sequenceID);
			
        if (geneBankID != null) {
	        JPhyloIOWritingUtils.writeSimpleLiteralMetadata(receiver, jPhyloIOPrefixSequenceID + 
	            ReadWriteConstants.DEFAULT_META_ID_PREFIX + "1", null, PREDICATE_HAS_GENEBANK_ID, W3CXSConstants.DATA_TYPE_NC_NAME, 
	            model.getGeneBankID(sequenceID);				
        }
    }
}

receiver is used to write an event sequence. Predicates like PREDICATE_HAS_GENEBANK_ID display information the user wants to write and link metadata to a certain node in the event grammar. This linkage is stored in a new DataModel, which has to be created.

Reading metadata such as a GeneBankID for each sequence from our example also requires a newly created GeneBankIdDataReader which is shown below. The method processEvent() is called separately for each event. Therefore, it is necessary to distinguish which type of data the events represent. This is determined by their EventContentType. Some events, for example the content types ALIGNMENT, SEQUENCE or LITERAL_META, have separated start and end events because they enclose a subsequence of the event stream such as contents of a matrix (event content type ALIGNMENT), contents of a sequence in a matrix (event content type SEQUENCE), or literal meta information (event content type LITERAL_META). Other events come in a single SOLE version and therefore have no nested events. An example for this is the event content Type LITERAL_META_CONTETNT which indicates that literal metadata content, such as simple values in single events or a sequence, was found in the underlying data source. Furthermore, since the method processEvent() is called separately for each event, states are stored instead of processing the events with an iterator in a loop. For example, if the EventContentType is ALIGNMENT and the EventTopologieType of the current event equals EventTopologieType START, the method asLabeledIDEvent() casts the current event to a labeled ID event.

public class GeneBankIdDataReader extends AbstractDataElementEventReader<ContigReferenceDataModel> {
    private boolean isPredicate;
    private String currentSequenceID = null;
    private String currentAlignmentID = null; 
	

    public GeneBankIdDataReader(AlignmentDataReader mainReader) {
	    super(mainReader, null);
    }
	

    @Override
    public void processEvent(JPhyloIOEventReader source, JPhyloIOEvent event) throws IOException {
	    switch (event.getType().getContentType()) {
	        case ALIGNMENT:
		        if (event.getType().getTopologyType().equals(EventTopologyType.START)) {
		        	currentAlignmentID = event.asLabeledIDEvent().getID();
		        }
		        else if (event.getType().getTopologyType().equals(EventTopologyType.END)) {
			    currentAlignmentID = null;
		        }
		        break;
	        case SEQUENCE:
		        if (event.getType().getTopologyType().equals(EventTopologyType.START)) {
		        	currentSequenceID = event.asLabeledIDEvent().getID();
		        }
		        else if (event.getType().getTopologyType().equals(EventTopologyType.END)) {
			        currentSequenceID = null;
		        }
		        break;
	        case LITERAL_META:
		        if (event.getType().getTopologyType().equals(EventTopologyType.START)) { 
		        	if (PREDICATE_HAS_GENEBANK_ID.equals(event.asLiteralMetadataEvent().getPredicate().getURI())) {
				        isPredicate = true;
			        }
		        	else {
				        isPredicate = false;
			        }
		        }	
		        break;
	        case LITERAL_META_CONTENT:
		        if(isPredicate && (currentAlignmentID != null) && (currentSequenceID != null)) {
		        	getCompletedElements().put(new DataElementKey(currentAlignmentID, currentSequenceID), 
			        	new GeneBankIdDataModelDataModel(getMainReader().getAlignmentModelReader().getCurrentModel(),
			        	event.asLiteralMetadataContentEvent().getStringValue()));	
		        }	
		        break;
	        default:
		        break;
	    }
    }
}

In this example only if isPredicate is true (which means a certain sequence has a GeneBankID) a new GeneBankIdDataModel is created. The string value, in this case the GeneBankID, is stored in the new GeneBankIdDataModel and corresponds to the currentAlignmnetID and currentSequenceID deposited in the DataElementKey. The created DataReader has to be added to your main DataReader in the same way as the new CharSetReader in the example above. The resulting models can be accessed by the newly created GeneBankIdReader that was added to the main DataReader as explained above.