Package model

Class Storage

java.lang.Object
model.Storage

public class Storage extends Object
Central component of the MUSIAL model, designed to manage genomic data, including contigs, features, and samples.

It provides methods for adding, retrieving, and processing genomic information, as well as handling variant calls and annotations. The class is structured to support efficient storage and manipulation of data, leveraging Java collections and utility classes. Operations executed on storage instances are implemented in separate classes within the op package.

Core Data Structures:

  • Contigs: Stored in a `Map (String, Contig)`, contigs represent chromosomes or plasmids. Each contig can store its sequence and associated variants.
  • Features: Stored in a `Map (String, Feature)`, features represent genomic elements like genes or mRNA. These are validated and processed using Sequence Ontology (SO) terms.
  • Samples: Stored in a `Map (String, Sample), samples represent variant calls from distinct biological samples. Metadata and variant calls are associated with each sample.
  • Field Details

    • SEQUENCE_ONTOLOGY_HIERARCHY

      public static final Map<String,Integer> SEQUENCE_ONTOLOGY_HIERARCHY
      A static map defining the hierarchy levels of Sequence Ontology (SO) terms used in the model.

      This map associates various SO terms with their respective hierarchy levels, which are used to categorize genomic features. The hierarchy levels are represented as integers, where a lower number indicates a higher-level feature (e.g., "region" at level 0) and a higher number indicates a more specific feature (e.g., "CDS" at level 3).

      The map includes common SO terms such as "gene", "mRNA", "CDS", and others. It is designed to support the processing and validation of genomic features in the model. Additional terms like UTRs can be optionally added in the future.

      Example:

      • "region" is assigned level 0, representing the highest-level feature.
      • "gene" and "pseudogene" are assigned level 1, representing primary genomic elements.
      • "mRNA" and other RNA types are assigned level 2, representing transcripts.
      • "CDS" and "exon" are assigned level 3, representing coding sequences and exons.
    • parameters

      public final Storage.Parameters parameters
      Static parameters used by this storage.

      This field holds an instance of Storage.Parameters, which contains the configuration for the storage system. The parameters are immutable and define the behavior of the storage, such as thresholds and exclusions.

  • Constructor Details

    • Storage

      public Storage(Storage.Parameters parameters)
      Constructs a new Storage instance with the specified parameters.

      This constructor initializes the Storage object with the provided configuration parameters. It also initializes empty containers for contigs, features, and samples using LinkedTreeMap, ensuring that the data is stored in a sorted and efficient manner.

      Parameters:
      parameters - The Storage.Parameters object containing the configuration for the storage. This includes thresholds, exclusions, and other settings for managing genomic data.
  • Method Details

    • hasReference

      public boolean hasReference()
      Checks if the reference sequence is set for the storage.

      This method determines whether the reference field has been initialized with a non-null value. The reference field is a transient accessor to the indexed reference sequences used in the storage. If the reference is set, it indicates that the storage has access to the reference sequence for further operations.

      Returns:
      true if the reference field is non-null, indicating that the reference sequence is set; false otherwise.
    • setReference

      public void setReference(htsjdk.samtools.reference.IndexedFastaSequenceFile indexedFastaSequenceFile) throws IOException
      Sets the reference sequence for the storage and populates contigs based on the reference.

      This method assigns the provided IndexedFastaSequenceFile as the reference sequence for the storage. It clears any previously stored contigs and repopulates them based on the sequences available in the reference.

      The method iterates through all sequences in the reference file, adding each sequence as a contig to the storage if it does not already exist. After processing all sequences, the reference file is reset to its initial state.

      This method should only be called when creating new instances of Storage and not during deserialization!

      Parameters:
      indexedFastaSequenceFile - The IndexedFastaSequenceFile instance representing the reference sequence. Must not be null.
      Throws:
      IOException - If an error occurs while adding contigs to the storage.
      AssertionError - If the provided indexedFastaSequenceFile is null.
    • hasContig

      public boolean hasContig(String identifier)
      Checks if a contig is present in the storage by its unique identifier.

      This method verifies whether a contig, identified by the given id, exists in the storage's contig map. It is useful for determining the presence of a specific contig before performing operations on it.

      Parameters:
      identifier - The unique id of the contig to check. This id typically represents the name of the chromosome or plasmid.
      Returns:
      true if the contig is present in the storage; false otherwise.
    • addContig

      public void addContig(String identifier, String sequence) throws IOException
      Adds a contig to the storage under the specified id and sequence.

      This method is responsible for adding a contig (chromosome or plasmid) to the storage. It ensures that the contig is only added if it does not already exist in the storage. The contig is represented by its name (id) and its sequence.

      The sequence is compressed using GZIP for efficient storage. If the provided sequence is null or empty, the method assigns an empty string as the compressed sequence and sets the length of the sequence to 0. This ensures that the contig is still added to the storage with its attributes, even if no sequence data is available.

      Parameters:
      identifier - The unique identifier (id) of the contig to add. This typically represents the name of the chromosome or plasmid.
      sequence - The nucleotide sequence of the contig. This can be null or empty, in which case an empty sequence is stored.
      Throws:
      IOException - If an error occurs during the compression of the sequence data.
    • getContig

      public Contig getContig(String identifier)
      Retrieves a contig by its unique identifier.

      This method looks up a contig in the storage using its unique id. If the contig is found, it returns the corresponding Contig object. If no contig with the specified id exists, the method returns null.

      Parameters:
      identifier - The unique identifier of the contig to retrieve. This id typically represents the name of the chromosome or plasmid.
      Returns:
      The Contig object associated with the specified id, or null if no such contig exists in the storage.
    • getContigs

      public Collection<Contig> getContigs()
      Retrieves an unmodifiable collection view of the contigs stored in the storage.

      This method provides a read-only view of the contigs stored in the storage. The returned collection reflects the current state of the contigs map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.

      This method is useful for accessing all contigs in the storage without allowing external modifications.

      Returns:
      An unmodifiable collection of Contig objects stored in the storage.
    • hasFeature

      public boolean hasFeature(String identifier)
      Query whether a feature is stored in this instance by its id.
      Parameters:
      identifier - The id of the feature.
      Returns:
      True if a feature is stored for id.
    • addFeature

      public void addFeature(org.biojava.nbio.genome.parsers.gff.FeatureI featureI, String name, Map<String,String> attributes) throws MusialException
      Adds feature information from a FeatureI object to the storage.

      This method extracts the necessary details from the provided FeatureI object, such as the parent contig, start and end positions, strand, and type, and delegates the processing to the overloaded addFeature(String, String, Number, Number, char, String, Map) method.

      Parameters:
      featureI - The FeatureI object containing the feature information to transfer.
      name - The name of the feature.
      attributes - A map of attributes associated with the feature.
      Throws:
      MusialException - If an error occurs while adding the feature to the storage.
    • addFeature

      public void addFeature(String name, String contigIdentifier, Number start, Number end, char strand, String type, Map<String,String> attributes) throws MusialException
      Adds feature information to the storage.

      This method processes and validates the provided feature information, including its type, location, and attributes. It checks if the feature type is supported by the Sequence Ontology (SO) map and determines a unique identifier (UID) for the feature. If a feature with the same UID already exists, it validates compatibility with the parent feature and updates the "children" attribute if applicable. Otherwise, it creates a new feature and adds it to the storage. Processed attributes are removed from the attributes map, and the remaining attributes are extended for the feature.

      The method performs the following steps:

      • Validates the feature type against the Sequence Ontology (SO) map.
      • Validates the feature's location and compatibility with stored contigs.
      • Determines a unique identifier (UID) for the feature based on its attributes.
      • Checks if a feature with the same UID already exists and updates or creates the feature accordingly.
      • Removes processed attributes and extends the feature's attributes with the remaining ones.
      Parameters:
      name - The name of the feature.
      contigIdentifier - The chromosome where the feature is located matching the identifier of a stored contig.
      start - The start position of the feature.
      end - The end position of the feature.
      strand - The strand of the feature ('+' or '-').
      type - The type of the feature (e.g., "gene", "mRNA").
      attributes - A map of attributes associated with the feature. @throws MusialException If an error occurs while adding the feature to the storage.
      Throws:
      MusialException - If an error occurs while adding the feature to the storage.
    • replaceFeature

      public Feature replaceFeature(Feature stored, Feature replacement)
      Replaces an existing feature in the storage with a new feature.

      This method checks if the feature to be replaced exists in the storage. If the feature exists, it is replaced with the provided replacement feature. If the feature does not exist, a warning is logged, and no changes are made.

      This method is and should only be used for updating features during feature validation.

      Parameters:
      stored - The Feature object to be replaced. This feature must already exist in the storage.
      replacement - The Feature object to replace the existing feature with.
      Returns:
      The replaced Feature object if the replacement was successful; null if the feature to be replaced does not exist.
    • getFeature

      public Feature getFeature(String identifier)
      Retrieves a feature from the storage by its unique identifier.

      This method looks up a feature in the storage using its unique id. If the feature is found, it returns the corresponding Feature object. If no feature with the specified id exists, the method returns null.

      Todo: It may be relevant for some applications to retrieve features by other attributes, such as "locus_tag" or "ID".

      Parameters:
      identifier - The unique identifier of the feature to retrieve. This id typically represents the name or id of the genomic feature.
      Returns:
      The Feature object associated with the specified id, or null if no such feature exists in the storage.
    • getFeatures

      public Collection<Feature> getFeatures()
      Retrieves an unmodifiable collection view of the features stored in the storage.

      This method provides a read-only view of the features stored in the storage. The returned collection reflects the current state of the features map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.

      Returns:
      An unmodifiable collection of Feature objects stored in the storage.
    • getFeatureAttributeKeys

      public Set<String> getFeatureAttributeKeys()
      Retrieves the set of all unique attribute keys from features in the storage.

      This method iterates through all features in the features map, collects the keys of their attributes, and returns them as a Set. The use of a Set ensures that the returned collection contains only unique attribute keys, even if multiple features share the same attribute keys.

      Returns:
      A Set of String objects representing the unique attribute keys of all features.
    • removeFeature

      public void removeFeature(String identifier)
      Removes a feature from the storage by its unique identifier.

      This method deletes the feature associated with the given identifier from the `features` map. It is useful for managing the storage by allowing the removal of specific genomic features when they are no longer needed or relevant.

      Parameters:
      identifier - The unique identifier of the feature to remove. This id typically represents the name or id of the genomic feature.
    • hasSample

      public boolean hasSample(String identifier)
      Checks if a sample is present in the storage by its unique identifier.

      This method verifies whether a sample, identified by the given id, exists in the storage's sample map. It is useful for determining the presence of a specific sample before performing operations on it.

      Parameters:
      identifier - The unique id of the sample to check. This id typically represents the name or identifier of the biological sample.
      Returns:
      true if the sample is present in the storage; false otherwise.
    • addSample

      public void addSample(String sampleIdentifier)
      Adds a sample to the storage if it does not already exist.

      This method checks whether a sample with the given identifier is already present in the storage. If the sample does not exist, it creates a new Sample object, associates it with the storage, and adds it to the `samples` map. The new Sample object is initialized with the current number of contigs and features in the storage. If the sample already exists, the method does nothing.

      Parameters:
      sampleIdentifier - The unique identifier of the sample to add. This typically represents the name or ID of the biological sample.
    • getSample

      public Sample getSample(String identifier)
      Retrieves a sample from the storage by its unique identifier.

      This method looks up a sample in the storage using its unique identifier. If the sample is found, it returns the corresponding Sample object. If no sample with the specified identifier exists, the method returns null.

      Parameters:
      identifier - The unique identifier of the sample to retrieve. This identifier typically represents the name or ID of the biological sample.
      Returns:
      The Sample object associated with the specified identifier, or null if no such sample exists in the storage.
    • getSamples

      public Collection<Sample> getSamples()
      Retrieves an unmodifiable collection view of the samples stored in the storage.

      This method provides a read-only view of the samples stored in the storage. The returned collection reflects the current state of the samples map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.

      This method is useful for accessing all samples in the storage without allowing external modifications.

      Returns:
      An unmodifiable collection of Sample objects stored in the storage.
    • getActiveSamples

      public Collection<Sample> getActiveSamples()
      Retrieves a collection of active samples from the storage.

      This method filters the samples stored in the `samples` map and returns only those that are marked as active. Active samples are identified by the `active` field being set to true.

      The returned collection is a list created from the filtered stream of samples. Modifications to the returned list do not affect the underlying storage.

      Returns:
      A Collection of Sample objects that are currently active.
    • getSampleAttributeKeys

      public Set<String> getSampleAttributeKeys()
      Retrieves the set of all unique attribute keys from samples in the storage.

      This method iterates through all samples in the samples map, collects the keys of their attributes, and returns them as a Set. The use of a Set ensures that the returned collection contains only unique attribute keys, even if multiple samples share the same attribute keys.

      Returns:
      A Set of String objects representing the unique attribute keys of all samples.
    • detachSample

      public void detachSample(String sampleIdentifier)
      Detaches a sample from the storage and removes its associations with variants and alleles.

      This method performs the following operations:

      • Checks if the sample exists in the storage. If not, the method returns immediately.
      • Detaches the sample from all variants in the contigs. If a variant is no longer associated with any sample, it is removed from the contig.
      • Detaches the sample from all alleles in the features. If an allele is no longer associated with any sample, it is removed from the feature.
      • Removes the sample from the `samples` map in the storage.
      Parameters:
      sampleIdentifier - The unique identifier of the sample to be detached.
    • addVariant

      public void addVariant(Contig contig, String sampleIdentifier, int position, String reference, String alternative, Set<VariantCall> variantCalls)
      Adds a variant to the specified contig in the storage.

      This method ensures that the variant is in a canonical padded format. If the variant is not canonical, an IllegalArgumentException is thrown. If the variant does not already exist in the contig, it is created and added. The method also associates the variant with the specified sample and caches the variant for further processing.

      Parameters:
      contig - The Contig object to which the variant belongs. Represents the genomic region where the variant is located.
      sampleIdentifier - The unique identifier of the sample associated with the variant. Used to track the sample's variant calls.
      position - The position of the variant within the contig. Represents the genomic coordinate of the variant.
      reference - The reference allele of the variant. This is the expected sequence at the given position.
      alternative - The alternative allele of the variant. This is the observed sequence differing from the reference.
      variantCalls - A set of VariantCall objects representing the variant calls associated with the sample.
      Throws:
      IllegalArgumentException - If the variant is not in a canonical padded format. Ensures data consistency and correctness.
    • getVariantsCount

      public int getVariantsCount()
      Calculates the total number of variants across all contigs in the storage.

      This method iterates through all contigs stored in the contigs map and sums up the variant counts for each contig. The variant count for each contig is retrieved using the Contig.getVariantsCount() method.

      Returns:
      The total number of variants across all contigs.
    • getActiveVariantsCount

      public int getActiveVariantsCount()
      Calculates the total number of active variants across all contigs in the storage.

      This method iterates through all contigs stored in the contigs map and sums up the active variant counts for each contig. The active variant count for each contig is retrieved using the Contig.getActiveVariantsCount() method.

      Returns:
      The total number of active variants across all contigs.
    • typeAdapter

      public static com.google.gson.TypeAdapter<Storage> typeAdapter()
      Creates a custom TypeAdapter for the Storage class.

      This method defines a custom TypeAdapter to handle the serialization and deserialization of Storage objects. The adapter uses Gson's default adapter for most operations but adds custom behavior during deserialization to initialize the transient reference field.

      Returns:
      A TypeAdapter for the Storage class.