Class Storage
It provides methods for adding, retrieving, and processing genomic information, as well as handling variant calls and annotations. The
class is structured to support efficient storage and manipulation of data, leveraging Java collections and utility classes. Operations
executed on storage instances are implemented in separate classes within the op package.
Core Data Structures:
- Contigs: Stored in a `Map (String,
Contig)`, contigs represent chromosomes or plasmids. Each contig can store its sequence and associated variants. - Features: Stored in a `Map (String,
Feature)`, features represent genomic elements like genes or mRNA. These are validated and processed using Sequence Ontology (SO) terms. - Samples: Stored in a `Map (String,
Sample), samples represent variant calls from distinct biological samples. Metadata and variant calls are associated with each sample.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final recordThe parameters used for configuring the storage of variant data. -
Field Summary
FieldsModifier and TypeFieldDescriptionfinal Storage.ParametersStatic parameters used by this storage.A static map defining the hierarchy levels of Sequence Ontology (SO) terms used in the model. -
Constructor Summary
ConstructorsConstructorDescriptionStorage(Storage.Parameters parameters) Constructs a newStorageinstance with the specified parameters. -
Method Summary
Modifier and TypeMethodDescriptionvoidAdds a contig to the storage under the specified id and sequence.voidaddFeature(String name, String contigIdentifier, Number start, Number end, char strand, String type, Map<String, String> attributes) Adds feature information to the storage.voidaddFeature(org.biojava.nbio.genome.parsers.gff.FeatureI featureI, String name, Map<String, String> attributes) Adds feature information from aFeatureIobject to the storage.voidAdds a sample to the storage if it does not already exist.voidaddVariant(Contig contig, String sampleIdentifier, int position, String reference, String alternative, Set<VariantCall> variantCalls) Adds a variant to the specified contig in the storage.voiddetachSample(String sampleIdentifier) Detaches a sample from the storage and removes its associations with variants and alleles.Retrieves a collection of active samples from the storage.intCalculates the total number of active variants across all contigs in the storage.Retrieves a contig by its unique identifier.Retrieves an unmodifiable collection view of the contigs stored in the storage.getFeature(String identifier) Retrieves a feature from the storage by its unique identifier.Retrieves the set of all unique attribute keys from features in the storage.Retrieves an unmodifiable collection view of the features stored in the storage.Retrieves a sample from the storage by its unique identifier.Retrieves the set of all unique attribute keys from samples in the storage.Retrieves an unmodifiable collection view of the samples stored in the storage.intCalculates the total number of variants across all contigs in the storage.booleanChecks if a contig is present in the storage by its unique identifier.booleanhasFeature(String identifier) Query whether a feature is stored in this instance by its id.booleanChecks if the reference sequence is set for the storage.booleanChecks if a sample is present in the storage by its unique identifier.voidremoveFeature(String identifier) Removes a feature from the storage by its unique identifier.replaceFeature(Feature stored, Feature replacement) Replaces an existing feature in the storage with a new feature.voidsetReference(htsjdk.samtools.reference.IndexedFastaSequenceFile indexedFastaSequenceFile) Sets the reference sequence for the storage and populates contigs based on the reference.static com.google.gson.TypeAdapter<Storage> Creates a customTypeAdapterfor theStorageclass.
-
Field Details
-
SEQUENCE_ONTOLOGY_HIERARCHY
A static map defining the hierarchy levels of Sequence Ontology (SO) terms used in the model.This map associates various SO terms with their respective hierarchy levels, which are used to categorize genomic features. The hierarchy levels are represented as integers, where a lower number indicates a higher-level feature (e.g., "region" at level 0) and a higher number indicates a more specific feature (e.g., "CDS" at level 3).
The map includes common SO terms such as "gene", "mRNA", "CDS", and others. It is designed to support the processing and validation of genomic features in the model. Additional terms like UTRs can be optionally added in the future.
Example:
- "region" is assigned level 0, representing the highest-level feature.
- "gene" and "pseudogene" are assigned level 1, representing primary genomic elements.
- "mRNA" and other RNA types are assigned level 2, representing transcripts.
- "CDS" and "exon" are assigned level 3, representing coding sequences and exons.
-
parameters
Static parameters used by this storage.This field holds an instance of
Storage.Parameters, which contains the configuration for the storage system. The parameters are immutable and define the behavior of the storage, such as thresholds and exclusions.
-
-
Constructor Details
-
Storage
Constructs a newStorageinstance with the specified parameters.This constructor initializes the
Storageobject with the provided configuration parameters. It also initializes empty containers for contigs, features, and samples usingLinkedTreeMap, ensuring that the data is stored in a sorted and efficient manner.- Parameters:
parameters- TheStorage.Parametersobject containing the configuration for the storage. This includes thresholds, exclusions, and other settings for managing genomic data.
-
-
Method Details
-
hasReference
public boolean hasReference()Checks if the reference sequence is set for the storage.This method determines whether the
referencefield has been initialized with a non-null value. Thereferencefield is a transient accessor to the indexed reference sequences used in the storage. If the reference is set, it indicates that the storage has access to the reference sequence for further operations.- Returns:
trueif thereferencefield is non-null, indicating that the reference sequence is set;falseotherwise.
-
setReference
public void setReference(htsjdk.samtools.reference.IndexedFastaSequenceFile indexedFastaSequenceFile) throws IOException Sets the reference sequence for the storage and populates contigs based on the reference.This method assigns the provided
IndexedFastaSequenceFileas the reference sequence for the storage. It clears any previously stored contigs and repopulates them based on the sequences available in the reference.The method iterates through all sequences in the reference file, adding each sequence as a contig to the storage if it does not already exist. After processing all sequences, the reference file is reset to its initial state.
This method should only be called when creating new instances of
Storageand not during deserialization!- Parameters:
indexedFastaSequenceFile- TheIndexedFastaSequenceFileinstance representing the reference sequence. Must not be null.- Throws:
IOException- If an error occurs while adding contigs to the storage.AssertionError- If the providedindexedFastaSequenceFileis null.
-
hasContig
Checks if a contig is present in the storage by its unique identifier.This method verifies whether a contig, identified by the given id, exists in the storage's contig map. It is useful for determining the presence of a specific contig before performing operations on it.
- Parameters:
identifier- The unique id of the contig to check. This id typically represents the name of the chromosome or plasmid.- Returns:
trueif the contig is present in the storage;falseotherwise.
-
addContig
Adds a contig to the storage under the specified id and sequence.This method is responsible for adding a contig (chromosome or plasmid) to the storage. It ensures that the contig is only added if it does not already exist in the storage. The contig is represented by its name (id) and its sequence.
The sequence is compressed using GZIP for efficient storage. If the provided sequence is null or empty, the method assigns an empty string as the compressed sequence and sets the length of the sequence to 0. This ensures that the contig is still added to the storage with its attributes, even if no sequence data is available.
- Parameters:
identifier- The unique identifier (id) of the contig to add. This typically represents the name of the chromosome or plasmid.sequence- The nucleotide sequence of the contig. This can be null or empty, in which case an empty sequence is stored.- Throws:
IOException- If an error occurs during the compression of the sequence data.
-
getContig
Retrieves a contig by its unique identifier.This method looks up a contig in the storage using its unique id. If the contig is found, it returns the corresponding
Contigobject. If no contig with the specified id exists, the method returnsnull.- Parameters:
identifier- The unique identifier of the contig to retrieve. This id typically represents the name of the chromosome or plasmid.- Returns:
- The
Contigobject associated with the specified id, ornullif no such contig exists in the storage.
-
getContigs
Retrieves an unmodifiable collection view of the contigs stored in the storage.This method provides a read-only view of the contigs stored in the storage. The returned collection reflects the current state of the contigs map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.
This method is useful for accessing all contigs in the storage without allowing external modifications.
- Returns:
- An unmodifiable collection of
Contigobjects stored in the storage.
-
hasFeature
Query whether a feature is stored in this instance by its id.- Parameters:
identifier- The id of the feature.- Returns:
- True if a feature is stored for
id.
-
addFeature
public void addFeature(org.biojava.nbio.genome.parsers.gff.FeatureI featureI, String name, Map<String, String> attributes) throws MusialExceptionAdds feature information from aFeatureIobject to the storage.This method extracts the necessary details from the provided
FeatureIobject, such as the parent contig, start and end positions, strand, and type, and delegates the processing to the overloadedaddFeature(String, String, Number, Number, char, String, Map)method.- Parameters:
featureI- TheFeatureIobject containing the feature information to transfer.name- The name of the feature.attributes- A map of attributes associated with the feature.- Throws:
MusialException- If an error occurs while adding the feature to the storage.
-
addFeature
public void addFeature(String name, String contigIdentifier, Number start, Number end, char strand, String type, Map<String, String> attributes) throws MusialExceptionAdds feature information to the storage.This method processes and validates the provided feature information, including its type, location, and attributes. It checks if the feature type is supported by the Sequence Ontology (SO) map and determines a unique identifier (UID) for the feature. If a feature with the same UID already exists, it validates compatibility with the parent feature and updates the "children" attribute if applicable. Otherwise, it creates a new feature and adds it to the storage. Processed attributes are removed from the attributes map, and the remaining attributes are extended for the feature.
The method performs the following steps:
- Validates the feature type against the Sequence Ontology (SO) map.
- Validates the feature's location and compatibility with stored contigs.
- Determines a unique identifier (UID) for the feature based on its attributes.
- Checks if a feature with the same UID already exists and updates or creates the feature accordingly.
- Removes processed attributes and extends the feature's attributes with the remaining ones.
- Parameters:
name- The name of the feature.contigIdentifier- The chromosome where the feature is located matching the identifier of a stored contig.start- The start position of the feature.end- The end position of the feature.strand- The strand of the feature ('+' or '-').type- The type of the feature (e.g., "gene", "mRNA").attributes- A map of attributes associated with the feature. @throws MusialException If an error occurs while adding the feature to the storage.- Throws:
MusialException- If an error occurs while adding the feature to the storage.
-
replaceFeature
Replaces an existing feature in the storage with a new feature.This method checks if the feature to be replaced exists in the storage. If the feature exists, it is replaced with the provided replacement feature. If the feature does not exist, a warning is logged, and no changes are made.
This method is and should only be used for updating features during feature validation.
-
getFeature
Retrieves a feature from the storage by its unique identifier.This method looks up a feature in the storage using its unique id. If the feature is found, it returns the corresponding
Featureobject. If no feature with the specified id exists, the method returnsnull.Todo: It may be relevant for some applications to retrieve features by other attributes, such as "locus_tag" or "ID".
- Parameters:
identifier- The unique identifier of the feature to retrieve. This id typically represents the name or id of the genomic feature.- Returns:
- The
Featureobject associated with the specified id, ornullif no such feature exists in the storage.
-
getFeatures
Retrieves an unmodifiable collection view of the features stored in the storage.This method provides a read-only view of the features stored in the storage. The returned collection reflects the current state of the features map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.
- Returns:
- An unmodifiable collection of
Featureobjects stored in the storage.
-
getFeatureAttributeKeys
Retrieves the set of all unique attribute keys from features in the storage.This method iterates through all features in the
featuresmap, collects the keys of their attributes, and returns them as aSet. The use of aSetensures that the returned collection contains only unique attribute keys, even if multiple features share the same attribute keys. -
removeFeature
Removes a feature from the storage by its unique identifier.This method deletes the feature associated with the given identifier from the `features` map. It is useful for managing the storage by allowing the removal of specific genomic features when they are no longer needed or relevant.
- Parameters:
identifier- The unique identifier of the feature to remove. This id typically represents the name or id of the genomic feature.
-
hasSample
Checks if a sample is present in the storage by its unique identifier.This method verifies whether a sample, identified by the given id, exists in the storage's sample map. It is useful for determining the presence of a specific sample before performing operations on it.
- Parameters:
identifier- The unique id of the sample to check. This id typically represents the name or identifier of the biological sample.- Returns:
trueif the sample is present in the storage;falseotherwise.
-
addSample
Adds a sample to the storage if it does not already exist.This method checks whether a sample with the given identifier is already present in the storage. If the sample does not exist, it creates a new
Sampleobject, associates it with the storage, and adds it to the `samples` map. The newSampleobject is initialized with the current number of contigs and features in the storage. If the sample already exists, the method does nothing.- Parameters:
sampleIdentifier- The unique identifier of the sample to add. This typically represents the name or ID of the biological sample.
-
getSample
Retrieves a sample from the storage by its unique identifier.This method looks up a sample in the storage using its unique identifier. If the sample is found, it returns the corresponding
Sampleobject. If no sample with the specified identifier exists, the method returnsnull.- Parameters:
identifier- The unique identifier of the sample to retrieve. This identifier typically represents the name or ID of the biological sample.- Returns:
- The
Sampleobject associated with the specified identifier, ornullif no such sample exists in the storage.
-
getSamples
Retrieves an unmodifiable collection view of the samples stored in the storage.This method provides a read-only view of the samples stored in the storage. The returned collection reflects the current state of the samples map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.
This method is useful for accessing all samples in the storage without allowing external modifications.
- Returns:
- An unmodifiable collection of
Sampleobjects stored in the storage.
-
getActiveSamples
Retrieves a collection of active samples from the storage.This method filters the samples stored in the `samples` map and returns only those that are marked as active. Active samples are identified by the `active` field being set to
true.The returned collection is a list created from the filtered stream of samples. Modifications to the returned list do not affect the underlying storage.
- Returns:
- A
CollectionofSampleobjects that are currently active.
-
getSampleAttributeKeys
Retrieves the set of all unique attribute keys from samples in the storage.This method iterates through all samples in the
samplesmap, collects the keys of their attributes, and returns them as aSet. The use of aSetensures that the returned collection contains only unique attribute keys, even if multiple samples share the same attribute keys. -
detachSample
Detaches a sample from the storage and removes its associations with variants and alleles.This method performs the following operations:
- Checks if the sample exists in the storage. If not, the method returns immediately.
- Detaches the sample from all variants in the contigs. If a variant is no longer associated with any sample, it is removed from the contig.
- Detaches the sample from all alleles in the features. If an allele is no longer associated with any sample, it is removed from the feature.
- Removes the sample from the `samples` map in the storage.
- Parameters:
sampleIdentifier- The unique identifier of the sample to be detached.
-
addVariant
public void addVariant(Contig contig, String sampleIdentifier, int position, String reference, String alternative, Set<VariantCall> variantCalls) Adds a variant to the specified contig in the storage.This method ensures that the variant is in a canonical padded format. If the variant is not canonical, an
IllegalArgumentExceptionis thrown. If the variant does not already exist in the contig, it is created and added. The method also associates the variant with the specified sample and caches the variant for further processing.- Parameters:
contig- TheContigobject to which the variant belongs. Represents the genomic region where the variant is located.sampleIdentifier- The unique identifier of the sample associated with the variant. Used to track the sample's variant calls.position- The position of the variant within the contig. Represents the genomic coordinate of the variant.reference- The reference allele of the variant. This is the expected sequence at the given position.alternative- The alternative allele of the variant. This is the observed sequence differing from the reference.variantCalls- A set ofVariantCallobjects representing the variant calls associated with the sample.- Throws:
IllegalArgumentException- If the variant is not in a canonical padded format. Ensures data consistency and correctness.
-
getVariantsCount
public int getVariantsCount()Calculates the total number of variants across all contigs in the storage.This method iterates through all contigs stored in the
contigsmap and sums up the variant counts for each contig. The variant count for each contig is retrieved using theContig.getVariantsCount()method.- Returns:
- The total number of variants across all contigs.
-
getActiveVariantsCount
public int getActiveVariantsCount()Calculates the total number of active variants across all contigs in the storage.This method iterates through all contigs stored in the
contigsmap and sums up the active variant counts for each contig. The active variant count for each contig is retrieved using theContig.getActiveVariantsCount()method.- Returns:
- The total number of active variants across all contigs.
-
typeAdapter
Creates a customTypeAdapterfor theStorageclass.This method defines a custom
TypeAdapterto handle the serialization and deserialization ofStorageobjects. The adapter uses Gson's default adapter for most operations but adds custom behavior during deserialization to initialize the transientreferencefield.- Returns:
- A
TypeAdapterfor theStorageclass.
-