Class Storage
It provides methods for adding, retrieving, and processing genomic information, as well as handling variant calls and annotations. The
class is structured to support efficient storage and manipulation of data, leveraging Java collections and utility classes. Operations
executed on storage instances are implemented in separate classes within the op
package.
Core Data Structures:
- Contigs: Stored in a `Map (String,
Contig
)`, contigs represent chromosomes or plasmids. Each contig can store its sequence and associated variants. - Features: Stored in a `Map (String,
Feature
)`, features represent genomic elements like genes or mRNA. These are validated and processed using Sequence Ontology (SO) terms. - Samples: Stored in a `Map (String,
Sample
), samples represent variant calls from distinct biological samples. Metadata and variant calls are associated with each sample.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final record
The parameters used for configuring the storage of variant data. -
Field Summary
FieldsModifier and TypeFieldDescriptionfinal Storage.Parameters
Static parameters used by this storage.A static map defining the hierarchy levels of Sequence Ontology (SO) terms used in the model. -
Constructor Summary
ConstructorsConstructorDescriptionStorage
(Storage.Parameters parameters) Constructs a newStorage
instance with the specified parameters. -
Method Summary
Modifier and TypeMethodDescriptionvoid
Adds a contig to the storage under the specified id and sequence.void
addFeature
(String name, String contigIdentifier, Number start, Number end, char strand, String type, Map<String, String> attributes) Adds feature information to the storage.void
addFeature
(org.biojava.nbio.genome.parsers.gff.FeatureI featureI, String name, Map<String, String> attributes) Adds feature information from aFeatureI
object to the storage.void
Adds a sample to the storage if it does not already exist.void
addVariant
(Contig contig, String sampleIdentifier, int position, String reference, String alternative, Set<VariantCall> variantCalls) Adds a variant to the specified contig in the storage.void
detachSample
(String sampleIdentifier) Detaches a sample from the storage and removes its associations with variants and alleles.Retrieves a collection of active samples from the storage.int
Calculates the total number of active variants across all contigs in the storage.Retrieves a contig by its unique identifier.Retrieves an unmodifiable collection view of the contigs stored in the storage.getFeature
(String identifier) Retrieves a feature from the storage by its unique identifier.Retrieves the set of all unique attribute keys from features in the storage.Retrieves an unmodifiable collection view of the features stored in the storage.Retrieves a sample from the storage by its unique identifier.Retrieves the set of all unique attribute keys from samples in the storage.Retrieves an unmodifiable collection view of the samples stored in the storage.int
Calculates the total number of variants across all contigs in the storage.boolean
Checks if a contig is present in the storage by its unique identifier.boolean
hasFeature
(String identifier) Query whether a feature is stored in this instance by its id.boolean
Checks if the reference sequence is set for the storage.boolean
Checks if a sample is present in the storage by its unique identifier.void
removeFeature
(String identifier) Removes a feature from the storage by its unique identifier.replaceFeature
(Feature stored, Feature replacement) Replaces an existing feature in the storage with a new feature.void
setReference
(htsjdk.samtools.reference.IndexedFastaSequenceFile indexedFastaSequenceFile) Sets the reference sequence for the storage and populates contigs based on the reference.static com.google.gson.TypeAdapter
<Storage> Creates a customTypeAdapter
for theStorage
class.
-
Field Details
-
SEQUENCE_ONTOLOGY_HIERARCHY
A static map defining the hierarchy levels of Sequence Ontology (SO) terms used in the model.This map associates various SO terms with their respective hierarchy levels, which are used to categorize genomic features. The hierarchy levels are represented as integers, where a lower number indicates a higher-level feature (e.g., "region" at level 0) and a higher number indicates a more specific feature (e.g., "CDS" at level 3).
The map includes common SO terms such as "gene", "mRNA", "CDS", and others. It is designed to support the processing and validation of genomic features in the model. Additional terms like UTRs can be optionally added in the future.
Example:
- "region" is assigned level 0, representing the highest-level feature.
- "gene" and "pseudogene" are assigned level 1, representing primary genomic elements.
- "mRNA" and other RNA types are assigned level 2, representing transcripts.
- "CDS" and "exon" are assigned level 3, representing coding sequences and exons.
-
parameters
Static parameters used by this storage.This field holds an instance of
Storage.Parameters
, which contains the configuration for the storage system. The parameters are immutable and define the behavior of the storage, such as thresholds and exclusions.
-
-
Constructor Details
-
Storage
Constructs a newStorage
instance with the specified parameters.This constructor initializes the
Storage
object with the provided configuration parameters. It also initializes empty containers for contigs, features, and samples usingLinkedTreeMap
, ensuring that the data is stored in a sorted and efficient manner.- Parameters:
parameters
- TheStorage.Parameters
object containing the configuration for the storage. This includes thresholds, exclusions, and other settings for managing genomic data.
-
-
Method Details
-
hasReference
public boolean hasReference()Checks if the reference sequence is set for the storage.This method determines whether the
reference
field has been initialized with a non-null value. Thereference
field is a transient accessor to the indexed reference sequences used in the storage. If the reference is set, it indicates that the storage has access to the reference sequence for further operations.- Returns:
true
if thereference
field is non-null, indicating that the reference sequence is set;false
otherwise.
-
setReference
public void setReference(htsjdk.samtools.reference.IndexedFastaSequenceFile indexedFastaSequenceFile) throws IOException Sets the reference sequence for the storage and populates contigs based on the reference.This method assigns the provided
IndexedFastaSequenceFile
as the reference sequence for the storage. It clears any previously stored contigs and repopulates them based on the sequences available in the reference.The method iterates through all sequences in the reference file, adding each sequence as a contig to the storage if it does not already exist. After processing all sequences, the reference file is reset to its initial state.
This method should only be called when creating new instances of
Storage
and not during deserialization!- Parameters:
indexedFastaSequenceFile
- TheIndexedFastaSequenceFile
instance representing the reference sequence. Must not be null.- Throws:
IOException
- If an error occurs while adding contigs to the storage.AssertionError
- If the providedindexedFastaSequenceFile
is null.
-
hasContig
Checks if a contig is present in the storage by its unique identifier.This method verifies whether a contig, identified by the given id, exists in the storage's contig map. It is useful for determining the presence of a specific contig before performing operations on it.
- Parameters:
identifier
- The unique id of the contig to check. This id typically represents the name of the chromosome or plasmid.- Returns:
true
if the contig is present in the storage;false
otherwise.
-
addContig
Adds a contig to the storage under the specified id and sequence.This method is responsible for adding a contig (chromosome or plasmid) to the storage. It ensures that the contig is only added if it does not already exist in the storage. The contig is represented by its name (id) and its sequence.
The sequence is compressed using GZIP for efficient storage. If the provided sequence is null or empty, the method assigns an empty string as the compressed sequence and sets the length of the sequence to 0. This ensures that the contig is still added to the storage with its attributes, even if no sequence data is available.
- Parameters:
identifier
- The unique identifier (id) of the contig to add. This typically represents the name of the chromosome or plasmid.sequence
- The nucleotide sequence of the contig. This can be null or empty, in which case an empty sequence is stored.- Throws:
IOException
- If an error occurs during the compression of the sequence data.
-
getContig
Retrieves a contig by its unique identifier.This method looks up a contig in the storage using its unique id. If the contig is found, it returns the corresponding
Contig
object. If no contig with the specified id exists, the method returnsnull
.- Parameters:
identifier
- The unique identifier of the contig to retrieve. This id typically represents the name of the chromosome or plasmid.- Returns:
- The
Contig
object associated with the specified id, ornull
if no such contig exists in the storage.
-
getContigs
Retrieves an unmodifiable collection view of the contigs stored in the storage.This method provides a read-only view of the contigs stored in the storage. The returned collection reflects the current state of the contigs map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.
This method is useful for accessing all contigs in the storage without allowing external modifications.
- Returns:
- An unmodifiable collection of
Contig
objects stored in the storage.
-
hasFeature
Query whether a feature is stored in this instance by its id.- Parameters:
identifier
- The id of the feature.- Returns:
- True if a feature is stored for
id
.
-
addFeature
public void addFeature(org.biojava.nbio.genome.parsers.gff.FeatureI featureI, String name, Map<String, String> attributes) throws MusialExceptionAdds feature information from aFeatureI
object to the storage.This method extracts the necessary details from the provided
FeatureI
object, such as the parent contig, start and end positions, strand, and type, and delegates the processing to the overloadedaddFeature(String, String, Number, Number, char, String, Map)
method.- Parameters:
featureI
- TheFeatureI
object containing the feature information to transfer.name
- The name of the feature.attributes
- A map of attributes associated with the feature.- Throws:
MusialException
- If an error occurs while adding the feature to the storage.
-
addFeature
public void addFeature(String name, String contigIdentifier, Number start, Number end, char strand, String type, Map<String, String> attributes) throws MusialExceptionAdds feature information to the storage.This method processes and validates the provided feature information, including its type, location, and attributes. It checks if the feature type is supported by the Sequence Ontology (SO) map and determines a unique identifier (UID) for the feature. If a feature with the same UID already exists, it validates compatibility with the parent feature and updates the "children" attribute if applicable. Otherwise, it creates a new feature and adds it to the storage. Processed attributes are removed from the attributes map, and the remaining attributes are extended for the feature.
The method performs the following steps:
- Validates the feature type against the Sequence Ontology (SO) map.
- Validates the feature's location and compatibility with stored contigs.
- Determines a unique identifier (UID) for the feature based on its attributes.
- Checks if a feature with the same UID already exists and updates or creates the feature accordingly.
- Removes processed attributes and extends the feature's attributes with the remaining ones.
- Parameters:
name
- The name of the feature.contigIdentifier
- The chromosome where the feature is located matching the identifier of a stored contig.start
- The start position of the feature.end
- The end position of the feature.strand
- The strand of the feature ('+' or '-').type
- The type of the feature (e.g., "gene", "mRNA").attributes
- A map of attributes associated with the feature. @throws MusialException If an error occurs while adding the feature to the storage.- Throws:
MusialException
- If an error occurs while adding the feature to the storage.
-
replaceFeature
Replaces an existing feature in the storage with a new feature.This method checks if the feature to be replaced exists in the storage. If the feature exists, it is replaced with the provided replacement feature. If the feature does not exist, a warning is logged, and no changes are made.
This method is and should only be used for updating features during feature validation.
-
getFeature
Retrieves a feature from the storage by its unique identifier.This method looks up a feature in the storage using its unique id. If the feature is found, it returns the corresponding
Feature
object. If no feature with the specified id exists, the method returnsnull
.Todo: It may be relevant for some applications to retrieve features by other attributes, such as "locus_tag" or "ID".
- Parameters:
identifier
- The unique identifier of the feature to retrieve. This id typically represents the name or id of the genomic feature.- Returns:
- The
Feature
object associated with the specified id, ornull
if no such feature exists in the storage.
-
getFeatures
Retrieves an unmodifiable collection view of the features stored in the storage.This method provides a read-only view of the features stored in the storage. The returned collection reflects the current state of the features map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.
- Returns:
- An unmodifiable collection of
Feature
objects stored in the storage.
-
getFeatureAttributeKeys
Retrieves the set of all unique attribute keys from features in the storage.This method iterates through all features in the
features
map, collects the keys of their attributes, and returns them as aSet
. The use of aSet
ensures that the returned collection contains only unique attribute keys, even if multiple features share the same attribute keys. -
removeFeature
Removes a feature from the storage by its unique identifier.This method deletes the feature associated with the given identifier from the `features` map. It is useful for managing the storage by allowing the removal of specific genomic features when they are no longer needed or relevant.
- Parameters:
identifier
- The unique identifier of the feature to remove. This id typically represents the name or id of the genomic feature.
-
hasSample
Checks if a sample is present in the storage by its unique identifier.This method verifies whether a sample, identified by the given id, exists in the storage's sample map. It is useful for determining the presence of a specific sample before performing operations on it.
- Parameters:
identifier
- The unique id of the sample to check. This id typically represents the name or identifier of the biological sample.- Returns:
true
if the sample is present in the storage;false
otherwise.
-
addSample
Adds a sample to the storage if it does not already exist.This method checks whether a sample with the given identifier is already present in the storage. If the sample does not exist, it creates a new
Sample
object, associates it with the storage, and adds it to the `samples` map. The newSample
object is initialized with the current number of contigs and features in the storage. If the sample already exists, the method does nothing.- Parameters:
sampleIdentifier
- The unique identifier of the sample to add. This typically represents the name or ID of the biological sample.
-
getSample
Retrieves a sample from the storage by its unique identifier.This method looks up a sample in the storage using its unique identifier. If the sample is found, it returns the corresponding
Sample
object. If no sample with the specified identifier exists, the method returnsnull
.- Parameters:
identifier
- The unique identifier of the sample to retrieve. This identifier typically represents the name or ID of the biological sample.- Returns:
- The
Sample
object associated with the specified identifier, ornull
if no such sample exists in the storage.
-
getSamples
Retrieves an unmodifiable collection view of the samples stored in the storage.This method provides a read-only view of the samples stored in the storage. The returned collection reflects the current state of the samples map but cannot be modified directly. This ensures that the integrity of the underlying data structure is maintained.
This method is useful for accessing all samples in the storage without allowing external modifications.
- Returns:
- An unmodifiable collection of
Sample
objects stored in the storage.
-
getActiveSamples
Retrieves a collection of active samples from the storage.This method filters the samples stored in the `samples` map and returns only those that are marked as active. Active samples are identified by the `active` field being set to
true
.The returned collection is a list created from the filtered stream of samples. Modifications to the returned list do not affect the underlying storage.
- Returns:
- A
Collection
ofSample
objects that are currently active.
-
getSampleAttributeKeys
Retrieves the set of all unique attribute keys from samples in the storage.This method iterates through all samples in the
samples
map, collects the keys of their attributes, and returns them as aSet
. The use of aSet
ensures that the returned collection contains only unique attribute keys, even if multiple samples share the same attribute keys. -
detachSample
Detaches a sample from the storage and removes its associations with variants and alleles.This method performs the following operations:
- Checks if the sample exists in the storage. If not, the method returns immediately.
- Detaches the sample from all variants in the contigs. If a variant is no longer associated with any sample, it is removed from the contig.
- Detaches the sample from all alleles in the features. If an allele is no longer associated with any sample, it is removed from the feature.
- Removes the sample from the `samples` map in the storage.
- Parameters:
sampleIdentifier
- The unique identifier of the sample to be detached.
-
addVariant
public void addVariant(Contig contig, String sampleIdentifier, int position, String reference, String alternative, Set<VariantCall> variantCalls) Adds a variant to the specified contig in the storage.This method ensures that the variant is in a canonical padded format. If the variant is not canonical, an
IllegalArgumentException
is thrown. If the variant does not already exist in the contig, it is created and added. The method also associates the variant with the specified sample and caches the variant for further processing.- Parameters:
contig
- TheContig
object to which the variant belongs. Represents the genomic region where the variant is located.sampleIdentifier
- The unique identifier of the sample associated with the variant. Used to track the sample's variant calls.position
- The position of the variant within the contig. Represents the genomic coordinate of the variant.reference
- The reference allele of the variant. This is the expected sequence at the given position.alternative
- The alternative allele of the variant. This is the observed sequence differing from the reference.variantCalls
- A set ofVariantCall
objects representing the variant calls associated with the sample.- Throws:
IllegalArgumentException
- If the variant is not in a canonical padded format. Ensures data consistency and correctness.
-
getVariantsCount
public int getVariantsCount()Calculates the total number of variants across all contigs in the storage.This method iterates through all contigs stored in the
contigs
map and sums up the variant counts for each contig. The variant count for each contig is retrieved using theContig.getVariantsCount()
method.- Returns:
- The total number of variants across all contigs.
-
getActiveVariantsCount
public int getActiveVariantsCount()Calculates the total number of active variants across all contigs in the storage.This method iterates through all contigs stored in the
contigs
map and sums up the active variant counts for each contig. The active variant count for each contig is retrieved using theContig.getActiveVariantsCount()
method.- Returns:
- The total number of active variants across all contigs.
-
typeAdapter
Creates a customTypeAdapter
for theStorage
class.This method defines a custom
TypeAdapter
to handle the serialization and deserialization ofStorage
objects. The adapter uses Gson's default adapter for most operations but adds custom behavior during deserialization to initialize the transientreference
field.- Returns:
- A
TypeAdapter
for theStorage
class.
-