Class Storage
It provides methods for adding, retrieving, and processing genomic information, as well as handling variant calls and annotations. The class is structured to support efficient storage and manipulation of data, leveraging Java collections and utility classes. It integrates external tools like SnpEff for annotation and ensures data integrity through validation and compliance with Sequence Ontology rules.
Core Data Structures:
- Contigs: Stored in a `Map (String,
Contig
)`, contigs represent chromosomes or plasmids. Each contig can store its sequence and associated variants. - Features: Stored in a `Map (String,
Feature
)`, features represent genomic elements like genes or mRNA. These are validated and processed using Sequence Ontology (SO) terms. - Samples: Stored in a `Map (String,
Sample
), samples represent variant calls from distinct biological samples. Metadata and variant calls are associated with each sample.
- Adding and Managing Contigs: Methods like `addContig` and `addContigIfAbsent` allow adding contigs with sequences or as placeholders.
- Handling Features: Features are processed and validated using Sequence Ontology terms, with methods like `transferFeatureInformation` and `validateFeatures`.
- Managing Samples and Variants: Samples are added using `addSample`, and variant calls are managed with methods like `addVariantCallToSample`.
- Variant Processing and Annotation: The `updateVariants` method processes VCF files to extract variant information, while `runSnpEffAnalysis` integrates with SnpEff to annotate novel variants.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final class
Factory class for creating and managing instances ofStorage
. -
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionvoid
Adds a contig to the storage with the specified name and sequence.void
Annotates novel variants in the storage using SnpEff.Retrieves a contig by its name.Returns a collection view of the contigs stored in the storage.getFeature
(String name) Retrieve aFeature
by its name.Returns a collection view of the features stored in the storage.long
Retrieves the number of processed genotype records.Retrieve aSample
by its name.Returns a collection view of the samples stored in the storage.Retrieves a collection of samples that need to be updated based on the presence of variant records.static Collection
<String> getSOTerms
(int level) Retrieve a collection of sequence ontology terms for a given level.long
Returns the total number of variants stored.boolean
Query whether a contig is stored in this instance by its name.boolean
hasFeature
(String name) Query whether a feature is stored in this instance by its name.boolean
Checks if all contigs in the storage have a non-empty sequence.boolean
Checks if there are any novel variants stored in the storage.boolean
Return whetherreference
is set.boolean
Query whether a sample is stored in this instance by its name.boolean
isPositionExcluded
(String contig, int position) Whetherposition
is excluded oncontig
.boolean
isVariantExcluded
(String contig, int position, String reference, String variant) Whethervariant
is excluded oncontig
atposition
.double
Retrieves the minimum coverage parameter.double
Retrieves the minimum frequency parameter.void
removeSampleOccurrence
(Collection<Sample> samples) Removes the occurrences of the specified samples from the storage.boolean
Checks if proteoform inference should be run.boolean
Checks if the SnpEff annotation process should be skipped.boolean
Retrieves whether filtered calls should be retained as ambiguous bases (N).void
Transfers sample information from variant records to the storage.void
Transfers variant information from sample variant calls to the storage.void
Updates sequence types for all samples and features in the storage.void
Updates various statistics for samples, contigs, and features in the storage.void
Updates the variants in the storage by processing VCF files and transferring the relevant information.
-
Field Details
-
SO
Map of sequence ontology (SO) terms and their respective hierarchy levels as used by MUSIAL. TODO: Optional extension to support UTRs, etc.?
-
-
Method Details
-
getSOTerms
Retrieve a collection of sequence ontology terms for a given level.- Parameters:
level
- The level to retrieve the sequence ontology terms for.- Returns:
- A collection of sequence ontology terms for the specified level.
-
hasReference
public boolean hasReference()Return whetherreference
is set.- Returns:
- Get the reference sequence.
-
minimumCoverage
public double minimumCoverage()Retrieves the minimum coverage parameter.- Returns:
- Get the minimum coverage parameter.
-
minimumFrequency
public double minimumFrequency()Retrieves the minimum frequency parameter.- Returns:
- Get the minimum frequency parameter.
-
storeFiltered
public boolean storeFiltered()Retrieves whether filtered calls should be retained as ambiguous bases (N).- Returns:
true
if filtered calls are retained as ambiguous bases,false
otherwise.
-
skipSnpEff
public boolean skipSnpEff()Checks if the SnpEff annotation process should be skipped.- Returns:
true
if the SnpEff annotation process is skipped,false
otherwise.
-
runProteoformInference
public boolean runProteoformInference()Checks if proteoform inference should be run.- Returns:
true
if proteoform inference should be run,false
otherwise.
-
isPositionExcluded
Whetherposition
is excluded oncontig
.- Parameters:
contig
- Contig (name) to check for exclusion.position
- Position to check for exclusion.- Returns:
- True if
position
oncontig
is excluded from analysis.
-
isVariantExcluded
Whethervariant
is excluded oncontig
atposition
.- Parameters:
contig
- Contig (name) to check for exclusion.position
- Position to check for exclusion.reference
- Reference base atposition
.variant
- Variant base atposition
.- Returns:
- True if
variant
oncontig
atposition
is excluded from analysis.
-
getProcessedGenotypes
public long getProcessedGenotypes()Retrieves the number of processed genotype records.This method returns the value of the `processedGenotypes` field from the `VcfHandler` class. The field keeps track of the total number of genotype records that have been processed during the analysis of VCF files.
- Returns:
- The total number of processed genotype records as a
long
.
-
addContig
Adds a contig to the storage with the specified name and sequence.This method compresses the provided sequence using GZIP and calculates its length. If the sequence is null or empty, it assigns an empty string as the compressed sequence and sets the length to 0. The contig is then added to the storage with its attributes.
- Parameters:
name
- The name of the contig to add.sequence
- The sequence of the contig. Can be null or empty.- Throws:
IOException
- If an error occurs during sequence compression.
-
getContig
Retrieves a contig by its name.This method searches for a contig in the storage by its name. If the contig exists, it returns the corresponding
Contig
object. If the contig does not exist, it returnsnull
.- Parameters:
name
- The name of the contig to retrieve.- Returns:
- The
Contig
object associated with the specified name, ornull
if no such contig exists.
-
hasContig
Query whether a contig is stored in this instance by its name.- Parameters:
name
- The name of the contig.- Returns:
- True if a contig is stored for
name
.
-
hasMissingContigSequences
public boolean hasMissingContigSequences()Checks if all contigs in the storage have a non-empty sequence.This method iterates through all contigs in the storage and checks if all of them have a sequence that is not empty.
- Returns:
true
if any contig has an empty sequence,false
otherwise.
-
getContigs
Returns a collection view of the contigs stored in the storage.- Returns:
- Collection of stored
contigs
.
-
getVariantsCount
public long getVariantsCount()Returns the total number of variants stored.- Returns:
- Number of all stored
Contig.variants
across allcontigs
.
-
hasNovelVariants
public boolean hasNovelVariants()Checks if there are any novel variants stored in the storage.This method verifies whether the `novelVariants` list contains any entries. Novel variants are those that have been identified during variant call processing but are not yet annotated or processed further.
- Returns:
true
if there are novel variants in the storage,false
otherwise.
-
getFeature
Retrieve aFeature
by its name.- Parameters:
name
- The name of the feature.- Returns:
- The queried feature or null, if no feature is stored with
name
.
-
getFeatures
Returns a collection view of the features stored in the storage.- Returns:
- Collection of
features
.
-
hasFeature
Query whether a feature is stored in this instance by its name.- Parameters:
name
- The name of the feature.- Returns:
- True if a feature is stored for
name
.
-
getSample
Retrieve aSample
by its name.- Parameters:
name
- The name of the sample.- Returns:
- The queried sample.
-
getSamples
Returns a collection view of the samples stored in the storage.- Returns:
- Collection of
samples
.
-
getSamplesToUpdate
Retrieves a collection of samples that need to be updated based on the presence of variant records.This method filters the samples stored in the `samples` map and returns only those samples whose names are present as keys in the `vcfAnalysis.records` map. These samples are considered to have associated variant records and require updates.
- Returns:
- A collection of
Sample
objects that need to be updated.
-
hasSample
Query whether a sample is stored in this instance by its name.- Parameters:
name
- The name of the sample.- Returns:
- True if a sample is stored for
name
.
-
updateVariants
Updates the variants in the storage by processing VCF files and transferring the relevant information.This method performs the following steps:
- Clears the existing variant records in the
Storage.VcfHandler
. - Analyzes each VCF file in the
vcfFiles
list to extract variant information. - Clears the
vcfFiles
list after processing. - Transfers sample information from the processed VCF records to the storage.
- Transfers sample attributes from the
sampleInfo
map to the corresponding samples in the storage. - Transfers variant information from the sample variant calls to the storage.
- Throws:
IOException
- If an error occurs while analyzing VCF files or transferring data.
- Clears the existing variant records in the
-
annotateVariants
Annotates novel variants in the storage using SnpEff.This method invokes the SnpEff analysis process to annotate novel variants stored in the `Storage` instance. It delegates the annotation task to the `runSnpEffAnalysis` method of the `VcfHandler` class. The results of the annotation are integrated back into the storage.
- Throws:
IOException
- If an error occurs during file operations required for the SnpEff analysis.MusialException
- If the SnpEff annotation process encounters an error.
-
updateSequenceTypes
Updates sequence types for all samples and features in the storage.This method iterates through all samples that need to be updated and all features in the storage. For each feature, it retrieves the associated contig and filters the variants for the sample within the feature's start and end positions. The filtered variants are reduced to a map containing the variant positions and their corresponding alternative allele base strings.
If the filtered variants are not empty, the method updates the allele for the feature using the contig, variants, and sample. If the feature is coding and the contig has a sequence, the proteoform for the feature is also updated.
- Throws:
IOException
- If an error occurs during sequence processing.MusialException
- If an error occurs during allele or proteoform updates.
-
updateStatistics
public void updateStatistics()Updates various statistics for samples, contigs, and features in the storage.This method performs the following tasks:
- Calculates and updates statistics for each sample, including total calls, filtered calls, mean coverage, and mean quality.
- Determines the frequency of disrupted and modified proteoforms for coding features in each sample.
- Updates variant frequency for each contig and aggregates substitution and indel counts per sample.
- Calculates and updates allelic frequencies, reference frequencies, and proteoform statistics for each feature.
-
removeSampleOccurrence
Removes the occurrences of the specified samples from the storage.This method iterates through all contigs and samples provided in the input collection. For each sample, it removes the sample's occurrences from the contig's variants and the feature's alleles. If a variant or allele no longer has any occurrences after removal, it is deleted from the storage.
- Parameters:
samples
- A collection ofSample
objects whose occurrences are to be removed.
-
transferSampleInformation
public void transferSampleInformation()Transfers sample information from variant records to the storage.This method processes variant records stored in
Storage.VcfHandler.records
and updates the storage with variant calls for each sample, contig, and position. It calculates the total depth of coverage (DP), determines the best allele based on phred-scaled likelihoods (PL) or allele depth (AD), and builds a variant call string. The method also handles exclusions for low frequency, low coverage, and specific variants, and skips passing reference calls. -
transferVariantsInformation
public void transferVariantsInformation()Transfers variant information from sample variant calls to the storage.This method processes variant calls for each sample and contig, resolving conflicts and handling deletions, insertions, and mixed InDels. It ensures that variants are stored in a canonical format and accounts for the effects of upstream deletions on downstream variants. Variants are added to the contig's variant map, and warnings are logged for conflicts or unhandled cases.
-