Package datastructure

Class Storage

java.lang.Object
datastructure.Storage

public class Storage extends Object
The `Storage` class is a central component of the project, designed to manage genomic data, including contigs, features, and samples.

It provides methods for adding, retrieving, and processing genomic information, as well as handling variant calls and annotations. The class is structured to support efficient storage and manipulation of data, leveraging Java collections and utility classes. It integrates external tools like SnpEff for annotation and ensures data integrity through validation and compliance with Sequence Ontology rules.

Core Data Structures:

  • Contigs: Stored in a `Map (String, Contig)`, contigs represent chromosomes or plasmids. Each contig can store its sequence and associated variants.
  • Features: Stored in a `Map (String, Feature)`, features represent genomic elements like genes or mRNA. These are validated and processed using Sequence Ontology (SO) terms.
  • Samples: Stored in a `Map (String, Sample), samples represent variant calls from distinct biological samples. Metadata and variant calls are associated with each sample.
Key Functionalities:
  • Adding and Managing Contigs: Methods like `addContig` and `addContigIfAbsent` allow adding contigs with sequences or as placeholders.
  • Handling Features: Features are processed and validated using Sequence Ontology terms, with methods like `transferFeatureInformation` and `validateFeatures`.
  • Managing Samples and Variants: Samples are added using `addSample`, and variant calls are managed with methods like `addVariantCallToSample`.
  • Variant Processing and Annotation: The `updateVariants` method processes VCF files to extract variant information, while `runSnpEffAnalysis` integrates with SnpEff to annotate novel variants.
  • Field Details

    • SO

      public static final Map<String,Integer> SO
      Map of sequence ontology (SO) terms and their respective hierarchy levels as used by MUSIAL. TODO: Optional extension to support UTRs, etc.?
  • Method Details

    • getSOTerms

      public static Collection<String> getSOTerms(int level)
      Retrieve a collection of sequence ontology terms for a given level.
      Parameters:
      level - The level to retrieve the sequence ontology terms for.
      Returns:
      A collection of sequence ontology terms for the specified level.
    • hasReference

      public boolean hasReference()
      Return whether reference is set.
      Returns:
      Get the reference sequence.
    • minimumCoverage

      public double minimumCoverage()
      Retrieves the minimum coverage parameter.
      Returns:
      Get the minimum coverage parameter.
    • minimumFrequency

      public double minimumFrequency()
      Retrieves the minimum frequency parameter.
      Returns:
      Get the minimum frequency parameter.
    • storeFiltered

      public boolean storeFiltered()
      Retrieves whether filtered calls should be retained as ambiguous bases (N).
      Returns:
      true if filtered calls are retained as ambiguous bases, false otherwise.
    • skipSnpEff

      public boolean skipSnpEff()
      Checks if the SnpEff annotation process should be skipped.
      Returns:
      true if the SnpEff annotation process is skipped, false otherwise.
    • runProteoformInference

      public boolean runProteoformInference()
      Checks if proteoform inference should be run.
      Returns:
      true if proteoform inference should be run, false otherwise.
    • isPositionExcluded

      public boolean isPositionExcluded(String contig, int position)
      Whether position is excluded on contig.
      Parameters:
      contig - Contig (name) to check for exclusion.
      position - Position to check for exclusion.
      Returns:
      True if position on contig is excluded from analysis.
    • isVariantExcluded

      public boolean isVariantExcluded(String contig, int position, String reference, String variant)
      Whether variant is excluded on contig at position.
      Parameters:
      contig - Contig (name) to check for exclusion.
      position - Position to check for exclusion.
      reference - Reference base at position.
      variant - Variant base at position.
      Returns:
      True if variant on contig at position is excluded from analysis.
    • getProcessedGenotypes

      public long getProcessedGenotypes()
      Retrieves the number of processed genotype records.

      This method returns the value of the `processedGenotypes` field from the `VcfHandler` class. The field keeps track of the total number of genotype records that have been processed during the analysis of VCF files.

      Returns:
      The total number of processed genotype records as a long.
    • addContig

      public void addContig(String name, String sequence) throws IOException
      Adds a contig to the storage with the specified name and sequence.

      This method compresses the provided sequence using GZIP and calculates its length. If the sequence is null or empty, it assigns an empty string as the compressed sequence and sets the length to 0. The contig is then added to the storage with its attributes.

      Parameters:
      name - The name of the contig to add.
      sequence - The sequence of the contig. Can be null or empty.
      Throws:
      IOException - If an error occurs during sequence compression.
    • getContig

      public Contig getContig(String name)
      Retrieves a contig by its name.

      This method searches for a contig in the storage by its name. If the contig exists, it returns the corresponding Contig object. If the contig does not exist, it returns null.

      Parameters:
      name - The name of the contig to retrieve.
      Returns:
      The Contig object associated with the specified name, or null if no such contig exists.
    • hasContig

      public boolean hasContig(String name)
      Query whether a contig is stored in this instance by its name.
      Parameters:
      name - The name of the contig.
      Returns:
      True if a contig is stored for name.
    • hasMissingContigSequences

      public boolean hasMissingContigSequences()
      Checks if all contigs in the storage have a non-empty sequence.

      This method iterates through all contigs in the storage and checks if all of them have a sequence that is not empty.

      Returns:
      true if any contig has an empty sequence, false otherwise.
    • getContigs

      public Collection<Contig> getContigs()
      Returns a collection view of the contigs stored in the storage.
      Returns:
      Collection of stored contigs.
    • getVariantsCount

      public long getVariantsCount()
      Returns the total number of variants stored.
      Returns:
      Number of all stored Contig.variants across all contigs.
    • hasNovelVariants

      public boolean hasNovelVariants()
      Checks if there are any novel variants stored in the storage.

      This method verifies whether the `novelVariants` list contains any entries. Novel variants are those that have been identified during variant call processing but are not yet annotated or processed further.

      Returns:
      true if there are novel variants in the storage, false otherwise.
    • getFeature

      public Feature getFeature(String name)
      Retrieve a Feature by its name.
      Parameters:
      name - The name of the feature.
      Returns:
      The queried feature or null, if no feature is stored with name.
    • getFeatures

      public Collection<Feature> getFeatures()
      Returns a collection view of the features stored in the storage.
      Returns:
      Collection of features.
    • hasFeature

      public boolean hasFeature(String name)
      Query whether a feature is stored in this instance by its name.
      Parameters:
      name - The name of the feature.
      Returns:
      True if a feature is stored for name.
    • getSample

      public Sample getSample(String name)
      Retrieve a Sample by its name.
      Parameters:
      name - The name of the sample.
      Returns:
      The queried sample.
    • getSamples

      public Collection<Sample> getSamples()
      Returns a collection view of the samples stored in the storage.
      Returns:
      Collection of samples.
    • getSamplesToUpdate

      public Collection<Sample> getSamplesToUpdate()
      Retrieves a collection of samples that need to be updated based on the presence of variant records.

      This method filters the samples stored in the `samples` map and returns only those samples whose names are present as keys in the `vcfAnalysis.records` map. These samples are considered to have associated variant records and require updates.

      Returns:
      A collection of Sample objects that need to be updated.
    • hasSample

      public boolean hasSample(String name)
      Query whether a sample is stored in this instance by its name.
      Parameters:
      name - The name of the sample.
      Returns:
      True if a sample is stored for name.
    • updateVariants

      public void updateVariants() throws IOException
      Updates the variants in the storage by processing VCF files and transferring the relevant information.

      This method performs the following steps:

      • Clears the existing variant records in the Storage.VcfHandler.
      • Analyzes each VCF file in the vcfFiles list to extract variant information.
      • Clears the vcfFiles list after processing.
      • Transfers sample information from the processed VCF records to the storage.
      • Transfers sample attributes from the sampleInfo map to the corresponding samples in the storage.
      • Transfers variant information from the sample variant calls to the storage.
      Throws:
      IOException - If an error occurs while analyzing VCF files or transferring data.
    • annotateVariants

      public void annotateVariants() throws IOException, MusialException
      Annotates novel variants in the storage using SnpEff.

      This method invokes the SnpEff analysis process to annotate novel variants stored in the `Storage` instance. It delegates the annotation task to the `runSnpEffAnalysis` method of the `VcfHandler` class. The results of the annotation are integrated back into the storage.

      Throws:
      IOException - If an error occurs during file operations required for the SnpEff analysis.
      MusialException - If the SnpEff annotation process encounters an error.
    • updateSequenceTypes

      public void updateSequenceTypes() throws IOException, MusialException
      Updates sequence types for all samples and features in the storage.

      This method iterates through all samples that need to be updated and all features in the storage. For each feature, it retrieves the associated contig and filters the variants for the sample within the feature's start and end positions. The filtered variants are reduced to a map containing the variant positions and their corresponding alternative allele base strings.

      If the filtered variants are not empty, the method updates the allele for the feature using the contig, variants, and sample. If the feature is coding and the contig has a sequence, the proteoform for the feature is also updated.

      Throws:
      IOException - If an error occurs during sequence processing.
      MusialException - If an error occurs during allele or proteoform updates.
    • updateStatistics

      public void updateStatistics()
      Updates various statistics for samples, contigs, and features in the storage.

      This method performs the following tasks:

      • Calculates and updates statistics for each sample, including total calls, filtered calls, mean coverage, and mean quality.
      • Determines the frequency of disrupted and modified proteoforms for coding features in each sample.
      • Updates variant frequency for each contig and aggregates substitution and indel counts per sample.
      • Calculates and updates allelic frequencies, reference frequencies, and proteoform statistics for each feature.
    • removeSampleOccurrence

      public void removeSampleOccurrence(Collection<Sample> samples)
      Removes the occurrences of the specified samples from the storage.

      This method iterates through all contigs and samples provided in the input collection. For each sample, it removes the sample's occurrences from the contig's variants and the feature's alleles. If a variant or allele no longer has any occurrences after removal, it is deleted from the storage.

      Parameters:
      samples - A collection of Sample objects whose occurrences are to be removed.
    • transferSampleInformation

      public void transferSampleInformation()
      Transfers sample information from variant records to the storage.

      This method processes variant records stored in Storage.VcfHandler.records and updates the storage with variant calls for each sample, contig, and position. It calculates the total depth of coverage (DP), determines the best allele based on phred-scaled likelihoods (PL) or allele depth (AD), and builds a variant call string. The method also handles exclusions for low frequency, low coverage, and specific variants, and skips passing reference calls.

    • transferVariantsInformation

      public void transferVariantsInformation()
      Transfers variant information from sample variant calls to the storage.

      This method processes variant calls for each sample and contig, resolving conflicts and handling deletions, insertions, and mixed InDels. It ensures that variants are stored in a canonical format and accounts for the effects of upstream deletions on downstream variants. Variants are added to the contig's variant map, and warnings are logged for conflicts or unhandled cases.