Package datastructure

Class Contig


public class Contig extends Attributable
Represents a reference sequence segment.

This class models a segment of a reference sequence, which can represent a complete genome, a plasmid, a single contig, or a scaffold. It extends the Attributable class to inherit functionality for managing attributes associated with the contig.

Each instance of this class is uniquely identified by its name and contains information about its nucleotide sequence, variants, and other relevant properties.

  • Field Details

    • name

      public final String name
      The name or internal identifier of this contig.

      This field uniquely identifies the contig within the context of the application. It is a final field, meaning its value is immutable once assigned during the construction of the Contig instance.

    • sequence

      protected final String sequence
      The sequence of this contig.

      This field stores the nucleotide sequence of the contig. The sequence is expected to be stored as a GZIP-compressed string to optimize storage. It may be empty or null if no sequence is available for the contig.

      Note: The sequence is not validated against the variants stored in the variants map.

    • variants

      protected final TreeMap<Integer,Map<String,VariantInformation>> variants
      Hierarchical map structure to store variants located on this contig.

      This map organizes variants in a hierarchical structure:

      • The first level (key: Integer) represents the position of the variant on the contig.
      • The second level (key: String) represents the alternative variant content (e.g., alternate alleles).
      • The third level (value: VariantInformation) contains additional information about the variant, such as associations with SequenceTypes or Samples.

      This structure allows storage and retrieval of variant data, enabling queries by position, alternative content, and associated metadata.

    • sequenceCache

      protected transient HashMap<htsjdk.samtools.util.Tuple<Integer,Integer>,String> sequenceCache
      Cache to store the sequence of a contig for a specific start and end position.

      This field is a transient HashMap used to cache subsequences of the contig's sequence. The keys in the map are Tuple objects representing the start and end positions of the subsequence, and the values are the corresponding subsequences as String.

      The cache is transient because it is not intended to be serialized, as it is dynamically populated during runtime to optimize performance by avoiding redundant sequence decompression or retrieval.

  • Constructor Details

    • Contig

      protected Contig(String name, String sequence)
      Constructs a new Contig instance with the specified name and sequence.

      This constructor initializes a contig with its name and nucleotide sequence. It also initializes the variants map to store variant information and the sequenceCache map to cache subsequences for optimized retrieval. The sequence is expected to be stored as a GZIP-compressed string to reduce storage requirements.

      Parameters:
      name - The name or identifier of the contig.
      sequence - The nucleotide sequence of the contig, stored as a GZIP-compressed string.
  • Method Details

    • hasSequence

      public boolean hasSequence()
      Checks if this contig has an associated nucleotide sequence.

      This method determines whether the contig has a stored sequence by checking if the sequence field is not empty. A non-empty sequence indicates that the contig has an associated nucleotide sequence.

      Returns:
      true if the contig has a sequence (i.e., the sequence length is not zero), false otherwise.
    • getSequence

      public String getSequence() throws IOException
      Retrieves the nucleotide sequence of this contig or an empty string if no sequence is stored.

      This method decompresses the GZIP-compressed sequence stored in the sequence field and returns it as a string. If no sequence is stored, it returns an empty string.

      Returns:
      The decompressed nucleotide sequence of this contig, or an empty string if no sequence is stored.
      Throws:
      IOException - If an error occurs during the decompression of the sequence.
    • getSubsequence

      public String getSubsequence(int start, int end) throws IOException
      Retrieves a subsequence of this contig, caching the result to optimize performance.

      This method extracts a subsequence from the nucleotide sequence of the contig based on the specified start and end positions. The subsequence is cached to avoid redundant decompression and substring operations for the same range. If the subsequence is already cached, it is retrieved directly from the cache. Otherwise, it is computed, stored in the cache, and returned.

      The start and end positions are 1-based indices, meaning the first nucleotide in the sequence is at position 1. If no sequence is stored for the contig, the method returns an empty string.

      Parameters:
      start - The 1-based indexed start position of the subsequence (inclusive).
      end - The 1-based indexed end position of the subsequence (exclusive).
      Returns:
      The subsequence of this contig, or an empty string if no sequence is stored.
      Throws:
      IOException - If an error occurs during the decompression of the sequence.
    • getVariantsCount

      public int getVariantsCount()
      Calculates the total number of variants located on this contig.

      This method iterates through the hierarchical map of variants stored in the variants field. It computes the total count by summing up the sizes of all inner maps, where each inner map represents the alternative sequences for a specific position on the contig.

      Returns:
      The total number of variants located on this contig.
    • getVariantInformation

      public VariantInformation getVariantInformation(int position, String alternativeBases)
      Retrieves the VariantInformation associated with a specific variant on this contig.

      This method accesses the hierarchical map of variants to retrieve the VariantInformation for a variant located at the specified position with the given alternative bases. The returned VariantInformation contains details about the variant, including its occurrences in samples and features, as well as any associated attributes.

      Parameters:
      position - The 1-based position of the variant on the contig.
      alternativeBases - The alternative base sequence of the variant.
      Returns:
      The VariantInformation associated with the specified variant, or null if no such variant exists at the given position with the specified alternative bases.
    • getVariantsEffects

      public Set<String> getVariantsEffects(List<htsjdk.samtools.util.Tuple<Integer,String>> variants)
      Extracts and aggregates the SnpEff effects from a list of variants.

      This method processes a list of Tuple objects, where each tuple contains:

      • The position of the variant (field a of the tuple).
      • The alternate allele of the variant (field b of the tuple).

      For each variant, the method retrieves the associated VariantInformation using the position and alternate allele. It then checks if the Constants.EFFECTS attribute is present. If the attribute is found, its value (a comma-separated string of effects) is split into individual effects, which are trimmed and aggregated into a Set to ensure uniqueness.

      Parameters:
      variants - A list of Tuple objects representing the variants. Each tuple contains:
      • The position of the variant.
      • The alternate allele sequence.
      Returns:
      A Set of unique SnpEff effects extracted from the variants.
    • getVariants

      public ArrayList<htsjdk.samtools.util.Tuple<Integer,String>> getVariants()
      Retrieves all variants located on this contig.

      This method iterates through the hierarchical variants map, which organizes variants by their positions and alternative sequences. For each variant, it creates a Tuple containing:

      • The position of the variant on the contig.
      • The alternative base sequence of the variant.
      Returns:
      An ArrayList of Tuple objects, where each tuple contains the position and alternative base sequence of a variant.
    • getVariantsByLocation

      public ArrayList<htsjdk.samtools.util.Tuple<Integer,String>> getVariantsByLocation(int start, int end)
      Retrieves variants located within a specified range on this contig.

      This method filters the variants map to identify variants that fall within the specified start and end positions. For each variant in the range, it creates a Tuple containing:

      • The position of the variant on the contig.
      • The alternative base sequence of the variant.
      Parameters:
      start - The 1-based indexed inclusive start position of the range.
      end - The 1-based indexed inclusive end position of the range.
      Returns:
      An ArrayList of Tuple objects, where each tuple contains the position and alternative base sequence of a variant within the specified range.
    • getVariantsByAlleles

      public ArrayList<htsjdk.samtools.util.Tuple<Integer,String>> getVariantsByAlleles(Feature feature, Set<String> alleleUids)
      Retrieves variants associated with specific alleles of a feature.

      This method filters the variants map to identify variants that are associated with the specified feature and alleles. It first retrieves a sub-map of variants within the location range of the feature and then filters the inner maps to find variants matching the alleles. Each matching variant is represented as a Tuple containing:

      • The position of the variant on the contig.
      • The alternative base sequence of the variant.
      Parameters:
      feature - The feature to filter variants by.
      alleleUids - A set of allele unique identifiers to filter variants by.
      Returns:
      An ArrayList of Tuple objects, where each tuple contains the position and alternative base sequence of a variant associated with the specified feature and alleles within the specified location range.
    • getVariantsBySample

      public ArrayList<htsjdk.samtools.util.Tuple<Integer,String>> getVariantsBySample(String sampleName)
      Retrieves variants associated with a specific sample.

      This method filters the variants map to identify variants that are associated with the specified sample. For each position in the map, it checks the inner map of alternative sequences and retrieves the first variant that matches the sample. Each matching variant is represented as a Tuple containing:

      • The position of the variant on the contig.
      • The alternative base sequence of the variant.
      Parameters:
      sampleName - The name of the sample to filter variants by.
      Returns:
      An ArrayList of Tuple objects, where each tuple contains the position and alternative base sequence of a variant associated with the specified sample.
    • getVariantsBySampleAndLocation

      public ArrayList<htsjdk.samtools.util.Tuple<Integer,String>> getVariantsBySampleAndLocation(String sampleName, int start, int end)
      Retrieves variants associated with a specific sample within a specified location range.

      This method filters the variants map to identify variants that are associated with the specified sample and fall within the given start and end positions. It first retrieves a sub-map of variants within the location range and then filters the inner maps to find variants matching the sample. Each matching variant is represented as a Tuple containing:

      • The position of the variant on the contig.
      • The alternative base sequence of the variant.
      Parameters:
      sampleName - The name of the sample to filter variants by.
      start - The 1-based indexed inclusive start position of the location range.
      end - The 1-based indexed inclusive end position of the location range.
      Returns:
      An ArrayList of Tuple objects, where each tuple contains the position and alternative base sequence of a variant associated with the specified sample and within the specified location range.