Class Contig
This class models a segment of a reference sequence, which can represent a complete genome,
a plasmid, a single contig, or a scaffold. It extends the Attributable
class to
inherit functionality for managing attributes associated with the contig.
Each instance of this class is uniquely identified by its name
and contains
information about its nucleotide sequence, variants, and other relevant properties.
-
Field Summary
FieldsModifier and TypeFieldDescriptionfinal String
The name or internal identifier of this contig.protected final String
The sequence of this contig.Cache to store the sequence of a contig for a specific start and end position.protected final TreeMap
<Integer, Map<String, VariantInformation>> Hierarchical map structure to store variants located on this contig. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionRetrieves the nucleotide sequence of this contig or an empty string if no sequence is stored.getSubsequence
(int start, int end) Retrieves a subsequence of this contig, caching the result to optimize performance.getVariantInformation
(int position, String alternativeBases) Retrieves theVariantInformation
associated with a specific variant on this contig.Retrieves all variants located on this contig.getVariantsByAlleles
(Feature feature, Set<String> alleleUids) Retrieves variants associated with specific alleles of a feature.getVariantsByLocation
(int start, int end) Retrieves variants located within a specified range on this contig.getVariantsBySample
(String sampleName) Retrieves variants associated with a specific sample.getVariantsBySampleAndLocation
(String sampleName, int start, int end) Retrieves variants associated with a specific sample within a specified location range.int
Calculates the total number of variants located on this contig.getVariantsEffects
(List<htsjdk.samtools.util.Tuple<Integer, String>> variants) Extracts and aggregates the SnpEff effects from a list of variants.boolean
Checks if this contig has an associated nucleotide sequence.Methods inherited from class datastructure.Attributable
addAttributeIfAbsent, addAttributesIfAbsent, attributesAsString, attributesAsString, clearAttributes, extendAttribute, extendAttributes, getAttribute, getAttributeAsCollection, getAttributes, hasAttribute, hasAttributes, removeAttribute, setAttribute, setAttributes
-
Field Details
-
name
The name or internal identifier of this contig.This field uniquely identifies the contig within the context of the application. It is a final field, meaning its value is immutable once assigned during the construction of the
Contig
instance. -
sequence
The sequence of this contig.This field stores the nucleotide sequence of the contig. The sequence is expected to be stored as a GZIP-compressed string to optimize storage. It may be empty or null if no sequence is available for the contig.
Note: The sequence is not validated against the variants stored in the
variants
map. -
variants
Hierarchical map structure to store variants located on this contig.This map organizes variants in a hierarchical structure:
- The first level (key:
Integer
) represents the position of the variant on the contig. - The second level (key:
String
) represents the alternative variant content (e.g., alternate alleles). - The third level (value:
VariantInformation
) contains additional information about the variant, such as associations withSequenceType
s orSample
s.
This structure allows storage and retrieval of variant data, enabling queries by position, alternative content, and associated metadata.
- The first level (key:
-
sequenceCache
Cache to store the sequence of a contig for a specific start and end position.This field is a transient
HashMap
used to cache subsequences of the contig's sequence. The keys in the map areTuple
objects representing the start and end positions of the subsequence, and the values are the corresponding subsequences asString
.The cache is transient because it is not intended to be serialized, as it is dynamically populated during runtime to optimize performance by avoiding redundant sequence decompression or retrieval.
-
-
Constructor Details
-
Contig
Constructs a newContig
instance with the specified name and sequence.This constructor initializes a contig with its name and nucleotide sequence. It also initializes the
variants
map to store variant information and thesequenceCache
map to cache subsequences for optimized retrieval. The sequence is expected to be stored as a GZIP-compressed string to reduce storage requirements.- Parameters:
name
- The name or identifier of the contig.sequence
- The nucleotide sequence of the contig, stored as a GZIP-compressed string.
-
-
Method Details
-
hasSequence
public boolean hasSequence()Checks if this contig has an associated nucleotide sequence.This method determines whether the contig has a stored sequence by checking if the
sequence
field is not empty. A non-empty sequence indicates that the contig has an associated nucleotide sequence.- Returns:
true
if the contig has a sequence (i.e., the sequence length is not zero),false
otherwise.
-
getSequence
Retrieves the nucleotide sequence of this contig or an empty string if no sequence is stored.This method decompresses the GZIP-compressed sequence stored in the
sequence
field and returns it as a string. If no sequence is stored, it returns an empty string.- Returns:
- The decompressed nucleotide sequence of this contig, or an empty string if no sequence is stored.
- Throws:
IOException
- If an error occurs during the decompression of the sequence.
-
getSubsequence
Retrieves a subsequence of this contig, caching the result to optimize performance.This method extracts a subsequence from the nucleotide sequence of the contig based on the specified start and end positions. The subsequence is cached to avoid redundant decompression and substring operations for the same range. If the subsequence is already cached, it is retrieved directly from the cache. Otherwise, it is computed, stored in the cache, and returned.
The start and end positions are 1-based indices, meaning the first nucleotide in the sequence is at position 1. If no sequence is stored for the contig, the method returns an empty string.
- Parameters:
start
- The 1-based indexed start position of the subsequence (inclusive).end
- The 1-based indexed end position of the subsequence (exclusive).- Returns:
- The subsequence of this contig, or an empty string if no sequence is stored.
- Throws:
IOException
- If an error occurs during the decompression of the sequence.
-
getVariantsCount
public int getVariantsCount()Calculates the total number of variants located on this contig.This method iterates through the hierarchical map of variants stored in the
variants
field. It computes the total count by summing up the sizes of all inner maps, where each inner map represents the alternative sequences for a specific position on the contig.- Returns:
- The total number of variants located on this contig.
-
getVariantInformation
Retrieves theVariantInformation
associated with a specific variant on this contig.This method accesses the hierarchical map of variants to retrieve the
VariantInformation
for a variant located at the specified position with the given alternative bases. The returnedVariantInformation
contains details about the variant, including its occurrences in samples and features, as well as any associated attributes.- Parameters:
position
- The 1-based position of the variant on the contig.alternativeBases
- The alternative base sequence of the variant.- Returns:
- The
VariantInformation
associated with the specified variant, ornull
if no such variant exists at the given position with the specified alternative bases.
-
getVariantsEffects
Extracts and aggregates the SnpEff effects from a list of variants.This method processes a list of
Tuple
objects, where each tuple contains:- The position of the variant (field
a
of the tuple). - The alternate allele of the variant (field
b
of the tuple).
For each variant, the method retrieves the associated
VariantInformation
using the position and alternate allele. It then checks if theConstants.EFFECTS
attribute is present. If the attribute is found, its value (a comma-separated string of effects) is split into individual effects, which are trimmed and aggregated into aSet
to ensure uniqueness.- Parameters:
variants
- A list ofTuple
objects representing the variants. Each tuple contains:- The position of the variant.
- The alternate allele sequence.
- Returns:
- A
Set
of unique SnpEff effects extracted from the variants.
- The position of the variant (field
-
getVariants
Retrieves all variants located on this contig.This method iterates through the hierarchical
variants
map, which organizes variants by their positions and alternative sequences. For each variant, it creates aTuple
containing:- The position of the variant on the contig.
- The alternative base sequence of the variant.
- Returns:
- An
ArrayList
ofTuple
objects, where each tuple contains the position and alternative base sequence of a variant.
-
getVariantsByLocation
public ArrayList<htsjdk.samtools.util.Tuple<Integer,String>> getVariantsByLocation(int start, int end) Retrieves variants located within a specified range on this contig.This method filters the
variants
map to identify variants that fall within the specified start and end positions. For each variant in the range, it creates aTuple
containing:- The position of the variant on the contig.
- The alternative base sequence of the variant.
- Parameters:
start
- The 1-based indexed inclusive start position of the range.end
- The 1-based indexed inclusive end position of the range.- Returns:
- An
ArrayList
ofTuple
objects, where each tuple contains the position and alternative base sequence of a variant within the specified range.
-
getVariantsByAlleles
public ArrayList<htsjdk.samtools.util.Tuple<Integer,String>> getVariantsByAlleles(Feature feature, Set<String> alleleUids) Retrieves variants associated with specific alleles of a feature.This method filters the
variants
map to identify variants that are associated with the specified feature and alleles. It first retrieves a sub-map of variants within the location range of the feature and then filters the inner maps to find variants matching the alleles. Each matching variant is represented as aTuple
containing:- The position of the variant on the contig.
- The alternative base sequence of the variant.
- Parameters:
feature
- The feature to filter variants by.alleleUids
- A set of allele unique identifiers to filter variants by.- Returns:
- An
ArrayList
ofTuple
objects, where each tuple contains the position and alternative base sequence of a variant associated with the specified feature and alleles within the specified location range.
-
getVariantsBySample
Retrieves variants associated with a specific sample.This method filters the
variants
map to identify variants that are associated with the specified sample. For each position in the map, it checks the inner map of alternative sequences and retrieves the first variant that matches the sample. Each matching variant is represented as aTuple
containing:- The position of the variant on the contig.
- The alternative base sequence of the variant.
- Parameters:
sampleName
- The name of the sample to filter variants by.- Returns:
- An
ArrayList
ofTuple
objects, where each tuple contains the position and alternative base sequence of a variant associated with the specified sample.
-
getVariantsBySampleAndLocation
public ArrayList<htsjdk.samtools.util.Tuple<Integer,String>> getVariantsBySampleAndLocation(String sampleName, int start, int end) Retrieves variants associated with a specific sample within a specified location range.This method filters the
variants
map to identify variants that are associated with the specified sample and fall within the given start and end positions. It first retrieves a sub-map of variants within the location range and then filters the inner maps to find variants matching the sample. Each matching variant is represented as aTuple
containing:- The position of the variant on the contig.
- The alternative base sequence of the variant.
- Parameters:
sampleName
- The name of the sample to filter variants by.start
- The 1-based indexed inclusive start position of the location range.end
- The 1-based indexed inclusive end position of the location range.- Returns:
- An
ArrayList
ofTuple
objects, where each tuple contains the position and alternative base sequence of a variant associated with the specified sample and within the specified location range.
-