Class Bio
This class provides static methods for sequence alignment, variant integration, sequence translation, and other related operations. It includes methods for handling nucleotide and protein sequences, as well as utilities for working with gaps and variants.
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic final record
Represents information about a single position in a reference sequence. -
Method Summary
Modifier and TypeMethodDescriptionalignByCigar
(String reference, String query, String cigar, int offset) Aligns a query sequence to a reference sequence based on a CIGAR string.getCanonicalVariants
(String reference, String alternative) Transforms two sequences into canonical VCF variants.globalNucleotideSequenceAlignment
(String sequenceA, String sequenceB, int gapOpenPenalty, int gapExtendPenalty, boolean noGapPrefix, boolean noGapSuffix, int bandWidth) Computes optimal pairwise global nucleotide sequence alignment using a gap-affine (Gotoh) banded Needleman-Wunsch algorithm.globalProteinSequenceAlignment
(String sequenceA, String sequenceB, int gapOpenPenalty, int gapExtendPenalty, boolean noGapPrefix, boolean noGapSuffix, int bandWidth) Computes optimal pairwise global amino acid sequence alignment using a gap-affine (Gotoh) banded Needleman-Wunsch algorithm.static String
integrateVariants
(String reference, NavigableMap<Integer, String> variants, boolean excludeGaps) Integrates variants into a reference sequence.static String
integrateVariants
(NavigableMap<Integer, Bio.ReferenceContext> reference, Map<Integer, String> variants, boolean excludeGaps) Integrates variants into a reference sequence.static boolean
isCanonical
(String ref, String alt) Determines whether a variant is canonical.static boolean
isDeletion
(String alt) Determines whether a variant is a deletion based on its alternative content.static boolean
isDeletion
(String ref, String alt, boolean padded) Determines whether a variant is a deletion, i.e., either the reference base content is a string of any length ofConstants.BASE_SYMBOLS
and the alternative base content is a single base ofConstants.BASE_SYMBOLS
followed byConstants.GAP
s matching the reference content's length (padded canonical), or the reference base content is a string of any length ofConstants.BASE_SYMBOLS
and the alternative content is a single base ofConstants.BASE_SYMBOLS
(un-padded canonical).static boolean
isInsertion
(String alt) Determines whether a variant is an insertion based on its alternative content.static boolean
isInsertion
(String ref, String alt, boolean padded) Determines whether a variant is an insertion, i.e., either the alternative base content is a string of any length ofConstants.BASE_SYMBOLS
and the reference base content is a single base ofConstants.BASE_SYMBOLS
followed byConstants.GAP
s matching the alternative content's length (padded canonical), or the reference base content is a single base ofConstants.BASE_SYMBOLS
and the alternative content is a string of any length ofConstants.BASE_SYMBOLS
(un-padded canonical).static boolean
isPaddedCanonical
(String ref, String alt) Determines whether a variant is padded canonical.static boolean
isSubstitution
(String alt) Determines whether a given alternative base content represents a substitution.static boolean
isSubstitution
(String ref, String alt) Determines whether a variant is a substitution; i.e., both the reference and alternative base content match a single base ofConstants.BASE_SYMBOLS
.static String
Pads a string with gap characters to reach a specified length.static String
Removes all gap characters from the input string.static String
translateSequence
(String sequence, boolean reverse) Translates a DNA sequence into an amino-acid sequence.
-
Method Details
-
globalNucleotideSequenceAlignment
public static htsjdk.samtools.util.Tuple<String,String> globalNucleotideSequenceAlignment(String sequenceA, String sequenceB, int gapOpenPenalty, int gapExtendPenalty, boolean noGapPrefix, boolean noGapSuffix, int bandWidth) Computes optimal pairwise global nucleotide sequence alignment using a gap-affine (Gotoh) banded Needleman-Wunsch algorithm.A simple scoring matrix (match: +1; mismatch: -1) is used.
- Parameters:
sequenceA
- The first nucleotide sequence to align.sequenceB
- The second nucleotide sequence to align.gapOpenPenalty
- The penalty for opening a gap in the alignment.gapExtendPenalty
- The penalty for extending an existing gap in the alignment.noGapPrefix
- Prevent (if true) gaps at the beginning of the aligned sequences.noGapSuffix
- Prevent (if true) gaps at the end of the aligned sequences.bandWidth
- The width of the band for banded alignment; if below or equal to 0, the full length of sequence B is used.- Returns:
- A
Tuple
containing the aligned sequences.
-
globalProteinSequenceAlignment
public static htsjdk.samtools.util.Tuple<String,String> globalProteinSequenceAlignment(String sequenceA, String sequenceB, int gapOpenPenalty, int gapExtendPenalty, boolean noGapPrefix, boolean noGapSuffix, int bandWidth) Computes optimal pairwise global amino acid sequence alignment using a gap-affine (Gotoh) banded Needleman-Wunsch algorithm.This method uses the BLOSUM80 scoring matrix for amino acid matches and mismatches.
- Parameters:
sequenceA
- The first protein sequence to align.sequenceB
- The second protein sequence to align.gapOpenPenalty
- The penalty for opening a gap in the alignment.gapExtendPenalty
- The penalty for extending an existing gap in the alignment.noGapPrefix
- Prevent (if true) gaps at the beginning of the aligned sequences.noGapSuffix
- Prevent (if true) gaps at the end of the aligned sequences.bandWidth
- The width of the band for banded alignment; if below or equal to 0, the full length of sequence B is used.- Returns:
- A
Tuple
containing the aligned sequences.
-
alignByCigar
public static htsjdk.samtools.util.Tuple<String,String> alignByCigar(String reference, String query, String cigar, int offset) Aligns a query sequence to a reference sequence based on a CIGAR string.This method parses the given CIGAR string and applies the specified operations to align the reference and query sequences. The alignment considers matches, mismatches, insertions, deletions, and other operations defined in the CIGAR string.
Supported CIGAR operations:
- M: Match or mismatch
- =: Match
- X: Mismatch
- I: Insertion (adds gaps to the reference)
- D: Deletion (adds gaps to the query)
- N: Skipped region (treated as deletion)
- S: Soft clipping (skips characters in the query)
- H: Hard clipping (ignored)
- Parameters:
reference
- The reference sequence to align.query
- The query sequence to align.cigar
- The CIGAR string describing the alignment operations.offset
- The starting position in the reference sequence.- Returns:
- A
Tuple
containing the aligned reference and query sequences. - Throws:
IllegalArgumentException
- If the CIGAR string contains unsupported operations.
-
padGaps
Pads a string with gap characters to reach a specified length.This method appends gap characters (defined by
Constants.GAP
) to the input string until it reaches the desired length. If the input string is already equal to or longer than the specified length, no padding is added.- Parameters:
s
- The input string to be padded.length
- The desired length of the resulting string.- Returns:
- The padded string, or the original string if no padding is needed.
-
stripGaps
Removes all gap characters from the input string.This method replaces all occurrences of the gap character (defined by
Constants.GAP
) in the input string with an empty string (defined byConstants.EMPTY
).- Parameters:
s
- The input string from which gaps should be removed.- Returns:
- A new string with all gap characters removed.
-
translateSequence
Translates a DNA sequence into an amino-acid sequence. The translation is always performed in the 1-frame. Utilizes the BioJava library for translation.- Parameters:
sequence
- The DNA sequence to translate.reverse
- Whether to translate the reverse complement of the sequence.- Returns:
- The translated amino-acid sequence.
- Throws:
MusialException
- If an error occurs during translation.
-
getCanonicalVariants
public static ArrayList<org.apache.commons.lang3.tuple.Triple<Integer,String, getCanonicalVariantsString>> (String reference, String alternative) Transforms two sequences into canonical VCF variants.The specified reference and alternative are expected to be aligned sequences. Variants are formatted as triples of relative position, reference-, and variant content. The relative position is the 0-based position of the variant in the reference sequence without gaps.
-
isSubstitution
Determines whether a variant is a substitution; i.e., both the reference and alternative base content match a single base ofConstants.BASE_SYMBOLS
.- Parameters:
ref
- The reference base content.alt
- The alternative base content.- Returns:
true
if the variant is a substitution,false
otherwise.
-
isSubstitution
Determines whether a given alternative base content represents a substitution.A substitution is defined as a single base from the set of valid nucleotide symbols defined in
Constants.BASE_SYMBOLS
.- Parameters:
alt
- The alternative base content to check.- Returns:
true
if the alternative content represents a substitution,false
otherwise.
-
isInsertion
Determines whether a variant is an insertion, i.e.,- either the alternative base content is a string of any length of
Constants.BASE_SYMBOLS
and the reference base content is a single base ofConstants.BASE_SYMBOLS
followed byConstants.GAP
s matching the alternative content's length (padded canonical), - or the reference base content is a single base of
Constants.BASE_SYMBOLS
and the alternative content is a string of any length ofConstants.BASE_SYMBOLS
(un-padded canonical).
- Parameters:
ref
- The reference base content.alt
- The alternative base content.padded
- Whether the variant is padded by gap symbols.- Returns:
true
if the variant is an insertion,false
otherwise.
- either the alternative base content is a string of any length of
-
isInsertion
Determines whether a variant is an insertion based on its alternative content.This method checks if the alternative base content represents an insertion. An insertion is defined as a string of at least two consecutive bases from the set of valid nucleotide symbols defined in
Constants.BASE_SYMBOLS
.- Parameters:
alt
- The alternative base content to check.- Returns:
true
if the alternative content represents an insertion,false
otherwise.
-
isDeletion
Determines whether a variant is a deletion, i.e.,- either the reference base content is a string of any length of
Constants.BASE_SYMBOLS
and the alternative base content is a single base ofConstants.BASE_SYMBOLS
followed byConstants.GAP
s matching the reference content's length (padded canonical), - or the reference base content is a string of any length of
Constants.BASE_SYMBOLS
and the alternative content is a single base ofConstants.BASE_SYMBOLS
(un-padded canonical).
- Parameters:
ref
- The reference base content.alt
- The alternative base content.padded
- Whether the variant is padded by gap symbols.- Returns:
true
if the variant is a deletion,false
otherwise.
- either the reference base content is a string of any length of
-
isDeletion
Determines whether a variant is a deletion based on its alternative content.This method checks if the alternative base content represents a deletion. A deletion is defined as a string that starts with a valid nucleotide base (from
Constants.BASE_SYMBOLS
) followed by one or more gap symbols (defined inConstants.GAP
).- Parameters:
alt
- The alternative base content to check.- Returns:
true
if the alternative content represents a deletion,false
otherwise.
-
isCanonical
Determines whether a variant is canonical.A variant is canonical if it is:
- a single nucleotide variant (SNV) (
isSubstitution(java.lang.String, java.lang.String)
), - an un-padded canonical insertion (
isInsertion(java.lang.String, java.lang.String, boolean)
), or - an un-padded canonical deletion (
isDeletion(java.lang.String, java.lang.String, boolean)
).
- Parameters:
ref
- The reference base content.alt
- The alternative base content.- Returns:
true
if the variant is canonical,false
otherwise.
- a single nucleotide variant (SNV) (
-
isPaddedCanonical
Determines whether a variant is padded canonical.A variant is padded canonical if it is:
- a single nucleotide variant (SNV) (
isSubstitution(java.lang.String, java.lang.String)
), - a padded canonical insertion (
isInsertion(java.lang.String, java.lang.String, boolean)
), or - a padded canonical deletion (
isDeletion(java.lang.String, java.lang.String, boolean)
).
- Parameters:
ref
- The reference base content.alt
- The alternative base content.- Returns:
true
if the variant is padded canonical,false
otherwise.
- a single nucleotide variant (SNV) (
-