Package util

Class Bio


public final class Bio extends Object
Utility class for performing various sequence related operations.

This class provides static methods for sequence alignment, variant integration, sequence translation, and other related operations. It includes methods for handling nucleotide and protein sequences, as well as utilities for working with gaps and variants.

  • Method Details

    • globalNucleotideSequenceAlignment

      public static htsjdk.samtools.util.Tuple<String,String> globalNucleotideSequenceAlignment(String sequenceA, String sequenceB, int gapOpenPenalty, int gapExtendPenalty, boolean noGapPrefix, boolean noGapSuffix, int bandWidth)
      Computes optimal pairwise global nucleotide sequence alignment using a gap-affine (Gotoh) banded Needleman-Wunsch algorithm.

      A simple scoring matrix (match: +1; mismatch: -1) is used.

      Parameters:
      sequenceA - The first nucleotide sequence to align.
      sequenceB - The second nucleotide sequence to align.
      gapOpenPenalty - The penalty for opening a gap in the alignment.
      gapExtendPenalty - The penalty for extending an existing gap in the alignment.
      noGapPrefix - Prevent (if true) gaps at the beginning of the aligned sequences.
      noGapSuffix - Prevent (if true) gaps at the end of the aligned sequences.
      bandWidth - The width of the band for banded alignment; if below or equal to 0, the full length of sequence B is used.
      Returns:
      A Tuple containing the aligned sequences.
    • globalProteinSequenceAlignment

      public static htsjdk.samtools.util.Tuple<String,String> globalProteinSequenceAlignment(String sequenceA, String sequenceB, int gapOpenPenalty, int gapExtendPenalty, boolean noGapPrefix, boolean noGapSuffix, int bandWidth)
      Computes optimal pairwise global amino acid sequence alignment using a gap-affine (Gotoh) banded Needleman-Wunsch algorithm.

      This method uses the BLOSUM80 scoring matrix for amino acid matches and mismatches.

      Parameters:
      sequenceA - The first protein sequence to align.
      sequenceB - The second protein sequence to align.
      gapOpenPenalty - The penalty for opening a gap in the alignment.
      gapExtendPenalty - The penalty for extending an existing gap in the alignment.
      noGapPrefix - Prevent (if true) gaps at the beginning of the aligned sequences.
      noGapSuffix - Prevent (if true) gaps at the end of the aligned sequences.
      bandWidth - The width of the band for banded alignment; if below or equal to 0, the full length of sequence B is used.
      Returns:
      A Tuple containing the aligned sequences.
    • alignByCigar

      public static htsjdk.samtools.util.Tuple<String,String> alignByCigar(String reference, String query, String cigar, int offset)
      Aligns a query sequence to a reference sequence based on a CIGAR string.

      This method parses the given CIGAR string and applies the specified operations to align the reference and query sequences. The alignment considers matches, mismatches, insertions, deletions, and other operations defined in the CIGAR string.

      Supported CIGAR operations:

      • M: Match or mismatch
      • =: Match
      • X: Mismatch
      • I: Insertion (adds gaps to the reference)
      • D: Deletion (adds gaps to the query)
      • N: Skipped region (treated as deletion)
      • S: Soft clipping (skips characters in the query)
      • H: Hard clipping (ignored)
      Parameters:
      reference - The reference sequence to align.
      query - The query sequence to align.
      cigar - The CIGAR string describing the alignment operations.
      offset - The starting position in the reference sequence.
      Returns:
      A Tuple containing the aligned reference and query sequences.
      Throws:
      IllegalArgumentException - If the CIGAR string contains unsupported operations.
    • padGaps

      public static String padGaps(String s, int length)
      Pads a string with gap characters to reach a specified length.

      This method appends gap characters (defined by Constants.GAP) to the input string until it reaches the desired length. If the input string is already equal to or longer than the specified length, no padding is added.

      Parameters:
      s - The input string to be padded.
      length - The desired length of the resulting string.
      Returns:
      The padded string, or the original string if no padding is needed.
    • stripGaps

      public static String stripGaps(String s)
      Removes all gap characters from the input string.

      This method replaces all occurrences of the gap character (defined by Constants.GAP) in the input string with an empty string (defined by Constants.EMPTY).

      Parameters:
      s - The input string from which gaps should be removed.
      Returns:
      A new string with all gap characters removed.
    • integrateVariants

      public static String integrateVariants(String reference, NavigableMap<Integer,String> variants, boolean excludeGaps) throws MusialException
      Integrates variants into a reference sequence.

      This method modifies the given reference sequence by incorporating the specified variants. Variants are represented as a NavigableMap where the key is the 0-based position in the reference sequence, and the value is the alternative base sequence. Variants have to be in canonical form, i.e. they must start with a non-gap character that is assumed to be in the coordinate system of the reference sequence. If additional characters follow, these have to be all gaps (indicating a deletion) or all non-gaps (indicating an insertion).

      The method handles deletions by tracking the number of deleted downstream positions. These are either replaced by gap symbols or ignored, depending on the value specified for excludeGaps.

      Parameters:
      reference - The original reference sequence as a String.
      variants - A NavigableMap containing the variants to integrate, where the key is the position and the value is the alternative base sequence in canonical form.
      excludeGaps - A boolean indicating whether to exclude gaps from the resulting sequence.
      Returns:
      A String representing the reference sequence with the integrated variants. If stripGaps is true, gaps are removed from the resulting sequence.
      Throws:
      IllegalArgumentException - If the reference sequence is empty.
      MusialException - If an invalid variant is encountered.
    • integrateVariants

      public static String integrateVariants(NavigableMap<Integer,Bio.ReferenceContext> reference, Map<Integer,String> variants, boolean excludeGaps) throws MusialException
      Integrates variants into a reference sequence.

      This method modifies the given reference sequence by incorporating the specified variants. Variants are represented as a Map where the key is the 1-based position in the reference sequence, and the value is the alternative base sequence. Variants have to be in canonical form, i.e. they must start with a non-gap character that is assumed to be in the coordinate system of the reference sequence. If additional characters follow, these have to be all gaps (indicating a deletion) or all non-gaps (indicating an insertion).

      In contrast to integrateVariants(String, NavigableMap, boolean), the reference sequence is provided as a NavigableMap where the key is the 1-based position and the value is a Bio.ReferenceContext containing the character and the maximal number of inserted bases at the position wrt. a collection of biological samples. This allows to represent gaps that have been introduced by insertions in other biological samples. These gaps as well as such induced by deletions can be either preserved or removed from the resulting sequence, depending on the value specified for excludeGaps.

      Parameters:
      reference - A NavigableMap representing the reference sequence, where the key is the position and the value is a Bio.ReferenceContext containing the character and extension at that position.
      variants - A Map containing the variants to integrate, where the key is the position and the value is the alternative base sequence.
      excludeGaps - A boolean indicating whether to remove gaps from the resulting sequence.
      Returns:
      A String representing the reference sequence with the integrated variants. If stripGaps is true, gaps are removed from the resulting sequence.
      Throws:
      IllegalArgumentException - If the reference sequence is empty or if an invalid variant is encountered.
      MusialException - If an invalid variant is encountered.
    • translateSequence

      public static String translateSequence(String sequence, boolean reverse) throws MusialException
      Translates a DNA sequence into an amino-acid sequence. The translation is always performed in the 1-frame. Utilizes the BioJava library for translation.
      Parameters:
      sequence - The DNA sequence to translate.
      reverse - Whether to translate the reverse complement of the sequence.
      Returns:
      The translated amino-acid sequence.
      Throws:
      MusialException - If an error occurs during translation.
    • getCanonicalVariants

      public static ArrayList<org.apache.commons.lang3.tuple.Triple<Integer,String,String>> getCanonicalVariants(String reference, String alternative)
      Transforms two sequences into canonical VCF variants.

      The specified reference and alternative are expected to be aligned sequences. Variants are formatted as triples of relative position, reference-, and variant content. The relative position is the 0-based position of the variant in the reference sequence without gaps.

      Parameters:
      reference - String representation of the reference sequence.
      alternative - String representation of the variant/alternative sequence.
      Returns:
      ArrayList containing derived variants, c.f. method description for format details.
    • isSubstitution

      public static boolean isSubstitution(String ref, String alt)
      Determines whether a variant is a substitution; i.e., both the reference and alternative base content match a single base of Constants.BASE_SYMBOLS.
      Parameters:
      ref - The reference base content.
      alt - The alternative base content.
      Returns:
      true if the variant is a substitution, false otherwise.
    • isSubstitution

      public static boolean isSubstitution(String alt)
      Determines whether a given alternative base content represents a substitution.

      A substitution is defined as a single base from the set of valid nucleotide symbols defined in Constants.BASE_SYMBOLS.

      Parameters:
      alt - The alternative base content to check.
      Returns:
      true if the alternative content represents a substitution, false otherwise.
    • isInsertion

      public static boolean isInsertion(String ref, String alt, boolean padded)
      Determines whether a variant is an insertion, i.e.,
      Parameters:
      ref - The reference base content.
      alt - The alternative base content.
      padded - Whether the variant is padded by gap symbols.
      Returns:
      true if the variant is an insertion, false otherwise.
    • isInsertion

      public static boolean isInsertion(String alt)
      Determines whether a variant is an insertion based on its alternative content.

      This method checks if the alternative base content represents an insertion. An insertion is defined as a string of at least two consecutive bases from the set of valid nucleotide symbols defined in Constants.BASE_SYMBOLS.

      Parameters:
      alt - The alternative base content to check.
      Returns:
      true if the alternative content represents an insertion, false otherwise.
    • isDeletion

      public static boolean isDeletion(String ref, String alt, boolean padded)
      Determines whether a variant is a deletion, i.e.,
      Parameters:
      ref - The reference base content.
      alt - The alternative base content.
      padded - Whether the variant is padded by gap symbols.
      Returns:
      true if the variant is a deletion, false otherwise.
    • isDeletion

      public static boolean isDeletion(String alt)
      Determines whether a variant is a deletion based on its alternative content.

      This method checks if the alternative base content represents a deletion. A deletion is defined as a string that starts with a valid nucleotide base (from Constants.BASE_SYMBOLS) followed by one or more gap symbols (defined in Constants.GAP).

      Parameters:
      alt - The alternative base content to check.
      Returns:
      true if the alternative content represents a deletion, false otherwise.
    • isCanonical

      public static boolean isCanonical(String ref, String alt)
      Determines whether a variant is canonical.

      A variant is canonical if it is:

      Parameters:
      ref - The reference base content.
      alt - The alternative base content.
      Returns:
      true if the variant is canonical, false otherwise.
    • isPaddedCanonical

      public static boolean isPaddedCanonical(String ref, String alt)
      Determines whether a variant is padded canonical.

      A variant is padded canonical if it is:

      Parameters:
      ref - The reference base content.
      alt - The alternative base content.
      Returns:
      true if the variant is padded canonical, false otherwise.