Package utility

Class SequenceOperations

java.lang.Object
utility.SequenceOperations

public final class SequenceOperations extends Object
Utility class for performing various sequence operations.

This class provides static methods for sequence alignment, variant integration, sequence translation, and other related operations. It includes methods for handling nucleotide and protein sequences, as well as utilities for working with gaps and variants.

  • Constructor Details

    • SequenceOperations

      public SequenceOperations()
  • Method Details

    • globalNucleotideSequenceAlignment

      public static htsjdk.samtools.util.Tuple<String,String> globalNucleotideSequenceAlignment(String sequenceA, String sequenceB, int gapOpenPenalty, int gapExtendPenalty, SequenceOperations.MarginalGaps left, SequenceOperations.MarginalGaps right, Integer bandWidth)
      Performs global nucleotide sequence alignment using a simple scoring matrix.

      This method aligns two nucleotide sequences using a gap-affine Needleman-Wunsch algorithm. It utilizes a predefined scoring matrix for nucleotide matches, mismatches, and gaps.

      The scoring matrix is defined as follows:

      • Match: +1
      • Mismatch: -1
      • Gap: -1
      Parameters:
      sequenceA - The first nucleotide sequence to align.
      sequenceB - The second nucleotide sequence to align.
      gapOpenPenalty - The penalty for opening a gap in the alignment.
      gapExtendPenalty - The penalty for extending an existing gap in the alignment.
      left - Specifies how to handle left-marginal gaps (FREE, PENALIZE, FORBID).
      right - Specifies how to handle right-marginal gaps (FREE, PENALIZE, FORBID).
      bandWidth - The band-width for banded alignment, or null for non-banded alignment.
      Returns:
      A Tuple containing the aligned sequences.
    • globalProteinSequenceAlignment

      public static htsjdk.samtools.util.Tuple<String,String> globalProteinSequenceAlignment(String sequenceA, String sequenceB, int gapOpenPenalty, int gapExtendPenalty, SequenceOperations.MarginalGaps left, SequenceOperations.MarginalGaps right, Integer bandWidth)
      Performs global protein sequence alignment using the BLOSUM80 scoring matrix.

      This method aligns two protein sequences using a gap-affine Needleman-Wunsch algorithm. It utilizes the BLOSUM80 scoring matrix for amino acid matches, mismatches, and gaps.

      The scoring matrix is defined as follows:

      • Match: Based on BLOSUM80 values.
      • Mismatch: Based on BLOSUM80 values.
      • Gap penalties: Defined by the gap open and gap extend penalties.
      Parameters:
      sequenceA - The first protein sequence to align.
      sequenceB - The second protein sequence to align.
      gapOpenPenalty - The penalty for opening a gap in the alignment.
      gapExtendPenalty - The penalty for extending an existing gap in the alignment.
      left - Specifies how to handle left-marginal gaps (FREE, PENALIZE, FORBID).
      right - Specifies how to handle right-marginal gaps (FREE, PENALIZE, FORBID).
      bandWidth - The band-width for banded alignment, or null for non-banded alignment.
      Returns:
      A Tuple containing the aligned sequences.
    • padGaps

      public static String padGaps(String s, int length)
      Pads a string with gap characters to reach a specified length.

      This method appends gap characters (defined by Constants.gapString) to the input string until it reaches the desired length. If the input string is already equal to or longer than the specified length, no padding is added.

      Parameters:
      s - The input string to be padded.
      length - The desired length of the resulting string.
      Returns:
      The padded string, or the original string if no padding is needed.
    • stripGaps

      public static String stripGaps(String s)
      Removes all gap characters from the input string.

      This method replaces all occurrences of the gap character (defined by Constants.gapString) in the input string with an empty string (defined by Constants.EMPTY).

      Parameters:
      s - The input string from which gaps should be removed.
      Returns:
      A new string with all gap characters removed.
    • integrateVariants

      public static String integrateVariants(Contig contig, Feature feature, NavigableMap<Integer,String> variants, boolean stripGaps) throws IOException
      Integrates variants into a reference sequence for a given feature.

      This method processes a reference sequence from a specified contig and feature, integrating variants provided in a map. Variants can include single nucleotide variants (SNVs), insertions, and deletions. The resulting sequence can optionally have gaps stripped.

      Upstream deletions are handled by skipping affected positions and logging a warning.

      Parameters:
      contig - The Contig object containing the reference sequence.
      feature - The Feature object specifying the region of interest.
      variants - A NavigableMap mapping positions to variant strings.
      stripGaps - A boolean indicating whether to remove gaps from the resulting sequence.
      Returns:
      A String representing the updated sequence with integrated variants.
      Throws:
      IOException - If an error occurs while accessing the contig sequence.
      IllegalArgumentException - If the contig does not have a sequence or the feature is incompatible.
    • translateSequence

      public static String translateSequence(String sequence, boolean reverse) throws MusialException
      Translates a DNA sequence into an amino-acid sequence. The translation is always performed in the 1-frame. Utilizes the BioJava library for translation.
      Parameters:
      sequence - The DNA sequence to translate.
      reverse - Whether to translate the reverse complement of the sequence.
      Returns:
      The translated amino-acid sequence.
      Throws:
      MusialException - If an error occurs during translation.
    • getCanonicalVariants

      public static ArrayList<org.apache.commons.lang3.tuple.Triple<Integer,String,String>> getCanonicalVariants(String reference, String alternative)
      Transforms two sequences into canonical VCF variants.

      The specified reference and alternative are expected to be aligned sequences. Variants are formatted as triples of relative position, reference-, and variant content. The relative position is the 0-based position of the variant in the reference sequence without gaps.

      Parameters:
      reference - String representation of the reference sequence.
      alternative - String representation of the variant/alternative sequence.
      Returns:
      ArrayList containing derived variants, c.f. method description for format details.