Class StorageIO
StorageIO class provides utility methods for serializing and deserializing genomic data.
This class includes methods to convert Storage objects into various file formats such as JSON, GFF3, FASTA, and VCF. It handles
the generation of file content based on the data stored in the Storage object, ensuring compliance with the respective file
format specifications. Additionally, it provides helper methods for processing features and contigs.
-
Method Summary
Modifier and TypeMethodDescriptionstatic StringGenerates the content of a FASTA file from the givenStorageobject.static StringGenerates the content of a GFF (General Feature Format) file from the givenStorageobject.static voidSerializes the givenStorageobject to a JSON file at the specified path.static StringGenerates the content of a VCF (Variant Call Format) file from the givenStorageobject.
-
Method Details
-
toJSON
Serializes the givenStorageobject to a JSON file at the specified path.This method converts the
Storageobject into a JSON string using the Gson library. If the specified path does not end with ".json" or ".json.gz", the default output extension defined inMusialis appended to the path. The JSON data is then written to the file.If the path ends with ".gz", the JSON data is compressed using GZIP before being written. Otherwise, it is written as plain text.
- Parameters:
storage- TheStorageobject to be serialized.path- ThePathwhere the JSON file will be written.- Throws:
IOException- If an I/O error occurs during file writing.
-
toGFF3
Generates the content of a GFF (General Feature Format) file from the givenStorageobject.This method constructs a GFF file content as a
Stringby iterating over the features in the providedStorageobject. The GFF content includes the version, processor information, and the feature data. Each feature is converted to its GFF string representation using thefeatureToGFF3String(Feature)method.The generated GFF content follows the GFF3 specification and includes the following:
- ##gff-version: Specifies the GFF version.
- ##processor: Includes the software id and version used to generate the file.
- Feature data: Each feature is represented in GFF format.
-
toFASTA
Generates the content of a FASTA file from the givenStorageobject.This method constructs a FASTA file content as a
Stringby iterating over the contigs in the providedStorageobject. Each contig's ID is used as the header (prefixed with '>'), and its sequence is split into lines of 80 characters for proper FASTA formatting. The method ensures that all contigs in the storage have sequence data before proceeding.- Parameters:
storage- TheStorageobject containing the contigs and their sequences.- Returns:
- A
Stringrepresenting the content of the reference FASTA file. - Throws:
IOException- If an I/O error occurs during the generation of the FASTA content.IllegalArgumentException- If no reference sequence information is stored in theStorageobject.
-
toVCF
Generates the content of a VCF (Variant Call Format) file from the givenStorageobject.This method constructs a VCF file content as a
Stringby iterating over the contigs in the providedStorageobject. The VCF content includes the file format, source, and a header line, followed by the variant data. Each variant is represented by its chromosome, position, reference base, and alternate base.The generated VCF content follows the VCFv4.3 specification and includes the following fields:
- CHROM: Chromosome identifier.
- POS: Position of the variant on the chromosome.
- ID: Variant identifier (set to ".").
- REF: Reference base(s) (gaps are stripped).
- ALT: Alternate base(s) (gaps are stripped).
- QUAL: Quality score (set to "100").
- FILTER: Filter status (set to ".").
- INFO: Additional information (see parameters).
Variants can be filtered based on their novelty and ambiguity:
- If
onlyNovelistrue, only active variants are included. - If
excludeAmbiguousistrue, variants with ambiguous alternate bases are excluded.
-