bkbit.data_translators.genome_annotation_translator module
Module for downloading, parsing, and processing GFF3 files from NCBI and Ensembl repositories. This module provides functionality to:
Download a GFF3 file from a specified URL and calculate its checksums.
Parse the GFF3 file to extract gene annotations.
Generate various metadata objects such as organism taxon, genome assembly, and genome annotation.
Serialize the extracted information into JSON-LD format for further use.
- Classes:
Gff3: The Gff3 class is designed to handle the complete lifecycle of downloading, parsing, and processing GFF3 files from NCBI or Ensembl repositories. It extracts gene annotations and serializes the data into JSON-LD format.
- Functions:
gff2jsonld: The gff2jsonld function is responsible for creating GeneAnnotation objects from a provided GFF3 file and serializing the extracted information into the JSON-LD format.
- Usage:
The module can be run as a standalone script by executing it with appropriate arguments and options:
` python genome_annotation_translator.py <content_url> -a <assembly_accession> -s <assembly_strain> -l <log_level> -f `
The script will download the GFF3 file from the specified URL, parse it, and serialize the extracted information into JSON-LD format.
Example
`
python genome_annotation_translator.py "https://example.com/path/to/gff3.gz" -a "GCF_000001405.39" -s "strain_name" -l "INFO" -f True
`
- Dependencies:
re
hashlib
tempfile
uuid
urllib
urllib.request
urllib.parse
os
json
datetime
collections.defaultdict
subprocess
gzip
tqdm
click
pkg_resources
bkbit.models.genome_annotation as ga
bkbit.utils.setup_logger as setup_logger
bkbit.utils.load_json as load_json
- class bkbit.data_translators.genome_annotation_translator.Gff3(content_url, assembly_accession=None, assembly_strain=None, log_level='WARNING', log_to_file=False)[source]
Bases:
object
The Gff3 class is responsible for downloading, parsing, and processing of GFF3 files from NCBI and Ensembl repositories.
- content_url
The URL of the GFF file.
- Type:
str
- assembly_accession
The ID of the genome assembly.
- Type:
str
- assembly_strain
The strain of the genome assembly. Defaults to None.
- Type:
str, optional
- log_level
The logging level. Defaults to ‘WARNING’.
- Type:
str
- log_to_file
Flag to log messages to a file. Defaults to False.
- Type:
bool
- __init__(content_url, assembly_accession=None, assembly_strain=None, log_level='WARNING', log_to_file=False)[source]
Initializes the Gff3 class with the provided parameters.
- __download_gff_file()
Downloads a GFF file from a given URL and calculates the MD5, SHA256, and SHA1 hashes.
- generate_organism_taxon(taxon_id)[source]
Generates an organism taxon object based on the provided taxon ID.
- assign_authority_type(authority)[source]
Assigns the authority type based on the given authority string.
- generate_genome_assembly(assembly_id, assembly_version, assembly_label, assembly_strain=None)[source]
Generates a genome assembly object based on the provided parameters.
- generate_genome_annotation(genome_label, genome_version)[source]
Generates a genome annotation object based on the provided parameters.
- generate_digest(hash_values, hash_functions=DEFAULT_HASH)[source]
Generates checksum digests for the GFF file using the specified hash functions.
- __get_line_count(file_path)
Returns the line count of a file.
- parse(feature_filter=DEFAULT_FEATURE_FILTER)[source]
Parses the GFF file and extracts gene annotations based on the provided feature filter.
- generate_ensembl_gene_annotation(attributes, curr_line_num)[source]
Generates a GeneAnnotation object for Ensembl based on the provided attributes.
- generate_ncbi_gene_annotation(attributes, curr_line_num)[source]
Generates a GeneAnnotation object for NCBI based on the provided attributes.
- __get_attribute(attributes, attribute_name, curr_line_num)
Retrieves the value of a specific attribute from the given attributes dictionary.
- __resolve_ncbi_gene_annotation(new_gene_annotation, curr_line_num)
Resolves conflicts between existing and new gene annotations based on certain conditions.
- __merge_values(t)
Merges values from a list of lists into a dictionary of sets.
- serialize_to_jsonld(exclude_none=True, exclude_unset=False)[source]
Serializes the object and either writes it to the specified output file or prints it to the CLI.
Initializes an instance of the GFFTranslator class.
Parameters: - content_url (str): The URL of the GFF file. - assembly_id (str): The ID of the genome assembly. - assembly_strain (str, optional): The strain of the genome assembly. Defaults to None. - hash_functions (tuple[str]): A tuple of hash functions to use for generating checksums. Defaults to (‘MD5’).
- assign_authority_type(authority: str)[source]
Assigns the authority type based on the given authority string.
- Parameters:
authority (str) – The authority string to be assigned.
- Returns:
The corresponding authority type.
- Return type:
ga.AuthorityType
- Raises:
Exception – If the authority is not supported. Only NCBI and Ensembl authorities are supported.
- generate_digest(hash_values: dict, hash_functions: tuple[str] = ('MD5',)) list[Checksum] [source]
Generates checksum digests for the GFF file using the specified hash functions.
- Parameters:
hash_functions (list[str]) – A list of hash functions to use for generating the digests.
- Returns:
A list of Checksum objects containing the generated digests.
- Return type:
list[ga.Checksum]
- Raises:
ValueError – If an unsupported hash algorithm is provided.
- generate_ensembl_gene_annotation(attributes, curr_line_num)[source]
Generates a GeneAnnotation object for Ensembl based on the provided attributes.
- Parameters:
attributes (dict) – A dictionary containing the attributes of the gene.
curr_line_num (int) – The line number of the current row in the input file.
- Returns:
The generated GeneAnnotation object if it is not a duplicate, otherwise None.
- Return type:
GeneAnnotation or None
- Raises:
None –
- generate_genome_annotation(genome_label: str, genome_version: str)[source]
Generates a genome annotation object.
- Parameters:
genome_label (str) – The label of the genome.
genome_version (str) – The version of the genome.
- Returns:
The generated genome annotation.
- Return type:
ga.GenomeAnnotation
- generate_genome_assembly(assembly_id: str, assembly_version: str, assembly_label: str, assembly_strain: str | None = None)[source]
Generate a genome assembly object.
Parameters: assembly_id (str): The ID of the assembly. assembly_version (str): The version of the assembly. assembly_label (str): The label of the assembly. assembly_strain (str, optional): The strain of the assembly. Defaults to None.
Returns: ga.GenomeAssembly: The generated genome assembly object.
- generate_ncbi_gene_annotation(attributes, curr_line_num)[source]
Generates a GeneAnnotation object for NCBI based on the provided attributes.
- Parameters:
attributes (dict) – A dictionary containing the attributes of the gene.
curr_line_num (int) – The line number of the current row in the input file.
- Returns:
The generated GeneAnnotation object if it is not a duplicate, otherwise None.
- Return type:
GeneAnnotation or None
- Raises:
None –
- generate_organism_taxon(taxon_id: str)[source]
Generates an organism taxon object based on the provided taxon ID.
- Parameters:
taxon_id (str) – The taxon ID of the organism.
- Returns:
The generated organism taxon object.
- Return type:
ga.OrganismTaxon
- parse(feature_filter: tuple[str] = ('gene', 'pseudogene', 'ncRNA_gene'))[source]
Parses the GFF file and extracts gene annotations based on the provided feature filter.
- Parameters:
feature_filter (tuple[str]) – Tuple of feature types to include in the gene annotations.
- Raises:
FileNotFoundError – If the GFF file does not exist.
- Returns:
None
- parse_url()[source]
Parses the content URL and extracts information about the genome annotation.
- Returns:
‘authority’: The authority type (NCBI or ENSEMBL).
’taxonid’: The taxon ID of the genome.
’release_version’: The release version of the genome annotation.
’assembly_accession’: The assembly accession of the genome.
’assembly_name’: The name of the assembly.
’species’: The species name (only for ENSEMBL URLs).
- Return type:
A dictionary containing the following information
- serialize_to_jsonld(exclude_none: bool = True, exclude_unset: bool = False)[source]
Serialize the object and either write it to the specified output file or print it to the CLI.
- Parameters:
exclude_none (bool) – Whether to exclude None values in the output.
exclude_unset (bool) – Whether to exclude unset values in the output.
- Returns:
None