GenBankFile
#
- class biotite.sequence.io.genbank.GenBankFile[source]#
Bases:
TextFile
This class represents a file in GenBank format (including GenPept).
A GenBank file annotates a reference sequence with features such as positions of genes, promoters, etc. Additionally, it provides metadata further describing the file.
A file is divided into separate fields, e.g. the DEFINITION field contains a description of the file. The field name starts at the beginning of a line, followed by the content. A field may contain subfields, whose name is indented. For example, the SOURCE field contains the ORGANISM subfield. Some fields may occur multiple times, e.g. the REFERENCE field. A sample GenBank file can be viewed at https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html.
This class provides a low-level interface for parsing, editing and writing GenBank files. It works like a list of field entries, where a field consists of the field name, the field content and the subfields. The field content is separated into the lines belonging to the content. While the content of metadata fields starts at the standard GenBank indentation of 12, the content of the FEATURES (contains the annotation) and ORIGIN (contains the sequence) fields starts without indentation. The subfields are represented by a dictionary, with subfield names being keys and the corresponding lines being values. The FEATURES and ORIGIN fields have no subfields.
Every entry can be obtained, set and deleted via the index operator.
Notes
This class does not support location identifiers with references to other Entrez database entries, e.g.
join(1..100,J00194.1:100..202)
.Examples
Create a GenBank file from scratch:
>>> file = GenBankFile() >>> file.append( ... "SOMEFIELD", ["One line", "A second line"], ... subfields={"SUBFIELD1": ["Single Line"], "SUBFIELD2": ["Two", "lines"]} ... ) >>> print(file) SOMEFIELD One line A second line SUBFIELD1 Single Line SUBFIELD2 Two lines // >>> name, content, subfields = file[0] >>> print(name) SOMEFIELD >>> print(content) ['One line', 'A second line'] >>> print(subfields) OrderedDict([('SUBFIELD1', ['Single Line']), ('SUBFIELD2', ['Two', 'lines'])])
Adding an additional field:
>>> file.insert(0, "OTHERFIELD", ["Another line"]) >>> print(len(file)) 2 >>> print(file) OTHERFIELD Another line SOMEFIELD One line A second line SUBFIELD1 Single Line SUBFIELD2 Two lines //
Overwriting and deleting an existing field:
>>> file[1] = "NEWFIELD", ["Yet another line"] >>> print(file) OTHERFIELD Another line NEWFIELD Yet another line // >>> file[1] = "NEWFIELD", ["Yet another line"], {"NEWSUB": ["Subfield line"]} >>> print(file) OTHERFIELD Another line NEWFIELD Yet another line NEWSUB Subfield line // >>> del file[1] >>> print(file) OTHERFIELD Another line //
Parsing fields from a real GenBank file:
>>> import os.path >>> file = GenBankFile.read(os.path.join(path_to_sequences, "gg_avidin.gb")) >>> print(file) LOCUS AJ311647 1224 bp DNA linear VRT 14-NOV-2006 DEFINITION Gallus gallus AVD gene for avidin, exons 1-4. ACCESSION AJ311647 VERSION AJ311647.1 GI:13397825 KEYWORDS AVD gene; avidin. SOURCE Gallus gallus (chicken) ORGANISM Gallus gallus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archelosauria; Archosauria; Dinosauria; Saurischia; Theropoda; Coelurosauria; Aves; Neognathae; Galloanserae; Galliformes; Phasianidae; Phasianinae; Gallus. REFERENCE 1 AUTHORS Wallen,M.J., Laukkanen,M.O. and Kulomaa,M.S. TITLE Cloning and sequencing of the chicken egg-white avidin-encoding gene and its relationship with the avidin-related genes Avr1-Avr5 JOURNAL Gene 161 (2), 205-209 (1995) PUBMED 7665080 REFERENCE 2 AUTHORS Ahlroth,M.K., Kola,E.H., Ewald,D., Masabanda,J., Sazanov,A., Fries,R. and Kulomaa,M.S. TITLE Characterization and chromosomal localization of the chicken avidin gene family JOURNAL Anim. Genet. 31 (6), 367-375 (2000) PUBMED 11167523 REFERENCE 3 (bases 1 to 1224) AUTHORS Ahlroth,M.K. TITLE Direct Submission JOURNAL Submitted (09-MAR-2001) Ahlroth M.K., Department of Biological and Environmental Science, University of Jyvaskyla, PO Box 35, FIN-40351 Jyvaskyla, FINLAND FEATURES Location/Qualifiers source 1..1224 /organism="Gallus gallus" /mol_type="genomic DNA" ... >>> name, content, _ = file[3] >>> print(name) VERSION >>> print(content) ['AJ311647.1 GI:13397825'] >>> name, content, subfields = file[5] >>> print(name) SOURCE >>> print(content) ['Gallus gallus (chicken)'] >>> print(dict(subfields)) {'ORGANISM': ['Gallus gallus', 'Eukaryota; Metazoa; Chordata; ...', ...]}
- append(name, content, subfields=None)#
Create a new GenBank field at the end of the file.
- Parameters:
- namestr
The field name.
- contentlist of str
The content lines.
- subfield_dictdict of str -> str, optional
The subfields of the field. The dictionary maps subfield names to the content lines of the respective subfield.
- copy()#
Create a deep copy of this object.
- Returns:
- copy
A copy of this object.
- get_fields(name)#
Get all GenBank fields associated with a given field name.
- Parameters:
- namestr
The field name.
- Returns:
- fieldslist of (list of str, OrderedDict of str -> str)
A list containing the fields. For most field names, the list will only contain one element, but fields like REFERENCE are an exception. Each field is represented by a tuple. Each tuple contains as first element the content lines and as second element the subfields as dictionary. If the field has no subfields, the dictionary is empty.
- get_indices(name)#
Get the indices to all GenBank fields associated with a given field name.
- Parameters:
- namestr
The field name.
- Returns:
- fieldslist of int
A list of indices. For most field names, the list will only contain one element, but fields like REFERENCE are an exception.
- insert(index, name, content, subfields=None)#
Insert a GenBank field at the given position.
- Parameters:
- indexint
The new field is inserted before the current field at this index. If the index is after the last field, the new field is appended to the end of the file.
- namestr
The field name.
- contentlist of str
The content lines.
- subfield_dictdict of str -> str, optional
The subfields of the field. The dictionary maps subfield names to the content lines of the respective subfield.
- classmethod read(file)#
Read a GenBank file.
- Parameters:
- filefile-like object or str
The file to be read. Alternatively a file path can be supplied.
- Returns:
- file_objectGenBankFile
The parsed file.
- static read_iter(file)#
Create an iterator over each line of the given text file.
- Parameters:
- filefile-like object or str
The file to be read. Alternatively a file path can be supplied.
- Yields:
- linestr
The current line in the file.
- set_field(name, content, subfield_dict=None)#
Set a GenBank field with the given content.
If the field already exists in the file, the field is overwritten, otherwise a new field is created at the end of the file.
- Parameters:
- namestr
The field name.
- contentlist of str
The content lines.
- subfield_dictdict of str -> str, optional
The subfields of the field. The dictionary maps subfield names to the content lines of the respective subfield.
- Raises:
- InvalidFileError
If the field occurs multiple times in the file. In this case it is ambiguous which field to overwrite.
- write(file)#
Write the contents of this object into a file (or file-like object).
- Parameters:
- filefile-like object or str
The file to be written to. Alternatively a file path can be supplied.
- static write_iter(file, lines)#
Iterate over the given lines of text and write each line into the specified file.
In contrast to
write()
, each line of text is not stored in an intermediateTextFile
, but is directly written to the file. Hence, this static method may save a large amount of memory if a large file should be written, especially if the lines are provided as generator.- Parameters:
- filefile-like object or str
The file to be written to. Alternatively a file path can be supplied.
- linesgenerator or array-like of str
The lines of text to be written. Must not include line break characters.
Gallery#
Finding homologous regions in two genomes
Finding homologs of a gene in a genome
Hydropathy and conservation of ion channels
Identification of a binding site by sequence conservation
Visualization of a region in proximity to a feature
Domains of bacterial sigma factors
Three ways to get the secondary structure of a protein