biotite.sequence.io.gff.GFFFile

class biotite.sequence.io.gff.GFFFile[source]

Bases: TextFile

This class represents a file in Generic Feature Format 3 (GFF3) format.

Similar to GenBank files, GFF3 files contain information about features of a reference sequence, but in a more concise and better parsable way. However, it does not provide additional meta information.

This class serves as low-level API for accessing GFF3 files. It is used as a sequence of entries, where each entry is defined as a non-comment and non-directive line. Each entry consists of values corresponding to the 9 columns of GFF3:

seqid

str

The ID of the reference sequence

source

str

Source of the data (e.g. Genbank)

type

str

Type of the feature (e.g. CDS)

start

int

Start coordinate of feature on the reference sequence

end

int

End coordinate of feature on the reference sequence

score

float or None

Optional score (e.g. an E-value)

strand

Location.Strand or None

Strand of the feature, None if feature is not stranded

phase

int or None

Reading frame shift, None for non-CDS features

attributes

dict

Additional properties of the feature

Note that the entry index may not be equal to the line index, because GFF3 files can contain comment and directive lines.

Notes

Although the GFF3 specification allows mixing in reference sequence data in FASTA format via the ##FASTA directive, this class does not support extracting the sequence information. The content after the ##FASTA directive is simply ignored. Please provide the sequence via a separate file or read the FASTA data directly via the lines attribute:

>>> import os.path
>>> from io import StringIO
>>> gff_file = GFFFile.read(os.path.join(path_to_sequences, "indexing_test.gff3"))
>>> fasta_start_index = None
>>> for directive, line_index in gff_file.directives():
...     if directive == "FASTA":
...         fasta_start_index = line_index + 1
>>> fasta_data = StringIO("\n".join(gff_file.lines[fasta_start_index:]))
>>> fasta_file = FastaFile.read(fasta_data)
>>> for seq_string in fasta_file.values():
...     print(seq_string[:60] + "...")
TACGTAGCTAGCTGATCGATGTTGTGTGTATCGATCTAGCTAGCTAGCTGACTACACAAT...

Examples

Reading and editing of an existing GFF3 file:

>>> import os.path
>>> gff_file = GFFFile.read(os.path.join(path_to_sequences, "gg_avidin.gff3"))
>>> # Get content of first entry
>>> seqid, source, type, start, end, score, strand, phase, attrib = gff_file[0]
>>> print(seqid)
AJ311647.1
>>> print(source)
EMBL
>>> print(type)
region
>>> print(start)
1
>>> print(end)
1224
>>> print(score)
None
>>> print(strand)
Strand.FORWARD
>>> print(phase)
None
>>> print(attrib)
{'ID': 'AJ311647.1:1..1224', 'Dbxref': 'taxon:9031', 'Name': 'Z', 'chromosome': 'Z', 'gbkey': 'Src', 'mol_type': 'genomic DNA'}
>>> # Edit the first entry: Simply add a score
>>> score = 1.0
>>> gff_file[0] = seqid, source, type, start, end, score, strand, phase, attrib
>>> # Delete first entry
>>> del gff_file[0]

Writing a new GFF3 file:

>>> gff_file = GFFFile()
>>> gff_file.append_directive("Example directive", "param1", "param2")
>>> gff_file.append(
...     "SomeSeqID", "Biotite", "CDS", 1, 99,
...     None, Location.Strand.FORWARD, 0,
...     {"ID": "FeatureID", "product":"A protein"}
... )
>>> print(gff_file)   
##gff-version 3
##Example directive param1 param2
SomeSeqID   Biotite CDS     1       99      .       +       0       ID=FeatureID;product=A protein
append(seqid, source, type, start, end, score, strand, phase, attributes=None)

Append an entry to the end of the file.

Parameters
seqidstr

The ID of the reference sequence.

sourcestr

Source of the data (e.g. Genbank).

typestr

Type of the feature (e.g. CDS).

startint

Start coordinate of feature on the reference sequence.

endint

End coordinate of feature on the reference sequence.

scorefloat or None

Optional score (e.g. an E-value).

strandLocation.Strand or None

Strand of the feature, None if feature is not stranded.

phaseint or None

Reading frame shift, None for non-CDS features.

attributesdict, optional

Additional properties of the feature.

append_directive(directive, *args)

Append a directive line to the end of the file.

Parameters
directivestr

Name of the directive.

*argsstr

Optional parameters for the directive. Each argument is simply appended to the directive, separated by a single space character.

Raises
NotImplementedError

If the ##FASTA directive is used, which is not supported.

Examples

>>> gff_file = GFFFile()
>>> gff_file.append_directive("Example directive", "param1", "param2")
>>> print(gff_file)
##gff-version 3
##Example directive param1 param2
copy()

Create a deep copy of this object.

Returns
copy

A copy of this object.

directives()

Get the directives in the file.

Returns
directiveslist of tuple(str, int)

A list of directives, sorted by their line order. The first element of each tuple is the name of the directive (without ##), the second element is the index of the corresponding line.

insert(index, seqid, source, type, start, end, score, strand, phase, attributes=None)

Insert an entry at the given index.

Parameters
indexint

Index where the entry is inserted. If the index is equal to the length of the file, the entry is appended at the end of the file.

seqidstr

The ID of the reference sequence.

sourcestr

Source of the data (e.g. Genbank).

typestr

Type of the feature (e.g. CDS).

startint

Start coordinate of feature on the reference sequence.

endint

End coordinate of feature on the reference sequence.

scorefloat or None

Optional score (e.g. an E-value).

strandLocation.Strand or None

Strand of the feature, None if feature is not stranded.

phaseint or None

Reading frame shift, None for non-CDS features.

attributesdict, optional

Additional properties of the feature.

classmethod read(file)

Read a GFF3 file.

Parameters
filefile-like object or str

The file to be read. Alternatively a file path can be supplied.

Returns
file_objectGFFFile

The parsed file.

static read_iter(file)

Create an iterator over each line of the given text file.

Parameters
filefile-like object or str

The file to be read. Alternatively a file path can be supplied.

Yields
linestr

The current line in the file.

write(file)

Write the contents of this object into a file (or file-like object).

Parameters
filefile-like object or str

The file to be written to. Alternatively a file path can be supplied.

static write_iter(file, lines)

Iterate over the given lines of text and write each line into the specified file.

In contrast to write(), each line of text is not stored in an intermediate TextFile, but is directly written to the file. Hence, this static method may save a large amount of memory if a large file should be written, especially if the lines are provided as generator.

Parameters
filefile-like object or str

The file to be written to. Alternatively a file path can be supplied.

linesgenerator or array-like of str

The lines of text to be written. Must not include line break characters.