Annotating sequences with features#

Sequence features describe functional parts of a sequence, like coding or regulatory regions. One popular source to obtain information about sequence features are GenBank (for DNA and RNA) and GenPept (for peptides) files. As example for sequence features we will work with the GenBank file for the avidin gene (Accession: AJ311647), that we can download from the NCBI Entrez database. After downloading we can load the file using the GenBankFile class from biotite.sequence.io.genbank. Similar to the other file classes we have encountered, a GenBankFile provides a low-level interface. In contrast, the biotite.sequence.io.genbank module contains high-level functions to directly obtain useful objects from a GenBankFile object.

from tempfile import gettempdir
import biotite.database.entrez as entrez
import biotite.sequence.io.genbank as gb

file_path = entrez.fetch(
    "AJ311647", gettempdir(), suffix="gb",
    db_name="nuccore", ret_type="gb"
)
file = gb.GenBankFile.read(file_path)
print("Accession:", gb.get_accession(file))
print("Definition:", gb.get_definition(file))

Accession: AJ311647
Definition: Gallus gallus AVD gene for avidin, exons 1-4.

Now that we have loaded the file, we want to have a look at the sequence features. Therefore, we grab the Annotation from the file. An annotation is the collection of features corresponding to one sequence (the sequence itself is not included, though). This Annotation can be iterated in order to obtain single Feature objects. Each Feature contains 3 pieces of information: its feature key (e.g. regulatory or CDS), a dictionary of qualifiers and one or multiple locations on the corresponding sequence. A Location in turn, contains its starting and its ending base/residue position, the strand it is on (only for DNA) and possible location defects (defects will be discussed later). In the next example we will print the keys of the features and their locations:

annotation = gb.get_annotation(file)
for feature in annotation:
    # Convert the feature locations in better readable format
    locs = [str(loc) for loc in sorted(feature.locs, key=lambda l: l.first)]
    print(f"{feature.key:12}   {locs}")

CDS            ['98-178 >', '263-473 >', '899-1019 >', '1107-1152 >']
gene           ['98-1152 >']
exon           ['899-1019 >']
regulatory     ['1215-1220 >']
mRNA           ['98-178 >', '263-473 >', '899-1019 >', '1107-1152 >']
intron         ['179-262 >']
sig_peptide    ['98-169 >']
exon           ['263-473 >']
regulatory     ['26-33 >']
intron         ['474-898 >']
source         ['1-1224 >']
exon           ['98-178 >']
exon           ['1107-1152 >']
intron         ['1020-1106 >']

The '>' characters in the string representations of a location indicate that the location is on the forward strand. Most of the features have only one location, except the mRNA and CDS feature, which have 4 locations joined. When we look at the rest of the features, this makes sense: The gene has 4 exons. Therefore, the mRNA (and consequently the CDS) is composed of these exons.

The two regulatory features are the TATA box and the poly-A signal, as the feature qualifiers make clear:

for feature in annotation:
    if feature.key == "regulatory":
        print(feature.qual["regulatory_class"])

polyA_signal_sequence
TATA_box

Similarly to Alignment objects, we can visualize an Annotation using the biotite.sequence.graphics subpackage, in a so called feature map. In order to avoid overlapping features, we draw only the CDS feature.

import matplotlib.pyplot as plt
import biotite.sequence as seq
import biotite.sequence.graphics as graphics

# Get the range of the entire annotation via the *source* feature
for feature in annotation:
    if feature.key == "source":
        # loc_range has an exclusive stop
        loc = list(feature.locs)[0]
        loc_range = (loc.first, loc.last+1)
fig, ax = plt.subplots(figsize=(8.0, 1.0))
graphics.plot_feature_map(
    ax,
    seq.Annotation(
        [feature for feature in annotation if feature.key == "CDS"]
    ),
    multi_line=False,
    loc_range=loc_range,
    show_line_position=True
)
fig.tight_layout()

Annotation objects can be indexed with slices, that represent the start and the exclusive stop base/residue of the annotation from which the subannotation is created. All features, that are not in this range, are not included in the subannotation. In order to demonstrate this indexing method, we create a subannotation that includes only features in range of the gene itself (without the regulatory parts).

# At first, we have to find the feature with the 'gene' key
for feature in annotation:
    if feature.key == "gene":
        gene_feature = feature
# Then, we create a subannotation from the feature's location
# Since the stop value of the slice is still exclusive,
# the stop value is the position of the last base + 1
loc = list(gene_feature.locs)[0]
sub_annot = annotation[loc.first : loc.last + 1]
# Print the remaining features and their locations
for feature in sub_annot:
    locs = [str(loc) for loc in sorted(feature.locs, key=lambda l: l.first)]
    print(f"{feature.key:12}   {locs}")

CDS            ['98-178 >', '263-473 >', '899-1019 >', '1107-1152 >']
gene           ['98-1152 >']
exon           ['899-1019 >']
source         ['98-1152 >']
mRNA           ['98-178 >', '263-473 >', '899-1019 >', '1107-1152 >']
intron         ['179-262 >']
sig_peptide    ['98-169 >']
exon           ['263-473 >']
intron         ['474-898 >']
exon           ['98-178 >']
exon           ['1107-1152 >']
intron         ['1020-1106 >']

The regulatory sequences have disappeared in the subannotation. Another interesting thing happened: The location of the source feature narrowed and is in range of the slice now. This happened because the feature was truncated: The bases that were not in range of the slice were removed.

Let’s have a closer look at location defects now: A Location instance has a defect when the feature itself is not directly located in the range of the first to the last base, for example when the exact position is not known or, as in our case, a part of the feature was truncated. Let’s have a closer look at the location defects of our subannotation:

for feature in sub_annot:
    defects = [str(location.defect) for location
               in sorted(feature.locs, key=lambda l: l.first)]
    print(f"{feature.key:12}   {defects}")

CDS            ['Defect.NONE', 'Defect.NONE', 'Defect.NONE', 'Defect.NONE']
gene           ['Defect.BEYOND_LEFT|BEYOND_RIGHT']
exon           ['Defect.NONE']
source         ['Defect.MISS_LEFT|MISS_RIGHT']
mRNA           ['Defect.BEYOND_LEFT', 'Defect.NONE', 'Defect.NONE', 'Defect.BEYOND_RIGHT']
intron         ['Defect.NONE']
sig_peptide    ['Defect.NONE']
exon           ['Defect.NONE']
intron         ['Defect.NONE']
exon           ['Defect.NONE']
exon           ['Defect.NONE']
intron         ['Defect.NONE']

The class Location.Defect is a Flag. This means that multiple defects can be combined into one value. NONE means that the location has no defect, which is true for most of the features. The source feature has a defect—a combination of MISS_LEFT and MISS_RIGHT. MISS_LEFT is applied if a feature was truncated before the first base, and MISS_RIGHT is applied if a feature was truncated after the last base. Since source was truncated from both sides, the combination is applied. gene has the defect values BEYOND_LEFT and BEYOND_RIGHT. These defects already appear in the GenBank file since the gene is defined as the unit that is transcribed into one (pre-)mRNA. As the transcription starts somewhere before the start of the coding region and the exact start location is not known, BEYOND_LEFT is applied. In an analogous way, the transcription stops somewhere after the coding region (at the terminator signal). Hence, BEYOND_RIGHT is applied. These two defects are also reflected in the mRNA feature.

Annotated sequences#

An AnnotatedSequence is like an annotation, but the sequence is included this time. Since our GenBank file contains the sequence corresponding to the feature table, we can directly obtain the AnnotatedSequence.

annot_seq = gb.get_annotated_sequence(file)
print("Same annotation as before?", (annotation == annot_seq.annotation))
print(annot_seq.sequence[:60], "...")

Same annotation as before? True
ACTGGGCAGAGTCAGTGCTGGAAGCAATMAAAAGGCGAGGGAGCAGGCAGGGGTGAGTCC ...

When indexing an AnnotatedSequence with a slice, the index is applied to the Annotation and the Sequence. While the Annotation handles the index as shown before, the Sequence is indexed based on the sequence start value (usually 1).

print("Sequence start before indexing:", annot_seq.sequence_start)
for feature in annot_seq.annotation:
    if feature.key == "regulatory" \
        and feature.qual["regulatory_class"] == "polyA_signal_sequence":
            polya_feature = feature
loc = list(polya_feature.locs)[0]
# Get annotated sequence containing only the poly-A signal region
poly_a = annot_seq[loc.first : loc.last + 1]
print("Sequence start after indexing:", poly_a.sequence_start)
print(poly_a.sequence)

Sequence start before indexing: 1
Sequence start after indexing: 1215
AATAAA

Here we get the poly-A signal sequence 'AATAAA'. As you might have noticed, the sequence start has shifted to the start of the slice index (the first base of the regulatory feature).

Warning

Since AnnotatedSequence objects use base position indices and Sequence objects use array position indices, you will get different results for annot_seq[n:m].sequence and annot_seq.sequence[n:m].

There is also a convenient way to obtain the sequence corresponding to a feature, even if the feature contains multiple locations or a location is on the reverse strand: Simply use a Feature object (in this case the CDS feature) as index.

for feature in annot_seq.annotation:
    if feature.key == "CDS":
        cds_feature = feature
cds_seq = annot_seq[cds_feature]
print(cds_seq[:60], "...")

ATGGTGCACGCAACCTCCCCGCTGCTGCTGCTGCTGCTGCTCAGCCTGGCTCTGGTGGCT ...

Now we can translate the sequence and compare it with the translation given by the CDS feature. But before we can do that, we have to prepare the data: The DNA sequence uses an ambiguous alphabet due to the nasty 'M' at position 28 of the original sequence, we have to remove the stop symbol after translation and we need to remove the whitespace characters in the translation given by the CDS feature.

# To make alphabet unambiguous, we create a new NucleotideSequence
# containing only the CDS portion, which is unambiguous
# Thus, the resulting NucleotideSequence has an unambiguous alphabet
cds_seq = seq.NucleotideSequence(cds_seq)
# Now we can translate the unambiguous sequence.
prot_seq = cds_seq.translate(complete=True)
print(prot_seq[:60], "...")
print(
    "Are the translated sequences equal?",
    # Remove stops of our translation
    (str(prot_seq.remove_stops()) ==
    # Remove whitespace characters from translation given by CDS feature
    cds_feature.qual["translation"].replace(" ", ""))
)

MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWDNDLGSNMTIGAVNSKGEFTGTYTTA ...
Are the translated sequences equal? True