Types of sequences#

Biotite provides predefined classes for the most common types of sequences: nucleotide and amino acid sequences. In the end of this chapter we will also have a look on how to create a sequence with a custom alphabet.

Nucleotide sequences#

The class NucleotideSequence represents DNA. A NucleotideSequence may use two different alphabets - an unambiguous alphabet containing the letters 'A', 'C', 'G' and 'T' and an ambiguous alphabet containing additionally the standard letters for ambiguous nucleic bases. A NucleotideSequence determines automatically which alphabet is required, unless an alphabet is specified. If you want to work with RNA sequences you can use this class, too, you just need to replace the 'U' with 'T'.

import biotite.sequence as seq

# Create a nucleotide sequence using a string
# The constructor can take any iterable object (e.g. a list of symbols)
seq1 = seq.NucleotideSequence("ACCGTATCAAG")
print(seq1.get_alphabet())
# Constructing a sequence with ambiguous nucleic bases
seq2 = seq.NucleotideSequence("TANNCGNGG")
print(seq2.get_alphabet())
('A', 'C', 'G', 'T')
('A', 'C', 'G', 'T', 'R', 'Y', 'W', 'S', 'M', 'K', 'H', 'B', 'V', 'D', 'N')

The reverse complement of a DNA sequence is created by chaining the Sequence.reverse() and the NucleotideSequence.complement() method.

# Lower case characters are automatically capitalized
seq1 = seq.NucleotideSequence("tacagtt")
print("Original:", seq1)
seq2 = seq1.reverse().complement()
print("Reverse complement:", seq2)
Original: TACAGTT
Reverse complement: AACTGTA

Protein sequences#

The other central Sequence type is ProteinSequence. It supports the letters for the 20 standard amino acids plus some letters for ambiguous amino acids and a letter for a stop signal. Furthermore, this class provides some utilities like 3-letter to 1-letter translation (and vice versa).

prot_seq = seq.ProteinSequence("BIQTITE")
print("-".join([seq.ProteinSequence.convert_letter_1to3(symbol)
                for symbol in prot_seq]))
ASX-ILE-GLN-THR-ILE-THR-GLU

A NucleotideSequence can be translated into a ProteinSequence via the NucleotideSequence.translate() method. By default, the method searches for open reading frames (ORFs) in the 3 frames of the sequence. A 6-frame ORF search requires an additional call of NucleotideSequence.translate() with the reverse complement of the sequence. If you want to conduct a complete 1-frame translation of the sequence, irrespective of any start and stop codons, set the parameter complete to true.

dna = seq.NucleotideSequence("CATATGATGTATGCAATAGGGTGAATG")
proteins, pos = dna.translate()
for i in range(len(proteins)):
    print(
        f"Protein sequence {str(proteins[i])} "
        f"from base {pos[i][0]+1} to base {pos[i][1]}"
    )
protein = dna.translate(complete=True)
print("Complete translation:", str(protein))
Protein sequence MMYAIG* from base 4 to base 24
Protein sequence MYAIG* from base 7 to base 24
Protein sequence MQ* from base 11 to base 19
Protein sequence M from base 25 to base 27
Complete translation: HMMYAIG*M

The upper example uses the default CodonTable instance. This can be changed with the codon_table parameter. A CodonTable maps codons to amino acids and defines start codons (both in symbol and code form). A CodonTable is mainly used in the NucleotideSequence.translate() method, but can also be used to find the corresponding amino acid for a codon and vice versa.

table = seq.CodonTable.default_table()
# Find the amino acid encoded by a given codon
print(table["TAC"])
# Find the codons encoding a given amino acid
print(table["Y"])
# Works also for codes instead of symbols
print(table[(1,2,3)])
print(table[14])
Y
('TAC', 'TAT')
14
((0, 2, 0), (0, 2, 2), (1, 2, 0), (1, 2, 1), (1, 2, 2), (1, 2, 3))

The default CodonTable is equal to the NCBI “Standard” table, with the small difference that only 'ATG' qualifies as start codon. You can also use any other official NCBI table via CodonTable.load().

# Use the official NCBI table name
table = seq.CodonTable.load("Yeast Mitochondrial")
print("Yeast Mitochondrial:")
print(table)
print()
# Use the official NCBI table ID
table = seq.CodonTable.load(11)
print("Bacterial:")
print(table)
Yeast Mitochondrial:
AAA K      AAC N      AAG K      AAT N
ACA T      ACC T      ACG T      ACT T
AGA R      AGC S      AGG R      AGT S
ATA M i    ATC I      ATG M i    ATT I

CAA Q      CAC H      CAG Q      CAT H
CCA P      CCC P      CCG P      CCT P
CGA R      CGC R      CGG R      CGT R
CTA T      CTC T      CTG T      CTT T

GAA E      GAC D      GAG E      GAT D
GCA A      GCC A      GCG A      GCT A
GGA G      GGC G      GGG G      GGT G
GTA V      GTC V      GTG V      GTT V

TAA *      TAC Y      TAG *      TAT Y
TCA S      TCC S      TCG S      TCT S
TGA W      TGC C      TGG W      TGT C
TTA L      TTC F      TTG L      TTT F

Bacterial:
AAA K      AAC N      AAG K      AAT N
ACA T      ACC T      ACG T      ACT T
AGA R      AGC S      AGG R      AGT S
ATA I i    ATC I i    ATG M i    ATT I

CAA Q      CAC H      CAG Q      CAT H
CCA P      CCC P      CCG P      CCT P
CGA R      CGC R      CGG R      CGT R
CTA L      CTC L      CTG L i    CTT L

GAA E      GAC D      GAG E      GAT D
GCA A      GCC A      GCG A      GCT A
GGA G      GGC G      GGG G      GGT G
GTA V      GTC V      GTG V i    GTT V

TAA *      TAC Y      TAG *      TAT Y
TCA S      TCC S      TCG S      TCT S
TGA *      TGC C      TGG W      TGT C
TTA L      TTC F      TTG L i    TTT F

Feel free to define your own custom codon table via the CodonTable constructor.

Custom sequence types#

We can also define a sequence type wit a custom alphabet on our own. First we need to define the symbols that are allowed in the alphabet. Previously, we have encountered only alphabets that contain characters. Hence, the specialized LetterAlphabet class is used there (more explanation of the merits of it in the next chapter). However, Biotite allows almost every Python object to be used as a symbol in a sequence.

custom_alphabet = seq.Alphabet(["foo", "bar", 42])

There are two ways to define a custom sequence type. If we require some type-specific methods, we can subclass Sequence. Otherwise, we can use the generic GeneralSequence class and pass the custom alphabet as a parameter.

class MySequence(seq.Sequence):
    def get_alphabet(self):
        return custom_alphabet

custom_seq = MySequence(["foo", "bar", 42, 42, "foo"])
print(custom_seq)

custom_seq = seq.GeneralSequence(
    custom_alphabet, ["foo", "bar", 42, 42, "foo"]
)
print(custom_seq)
foo, bar, 42, 42, foo
foo, bar, 42, 42, foo