Biotite is a Python package for computational biologists. It aims to provide a broad set of tools for working with different kinds of biological data, from sequences to structures. On the one hand side, working with Biotite should be computationally efficient, with the help of the powerful packages NumPy and Cython. On the other hand it aims for simple usability and extensibility, so that beginners are not overwhelmed and advanced users can easily build upon the exisiting system to implement their own algorithms.

Biotite provides 4 subpackages: The biotite.sequence subpackage contains functionality for working with sequence information of any kind. The package contains by default sequence types for nucleotides and proteins, but the alphabet-based implementation allows simple integration of own sequence types, even if they do not rely on letters. Beside the standard I/O operations, the package includes general purpose functions for sequence manipulations and global/local alignments. The biotite.structure subpackage enables handling of 3D structures of biomolecules. Simplified, a structure is represented by a list of atoms and their properties, based on NumPy arrays. The subpackage includes read/write functionality for different formats, structure filters, coordinate transformations, angle and bond measurements, structure superimposition and some more advanced analysis capabilities. The biotite.database subpackage is all about downloading data from biological databases, including the probably most important ones: the RCSB PDB and the NCBI Entrez database. The biotite.application subpackage provides interfaces for external software. The interfaces range from locally installed software (e.g. MSA software) to web applications (e.g. BLAST). The speciality is that the interfaces are seamless: You do not have to write input files and read output files, you only have to input Python objects and you get Python objects. It is basically very similar to using normal functions.

In the following sections you will get an overview over the mentioned subpackages, so go and grab some tea and cookies und let us begin.

Preliminary note

The files used in this tutorial will be stored in a temporary directory. The top level package biotite provides functionality to create a temporary directory, called .biotitetemp, in your current working directory. You can either obtain the path to this directory via temp_dir() or directly create an unambiguous file name in this directory using temp_file().

In the end of the session the temporary directory and all its contents will be automatically deleted, so make sure to put the files, you want keep, somewhere else.

from os.path import relpath
import biotite
# Create temporary directory
dir_path = biotite.temp_dir()
# Get a path to a temporary FASTA file
# This would also create the temporary directory,
# if it was not created, yet
file_path = biotite.temp_file("fasta")



Data to work with - The Database subpackage

Biological databases are the backbone of computational biology. The biotite.database subpackage provides interfaces for popular online databases like the RCSB PDB or the NCBI Entrez database.

Fetching structure files from the RCSB PDB

Downloading structure files from the RCSB PDB is quite easy: Simply specify the PDB ID, the file format and the target directory for the fetch() function and you are done. The function even returns the path to the downloaded file, so you can just load it via the other Biotite subpackages (more on this later). We will download on a protein structure of the miniprotein TC5b (PDB: 1L2Y) into a temporary directory.

from os.path import relpath
import biotite
import biotite.database.rcsb as rcsb
file_path = rcsb.fetch("1l2y", "pdb", biotite.temp_dir())



In case you want to download multiple files, you are able to specify a list of PDB IDs, which in return gives you a list of file_paths.

# Download files in the more modern mmCIF format
file_paths = rcsb.fetch(["1l2y", "1aki"], "cif", biotite.temp_dir())
print([relpath(file_path) for file_path in file_paths])


['.biotitetemp/1l2y.cif', '.biotitetemp/1aki.cif']

By default fetch() checks whether the file to be fetched already exists in the directory, and downloads it, if it does not exist yet. If you want to download files irrespectively, set overwrite to true.

# Download file in the fast and small binary MMTF format
file_path = rcsb.fetch("1l2y", "mmtf", biotite.temp_dir(), overwrite=True)

In many cases you are not interested in a specific structure, but you want a set of structures that fits your desired criteria. For this purpose the RCSB SEARCH service can be interfaced.

At first you have to create Query object for the property you want to filter. The search() method takes the Query and returns a list of PDB IDs, which itself can be used as input for fetch().

query = rcsb.ResolutionQuery(0.0, 0.6)
pdb_ids = rcsb.search(query)
files = rcsb.fetch(pdb_ids, "mmtf", biotite.temp_dir())


['1EJG', '1I0T', '3NIR', '3P4J', '5D8V', '5NW3']

Not all query types of the SEARCH service are supported yet. But it is quite easy to implement your needed query type by inheriting SimpleQuery.

Multiple SimpleQuery objects can be ‘and’/’or’ combined using a CompositeQuery.

query1 = rcsb.ResolutionQuery(0.0, 1.0)
query2 = rcsb.MolecularWeightQuery(10000, 100000)
composite = rcsb.CompositeQuery("and", [query1, query2])

Fetching files from the NCBI Entrez database

Another important source of biological information is the NCBI Entrez database, which is commonly known as the NCBI. It provides a myriad of information, ranging from sequences and sequence features to scientific papers. Fetching files from NCBI Entrez works analogous to the RCSB interface. This time we have to provide the UIDs (Accession or GI) instead of PDB IDs to the fetch() function. Furthermore, we need to specifiy the database to retrieve the data from and the retrieval type.

from os.path import relpath
import biotite
import biotite.database.entrez as entrez
# Fetch a single UID ...
file_path = entrez.fetch(
    "NC_001416", biotite.temp_dir(), suffix="fa",
    db_name="nuccore", ret_type="fasta"
# ... or multiple UIDs
file_paths = entrez.fetch(
    ["1L2Y_A","1AKI_A"], biotite.temp_dir(), suffix="fa",
    db_name="protein", ret_type="fasta"
print([relpath(file_path) for file_path in file_paths])


['.biotitetemp/1L2Y_A.fa', '.biotitetemp/1AKI_A.fa']

A list of valid database, retrieval type and mode combinations can be found here. Furthermore get_database_name() can be helpful to get the required database name by the more commonly known names.




The Entrez database allows for packing data for multiple UIDs into a single file. This is achieved with the fetch_single_file() function.

file_path = entrez.fetch_single_file(
    ["1L2Y_A","1AKI_A"], biotite.temp_file("fa"),
    db_name="protein", ret_type="fasta"



Similar to the RCSB PDB, you can also search in the NCBI Entrez database, but in an even more powerful manner: Due to the simple design of the search queries accepted by NCBI Entrez, you can search in every field of the database.

# Search in all fields
print(entrez.SimpleQuery("BL21 genome"))
# Search in the 'Organism' field
print(entrez.SimpleQuery("Escherichia coli", field="Organism"))


"BL21 genome"
"Escherichia coli"[Organism]

You can even combine multiple Query objects in any way you like using the binary operators |, & and ^, that represent OR, AND and NOT linkage, respectively.

composite_query = (
    entrez.SimpleQuery("50:100", field="Sequence Length") &
        entrez.SimpleQuery("Escherichia coli", field="Organism") |
        entrez.SimpleQuery("Bacillus subtilis", field="Organism")


(50:100[Sequence Length]) AND (("Escherichia coli"[Organism]) OR ("Bacillus subtilis"[Organism]))

Finally, the query is given to the search() function to obtain the GIs, that can be used as input to fetch().

# Return a maximum number of 10 entries
gis = entrez.search(composite_query, "protein", number=10)


['1516406054', '1516338937', '1516072008', '1516072006', '1516068324', '1516068321', '1516068319', '1516068307', '1516068305', '1516068304']

From A to T - The Sequence subpackage

biotite.sequence is a Biotite subpackage concerning maybe the most popular data type in computational molecular biology: sequences. The instantiation can be quite simple as

import biotite.sequence as seq
dna = seq.NucleotideSequence("AACTGCTA")



This example shows NucleotideSequence which is a subclass of the abstract base class Sequence. A NucleotideSequence accepts an iterable object of strings, where each string can be 'A', 'C', 'G' or 'T'. Each of these letters is called a symbol.

In general the sequence implementation in Biotite allows for sequences of anything. This means any (immutable an hashable) Python object can be used as a symbol in a sequence, as long as the object is part of the Alphabet of the particular Sequence. An Alphabet object simply represents a list of objects that are allowed to occur in a Sequence. The following figure shows how the symbols are stored in a Sequence object.


When setting the Sequence object with a sequence of symbols, the Alphabet of the Sequence encodes each symbol in the input sequence into a so called symbol code. The encoding process is quite simple: A symbol s is at index i in the list of allowed symbols in the alphabet, so the symbol code for s is i. If s is not in the alphabet, an AlphabetError is raised. The array of symbol codes, that arises from encoding the input sequence, is called sequence code. This sequence code is now stored in an internal integer ndarray in the Sequence object. The sequence code is now accessed via the code attribute, the corresponding symbols via the symbols attribute.

This approach has multiple advantages:

  • Ability to create sequences of anything
  • Sequence utility functions (searches, alignments,…) usually do not care about the specific sequence type, since they work with the internal sequence code
  • Integer type for sequence code is only as large as the alphabet requests
  • Sequence codes can be directly used as substitution matrix indices in alignments

Effectively, this means a potential Sequence subclass could look like following:

class NonsenseSequence(seq.Sequence):

    alphabet = seq.Alphabet([42, "foo", b"bar"])

    def get_alphabet(self):
        return NonsenseSequence.alphabet

sequence = NonsenseSequence(["foo", b"bar", 42, "foo", "foo", 42])
print("Alphabet:", sequence.get_alphabet())
print("Symbols:", sequence.symbols)
print("Code:", sequence.code)


Alphabet: [42, 'foo', b'bar']
Symbols: ['foo', b'bar', 42, 'foo', 'foo', 42]
Code: [1 2 0 1 1 0]

From DNA to Protein

Biotite offers two prominent Sequence sublasses:

The NucleotideSequence represents DNA. It may use two different alphabets - an unambiguous alphabet containing the letters 'A', 'C', 'G' and 'T' and an ambiguous alphabet containing additionally the standard letters for ambiguous nucleic bases. A NucleotideSequence determines automatically which alphabet is required, unless an alphabet is specified. If you want to work with RNA sequences you can use this class, too, you just need to replace the 'U' with 'T'.

import biotite.sequence as seq
# Create a nucleotide sequence using a string
# The constructor can take any iterable object (e.g. a list of symbols)
seq1 = seq.NucleotideSequence("ACCGTATCAAG")
# Constructing a sequence with ambiguous nucleic bases
seq2 = seq.NucleotideSequence("TANNCGNGG")


['A', 'C', 'G', 'T']
['A', 'C', 'G', 'T', 'R', 'Y', 'W', 'S', 'M', 'K', 'H', 'B', 'V', 'D', 'N', 'X']

The reverse complement of a DNA sequence is created by chaining the Sequence.reverse() and the NucleotideSequence.complement() method.

# Lower case characters are automatically capitalized
seq1 = seq.NucleotideSequence("tacagtt")
print("Original:", seq1)
seq2 = seq1.reverse().complement()
print("Reverse complement:", seq2)


Original: TACAGTT
Reverse complement: AACTGTA

The other Sequence type is ProteinSequence. It supports the letters for the 20 standard amino acids plus some letters for ambiguous amino acids and a letter for a stop signal. Furthermore, this class provides some utilities like 3-letter to 1-letter translation (and vice versa).

prot_seq = seq.ProteinSequence("BIQTITE")
                for symbol in prot_seq]))



A NucleotideSequence can be translated into a ProteinSequence via the NucleotideSequence.translate() method. By default, the method searches for open reading frames (ORFs) in the 3 frames of the sequence. A 6 frame ORF search requires an additional call of NucleotideSequence.translate() with the reverse complement of the sequence. If you want to conduct a complete translation of the sequence, irrespective of any start and stop codons, set the parameter complete to true.

dna = seq.NucleotideSequence("CATATGATGTATGCAATAGGGTGAATG")
proteins, pos = dna.translate()
for i in range(len(proteins)):
    print("Protein sequence {:} from base {:d} to base {:d}"
            .format(str(proteins[i]), pos[i][0]+1, pos[i][1]))
protein = dna.translate(complete=True)
print("Complete translation:", str(protein))


Protein sequence MMYAIG* from base 4 to base 24
Protein sequence MYAIG* from base 7 to base 24
Protein sequence MQ* from base 11 to base 19
Protein sequence M from base 25 to base 27
Complete translation: HMMYAIG*M

The upper example uses the default CodonTable instance. This can be changed with the codon_table parameter. A CodonTable maps codons to amino acid and defines start codons (both in symbol and code form). A CodonTable is mainly used in the NucleotideSequence.translate() method, but can also be used to find the corresponding amino acid for a codon and vice versa.

table = seq.CodonTable.default_table()
# Find the amino acid encoded by a given codon
# Find the codons encoding a given amino acid
# Works also for codes instead of symbols


('TAT', 'TAC')
((1, 2, 3), (1, 2, 1), (1, 2, 0), (1, 2, 2), (0, 2, 0), (0, 2, 2))

The default CodonTable is equal to the NCBI “Standard” table, with the small difference that only 'ATG' qualifies as start codon. You can also use any other official NCBI table via CodonTable.load().

# Use the official NCBI table name
table = seq.CodonTable.load("Yeast Mitochondrial")
print("Yeast Mitochondrial:")
# Use the official NCBI table ID
table = seq.CodonTable.load(11)


Yeast Mitochondrial:
AAA K      AAC N      AAG K      AAT N
ACA T      ACC T      ACG T      ACT T
AGA R      AGC S      AGG R      AGT S
ATA M i    ATC I      ATG M i    ATT I

CAA Q      CAC H      CAG Q      CAT H
CCA P      CCC P      CCG P      CCT P
CGA R      CGC R      CGG R      CGT R
CTA T      CTC T      CTG T      CTT T

GAA E      GAC D      GAG E      GAT D
GCA A      GCC A      GCG A      GCT A
GGA G      GGC G      GGG G      GGT G
GTA V      GTC V      GTG V      GTT V

TAA *      TAC Y      TAG *      TAT Y
TCA S      TCC S      TCG S      TCT S
TGA W      TGC C      TGG W      TGT C
TTA L      TTC F      TTG L      TTT F

AAA K      AAC N      AAG K      AAT N
ACA T      ACC T      ACG T      ACT T
AGA R      AGC S      AGG R      AGT S
ATA I i    ATC I i    ATG M i    ATT I

CAA Q      CAC H      CAG Q      CAT H
CCA P      CCC P      CCG P      CCT P
CGA R      CGC R      CGG R      CGT R
CTA L      CTC L      CTG L i    CTT L

GAA E      GAC D      GAG E      GAT D
GCA A      GCC A      GCG A      GCT A
GGA G      GGC G      GGG G      GGT G
GTA V      GTC V      GTG V i    GTT V

TAA *      TAC Y      TAG *      TAT Y
TCA S      TCC S      TCG S      TCT S
TGA *      TGC C      TGG W      TGT C
TTA L      TTC F      TTG L i    TTT F

Feel free to define your own custom codon table via the CodonTable constructor.

Loading sequences from file

Biotite enables the user to load and save sequences from/to the popular FASTA format via the FastaFile class. A FASTA file may contain multiple sequences. Each sequence entry starts with a line with a leading '>' and a trailing header name. The corresponding sequence is specified in the following lines until the next header or end of file. Since every sequence has its obligatory header, a FASTA file is predestinated to be implemented as some kind of dictionary. This is exactly what has been done in Biotite: The header strings (without the '>') are used as keys to access the respective sequence strings. Actually you can cast the FastaFile object into a dict. Let’s demonstrate this on the genome of the lambda phage (Accession: NC_001416`). After downloading the FASTA file from the NCBI Entrez database, we can load the contents in the following way:

import biotite
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.database.entrez as entrez
file_path = entrez.fetch(
    "NC_001416", biotite.temp_dir(), suffix="fa",
    db_name="nuccore", ret_type="fasta"
file = fasta.FastaFile()
for header, string in file.items():
    print("Header:", header)
    print("Sequence:", string[:50], "...")
    print("Sequence length:", len(string))


Header: NC_001416.1 Enterobacteria phage lambda, complete genome
Sequence length: 48502

Since there is only a single sequence in the file, the loop is run only one time. As the sequence string is very long, only the first 50 bp are printed. Now this string could be used as input parameter for creation of a NucleotideSequence. But we want to spare ourselves some unnecessary work, there is already a convenience function for that:

dna_seq = fasta.get_sequence(file)



In this form get_sequence() returns the first sequence in the file, which is also the only sequence in most cases. If you want the sequence corresponding to a specific header, you have to specifiy the header parameter. The function even automatically recognizes, if the file contains a DNA or protein sequence and returns a NucleotideSequence or ProteinSequence, instance respectively. Actually, it just tries to create a NucleotideSequence, and if this fails, a ProteinSequence is created instead.

Sequences can be written into FASTA files in a similar way: either via dictionary-like access or using the set_sequence() convenience function.

# Create new empty FASTA file
file = fasta.FastaFile()
# PROTIP: Let your cat walk over the keyboard
dna_seq1 = seq.NucleotideSequence("ATCGGATCTATCGATGCTAGCTACAGCTAT")
dna_seq2 = seq.NucleotideSequence("ACGATCTACTAGCTGATGTCGTGCATGTACG")
# Append entries to file...
# ... via set_sequence()
fasta.set_sequence(file, dna_seq1, header="gibberish")
# .. or dictionary style
file["more gibberish"] = str(dna_seq2)


>more gibberish

As you see, our file contains our new 'gibberish' and 'more gibberish' sequences now.

In a similar manner sequences and sequence quality scores can be read from FASTQ files. For further reference, have a look at the biotite.sequence.io.fastq subpackage.

Alternatively, a sequence can also be loaded from GenBank or GenPept files, using the GenBankFile and GenPeptFile class (more on this later).

Sequence alignments

When comparing two (or more) sequences, usually an alignment needs to be performed. Two kinds of algorithms need to be distinguished here: Heuristic algorithms do not guarantee to yield the optimal alignment, but instead they are very fast. On the other hand, there are algorithms that calculate the optimal (maximum similarity score) alignment, but are quite slow.

The biotite.sequence.align package provides the function align_optimal(), which either performs an optimal global alignment, using the Needleman-Wunsch algorithm, or an optimal local alignment, using the Smith-Waterman algorithm. By default it uses a general gap penalty, but an affine gap penalty can be used, too.

Most functions in biotite.sequence.align can align any two Sequence objects with each other. In fact the Sequence objects can be instances from different Sequence subclasses and therefore may have different alphabets. The only condition that must be satisfied, is that the class:SubstitutionMatrix alphabets matches the alphabets of the sequences to be aligned.

But wait, what’s a SubstitutionMatrix? This class maps a similarity score to two symbols, one from the first sequence the other from the second sequence. A SubstitutionMatrix object contains two alphabets with length n or m, respectively, and an (n,m)-shaped ndarray storing the similarity scores. You can choose one of many predefined matrices from an internal database or you can create a custom matrix on your own.

So much for theory. Let’s start by showing different ways to construct a SubstitutionMatrix, in our case for protein sequence alignments:

import biotite.sequence as seq
import biotite.sequence.align as align
import numpy as np
alph = seq.ProteinSequence.alphabet
# Load the standard protein substitution matrix, which is BLOSUM62
matrix = align.SubstitutionMatrix.std_protein_matrix()
# Load another matrix from internal database
matrix = align.SubstitutionMatrix(alph, alph, "BLOSUM50")
# Load a matrix dictionary representation,
# modify it, and create the SubstitutionMatrix
# (Dictionary could be loaded from matrix string in NCBI format, too)
matrix_dict = align.SubstitutionMatrix.dict_from_db("BLOSUM62")
matrix_dict[("P","Y")] = 100
matrix = align.SubstitutionMatrix(alph, alph, matrix_dict)
# And now create a matrix by directly provding the ndarray
# containing the similarity scores
# (identity matrix in our case)
scores = np.identity(len(alph), dtype=int)
matrix = align.SubstitutionMatrix(alph, alph, scores)
print("\n\nIdentity matrix\n")



    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   0  -4  -3   2  -2   4   2  -3  -3  -2  -2  -2  -1   1  -2  -1  -4  -3  -1  -4
M  -1  -1  -3  -2   0  -3  -2   1  -1   2   5  -2  -2   0  -1  -1  -1   1  -1  -1  -3  -1  -1  -4
N  -2  -3   1   0  -3   0   1  -3   0  -3  -2   6  -2   0   0   1   0  -3  -4  -2   3   0  -1  -4
P  -1  -3  -1  -1  -4  -2  -2  -3  -1  -3  -2  -2   7  -1  -2  -1  -1  -2  -4  -3  -2  -1  -2  -4
Q  -1  -3   0   2  -3  -2   0  -3   1  -2   0   0  -1   5   1   0  -1  -2  -2  -1   0   3  -1  -4
R  -1  -3  -2   0  -3  -2   0  -3   2  -2  -1   0  -2   1   5  -1  -1  -3  -3  -2  -1   0  -1  -4
S   1  -1   0   0  -2   0  -1  -2   0  -2  -1   1  -1   0  -1   4   1  -2  -3  -2   0   0   0  -4
T   0  -1  -1  -1  -2  -2  -2  -1  -1  -1  -1   0  -1  -1  -1   1   5   0  -2  -2  -1  -1   0  -4
V   0  -1  -3  -2  -1  -3  -3   3  -2   1   1  -3  -2  -2  -3  -2   0   4  -3  -1  -3  -2  -1  -4
W  -3  -2  -4  -3   1  -2  -2  -3  -3  -2  -1  -4  -4  -2  -3  -3  -2  -3  11   2  -4  -3  -2  -4
Y  -2  -2  -3  -2   3  -3   2  -1  -2  -1  -1  -2  -3  -1  -2  -2  -2  -1   2   7  -3  -2  -1  -4
B  -2  -3   4   1  -3  -1   0  -3   0  -4  -3   3  -2   0  -1   0  -1  -3  -4  -3   4   1  -1  -4
Z  -1  -3   1   4  -3  -2   0  -3   1  -3  -1   0  -1   3   0   0  -1  -2  -3  -2   1   4  -1  -4
X   0  -2  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -2  -1  -1   0   0  -1  -2  -1  -1  -1  -1  -4
*  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4   1

Identity matrix

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
C   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
D   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
E   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
F   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
G   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
H   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
I   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
K   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
L   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
M   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
N   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0
P   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
Q   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
R   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0
S   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0
T   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0
V   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0
W   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0
Y   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0
B   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
Z   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0
X   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0
*   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1

For our protein alignment we will use the standard BLOSUM62 matrix.

seq1 = seq.ProteinSequence("BIQTITE")
seq2 = seq.ProteinSequence("IQLITE")
matrix = align.SubstitutionMatrix.std_protein_matrix()
print("\nLocal alignment")
alignments = align.align_optimal(seq1, seq2, matrix, local=True)
for ali in alignments:
print("Global alignment")
alignments = align.align_optimal(seq1, seq2, matrix, local=False)
for ali in alignments:


Local alignment
Global alignment

The alignment functions return a list of Alignment objects. This object saves the input sequences together with a so called trace - the indices to symbols in these sequences that are aligned to each other (-1 for a gap). Additionally the alignment score is stored in this object. Furthermore, this object can prettyprint the alignment into a human readable form.

For publication purposes you can create an actual figure based on Matplotlib. You can either decide to color the symbols based on the symbol type or based on the similarity within the alignment columns. In this case we will go with the similarity visualization.

import matplotlib.pyplot as plt
import biotite.sequence.graphics as graphics

fig, ax = plt.subplots(figsize=(2.0, 0.8))
    ax, alignments[0], matrix=matrix, symbols_per_line=len(alignments[0])

If you are interested in more advanced visualization examples, have a look at the example gallery.

You can also do some simple analysis on these objects, like determining the sequence identity or calculating the score. For further custom analysis, it can be convenient to have directly the aligned symbos codes instead of the trace.

alignment = alignments[0]
print("Score: ", alignment.score)
print("Recalculated score:", align.score(alignment, matrix=matrix))
print("Sequence identity:", align.get_sequence_identity(alignment))
print("symbols codes:")


Score:  12
Recalculated score: 12
Sequence identity: 0.8333333333333334
[['B', 'I', 'Q', 'T', 'I', 'T', 'E'], [None, 'I', 'Q', 'L', 'I', 'T', 'E']]
symbols codes:
[[20  7 13 16  7 16  3]
 [-1  7 13  9  7 16  3]]

You may ask, why should you recalculate the score, when the score has already been directly calculated via align_optimal(). The answer is, that you might load an alignment from an external alignment program as FASTA file using get_alignment().

If you want to perform a multiple sequence alignment, have a look at the align_multiple() function or the interfaces to external MSA software in the biotite.application subpackage.

Sequence features

Sequence features describe functional parts of a sequence, like coding regions or regulatory parts. One popular source to obtain information about sequence features are GenBank (for DNA and RNA) and GenPept (for peptides) files. As example for sequence features we will work with the GenBank file for the DNA sequence of the avidin gene (Accession: AJ311647`), that we can download from the NCBI Entrez database. After downloading we can load the file using the GenBankFile class from biotite.sequence.io.genbank.

import biotite.sequence.io.genbank as gb
file_path = entrez.fetch(
    "AJ311647", biotite.temp_dir(), suffix="gb",
    db_name="nuccore", ret_type="gb"
file = gb.GenBankFile()
print("Accession:", file.get_accession())
print("Definition:", file.get_definition())


Accession: AJ311647
Definition: Gallus gallus AVD gene for avidin, exons 1-4.

Now that we have loaded the file, we want to have a look at the sequence features. Therefore, we grab the annotation from the file. An annotation is the collection of features corresponding to one sequence (the sequence itself is not included, though). In case of Biotite we can get an Annotation object from the GenBankFile. This Annotation can be iterated in order to obtain single Feature objects. Each Feature contains 3 pieces of information: Its feature key (e.g. regulatory or CDS), a dictionary of qualifiers and one or multiple locations on the corresponding sequence. A Location in turn, contains its starting and its ending base/residue position, the strand it is on (only for DNA) and possible location defects (defects will be discussed later). In the next example we will print the keys of the features and their locations:

annotation = file.get_annotation()
for feature in annotation:
    # Convert the feature locations in better readable format
    locs = [str(loc) for loc in feature.locs]
    print(f"{feature.key:12}   {locs}")


regulatory     ['26-33 >']
intron         ['179-262 >']
mRNA           ['263-473 >', '1107-1152 >', '899-1019 >', '98-178 >']
intron         ['474-898 >']
exon           ['899-1019 >']
source         ['1-1224 >']
exon           ['1107-1152 >']
gene           ['98-1152 >']
exon           ['263-473 >']
regulatory     ['1215-1220 >']
sig_peptide    ['98-169 >']
exon           ['98-178 >']
CDS            ['1107-1152 >', '98-178 >', '263-473 >', '899-1019 >']
intron         ['1020-1106 >']

The '>' characters in the string representations of a location indicate that the location is on the forward strand. Most of the features have only one location, except the mRNA and CDS feature, which have 4 locations joined. When we look at the rest of the features, this makes sense: The gene has 4 exons. Therefore, the mRNA (and consequently the CDS) is composed of these exons.

The two regulatory features are the TATA box and the poly-A signal as the feature qualifiers make clear:

for feature in annotation:
    if feature.key == "regulatory":



Similarily to Alignment objects, we can visualize an Annotation in a feature map. In order to avoid overlaping features, we draw only the CDS feature.

# Get the range of the entire annotation via the *source* feature
for feature in annotation:
    if feature.key == "source":
        # loc_range has exclusive stop
        loc = list(feature.locs)[0]
        loc_range = (loc.first, loc.last+1)
fig, ax = plt.subplots(figsize=(8.0, 1.0))
        [feature for feature in annotation if feature.key == "CDS"]

Annotation objects can be indexed with slices, that represent the start and the stop base/residue of the annotation from which the subannotation is created. All features, that are not in this range, are not included in the subannotation. In order to demonstrate this indexing method, we create a subannotation that includes only features in range of the gene itself (without the regulatory stuff).

# At first we have the find the feature with the 'gene' key
for feature in annotation:
    if feature.key == "gene":
        gene_feature = feature
# Then we create a subannotation from the feature's location
# Since the stop value of the slice is still exclusive,
# the stop value is the position of the last base +1
loc = list(gene_feature.locs)[0]
sub_annot = annotation[loc.first : loc.last +1]
# Print the remaining features and their locations
for feature in sub_annot:
    locs = [str(loc) for loc in feature.locs]
    print(f"{feature.key:12}   {locs}")


source         ['98-1152 >']
exon           ['98-178 >']
intron         ['179-262 >']
mRNA           ['263-473 >', '98-178 >', '899-1019 >', '1107-1152 >']
intron         ['474-898 >']
exon           ['899-1019 >']
exon           ['1107-1152 >']
gene           ['98-1152 >']
sig_peptide    ['98-169 >']
exon           ['263-473 >']
CDS            ['1107-1152 >', '98-178 >', '263-473 >', '899-1019 >']
intron         ['1020-1106 >']

The regulatory sequences have disappeared in the subannotation. Another interesting thing happened: The location of the source feature narrowed and is in range of the slice now. This happened, because the feature was truncated: The bases that were not in range of the slice were removed.

Let’s have a closer look into location defects now: A Location instance has a defect, when the feature itself is not directly located in the range of the first to the last base, for example when the exact postion is not known or, as in our case, a part of the feature was truncated. Let’s have a closer look at the location defects of our subannotation:

for feature in sub_annot:
    defects = [str(location.defect) for location in feature.locs]
    print(f"{feature.key:12}   {defects}")


source         ['Defect.MISS_RIGHT|MISS_LEFT']
exon           ['Defect.NONE']
intron         ['Defect.NONE']
mRNA           ['Defect.NONE', 'Defect.BEYOND_LEFT', 'Defect.NONE', 'Defect.BEYOND_RIGHT']
intron         ['Defect.NONE']
exon           ['Defect.NONE']
exon           ['Defect.NONE']
gene           ['Defect.BEYOND_RIGHT|BEYOND_LEFT']
sig_peptide    ['Defect.NONE']
exon           ['Defect.NONE']
CDS            ['Defect.NONE', 'Defect.NONE', 'Defect.NONE', 'Defect.NONE']
intron         ['Defect.NONE']

The class Location.Defect is a Flag. This means that multiple defects can be combined to one value. NONE means that the location has no defect, which is true for most of the features. The source feature has a defect has a combination of MISS_LEFT and MISS_RIGHT. MISS_LEFT is applied, if a feature was truncated before the first base, and MISS_RIGHT is applied, if a feature was truncated after the last base. Since source was truncated from both sides, the combinated is applied. gene has the defect values BEYOND_LEFT and BEYOND_RIGHT. These defects already appear in the GenBank file, since the gene is defined as the unit that is transcribed into one (pre-)mRNA. As the transcription starts somewhere before the start of the coding region, and the exact location is not known, BEYOND_LEFT is applied. In an analogous way, the transription does stop somewhere after the coding region (at the terminator signal), hence BEYOND_RIGHT is applied. These two defects are also reflected in the mRNA feature.

Now, that you have understood what annotations are, we proceed to the next topic: annotated sequences. An AnnotatedSequence is like an annotation, but the sequence is included this time. Since our GenBank file contains the sequence corresponding to the feature table, we can directly obtain the AnnotatedSequence.

annot_seq = file.get_annotated_sequence()
print("Same annotation as before?", (annotation == annot_seq.annotation))
print(annot_seq.sequence[:60], "...")


Same annotation as before? True

When indexing an AnnotatedSequence with a slice, the index is applied to the Annotation and the Sequence. While the Annotation handles the index as shown before, the Sequence is indexed based on the sequence start value (usually 1).

print("Sequence start before indexing:", annot_seq.sequence_start)
for feature in annot_seq.annotation:
    if feature.key == "regulatory" \
        and feature.qual["regulatory_class"] == "polyA_signal_sequence":
            polya_feature = feature
loc = list(feature.locs)[0]
# Get annotated sequence containing only the poly-A signal region
poly_a = annot_seq[loc.first : loc.last +1]
print("Sequence start after indexing:", poly_a.sequence_start)


Sequence start before indexing: 1
Sequence start after indexing: 1020

Here we get the poly-A signal Sequence 'AATAAA'. As you might have noticed, the sequence start has shifted to the start of the slice index (the first base of the regulatory feature).


Since AnnotatedSequence objects use base position indices and Sequence objects use array position indices, you will get different results for annot_seq[n:m].sequence and annot_seq.sequence[n:m].

There is also a convenient way to obtain the sequence corresponding to a feature, even if the feature contains multiple locations or a location is on the reverse strand: Simply use a Feature object as index.

for feature in annot_seq.annotation:
    if feature.key == "CDS":
        cds_feature = feature
dna_seq = annot_seq[cds_feature]
print(dna_seq[:60], "...")



Awesome. Now we can translate the sequence and compare it with the translation given by the CDS feature. But before we can do that, we have to prepare the data: The DNA sequence uses currently an ambiguous alphabet due to the nasty 'M' at position 28 of the original sequence, we have to remove the stop symbol after translation and we need to remove the space characters in the translation given by the CDS feature.

# This step make the alphabet unambiguous
dna_seq = seq.NucleotideSequence(dna_seq)
prot_seq = dna_seq.translate(complete=True)
print("Are the translated sequences equal?",
        (str(prot_seq.remove_stops()) == \
        cds_feature.qual["translation"].replace(" ", "")))


Are the translated sequences equal? False

Phylogenetic and guide trees

Trees have an important role in bioinformatics, as they are used to guide multiple sequence alignments or to create phylogenies.

In Biotite such a tree is represented by the Tree class in the biotite.sequence.phylo package. A tree is rooted and binary, that means each tree node has two childs, or none in case of leaf nodes. Each node in a tree is represented by a TreeNode. When a TreeNode is created, you have to provide either two child nodes and their distances to this node (leaf node) or a reference index (intermediate node). This reference index is dependent on the context and can refer to anything: sequences, organisms, etc.

The childs and the reference index cannot be changed after object creation. Also the parent can only be set once - when the node is used as child in the creation of a new node.

import biotite.sequence.phylo as phylo
# The reference objects
fruits = ["Apple", "Pear", "Orange", "Lemon", "Banana"]
# Create nodes
apple  = phylo.TreeNode(index=fruits.index("Apple"))
pear   = phylo.TreeNode(index=fruits.index("Pear"))
orange = phylo.TreeNode(index=fruits.index("Orange"))
lemon  = phylo.TreeNode(index=fruits.index("Lemon"))
banana = phylo.TreeNode(index=fruits.index("Banana"))
intermediate1 = phylo.TreeNode(
    child1=apple, child2=pear, child1_distance=2.0, child2_distance=2.0
intermediate2 = phylo.TreeNode(orange, lemon, 1.0, 1.0)
intermediate3 = phylo.TreeNode(intermediate2, banana, 2.0, 3.0)
root = phylo.TreeNode(intermediate1, intermediate3, 2.0, 1.0)
# Create tree from root node
tree = phylo.Tree(root=root)
# Trees can be converted into Newick notation
print("Tree:", tree.to_newick(labels=fruits))
# Distances can be omitted
    "Tree w/o distances:",
    tree.to_newick(labels=fruits, include_distance=False)
# Distances can be measured
distance = tree.get_distance(fruits.index("Apple"), fruits.index("Banana"))
print("Distance Apple-Banana:", distance)


Tree: ((Apple:2.0,Pear:2.0):2.0,((Orange:1.0,Lemon:1.0):2.0,Banana:3.0):1.0):0.0;
Tree w/o distances: ((Apple,Pear),((Orange,Lemon),Banana));
Distance Apple-Banana: 8.0

You can also plot a tree as dendrogram.

fig, ax = plt.subplots(figsize=(6.0, 6.0))
graphics.plot_dendrogram(ax, tree, labels=fruits)

From distances to trees

When you want to create a Tree from distances obtained for example from sequence alignments, you can use the UPGMA algorithm implemented in the function of the same name upgma().

distances = np.array([
    [ 0, 17, 21, 31, 23],
    [17, 0,  30, 34, 21],
    [21, 30, 0,  28, 39],
    [31, 34, 28,  0, 43],
    [23, 21, 39, 43,  0]
tree = phylo.upgma(distances)
fig, ax = plt.subplots(figsize=(6.0, 3.0))
graphics.plot_dendrogram(ax, tree, orientation="top")

Going 3D - The Structure subpackage

biotite.structure is a Biotite subpackage for handling molecular structures. This subpackage enables efficient and easy handling of protein structure data by representing atom attributes in NumPy ndarrays. These atom attributes include so called annotations (polypetide chain id, residue id, residue name, hetero residue information, atom name, element, etc.) and the atom coordinates.

The package contains three central types: Atom, AtomArray and AtomArrayStack. An Atom contains data for a single atom, an AtomArray stores data for an entire model and AtomArrayStack stores data for multiple models, where each model contains the same atoms but differs in the atom coordinates. Both, AtomArrray and AtomArrayStack, store the attributes in NumPy arrays. This approach has multiple advantages:

  • Convenient selection of atoms in a structure by using NumPy style indexing
  • Fast calculations on structures using C-accelerated ndarray operations
  • Simple implementation of custom calculations

Based on the implementation using ndarray objects, this package also contains functions for structure analysis and manipulation.

Creating structures

Let’s begin by constructing some atoms:

import biotite.structure as struc
atom1 = struc.Atom([0,0,0], chain_id="A", res_id=1, res_name="GLY",
                   hetero=False, atom_name="N", element="N")
atom2 = struc.Atom([0,1,1], chain_id="A", res_id=1, res_name="GLY",
                   hetero=False, atom_name="CA", element="C")
atom3 = struc.Atom([0,0,2], chain_id="A", res_id=1, res_name="GLY",
                   hetero=False, atom_name="C", element="C")

The first parameter are the coordinates (internally converted into an ndarray), the other parameters are annotations. The annotations shown in this example are mandatory: If you miss one of these, Python will not diretcly complain, but some operations might not work properly (especially true, when we go to atom arrays and stacks). The mandatory annotation categories are originated in ATOM records in the PDB format. Additionally, you can specify an arbitrary amount of custom annotations, like B-factors, charge, etc. In most cases you won’t work with Atom instances and in even fewer cases Atom instances are created as it is done in the above example.

If you want to work with an entire molecular structure, containing an arbitrary amount of atoms, you have to use so called atom arrays. An atom array can be seen as an array of atom instances (hence the name). But instead of storing Atom instances in a list, an AtomArray instance contains one ndarray for each annotation and the coordinates. In order to see this in action, we first have to create an array from the atoms we constructed before. Then we can access the annotations and coordinates of the atom array simply by specifying the attribute.

array = struc.array([atom1, atom2, atom3])
print("Chain ID:", array.chain_id)
print("Residue ID:", array.res_id)
print("Atom name:", array.atom_name)
print("Coordinates:", array.coord)


Chain ID: ['A' 'A' 'A']
Residue ID: [1 1 1]
Atom name: ['N' 'CA' 'C']
Coordinates: [[0. 0. 0.]
 [0. 1. 1.]
 [0. 0. 2.]]

    A       1 GLY N      N         0.000    0.000    0.000
    A       1 GLY CA     C         0.000    1.000    1.000
    A       1 GLY C      C         0.000    0.000    2.000

The array() builder function takes any iterable object containing Atom instances. If you wanted to, you could even use another AtomArray, which functions also as an iterable object of Atom objects. An alternative way of constructing an array would be creating an AtomArray by using its constructor, which fills the annotation arrays and coordinates with the type respective zero value. In our example all annotation arrays have a length of 3, since we used 3 atoms to create it. A structure containing n atoms, is represented by annotation arrays of length n and coordinates of shape (n,3).

If you want to add further annotation categories to an array, you have to call the add_annotation() or set_annotation() method at first. After that you can access the new annotation array like any other annotation array.

In some cases, you might need to handle structures, where each atom is present in multiple locations (multiple models in NMR structures, MD trajectories). For the cases AtomArrayStack objects are used, which represent a list of atom arrays. Since the atoms are the same for each frame, but only the coordinates change, the annotation arrays in stacks are still the same length n ndarray objects as in atom arrays. However, a stack stores the coordinates in a (m,n,3)-shaped ndarray, where m is the number of frames. A stack is constructed with stack() analogous to the code snipped above. It is crucial that all arrays that should be stacked have the same annotation arrays, otherwise an exception is raised. For simplicity reasons, we create a stack containing two identical models, derived from the previous example.

stack = struc.stack([array, array.copy()])


Model 1
    A       1 GLY N      N         0.000    0.000    0.000
    A       1 GLY CA     C         0.000    1.000    1.000
    A       1 GLY C      C         0.000    0.000    2.000

Model 2
    A       1 GLY N      N         0.000    0.000    0.000
    A       1 GLY CA     C         0.000    1.000    1.000
    A       1 GLY C      C         0.000    0.000    2.000

Loading structures from file

Usually structures are not built from scratch in Biotite, but they are read from a file. Probably the most popular strcuture file format is the PDB format. For our purpose, we will work on a protein structure as small as possible, namely the miniprotein TC5b (PDB: 1L2Y`). The structure of this 20-residue protein (304 atoms) has been elucidated via NMR. Thus, the corresponding PDB file consists of multiple (namely 38) models, each showing another conformation.

At first we load the structure from a PDB file via the class PDBFile in the subpackage biotite.structure.io.pdb.

import biotite
import biotite.structure.io.pdb as pdb
import biotite.database.rcsb as rcsb
pdb_file_path = rcsb.fetch("1l2y", "pdb", biotite.temp_dir())
file = pdb.PDBFile()
tc5b = file.get_structure()



The method PDBFile.get_structure() returns an atom array stack unless the model parameter is specified, even if the file contains only one model. The following example shows how to write an array or stack back into a PDB file:

file = pdb.PDBFile()

Other information (authors, secondary structure, etc.) cannot be extracted from PDB files, yet. This is a good place to mention, that it is recommended to use the modern PDBx/mmCIF format in favor of the PDB format. It solves limitations of the PDB format, that arise from the column restrictions. Furthermore, much more additional information is stored in these files.

In contrast to PDB files, Biotite can read the entire content of PDBx/mmCIF files, which can be accessed in a dictionary like manner. At first, we read the file similarily to before, but this time we use the PDBxFile class.

import biotite.structure.io.pdbx as pdbx
cif_file_path = rcsb.fetch("1l2y", "cif", biotite.temp_dir())
file = pdbx.PDBxFile()

Now we can access the data like a dictionary of dictionaries.

print(file["1L2Y", "audit_author"]["name"])


['Neidigh, J.W.' 'Fesinmeyer, R.M.' 'Andersen, N.H.']

The first index contains the data block and the category name. The data block could be omitted, since there is only one block in the file. This returns a dictionary. If the category is in a loop, the dictionary contains ndarrays of strings as values, otherwise the dictionary contains strings directly. The second index specifies the name of the subcategory, which is used as key in this dictionary and returns the corresponding ndarray. Setting/adding a category in the file is done in a similar way:

file["audit_author"] = {"name" : ["Doe, Jane", "Doe, John"],
                        "pdbx_ordinal" : ["1","2"]}

In most applications only the structure itself (stored in the atom_site category) is relevant. get_structure() and set_structure() are convenience functions that are used to convert the atom_site category into an atom array (stack) and vice versa.

tc5b = pdbx.get_structure(file)
# Do some fancy stuff
pdbx.set_structure(file, tc5b)

get_structure() creates automatically an AtomArrayStack, even if the file actually contains only a single model. If you would like to have an AtomArray instead, you have to specifiy the model parameter.

If you want to parse a large batch of structure files or you have to load very large structure files, the usage of PDB or mmCIF files might be too slow for your requirements. In this case you probably might want to use MMTF files. MMTF files describe structures just like PDB and mmCIF files, but they are binary! This circumstance increases the downloading and parsing speed by several multiples. The usage is similar to PDBxFile: The MMTFFile class decodes the file and makes it raw information accessible. Via get_structure() the data can be loaded into an atom array (stack) and set_structure() is used to save it back into a MMTF file.

import numpy as np
import biotite.structure.io.mmtf as mmtf
mmtf_file_path = rcsb.fetch("1l2y", "mmtf", biotite.temp_dir())
file = mmtf.MMTFFile()
stack = mmtf.get_structure(file)
array = mmtf.get_structure(file, model=1)
# Do some fancy stuff
mmtf.set_structure(file, array)

A more low level access to MMTF files is also possible: An MMTF file is structured as dictionary, with each key being a strutural feature like the coordinates, the residue ID or the secondary structure. If a field is encoded the decoded ndarray is returned, otherwise the dictionary value is directly returned. A list of all MMTF fields (keys) can be found in the specification. The implementation of MMTFFile decodes the encoded fields only when you need them, so no computation time is wasted on fields you are not interested in.

# Field is not encoded
# Field is encoded and is automatically decoded


NMR Structure of Trp-Cage Miniprotein Construct TC5b
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]

Setting fields of an MMTF file works in an analogous way for values, that should not be encoded. The situation is a little more complex for arrays, that should be encoded: Since arbitrarily named fields can be set in the file, MMTFFile does not know which codec to use for encoding your array. Hence, you need to use the MMTFFile.set_array() function.

file["title"] = "Some other title"
# Determine appropriate codec from the codec used originally


Some other title
[20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]

For Biotite internal storage of structures npz files are recommended. These are simply binary files, that are used by NumPy. In case of atom arrays and stacks, the annotation arrays and coordinates are written/read to/from npz files via the NpzFile class. Since no expensive data conversion has o be performed, this format is the fastest way to save and load atom arrays and stacks.

Since programmers are usually lazy and do not want to write more code than necessary, there are two convenient function for loading and saving atom arrays or stacks, unifying the forementioned file formats: load_structure() takes a file path and outputs an array (or stack, if the files contains multiple models). Internally, this function uses the appropriate File class, depending on the file format. The analogous save_structure() function provides a shortcut for writing to structure files. The desired file format is inferred from the provided file name.

import biotite.structure.io as strucio
stack_from_pdb = strucio.load_structure(pdb_file_path)
stack_from_cif = strucio.load_structure(cif_file_path)
print("Are both stacks equal?", stack_from_pdb == stack_from_cif)
strucio.save_structure(biotite.temp_file("cif"), stack_from_pdb)


Are both stacks equal? True

Reading trajectory files

If the package MDtraj is installed Biotite provides a read/write interface for different trajectory file formats. More information can be found in the API reference.

Array indexing and filtering

Atom arrays and stacks can be indexed in a similar way a ndarray is indexed. In fact, the index is propagated to the coordinates and the annotation arrays. Therefore, all NumPy compatible types of indices can be used, like boolean arrays, index arrays/lists, slices and, of course, integer values. Integer indices have a special role here, as they reduce the dimensionality of the data type: Indexing an AtomArrayStack with an integer results in an AtomArray at the specified frame, indexing an AtomArray with an integer yields the specified Atom. Iterating over arrays and stacks reduces the dimensionality in an analogous way. Let’s demonstrate indexing with the help of the structure of TC5b.

import biotite.structure as struc
import biotite.database.rcsb as rcsb
import biotite.structure.io as strucio
file_path = rcsb.fetch("1l2y", "mmtf", biotite.temp_dir())
stack = strucio.load_structure(file_path)
array = stack[2]



load_structure() gives us an AtomArrayStack Via the integer index, we get the AtomArray representing the third model. The AtomArray.array_length() (or AtomArrayStack.array_length()) method gives us the number of atoms in arrays and stacks and is equivalent to the length of an atom array. The amount of models is obtained with AtomArrayStack.stack_depth(). The following code section shows some examples for how an atom array can be indexed.

# Get the first atom
atom = array[0]
# Get a subarray containing the first and third atom
subarray = array[[0,2]]
# Get a subarray containing a range of atoms using slices
subarray = array[100:200]
# Filter all carbon atoms in residue 1
subarray = array[(array.element == "C") & (array.res_id == 1)]
# Filter all atoms where the X-coordinate is smaller than 2
subarray = array[array.coord[:,0] < 2]

An atom array stack can be indexed in a similar way, with the difference, that the index specifies the frame(s).

# Get an atom array from the first model
subarray = stack[0]
# Get a substack containing the first 10 models
substack = stack[:10]

Stacks also have the speciality, that they can handle 2-dimensional indices, where the first dimension specifies the frame and the second dimension specifies the atom.

# Get the first 100 atoms from the third model
subarray = stack[2, :100]
# Get the first 100 atoms from the models 3, 4 and 5
substack = stack[2:5, :100]
# Get the first atom in the second model
atom = stack[1,0]
# Get a stack containing arrays containing only the first atom
substack = stack[:, 0]

Furthermore, biotite.structure contains advanced filters, that create boolean masks from an array using specific criteria. Here is a small example.

backbone = array[struc.filter_backbone(array)]


['N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA'
 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N'
 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C'
 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C']

If you would like to know which atoms are in proximity to specific coordinates, have a look at the CellList class.


Creating a subarray or substack by indexing does not necessarily copy the coordinates and annotation arrays. If possible, only array views are created. Look into the NumPy documentation for furher details. If you want to ensure, that you are working with a copy, use the copy() method after indexing.

Representing bonds

Up to now we only looked into atom arrays whose atoms are merely described by its coordinates and annotations But there is more: Chemcial bonds can be described, too, using a BondList!

Consider the following case: Your atom array contains four atoms: N, CA, C and CB. CA is a central atom that is connected to N, C and CB. A BondList is created by passing a ndarray containing pairs of integers, where each integer represents an index in a corresponding atom array and the pairs indicate which atoms share a bond. Addtionally, it is required to specifiy the number of atoms in the atom array.

import biotite.structure as struc
array = struc.array([
struc.Atom([0,0,0], atom_name="N"),
struc.Atom([0,0,0], atom_name="CA"),
struc.Atom([0,0,0], atom_name="C"),
struc.Atom([0,0,0], atom_name="CB")
print("Atoms:", array.atom_name)
bond_list = struc.BondList(len(array), np.array([[1,0], [1,2], [1,3]]))
print("Bonds (indices):")
print("Bonds (atoms names):")
print(array.atom_name[bond_list.as_array()[:, :2]])
ca_bonds, ca_bond_types = bond_list.get_bonds(1)
print("Bonds of CA:", array.atom_name[ca_bonds])


Atoms: ['N' 'CA' 'C' 'CB']
Bonds (indices):
[[0 1 0]
 [1 2 0]
 [1 3 0]]
Bonds (atoms names):
[['N' 'CA']
 ['CA' 'C']
 ['CA' 'CB']]
Bonds of CA: ['N' 'C' 'CB']

When you look at the internal ndarray (as given by BondList.as_array()), you see a third column containging zeros. This column describes each bond with values from the BondType enum: 0 correponds to BondType.ANY, which means that the type of the bond is undefined. This makes sense, since we did not define the bond types, when we created the bond list. The other thing that has changed is the index order: Each bond is sorted so that the index with the lower index is the first element.

Although a BondList uses an ndarray under the hood, indexing works a little bit different: The indexing operation is not applied on the internal ndarray, instead it behaves like the same indexing operation was applied to a corresponding atom array: The bond list adjusts its indices so that they still point on the same atoms as before. Bonds that involve at least one atom, that has been removed, are deleted as well. We will try that by deleting the C atom.

mask = (array.atom_name != "C")
sub_array = array[mask]
sub_bond_list = bond_list[mask]
print("Atoms:", sub_array.atom_name)
print("Bonds (indices):")
print("Bonds (atoms names):")
print(sub_array.atom_name[sub_bond_list.as_array()[:, :2]])


Atoms: ['N' 'CA' 'CB']
Bonds (indices):
[[0 1 0]
 [1 2 0]]
Bonds (atoms names):
[['N' 'CA']
 ['CA' 'CB']]

As you see, the the bonds involing the C (only a single one) is removed and the remaining indices are shifted.

We do not have to index the atom array and the bond list separately, for convenience reasons you can associate a bond list to an atom array. Every time the atom array is indexed, the index is also applied to the associated bond list. he same behavior applies to concatenations, by the way.

array.bonds = bond_list
sub_array = array[array.atom_name != "C"]
print("Bonds (atoms names):")
print(sub_array.atom_name[sub_array.bonds.as_array()[:, :2]])


Bonds (atoms names):
[['N' 'CA']
 ['CA' 'CB']]

Let’s scale things up a bit: Bond information can be loaded from and saved to MMTF files. We’ll try that on the structure of TC5b and look at the bond information of the third residue, a tyrosine.

import biotite.database.rcsb as rcsb
import biotite.structure.io as strucio
file_path = rcsb.fetch("1l2y", "mmtf", biotite.temp_dir())
mmtf_file = mmtf.MMTFFile()
# Essential: set the 'include_bonds' parameter to true
stack = mmtf.get_structure(mmtf_file, include_bonds=True)
tyrosine = stack[:, (stack.res_id == 3)]
print("Bonds (indices):")
print("Bonds (atoms names):")
print(tyrosine.atom_name[tyrosine.bonds.as_array()[:, :2]])


Bonds (indices):
[[ 0  1  1]
 [ 1  2  1]
 [ 2  3  2]
 [ 1  4  1]
 [ 4  5  1]
 [ 5  6  2]
 [ 5  7  1]
 [ 6  8  1]
 [ 7  9  2]
 [ 8 10  2]
 [ 9 10  1]
 [10 11  1]
 [ 0 12  1]
 [ 1 13  1]
 [ 4 14  1]
 [ 4 15  1]
 [ 6 16  1]
 [ 7 17  1]
 [ 8 18  1]
 [ 9 19  1]
 [11 20  1]]
Bonds (atoms names):
[['N' 'CA']
 ['CA' 'C']
 ['C' 'O']
 ['CA' 'CB']
 ['CB' 'CG']
 ['CG' 'CD1']
 ['CG' 'CD2']
 ['CD1' 'CE1']
 ['CD2' 'CE2']
 ['CE1' 'CZ']
 ['CE2' 'CZ']
 ['CZ' 'OH']
 ['N' 'H']
 ['CA' 'HA']
 ['CB' 'HB2']
 ['CB' 'HB3']
 ['CD1' 'HD1']
 ['CD2' 'HD2']
 ['CE1' 'HE1']
 ['CE2' 'HE2']
 ['OH' 'HH']]

Since we loaded the bond information from a MMTF file, the bond types are also defined: Here we have both, BondType.SINGLE and BondType.DOUBLE bonds (1 and 2, repectively).

Structure analysis

This package would be almost useless, if there wasn’t some means to analyze your structures. Therefore, Biotite offers a bunch of functions for this purpose, reaching from simple bond angle and length measurements to more complex characteristics, like accessible surface area and secondary structure. The following section will introduce you to some of these functions, which should be applied to that good old structure of TC5b.

The examples shown in this section do not represent the full spectrum of analysis tools in this package. Look into the API reference for more information.

Geometry measures

Let’s start with measuring some simple geometric characteristics, for example atom distances of CA atoms.

import biotite.structure as struc
import biotite.structure.io.pdbx as pdbx
import biotite.database.rcsb as rcsb
file_path = rcsb.fetch("1l2y", "cif", biotite.temp_dir())
file = pdbx.PDBxFile()
stack = pdbx.get_structure(file)
# Filter only CA atoms
stack = stack[:, stack.atom_name == "CA"]
# Calculate distance between first and second CA in first frame
array = stack[0]
print("Atom to atom:", struc.distance(array[0], array[1]))
# Calculate distance between the first atom
# and all other CA atoms in the array
print("Array to atom:")
array = stack[0]
print(struc.distance(array[0], array))
# Calculate pairwise distances between the CA atoms in the first frame
# and the CA atoms in the second frame
print("Array to array:")
print(struc.distance(stack[0], stack[1]))
# Calculate the distances between all CA atoms in the stack
# and the first CA atom in the first frame
# The resulting array is too large, therefore only the shape is printed
print("Stack to atom:")
print(struc.distance(stack, stack[0,0]).shape)
# And finally distances between two adjacent CA in the first frame
array = stack[0]
print("Adjacent CA distances")
print(struc.distance(array[:-1], array[1:]))


Atom to atom: 3.8763991022597253
Array to atom:
[ 0.          3.8763991   5.57665975  5.03889055  6.31640919  8.76681499
  9.90813499 10.61481667 12.89033149 14.80667937 13.50116443 16.87541054
 18.72356614 17.22428861 19.11193308 16.19300176 15.51475678 12.37781309
 10.44593404 12.0589665 ]
Array to array:
[3.43441989 0.37241509 0.22178593 0.10823123 0.15207235 0.1701705
 0.22572771 0.47650498 0.2949322  0.1548354  0.28323488 0.40683903
 0.13555073 0.36768737 0.46464395 0.57544244 0.33707418 0.25703307
 0.34762192 0.38818681]
Stack to atom:
(38, 20)
Adjacent CA distances
[3.8763991  3.86050178 3.87147026 3.84557993 3.86660471 3.85851811
 3.88180293 3.86098705 3.89091814 3.86355497 3.88626993 3.87561298
 3.87466863 3.86554912 3.86627728 3.87766244 3.86038275 3.85824688

Like some other functions in biotite.structure, we are able to pick any combination of an atom, atom array or stack. Alternatively ndarray objects containing the coordinates can be provided.

Furthermore, we can measure bond angles and dihedral angles.

# Calculate angle between first 3 CA atoms in first frame
# (in radians)
print("Angle:", struc.angle(array[0],array[1],array[2]))
# Calculate dihedral angle between first 4 CA atoms in first frame
# (in radians)
print("Dihedral angle:", struc.dihedral(array[0],array[1],array[2],array[4]))


Angle: 1.6098708219270583
Dihedral angle: 1.4903792085158993

In some cases one is interested in the dihedral angles of the peptide backbone, \(\phi\), \(\psi\) and \(\omega\). In the following code snippet we measure these angles and create a simple Ramachandran plot for the first frame of TC5b.

import matplotlib.pyplot as plt
import numpy as np
array = pdbx.get_structure(file, model=1)
phi, psi, omega = struc.dihedral_backbone(array, chain_id="A")
plt.plot(phi * 360/(2*np.pi), psi * 360/(2*np.pi),
        marker="o", linestyle="None")

Comparing structures

Now we want to calculate a measure of flexibility for each residue in TC5b. The root mean square fluctuation (RMSF) is a good value for that. It represents the deviation for each atom in all models relative to a reference model, which is usually the averaged structure. Since we are only interested in the backbone flexibility, we consider only CA atoms. Before we can calculate a reasonable RMSF, we have to superimpose each model on a reference model (we choose the first model), which minimizes the root mean square deviation (RMSD).

stack = pdbx.get_structure(file)
# We consider only CA atoms
stack = stack[:, stack.atom_name == "CA"]
# Superimposing all models of the structure onto the first model
stack, transformation_tuple = struc.superimpose(stack[0], stack)
print("RMSD for each model to first model:")
print(struc.rmsd(stack[0], stack))
# Calculate the RMSF relative to average of all models
rmsf = struc.rmsf(struc.average(stack), stack)
# Plotting stuff
plt.plot(np.arange(1,21), rmsf)


RMSD for each model to first model:
[0.         0.78426444 1.00757683 0.55180272 0.80663454 1.06066791
 0.87383705 0.62606424 1.00576561 0.81440804 0.87628298 1.35385898
 0.93278001 0.87600934 0.99357313 0.40626578 0.31801941 1.18389047
 1.23477073 0.89114465 0.55536526 0.73639392 0.78567399 1.10192568
 0.67228881 1.1605639  0.98213955 1.22808849 0.79269641 0.86854739
 0.93866682 0.83565702 0.61650375 0.97335428 1.03223981 0.55556655
 1.15175216 0.85585345]

As you can see, both terminal residues are most flexible.

Calculating accessible surface area

Another interesting value for a protein structure is the solvent accessible surface area (SASA) that indicates whether an atom or residue is on the protein surface or buried inside the protein. The function sasa() approximates the SASA for each atom. Then we sum up the values for each residue, to get the residue-wise SASA.

Besides other parameters, you can choose between different Van-der-Waals radii sets: Prot0r, the default set, is a set that defines radii for non-hydrogen atoms, but determines the radius of an atom based on the assumed amount of hydrogen atoms connected to it. Therefore, ProtOr is suitable for structures with missing hydrogen atoms, like crystal structures. Since the structure of TC5b was elucidated via NMR, we can assign a radius to every single atom (including hydrogens), hence we use the Single set.

array = pdbx.get_structure(file, model=1)
# The following line calculates the atom-wise SASA of the atom array
atom_sasa = struc.sasa(array, vdw_radii="Single")
# Sum up SASA for each residue in atom array
res_sasa = struc.apply_residue_wise(array, atom_sasa, np.sum)
# Again plotting stuff
plt.plot(np.arange(1,21), res_sasa)

Secondary structure determination

Biotite can also be used to assign secondary structure elements (SSE) to a structure with the annotate_sse() function. An 'a' means alpha-helix, 'b' beta-sheet, and 'c' means coil.

array = pdbx.get_structure(file, model=1)
# Estimate secondary structure
sse = struc.annotate_sse(array, chain_id="A")
# Pretty print



Outsourcing - The Application subpackage

Although you can achieve a lot with Biotite, there are still a lot of things which are not implemented in this Python package. But wait, this is what the biotite.application package is for: It contains interfaces for popular external software. This ranges from locally installed software to external tools running on web servers. The usage of these interfaces is seamless: Rather than writing input files and reading output files, you simply put in your Python objects (e.g. instances of Sequence or AtomArray) and the interface returns Python objects (e.g. an Alignment object).

The base class for all interfaces is the Application class. Each Application instance has a life cycle, starting with its creation and ending with the result extraction. Each state in this life cycle is described by the value of the enum AppState, that each Application contains: Directly after its instantiation the app is in the CREATED state. In this state further parameters can be set for the application run. After the user calls the Application.start() method, the app state is set to RUNNING and the app performs the calculations. When the application finishes the AppState changes to FINISHED. The user can now call the Application.join() method, concluding the application in the JOINED state and making the results of the application accessible. Furthermore, this may trigger cleanup actions in some applications. Application.join() can even be called in the RUNNING state: This will constantly check if the application has finished and will directly go into the JOINED state as soon as the application reaches the FINISHED state. Calling the Application.cancel() method while the application is RUNNING or FINISHED leaves the application in the CANCELLED state. This triggers cleanup, too, but there are no accessible results. If a method is called in an unsuitable app state, an AppStateError is called. At each state in the life cycle, Application type specific methods are called, as shown in the following diagram.


The following sample code shows how an Application is generally executed. Pay attention to the space between the Application.start() and Application.join() method: Since these are separate functions, you can do some other stuff, while the Application runs in the background. Thus, an Application behaves effectively like an additional thread.

from biotite.application import Application
# Create a dummy Application subclass
class MyApplication(Application):
    def __init__(self, param): super().__init__()
    def run(self): pass
    def is_finished(self): return True
    def wait_interval(self): return 0.1
    def evaluate(self): pass
    def clean_up(self): pass
    def set_some_other_input(self, param): pass
    def get_some_data(self): return "some data"

param = "foo"
param2 = "bar"
app = MyApplication(param)
# Time to do other stuff
results = app.get_some_data()

The following subsections will dive into the available Application classes in depth.

Finding homologous sequences with BLAST

the biotite.application.blast subpackage provides an interface to NCBI BLAST: the BlastWebApp class. Let’s dive directly into the code, we try to find homologous sequences to the miniprotein TC5b:

import biotite.application.blast as blast
import biotite.sequence as seq
tc5b_seq = seq.ProteinSequence("NLYIQWLKDGGPSSGRPPPS")
app = blast.BlastWebApp("blastp", tc5b_seq)
alignments = app.get_alignments()
best_ali = alignments[0]
print("HSP position in query: ", best_ali.query_interval)
print("HSP position in hit: ", best_ali.hit_interval)
print("Score: ", best_ali.score)
print("E-value: ", best_ali.e_value)
print("Hit UID: ", best_ali.hit_id)
print("Hit name: ", best_ali.hit_definition)



HSP position in query:  (1, 20)
HSP position in hit:  (1, 20)
Score:  101
E-value:  3.82217e-05
Hit UID:  1L2Y_A
Hit name:  Chain A, NMR Structure of Trp-Cage Miniprotein Construct TC5b

This was too simple for BLAST: It just found the query sequence in the PDB. However, it gives a good impression about how this Application works. Besides some optional parameters, the BlastWebApp requires the BLAST program and the query sequence. After the app has finished, you get a list of alignments with descending score. An alignment is an instance of BlastAlignment, a subclass of biotite.sequence.align.Alignment. It contains some additional information as shown above. The hit UID can be used to obtain the complete hit sequence via biotite.database.entrez.

The next alignment should be a bit more challenging. We take a random part of the E. coli BL21 genome and distort it a little bit. Since we still expect a high similarity to the original sequence, we decrease the E-value threshold.

import biotite.application.blast as blast
import biotite.sequence as seq
bl21_seq = seq.NucleotideSequence(
app = blast.BlastWebApp("blastn", bl21_seq, obey_rules=False)
alignments = app.get_alignments()
best_ali = alignments[0]
print("HSP position in query: ", best_ali.query_interval)
print("HSP position in hit: ", best_ali.hit_interval)
print("Score: ", best_ali.score)
print("E-value: ", best_ali.e_value)
print("Hit UID: ", best_ali.hit_id)
print("Hit name: ", best_ali.hit_definition)



HSP position in query:  (1, 65)
HSP position in hit:  (4656858, 4656923)
Score:  56
E-value:  0.00145884
Hit UID:  CP026845
Hit name:  Shigella boydii strain NCTC 9733 chromosome, complete genome

In this snippet we have set obey_rules to false in the BlastWebApp constructor, if we omitted this parameter and we started the last two code snippets in quick succession, a RuleViolationError would be raised. This is because normally the BlastWebApp respects NCBI’s code of conduct and prevents you from submitting two queries within one minute. If you want to be rude to the NCBI server, create the instance with obey_rules set to false.

Multiple sequence alignments

For multiple sequence alignments (MSAs) biotite.application offers several interfaces to MSA software. For our example we choose the software MUSCLE: The subpackage biotite.application.muscle contains the class MuscleApp that does the job. You simply input the sequences you want to have aligned, run the application and get the resulting Alignment object.

import biotite.application.muscle as muscle
import biotite.sequence as seq
seq1 = seq.ProteinSequence("BIQTITE")
seq2 = seq.ProteinSequence("TITANITE")
seq3 = seq.ProteinSequence("BISMITE")
seq4 = seq.ProteinSequence("IQLITE")
app = muscle.MuscleApp([seq1, seq2, seq3, seq4])
alignment = app.get_alignment()



For the lazy people there is also a convenience method, that handles the Application execution internally.

alignment = muscle.MuscleApp.align([seq1, seq2, seq3, seq4])

The alternatives to MUSCLE are Clustal-Omega and MAFFT. To use them, simply replace MuscleApp with ClustalOmegaApp or MafftApp and you are done.

import biotite.application.clustalo as clustalo
alignment = clustalo.ClustalOmegaApp.align([seq1, seq2, seq3, seq4])



As shown in the output, the alignment with Clustal-Omega slightly differs from the one performed with MUSCLE. In contrast to MUSCLE, Clustal-Omega and MAFFT also support alignments of NucleotideSequence instances.

Secondary structure annotation

Althogh biotite.structure offers the function annotate_sse() to assign secondary structure elements based on the P-SEA algorithm, DSSP can also be used via the biotite.application.dssp subpackage (provided that DSSP is installed). Let us demonstrate this on the example of the good old miniprotein TC5b.

Similar to the MSA examples, DsspApp has the convenience function DsspApp.annotate_sse(), which handles the DsspApp execution internally.

import biotite
import biotite.database.rcsb as rcsb
import biotite.application.dssp as dssp
import biotite.structure.io as strucio
file_path = rcsb.fetch("1l2y", "mmtf", biotite.temp_dir())
stack = strucio.load_structure(file_path)
array = stack[0]
app = dssp.DsspApp(array)
sse = app.get_sse()


['C' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'T' 'T' 'G' 'G' 'G' 'G' 'T' 'C' 'C' 'C'
 'C' 'C']