Tutorial¶
Introduction¶
Biotite is a Python package for computational biologists. It aims to provide a broad set of tools for working with different kinds of biological data, from sequences to structures. On the one hand side, working with Biotite should be computationally efficient, with the help of the powerful packages NumPy and Cython. On the other hand it aims for simple usability and extensibility, so that beginners are not overwhelmed and advanced users can easily build upon the existing system to implement their own algorithms.
Biotite provides 4 subpackages:
The biotite.sequence
subpackage contains functionality for
working with sequence information of any kind.
The package contains by default sequence types for nucleotides and
proteins, but the alphabet-based implementation allows simple
integration of own sequence types, even if they do not rely on letters.
Beside the standard I/O operations, the package includes general purpose
functions for sequence manipulations and alignments.
The biotite.structure
subpackage enables handling of 3D
structures of biomolecules.
Simplified, a structure is represented by a list of atoms and their
properties,
based on NumPy arrays.
The subpackage includes read/write functionality for different formats,
structure filters, coordinate transformations, angle and bond
measurements, structure
superimposition and some more advanced analysis capabilities.
The biotite.database
subpackage is all about downloading data
from biological databases, including the probably most important ones:
the RCSB PDB and the NCBI Entrez database.
The biotite.application
subpackage provides interfaces for
external software.
The interfaces range from locally installed software (e.g. MSA software)
to web applications (e.g. BLAST).
The speciality is that the interfaces are seamless:
You do not have to write input files and read output files, you only
have to input Python objects and you get Python objects.
It is basically very similar to using normal functions.
In the following sections you will get an overview over the mentioned subpackages, so go and grab some tea and cookies und let us begin.
Preliminary note¶
The files used in this tutorial will be stored in a temporary directory. So make sure to put the files, you want keep, somewhere else.
Data to work with - The Database subpackage¶
Biological databases are the backbone of computational biology.
The biotite.database
subpackage provides interfaces for popular
online databases like the RCSB PDB or the NCBI Entrez database.
Fetching structure files from the RCSB PDB¶
Downloading structure files from the RCSB PDB is quite easy:
Simply specify the PDB ID, the file format and the target directory
for the fetch()
function and you are done.
The function returns the path to the downloaded file, so you
can simply load the file via the other Biotite subpackages
(more on this later).
We will download on a protein structure of the miniprotein TC5b
(PDB: 1L2Y) into a temporary directory.
from tempfile import gettempdir
import biotite.database.rcsb as rcsb
file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
print(file_path)
/tmp/1l2y.pdb
In case you want to download multiple files, you are able to specify a list of PDB IDs, which in return gives you a list of file paths.
# Download files in the more modern mmCIF format
file_paths = rcsb.fetch(["1l2y", "1aki"], "cif", gettempdir())
print([file_path for file_path in file_paths])
['/tmp/1l2y.cif', '/tmp/1aki.cif']
By default fetch()
checks whether the file to be fetched
already exists in the directory and downloads it, if it does not
exist yet.
If you want to download files irrespectively, set overwrite
to
true.
# Download file in the fast and small binary MMTF format
file_path = rcsb.fetch("1l2y", "mmtf", gettempdir(), overwrite=True)
If you omit the file path or set it to None
, the downloaded data
will be returned directly as a file-like object, without creating a
file on your disk at all.
file = rcsb.fetch("1l2y", "pdb")
lines = file.readlines()
print("\n".join(lines[:10] + ["..."]))
HEADER DE NOVO PROTEIN 25-FEB-02 1L2Y
TITLE NMR STRUCTURE OF TRP-CAGE MINIPROTEIN CONSTRUCT TC5B
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: TC5B;
COMPND 3 CHAIN: A;
COMPND 4 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 SYNTHETIC: YES;
SOURCE 3 OTHER_DETAILS: THE PROTEIN WAS SYNTHESIZED USING STANDARD FMOC
SOURCE 4 SOLID-PHASE SYNTHESIS METHODS ON AN APPLIED BIOSYSTEMS 433A PEPTIDE
...
In many cases you are not interested in a specific structure, but you
want a set of structures that fits your desired criteria.
For this purpose the RCSB search API can be used.
At first you have to create Query
object for the property you
want to filter.
The search()
method takes the Query
and returns a
list of PDB IDs, which itself can be used as input for
fetch()
.
Likewise, count()
is used to count the number of matching
PDB IDs.
query = rcsb.BasicQuery("HCN1")
pdb_ids = rcsb.search(query)
print(pdb_ids)
print(rcsb.count(query))
files = rcsb.fetch(pdb_ids, "mmtf", gettempdir())
['2XPI', '5U6O', '3U0Z', '3U10', '5U6P', '3U11', '6UQF', '6UQG', '7NMN', '7NP3', '7NP4', '2MNG', '4NVP']
13
This was a simple search for the occurrence of the search term in any
field.
You can also search for a value in a specific field with a
FieldQuery
.
A complete list of the available fields and its supported operators
is documented
on this page
and on that page <https://search.rcsb.org/chemical-search-attributes.html>.
# Query for 'lacA' gene
query1 = rcsb.FieldQuery(
"rcsb_entity_source_organism.rcsb_gene_name.value",
exact_match="lacA"
)
# Query for resolution below 1.5 Å
query2 = rcsb.FieldQuery("reflns.d_resolution_high", less=1.5)
The search API allows even more complex queries, e.g. for sequence
or structure similarity. Have a look at the API reference of
biotite.database.rcsb
.
Multiple Query
objects can be combined using the |
(or)
or &
(and) operator for a more fine-grained selection.
A FieldQuery
is negated with ~
.
composite_query = query1 & ~query2
print(rcsb.search(composite_query))
['1KQA', '1KRR', '1KRU', '1KRV', '1TG7', '1XC6', '3U7V', '4DUW', '4IUG', '4LFK', '4LFL', '4LFM', '4LFN', '5IFP', '5IFT', '5IHR', '5JUV', '5MGC', '5MGD']
Often the structures behind the obtained PDB IDs have degree of
redundancy.
For example they may represent the same protein sequences or result
from the same set of experiments.
You may use Grouping
of structures to group redundant
entries or even return only single representatives of each group.
query = rcsb.BasicQuery("Transketolase")
# Group PDB IDs from the same collection
print(rcsb.search(
query, group_by=rcsb.DepositGrouping(), return_groups=True
))
# Get only a single representative of each group
print(rcsb.search(
query, group_by=rcsb.DepositGrouping(), return_groups=False
))
{'G_1002178': ['5RW0', '5RVZ', '5RVY', '5RVX', '5RVW'], 'G_1002179': ['5RW1']}
['5RW0', '5RW1']
Note that grouping may omit PDB IDs in search results, if such PDB IDs cannot be grouped. In the example shown above, not all structures For example in the case shown above only a few PDB entries were uploaded as collection and hence are part of the search results.
Fetching files from the NCBI Entrez database¶
Another important source of biological information is the
NCBI Entrez database, which is commonly known as the NCBI.
It provides a myriad of information, ranging from sequences and
sequence features to scientific articles.
Fetching files from NCBI Entrez works analogous to the RCSB interface.
This time we have to provide the UIDs (Accession or GI) instead of
PDB IDs to the fetch()
function.
Furthermore, we need to specifiy the database to retrieve the data
from and the retrieval type.
from tempfile import gettempdir, NamedTemporaryFile
import biotite.database.entrez as entrez
# Fetch a single UID ...
file_path = entrez.fetch(
"NC_001416", gettempdir(), suffix="fa",
db_name="nuccore", ret_type="fasta"
)
print(file_path)
# ... or multiple UIDs
file_paths = entrez.fetch(
["1L2Y_A","1AKI_A"], gettempdir(), suffix="fa",
db_name="protein", ret_type="fasta"
)
print([file_path for file_path in file_paths])
/tmp/NC_001416.fa
['/tmp/1L2Y_A.fa', '/tmp/1AKI_A.fa']
A list of valid database, retrieval type and mode combinations can
be found
here.
Furthermore, get_database_name()
can be helpful to get the
required database name by the more commonly known names.
print(entrez.get_database_name("Nucleotide"))
nuccore
The Entrez database allows for packing data for multiple UIDs into a
single file. This is achieved with the fetch_single_file()
function.
temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
file_path = entrez.fetch_single_file(
["1L2Y_A","1AKI_A"], temp_file.name, db_name="protein", ret_type="fasta"
)
print(file_path)
temp_file.close()
/tmp/tmphyvnkkb7.fasta
Similar to the RCSB PDB, you can also search every field of the NCBI Entrez database.
# Search in all fields
print(entrez.SimpleQuery("BL21 genome"))
# Search in the 'Organism' field
print(entrez.SimpleQuery("Escherichia coli", field="Organism"))
"BL21 genome"
"Escherichia coli"[Organism]
You can also combine multiple Query
objects in any way you
like using the binary operators |
, &
and ^
,
that represent OR
, AND
and NOT
linkage, respectively.
composite_query = (
entrez.SimpleQuery("50:100", field="Sequence Length") &
(
entrez.SimpleQuery("Escherichia coli", field="Organism") |
entrez.SimpleQuery("Bacillus subtilis", field="Organism")
)
)
print(composite_query)
(50:100[Sequence Length]) AND (("Escherichia coli"[Organism]) OR ("Bacillus subtilis"[Organism]))
Finally, the query is given to the search()
function to obtain
the GIs, that can be used as input for fetch()
.
# Return a maximum number of 10 entries
gis = entrez.search(composite_query, "protein", number=10)
print(gis)
['2503041065', '2503041064', '2503041063', '2503041062', '2503041042', '2503041038', '2503041036', '2503041031', '2503041029', '2503037913']
From A to T - The Sequence subpackage¶
biotite.sequence
is a Biotite subpackage concerning maybe the
most popular type of data in bioinformatics: sequences.
The instantiation can be quite simple as
import biotite.sequence as seq
from biotite.sequence.align.matrix import SubstitutionMatrix
dna = seq.NucleotideSequence("AACTGCTA")
print(dna)
AACTGCTA
This example shows NucleotideSequence
which is a subclass of
the abstract base class Sequence
.
A NucleotideSequence
accepts an iterable object of strings,
where each string can be 'A'
, 'C'
, 'G'
or 'T'
.
Each of these letters is called a symbol.
In general the sequence implementation in Biotite allows for
sequences of anything.
This means any immutable and hashable Python object can be used as
a symbol in a sequence, as long as the object is part of the
Alphabet
of the particular Sequence
.
An Alphabet
object simply represents a list of objects that
are allowed to occur in a Sequence
.
The following figure shows how the symbols are stored in a
Sequence
object.
When setting the Sequence
object with a sequence of symbols,
the Alphabet
of the Sequence
encodes each symbol in
the input sequence into a so called symbol code.
The encoding process is quite simple:
A symbol s is at index i in the list of allowed symbols in the
alphabet, so the symbol code for s is i.
If s is not in the alphabet, an AlphabetError
is raised.
The array of symbol codes, that arises from encoding the input
sequence, is called sequence code.
This sequence code is now stored in an internal integer
ndarray
in the Sequence
object.
The sequence code is now accessed via the code
attribute,
the corresponding symbols via the symbols
attribute.
This approach has multiple advantages:
Ability to create sequences of anything
Sequence utility functions (searches, alignments,…) usually do not care about the specific sequence type, since they work with the internal sequence code
Integer type for sequence code is only as large as the alphabet requests
Sequence codes can be directly used as substitution matrix indices in alignments
Effectively, this means a potential Sequence
subclass could
look like following:
class NonsenseSequence(seq.Sequence):
alphabet = seq.Alphabet([42, "foo", b"bar"])
def get_alphabet(self):
return NonsenseSequence.alphabet
sequence = NonsenseSequence(["foo", b"bar", 42, "foo", "foo", 42])
print("Alphabet:", sequence.alphabet)
print("Symbols:", sequence.symbols)
print("Code:", sequence.code)
Alphabet: [42, 'foo', b'bar']
Symbols: ['foo', b'bar', 42, 'foo', 'foo', 42]
Code: [1 2 0 1 1 0]
From DNA to Protein¶
Biotite offers two prominent Sequence
sublasses:
The NucleotideSequence
represents DNA.
It may use two different alphabets - an unambiguous alphabet
containing the letters 'A'
, 'C'
, 'G'
and 'T'
and an
ambiguous alphabet containing additionally the standard letters for
ambiguous nucleic bases.
A NucleotideSequence
determines automatically which alphabet
is required, unless an alphabet is specified. If you want to work with
RNA sequences you can use this class, too, you just need to replace
the 'U'
with 'T'
.
import biotite.sequence as seq
# Create a nucleotide sequence using a string
# The constructor can take any iterable object (e.g. a list of symbols)
seq1 = seq.NucleotideSequence("ACCGTATCAAG")
print(seq1.get_alphabet())
# Constructing a sequence with ambiguous nucleic bases
seq2 = seq.NucleotideSequence("TANNCGNGG")
print(seq2.get_alphabet())
['A', 'C', 'G', 'T']
['A', 'C', 'G', 'T', 'R', 'Y', 'W', 'S', 'M', 'K', 'H', 'B', 'V', 'D', 'N']
The reverse complement of a DNA sequence is created by chaining the
Sequence.reverse()
and the
NucleotideSequence.complement()
method.
# Lower case characters are automatically capitalized
seq1 = seq.NucleotideSequence("tacagtt")
print("Original:", seq1)
seq2 = seq1.reverse().complement()
print("Reverse complement:", seq2)
Original: TACAGTT
Reverse complement: AACTGTA
The other Sequence
type is ProteinSequence
.
It supports the letters for the 20 standard amino acids plus some
letters for ambiguous amino acids and a letter for a stop signal.
Furthermore, this class provides some utilities like
3-letter to 1-letter translation (and vice versa).
prot_seq = seq.ProteinSequence("BIQTITE")
print("-".join([seq.ProteinSequence.convert_letter_1to3(symbol)
for symbol in prot_seq]))
ASX-ILE-GLN-THR-ILE-THR-GLU
A NucleotideSequence
can be translated into a
ProteinSequence
via the
NucleotideSequence.translate()
method.
By default, the method searches for open reading frames (ORFs) in the
3 frames of the sequence.
A 6-frame ORF search requires an
additional call of NucleotideSequence.translate()
with the
reverse complement of the sequence.
If you want to conduct a complete 1-frame translation of the sequence,
irrespective of any start and stop codons, set the parameter
complete
to true.
dna = seq.NucleotideSequence("CATATGATGTATGCAATAGGGTGAATG")
proteins, pos = dna.translate()
for i in range(len(proteins)):
print(
f"Protein sequence {str(proteins[i])} "
f"from base {pos[i][0]+1} to base {pos[i][1]}"
)
protein = dna.translate(complete=True)
print("Complete translation:", str(protein))
Protein sequence MMYAIG* from base 4 to base 24
Protein sequence MYAIG* from base 7 to base 24
Protein sequence MQ* from base 11 to base 19
Protein sequence M from base 25 to base 27
Complete translation: HMMYAIG*M
The upper example uses the default CodonTable
instance.
This can be changed with the codon_table
parameter.
A CodonTable
maps codons to amino acids and defines start
codons (both in symbol and code form).
A CodonTable
is mainly used in the
NucleotideSequence.translate()
method,
but can also be used to find the corresponding amino acid for a codon
and vice versa.
table = seq.CodonTable.default_table()
# Find the amino acid encoded by a given codon
print(table["TAC"])
# Find the codons encoding a given amino acid
print(table["Y"])
# Works also for codes instead of symbols
print(table[(1,2,3)])
print(table[14])
Y
('TAC', 'TAT')
14
((0, 2, 0), (0, 2, 2), (1, 2, 0), (1, 2, 1), (1, 2, 2), (1, 2, 3))
The default CodonTable
is equal to the NCBI “Standard” table,
with the small difference that only 'ATG'
qualifies as start
codon.
You can also use any other official NCBI table via
CodonTable.load()
.
# Use the official NCBI table name
table = seq.CodonTable.load("Yeast Mitochondrial")
print("Yeast Mitochondrial:")
print(table)
print()
# Use the official NCBI table ID
table = seq.CodonTable.load(11)
print("Bacterial:")
print(table)
Yeast Mitochondrial:
AAA K AAC N AAG K AAT N
ACA T ACC T ACG T ACT T
AGA R AGC S AGG R AGT S
ATA M i ATC I ATG M i ATT I
CAA Q CAC H CAG Q CAT H
CCA P CCC P CCG P CCT P
CGA R CGC R CGG R CGT R
CTA T CTC T CTG T CTT T
GAA E GAC D GAG E GAT D
GCA A GCC A GCG A GCT A
GGA G GGC G GGG G GGT G
GTA V GTC V GTG V GTT V
TAA * TAC Y TAG * TAT Y
TCA S TCC S TCG S TCT S
TGA W TGC C TGG W TGT C
TTA L TTC F TTG L TTT F
Bacterial:
AAA K AAC N AAG K AAT N
ACA T ACC T ACG T ACT T
AGA R AGC S AGG R AGT S
ATA I i ATC I i ATG M i ATT I
CAA Q CAC H CAG Q CAT H
CCA P CCC P CCG P CCT P
CGA R CGC R CGG R CGT R
CTA L CTC L CTG L i CTT L
GAA E GAC D GAG E GAT D
GCA A GCC A GCG A GCT A
GGA G GGC G GGG G GGT G
GTA V GTC V GTG V i GTT V
TAA * TAC Y TAG * TAT Y
TCA S TCC S TCG S TCT S
TGA * TGC C TGG W TGT C
TTA L TTC F TTG L i TTT F
Feel free to define your own custom codon table via the
CodonTable
constructor.
Loading sequences from file¶
Biotite enables the user to load and save sequences from/to the
popular FASTA format via the FastaFile
class.
A FASTA file may contain multiple sequences.
Each sequence entry starts with a line with a leading '>'
and a
trailing header name.
The corresponding sequence is specified in the following lines until
the next header or end of file.
Since every sequence has its obligatory header, a FASTA file is
predestinated to be implemented as some kind of dictionary.
This is exactly what has been done in Biotite:
The header strings (without the '>'
) are used as keys to access
the respective sequence strings.
Actually you can cast the FastaFile
object into a
dict
.
Let’s demonstrate this on the genome of the lambda phage
(Accession: NC_001416
).
After downloading the FASTA file from the NCBI Entrez database,
we can load its contents in the following way:
from tempfile import gettempdir, NamedTemporaryFile
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.database.entrez as entrez
file_path = entrez.fetch(
"NC_001416", gettempdir(), suffix="fa",
db_name="nuccore", ret_type="fasta"
)
fasta_file = fasta.FastaFile.read(file_path)
for header, string in fasta_file.items():
print("Header:", header)
print(len(string))
print("Sequence:", string[:50], "...")
print("Sequence length:", len(string))
Header: NC_001416.1 Enterobacteria phage lambda, complete genome
48502
Sequence: GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAA ...
Sequence length: 48502
Since there is only a single sequence in the file, the loop is run
only one time.
As the sequence string is very long, only the first 50 bp are printed.
Now this string could be used as input parameter for creation of a
NucleotideSequence
.
But we want to spare ourselves some unnecessary work, there is already
a convenience function for that:
dna_seq = fasta.get_sequence(fasta_file)
print(type(dna_seq).__name__)
print(dna_seq[:50])
NucleotideSequence
GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAA
In this form get_sequence()
returns the first sequence in the
file, which is also the only sequence in most cases.
If you want the sequence corresponding to a specific header, you have
to specifiy the header
parameter.
The function even automatically recognizes, if the file contains a
DNA or protein sequence and returns a NucleotideSequence
or
ProteinSequence
, instance respectively.
Actually, it just tries to create a NucleotideSequence
,
and if this fails, a ProteinSequence
is created instead.
Sequences can be written into FASTA files in a similar way: either via
dictionary-like access or using the set_sequence()
convenience function.
# Create new empty FASTA file
fasta_file = fasta.FastaFile()
# PROTIP: Let your cat walk over the keyboard
dna_seq1 = seq.NucleotideSequence("ATCGGATCTATCGATGCTAGCTACAGCTAT")
dna_seq2 = seq.NucleotideSequence("ACGATCTACTAGCTGATGTCGTGCATGTACG")
# Append entries to file...
# ... via set_sequence()
fasta.set_sequence(fasta_file, dna_seq1, header="gibberish")
# .. or dictionary style
fasta_file["more gibberish"] = str(dna_seq2)
print(fasta_file)
temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
fasta_file.write(temp_file.name)
temp_file.close()
>gibberish
ATCGGATCTATCGATGCTAGCTACAGCTAT
>more gibberish
ACGATCTACTAGCTGATGTCGTGCATGTACG
As you see, our file contains our new 'gibberish'
and
'more gibberish'
sequences now.
In a similar manner sequences and sequence quality scores can be read
from FASTQ files. For further reference, have a look at the
biotite.sequence.io.fastq
subpackage.
Alternatively, a sequence can also be loaded from GenBank or GenPept
files, using the GenBankFile
class (more on this later).
Sequence search¶
A sequence can be searched for the position of a subsequence or a specific symbol:
import biotite.sequence as seq
main_seq = seq.NucleotideSequence("ACCGTATCAAGTATTG")
sub_seq = seq.NucleotideSequence("TAT")
print("Occurences of 'TAT':", seq.find_subsequence(main_seq, sub_seq))
print("Occurences of 'C':", seq.find_symbol(main_seq, "C"))
Occurences of 'TAT': [ 4 11]
Occurences of 'C': [1 2 7]
Sequence alignments¶
Pairwise alignments¶
When comparing two (or more) sequences, usually an alignment needs to be performed. Two kinds of algorithms need to be distinguished here: Heuristic algorithms do not guarantee to yield the optimal alignment, but instead they are very fast. On the other hand, there are algorithms that calculate the optimal (maximum similarity score) alignment, but are quite slow.
The biotite.sequence.align
package provides the function
align_optimal()
, which fits into the latter category.
It either performs an optimal global alignment, using the
Needleman-Wunsch algorithm, or an optimal local
alignment, using the Smith-Waterman algorithm.
By default it uses a linear gap penalty, but an affine gap penalty
can be used, too.
Most functions in biotite.sequence.align
can align any two
Sequence
objects with each other.
In fact the Sequence
objects can be instances from different
Sequence
subclasses and therefore may have different
alphabets.
The only condition that must be satisfied, is that the
SubstitutionMatrix
alphabets match the alphabets of the
sequences to be aligned.
But wait, what’s a SubstitutionMatrix
?
This class maps a combination of two symbols, one from the first
sequence the other one from the second sequence, to a similarity
score.
A SubstitutionMatrix
object contains two alphabets with
length n or m, respectively, and an (n,m)-shaped
ndarray
storing the similarity scores.
You can choose one of many predefined matrices from an internal
database or you can create a custom matrix on your own.
So much for theory.
Let’s start by showing different ways to construct a
SubstitutionMatrix
, in our case for protein sequence
alignments:
import biotite.sequence as seq
import biotite.sequence.align as align
import numpy as np
alph = seq.ProteinSequence.alphabet
# Load the standard protein substitution matrix, which is BLOSUM62
matrix = align.SubstitutionMatrix.std_protein_matrix()
print("\nBLOSUM62\n")
print(matrix)
# Load another matrix from internal database
matrix = align.SubstitutionMatrix(alph, alph, "BLOSUM50")
# Load a matrix dictionary representation,
# modify it, and create the SubstitutionMatrix
# (The dictionary could be alternatively loaded from a string containing
# the matrix in NCBI format)
matrix_dict = align.SubstitutionMatrix.dict_from_db("BLOSUM62")
matrix_dict[("P","Y")] = 100
matrix = align.SubstitutionMatrix(alph, alph, matrix_dict)
# And now create a matrix by directly provding the ndarray
# containing the similarity scores
# (identity matrix in our case)
scores = np.identity(len(alph), dtype=int)
matrix = align.SubstitutionMatrix(alph, alph, scores)
print("\n\nIdentity matrix\n")
print(matrix)
BLOSUM62
A C D E F G H I K L M N P Q R S T V W Y B Z X *
A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 -2 -1 0 -4
C 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 -3 -3 -2 -4
D -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 4 1 -1 -4
E -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 1 4 -1 -4
F -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 -3 -3 -1 -4
G 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 -1 -2 -1 -4
H -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 0 0 -1 -4
I -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -3 -3 -1 -4
K -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 0 1 -1 -4
L -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -4 -3 -1 -4
M -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -3 -1 -1 -4
N -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 3 0 -1 -4
P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 -2 -1 -2 -4
Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 0 3 -1 -4
R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 -1 0 -1 -4
S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 0 0 0 -4
T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 -1 -1 0 -4
V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -3 -2 -1 -4
W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 -4 -3 -2 -4
Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 -3 -2 -1 -4
B -2 -3 4 1 -3 -1 0 -3 0 -4 -3 3 -2 0 -1 0 -1 -3 -4 -3 4 1 -1 -4
Z -1 -3 1 4 -3 -2 0 -3 1 -3 -1 0 -1 3 0 0 -1 -2 -3 -2 1 4 -1 -4
X 0 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 -1 0 0 -1 -2 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Identity matrix
A C D E F G H I K L M N P Q R S T V W Y B Z X *
A 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
D 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
H 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
I 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
L 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
N 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Q 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
B 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
* 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
For our protein alignment we will use the standard BLOSUM62 matrix.
seq1 = seq.ProteinSequence("BIQTITE")
seq2 = seq.ProteinSequence("IQLITE")
matrix = align.SubstitutionMatrix.std_protein_matrix()
print("\nLocal alignment")
alignments = align.align_optimal(seq1, seq2, matrix, local=True)
for ali in alignments:
print(ali)
print("Global alignment")
alignments = align.align_optimal(seq1, seq2, matrix, local=False)
for ali in alignments:
print(ali)
Local alignment
IQTITE
IQLITE
Global alignment
BIQTITE
-IQLITE
The alignment functions return a list of Alignment
objects.
This object saves the input sequences together with a so called trace
- the indices to symbols in these sequences that are aligned to each
other (-1 for a gap).
Additionally the alignment score is stored in this object.
Furthermore, this object can prettyprint the alignment into a human
readable form.
For publication purposes you can create an actual figure based on Matplotlib. You can either decide to color the symbols based on the symbol type or based on the similarity within the alignment columns. In this case we will go with the similarity visualization.
import matplotlib.pyplot as plt
import biotite.sequence.graphics as graphics
fig, ax = plt.subplots(figsize=(2.0, 0.8))
graphics.plot_alignment_similarity_based(
ax, alignments[0], matrix=matrix, symbols_per_line=len(alignments[0])
)
fig.tight_layout()

If you are interested in more advanced visualization examples, have a look at the example gallery.
You can also do some simple analysis on these objects, like determining the sequence identity or calculating the score. For further custom analysis, it can be convenient to have directly the aligned symbols codes instead of the trace.
alignment = alignments[0]
print("Score: ", alignment.score)
print("Recalculated score:", align.score(alignment, matrix=matrix))
print("Sequence identity:", align.get_sequence_identity(alignment))
print("Symbols:")
print(align.get_symbols(alignment))
print("symbols codes:")
print(align.get_codes(alignment))
Score: 12
Recalculated score: 12
Sequence identity: 0.8333333333333334
Symbols:
[['B', 'I', 'Q', 'T', 'I', 'T', 'E'], [None, 'I', 'Q', 'L', 'I', 'T', 'E']]
symbols codes:
[[20 7 13 16 7 16 3]
[-1 7 13 9 7 16 3]]
You might wonder, why you should recalculate the score, when the score
has already been directly calculated via align_optimal()
.
The answer is that you might load an alignment from a FASTA file
using get_alignment()
, where the score is not provided.
Advanced sequence alignments¶
While the former alignment method returns the optimal alignment of two sequences, it is not recommended to use this method to align a short query sequence (e.g a gene) to an extremely long sequence (e.g. the human genome): The computation time and memory space requirements scale linearly with the length of both sequences, so even if your RAM does not overflow, you might need to wait a very long time for your alignment results.
But there is another method: You could look for local k-mer matches of the long (reference) sequence and the short (query) sequence and perform a sequence alignment restricted to the position of the match. Although this approach might not give the optimal result in some cases, it works well enough, so that popular programs like BLAST are based on it.
Biotite provides a modular system to build such an alignment search method yourself. At least four steps are necessary:
Creating an index table mapping k-mers to their position in the reference sequence
Find match positions between this k-mer index table and the k-mers of the query sequence
Perform gapped alignments restricted to the match positions
Evaluate the significance of the created alignments
In the following example the short query protein sequence BIQTITE
is aligned to a longer reference sequence containing the ‘homologous’
NIQBITE
in its middle.
While both sequences are relatively short and they could be easily
aligned with the align_optimal()
function, the following
general approach scales well for real world applications, where the
reference could be a large genome or where you have a database of
thousands of sequences.
query = seq.ProteinSequence("BIQTITE")
reference = seq.ProteinSequence(
# This toy sequence is adapted from the first sentence of the
# Wikipedia 'Niobite' article
"CQLVMBITEALSQCALLEDNIQBITEANDCQLVMBATEISAMINERALGRQVPTHATISANQREQFNIQBIVM"
# ^^^^^^^
# Here is the 'homologous' mineral
)
1. Indexing the reference sequence¶
In the first step the k-mers of the reference sequence needs to be
indexed into a KmerTable
.
The k-mers (also called words or k-tuples) of a sequence are all
overlapping subsequences with a given length k.
Indexing means creating a table that maps each k-mer to the position
where this k-mer appears in the sequence - similar to the index of a
book.
Here the first decision needs to be made:
Which k is desired?
A small k improves the sensitivity, a large k decreases the
computation time in the later steps.
In this case we choose 3-mers.
# Create a k-mer index table from the k-mers of the reference sequence
kmer_table = align.KmerTable.from_sequences(
# Use 3-mers
k=3,
# Add only the reference sequence to the table
sequences=[reference],
# The purpose of the reference ID is to identify the sequence
ref_ids=[0]
)
The purpose of the reference ID is to identify not only the position of a k-mer in a sequence, but also which sequence is involved, if you add multiple sequences to the table. In this case there is only a single sequence in the table, so the reference ID is arbitrary.
Let’s have a deeper look under the hood:
The KmerTable
creates a KmerAlphabet
that encodes a
k-mer symbol, i.e. a tuple of k symbols from the base alphabet,
into a k-mer code.
Importantly, this k-mer code can be uniquely decoded back into
a k-mer symbol.
# Access the internal *k-mer* alphabet.
kmer_alphabet = kmer_table.kmer_alphabet
print("Base alphabet:", kmer_alphabet.base_alphabet)
print("k:", kmer_alphabet.k)
print("k-mer code for 'BIQ':", kmer_alphabet.encode("BIQ"))
Base alphabet: ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y', 'B', 'Z', 'X', '*']
k: 3
k-mer code for 'BIQ': 7676
for code in range(5):
print(kmer_alphabet.decode(code))
print("...")
['A' 'A' 'A']
['C' 'A' 'A']
['D' 'A' 'A']
['E' 'A' 'A']
['F' 'A' 'A']
...
Furthermore the KmerAlphabet
can encode all overlapping
k-mers of a sequence.
kmer_codes = kmer_alphabet.create_kmers(seq.ProteinSequence("BIQTITE").code)
print("k-mer codes:", kmer_codes)
print("k-mers:")
for kmer in kmer_alphabet.decode_multiple(kmer_codes):
print("".join(kmer))
k-mer codes: [7676 9535 4429 9400 2119]
k-mers:
BIQ
IQT
QTI
TIT
ITE
Now we get back to the KmerTable
.
When the table is created, it uses KmerAlphabet.create_kmers()
to get all k-mers in the sequence and stores for each k-mer the
position(s) where the respective k-mer appears.
# Get all positions for the 'ITE' k-mer
for ref_id, position in kmer_table[kmer_alphabet.encode("ITE")]:
print(position)
6
23
2. Matching the query sequence¶
In the second step we would like to find k-mer matches of our
reference KmerTable
with the query sequence.
A match is a k-mer that appears in both, the table and the query
sequence.
The KmerTable.match()
method iterates over all
overlapping k-mers in the query and checks whether the
KmerTable
has at least one position for this k-mer.
If it does, it adds the position in the query and all corresponding
positions saved in the KmerTable
to the matches.
matches = kmer_table.match(query)
# Filter out the reference ID, because we have only one sequence
# in the table anyway
matches = matches[:, [0,2]]
for query_pos, ref_pos in matches:
print(f"Match at query position {query_pos}, reference position {ref_pos}")
# Print the reference sequence at the match position including four
# symbols before and after the matching k-mer
print("...", reference[ref_pos - 4 : ref_pos + kmer_table.k + 4], "...")
print()
Match at query position 4, reference position 6
... LVMBITEALSQ ...
Match at query position 4, reference position 23
... NIQBITEANDC ...
We see that the 'ITE'
of 'BIQTITE'
matches the 'ITE'
of
'CQLVMBITE'
and 'NIQBITE'
.
3. Alignments at the match positions¶
Now that we have found the match positions, we can perform an
alignment restricted to each match position.
Currently Biotite offers three functions for this purpose:
align_local_ungapped()
, align_local_gapped()
and
align_banded()
.
align_local_ungapped()
and align_local_gapped()
perform fast local alignments expanding from a given seed position,
which is typically set to a match position from the previous step.
The alignment stops, if the current similarity score drops a given
threshold below the maximum score already found, a technique that is
also called X-Drop.
While align_local_ungapped()
is much faster than
align_local_gapped()
, it does not insert gaps into the
alignment.
In contrast align_banded()
performs a local or global
alignment, where the alignment space is restricted to a defined
diagonal band, allowing only a certain number of insertions/deletions
in each sequence.
The presented methods have in common, that they ideally only traverse
through a small fraction of the possible alignment space, allowing
them to run much faster than align_optimal()
.
However they might not find the optimal alignment, if such an
alignment would have an intermediate low scoring region or too many
gaps in either sequence, respectively.
In this tutorial we will focus on using align_banded()
to
perform a global alignment of our two sequences.
align_banded()
requires two diagonals that define the lower
and upper limit of the alignment band.
A diagonal is an integer defined as \(D = j - i\), where i and
j are sequence positions in the first and second sequence,
respectively.
This means that two symbols at position i and j can only be
aligned to each other, if \(D_L \leq j - i \leq D_U\).
In our case we center the diagonal band to the diagonal of the match and use a fixed band width \(W = D_U - D_L\).
BAND_WIDTH = 4
matrix = SubstitutionMatrix.std_protein_matrix()
alignments = []
for query_pos, ref_pos in matches:
diagonal = ref_pos - query_pos
alignment = align.align_banded(
query, reference, matrix, gap_penalty=-5, max_number=1,
# Center the band at the match diagonal and extend the band by
# one half of the band width in each direction
band=(diagonal - BAND_WIDTH//2, diagonal + BAND_WIDTH//2)
)[0]
alignments.append(alignment)
for alignment in alignments:
print(alignment)
print("\n")
BIQTITE
LVMBITE
BIQTITE
NIQBITE
4. Significance evaluation¶
We have obtained two alignments, but which one of them is the
‘correct’ one?
in this simple example we could simply select the one with the highest
similarity score, but this approach is not sound in general:
A reference sequence might contain multiple regions, that are
homologous to the query, or none at all.
A better approach is a statistical measure, like the
BLAST E-value.
It gives the number of alignments expected by chance with a score at
least as high as the score obtained from the alignment of interest.
Hence, a value close to zero means a very significant homology.
We can calculate the E-value using the EValueEstimator
, that
needs to be initialized with the same scoring scheme used for our
alignments.
For the sake of simplicity we choose uniform background frequencies
for each symbol, but usually you would choose values that reflect
the amino acid/nucleotide composition in your sequence database.
estimator = align.EValueEstimator.from_samples(
seq.ProteinSequence.alphabet, matrix, gap_penalty=-5,
frequencies=np.ones(len(seq.ProteinSequence.alphabet)),
# Trade accuracy for a lower runtime
sample_length=200
)
Now we can calculate the E-value for the alignments. Since we have aligned the query only to the reference sequence shown above, we use its length to calculate the E-value. If you have an entire sequence database you align against, you would take the total sequence length of the database instead.
scores = [alignment.score for alignment in alignments]
evalues = 10 ** estimator.log_evalue(scores, len(query), len(reference))
for alignment, evalue in zip(alignments, evalues):
print(f"E-value = {evalue:.2e}")
print(alignment)
print("\n")
E-value = 6.30e-01
BIQTITE
LVMBITE
E-value = 3.60e-02
BIQTITE
NIQBITE
Finally, we can see that the expected alignment of BIQTITE
to
NIQBITE
is more significant than the unspecific match.
The setup shown here is a very simple one compared to the methods popular software like BLAST use. Since the k-mer matching step is very fast and the gapped alignments take the largest part of the time, you usually want to have additional filters before you trigger a gapped alignment: Commonly a gapped alignment is only started at a match, if there is another match on the same diagonal in proximity and if a fast local ungapped alignment (seed extension) exceeds a defined threshold score. Furthermore, the parameter selection, e.g. the k-mer length, is key to a fast but also sensitive alignment procedure. However, you can find suitable parameters in literature or run benchmarks by yourself to find appropriate parameters for your application.
Multiple sequence alignments¶
If you want to perform a multiple sequence alignment (MSA), have a
look at the align_multiple()
function:
seq1 = seq.ProteinSequence("BIQTITE")
seq2 = seq.ProteinSequence("TITANITE")
seq3 = seq.ProteinSequence("BISMITE")
seq4 = seq.ProteinSequence("IQLITE")
alignment, order, guide_tree, distance_matrix = align.align_multiple(
[seq1, seq2, seq3, seq4],
matrix=align.SubstitutionMatrix.std_protein_matrix(),
gap_penalty=-5,
terminal_penalty=False
)
print(alignment)
BIQT-ITE
TITANITE
BISM-ITE
-IQL-ITE
This function is only recommended for strongly related sequences or
exotic sequence types.
When high accuracy or computation time matters, other MSA programs
deliver better results.
External MSA software can accessed via the biotite.application
subpackage.
Sequence features¶
Sequence features describe functional parts of a sequence,
like coding regions or regulatory parts.
One popular source to obtain information about sequence features are
GenBank (for DNA and RNA) and GenPept (for peptides) files.
As example for sequence features we will work with the GenBank file
for the avidin gene (Accession: AJ311647
),
that we can download from the NCBI Entrez database.
After downloading we can load the file using the GenBankFile
class from biotite.sequence.io.genbank
.
Similar to the other file classes we have encountered, a
GenBankFile
provides a low-level interface.
In contrast, the biotite.sequence.io.genbank
module contains
high-level functions to directly obtain useful objects from a
GenBankFile
object.
import biotite.sequence.io.genbank as gb
file_path = entrez.fetch(
"AJ311647", gettempdir(), suffix="gb",
db_name="nuccore", ret_type="gb"
)
file = gb.GenBankFile.read(file_path)
print("Accession:", gb.get_accession(file))
print("Definition:", gb.get_definition(file))
Accession: AJ311647
Definition: Gallus gallus AVD gene for avidin, exons 1-4.
Now that we have loaded the file, we want to have a look at the
sequence features.
Therefore, we grab the Annotation
from the file.
An annotation is the collection of features corresponding to one
sequence (the sequence itself is not included, though).
This Annotation
can be iterated in order to obtain single
Feature
objects.
Each Feature
contains 3 pieces of information: Its feature
key (e.g. regulatory
or CDS
), a dictionary of qualifiers and
one or multiple locations on the corresponding sequence.
A Location
in turn, contains its starting and its ending
base/residue position, the strand it is on (only for DNA) and possible
location defects (defects will be discussed later).
In the next example we will print the keys of the features and their
locations:
annotation = gb.get_annotation(file)
for feature in annotation:
# Convert the feature locations in better readable format
locs = [str(loc) for loc in sorted(feature.locs, key=lambda l: l.first)]
print(f"{feature.key:12} {locs}")
gene ['98-1152 >']
regulatory ['1215-1220 >']
exon ['263-473 >']
exon ['98-178 >']
source ['1-1224 >']
CDS ['98-178 >', '263-473 >', '899-1019 >', '1107-1152 >']
exon ['899-1019 >']
intron ['1020-1106 >']
mRNA ['98-178 >', '263-473 >', '899-1019 >', '1107-1152 >']
intron ['474-898 >']
intron ['179-262 >']
regulatory ['26-33 >']
sig_peptide ['98-169 >']
exon ['1107-1152 >']
The '>'
characters in the string representations of a location
indicate that the location is on the forward strand.
Most of the features have only one location, except the mRNA
and
CDS
feature, which have 4 locations joined.
When we look at the rest of the features, this makes sense: The gene
has 4 exons.
Therefore, the mRNA (and consequently the CDS) is composed of
these exons.
The two regulatory
features are the TATA box and the
poly-A signal, as the feature qualifiers make clear:
for feature in annotation:
if feature.key == "regulatory":
print(feature.qual["regulatory_class"])
polyA_signal_sequence
TATA_box
Similarily to Alignment
objects, we can visualize an
Annotation using the biotite.sequence.graphics
subpackage, in
a so called feature map.
In order to avoid overlaping features, we draw only the CDS feature.
# Get the range of the entire annotation via the *source* feature
for feature in annotation:
if feature.key == "source":
# loc_range has exclusive stop
loc = list(feature.locs)[0]
loc_range = (loc.first, loc.last+1)
fig, ax = plt.subplots(figsize=(8.0, 1.0))
graphics.plot_feature_map(
ax,
seq.Annotation(
[feature for feature in annotation if feature.key == "CDS"]
),
multi_line=False,
loc_range=loc_range,
show_line_position=True
)
fig.tight_layout()

Annotation
objects can be indexed with slices, that represent
the start and the exclusive stop base/residue of the annotation from
which the subannotation is created.
All features, that are not in this range, are not included in the
subannotation.
In order to demonstrate this indexing method, we create a
subannotation that includes only features in range of the gene itself
(without the regulatory stuff).
# At first we have the find the feature with the 'gene' key
for feature in annotation:
if feature.key == "gene":
gene_feature = feature
# Then we create a subannotation from the feature's location
# Since the stop value of the slice is still exclusive,
# the stop value is the position of the last base +1
loc = list(gene_feature.locs)[0]
sub_annot = annotation[loc.first : loc.last +1]
# Print the remaining features and their locations
for feature in sub_annot:
locs = [str(loc) for loc in sorted(feature.locs, key=lambda l: l.first)]
print(f"{feature.key:12} {locs}")
gene ['98-1152 >']
exon ['263-473 >']
exon ['98-178 >']
CDS ['98-178 >', '263-473 >', '899-1019 >', '1107-1152 >']
exon ['899-1019 >']
intron ['1020-1106 >']
sig_peptide ['98-169 >']
mRNA ['98-178 >', '263-473 >', '899-1019 >', '1107-1152 >']
intron ['474-898 >']
intron ['179-262 >']
source ['98-1152 >']
exon ['1107-1152 >']
The regulatory sequences have disappeared in the subannotation.
Another interesting thing happened:
The location of the source`
feature narrowed and
is in range of the slice now. This happened, because the feature was
truncated:
The bases that were not in range of the slice were removed.
Let’s have a closer look into location defects now:
A Location
instance has a defect, when the feature itself is
not directly located in the range of the first to the last base,
for example when the exact postion is not known or, as in our case, a
part of the feature was truncated.
Let’s have a closer look at the location defects of our subannotation:
for feature in sub_annot:
defects = [str(location.defect) for location
in sorted(feature.locs, key=lambda l: l.first)]
print(f"{feature.key:12} {defects}")
gene ['Defect.BEYOND_RIGHT|BEYOND_LEFT']
exon ['Defect.NONE']
exon ['Defect.NONE']
CDS ['Defect.NONE', 'Defect.NONE', 'Defect.NONE', 'Defect.NONE']
exon ['Defect.NONE']
intron ['Defect.NONE']
sig_peptide ['Defect.NONE']
mRNA ['Defect.BEYOND_LEFT', 'Defect.NONE', 'Defect.NONE', 'Defect.BEYOND_RIGHT']
intron ['Defect.NONE']
intron ['Defect.NONE']
source ['Defect.MISS_RIGHT|MISS_LEFT']
exon ['Defect.NONE']
The class Location.Defect
is a Flag
.
This means that multiple defects can be combined to one value.
NONE
means that the location has no defect, which is true for most
of the features.
The source
feature has a defect - a combination of MISS_LEFT
and MISS_RIGHT
. MISS_LEFT
is applied, if a feature was
truncated before the first base, and MISS_RIGHT
is applied, if
a feature was truncated after the last base.
Since source`
was truncated from both sides, the combination is
applied.
gene
has the defect values BEYOND_LEFT
and BEYOND_RIGHT
.
These defects already appear in the GenBank file, since
the gene is defined as the unit that is transcribed into one
(pre-)mRNA.
As the transcription starts somewhere before the start of the coding
region and the exact start location is not known, BEYOND_LEFT
is
applied.
In an analogous way, the transcription does stop somewhere after the
coding region (at the terminator signal).
Hence, BEYOND_RIGHT
is applied.
These two defects are also reflected in the mRNA
feature.
Annotated sequences¶
Now, that you have understood what annotations are, we proceed to the
next topic: annotated sequences.
An AnnotatedSequence
is like an annotation, but the sequence
is included this time.
Since our GenBank file contains the
sequence corresponding to the feature table, we can directly obtain the
AnnotatedSequence
.
annot_seq = gb.get_annotated_sequence(file)
print("Same annotation as before?", (annotation == annot_seq.annotation))
print(annot_seq.sequence[:60], "...")
Same annotation as before? True
ACTGGGCAGAGTCAGTGCTGGAAGCAATMAAAAGGCGAGGGAGCAGGCAGGGGTGAGTCC ...
When indexing an AnnotatedSequence
with a slice,
the index is applied to the Annotation
and the
Sequence
.
While the Annotation
handles the index as shown before,
the Sequence
is indexed based on the sequence start
value (usually 1).
print("Sequence start before indexing:", annot_seq.sequence_start)
for feature in annot_seq.annotation:
if feature.key == "regulatory" \
and feature.qual["regulatory_class"] == "polyA_signal_sequence":
polya_feature = feature
loc = list(polya_feature.locs)[0]
# Get annotated sequence containing only the poly-A signal region
poly_a = annot_seq[loc.first : loc.last +1]
print("Sequence start after indexing:", poly_a.sequence_start)
print(poly_a.sequence)
Sequence start before indexing: 1
Sequence start after indexing: 1215
AATAAA
Here we get the poly-A signal Sequence 'AATAAA'
.
As you might have noticed, the sequence start has shifted to the start
of the slice index (the first base of the regulatory
feature).
Warning
Since AnnotatedSequence
objects use base position
indices and Sequence
objects use array position indices,
you will get different results for annot_seq[n:m].sequence
and
annot_seq.sequence[n:m]
.
There is also a convenient way to obtain the sequence corresponding to
a feature, even if the feature contains multiple locations or a
location is on the reverse strand:
Simply use a Feature
object (in this case the CDS feature)
as index.
for feature in annot_seq.annotation:
if feature.key == "CDS":
cds_feature = feature
cds_seq = annot_seq[cds_feature]
print(cds_seq[:60], "...")
ATGGTGCACGCAACCTCCCCGCTGCTGCTGCTGCTGCTGCTCAGCCTGGCTCTGGTGGCT ...
Awesome.
Now we can translate the sequence and compare it with the translation
given by the CDS feature.
But before we can do that, we have to prepare the data:
The DNA sequence uses an ambiguous alphabet due to the nasty
'M'
at position 28 of the original sequence, we have to remove the
stop symbol after translation and we need to remove the whitespace
characters in the translation given by the CDS feature.
# To make alphabet unambiguous we create a new NucleotideSequence
# containing only the CDS portion, which is unambiguous
# Thus, the resulting NucleotideSequence has an unambiguous alphabet
cds_seq = seq.NucleotideSequence(cds_seq)
# Now we can translate the unambiguous sequence.
prot_seq = cds_seq.translate(complete=True)
print(prot_seq[:60], "...")
print(
"Are the translated sequences equal?",
# Remove stops of our translation
(str(prot_seq.remove_stops()) ==
# Remove whitespace characters from translation given by CDS feature
cds_feature.qual["translation"].replace(" ", ""))
)
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWDNDLGSNMTIGAVNSKGEFTGTYTTA ...
Are the translated sequences equal? True
Phylogenetic and guide trees¶
Trees have an important role in bioinformatics, as they are used to guide multiple sequence alignments or to create phylogenies.
In Biotite such a tree is represented by the Tree
class in
the biotite.sequence.phylo
package.
A tree is rooted, that means each tree node has at least one child,
or none in case of leaf nodes.
Each node in a tree is represented by a TreeNode
.
When a TreeNode
is created, you have to provide either child
nodes and their distances to this node (intermediate node) or a
reference index (leaf node).
This reference index is dependent on the context and can refer to
anything: sequences, organisms, etc.
The childs and the reference index cannot be changed after object creation. Also the parent can only be set once - when the node is used as child in the creation of a new node.
import biotite.sequence.phylo as phylo
# The reference objects
fruits = ["Apple", "Pear", "Orange", "Lemon", "Banana"]
# Create nodes
apple = phylo.TreeNode(index=fruits.index("Apple"))
pear = phylo.TreeNode(index=fruits.index("Pear"))
orange = phylo.TreeNode(index=fruits.index("Orange"))
lemon = phylo.TreeNode(index=fruits.index("Lemon"))
banana = phylo.TreeNode(index=fruits.index("Banana"))
intermediate1 = phylo.TreeNode(
children=(apple, pear), distances=(2.0, 2.0)
)
intermediate2 = phylo.TreeNode((orange, lemon), (1.0, 1.0))
intermediate3 = phylo.TreeNode((intermediate2, banana), (2.0, 3.0))
root = phylo.TreeNode((intermediate1, intermediate3), (2.0, 1.0))
# Create tree from root node
tree = phylo.Tree(root=root)
# Trees can be converted into Newick notation
print("Tree:", tree.to_newick(labels=fruits))
# Distances can be omitted
print(
"Tree w/o distances:",
tree.to_newick(labels=fruits, include_distance=False)
)
# Distances can be measured
distance = tree.get_distance(fruits.index("Apple"), fruits.index("Banana"))
print("Distance Apple-Banana:", distance)
Tree: ((Apple:2.0,Pear:2.0):2.0,((Orange:1.0,Lemon:1.0):2.0,Banana:3.0):1.0):0.0;
Tree w/o distances: ((Apple,Pear),((Orange,Lemon),Banana));
Distance Apple-Banana: 8.0
You can also plot a tree as dendrogram.
fig, ax = plt.subplots(figsize=(6.0, 6.0))
graphics.plot_dendrogram(ax, tree, labels=fruits)
fig.tight_layout()

From distances to trees¶
When you want to create a Tree
from distances obtained for
example from sequence alignments, you can use the UPGMA or
neighbour joining algorithm.
distances = np.array([
[ 0, 17, 21, 31, 23],
[17, 0, 30, 34, 21],
[21, 30, 0, 28, 39],
[31, 34, 28, 0, 43],
[23, 21, 39, 43, 0]
])
tree = phylo.upgma(distances)
fig, ax = plt.subplots(figsize=(6.0, 3.0))
graphics.plot_dendrogram(ax, tree, orientation="top")
fig.tight_layout()

Going 3D - The Structure subpackage¶
biotite.structure
is a Biotite subpackage for handling
molecular structures.
This subpackage enables efficient and easy handling of protein structure
data by representing atom attributes in NumPy ndarray
objects.
These atom attributes include so called annotations
(polypetide chain id, residue id, residue name, hetero residue
information, atom name, element, etc.)
and the atom coordinates.
The package contains three central types: Atom
,
AtomArray
and AtomArrayStack
.
An Atom
contains data for a single atom, an AtomArray
stores data for an entire model and AtomArrayStack
stores data
for multiple models, where each model contains the same atoms but
differs in the atom coordinates.
Both, AtomArray
and AtomArrayStack
, store the
attributes in NumPy arrays. This approach has multiple advantages:
Convenient selection of atoms in a structure by using NumPy style indexing
Fast calculations on structures using C-accelerated
ndarray
operationsSimple implementation of custom calculations
Based on the implementation using ndarray
objects, this package
also contains functions for structure analysis and manipulation.
Note
The universal length unit in Biotite is Å. This includes coordinates, distances, surface areas, etc.
Creating structures¶
Let’s begin by constructing some atoms:
import biotite.structure as struc
atom1 = struc.Atom([0,0,0], chain_id="A", res_id=1, res_name="GLY",
atom_name="N", element="N")
atom2 = struc.Atom([0,1,1], chain_id="A", res_id=1, res_name="GLY",
atom_name="CA", element="C")
atom3 = struc.Atom([0,0,2], chain_id="A", res_id=1, res_name="GLY",
atom_name="C", element="C")
The first parameter are the coordinates (internally converted into an
ndarray
), the other parameters are annotations.
The annotations shown in this example are mandatory:
The chain ID, residue ID, residue name, insertion code, atom name,
element and whether the atom is not in protein/nucleotide chain
(hetero).
If you miss one of these, they will get a default value.
The mandatory annotation categories are originated in ATOM and
HETATM records in the PDB format.
The description of each annotation can be viewed in the
API reference.
Additionally, you can specify an arbitrary amount of custom
annotations, like B-factors, charge, etc.
In most cases you won’t work with Atom
instances and in even
fewer cases Atom
instances are created as it is done in the
above example.
If you want to work with an entire molecular structure, containing an
arbitrary amount of atoms, you have to use so called atom arrays.
An atom array can be seen as an array of atom instances
(hence the name).
But instead of storing Atom
instances in a list, an
AtomArray
instance contains one ndarray
for each
annotation and the coordinates.
In order to see this in action, we first have to create an array from
the atoms we constructed before.
Then we can access the annotations and coordinates of the atom array
simply by specifying the attribute.
import numpy as np
array = struc.array([atom1, atom2, atom3])
print("Chain ID:", array.chain_id)
print("Residue ID:", array.res_id)
print("Atom name:", array.atom_name)
print("Coordinates:", array.coord)
print()
print(array)
Chain ID: ['A' 'A' 'A']
Residue ID: [1 1 1]
Atom name: ['N' 'CA' 'C']
Coordinates: [[0. 0. 0.]
[0. 1. 1.]
[0. 0. 2.]]
A 1 GLY N N 0.000 0.000 0.000
A 1 GLY CA C 0.000 1.000 1.000
A 1 GLY C C 0.000 0.000 2.000
The array()
builder function takes any iterable object
containing Atom
instances.
If you wanted to, you could even use another AtomArray
, which
functions also as an iterable object of Atom
objects.
An alternative way of constructing an array would be creating an
AtomArray
by using its constructor, which fills the
annotation arrays and coordinates with the type respective zero
value.
In our example all annotation arrays have a length of 3, since we used
3 atoms to create it.
A structure containing n atoms is represented by annotation arrays
of length n and coordinates of shape (n,3).
As the annotations and coordinates are simply ndarray
objects, they can be edited in the same manner.
array.chain_id[:] = "B"
array.coord[array.element == "C", 0] = 42
# It is also possible to replace an entire annotation with another array
array.res_id = np.array([1,2,3])
print(array)
B 1 GLY N N 0.000 0.000 0.000
B 2 GLY CA C 42.000 1.000 1.000
B 3 GLY C C 42.000 0.000 2.000
Apart from the structure manipulation functions we see later on, this is the usual way to edit structures in Biotite.
Warning
For editing an annotation, the index must be applied to
the annotation and not to the AtomArray
, e.g.
array.chain_id[...] = "B"
instead of
array[...].chain_id = "B"
.
The latter example is incorrect, as it creates a subarray of the
initial AtomArray
(discussed later) and then tries to
replace the annotation array with the new value.
If you want to add further annotation categories to an array, you have
to call the add_annotation()
or set_annotation()
method at first. After that you can access the new annotation array
like any other annotation array.
array.add_annotation("foo", dtype=bool)
array.set_annotation("bar", [1, 2, 3])
print(array.foo)
print(array.bar)
[False False False]
[1 2 3]
In some cases, you might need to handle structures, where each atom is
present in multiple locations
(multiple models in NMR structures, MD trajectories).
For the cases AtomArrayStack
objects are used, which
represent a list of atom arrays.
Since the atoms are the same for each frame, but only the coordinates
change, the annotation arrays in stacks are still the same length n
ndarray
objects as in atom arrays.
However, a stack stores the coordinates in a (m,n,3)-shaped
ndarray
, where m is the number of frames.
A stack is constructed with stack()
analogous to the code
snipped above.
It is crucial that all arrays that should be stacked
have the same annotation arrays, otherwise an exception is raised.
For simplicity reasons, we create a stack containing two identical
models, derived from the previous example.
stack = struc.stack([array, array.copy()])
print(stack)
Model 1
B 1 GLY N N 0.000 0.000 0.000
B 2 GLY CA C 42.000 1.000 1.000
B 3 GLY C C 42.000 0.000 2.000
Model 2
B 1 GLY N N 0.000 0.000 0.000
B 2 GLY CA C 42.000 1.000 1.000
B 3 GLY C C 42.000 0.000 2.000
Loading structures from file¶
Usually structures are not built from scratch, but they are read from
a file.
Probably the most popular structure file format is the PDB format.
For our purpose, we will work on a protein structure as small as
possible, namely the miniprotein TC5b (PDB: 1L2Y
).
The structure of this 20-residue protein (304 atoms) has been
elucidated via NMR.
Thus, the corresponding PDB file consists of multiple (namely 38)
models, each showing another conformation.
At first we load the structure from a PDB file via the class
PDBFile
in the subpackage biotite.structure.io.pdb
.
from tempfile import gettempdir, NamedTemporaryFile
import biotite.structure.io.pdb as pdb
import biotite.database.rcsb as rcsb
pdb_file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
pdb_file = pdb.PDBFile.read(pdb_file_path)
tc5b = pdb_file.get_structure()
print(type(tc5b).__name__)
print(tc5b.stack_depth())
print(tc5b.array_length())
print(tc5b.shape)
AtomArrayStack
38
304
(38, 304)
The method PDBFile.get_structure()
returns an atom array stack
unless the model
parameter is specified,
even if the file contains only one model.
Alternatively, the module level function get_structure()
can be used.
The following example
shows how to write an atom array or stack back into a PDB file:
pdb_file = pdb.PDBFile()
pdb_file.set_structure(tc5b)
temp_file = NamedTemporaryFile(suffix=".pdb", delete=False)
pdb_file.write(temp_file.name)
temp_file.close()
Other information (authors, secondary structure, etc.) cannot be extracted from PDB files, yet. This is a good place to mention, that it is recommended to use the modern PDBx/mmCIF format in favor of the PDB format. It solves limitations of the PDB format, that arise from the column restrictions. Furthermore, much more additional information is stored in these files.
In contrast to PDB files, Biotite can read the entire content of
PDBx/mmCIF files, which can be accessed in a dictionary like manner.
At first, we read the file similarily to before, but this time we
use the PDBxFile
class.
import biotite.structure.io.pdbx as pdbx
cif_file_path = rcsb.fetch("1l2y", "cif", gettempdir())
cif_file = pdbx.PDBxFile.read(cif_file_path)
Now we can access the data like a dictionary of dictionaries.
print(cif_file["1L2Y", "audit_author"]["name"])
['Neidigh, J.W.' 'Fesinmeyer, R.M.' 'Andersen, N.H.']
The first index contains the data block and the category name.
The data block could be omitted, since there is only one block in the
file.
This returns a dictionary.
If the category is in a loop_
category, i.e. the category’s fields
have a list of values, like in this case, the dictionary contains
ndarray
objects of type string as values, otherwise the
dictionary contains strings directly.
The second index specifies the name of the subcategory, which is used
as key in this dictionary and returns the corresponding
ndarray
.
Setting/adding a category in the file is done in a similar way:
cif_file["audit_author"] = {
"name" : ["Doe, Jane", "Doe, John"],
"pdbx_ordinal" : ["1","2"]
}
In most applications only the structure itself
(stored in the atom_site category) is relevant.
get_structure()
and set_structure()
are convenience
functions that are used to convert the
atom_site
category into an atom array (stack) and vice versa.
tc5b = pdbx.get_structure(cif_file)
# Do some fancy stuff
pdbx.set_structure(cif_file, tc5b)
get_structure()
creates automatically an
AtomArrayStack
, even if the file actually contains only a
single model.
If you would like to have an AtomArray
instead, you have to
specifiy the model
parameter.
If you want to parse a large batch of structure files or you have to
load very large structure files, the usage of PDB or mmCIF files might
be too slow for your requirements.
In this case you probably might want to use MMTF files.
MMTF files describe structures just like PDB and mmCIF files,
but they are binary!
This circumstance increases the downloading and parsing speed by
several multiples.
The usage is similar to PDBxFile
: The MMTFFile
class
decodes the file and makes it raw information accessible.
Via get_structure()
the data can be loaded into an atom array
(stack) and set_structure()
is used to save it back into a
MMTF file.
import numpy as np
import biotite.structure.io.mmtf as mmtf
mmtf_file_path = rcsb.fetch("1l2y", "mmtf", gettempdir())
mmtf_file = mmtf.MMTFFile.read(mmtf_file_path)
stack = mmtf.get_structure(mmtf_file)
array = mmtf.get_structure(mmtf_file, model=1)
# Do some fancy stuff
mmtf.set_structure(mmtf_file, array)
A more low level access to MMTF files is also possible:
An MMTF file is structured as dictionary, with each key being a
structural feature like the coordinates, the residue ID or the
secondary structure.
Most of the fields are encoded to reduce to size of the file,
but the whole decoding process is handled automatically by
the MMTFFile
class:
If a field is encoded the decoded
ndarray
is returned, otherwise the value is directly
returned.
A list of all MMTF fields (keys) can be found in the
specification.
The implementation of MMTFFile
decodes the encoded fields
only when you need them, so no computation time is wasted on fields
you are not interested in.
# Field is not encoded
print(mmtf_file["title"])
# Field is encoded and is automatically decoded
print(mmtf_file["groupIdList"])
NMR Structure of Trp-Cage Miniprotein Construct TC5b
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
Setting fields of an MMTF file works in an analogous way for values,
that should not be encoded.
The situation is a little more complex for arrays, that should be
encoded:
Since arbitrarily named fields can be set in the file,
MMTFFile
does not know which codec to use for encoding
your array.
Hence, you need to use the MMTFFile.set_array()
function.
mmtf_file["title"] = "Some other title"
print(mmtf_file["title"])
# Determine appropriate codec from the codec used originally
mmtf_file.set_array(
"groupIdList",
np.arange(20,40),
codec=mmtf_file.get_codec("groupIdList"))
print(mmtf_file["groupIdList"])
Some other title
[20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]
An alternative file format for storing and loading atom arrays and
stacks even faster, is the NPZ format.
The big disadvantage is that the format is Biotite-exclusive:
No other software will be able to read these files.
These are simple binary files, that are used to store NumPy arrays.
In case of atom arrays and stacks, the annotation arrays and
coordinates are written/read to/from npz files via the
NpzFile
class.
Since no expensive data conversion has to be performed,
this format is the fastest way to save and load atom arrays and
stacks.
import biotite.structure.io.npz as npz
file = npz.NpzFile()
file.set_structure(array)
reloaded_array = file.get_structure()
There are also some other supported file formats.
For a full list, have a look at biotite.structure.io
.
Since programmers are usually lazy and do not want to write more code
than necessary, there are two convenient function for loading and
saving atom arrays or stacks, unifying the forementioned file formats:
load_structure()
takes a file path and outputs an array
(or stack, if the file contains multiple models).
Internally, this function uses the appropriate File
class,
depending on the file format.
The analogous save_structure()
function provides a shortcut
for writing to structure files.
The desired file format is inferred from the the extension of the
provided file name.
import biotite.structure.io as strucio
stack_from_pdb = strucio.load_structure(pdb_file_path)
stack_from_cif = strucio.load_structure(cif_file_path)
temp_file = NamedTemporaryFile(suffix=".cif", delete=False)
strucio.save_structure(temp_file.name, stack_from_pdb)
temp_file.close()
Reading trajectory files¶
If the package MDtraj is installed, Biotite provides a read/write
interface for different trajectory file formats.
All supported trajectory formats have in common, that they store
only coordinates.
These can be extracted as ndarray
with the
get_coord()
method.
from tempfile import NamedTemporaryFile
import requests
import biotite.structure.io.xtc as xtc
# Download 1L2Y as XTC file for demonstration purposes
temp_xtc_file = NamedTemporaryFile("wb", suffix=".xtc", delete=False)
response = requests.get(
"https://raw.githubusercontent.com/biotite-dev/biotite/master/"
"tests/structure/data/1l2y.xtc"
)
temp_xtc_file.write(response.content)
traj_file = xtc.XTCFile.read(temp_xtc_file.name)
coord = traj_file.get_coord()
print(coord.shape)
(38, 304, 3)
If only an excerpt of frames is desired, the behavior of the
read()
function can be customized with the start, stop and
step parameters.
# Read only every second frame
traj_file = xtc.XTCFile.read(temp_xtc_file.name, step=2)
coord = traj_file.get_coord()
print(coord.shape)
(19, 304, 3)
In order to extract an entire structure, i.e. an
AtomArrayStack
, from a trajectory file, a template
structure must be given, since the trajectory file contains only
coordinate information.
import biotite.database.rcsb as rcsb
import biotite.structure.io.mmtf as mmtf
mmtf_file_path = rcsb.fetch("1l2y", "mmtf", gettempdir())
mmtf_file = mmtf.MMTFFile.read(mmtf_file_path)
template = mmtf.get_structure(mmtf_file, model=1)
traj_file = xtc.XTCFile.read(temp_xtc_file.name)
trajectory = traj_file.get_structure(template)
temp_xtc_file.close()
Array indexing and filtering¶
Atom arrays and stacks can be indexed in a similar way a
ndarray
is indexed.
In fact, the index is propagated to the coordinates and the annotation
arrays.
Therefore, all NumPy compatible types of indices can be used,
like boolean arrays, index arrays/lists, slices and, of course,
integer values.
Integer indices have a special role here, as they reduce the
dimensionality of the data type:
Indexing an AtomArrayStack
with an integer results in an
AtomArray
at the specified frame, indexing an AtomArray
with an integer yields the specified Atom
.
Iterating over arrays and stacks reduces the dimensionality in an
analogous way.
Let’s demonstrate indexing with the help of the structure of TC5b.
from tempfile import gettempdir
import biotite.structure as struc
import biotite.database.rcsb as rcsb
import biotite.structure.io as strucio
file_path = rcsb.fetch("1l2y", "mmtf", gettempdir())
stack = strucio.load_structure(file_path)
print(type(stack).__name__)
print(stack.shape)
# Get the third model
array = stack[2]
print(type(array).__name__)
print(array.shape)
# Get the fifth atom
atom = array[4]
print(type(atom).__name__)
print(atom.shape)
AtomArrayStack
(38, 304)
AtomArray
(304,)
Atom
()
load_structure()
gives us an AtomArrayStack
.
The first indexing step reduces the stack to an atom array and the
second indexing step reduces the array to a single atom.
The shape attribute gives the number of models and atoms, similarly
to the shape attribute of ndarray
objects.
Alternatively, the stack_depth()
or array_length()
methods can be used to get the number of models or atoms,
respectively.
The following code section shows some examples for how an atom array can be indexed.
# Get the first atom
atom = array[0]
# Get a subarray containing the first and third atom
subarray = array[[0,2]]
# Get a subarray containing a range of atoms using slices
subarray = array[100:200]
# Filter all carbon atoms in residue 1
subarray = array[(array.element == "C") & (array.res_id == 1)]
# Filter all atoms where the X-coordinate is smaller than 2
subarray = array[array.coord[:,0] < 2]
An atom array stack can be indexed in a similar way, with the difference, that the index specifies the frame(s).
# Get an atom array from the first model
subarray = stack[0]
# Get a substack containing the first 10 models
substack = stack[:10]
Stacks also have the speciality, that they can handle 2-dimensional indices, where the first dimension specifies the frame(s) and the second dimension specifies the atom(s).
# Get the first 100 atoms from the third model
subarray = stack[2, :100]
# Get the first 100 atoms from the models 3, 4 and 5
substack = stack[2:5, :100]
# Get the first atom in the second model
atom = stack[1,0]
# Get a stack containing arrays containing only the first atom
substack = stack[:, 0]
Furthermore, biotite.structure
contains advanced filters,
that create boolean masks from an atom array using specific criteria.
Here is a small example.
backbone = array[struc.filter_peptide_backbone(array)]
print(backbone.atom_name)
['N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA'
'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N'
'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C'
'N' 'CA' 'C' 'N' 'CA' 'C' 'N' 'CA' 'C']
If you would like to know which atoms are in proximity to specific
coordinates, have a look at the CellList
class.
Warning
For annotation editing Since AnnotatedSequence
objects use base position
indices and Sequence
objects use array position indices,
you will get different results for annot_seq[n:m].sequence
and
annot_seq.sequence[n:m]
.
Representing bonds¶
Up to now we only looked into atom arrays whose atoms are merely
described by its coordinates and annotations.
But there is more: Chemical bonds can be described, too, using a
BondList
!
Consider the following case: Your atom array contains four atoms:
N, CA, C and CB. CA is a central atom that is connected to
N, C and CB.
A BondList
is created by passing a ndarray
containing pairs of integers, where each integer represents an index
in a corresponding atom array.
The pairs indicate which atoms share a bond.
Additionally, it is required to specify the number of atoms in the
atom array.
from tempfile import gettempdir
import biotite.structure as struc
array = struc.array([
struc.Atom([0,0,0], atom_name="N"),
struc.Atom([0,0,0], atom_name="CA"),
struc.Atom([0,0,0], atom_name="C"),
struc.Atom([0,0,0], atom_name="CB")
])
print("Atoms:", array.atom_name)
bond_list = struc.BondList(
array.array_length(),
np.array([[1,0], [1,2], [1,3]])
)
print("Bonds (indices):")
print(bond_list.as_array())
print("Bonds (atoms names):")
print(array.atom_name[bond_list.as_array()[:, :2]])
ca_bonds, ca_bond_types = bond_list.get_bonds(1)
print("Bonds of CA:", array.atom_name[ca_bonds])
Atoms: ['N' 'CA' 'C' 'CB']
Bonds (indices):
[[0 1 0]
[1 2 0]
[1 3 0]]
Bonds (atoms names):
[['N' 'CA']
['CA' 'C']
['CA' 'CB']]
Bonds of CA: ['N' 'C' 'CB']
When you look at the internal ndarray
(as given by BondList.as_array()
), you see a third column
containing zeros.
This column describes each bond with values from the BondType
enum: 0 corresponds to BondType.ANY
, which means that the type
of the bond is undefined.
This makes sense, since we did not define the bond types, when we
created the bond list.
The other thing that has changed is the index order:
Each bond is sorted so that the index with the lower index is the
first element.
Although a BondList
uses a ndarray
under the hood,
indexing works a little bit different:
The indexing operation is not applied to the internal
ndarray
, instead it behaves like the same indexing operation
was applied to a corresponding atom array:
The bond list adjusts its indices so that they still point to the same
atoms as before.
Bonds that involve at least one atom, that has been removed, are
deleted as well.
We will try that by deleting the C atom.
mask = (array.atom_name != "C")
sub_array = array[mask]
sub_bond_list = bond_list[mask]
print("Atoms:", sub_array.atom_name)
print("Bonds (indices):")
print(sub_bond_list.as_array())
print("Bonds (atoms names):")
print(sub_array.atom_name[sub_bond_list.as_array()[:, :2]])
Atoms: ['N' 'CA' 'CB']
Bonds (indices):
[[0 1 0]
[1 2 0]]
Bonds (atoms names):
[['N' 'CA']
['CA' 'CB']]
As you see, the the bonds involving the C (only a single one) are removed and the remaining indices are shifted.
Connecting atoms and bonds¶
We do not have to index the atom array and the bond list
separately.
For convenience reasons you can associate a BondList
to an
AtomArray
via the bonds
attribute.
If no BondList
is associated, bonds
is None
.
Every time the atom array is indexed, the index is also applied to the
associated bond list.
The same behavior applies to concatenations, by the way.
array.bonds = bond_list
sub_array = array[array.atom_name != "C"]
print("Bonds (atoms names):")
print(sub_array.atom_name[sub_array.bonds.as_array()[:, :2]])
Bonds (atoms names):
[['N' 'CA']
['CA' 'CB']]
Note, that some functionalities in Biotite even require that the
input structure has an associated BondList
.
Reading and writing bonds¶
Up to now the bond information has been created manually, which is infeasible for larger molecules. Instead bond information can be loaded from and saved to some file formats, including MMTF and MOL files. We’ll try that on the structure of TC5b and look at the bond information of the third residue, a tyrosine.
import biotite.database.rcsb as rcsb
import biotite.structure.io.mmtf as mmtf
file_path = rcsb.fetch("1l2y", "mmtf", gettempdir())
mmtf_file = mmtf.MMTFFile.read(file_path)
# Essential: set the 'include_bonds' parameter to true
stack = mmtf.get_structure(mmtf_file, include_bonds=True)
tyrosine = stack[:, (stack.res_id == 3)]
print("Bonds (indices):")
print(tyrosine.bonds)
print("Bonds (atoms names):")
print(tyrosine.atom_name[tyrosine.bonds.as_array()[:, :2]])
Bonds (indices):
[[ 0 1 1]
[ 1 2 1]
[ 2 3 2]
[ 1 4 1]
[ 4 5 1]
[ 5 6 2]
[ 5 7 1]
[ 6 8 1]
[ 7 9 2]
[ 8 10 2]
[ 9 10 1]
[10 11 1]
[ 0 12 1]
[ 1 13 1]
[ 4 14 1]
[ 4 15 1]
[ 6 16 1]
[ 7 17 1]
[ 8 18 1]
[ 9 19 1]
[11 20 1]]
Bonds (atoms names):
[['N' 'CA']
['CA' 'C']
['C' 'O']
['CA' 'CB']
['CB' 'CG']
['CG' 'CD1']
['CG' 'CD2']
['CD1' 'CE1']
['CD2' 'CE2']
['CE1' 'CZ']
['CE2' 'CZ']
['CZ' 'OH']
['N' 'H']
['CA' 'HA']
['CB' 'HB2']
['CB' 'HB3']
['CD1' 'HD1']
['CD2' 'HD2']
['CE1' 'HE1']
['CE2' 'HE2']
['OH' 'HH']]
Since we loaded the bond information from a MMTF file, the bond types
are also defined:
Here we have both, BondType.SINGLE
and BondType.DOUBLE
bonds (1 and 2, respectively).
Bond information can also be automatically inferred from an
AtomArray
or AtomArrayStack
:
connect_via_residue_names()
is able to connect atoms in
all residues that appear in the
Chemical Component Dictionary,
comprising every molecule that appears in any PDB entry.
In contrast, connect_via_distances()
uses only distances
between atoms to infer the bonds.
However, this function creates only BondType.ANY
bonds and might
connect wrong atoms in rare cases.
Since both functions have the mentioned limitations it is usually
better to read bond information directly from file, if available.
import biotite.structure.io.pdb as pdb
file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
pdb_file = pdb.PDBFile.read(file_path)
stack = pdb_file.get_structure()
stack.bonds = struc.connect_via_residue_names(stack)
tyrosine = stack[:, (stack.res_id == 3)]
print("Bonds (indices):")
print(tyrosine.bonds)
print("Bonds (atoms names):")
print(tyrosine.atom_name[tyrosine.bonds.as_array()[:, :2]])
Bonds (indices):
[[ 0 1 1]
[ 0 12 1]
[ 1 2 1]
[ 1 4 1]
[ 1 13 1]
[ 2 3 2]
[ 4 5 1]
[ 4 14 1]
[ 4 15 1]
[ 5 6 6]
[ 5 7 5]
[ 6 8 5]
[ 6 16 1]
[ 7 9 6]
[ 7 17 1]
[ 8 10 6]
[ 8 18 1]
[ 9 10 5]
[ 9 19 1]
[10 11 1]
[11 20 1]]
Bonds (atoms names):
[['N' 'CA']
['N' 'H']
['CA' 'C']
['CA' 'CB']
['CA' 'HA']
['C' 'O']
['CB' 'CG']
['CB' 'HB2']
['CB' 'HB3']
['CG' 'CD1']
['CG' 'CD2']
['CD1' 'CE1']
['CD1' 'HD1']
['CD2' 'CE2']
['CD2' 'HD2']
['CE1' 'CZ']
['CE1' 'HE1']
['CE2' 'CZ']
['CE2' 'HE2']
['CZ' 'OH']
['OH' 'HH']]
Simulation boxes and unit cells¶
Depending on the source of the macromolecular structure, there might
be an associated unit cell or simulation box.
In this package such boxes are represented by (3,3)-shaped
ndarray
objects, where each element in the array is one of
the three vectors spanning the box or unit cell.
Let’s create an orthorhombic box from the vector lengths and the
angles between the vectors.
from tempfile import gettempdir
import numpy as np
import biotite.structure as struc
# The function uses angles in radians
box = struc.vectors_from_unitcell(10, 20, 30, np.pi/2, np.pi/2, np.pi/2)
print("Box:")
print(box)
print("Box volume:", struc.box_volume(box))
print("Is the box orthogonal?:", struc.is_orthogonal(box))
cell = struc.unitcell_from_vectors(box)
print("Cell:")
print(cell)
Box:
[[10. 0. 0.]
[ 0. 20. 0.]
[ 0. 0. 30.]]
Box volume: 6000.0
Is the box orthogonal?: True
Cell:
(10.0, 20.0, 30.0, 1.5707964, 1.5707964, 1.5707964)
An atom array can have an associated box, which is used in functions,
that consider periodic boundary conditions.
Atom array stacks require a (m,3,3)-shaped ndarray
,
that contains the box vectors for each model.
The box is accessed via the box attribute, which is None
by
default.
When loaded from a structure file, the box described in the file is
automatically used.
import biotite.database.rcsb as rcsb
import biotite.structure.io as strucio
array = struc.AtomArray(length=100)
print(array.box)
array.box = struc.vectors_from_unitcell(10, 20, 30, np.pi/2, np.pi/2, np.pi/2)
print(array.box)
file_path = rcsb.fetch("3o5r", "mmtf", gettempdir())
array = strucio.load_structure(file_path)
print(array.box)
None
[[10. 0. 0.]
[ 0. 20. 0.]
[ 0. 0. 30.]]
[[42.051 0. 0. ]
[ 0. 54.784 0. ]
[ 0. 0. 56.816]]
When loading a trajectory from an MD simulation, the molecules are
often fragmented over the periodic boundary.
While a lot of analysis functions can handle such periodic boundary
conditions automatically, some require completed molecules.
In this case you should use remove_pbc()
.
array = struc.remove_pbc(array)
Structure manipulation¶
The most basic way to manipulate a structure is to edit the annotation arrays or coordinates directly.
from tempfile import gettempdir
import biotite.database.rcsb as rcsb
import biotite.structure as struc
import biotite.structure.io.mmtf as mmtf
file_path = rcsb.fetch("1l2y", "mmtf", gettempdir())
mmtf_file = mmtf.MMTFFile.read(file_path)
structure = mmtf.get_structure(mmtf_file, model=1)
print("Before:")
print(structure[structure.res_id == 1])
print()
structure.coord += 100
print("After:")
print(structure[structure.res_id == 1])
Before:
A 1 ASN N N -8.901 4.127 -0.555
A 1 ASN CA C -8.608 3.135 -1.618
A 1 ASN C C -7.117 2.964 -1.897
A 1 ASN O O -6.634 1.849 -1.758
A 1 ASN CB C -9.437 3.396 -2.889
A 1 ASN CG C -10.915 3.130 -2.611
A 1 ASN OD1 O -11.269 2.700 -1.524
A 1 ASN ND2 N -11.806 3.406 -3.543
A 1 ASN H1 H -8.330 3.957 0.261
A 1 ASN H2 H -8.740 5.068 -0.889
A 1 ASN H3 H -9.877 4.041 -0.293
A 1 ASN HA H -8.930 2.162 -1.239
A 1 ASN HB2 H -9.310 4.417 -3.193
A 1 ASN HB3 H -9.108 2.719 -3.679
A 1 ASN HD21 H -11.572 3.791 -4.444
A 1 ASN HD22 H -12.757 3.183 -3.294
After:
A 1 ASN N N 91.099 104.127 99.445
A 1 ASN CA C 91.392 103.135 98.382
A 1 ASN C C 92.883 102.964 98.103
A 1 ASN O O 93.366 101.849 98.242
A 1 ASN CB C 90.563 103.396 97.111
A 1 ASN CG C 89.085 103.130 97.389
A 1 ASN OD1 O 88.731 102.700 98.476
A 1 ASN ND2 N 88.194 103.406 96.457
A 1 ASN H1 H 91.670 103.957 100.261
A 1 ASN H2 H 91.260 105.068 99.111
A 1 ASN H3 H 90.123 104.041 99.707
A 1 ASN HA H 91.070 102.162 98.761
A 1 ASN HB2 H 90.690 104.417 96.807
A 1 ASN HB3 H 90.892 102.719 96.321
A 1 ASN HD21 H 88.428 103.791 95.556
A 1 ASN HD22 H 87.243 103.183 96.706
Biotite provides also some transformation functions, for example
rotate()
for rotations about the x-, y- or z-axis.
structure = mmtf.get_structure(mmtf_file, model=1)
print("Before:")
print(structure[structure.res_id == 1])
print()
# Rotation about z-axis by 90 degrees
structure = struc.rotate(structure, [0, 0, np.deg2rad(90)])
print("After:")
print(structure[structure.res_id == 1])
Before:
A 1 ASN N N -8.901 4.127 -0.555
A 1 ASN CA C -8.608 3.135 -1.618
A 1 ASN C C -7.117 2.964 -1.897
A 1 ASN O O -6.634 1.849 -1.758
A 1 ASN CB C -9.437 3.396 -2.889
A 1 ASN CG C -10.915 3.130 -2.611
A 1 ASN OD1 O -11.269 2.700 -1.524
A 1 ASN ND2 N -11.806 3.406 -3.543
A 1 ASN H1 H -8.330 3.957 0.261
A 1 ASN H2 H -8.740 5.068 -0.889
A 1 ASN H3 H -9.877 4.041 -0.293
A 1 ASN HA H -8.930 2.162 -1.239
A 1 ASN HB2 H -9.310 4.417 -3.193
A 1 ASN HB3 H -9.108 2.719 -3.679
A 1 ASN HD21 H -11.572 3.791 -4.444
A 1 ASN HD22 H -12.757 3.183 -3.294
After:
A 1 ASN N N -4.127 -8.901 -0.555
A 1 ASN CA C -3.135 -8.608 -1.618
A 1 ASN C C -2.964 -7.117 -1.897
A 1 ASN O O -1.849 -6.634 -1.758
A 1 ASN CB C -3.396 -9.437 -2.889
A 1 ASN CG C -3.130 -10.915 -2.611
A 1 ASN OD1 O -2.700 -11.269 -1.524
A 1 ASN ND2 N -3.406 -11.806 -3.543
A 1 ASN H1 H -3.957 -8.330 0.261
A 1 ASN H2 H -5.068 -8.740 -0.889
A 1 ASN H3 H -4.041 -9.877 -0.293
A 1 ASN HA H -2.162 -8.930 -1.239
A 1 ASN HB2 H -4.417 -9.310 -3.193
A 1 ASN HB3 H -2.719 -9.108 -3.679
A 1 ASN HD21 H -3.791 -11.572 -4.444
A 1 ASN HD22 H -3.183 -12.757 -3.294
For a complete list of transformation functions have a look in the API reference.
Structure analysis¶
This package would be almost useless, if there wasn’t some means to analyze your structures. Therefore, Biotite offers a bunch of functions for this purpose, reaching from simple bond angle and length measurements to more complex characteristics, like accessible surface area and secondary structure. The following section will introduce you to some of these functions, which should be applied to that good old structure of TC5b.
The examples shown in this section are only a small glimpse into the structure analysis toolset. Have a look into the API reference for more information.
Geometry measures¶
Let’s start with measuring some simple geometric characteristics, for example atom distances of CA atoms.
from tempfile import gettempdir
import biotite.structure as struc
import biotite.structure.io as strucio
import biotite.database.rcsb as rcsb
file_path = rcsb.fetch("1l2y", "mmtf", gettempdir())
stack = strucio.load_structure(file_path)
# Filter only CA atoms
stack = stack[:, stack.atom_name == "CA"]
# Calculate distance between first and second CA in first frame
array = stack[0]
print("Atom to atom:", struc.distance(array[0], array[1]))
# Calculate distance between the first atom
# and all other CA atoms in the array
print("Array to atom:")
array = stack[0]
print(struc.distance(array[0], array))
# Calculate pairwise distances between the CA atoms in the first frame
# and the CA atoms in the second frame
print("Array to array:")
print(struc.distance(stack[0], stack[1]))
# Calculate the distances between all CA atoms in the stack
# and the first CA atom in the first frame
# The resulting array is too large, therefore only the shape is printed
print("Stack to atom:")
print(struc.distance(stack, stack[0,0]).shape)
# And finally distances between two adjacent CA in the first frame
array = stack[0]
print("Adjacent CA distances")
print(struc.distance(array[:-1], array[1:]))
Atom to atom: 3.876399
Array to atom:
[ 0. 3.876399 5.5766597 5.038891 6.316409 8.766815
9.908135 10.614817 12.890331 14.806679 13.5011635 16.87541
18.723566 17.224289 19.111933 16.193 15.514756 12.377812
10.445933 12.058967 ]
Array to array:
[3.4344199 0.37241507 0.22178602 0.10823133 0.15207201 0.17017055
0.22572783 0.47650498 0.29493213 0.1548351 0.2832347 0.40683934
0.1355508 0.3676874 0.46464393 0.57544243 0.33707416 0.25703317
0.34762198 0.38818687]
Stack to atom:
(38, 20)
Adjacent CA distances
[3.876399 3.8605015 3.87147 3.8455799 3.8666048 3.8585181 3.8818028
3.860987 3.8909183 3.8635552 3.8862703 3.8756127 3.8746684 3.8655493
3.8662775 3.8776622 3.8603828 3.8582468 3.8642192]
Like some other functions in biotite.structure
, we are able to
pick any combination of an atom, atom array or stack. Alternatively
ndarray
objects containing the coordinates can be provided.
Furthermore, we can measure bond angles and dihedral angles.
# Calculate angle between first 3 CA atoms in first frame
# (in radians)
print("Angle:", struc.angle(array[0],array[1],array[2]))
# Calculate dihedral angle between first 4 CA atoms in first frame
# (in radians)
print("Dihedral angle:", struc.dihedral(array[0],array[1],array[2],array[4]))
Angle: 1.6098708
Dihedral angle: 1.4903792
Note
The distance()
, angle()
and
dihedral()
functions all have their pair_...()
counterparts, that take an atom array (stack) and
pairs/triplets/quadruplets of atom indices, of which the
distance/angle should be calculated.
Both variants can be setup to consider periodic boundary conditions
by setting the box or periodic parameter, respectively.
In some cases one is interested in the dihedral angles of the peptide backbone, \(\phi\), \(\psi\) and \(\omega\). In the following code snippet we measure these angles and create a simple Ramachandran plot for the first frame of TC5b.
import matplotlib.pyplot as plt
import numpy as np
array = strucio.load_structure(file_path)[0]
phi, psi, omega = struc.dihedral_backbone(array)
plt.plot(phi * 360/(2*np.pi), psi * 360/(2*np.pi),
marker="o", linestyle="None")
plt.xlim(-180,180)
plt.ylim(-180,180)
plt.xlabel("$\phi$")
plt.ylabel("$\psi$")

Text(26.472222222222214, 0.5, '$\\psi$')
Comparing structures¶
Now we want to calculate a measure of flexibility for each residue in TC5b. The root mean square fluctuation (RMSF) is a good value for that. It represents the deviation for each atom in all models relative to a reference model, which is usually the averaged structure. Since we are only interested in the backbone flexibility, we consider only CA atoms. Before we can calculate a reasonable RMSF, we have to superimpose each model on a reference model (we choose the first model), which minimizes the root mean square deviation (RMSD).
stack = strucio.load_structure(file_path)
# We consider only CA atoms
stack = stack[:, stack.atom_name == "CA"]
# Superimposing all models of the structure onto the first model
stack, transformation_tuple = struc.superimpose(stack[0], stack)
print("RMSD for each model to first model:")
print(struc.rmsd(stack[0], stack))
# Calculate the RMSF relative to the average of all models
rmsf = struc.rmsf(struc.average(stack), stack)
# Plotting stuff
plt.plot(np.arange(1,21), rmsf)
plt.xlim(0,21)
plt.xticks(np.arange(1,21))
plt.xlabel("Residue")
plt.ylabel("RMSF")

RMSD for each model to first model:
[0. 0.7842643 1.0075767 0.5518028 0.8066345 1.0606678
0.8738371 0.6260641 1.0057656 0.81440794 0.876283 1.353859
0.93277985 0.8760094 0.99357325 0.40626574 0.31801927 1.1838906
1.2347708 0.89114463 0.5553653 0.73639375 0.7856738 1.1019256
0.67228884 1.1605636 0.98213965 1.2280884 0.79269636 0.8685473
0.9386668 0.8356571 0.61650354 0.97335416 1.0322398 0.5555665
1.1517522 0.8558534 ]
Text(42.597222222222214, 0.5, 'RMSF')
As you can see, both terminal residues are most flexible.
Calculating accessible surface area¶
Another interesting value for a protein structure is the
solvent accessible surface area (SASA) that indicates whether an
atom or residue is on the protein surface or buried inside the
protein.
The function sasa()
approximates the SASA for each
atom.
Then we sum up the values for each residue, to get the
residue-wise SASA.
Besides other parameters, you can choose between different Van-der-Waals radii sets: Prot0r, the default set, is a set that defines radii for non-hydrogen atoms, but determines the radius of an atom based on the assumed amount of hydrogen atoms connected to it. Therefore, ProtOr is suitable for structures with missing hydrogen atoms, like crystal structures. Since the structure of TC5b was elucidated via NMR, we can assign a radius to every single atom (including hydrogens), hence we use the Single set.
array = strucio.load_structure(file_path)[0]
# The following line calculates the atom-wise SASA of the atom array
atom_sasa = struc.sasa(array, vdw_radii="Single")
# Sum up SASA for each residue in atom array
res_sasa = struc.apply_residue_wise(array, atom_sasa, np.sum)
# Again plotting stuff
plt.plot(np.arange(1,21), res_sasa)
plt.xlim(0,21)
plt.xticks(np.arange(1,21))
plt.xlabel("Residue")
plt.ylabel("SASA")

Text(38.097222222222214, 0.5, 'SASA')
Secondary structure determination¶
Biotite can also be used to assign
secondary structure elements (SSE) to a structure with the
annotate_sse()
function.
An 'a'
means alpha-helix, 'b'
beta-sheet, and 'c'
means
coil.
array = strucio.load_structure(file_path)[0]
# Estimate secondary structure
sse = struc.annotate_sse(array, chain_id="A")
# Pretty print
print("".join(sse))
caaaaaaaaccccccccccc
Outsourcing - The Application subpackage¶
Although you can achieve a lot with Biotite, there are still a lot of
things which are not implemented in this Python package.
But wait, this is what the biotite.application
package is for:
It contains interfaces for popular external software.
This ranges from locally installed software to external tools running on
web servers.
The usage of these interfaces is seamless: Rather than writing input
files and reading output files, you simply put in your Python
objects (e.g. instances of Sequence
or AtomArray
) and
the interface returns Python objects (e.g. an Alignment
object).
Note
Note that in order to use an interface in
biotite.application
the corresponding software must be
installed or the web server must be reachable, respectively.
These programs are not shipped with the Biotite package.
The base class for all interfaces is the Application
class.
Each Application
instance has a life cycle, starting with its
creation and ending with the result extraction.
Each state in this life cycle is described by the value of the
enum AppState
, that each Application
contains:
Directly after its instantiation the app is in the CREATED
state.
In this state further parameters can be set for the application run.
After the user calls the Application.start()
method, the app
state is set to RUNNING
and the app performs the calculations.
When the application finishes the AppState
changes to FINISHED
.
The user can now call the Application.join()
method, concluding
the application in the JOINED
state and making the results of the
application accessible.
Furthermore, this may trigger cleanup actions in some applications.
Application.join()
can even be called in the RUNNING
state:
This will constantly check if the application has finished and will
directly go into the JOINED
state as soon as the application reaches
the FINISHED
state.
Calling the Application.cancel()
method while the application is
RUNNING
or FINISHED
leaves the application in the CANCELLED
state.
This triggers cleanup, too, but there are no accessible results.
If a method is called in an unsuitable app state, an
AppStateError
is called.
At each state in the life cycle, Application
type specific
methods are called, as shown in the following diagram.
The following sample code shows how an Application
is generally
executed. Pay attention to the space between the
Application.start()
and Application.join()
method:
Since these are separate functions, you can do some other stuff, while
the Application
runs in the background.
Thus, an Application
behaves effectively like an additional
thread.
from biotite.application import Application
# Create a dummy Application subclass
class MyApplication(Application):
def __init__(self, param): super().__init__()
def run(self): pass
def is_finished(self): return True
def wait_interval(self): return 0.1
def evaluate(self): pass
def clean_up(self): pass
def set_some_other_input(self, param): pass
def get_some_data(self): return "some data"
param = "foo"
param2 = "bar"
app = MyApplication(param)
app.set_some_other_input(param2)
app.start()
# Time to do other stuff
app.join()
results = app.get_some_data()
The following subsections will dive into the available
Application
classes in depth.
Finding homologous sequences with BLAST¶
the biotite.application.blast
subpackage provides an
interface to NCBI BLAST: the BlastWebApp
class.
Let’s dive directly into the code, we try to find
homologous sequences to the miniprotein TC5b:
import biotite.application.blast as blast
import biotite.sequence as seq
tc5b_seq = seq.ProteinSequence("NLYIQWLKDGGPSSGRPPPS")
app = blast.BlastWebApp("blastp", tc5b_seq)
app.start()
app.join()
alignments = app.get_alignments()
best_ali = alignments[0]
print(best_ali)
print()
print("HSP position in query: ", best_ali.query_interval)
print("HSP position in hit: ", best_ali.hit_interval)
print("Score: ", best_ali.score)
print("E-value: ", best_ali.e_value)
print("Hit UID: ", best_ali.hit_id)
print("Hit name: ", best_ali.hit_definition)
NLYIQWLKDGGPSSGRPPPS
NLYIQWLKDGGPSSGRPPPS
HSP position in query: (1, 20)
HSP position in hit: (1, 20)
Score: 101
E-value: 0.000125394
Hit UID: 1L2Y_A
Hit name: NMR Structure of Trp-Cage Miniprotein Construct TC5b [synthetic construct]
This was too simple for BLAST:
As best hit it just found the query sequence itself in the PDB.
However, it gives a good impression about how this
Application
works.
Besides some optional parameters, the BlastWebApp
requires
the BLAST program and the query sequence. After the app has finished,
you get a list of alignments with descending score.
An alignment is an instance of BlastAlignment
, a subclass of
biotite.sequence.align.Alignment
.
It contains some additional information as shown above.
The hit UID can be used to obtain the complete hit sequence via
biotite.database.entrez
.
The next alignment should be a bit more challenging. We take a random part of the E. coli BL21 genome and distort it a little bit. Since we still expect a high similarity to the original sequence, we decrease the E-value threshold.
import biotite.application.blast as blast
import biotite.sequence as seq
distorted_bl21_excerpt = seq.NucleotideSequence(
"CGGAAGCGCTCGGTCTCCTGGCCTTATCAGCCACTGCGCGACGATATGCTCGTCCGTTTCGAAGA"
)
app = blast.BlastWebApp("blastn", distorted_bl21_excerpt, obey_rules=False)
app.set_max_expect_value(0.1)
app.start()
app.join()
alignments = app.get_alignments()
best_ali = alignments[0]
print(best_ali)
print()
print("HSP position in query: ", best_ali.query_interval)
print("HSP position in hit: ", best_ali.hit_interval)
print("Score: ", best_ali.score)
print("E-value: ", best_ali.e_value)
print("Hit UID: ", best_ali.hit_id)
print("Hit name: ", best_ali.hit_definition)
CGGAAGCGCTCGGTCTCCTGGCC---TTATCAGCCACTGCGCGACGATATGCTCGTCCGTTTCGAAGA
CGGAAGCGCT-GGTC-CCTGCCCGCTTTATCAGGGAATGCGCGACGGCAAAATCGTCCGTTTCGAAGA
HSP position in query: (1, 65)
HSP position in hit: (4656858, 4656923)
Score: 56
E-value: 0.00851799
Hit UID: CP026845
Hit name: Shigella boydii strain NCTC 9733 chromosome, complete genome
In this snippet we have set obey_rules
to false in the
BlastWebApp
constructor, if we omitted this parameter and
we started the last two code snippets in quick succession, a
RuleViolationError
would be raised.
This is because normally the BlastWebApp
respects NCBI’s code
of conduct and prevents you from submitting two queries within one
minute. If you want to be rude to the NCBI server, create the
instance with obey_rules
set to false.
Multiple sequence alignments¶
For multiple sequence alignments (MSAs) biotite.application
offers several interfaces to MSA software.
For our example we choose the software MUSCLE:
The subpackage biotite.application.muscle
contains the class
MuscleApp
that does the job.
You simply input the sequences you want to have aligned, run the
application and get the resulting Alignment
object.
import biotite.application.muscle as muscle
import biotite.sequence as seq
seq1 = seq.ProteinSequence("BIQTITE")
seq2 = seq.ProteinSequence("TITANITE")
seq3 = seq.ProteinSequence("BISMITE")
seq4 = seq.ProteinSequence("IQLITE")
app = muscle.MuscleApp([seq1, seq2, seq3, seq4])
app.start()
app.join()
alignment = app.get_alignment()
print(alignment)
BIQT-ITE
TITANITE
BISM-ITE
-IQL-ITE
In most MSA software even more information than the mere alignment can be extracted. For instance the guide tree, that was used for the alignment, can be obtained from the MUSCLE output.
import matplotlib.pyplot as plt
import biotite.sequence.graphics as graphics
tree = app.get_guide_tree()
fig, ax = plt.subplots()
graphics.plot_dendrogram(
ax, tree, labels=[str(sequence) for sequence in [seq1, seq2, seq3, seq4]]
)
ax.set_xlabel("Distance")
fig.tight_layout()

For the lazy people there is also a convenience method,
that handles the Application
execution internally.
However, this shortcut returns only the Alignment
.
alignment = muscle.MuscleApp.align([seq1, seq2, seq3, seq4])
The alternatives to MUSCLE are Clustal-Omega and MAFFT.
To use them, simply replace MuscleApp
with
ClustalOmegaApp
or MafftApp
and you are done.
import biotite.application.clustalo as clustalo
alignment = clustalo.ClustalOmegaApp.align([seq1, seq2, seq3, seq4])
print(alignment)
-BIQTITE
TITANITE
-BISMITE
--IQLITE
As shown in the output, the alignment with Clustal-Omega slightly differs from the one performed with MUSCLE.
If the MSA software supports protein sequence alignment AND custom substitution matrices, e.g. MUSCLE and MAFFT, almost any type of sequence can be aligned: Internally the sequences and the matrix are converted into protein sequences/matrix. Then the masquerading sequences are aligned via the software and finally the sequences are mapped back into the original sequence type. Let’s show this on the example of a nonsense alphabet.
import numpy as np
import biotite.application.mafft as mafft
import biotite.sequence.align as align
alphabet = seq.Alphabet(("foo", "bar", 42))
sequences = [seq.GeneralSequence(alphabet, sequence) for sequence in [
["foo", "bar", 42, "foo", "foo", 42, 42],
["foo", 42, "foo", "bar", "foo", 42, 42],
]]
matrix = align.SubstitutionMatrix(
alphabet, alphabet, np.array([
[ 100, -100, -100],
[-100, 100, -100],
[-100, -100, 100]
])
)
alignment = mafft.MafftApp.align(sequences, matrix=matrix)
# As the alphabet do not has characters as symbols
# the alignment cannot be directly printed
# However, we can print the trace
print(alignment.trace)
[[ 0 0]
[ 1 -1]
[ 2 1]
[ 3 2]
[-1 3]
[ 4 4]
[ 5 5]
[ 6 6]]
Secondary structure annotation¶
Althogh biotite.structure
offers the function
annotate_sse()
to assign secondary structure elements based on
the P-SEA algorithm, DSSP can also be used via the
biotite.application.dssp
subpackage (provided that DSSP is
installed).
Let us demonstrate this on the example of the good old miniprotein
TC5b.
from tempfile import gettempdir
import biotite.database.rcsb as rcsb
import biotite.application.dssp as dssp
import biotite.structure.io as strucio
file_path = rcsb.fetch("1l2y", "mmtf", gettempdir())
stack = strucio.load_structure(file_path)
array = stack[0]
app = dssp.DsspApp(array)
app.start()
app.join()
sse = app.get_sse()
print(sse)
['C' 'H' 'H' 'H' 'H' 'H' 'H' 'H' 'T' 'T' 'G' 'G' 'G' 'G' 'T' 'C' 'C' 'C'
'C' 'C']
Similar to the MSA examples, DsspApp
has the convenience
method DsspApp.annotate_sse()
as shortcut.