biotite.sequence
#
A subpackage for handling sequences.
A Sequence
can be seen as a succession of symbols.
The set of symbols, that can occur in a sequence, is defined by an
Alphabet
.
For example, an unambiguous DNA sequence has an Alphabet
, that
includes the 4 letters (strings) 'A'
, 'C'
, 'G'
and 'T'
.
But furthermore, an Alphabet
can also contain any immutable and
hashable Python object like int
, tuple
, etc.
If a Sequence
is created with at least a symbol,
that is not in the given Alphabet
, an AlphabetError
is
raised.
Internally, a Sequence
is saved as a NumPy ndarray
of integer values, where each integer represents a symbol in the
Alphabet
.
For example, 'A'
, 'C'
, 'G'
and 'T'
would be encoded into
0, 1, 2 and 3, respectively.
These integer values are called symbol code, the encoding of an entire
sequence of symbols is called sequence code.

Taken from Kunzmann & Hamacher 2018 licensed under CC BY 4.0.#
The size of the symbol code type in the array is determined by the
size of the Alphabet
:
If the Alphabet
contains 256 symbols or less, one byte is used
per array element, between 257 and 65536 symbols, two bytes are used,
and so on.
This approach has multiple advantages:
Wider spectrum of what kind of objects can be represented by
Sequence
objectsEfficient memory usage and faster calculations due to alphabet-tailored symbol code type size
C-acceleration due to usage of
ndarray
objectsMost functions applied on
Sequence
objects are indifferent to the actual type of sequence.Symbol codes are directly indices for substitution matrices in alignments
k-mers can be computed fast
The abstract Sequence
superclass cannot be instantiated
directly, as it does not define an Alphabet
by itself.
Instead usually the concrete subclasses NucleotideSequence
(for DNA and RNA sequences) and ProteinSequence
(for amino acid sequences) are used.
These classes have defined alphabets and provide additional sequence
type specific methods.
The class GeneralSequence
allows the usage of a custom
Alphabet
without the need to subclass Sequence
.
Additionally, this subpackage provides support for sequence features,
as used in e.g. GenBank or GFF files.
A Feature
stores its key name, its qualifiers and locations.
An Annotation
is a group of multiple Feature
objects
and offers convenient location based indexing.
An AnnotatedSequence
combines an Annotation
and a
Sequence
.
Sequence profiles can be created with the SequenceProfile
class.
Sequence types#
Alphabets#
This class defines the allowed symbols for a |
|
|
|
This class is used for symbol code conversion from a source alphabet into a target alphabet. |
|
This exception is raised, when a code or a symbol is not in an |
|
Determine the alphabet from a list of alphabets, that extends all alphabets. |
Sequence features#
This class represents a single sequence feature, for example from a GenBank feature table. |
|
A |
|
An |
|
An |
Sequence search#
Find a subsequence in a sequence. |
|
Find a symbol in a sequence. |
|
Find first occurence of a symbol in a sequence. |
|
Find last occurence of a symbol in a sequence. |
Miscellaneous#
A |
|
A sequence where each symbol is associated with a position. |
|
An object of this class is a 'placeholder' sequence, where each symbol is the position in the sequence itself. |
|
A |
Subpackages#
A subpackage for reading and writing sequence related data. |
|
This subpackage is used for reading and writing sequencing data using the popular FASTQ format. |
|
This subpackage is used for reading/writing information (especially sequence features) from/to files in the GenBank and GenPept format. |
|
This subpackage is used for reading and writing sequence objects using the popular FASTA format. |
|
This subpackage is used for reading and writing sequence features in the Generic Feature Format 3 (GFF3). |
|
This subpackage provides functionality for sequence alignments. |
|
A subpackage for visualization of sequence related objects. |
|
This subpackage provides functions and data structures for creating (phylogenetic) trees. |