biotite.sequence

A subpackage for handling sequences.

A Sequence can be seen as a succession of symbols. The set of symbols, that can occur in a sequence, is defined by an Alphabet. For example, an unambiguous DNA sequence has an Alphabet, that includes the 4 letters (strings) ‘A’, ‘C’, ‘G’ and ‘T’. But furthermore, an Alphabet can also contain any arbitrary Python object. If a Sequence is created with at least a symbol, that is not in the given Alphabet, an AlphabetError is raised.

Internally, a Sequence is saved as a NumPy ndarray of integer values, where each integer represents a symbol in the Alphabet. For example, ‘A’, ‘C’, ‘G’ and ‘T’ would be encoded into 0, 1, 2 and 3. These integer values are called symbol code, the encoding of an entire sequence of symbols is called sequence code.

The size of the symbol code type in the array is determined by the size of the Alphabet: If the Alphabet contains 256 symbols or less, one byte is used per array element; if the Alphabet contains between 257 and 65536 symbols, two bytes are used, and so on.

This approach has multiple advantages:

  • Wider spectrum of what kind of objects can be represented by Sequence objects

  • Efficient memory usage and faster calculations due to alphabet-tailored symbol code type size

  • Partial C-acceleration due to usage of ndarrays

  • Most functions applied on Sequence objects are indifferent to the actual type of sequence.

  • Symbol codes are directly indices for substitution matrices in alignments

Besides the Sequence superclass, this subpackage contains the classes NucleotideSequence and ProteinSequence in order to include the most important biological sequence types. The class GeneralSequence allows the usage of a custom Alphabet without the need to subclass Sequence.

Additionally, this subpackage provides support for sequence feature, as for example used in GenBank files. A Feature stores its class name, its qualifiers and locations. An Annotation is a froup of multiple Feataure objects and offers convenient location based indexing. An AnnotatedSequence combines an Annotation and a Sequence

Sequence types

Sequence

The abstract base class for all sequence types.

NucleotideSequence

Representation of a nucleotide sequence (DNA or RNA).

ProteinSequence

Representation of a protein sequence.

GeneralSequence

This class allows the creation of a sequence with custom Alphabet without the need to subclass Sequence.

Alphabets

Alphabet

This class defines the allowed symbols for a Sequence and handles the encoding/decoding between symbols and symbol codes.

LetterAlphabet

LetterAlphabet is a an Alphabet subclass specialized for letter based alphabets, like DNA or protein sequence alphabets.

AlphabetMapper

This class is used for symbol code conversion from a source alphabet into a target alphabet.

AlphabetError

This exception is raised, when a code or a symbol is not in an Alphabet.

Sequence features

Feature

This class represents a single sequence feature, for example from a GenBank feature table.

Location

A Location defines at which base(s)/residue(s) a feature is located.

Annotation

An Annotation is a set of features belonging to one sequence.

AnnotatedSequence

An AnnotatedSequence is a combination of a Sequence and an Annotation.

Miscellaneous

CodonTable

A CodonTable maps a codon (sequence of 3 nucleotides) to an amino acid.

Subpackages

biotite.sequence.io

A subpackage for reading and writing sequence related data.

biotite.sequence.io.gff

This subpackage is used for reading and writing sequence features in the Generic Feature Format 3 (GFF3).

biotite.sequence.io.fasta

This subpackage is used for reading and writing sequence objects using the popular FASTA format.

biotite.sequence.io.genbank

This subpackage is used for reading/writing information (especially sequence features) from/to files in the GenBank and GenPept format.

biotite.sequence.io.fastq

This subpackage is used for reading and writing sequencing data using the popular FASTQ format.

biotite.sequence.align

This subpackage provides functionality for sequence alignemnts.

biotite.sequence.phylo

biotite.sequence.graphics

A subpackage for visualization of sequence related objects.