# biotite.sequence¶

A subpackage for handling sequences.

A `Sequence` can be seen as a succession of symbols. The set of symbols, that can occur in a sequence, is defined by an `Alphabet`. For example, an unambiguous DNA sequence has an `Alphabet`, that includes the 4 letters (strings) `'A'`, `'C'`, `'G'` and `'T'`. But furthermore, an `Alphabet` can also contain any immutable and hashable Python object like `int`, `tuple`, etc. If a `Sequence` is created with at least a symbol, that is not in the given `Alphabet`, an `AlphabetError` is raised.

Internally, a `Sequence` is saved as a NumPy `ndarray` of integer values, where each integer represents a symbol in the `Alphabet`. For example, `'A'`, `'C'`, `'G'` and `'T'` would be encoded into 0, 1, 2 and 3, respectively. These integer values are called symbol code, the encoding of an entire sequence of symbols is called sequence code.

The size of the symbol code type in the array is determined by the size of the `Alphabet`: If the `Alphabet` contains 256 symbols or less, one byte is used per array element, between 257 and 65536 symbols, two bytes are used, and so on.

• Wider spectrum of what kind of objects can be represented by `Sequence` objects

• Efficient memory usage and faster calculations due to alphabet-tailored symbol code type size

• C-acceleration due to usage of `ndarray` objects

• Most functions applied on `Sequence` objects are indifferent to the actual type of sequence.

• Symbol codes are directly indices for substitution matrices in alignments

The abstract `Sequence` superclass cannot be instantiated directly, as it does not define an `Alphabet` by itself. Instead usually the concrete subclasses `NucleotideSequence` (for DNA and RNA sequences) and `ProteinSequence` (for amino acid sequences) are used. These classes have defined alphabets and provide additional sequence type specific methods. The class `GeneralSequence` allows the usage of a custom `Alphabet` without the need to subclass `Sequence`.

Additionally, this subpackage provides support for sequence features, as used in e.g. GenBank or GFF files. A `Feature` stores its key name, its qualifiers and locations. An `Annotation` is a group of multiple `Feataure` objects and offers convenient location based indexing. An `AnnotatedSequence` combines an `Annotation` and a `Sequence`.

## Sequence types¶

 `Sequence` The abstract base class for all sequence types. `NucleotideSequence` Representation of a nucleotide sequence (DNA or RNA). `ProteinSequence` Representation of a protein sequence. `GeneralSequence` This class allows the creation of a sequence with custom `Alphabet` without the need to subclass `Sequence`.

## Alphabets¶

 `Alphabet` This class defines the allowed symbols for a `Sequence` and handles the encoding/decoding between symbols and symbol codes. `LetterAlphabet` `LetterAlphabet` is a an `Alphabet` subclass specialized for letter based alphabets, like DNA or protein sequence alphabets. `AlphabetMapper` This class is used for symbol code conversion from a source alphabet into a target alphabet. `AlphabetError` This exception is raised, when a code or a symbol is not in an `Alphabet`. `common_alphabet` Determine the alphabet from a list of alphabets, that extends all alphabets.

## Sequence features¶

 `Feature` This class represents a single sequence feature, for example from a GenBank feature table. `Location` A `Location` defines at which base(s)/residue(s) a feature is located. `Annotation` An `Annotation` is a set of features belonging to one sequence. `AnnotatedSequence` An `AnnotatedSequence` is a combination of a `Sequence` and an `Annotation`.

## Miscellaneous¶

 `CodonTable` A `CodonTable` maps a codon (sequence of 3 nucleotides) to an amino acid. `SequenceProfile` A `SequenceProfile` object stores information about a sequence profile of aligned sequences.

## Subpackages¶

 `biotite.sequence.phylo` This subpackage provides functions and data structures for creating (phylogenetic) trees. `biotite.sequence.io` A subpackage for reading and writing sequence related data. `biotite.sequence.io.fasta` This subpackage is used for reading and writing sequence objects using the popular FASTA format. `biotite.sequence.io.fastq` This subpackage is used for reading and writing sequencing data using the popular FASTQ format. `biotite.sequence.io.gff` This subpackage is used for reading and writing sequence features in the Generic Feature Format 3 (GFF3). `biotite.sequence.io.genbank` This subpackage is used for reading/writing information (especially sequence features) from/to files in the GenBank and GenPept format. `biotite.sequence.align` This subpackage provides functionality for sequence alignments. `biotite.sequence.graphics` A subpackage for visualization of sequence related objects.