Excursion: Symbol encoding#

As you have seen in the previous chapter, Sequence objects may support a wide variety of Python objects as symbols. To still ensure a NumPy-boosted performance of functions acting upon a Sequence, an underlying Alphabet encodes each symbol into an integer, the so called code.

Symbol encoding in Biotite — Taken from Kunzmann & Hamacher 2018 licensed under CC BY 4.0.#

In short, the Alphabet maps a symbol to the index of that symbol in the alphabet. Encoding and decoding is done by the the Alphabet.encode() and Alphabet.decode() methods, respectively.

import biotite.sequence as seq

alph = seq.NucleotideSequence.unambiguous_alphabet()
print("Allowed symbols:", alph.get_symbols())
print("G is encoded to", alph.encode("G"))
print("2 is decoded to", alph.decode(2))

Allowed symbols: ('A', 'C', 'G', 'T')
G is encoded to 2
2 is decoded to G

A sequence actually does not store the symbols themselves, but only the code in a Numpy array. The code is only decoded into symbols when required, for example when the sequence is converted into a string.

dna = seq.NucleotideSequence("AACTGCTA")
print("Actually stored:", dna.code)
print("Calculated on-the-fly:", dna.symbols)

Actually stored: [0 0 1 3 2 1 3 0]
Calculated on-the-fly: ['A' 'A' 'C' 'T' 'G' 'C' 'T' 'A']

As most functions throughout biotite.sequence work directly on the code, they usually work on any type of sequence.

Most users will never need to work with the code directly. However, if you want to implement a new function, the recommended approach is to use the code, as this ensures compatibility with all types of sequences and enables harnessing the power of NumPy.