biotite.sequence.Alphabet

class biotite.sequence.Alphabet(symbols)[source]

Bases: object

This class defines the allowed symbols for a Sequence and handles the encoding/decoding between symbols and symbol codes.

An Alphabet is created with the list of symbols, that can be used in this context. In most cases a symbol will be simply a letter, hence a string of length 1. But in principle every hashable Python object can serve as symbol.

The encoding of a symbol into a symbol code is done in the following way: Find the first index in the symbol list, where the list element equals the symbol. This index is the symbol code. If the symbol is not found in the list, an AlphabetError is raised.

Internally, a dictionary is used for encoding, with symbols as keys and symbol codes as values. Therefore, every symbol must be hashable. For decoding the symbol list is indexed with the symbol code.

If an alphabet 1 contains the same symbols and the same symbol-code-mappings like another alphabet 2, but alphabet 1 introduces also new symbols, then alphabet 1 extends alphabet 2. Per definition, every alphabet also extends itself.

Objects of this class are immutable.

Parameters
symbolsiterable object

The symbols, that are allowed in this alphabet. The corresponding code for a symbol, is the index of that symbol in this list.

Examples

Create an Alphabet containing DNA letters and encode/decode a letter/code:

>>> alph = Alphabet(["A","C","G","T"])
>>> print(alph.encode("G"))
2
>>> print(alph.decode(2))
G
>>> try:
...    alph.encode("foo")
... except Exception as e:
...    print(e)
Symbol 'foo' is not in the alphabet

Create an Alphabet of arbitrary objects:

>>> alph = Alphabet(["foo", 42, (1,2,3), 5, 3.141])
>>> print(alph.encode((1,2,3)))
2
>>> print(alph.decode(4))
3.141

On the subject of alphabet extension: An alphabet always extends itself.

>>> Alphabet(["A","C","G","T"]).extends(Alphabet(["A","C","G","T"]))
True

An alphabet extends an alphabet when it contains additional symbols…

>>> Alphabet(["A","C","G","T","U"]).extends(Alphabet(["A","C","G","T"]))
True

…but not vice versa

>>> Alphabet(["A","C","G","T"]).extends(Alphabet(["A","C","G","T","U"]))
False

Two alphabets with same symbols but different symbol-code-mappings

>>> Alphabet(["A","C","G","T"]).extends(Alphabet(["A","C","T","G"]))    
False
decode(code)

Use the alphabet to decode a symbol code.

Parameters
codeint

The symbol code to be decoded.

Returns
symbolobject

The symbol corresponding to code.

Raises
AlphabetError

If code is not a valid code in the alphabet.

decode_multiple(code)

Decode a sequence code into a list of symbols.

Parameters
codendarray

The sequence code to decode.

Returns
symbolslist

The decoded list of symbols.

encode(symbol)

Use the alphabet to encode a symbol.

Parameters
symbolobject

The object to encode into a symbol code.

Returns
codeint

The symbol code of symbol.

Raises
AlphabetError

If symbol is not in the alphabet.

encode_multiple(symbols, dtype=<class 'numpy.int64'>)

Encode a list of symbols.

Parameters
symbolsarray-like

The symbols to encode.

dtypedtype, optional

The dtype of the output ndarray. (Default: int64)

Returns
codendarray

The sequence code.

extends(alphabet)

Check, if this alphabet extends another alphabet.

Parameters
alphabetAlphabet

The potential parent alphabet.

Returns
resultbool

True, if this object extends alphabet, false otherwise.

get_symbols()

Get the symbols in the alphabet.

Returns
symbolslist

Copy of the internal list of symbols.

is_letter_alphabet()

Check whether the symbols in this alphabet are single printable letters. If so, the alphabet could be expressed by a LetterAlphabet.

Returns
is_letter_alphabetbool

True, if all symbols in the alphabet are ‘str’ or ‘bytes’, have length 1 and are printable.