Alphabet
#
- class biotite.sequence.Alphabet(symbols)[source]#
Bases:
object
This class defines the allowed symbols for a
Sequence
and handles the encoding/decoding between symbols and symbol codes.An
Alphabet
is created with the list of symbols, that can be used in this context. In most cases a symbol will be simply a letter, hence a string of length 1. But in principle every hashable Python object can serve as symbol.The encoding of a symbol into a symbol code is done in the following way: Find the first index in the symbol list, where the list element equals the symbol. This index is the symbol code. If the symbol is not found in the list, an
AlphabetError
is raised.Internally, a dictionary is used for encoding, with symbols as keys and symbol codes as values. Therefore, every symbol must be hashable. For decoding the symbol list is indexed with the symbol code.
If an alphabet 1 contains the same symbols and the same symbol-code-mappings like another alphabet 2, but alphabet 1 introduces also new symbols, then alphabet 1 extends alphabet 2. Per definition, every alphabet also extends itself.
Objects of this class are immutable.
- Parameters:
- symbolsiterable object
The symbols, that are allowed in this alphabet. The corresponding code for a symbol, is the index of that symbol in this list.
Examples
Create an Alphabet containing DNA letters and encode/decode a letter/code:
>>> alph = Alphabet(["A","C","G","T"]) >>> print(alph.encode("G")) 2 >>> print(alph.decode(2)) G >>> try: ... alph.encode("foo") ... except Exception as e: ... print(e) Symbol 'foo' is not in the alphabet
Create an Alphabet of arbitrary objects:
>>> alph = Alphabet(["foo", 42, (1,2,3), 5, 3.141]) >>> print(alph.encode((1,2,3))) 2 >>> print(alph.decode(4)) 3.141
On the subject of alphabet extension: An alphabet always extends itself.
>>> Alphabet(["A","C","G","T"]).extends(Alphabet(["A","C","G","T"])) True
An alphabet extends an alphabet when it contains additional symbols…
>>> Alphabet(["A","C","G","T","U"]).extends(Alphabet(["A","C","G","T"])) True
…but not vice versa
>>> Alphabet(["A","C","G","T"]).extends(Alphabet(["A","C","G","T","U"])) False
Two alphabets with same symbols but different symbol-code-mappings
>>> Alphabet(["A","C","G","T"]).extends(Alphabet(["A","C","T","G"])) False
- decode(code)#
Use the alphabet to decode a symbol code.
- Parameters:
- codeint
The symbol code to be decoded.
- Returns:
- symbolobject
The symbol corresponding to code.
- Raises:
- AlphabetError
If code is not a valid code in the alphabet.
- decode_multiple(code)#
Decode a sequence code into a list of symbols.
- Parameters:
- codendarray
The sequence code to decode.
- Returns:
- symbolslist
The decoded list of symbols.
- encode(symbol)#
Use the alphabet to encode a symbol.
- Parameters:
- symbolobject
The object to encode into a symbol code.
- Returns:
- codeint
The symbol code of symbol.
- Raises:
- AlphabetError
If symbol is not in the alphabet.
- encode_multiple(symbols, dtype=<class 'numpy.int64'>)#
Encode a list of symbols.
- Parameters:
- symbolsarray-like
The symbols to encode.
- dtypedtype, optional
The dtype of the output ndarray. (Default: int64)
- Returns:
- codendarray
The sequence code.
- extends(alphabet)#
Check, if this alphabet extends another alphabet.
- Parameters:
- alphabetAlphabet
The potential parent alphabet.
- Returns:
- resultbool
True, if this object extends alphabet, false otherwise.
- get_symbols()#
Get the symbols in the alphabet.
- Returns:
- symbolstuple
The symbols.
- is_letter_alphabet()#
Check whether the symbols in this alphabet are single printable letters. If so, the alphabet could be expressed by a LetterAlphabet.
- Returns:
- is_letter_alphabetbool
True, if all symbols in the alphabet are ‘str’ or ‘bytes’, have length 1 and are printable.
Gallery#
Quantifying gene expression from RNA-seq data
Dendrogram of a substitution matrix