KmerAlphabet#

class biotite.sequence.align.KmerAlphabet(base_alphabet, k, spacing=None)[source]#

Bases: Alphabet

This type of alphabet uses k-mers as symbols, i.e. all combinations of k symbols from its base alphabet.

It’s primary use is its create_kmers() method, that iterates over all overlapping k-mers in a Sequence and encodes each one into its corresponding k-mer symbol code (k-mer code in short). This functionality is prominently used by a KmerTable to find k-mer matches between two sequences.

A KmerAlphabet has \(n^k\) different symbols, where \(n\) is the number of symbols in the base alphabet.

Parameters:
base_alphabetAlphabet

The base alphabet. The created KmerAlphabet contains all combinations of k symbols from this alphabet.

kint

An integer greater than 1 that defines the length of the k-mers.

spacingNone or str or list or ndarray, dtype=int, shape=(k,)

If provided, spaced k-mers are used instead of continuous ones [1]. The value contains the informative positions relative to the start of the k-mer, also called the model. The number of informative positions must equal k.

If a string is given, each '1' in the string indicates an informative position. For a continuous k-mer the spacing would be '111...'.

If a list or array is given, it must contain unique non-negative integers, that indicate the informative positions. For a continuous k-mer the spacing would be [0, 1, 2,...].

Notes

The symbol code for a k-mer \(s\) calculates as

\[RMSD = \sum_{i=0}^{k-1} n^{k-i-1} s_i\]

where \(n\) is the length of the base alphabet.

Hence the KmerAlphabet sorts k-mers in the order of the base alphabet, where leading positions within the k-mer take precedence.

References

Examples

Create an alphabet of nucleobase 2-mers:

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> print(base_alphabet.get_symbols())
('A', 'C', 'G', 'T')
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> print(kmer_alphabet.get_symbols())
('AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT')

Encode and decode k-mers:

>>> print(kmer_alphabet.encode("TC"))
13
>>> print(kmer_alphabet.decode(13))
['T' 'C']

Fuse symbol codes from the base alphabet into a k-mer code and split the k-mer code back into the original symbol codes:

>>> symbol_codes = base_alphabet.encode_multiple("TC")
>>> print(symbol_codes)
[3 1]
>>> print(kmer_alphabet.fuse(symbol_codes))
13
>>> print(kmer_alphabet.split(13))
[3 1]

Encode all overlapping continuous k-mers of a sequence:

>>> sequence = NucleotideSequence("ATTGCT")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> print(kmer_codes)
[ 3 15 14  9  7]
>>> print(["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)])
['AT', 'TT', 'TG', 'GC', 'CT']

Encode all overlapping k-mers using spacing:

>>> base_alphabet = ProteinSequence.alphabet
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 3, spacing="1101")
>>> sequence = ProteinSequence("BIQTITE")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> # Pretty print k-mers
>>> strings = ["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)]
>>> print([s[0] + s[1] + "_" + s[2] for s in strings])
['BI_T', 'IQ_I', 'QT_T', 'TI_E']
Attributes:
base_alphabetAlphabet

The base alphabet, from which the KmerAlphabet was created.

kint

The length of the k-mers.

spacingNone or ndarray, dtype=int

The k-mer model in array form, if spaced k-mers are used, None otherwise.

create_kmers(seq_code)#

Create k-mer codes for all overlapping k-mers in the given sequence code.

Parameters:
seq_codendarray, dtype={np.uint8, np.uint16, np.uint32, np.uint64}

The sequence code to be converted into k-mers.

Returns:
kmer_codesndarray, dtype=int64

The symbol codes for the k-mers.

Examples

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> sequence = NucleotideSequence("ATTGCT")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> print(kmer_codes)
[ 3 15 14  9  7]
>>> print(["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)])
['AT', 'TT', 'TG', 'GC', 'CT']
decode(code)#

Use the alphabet to decode a symbol code.

Parameters:
codeint

The symbol code to be decoded.

Returns:
symbolobject

The symbol corresponding to code.

Raises:
AlphabetError

If code is not a valid code in the alphabet.

decode_multiple(code)#

Decode a sequence code into a list of symbols.

Parameters:
codendarray

The sequence code to decode.

Returns:
symbolslist

The decoded list of symbols.

encode(symbol)#

Use the alphabet to encode a symbol.

Parameters:
symbolobject

The object to encode into a symbol code.

Returns:
codeint

The symbol code of symbol.

Raises:
AlphabetError

If symbol is not in the alphabet.

encode_multiple(symbols, dtype=<class 'numpy.int64'>)#

Encode a list of symbols.

Parameters:
symbolsarray-like

The symbols to encode.

dtypedtype, optional

The dtype of the output ndarray. (Default: int64)

Returns:
codendarray

The sequence code.

extends(alphabet)#

Check, if this alphabet extends another alphabet.

Parameters:
alphabetAlphabet

The potential parent alphabet.

Returns:
resultbool

True, if this object extends alphabet, false otherwise.

fuse(codes)#

Get the k-mer code for k symbol codes from the base alphabet.

This method can be used in a vectorized manner to obtain n k-mer codes from an (n,k) integer array.

Parameters:
codesndarray, dtype=int, shape=(k,) or shape=(n,k)

The symbol codes from the base alphabet to be fused.

Returns:
kmer_codesint or ndarray, dtype=np.int64, shape=(n,)

The fused k-mer code(s).

See also

split

The reverse operation.

Examples

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> symbol_codes = base_alphabet.encode_multiple("TC")
>>> print(symbol_codes)
[3 1]
>>> print(kmer_alphabet.fuse(symbol_codes))
13
>>> print(kmer_alphabet.split(13))
[3 1]
get_symbols()#

Get the symbols in the alphabet.

Returns:
symbolstuple

A tuple of all k-mer symbols, i.e. all possible combinations of k symbols from its base alphabet.

Notes

In contrast the base Alphabet and LetterAlphabet class, KmerAlphabet does not hold a list of its symbols internally for performance reasons. Hence calling get_symbols() may be quite time consuming for large base alphabets or large k values, as the list needs to be created first.

is_letter_alphabet()#

Check whether the symbols in this alphabet are single printable letters. If so, the alphabet could be expressed by a LetterAlphabet.

Returns:
is_letter_alphabetbool

True, if all symbols in the alphabet are ‘str’ or ‘bytes’, have length 1 and are printable.

kmer_array_length(length)#

Get the length of the k-mer array, created by create_kmers(), if a sequence of size length would be given.

Parameters:
lengthint

The length of the hypothetical sequence

Returns:
kmer_lengthint

The length of created k-mer array.

split(kmer_code)#

Convert a k-mer code back into k symbol codes from the base alphabet.

This method can be used in a vectorized manner to split n k-mer codes into an (n,k) integer array.

Parameters:
kmer_codeint or ndarray, dtype=int, shape=(n,)

The k-mer code(s).

Returns:
codesndarray, dtype=np.uint64, shape=(k,) or shape=(n,k)

The split symbol codes from the base alphabet.

See also

fuse

The reverse operation.

Examples

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> symbol_codes = base_alphabet.encode_multiple("TC")
>>> print(symbol_codes)
[3 1]
>>> print(kmer_alphabet.fuse(symbol_codes))
13
>>> print(kmer_alphabet.split(13))
[3 1]