biotite.sequence.align.KmerAlphabet

class biotite.sequence.align.KmerAlphabet(base_alphabet, k, spacing=None)[source]

Bases: Alphabet

This type of alphabet uses k-mers as symbols, i.e. all combinations of k symbols from its base alphabet.

It’s primary use is its create_kmers() method, that iterates over all overlapping k-mers in a Sequence and encodes each one into its corresponding k-mer symbol code (k-mer code in short). This functionality is prominently used by a KmerTable to find k-mer matches between two sequences.

A KmerAlphabet has \(n^k\) different symbols, where \(n\) is the number of symbols in the base alphabet.

Parameters
base_alphabetAlphabet

The base alphabet. The created KmerAlphabet contains all combinations of k symbols from this alphabet.

kint

An integer greater than 1 that defines the length of the k-mers.

spacingNone or str or list or ndarray, dtype=int, shape=(k,)

If provided, spaced k-mers are used instead of continuous ones 1. The value contains the informative positions relative to the start of the k-mer, also called the model. The number of informative positions must equal k.

If a string is given, each '1' in the string indicates an informative position. For a continuous k-mer the spacing would be '111...'.

If a list or array is given, it must contain unique non-negative integers, that indicate the informative positions. For a continuous k-mer the spacing would be [0, 1, 2,...].

Notes

The symbol code for a k-mer \(s\) calculates as

\[RMSD = \sum_{i=0}^{k-1} n^{k-i-1} s_i\]

where \(n\) is the length of the base alphabet.

Hence the KmerAlphabet sorts k-mers in the order of the base alphabet, where leading positions within the k-mer take precedence.

References

1

B. Ma, J. Tromp, M. Li, “PatternHunter: faster and more sensitive homology search,” Bioinformatics, vol. 18, pp. 440–445, March 2002. doi: 10.1093/bioinformatics/18.3.440

Examples

Create an alphabet of nucleobase 2-mers:

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> print(base_alphabet.get_symbols())
['A', 'C', 'G', 'T']
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> print(kmer_alphabet.get_symbols())
['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']

Encode and decode k-mers:

>>> print(kmer_alphabet.encode("TC"))
13
>>> print(kmer_alphabet.decode(13))
['T' 'C']

Fuse symbol codes from the base alphabet into a k-mer code and split the k-mer code back into the original symbol codes:

>>> symbol_codes = base_alphabet.encode_multiple("TC")
>>> print(symbol_codes)
[3 1]
>>> print(kmer_alphabet.fuse(symbol_codes))
13
>>> print(kmer_alphabet.split(13))
[3 1]

Encode all overlapping continuous k-mers of a sequence:

>>> sequence = NucleotideSequence("ATTGCT")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> print(kmer_codes)
[ 3 15 14  9  7]
>>> print(["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)])
['AT', 'TT', 'TG', 'GC', 'CT']

Encode all overlapping k-mers using spacing:

>>> base_alphabet = ProteinSequence.alphabet
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 3, spacing="1101")
>>> sequence = ProteinSequence("BIQTITE")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> # Pretty print k-mers
>>> strings = ["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)]
>>> print([s[0] + s[1] + "_" + s[2] for s in strings])
['BI_T', 'IQ_I', 'QT_T', 'TI_E']
Attributes
base_alphabetAlphabet

The base alphabet, from which the KmerAlphabet was created.

kint

The length of the k-mers.

spacingNone or ndarray, dtype=int

The k-mer model in array form, if spaced k-mers are used, None otherwise.

create_kmers(seq_code)

Create k-mer codes for all overlapping k-mers in the given sequence code.

Parameters
seq_codendarray, dtype={np.uint8, np.uint16, np.uint32, np.uint64}

The sequence code to be converted into k-mers.

Returns
kmer_codesndarray, dtype=int64

The symbol codes for the k-mers.

Examples

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> sequence = NucleotideSequence("ATTGCT")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> print(kmer_codes)
[ 3 15 14  9  7]
>>> print(["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)])
['AT', 'TT', 'TG', 'GC', 'CT']
decode(code)

Use the alphabet to decode a symbol code.

Parameters
codeint

The symbol code to be decoded.

Returns
symbolobject

The symbol corresponding to code.

Raises
AlphabetError

If code is not a valid code in the alphabet.

decode_multiple(code)

Decode a sequence code into a list of symbols.

Parameters
codendarray

The sequence code to decode.

Returns
symbolslist

The decoded list of symbols.

encode(symbol)

Use the alphabet to encode a symbol.

Parameters
symbolobject

The object to encode into a symbol code.

Returns
codeint

The symbol code of symbol.

Raises
AlphabetError

If symbol is not in the alphabet.

encode_multiple(symbols, dtype=<class 'numpy.int64'>)

Encode a list of symbols.

Parameters
symbolsarray-like

The symbols to encode.

dtypedtype, optional

The dtype of the output ndarray. (Default: int64)

Returns
codendarray

The sequence code.

extends(alphabet)

Check, if this alphabet extends another alphabet.

Parameters
alphabetAlphabet

The potential parent alphabet.

Returns
resultbool

True, if this object extends alphabet, false otherwise.

fuse(codes)

Get the k-mer code for k symbol codes from the base alphabet.

This method can be used in a vectorized manner to obtain n k-mer codes from an (n,k) integer array.

Parameters
codesndarray, dtype=int, shape=(k,) or shape=(n,k)

The symbol codes from the base alphabet to be fused.

Returns
kmer_codesint or ndarray, dtype=np.int64, shape=(n,)

The fused k-mer code(s).

See also

split

The reverse operation.

Examples

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> symbol_codes = base_alphabet.encode_multiple("TC")
>>> print(symbol_codes)
[3 1]
>>> print(kmer_alphabet.fuse(symbol_codes))
13
>>> print(kmer_alphabet.split(13))
[3 1]
get_symbols()

Get the symbols in the alphabet.

Returns
symbolslist

A list of all k-mer symbols, i.e. all possible combinations of k symbols from its base alphabet.

Notes

In contrast the base Alphabet and LetterAlphabet class, KmerAlphabet does not hold a list of its symbols internally for performance reasons. Hence calling get_symbols() may be quite time consuming for large base alphabets or large k values, as the list needs to be created first.

is_letter_alphabet()

Check whether the symbols in this alphabet are single printable letters. If so, the alphabet could be expressed by a LetterAlphabet.

Returns
is_letter_alphabetbool

True, if all symbols in the alphabet are ‘str’ or ‘bytes’, have length 1 and are printable.

kmer_array_length(length)

Get the length of the k-mer array, created by create_kmers(), if a sequence of size length would be given.

Parameters
lengthint

The length of the hypothetical sequence

Returns
kmer_lengthint

The length of created k-mer array.

split(kmer_code)

Convert a k-mer code back into k symbol codes from the base alphabet.

This method can be used in a vectorized manner to split n k-mer codes into an (n,k) integer array.

Parameters
kmer_codeint or ndarray, dtype=int, shape=(n,)

The k-mer code(s).

Returns
codesndarray, dtype=np.uint64, shape=(k,) or shape=(n,k)

The split symbol codes from the base alphabet.

See also

fuse

The reverse operation.

Examples

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> symbol_codes = base_alphabet.encode_multiple("TC")
>>> print(symbol_codes)
[3 1]
>>> print(kmer_alphabet.fuse(symbol_codes))
13
>>> print(kmer_alphabet.split(13))
[3 1]