biotite.sequence.align.KmerAlphabet¶

class biotite.sequence.align.KmerAlphabet(base_alphabet, k, spacing=None)[source]¶

Bases: Alphabet

This type of alphabet uses k-mers as symbols, i.e. all combinations of k symbols from its base alphabet.

It’s primary use is its create_kmers() method, that iterates over all overlapping k-mers in a Sequence and encodes each one into its corresponding k-mer symbol code (k-mer code in short). This functionality is prominently used by a KmerTable to find k-mer matches between two sequences.

A KmerAlphabet has \(n^k\) different symbols, where \(n\) is the number of symbols in the base alphabet.

Parameters

base_alphabetAlphabet

The base alphabet. The created KmerAlphabet contains all combinations of k symbols from this alphabet.

kint

An integer greater than 1 that defines the length of the k-mers.

spacingNone or str or list or ndarray, dtype=int, shape=(k,)

If provided, spaced k-mers are used instead of continuous ones 1. The value contains the informative positions relative to the start of the k-mer, also called the model. The number of informative positions must equal k.

If a string is given, each '1' in the string indicates an informative position. For a continuous k-mer the spacing would be '111...'.

If a list or array is given, it must contain unique non-negative integers, that indicate the informative positions. For a continuous k-mer the spacing would be [0, 1, 2,...].

Notes

The symbol code for a k-mer \(s\) calculates as

\[RMSD = \sum_{i=0}^{k-1} n^{k-i-1} s_i\]

where \(n\) is the length of the base alphabet.

Hence the KmerAlphabet sorts k-mers in the order of the base alphabet, where leading positions within the k-mer take precedence.

References

1: B. Ma, J. Tromp, M. Li, “PatternHunter: faster and more sensitive homology search,” Bioinformatics, vol. 18, pp. 440–445, March 2002. doi: 10.1093/bioinformatics/18.3.440

Examples

Create an alphabet of nucleobase 2-mers:

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> print(base_alphabet.get_symbols())
['A', 'C', 'G', 'T']
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> print(kmer_alphabet.get_symbols())
['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']

Encode and decode k-mers:

>>> print(kmer_alphabet.encode("TC"))
13
>>> print(kmer_alphabet.decode(13))
['T' 'C']

Fuse symbol codes from the base alphabet into a k-mer code and split the k-mer code back into the original symbol codes:

>>> symbol_codes = base_alphabet.encode_multiple("TC")
>>> print(symbol_codes)
[3 1]
>>> print(kmer_alphabet.fuse(symbol_codes))
13
>>> print(kmer_alphabet.split(13))
[3 1]

Encode all overlapping continuous k-mers of a sequence:

>>> sequence = NucleotideSequence("ATTGCT")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> print(kmer_codes)
[ 3 15 14  9  7]
>>> print(["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)])
['AT', 'TT', 'TG', 'GC', 'CT']

Encode all overlapping k-mers using spacing:

>>> base_alphabet = ProteinSequence.alphabet
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 3, spacing="1101")
>>> sequence = ProteinSequence("BIQTITE")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> # Pretty print k-mers
>>> strings = ["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)]
>>> print([s[0] + s[1] + "_" + s[2] for s in strings])
['BI_T', 'IQ_I', 'QT_T', 'TI_E']

Attributes

base_alphabetAlphabet: The base alphabet, from which the KmerAlphabet was created.
kint: The length of the k-mers.
spacingNone or ndarray, dtype=int: The k-mer model in array form, if spaced k-mers are used, None otherwise.

create_kmers(seq_code)¶

Create k-mer codes for all overlapping k-mers in the given sequence code.

Parameters

seq_codendarray, dtype={np.uint8, np.uint16, np.uint32, np.uint64}: The sequence code to be converted into k-mers.

Returns

kmer_codesndarray, dtype=int64: The symbol codes for the k-mers.

Examples

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> sequence = NucleotideSequence("ATTGCT")
>>> kmer_codes = kmer_alphabet.create_kmers(sequence.code)
>>> print(kmer_codes)
[ 3 15 14  9  7]
>>> print(["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)])
['AT', 'TT', 'TG', 'GC', 'CT']

decode(code)¶

Use the alphabet to decode a symbol code.

Parameters

codeint: The symbol code to be decoded.

Returns

symbolobject: The symbol corresponding to code.

Raises

AlphabetError: If code is not a valid code in the alphabet.

decode_multiple(code)¶

Decode a sequence code into a list of symbols.

Parameters

codendarray: The sequence code to decode.

Returns

symbolslist: The decoded list of symbols.

encode(symbol)¶

Use the alphabet to encode a symbol.

Parameters

symbolobject: The object to encode into a symbol code.

Returns

codeint: The symbol code of symbol.

Raises

AlphabetError: If symbol is not in the alphabet.

encode_multiple(symbols, dtype=<class 'numpy.int64'>)¶

Encode a list of symbols.

Parameters

symbolsarray-like: The symbols to encode.
dtypedtype, optional: The dtype of the output ndarray. (Default: int64)

Returns

codendarray: The sequence code.

extends(alphabet)¶

Check, if this alphabet extends another alphabet.

Parameters

alphabetAlphabet: The potential parent alphabet.

Returns

resultbool: True, if this object extends alphabet, false otherwise.

fuse(codes)¶

Get the k-mer code for k symbol codes from the base alphabet.

This method can be used in a vectorized manner to obtain n k-mer codes from an (n,k) integer array.

Parameters

codesndarray, dtype=int, shape=(k,) or shape=(n,k): The symbol codes from the base alphabet to be fused.

Returns

kmer_codesint or ndarray, dtype=np.int64, shape=(n,): The fused k-mer code(s).

See also

split: The reverse operation.

Examples

>>> base_alphabet = NucleotideSequence.unambiguous_alphabet()
>>> kmer_alphabet = KmerAlphabet(base_alphabet, 2)
>>> symbol_codes = base_alphabet.encode_multiple("TC")
>>> print(symbol_codes)
[3 1]
>>> print(kmer_alphabet.fuse(symbol_codes))
13
>>> print(kmer_alphabet.split(13))
[3 1]

get_symbols()¶

Get the symbols in the alphabet.

Returns

symbolslist: A list of all k-mer symbols, i.e. all possible combinations of k symbols from its base alphabet.

Notes

In contrast the base Alphabet and LetterAlphabet class, KmerAlphabet does not hold a list of its symbols internally for performance reasons. Hence calling get_symbols() may be quite time consuming for large base alphabets or large k values, as the list needs to be created first.

is_letter_alphabet()¶

Check whether the symbols in this alphabet are single printable letters. If so, the alphabet could be expressed by a LetterAlphabet.

Returns

is_letter_alphabetbool: True, if all symbols in the alphabet are ‘str’ or ‘bytes’, have length 1 and are printable.

kmer_array_length(length)¶

Get the length of the k-mer array, created by create_kmers(), if a sequence of size length would be given.

Parameters

lengthint: The length of the hypothetical sequence

Returns

kmer_lengthint: The length of created k-mer array.

split(kmer_code)¶

Convert a k-mer code back into k symbol codes from the base alphabet.

This method can be used in a vectorized manner to split n k-mer codes into an (n,k) integer array.

Parameters

kmer_codeint or ndarray, dtype=int, shape=(n,): The k-mer code(s).

Returns

codesndarray, dtype=np.uint64, shape=(k,) or shape=(n,k): The split symbol codes from the base alphabet.

Gallery¶

Quantifying gene expression from RNA-seq data