KmerAlphabet
#
- class biotite.sequence.align.KmerAlphabet(base_alphabet, k, spacing=None)[source]#
Bases:
Alphabet
This type of alphabet uses k-mers as symbols, i.e. all combinations of k symbols from its base alphabet.
It’s primary use is its
create_kmers()
method, that iterates over all overlapping k-mers in aSequence
and encodes each one into its corresponding k-mer symbol code (k-mer code in short). This functionality is prominently used by aKmerTable
to find k-mer matches between two sequences.A
KmerAlphabet
has \(n^k\) different symbols, where \(n\) is the number of symbols in the base alphabet.- Parameters:
- base_alphabetAlphabet
The base alphabet. The created
KmerAlphabet
contains all combinations of k symbols from this alphabet.- kint
An integer greater than 1 that defines the length of the k-mers.
- spacingNone or str or list or ndarray, dtype=int, shape=(k,)
If provided, spaced k-mers are used instead of continuous ones [1]. The value contains the informative positions relative to the start of the k-mer, also called the model. The number of informative positions must equal k.
If a string is given, each
'1'
in the string indicates an informative position. For a continuous k-mer the spacing would be'111...'
.If a list or array is given, it must contain unique non-negative integers, that indicate the informative positions. For a continuous k-mer the spacing would be
[0, 1, 2,...]
.
Notes
The symbol code for a k-mer \(s\) calculates as
\[RMSD = \sum_{i=0}^{k-1} n^{k-i-1} s_i\]where \(n\) is the length of the base alphabet.
Hence the
KmerAlphabet
sorts k-mers in the order of the base alphabet, where leading positions within the k-mer take precedence.References
Examples
Create an alphabet of nucleobase 2-mers:
>>> base_alphabet = NucleotideSequence.unambiguous_alphabet() >>> print(base_alphabet.get_symbols()) ('A', 'C', 'G', 'T') >>> kmer_alphabet = KmerAlphabet(base_alphabet, 2) >>> print(kmer_alphabet.get_symbols()) ('AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT')
Encode and decode k-mers:
>>> print(kmer_alphabet.encode("TC")) 13 >>> print(kmer_alphabet.decode(13)) ['T' 'C']
Fuse symbol codes from the base alphabet into a k-mer code and split the k-mer code back into the original symbol codes:
>>> symbol_codes = base_alphabet.encode_multiple("TC") >>> print(symbol_codes) [3 1] >>> print(kmer_alphabet.fuse(symbol_codes)) 13 >>> print(kmer_alphabet.split(13)) [3 1]
Encode all overlapping continuous k-mers of a sequence:
>>> sequence = NucleotideSequence("ATTGCT") >>> kmer_codes = kmer_alphabet.create_kmers(sequence.code) >>> print(kmer_codes) [ 3 15 14 9 7] >>> print(["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)]) ['AT', 'TT', 'TG', 'GC', 'CT']
Encode all overlapping k-mers using spacing:
>>> base_alphabet = ProteinSequence.alphabet >>> kmer_alphabet = KmerAlphabet(base_alphabet, 3, spacing="1101") >>> sequence = ProteinSequence("BIQTITE") >>> kmer_codes = kmer_alphabet.create_kmers(sequence.code) >>> # Pretty print k-mers >>> strings = ["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)] >>> print([s[0] + s[1] + "_" + s[2] for s in strings]) ['BI_T', 'IQ_I', 'QT_T', 'TI_E']
- Attributes:
- base_alphabetAlphabet
The base alphabet, from which the
KmerAlphabet
was created.- kint
The length of the k-mers.
- spacingNone or ndarray, dtype=int
The k-mer model in array form, if spaced k-mers are used,
None
otherwise.
- create_kmers(seq_code)#
Create k-mer codes for all overlapping k-mers in the given sequence code.
- Parameters:
- seq_codendarray, dtype={np.uint8, np.uint16, np.uint32, np.uint64}
The sequence code to be converted into k-mers.
- Returns:
- kmer_codesndarray, dtype=int64
The symbol codes for the k-mers.
Examples
>>> base_alphabet = NucleotideSequence.unambiguous_alphabet() >>> kmer_alphabet = KmerAlphabet(base_alphabet, 2) >>> sequence = NucleotideSequence("ATTGCT") >>> kmer_codes = kmer_alphabet.create_kmers(sequence.code) >>> print(kmer_codes) [ 3 15 14 9 7] >>> print(["".join(kmer) for kmer in kmer_alphabet.decode_multiple(kmer_codes)]) ['AT', 'TT', 'TG', 'GC', 'CT']
- decode(code)#
Use the alphabet to decode a symbol code.
- Parameters:
- codeint
The symbol code to be decoded.
- Returns:
- symbolobject
The symbol corresponding to code.
- Raises:
- AlphabetError
If code is not a valid code in the alphabet.
- decode_multiple(code)#
Decode a sequence code into a list of symbols.
- Parameters:
- codendarray
The sequence code to decode.
- Returns:
- symbolslist
The decoded list of symbols.
- encode(symbol)#
Use the alphabet to encode a symbol.
- Parameters:
- symbolobject
The object to encode into a symbol code.
- Returns:
- codeint
The symbol code of symbol.
- Raises:
- AlphabetError
If symbol is not in the alphabet.
- encode_multiple(symbols, dtype=<class 'numpy.int64'>)#
Encode a list of symbols.
- Parameters:
- symbolsarray-like
The symbols to encode.
- dtypedtype, optional
The dtype of the output ndarray. (Default: int64)
- Returns:
- codendarray
The sequence code.
- extends(alphabet)#
Check, if this alphabet extends another alphabet.
- Parameters:
- alphabetAlphabet
The potential parent alphabet.
- Returns:
- resultbool
True, if this object extends alphabet, false otherwise.
- fuse(codes)#
Get the k-mer code for k symbol codes from the base alphabet.
This method can be used in a vectorized manner to obtain n k-mer codes from an (n,k) integer array.
- Parameters:
- codesndarray, dtype=int, shape=(k,) or shape=(n,k)
The symbol codes from the base alphabet to be fused.
- Returns:
- kmer_codesint or ndarray, dtype=np.int64, shape=(n,)
The fused k-mer code(s).
See also
split
The reverse operation.
Examples
>>> base_alphabet = NucleotideSequence.unambiguous_alphabet() >>> kmer_alphabet = KmerAlphabet(base_alphabet, 2) >>> symbol_codes = base_alphabet.encode_multiple("TC") >>> print(symbol_codes) [3 1] >>> print(kmer_alphabet.fuse(symbol_codes)) 13 >>> print(kmer_alphabet.split(13)) [3 1]
- get_symbols()#
Get the symbols in the alphabet.
- Returns:
- symbolstuple
A tuple of all k-mer symbols, i.e. all possible combinations of k symbols from its base alphabet.
Notes
In contrast the base
Alphabet
andLetterAlphabet
class,KmerAlphabet
does not hold a list of its symbols internally for performance reasons. Hence callingget_symbols()
may be quite time consuming for large base alphabets or large k values, as the list needs to be created first.
- is_letter_alphabet()#
Check whether the symbols in this alphabet are single printable letters. If so, the alphabet could be expressed by a LetterAlphabet.
- Returns:
- is_letter_alphabetbool
True, if all symbols in the alphabet are ‘str’ or ‘bytes’, have length 1 and are printable.
- kmer_array_length(length)#
Get the length of the k-mer array, created by
create_kmers()
, if a sequence of size length would be given.- Parameters:
- lengthint
The length of the hypothetical sequence
- Returns:
- kmer_lengthint
The length of created k-mer array.
- split(kmer_code)#
Convert a k-mer code back into k symbol codes from the base alphabet.
This method can be used in a vectorized manner to split n k-mer codes into an (n,k) integer array.
- Parameters:
- kmer_codeint or ndarray, dtype=int, shape=(n,)
The k-mer code(s).
- Returns:
- codesndarray, dtype=np.uint64, shape=(k,) or shape=(n,k)
The split symbol codes from the base alphabet.
See also
fuse
The reverse operation.
Examples
>>> base_alphabet = NucleotideSequence.unambiguous_alphabet() >>> kmer_alphabet = KmerAlphabet(base_alphabet, 2) >>> symbol_codes = base_alphabet.encode_multiple("TC") >>> print(symbol_codes) [3 1] >>> print(kmer_alphabet.fuse(symbol_codes)) 13 >>> print(kmer_alphabet.split(13)) [3 1]
Gallery#
Quantifying gene expression from RNA-seq data