FrequencyPermutation#

class biotite.sequence.align.FrequencyPermutation(kmer_alphabet, counts)[source]#

Bases: Permutation

Provide an order for k-mers from a given KmerAlphabet, such that less frequent k-mers are smaller than more frequent k-mers. The frequency of each k-mer can either be given directly via the constructor or can be computed from a KmerTable via from_table().

Parameters:
kmer_alphabetKmerAlphabet, length=n

The k-mer alphabet that defines the range of possible k-mers that should be permuted.

countsndarray, shape=(n,), dtype=np.int64

The absolute frequency, i.e. the number of occurrences, of each k-mer in kmer_alphabet in the sequence database of interest. counts[c] = f, where c is the k-mer code and f is the corresponding frequency.

Notes

In actual sequences some sequence patterns appear in high quantity. When selecting a subset of k-mers, e.g. via MinimizerSelector, it is desireable to select the low-frequency informative k-mers to avoid spurious matches. To achieve such selection this class can be used.

This class uses a table to look up the order. Hence, the memory consumption is \(8 n^k\) bytes, where \(n\) is the size of the base alphabet and \(k\) is the k-mer size.

Examples

>>> alphabet = LetterAlphabet("abcdr")
>>> sequence = GeneralSequence(alphabet, "abracadabra")
>>> kmer_table = KmerTable.from_sequences(k=2, sequences=[sequence])
>>> print(kmer_table)
ab: (0, 0), (0, 7)
ac: (0, 3)
ad: (0, 5)
br: (0, 1), (0, 8)
ca: (0, 4)
da: (0, 6)
ra: (0, 2), (0, 9)
>>> # Create all k-mers in lexicographic order
>>> kmer_alph = kmer_table.kmer_alphabet
>>> kmer_codes = np.arange(0, len(kmer_alph))
>>> print(["..."] + ["".join(kmer_alph.decode(c)) for c in kmer_codes[-10:]])
['...', 'da', 'db', 'dc', 'dd', 'dr', 'ra', 'rb', 'rc', 'rd', 'rr']
>>> # After applying the permutation the k-mers are ordered
>>> # by their frequency in the table
>>> # -> the most frequent k-mers have low rank
>>> permutation = FrequencyPermutation.from_table(kmer_table)
>>> order = permutation.permute(kmer_codes)
>>> print(order)
[ 0 22 18 19  1  2  3  4  5 23 20  6  7  8  9 21 10 11 12 13 24 14 15 16
 17]
>>> kmer_codes = kmer_codes[np.argsort(order)]
>>> print(["..."] + ["".join(kmer_alph.decode(c)) for c in kmer_codes[-10:]])
['...', 'rc', 'rd', 'rr', 'ac', 'ad', 'ca', 'da', 'ab', 'br', 'ra']
Attributes:
min, max: int

The minimum and maximum value, the permutated value (i.e. the return value of permute()) can take.

kmer_alphabetKmerAlphabet

The k-mer alphabet that defines the range of possible k-mers that should be permuted.

static from_table(kmer_table)#

Create a FrequencyPermutation from the k-mer counts of a KmerTable.

Parameters:
kmer_tableKmerTable

The k-mer counts are taken from this table.

Returns:
permutationFrequencyPermutation

The permutation is based on the counts.

permute(kmers)#

Give the given k-mers a new order.

Parameters:
kmersndarray, dtype=np.int64

The k-mers to reorder given as k-mer code.

Returns:
orderndarray, dtype=np.int64

The sort key for the new order, i.e. a k-mer A is smaller than k-mer B, if order[A] < order[B] The order value may not only contain positive but also negative integers. The order is unambiguous: If A != B, then order[A] != order[B].