biotite.sequence.align.FrequencyPermutation

class biotite.sequence.align.FrequencyPermutation(kmer_alphabet, counts)[source]

Bases: Permutation

Provide an order for k-mers from a given KmerAlphabet, such that less frequent k-mers are smaller than more frequent k-mers. The frequency of each k-mer can either be given directly via the constructor or can be computed from a KmerTable via from_table().

Parameters
kmer_alphabetKmerAlphabet, length=n

The k-mer alphabet that defines the range of possible k-mers that should be permuted.

countsndarray, shape=(n,), dtype=np.int64

The absolute frequency, i.e. the number of occurrences, of each k-mer in kmer_alphabet in the sequence database of interest. counts[c] = f, where c is the k-mer code and f is the corresponding frequency.

Notes

In actual sequences some sequence patterns appear in high quantity. When selecting a subset of k-mers, e.g. via MinimizerSelector, it is desireable to select the low-frequency informative k-mers to avoid spurious matches. To achieve such selection this class can be used.

This class uses a table to look up the order. Hence, the memory consumption is \(8 n^k\) bytes, where \(n\) is the size of the base alphabet and \(k\) is the k-mer size.

Examples

>>> alphabet = LetterAlphabet("abcdr")
>>> sequence = GeneralSequence(alphabet, "abracadabra")
>>> kmer_table = KmerTable.from_sequences(k=2, sequences=[sequence])
>>> print(kmer_table)
ab: (0, 0), (0, 7)
ac: (0, 3)
ad: (0, 5)
br: (0, 1), (0, 8)
ca: (0, 4)
da: (0, 6)
ra: (0, 2), (0, 9)
>>> # Create all k-mers in lexicographic order
>>> kmer_alph = kmer_table.kmer_alphabet
>>> kmer_codes = np.arange(0, len(kmer_alph))
>>> print(["..."] + ["".join(kmer_alph.decode(c)) for c in kmer_codes[-10:]])
['...', 'da', 'db', 'dc', 'dd', 'dr', 'ra', 'rb', 'rc', 'rd', 'rr']
>>> # After applying the permutation the k-mers are ordered
>>> # by their frequency in the table
>>> # -> the most frequent k-mers have low rank
>>> permutation = FrequencyPermutation.from_table(kmer_table)
>>> order = permutation.permute(kmer_codes)
>>> print(order)
[ 0 24 20 19 16 15 14 13 12 22 21 10 11  8  7 18  6  5  4  3 23  2  1  9
 17]
>>> kmer_codes = kmer_codes[np.argsort(order)]
>>> print(["..."] + ["".join(kmer_alph.decode(c)) for c in kmer_codes[-10:]])
['...', 'ba', 'ar', 'rr', 'da', 'ad', 'ac', 'ca', 'br', 'ra', 'ab']
Attributes
min, max: int

The minimum and maximum value, the permutated value (i.e. the return value of permute()) can take.

from_table()

Create a FrequencyPermutation from the k-mer counts of a KmerTable.

Parameters
kmer_tableKmerTable

The k-mer counts are taken from this table.

Returns
permutationFrequencyPermutation

The permutation is based on the counts.

permute(kmers)

Give the given k-mers a new order.

Parameters
kmersndarray, dtype=np.int64

The k-mers to reorder given as k-mer code.

Returns
orderndarray, dtype=np.int64

The sort key for the new order, i.e. a k-mer A is smaller than k-mer B, if order[A] < order[B] The order value may not only contain positive but also negative integers. The order is unambiguous: If A != B, then order[A] != order[B].