MinimizerSelector#

class biotite.sequence.align.MinimizerSelector(kmer_alphabet, window, permutation=None)[source]#

Bases: object

Selects the minimizers in sequences.

In a rolling window of k-mers, the minimizer is defined as the k-mer with the minimum k-mer code [1]. If the same minimum k-mer appears twice in a window, the leftmost k-mer is selected as minimizer.

Parameters:
kmer_alphabetKmerAlphabet

The k-mer alphabet that defines the k-mer size and the type of sequence this MinimizerSelector can be applied on.

windowint

The size of the rolling window, where the minimizers are searched in. In other words this is the number of k-mers per window. The window size must be at least 2.

permutationPermutation

If set, the k-mer order is permuted, i.e. the minimizer is chosen based on the ordering of the sort keys from Permutation.permute(). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases [1].

Notes

For minimizer computation a fast algorithm [2] is used, whose runtime scales linearly with the length of the sequence and is constant with regard to the size of the rolling window.

References

Examples

The k-mer decomposition of a sequence can yield a high number of k-mers:

>>> sequence1 = ProteinSequence("THIS*IS*A*SEQVENCE")
>>> kmer_alph = KmerAlphabet(sequence1.alphabet, k=3)
>>> all_kmers = kmer_alph.create_kmers(sequence1.code)
>>> print(all_kmers)
[ 9367  3639  4415  9199 13431  4415  9192 13271   567 13611  8725  2057
  7899  9875  1993  6363]
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in all_kmers])
['THI', 'HIS', 'IS*', 'S*I', '*IS', 'IS*', 'S*A', '*A*', 'A*S', '*SE', 'SEQ', 'EQV', 'QVE', 'VEN', 'ENC', 'NCE']

Minimizers can be used to reduce the number of k-mers by selecting only the minimum k-mer in each window w:

>>> minimizer = MinimizerSelector(kmer_alph, window=4)
>>> minimizer_pos, minimizers = minimizer.select(sequence1)
>>> print(minimizer_pos)
[ 1  2  5  8 11 14]
>>> print(minimizers)
[3639 4415 4415  567 2057 1993]
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in minimizers])
['HIS', 'IS*', 'IS*', 'A*S', 'EQV', 'ENC']

Although this approach reduces the number of k-mers, minimizers are still guaranteed to match minimizers in another sequence, if they share an equal subsequence of at least length w + k - 1:

>>> sequence2 = ProteinSequence("ANQTHER*SEQVENCE")
>>> other_minimizer_pos, other_minimizers = minimizer.select(sequence2)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in other_minimizers])
['ANQ', 'HER', 'ER*', 'EQV', 'ENC']
>>> common_minimizers = set.intersection(set(minimizers), set(other_minimizers))
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in common_minimizers])
['EQV', 'ENC']
Attributes:
kmer_alphabetKmerAlphabet

The k-mer alphabet.

windowint

The window size.

permutationPermutation

The permutation.

select(sequence, alphabet_check=True)#

Obtain all overlapping k-mers from a sequence and select the minimizers from them.

Parameters:
sequenceSequence

The sequence to find the minimizers in. Must be compatible with the given kmer_alphabet

alphabet_check: bool, optional

If set to false, the compatibility between the alphabet of the sequence and the alphabet of the MinimizerSelector is not checked to gain additional performance.

Returns:
minimizer_indicesndarray, dtype=np.uint32

The sequence indices where the minimizer k-mers start.

minimizersndarray, dtype=np.int64

The k-mers that are the selected minimizers, returned as k-mer code.

Notes

Duplicate minimizers are omitted, i.e. if two windows have the same minimizer position, the return values contain this minimizer only once.

select_from_kmers(kmers)#

Select minimizers for the given overlapping k-mers.

Parameters:
kmersndarray, dtype=np.int64

The k-mer codes representing the sequence to find the minimizers in. The k-mer codes correspond to the k-mers encoded by the given kmer_alphabet.

Returns:
minimizer_indicesndarray, dtype=np.uint32

The indices in the input k-mer sequence where a minimizer appears.

minimizersndarray, dtype=np.int64

The corresponding k-mers codes of the minimizers.

Notes

Duplicate minimizers are omitted, i.e. if two windows have the same minimizer position, the return values contain this minimizer only once.