MinimizerSelector
#
- class biotite.sequence.align.MinimizerSelector(kmer_alphabet, window, permutation=None)[source]#
Bases:
object
Selects the minimizers in sequences.
In a rolling window of k-mers, the minimizer is defined as the k-mer with the minimum k-mer code [1]. If the same minimum k-mer appears twice in a window, the leftmost k-mer is selected as minimizer.
- Parameters:
- kmer_alphabetKmerAlphabet
The k-mer alphabet that defines the k-mer size and the type of sequence this
MinimizerSelector
can be applied on.- windowint
The size of the rolling window, where the minimizers are searched in. In other words this is the number of k-mers per window. The window size must be at least 2.
- permutationPermutation
If set, the k-mer order is permuted, i.e. the minimizer is chosen based on the ordering of the sort keys from
Permutation.permute()
. By default, the standard order of theKmerAlphabet
is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases [1].
Notes
For minimizer computation a fast algorithm [2] is used, whose runtime scales linearly with the length of the sequence and is constant with regard to the size of the rolling window.
References
Examples
The k-mer decomposition of a sequence can yield a high number of k-mers:
>>> sequence1 = ProteinSequence("THIS*IS*A*SEQVENCE") >>> kmer_alph = KmerAlphabet(sequence1.alphabet, k=3) >>> all_kmers = kmer_alph.create_kmers(sequence1.code) >>> print(all_kmers) [ 9367 3639 4415 9199 13431 4415 9192 13271 567 13611 8725 2057 7899 9875 1993 6363] >>> print(["".join(kmer_alph.decode(kmer)) for kmer in all_kmers]) ['THI', 'HIS', 'IS*', 'S*I', '*IS', 'IS*', 'S*A', '*A*', 'A*S', '*SE', 'SEQ', 'EQV', 'QVE', 'VEN', 'ENC', 'NCE']
Minimizers can be used to reduce the number of k-mers by selecting only the minimum k-mer in each window w:
>>> minimizer = MinimizerSelector(kmer_alph, window=4) >>> minimizer_pos, minimizers = minimizer.select(sequence1) >>> print(minimizer_pos) [ 1 2 5 8 11 14] >>> print(minimizers) [3639 4415 4415 567 2057 1993] >>> print(["".join(kmer_alph.decode(kmer)) for kmer in minimizers]) ['HIS', 'IS*', 'IS*', 'A*S', 'EQV', 'ENC']
Although this approach reduces the number of k-mers, minimizers are still guaranteed to match minimizers in another sequence, if they share an equal subsequence of at least length w + k - 1:
>>> sequence2 = ProteinSequence("ANQTHER*SEQVENCE") >>> other_minimizer_pos, other_minimizers = minimizer.select(sequence2) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in other_minimizers]) ['ANQ', 'HER', 'ER*', 'EQV', 'ENC'] >>> common_minimizers = set.intersection(set(minimizers), set(other_minimizers)) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in common_minimizers]) ['EQV', 'ENC']
- Attributes:
- kmer_alphabetKmerAlphabet
The k-mer alphabet.
- windowint
The window size.
- permutationPermutation
The permutation.
- select(sequence, alphabet_check=True)#
Obtain all overlapping k-mers from a sequence and select the minimizers from them.
- Parameters:
- sequenceSequence
The sequence to find the minimizers in. Must be compatible with the given kmer_alphabet
- alphabet_check: bool, optional
If set to false, the compatibility between the alphabet of the sequence and the alphabet of the
MinimizerSelector
is not checked to gain additional performance.
- Returns:
- minimizer_indicesndarray, dtype=np.uint32
The sequence indices where the minimizer k-mers start.
- minimizersndarray, dtype=np.int64
The k-mers that are the selected minimizers, returned as k-mer code.
Notes
Duplicate minimizers are omitted, i.e. if two windows have the same minimizer position, the return values contain this minimizer only once.
- select_from_kmers(kmers)#
Select minimizers for the given overlapping k-mers.
- Parameters:
- kmersndarray, dtype=np.int64
The k-mer codes representing the sequence to find the minimizers in. The k-mer codes correspond to the k-mers encoded by the given kmer_alphabet.
- Returns:
- minimizer_indicesndarray, dtype=np.uint32
The indices in the input k-mer sequence where a minimizer appears.
- minimizersndarray, dtype=np.int64
The corresponding k-mers codes of the minimizers.
Notes
Duplicate minimizers are omitted, i.e. if two windows have the same minimizer position, the return values contain this minimizer only once.
Gallery#
Quantifying gene expression from RNA-seq data