biotite.sequence.align.MinimizerSelector¶
- class biotite.sequence.align.MinimizerSelector(kmer_alphabet, window, permutation=None)[source]¶
Bases:
object
Selects the minimizers in sequences.
In a rolling window of k-mers, the minimizer is defined as the k-mer with the minimum k-mer code 1. If the same minimum k-mer appears twice in a window, the leftmost k-mer is selected as minimizer.
- Parameters
- kmer_alphabetKmerAlphabet
The k-mer alphabet that defines the k-mer size and the type of sequence this
MinimizerSelector
can be applied on.- windowint
The size of the rolling window, where the minimizers are searched in. In other words this is the number of k-mers per window. The window size must be at least 2.
- permutationPermutation
If set, the k-mer order is permuted, i.e. the minimizer is chosen based on the ordering of the sort keys from
Permutation.permute()
. By default, the standard order of theKmerAlphabet
is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases 1.
Notes
For minimizer computation a fast algorithm 2 is used, whose runtime scales linearly with the length of the sequence and is constant with regard to the size of the rolling window.
References
- 1(1,2)
M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, J. A. Yorke, “Reducing storage requirements for biological sequence comparison,” Bioinformatics, vol. 20, pp. 3363–3369, December 2004. doi: 10.1093/bioinformatics/bth408
- 2
M. {van Herk}, “A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels,” Pattern Recognition Letters, vol. 13, pp. 517–521, July 1992. doi: 10.1016/0167-8655(92)90069-C
Examples
The k-mer decomposition of a sequence can yield a high number of k-mers:
>>> sequence1 = ProteinSequence("THIS*IS*A*SEQVENCE") >>> kmer_alph = KmerAlphabet(sequence1.alphabet, k=3) >>> all_kmers = kmer_alph.create_kmers(sequence1.code) >>> print(all_kmers) [ 9367 3639 4415 9199 13431 4415 9192 13271 567 13611 8725 2057 7899 9875 1993 6363] >>> print(["".join(kmer_alph.decode(kmer)) for kmer in all_kmers]) ['THI', 'HIS', 'IS*', 'S*I', '*IS', 'IS*', 'S*A', '*A*', 'A*S', '*SE', 'SEQ', 'EQV', 'QVE', 'VEN', 'ENC', 'NCE']
Minimizers can be used to reduce the number of k-mers by selecting only the minimum k-mer in each window w:
>>> minimizer = MinimizerSelector(kmer_alph, window=4) >>> minimizer_pos, minimizers = minimizer.select(sequence1) >>> print(minimizer_pos) [ 1 2 5 8 11 14] >>> print(minimizers) [3639 4415 4415 567 2057 1993] >>> print(["".join(kmer_alph.decode(kmer)) for kmer in minimizers]) ['HIS', 'IS*', 'IS*', 'A*S', 'EQV', 'ENC']
Although this approach reduces the number of k-mers, minimizers are still guaranteed to match minimizers in another sequence, if they share an equal subsequence of at least length w + k - 1:
>>> sequence2 = ProteinSequence("ANQTHER*SEQVENCE") >>> other_minimizer_pos, other_minimizers = minimizer.select(sequence2) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in other_minimizers]) ['ANQ', 'HER', 'ER*', 'EQV', 'ENC'] >>> common_minimizers = set.intersection(set(minimizers), set(other_minimizers)) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in common_minimizers]) ['EQV', 'ENC']
- Attributes
- kmer_alphabetKmerAlphabet
The k-mer alphabet.
- windowint
The window size.
- permutationPermutation
The permutation.
- select(sequence, alphabet_check=True)¶
Obtain all overlapping k-mers from a sequence and select the minimizers from them.
- Parameters
- sequenceSequence
The sequence to find the minimizers in. Must be compatible with the given kmer_alphabet
- alphabet_check: bool, optional
If set to false, the compatibility between the alphabet of the sequence and the alphabet of the
MinimizerSelector
is not checked to gain additional performance.
- Returns
- minimizer_indicesndarray, dtype=np.uint32
The sequence indices where the minimizer k-mers start.
- minimizersndarray, dtype=np.int64
The k-mers that are the selected minimizers, returned as k-mer code.
Notes
Duplicate minimizers are omitted, i.e. if two windows have the same minimizer position, the return values contain this minimizer only once.
- select_from_kmers(kmers)¶
Select minimizers for the given overlapping k-mers.
- Parameters
- kmersndarray, dtype=np.int64
The k-mer codes representing the sequence to find the minimizers in. The k-mer codes correspond to the k-mers encoded by the given kmer_alphabet.
- Returns
- minimizer_indicesndarray, dtype=np.uint32
The indices in the input k-mer sequence where a minimizer appears.
- minimizersndarray, dtype=np.int64
The corresponding k-mers codes of the minimizers.
Notes
Duplicate minimizers are omitted, i.e. if two windows have the same minimizer position, the return values contain this minimizer only once.
Gallery¶
Quantifying gene expression from RNA-seq data