biotite.sequence.align.MincodeSelector¶
- class biotite.sequence.align.MincodeSelector(self, kmer_alphabet, compression, permutation=None)[source]¶
Bases:
object
Selects the \(1/\text{compression}\) smallest k-mers from
KmerAlphabet
. 1‘Small’ refers to the lexicographical order, or alternatively a custom order if permutation is given. The Mincode approach tries to reduce the number of k-mers from a sequence by the factor compression, while it still ensures that a common set of k-mers are selected from similar sequences.
- Parameters
- kmer_alphabetKmerAlphabet
The k-mer alphabet that defines the k-mer size and the type of sequence this
MincodeSelector
can be applied on.- compressionfloat
Defines the compression factor, i.e. the approximate fraction of k-mers that will be sampled from a sequence.
- permutationPermutation
If set, the k-mer order is permuted, i.e. the k-mers are selected based on the ordering of the sort keys from
Permutation.permute()
. By default, the standard order of theKmerAlphabet
is used. This standard order is often the lexicographical order.
References
- 1
R. Edgar, “Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences,” PeerJ, vol. 9, pp. e10805, February 2021. doi: 10.7717/peerj.10805
Examples
>>> kmer_alph = KmerAlphabet(NucleotideSequence.alphabet_unamb, k=2) >>> kmers = np.arange(len(kmer_alph)) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers]) ['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT'] >>> # Select 1/4 of *k-mers* based on lexicographical k-mer order >>> selector = MincodeSelector(kmer_alph, 4) >>> subset_pos, kmers_subset = selector.select_from_kmers(kmers) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset]) ['AA', 'AC', 'AG', 'AT'] >>> # Select 1/4 based on randomized k-mer order >>> selector = MincodeSelector(kmer_alph, 4, permutation=RandomPermutation()) >>> subset_pos, kmers_subset = selector.select_from_kmers(kmers) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset]) ['AG', 'CT', 'GA', 'TC']
- Attributes
- kmer_alphabetKmerAlphabet
The k-mer alphabet.
- compressionfloat
The compression factor.
- thresholdfloat
Based on the compression factor and the range of (permuted) k-mer values this threshold is calculated. All k-mers, that are smaller than this value are selected.
- permutationPermutation
The permutation.
- select(sequence, alphabet_check=True)¶
Obtain all overlapping k-mers from a sequence and select the Mincode k-mers from them.
- Parameters
- sequenceSequence
The sequence to find the Mincode k-mers in. Must be compatible with the given kmer_alphabet
- alphabet_check: bool, optional
If set to false, the compatibility between the alphabet of the sequence and the alphabet of the
MincodeSelector
is not checked to gain additional performance.
- Returns
- mincode_indicesndarray, dtype=np.uint32
The sequence indices where the Mincode k-mers start.
- mincodendarray, dtype=np.int64
The corresponding Mincode k-mer codes.
- select_from_kmers(kmers)¶
Select Mincode k-mers.
The given k-mers are not required to overlap.
- Parameters
- kmersndarray, dtype=np.int64
The k-mer codes to select the Mincode k-mers from.
- Returns
- mincode_indicesndarray, dtype=np.uint32
The sequence indices where the Mincode k-mers start.
- mincodendarray, dtype=np.int64
The corresponding Mincode k-mer codes.