biotite.sequence.align.MincodeSelector

class biotite.sequence.align.MincodeSelector(self, kmer_alphabet, compression, permutation=None)[source]

Bases: object

Selects the \(1/\text{compression}\) smallest k-mers from KmerAlphabet. 1

Small’ refers to the lexicographical order, or alternatively a custom order if permutation is given. The Mincode approach tries to reduce the number of k-mers from a sequence by the factor compression, while it still ensures that a common set of k-mers are selected from similar sequences.

Parameters
kmer_alphabetKmerAlphabet

The k-mer alphabet that defines the k-mer size and the type of sequence this MincodeSelector can be applied on.

compressionfloat

Defines the compression factor, i.e. the approximate fraction of k-mers that will be sampled from a sequence.

permutationPermutation

If set, the k-mer order is permuted, i.e. the k-mers are selected based on the ordering of the sort keys from Permutation.permute(). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order.

References

1

R. Edgar, “Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences,” PeerJ, vol. 9, pp. e10805, February 2021. doi: 10.7717/peerj.10805

Examples

>>> kmer_alph = KmerAlphabet(NucleotideSequence.alphabet_unamb, k=2)
>>> kmers = np.arange(len(kmer_alph))
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers])
['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']
>>> # Select 1/4 of *k-mers* based on lexicographical k-mer order
>>> selector = MincodeSelector(kmer_alph, 4)
>>> subset_pos, kmers_subset = selector.select_from_kmers(kmers)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset])
['AA', 'AC', 'AG', 'AT']
>>> # Select 1/4 based on randomized k-mer order
>>> selector = MincodeSelector(kmer_alph, 4, permutation=RandomPermutation())
>>> subset_pos, kmers_subset = selector.select_from_kmers(kmers)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset])
['AG', 'CT', 'GA', 'TC']
Attributes
kmer_alphabetKmerAlphabet

The k-mer alphabet.

compressionfloat

The compression factor.

thresholdfloat

Based on the compression factor and the range of (permuted) k-mer values this threshold is calculated. All k-mers, that are smaller than this value are selected.

permutationPermutation

The permutation.

select(sequence, alphabet_check=True)

Obtain all overlapping k-mers from a sequence and select the Mincode k-mers from them.

Parameters
sequenceSequence

The sequence to find the Mincode k-mers in. Must be compatible with the given kmer_alphabet

alphabet_check: bool, optional

If set to false, the compatibility between the alphabet of the sequence and the alphabet of the MincodeSelector is not checked to gain additional performance.

Returns
mincode_indicesndarray, dtype=np.uint32

The sequence indices where the Mincode k-mers start.

mincodendarray, dtype=np.int64

The corresponding Mincode k-mer codes.

select_from_kmers(kmers)

Select Mincode k-mers.

The given k-mers are not required to overlap.

Parameters
kmersndarray, dtype=np.int64

The k-mer codes to select the Mincode k-mers from.

Returns
mincode_indicesndarray, dtype=np.uint32

The sequence indices where the Mincode k-mers start.

mincodendarray, dtype=np.int64

The corresponding Mincode k-mer codes.