biotite.sequence.align.MincodeSelector¶

class biotite.sequence.align.MincodeSelector(self, kmer_alphabet, compression, permutation=None)[source]¶

Bases: object

Selects the \(1/\text{compression}\) smallest k-mers from KmerAlphabet. 1

‘Small’ refers to the lexicographical order, or alternatively a custom order if permutation is given. The Mincode approach tries to reduce the number of k-mers from a sequence by the factor compression, while it still ensures that a common set of k-mers are selected from similar sequences.

Parameters

kmer_alphabetKmerAlphabet: The k-mer alphabet that defines the k-mer size and the type of sequence this MincodeSelector can be applied on.
compressionfloat: Defines the compression factor, i.e. the approximate fraction of k-mers that will be sampled from a sequence.
permutationPermutation: If set, the k-mer order is permuted, i.e. the k-mers are selected based on the ordering of the sort keys from Permutation.permute(). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order.

References

1: R. Edgar, “Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences,” PeerJ, vol. 9, pp. e10805, February 2021. doi: 10.7717/peerj.10805

Examples

>>> kmer_alph = KmerAlphabet(NucleotideSequence.alphabet_unamb, k=2)
>>> kmers = np.arange(len(kmer_alph))
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers])
['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']
>>> # Select 1/4 of *k-mers* based on lexicographical k-mer order
>>> selector = MincodeSelector(kmer_alph, 4)
>>> subset_pos, kmers_subset = selector.select_from_kmers(kmers)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset])
['AA', 'AC', 'AG', 'AT']
>>> # Select 1/4 based on randomized k-mer order
>>> selector = MincodeSelector(kmer_alph, 4, permutation=RandomPermutation())
>>> subset_pos, kmers_subset = selector.select_from_kmers(kmers)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset])
['AG', 'CT', 'GA', 'TC']

Attributes

kmer_alphabetKmerAlphabet: The k-mer alphabet.
compressionfloat: The compression factor.
thresholdfloat: Based on the compression factor and the range of (permuted) k-mer values this threshold is calculated. All k-mers, that are smaller than this value are selected.
permutationPermutation: The permutation.

select(sequence, alphabet_check=True)¶

Obtain all overlapping k-mers from a sequence and select the Mincode k-mers from them.

Parameters

sequenceSequence: The sequence to find the Mincode k-mers in. Must be compatible with the given kmer_alphabet
alphabet_check: bool, optional: If set to false, the compatibility between the alphabet of the sequence and the alphabet of the MincodeSelector is not checked to gain additional performance.

Returns

mincode_indicesndarray, dtype=np.uint32: The sequence indices where the Mincode k-mers start.
mincodendarray, dtype=np.int64: The corresponding Mincode k-mer codes.

select_from_kmers(kmers)¶

Select Mincode k-mers.

The given k-mers are not required to overlap.

Parameters

kmersndarray, dtype=np.int64: The k-mer codes to select the Mincode k-mers from.

Returns

mincode_indicesndarray, dtype=np.uint32: The sequence indices where the Mincode k-mers start.
mincodendarray, dtype=np.int64: The corresponding Mincode k-mer codes.