MincodeSelector#

class biotite.sequence.align.MincodeSelector(self, kmer_alphabet, compression, permutation=None)[source]#

Bases: object

Selects the \(1/\text{compression}\) smallest k-mers from KmerAlphabet. [1]

Small’ refers to the lexicographical order, or alternatively a custom order if permutation is given. The Mincode approach tries to reduce the number of k-mers from a sequence by the factor compression, while it still ensures that a common set of k-mers are selected from similar sequences.

Parameters:
kmer_alphabetKmerAlphabet

The k-mer alphabet that defines the k-mer size and the type of sequence this MincodeSelector can be applied on.

compressionfloat

Defines the compression factor, i.e. the approximate fraction of k-mers that will be sampled from a sequence.

permutationPermutation

If set, the k-mer order is permuted, i.e. the k-mers are selected based on the ordering of the sort keys from Permutation.permute(). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order.

Attributes:
kmer_alphabetKmerAlphabet

The k-mer alphabet.

compressionfloat

The compression factor.

thresholdfloat

Based on the compression factor and the range of (permuted) k-mer values this threshold is calculated. All k-mers, that are smaller than this value are selected.

permutationPermutation

The permutation.

References

Examples

>>> kmer_alph = KmerAlphabet(NucleotideSequence.alphabet_unamb, k=2)
>>> kmers = np.arange(len(kmer_alph))
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers])
['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']
>>> # Select 1/4 of *k-mers* based on lexicographical k-mer order
>>> selector = MincodeSelector(kmer_alph, 4)
>>> subset_pos, kmers_subset = selector.select_from_kmers(kmers)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset])
['AA', 'AC', 'AG', 'AT']
>>> # Select 1/4 based on randomized k-mer order
>>> selector = MincodeSelector(kmer_alph, 4, permutation=RandomPermutation())
>>> subset_pos, kmers_subset = selector.select_from_kmers(kmers)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset])
['AG', 'CT', 'GA', 'TC']
select(sequence, alphabet_check=True)#

Obtain all overlapping k-mers from a sequence and select the Mincode k-mers from them.

Parameters:
sequenceSequence

The sequence to find the Mincode k-mers in. Must be compatible with the given kmer_alphabet

alphabet_check: bool, optional

If set to false, the compatibility between the alphabet of the sequence and the alphabet of the MincodeSelector is not checked to gain additional performance.

Returns:
mincode_indicesndarray, dtype=np.uint32

The sequence indices where the Mincode k-mers start.

mincodendarray, dtype=np.int64

The corresponding Mincode k-mer codes.

select_from_kmers(kmers)#

Select Mincode k-mers.

The given k-mers are not required to overlap.

Parameters:
kmersndarray, dtype=np.int64

The k-mer codes to select the Mincode k-mers from.

Returns:
mincode_indicesndarray, dtype=np.uint32

The sequence indices where the Mincode k-mers start.

mincodendarray, dtype=np.int64

The corresponding Mincode k-mer codes.