`MincodeSelector`#

class biotite.sequence.align.MincodeSelector(self, kmer_alphabet, compression, permutation=None)[source]#

Bases: object

Selects the \(1/\text{compression}\) smallest k-mers from KmerAlphabet. [1]

‘Small’ refers to the lexicographical order, or alternatively a custom order if permutation is given. The Mincode approach tries to reduce the number of k-mers from a sequence by the factor compression, while it still ensures that a common set of k-mers are selected from similar sequences.

Parameters:

kmer_alphabetKmerAlphabet: The k-mer alphabet that defines the k-mer size and the type of sequence this MincodeSelector can be applied on.
compressionfloat: Defines the compression factor, i.e. the approximate fraction of k-mers that will be sampled from a sequence.
permutationPermutation: If set, the k-mer order is permuted, i.e. the k-mers are selected based on the ordering of the sort keys from Permutation.permute(). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order.

Attributes:

kmer_alphabetKmerAlphabet: The k-mer alphabet.
compressionfloat: The compression factor.
thresholdfloat: Based on the compression factor and the range of (permuted) k-mer values this threshold is calculated. All k-mers, that are smaller than this value are selected.
permutationPermutation: The permutation.

References

Examples

>>> kmer_alph = KmerAlphabet(NucleotideSequence.alphabet_unamb, k=2)
>>> kmers = np.arange(len(kmer_alph))
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers])
['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']
>>> # Select 1/4 of *k-mers* based on lexicographical k-mer order
>>> selector = MincodeSelector(kmer_alph, 4)
>>> subset_pos, kmers_subset = selector.select_from_kmers(kmers)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset])
['AA', 'AC', 'AG', 'AT']
>>> # Select 1/4 based on randomized k-mer order
>>> selector = MincodeSelector(kmer_alph, 4, permutation=RandomPermutation())
>>> subset_pos, kmers_subset = selector.select_from_kmers(kmers)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in kmers_subset])
['AG', 'CT', 'GA', 'TC']

select(sequence, alphabet_check=True)#

Obtain all overlapping k-mers from a sequence and select the Mincode k-mers from them.

Parameters:

sequenceSequence: The sequence to find the Mincode k-mers in. Must be compatible with the given kmer_alphabet
alphabet_check: bool, optional: If set to false, the compatibility between the alphabet of the sequence and the alphabet of the MincodeSelector is not checked to gain additional performance.

Returns:

mincode_indicesndarray, dtype=np.uint32: The sequence indices where the Mincode k-mers start.
mincodendarray, dtype=np.int64: The corresponding Mincode k-mer codes.

select_from_kmers(kmers)#

Select Mincode k-mers.

The given k-mers are not required to overlap.

Parameters:

kmersndarray, dtype=np.int64: The k-mer codes to select the Mincode k-mers from.

Returns:

mincode_indicesndarray, dtype=np.uint32: The sequence indices where the Mincode k-mers start.
mincodendarray, dtype=np.int64: The corresponding Mincode k-mer codes.

MincodeSelector#

`MincodeSelector`#