`CachedSyncmerSelector`#

class biotite.sequence.align.CachedSyncmerSelector(alphabet, k, s, permutation=None, offset=(0,))[source]#

Bases: SyncmerSelector

Selects the syncmers in sequences.

Fulsfills the same purpose as SyncmerSelector, but precomputes for each possible k-mer, whether it is a syncmer, at initialization. Hence, syncmer selection is faster at the cost of longer initialization time.

Parameters:

alphabetAlphabet: The base alphabet the k-mers and s-mers are created from. Defines the type of sequence this MinimizerSelector can be applied on.
k, sint: The length of the k-mers and s-mers, respectively.
permutationPermutation: If set, the s-mer order is permuted, i.e. the minimum s-mer is chosen based on the ordering of the sort keys from Permutation.permute(). This Permutation must be compatible with s (not with k). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases [1].
offsetarray-like of int: If the minimum s-mer in a k-mer is at one of the given offset positions, that k-mer is a syncmer. Negative values indicate the position from the end of the k-mer. By default, the minimum position needs to be at the start of the k-mer, which is termed open syncmer.

Attributes:

alphabetAlphabet: The base alphabet.
kmer_alphabet, smer_alphabetint: The KmerAlphabet for k and s, respectively.
permutationPermutation: The permutation.

See also

SyncmerSelector: A standard variant for syncmer selection.

Notes

Both the initialization time and memory requirements are proportional to the size of the kmer_alphabet, i.e. \(n^k\). Hence, it is adviced to use this class only for rather small alphabets.

References

Examples

>>> sequence = NucleotideSequence("GGCAAGTGACA")
>>> kmer_alph = KmerAlphabet(sequence.alphabet, k=5)
>>> # The initialization can quite a long time for large *k-mer* alphabets...
>>> closed_syncmer_selector = CachedSyncmerSelector(
...     sequence.alphabet,
...     # The same k as in the KmerAlphabet
...     k=5,
...     s=2,
...     # The offset determines that closed syncmers will be selected
...     offset=(0, -1)
... )
>>> # ...but the actual syncmer identification is very fast
>>> syncmer_pos, syncmers = closed_syncmer_selector.select(sequence)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in syncmers])
['GGCAA', 'AAGTG', 'AGTGA', 'GTGAC']

select(sequence, alphabet_check=True)#

Obtain all overlapping k-mers from a sequence and select the syncmers from them.

Parameters:

sequenceSequence: The sequence to find the syncmers in. Must be compatible with the given kmer_alphabet
alphabet_check: bool, optional: If set to false, the compatibility between the alphabet of the sequence and the alphabet of the CachedSyncmerSelector is not checked to gain additional performance.

Returns:

syncmer_indicesndarray, dtype=np.uint32: The sequence indices where the syncmers start.
syncmersndarray, dtype=np.int64: The corresponding k-mer codes of the syncmers.

select_from_kmers(kmers)#

Select syncmers for the given k-mers.

The k-mers are not required to overlap.

Parameters:

kmersndarray, dtype=np.int64: The k-mer codes to select the syncmers from.

Returns:

syncmer_indicesndarray, dtype=np.uint32: The sequence indices where the syncmers start.
syncmersndarray, dtype=np.int64: The corresponding k-mer codes of the syncmers.

CachedSyncmerSelector#

`CachedSyncmerSelector`#