biotite.sequence.align.CachedSyncmerSelector¶

class biotite.sequence.align.CachedSyncmerSelector(alphabet, k, s, permutation=None, offset=(0,))[source]¶

Bases: SyncmerSelector

Selects the syncmers in sequences.

Fulsfills the same purpose as SyncmerSelector, but precomputes for each possible k-mer, whether it is a syncmer, at initialization. Hence, syncmer selection is faster at the cost of longer initialization time.

Parameters

alphabetAlphabet: The base alphabet the k-mers and s-mers are created from. Defines the type of sequence this MinimizerSelector can be applied on.
k, sint: The length of the k-mers and s-mers, respectively.
permutationPermutation: If set, the s-mer order is permuted, i.e. the minimum s-mer is chosen based on the ordering of the sort keys from Permutation.permute(). This Permutation must be compatible with s (not with k). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases 1.
offsetarray-like of int: If the minimum s-mer in a k-mer is at one of the given offset positions, that k-mer is a syncmer. Negative values indicate the position from the end of the k-mer. By default, the minimum position needs to be at the start of the k-mer, which is termed open syncmer.

See also

SyncmerSelector: A standard variant for syncmer selection.

Notes

Both the initialization time and memory requirements are proportional to the size of the kmer_alphabet, i.e. \(n^k\). Hence, it is adviced to use this class only for rather small alphabets.

References

1: M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, J. A. Yorke, “Reducing storage requirements for biological sequence comparison,” Bioinformatics, vol. 20, pp. 3363–3369, December 2004. doi: 10.1093/bioinformatics/bth408

Examples

>>> sequence = NucleotideSequence("GGCAAGTGACA")
>>> kmer_alph = KmerAlphabet(sequence.alphabet, k=5)
>>> # The initialization can quite a long time for large *k-mer* alphabets...
>>> closed_syncmer_selector = CachedSyncmerSelector(
...     sequence.alphabet,
...     # The same k as in the KmerAlphabet
...     k=5,
...     s=2,
...     # The offset determines that closed syncmers will be selected
...     offset=(0, -1)
... )
>>> # ...but the actual syncmer identification is very fast
>>> syncmer_pos, syncmers = closed_syncmer_selector.select(sequence)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in syncmers])
['GGCAA', 'AAGTG', 'AGTGA', 'GTGAC']

Attributes

alphabetAlphabet: The base alphabet.
kmer_alphabet, smer_alphabetint: The KmerAlphabet for k and s, respectively.
permutationPermutation: The permutation.

select(sequence, alphabet_check=True)¶

Obtain all overlapping k-mers from a sequence and select the syncmers from them.

Parameters

sequenceSequence: The sequence to find the syncmers in. Must be compatible with the given kmer_alphabet
alphabet_check: bool, optional: If set to false, the compatibility between the alphabet of the sequence and the alphabet of the CachedSyncmerSelector is not checked to gain additional performance.

Returns

syncmer_indicesndarray, dtype=np.uint32: The sequence indices where the syncmers start.
syncmersndarray, dtype=np.int64: The corresponding k-mer codes of the syncmers.

select_from_kmers(kmers)¶

Select syncmers for the given k-mers.

The k-mers are not required to overlap.

Parameters

kmersndarray, dtype=np.int64: The k-mer codes to select the syncmers from.

Returns

syncmer_indicesndarray, dtype=np.uint32: The sequence indices where the syncmers start.
syncmersndarray, dtype=np.int64: The corresponding k-mer codes of the syncmers.