biotite.sequence.align.CachedSyncmerSelector

class biotite.sequence.align.CachedSyncmerSelector(alphabet, k, s, permutation=None, offset=(0,))[source]

Bases: SyncmerSelector

Selects the syncmers in sequences.

Fulsfills the same purpose as SyncmerSelector, but precomputes for each possible k-mer, whether it is a syncmer, at initialization. Hence, syncmer selection is faster at the cost of longer initialization time.

Parameters
alphabetAlphabet

The base alphabet the k-mers and s-mers are created from. Defines the type of sequence this MinimizerSelector can be applied on.

k, sint

The length of the k-mers and s-mers, respectively.

permutationPermutation

If set, the s-mer order is permuted, i.e. the minimum s-mer is chosen based on the ordering of the sort keys from Permutation.permute(). This Permutation must be compatible with s (not with k). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases 1.

offsetarray-like of int

If the minimum s-mer in a k-mer is at one of the given offset positions, that k-mer is a syncmer. Negative values indicate the position from the end of the k-mer. By default, the minimum position needs to be at the start of the k-mer, which is termed open syncmer.

See also

SyncmerSelector

A standard variant for syncmer selection.

Notes

Both the initialization time and memory requirements are proportional to the size of the kmer_alphabet, i.e. \(n^k\). Hence, it is adviced to use this class only for rather small alphabets.

References

1

M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, J. A. Yorke, “Reducing storage requirements for biological sequence comparison,” Bioinformatics, vol. 20, pp. 3363–3369, December 2004. doi: 10.1093/bioinformatics/bth408

Examples

>>> sequence = NucleotideSequence("GGCAAGTGACA")
>>> kmer_alph = KmerAlphabet(sequence.alphabet, k=5)
>>> # The initialization can quite a long time for large *k-mer* alphabets...
>>> closed_syncmer_selector = CachedSyncmerSelector(
...     sequence.alphabet,
...     # The same k as in the KmerAlphabet
...     k=5,
...     s=2,
...     # The offset determines that closed syncmers will be selected
...     offset=(0, -1)
... )
>>> # ...but the actual syncmer identification is very fast
>>> syncmer_pos, syncmers = closed_syncmer_selector.select(sequence)
>>> print(["".join(kmer_alph.decode(kmer)) for kmer in syncmers])
['GGCAA', 'AAGTG', 'AGTGA', 'GTGAC']
Attributes
alphabetAlphabet

The base alphabet.

kmer_alphabet, smer_alphabetint

The KmerAlphabet for k and s, respectively.

permutationPermutation

The permutation.

select(sequence, alphabet_check=True)

Obtain all overlapping k-mers from a sequence and select the syncmers from them.

Parameters
sequenceSequence

The sequence to find the syncmers in. Must be compatible with the given kmer_alphabet

alphabet_check: bool, optional

If set to false, the compatibility between the alphabet of the sequence and the alphabet of the CachedSyncmerSelector is not checked to gain additional performance.

Returns
syncmer_indicesndarray, dtype=np.uint32

The sequence indices where the syncmers start.

syncmersndarray, dtype=np.int64

The corresponding k-mer codes of the syncmers.

select_from_kmers(kmers)

Select syncmers for the given k-mers.

The k-mers are not required to overlap.

Parameters
kmersndarray, dtype=np.int64

The k-mer codes to select the syncmers from.

Returns
syncmer_indicesndarray, dtype=np.uint32

The sequence indices where the syncmers start.

syncmersndarray, dtype=np.int64

The corresponding k-mer codes of the syncmers.