SyncmerSelector#

class biotite.sequence.align.SyncmerSelector(alphabet, k, s, permutation=None, offset=(0,))[source]#

Bases: object

Selects the syncmers in sequences.

Let the s-mers be all overlapping substrings of length s in a k-mer. A k-mer is a syncmer, if its minimum s-mer is at one of the given offset positions [1]. If the same minimum s-mer appears twice in a k-mer, the position of the leftmost s-mer is taken.

Parameters:
alphabetAlphabet

The base alphabet the k-mers and s-mers are created from. Defines the type of sequence this MinimizerSelector can be applied on.

k, sint

The length of the k-mers and s-mers, respectively.

permutationPermutation

If set, the s-mer order is permuted, i.e. the minimum s-mer is chosen based on the ordering of the sort keys from Permutation.permute(). This Permutation must be compatible with s (not with k). By default, the standard order of the KmerAlphabet is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases [2].

offsetarray-like of int

If the minimum s-mer in a k-mer is at one of the given offset positions, that k-mer is a syncmer. Negative values indicate the position from the end of the k-mer. By default, the minimum position needs to be at the start of the k-mer, which is termed open syncmer.

See also

CachedSyncmerSelector

A cached variant with faster syncmer selection at the cost of increased initialization time.

Notes

For syncmer computation from a sequence a fast algorithm [3] is used, whose runtime scales linearly with the length of the sequence and is constant with regard to k.

References

Examples

This example is taken from [1]: The subset of k-mers that are closed syncmers are selected. Closed syncmers are syncmers, where the minimum s-mer is in the first or last position of the k-mer. s-mers are ordered lexicographically in this example.

>>> sequence = NucleotideSequence("GGCAAGTGACA")
>>> kmer_alph = KmerAlphabet(sequence.alphabet, k=5)
>>> kmers = kmer_alph.create_kmers(sequence.code)
>>> closed_syncmer_selector = CachedSyncmerSelector(
...     sequence.alphabet,
...     # The same k as in the KmerAlphabet
...     k=5,
...     s=2,
...     # The offset determines that closed syncmers will be selected
...     offset=(0, -1)
... )
>>> syncmer_pos, syncmers = closed_syncmer_selector.select(sequence)
>>> # Print all k-mers in the sequence and mark syncmers with a '*'
>>> for pos, kmer in enumerate(kmer_alph.create_kmers(sequence.code)):
...     if pos in syncmer_pos:
...         print("* " + "".join(kmer_alph.decode(kmer)))
...     else:
...         print("  " + "".join(kmer_alph.decode(kmer)))
* GGCAA
  GCAAG
  CAAGT
* AAGTG
* AGTGA
* GTGAC
  TGACA
Attributes:
alphabetAlphabet

The base alphabet.

kmer_alphabet, smer_alphabetint

The KmerAlphabet for k and s, respectively.

permutationPermutation

The permutation.

select(sequence, alphabet_check=True)#

Obtain all overlapping k-mers from a sequence and select the syncmers from them.

Parameters:
sequenceSequence

The sequence to find the syncmers in. Must be compatible with the given kmer_alphabet

alphabet_check: bool, optional

If set to false, the compatibility between the alphabet of the sequence and the alphabet of the SyncmerSelector is not checked to gain additional performance.

Returns:
syncmer_indicesndarray, dtype=np.uint32

The sequence indices where the syncmers start.

syncmersndarray, dtype=np.int64

The corresponding k-mer codes of the syncmers.

select_from_kmers(kmers)#

Select syncmers for the given k-mers.

The k-mers are not required to overlap.

Parameters:
kmersndarray, dtype=np.int64

The k-mer codes to select the syncmers from.

Returns:
syncmer_indicesndarray, dtype=np.uint32

The sequence indices where the syncmers start.

syncmersndarray, dtype=np.int64

The corresponding k-mer codes of the syncmers.

Notes

Since for s-mer creation, the k-mers need to be converted back to symbol codes again and since the input k-mers are not required to overlap, calling select() is much faster. However, select() is only available for Sequence objects.