biotite.sequence.align.SyncmerSelector¶
- class biotite.sequence.align.SyncmerSelector(alphabet, k, s, permutation=None, offset=(0,))[source]¶
Bases:
object
Selects the syncmers in sequences.
Let the s-mers be all overlapping substrings of length s in a k-mer. A k-mer is a syncmer, if its minimum s-mer is at one of the given offset positions 1. If the same minimum s-mer appears twice in a k-mer, the position of the leftmost s-mer is taken.
- Parameters
- alphabetAlphabet
The base alphabet the k-mers and s-mers are created from. Defines the type of sequence this
MinimizerSelector
can be applied on.- k, sint
The length of the k-mers and s-mers, respectively.
- permutationPermutation
If set, the s-mer order is permuted, i.e. the minimum s-mer is chosen based on the ordering of the sort keys from
Permutation.permute()
. ThisPermutation
must be compatible with s (not with k). By default, the standard order of theKmerAlphabet
is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases 2.- offsetarray-like of int
If the minimum s-mer in a k-mer is at one of the given offset positions, that k-mer is a syncmer. Negative values indicate the position from the end of the k-mer. By default, the minimum position needs to be at the start of the k-mer, which is termed open syncmer.
See also
CachedSyncmerSelector
A cached variant with faster syncmer selection at the cost of increased initialization time.
Notes
For syncmer computation from a sequence a fast algorithm 3 is used, whose runtime scales linearly with the length of the sequence and is constant with regard to k.
References
- 1(1,2)
R. Edgar, “Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences,” PeerJ, vol. 9, pp. e10805, February 2021. doi: 10.7717/peerj.10805
- 2
M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, J. A. Yorke, “Reducing storage requirements for biological sequence comparison,” Bioinformatics, vol. 20, pp. 3363–3369, December 2004. doi: 10.1093/bioinformatics/bth408
- 3
M. {van Herk}, “A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels,” Pattern Recognition Letters, vol. 13, pp. 517–521, July 1992. doi: 10.1016/0167-8655(92)90069-C
Examples
This example is taken from 1: The subset of k-mers that are closed syncmers are selected. Closed syncmers are syncmers, where the minimum s-mer is in the first or last position of the k-mer. s-mers are ordered lexicographically in this example.
>>> sequence = NucleotideSequence("GGCAAGTGACA") >>> kmer_alph = KmerAlphabet(sequence.alphabet, k=5) >>> kmers = kmer_alph.create_kmers(sequence.code) >>> closed_syncmer_selector = CachedSyncmerSelector( ... sequence.alphabet, ... # The same k as in the KmerAlphabet ... k=5, ... s=2, ... # The offset determines that closed syncmers will be selected ... offset=(0, -1) ... ) >>> syncmer_pos, syncmers = closed_syncmer_selector.select(sequence) >>> # Print all k-mers in the sequence and mark syncmers with a '*' >>> for pos, kmer in enumerate(kmer_alph.create_kmers(sequence.code)): ... if pos in syncmer_pos: ... print("* " + "".join(kmer_alph.decode(kmer))) ... else: ... print(" " + "".join(kmer_alph.decode(kmer))) * GGCAA GCAAG CAAGT * AAGTG * AGTGA * GTGAC TGACA
- Attributes
- alphabetAlphabet
The base alphabet.
- kmer_alphabet, smer_alphabetint
The
KmerAlphabet
for k and s, respectively.- permutationPermutation
The permutation.
- select(sequence, alphabet_check=True)¶
Obtain all overlapping k-mers from a sequence and select the syncmers from them.
- Parameters
- sequenceSequence
The sequence to find the syncmers in. Must be compatible with the given kmer_alphabet
- alphabet_check: bool, optional
If set to false, the compatibility between the alphabet of the sequence and the alphabet of the
SyncmerSelector
is not checked to gain additional performance.
- Returns
- syncmer_indicesndarray, dtype=np.uint32
The sequence indices where the syncmers start.
- syncmersndarray, dtype=np.int64
The corresponding k-mer codes of the syncmers.
- select_from_kmers(kmers)¶
Select syncmers for the given k-mers.
The k-mers are not required to overlap.
- Parameters
- kmersndarray, dtype=np.int64
The k-mer codes to select the syncmers from.
- Returns
- syncmer_indicesndarray, dtype=np.uint32
The sequence indices where the syncmers start.
- syncmersndarray, dtype=np.int64
The corresponding k-mer codes of the syncmers.
Notes
Since for s-mer creation, the k-mers need to be converted back to symbol codes again and since the input k-mers are not required to overlap, calling
select()
is much faster. However,select()
is only available forSequence
objects.