SyncmerSelector
#
- class biotite.sequence.align.SyncmerSelector(alphabet, k, s, permutation=None, offset=(0,))[source]#
Bases:
object
Selects the syncmers in sequences.
Let the s-mers be all overlapping substrings of length s in a k-mer. A k-mer is a syncmer, if its minimum s-mer is at one of the given offset positions [1]. If the same minimum s-mer appears twice in a k-mer, the position of the leftmost s-mer is taken.
- Parameters:
- alphabetAlphabet
The base alphabet the k-mers and s-mers are created from. Defines the type of sequence this
MinimizerSelector
can be applied on.- k, sint
The length of the k-mers and s-mers, respectively.
- permutationPermutation
If set, the s-mer order is permuted, i.e. the minimum s-mer is chosen based on the ordering of the sort keys from
Permutation.permute()
. ThisPermutation
must be compatible with s (not with k). By default, the standard order of theKmerAlphabet
is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases [2].- offsetarray-like of int
If the minimum s-mer in a k-mer is at one of the given offset positions, that k-mer is a syncmer. Negative values indicate the position from the end of the k-mer. By default, the minimum position needs to be at the start of the k-mer, which is termed open syncmer.
See also
CachedSyncmerSelector
A cached variant with faster syncmer selection at the cost of increased initialization time.
Notes
For syncmer computation from a sequence a fast algorithm [3] is used, whose runtime scales linearly with the length of the sequence and is constant with regard to k.
References
Examples
This example is taken from [1]: The subset of k-mers that are closed syncmers are selected. Closed syncmers are syncmers, where the minimum s-mer is in the first or last position of the k-mer. s-mers are ordered lexicographically in this example.
>>> sequence = NucleotideSequence("GGCAAGTGACA") >>> kmer_alph = KmerAlphabet(sequence.alphabet, k=5) >>> kmers = kmer_alph.create_kmers(sequence.code) >>> closed_syncmer_selector = CachedSyncmerSelector( ... sequence.alphabet, ... # The same k as in the KmerAlphabet ... k=5, ... s=2, ... # The offset determines that closed syncmers will be selected ... offset=(0, -1) ... ) >>> syncmer_pos, syncmers = closed_syncmer_selector.select(sequence) >>> # Print all k-mers in the sequence and mark syncmers with a '*' >>> for pos, kmer in enumerate(kmer_alph.create_kmers(sequence.code)): ... if pos in syncmer_pos: ... print("* " + "".join(kmer_alph.decode(kmer))) ... else: ... print(" " + "".join(kmer_alph.decode(kmer))) * GGCAA GCAAG CAAGT * AAGTG * AGTGA * GTGAC TGACA
- Attributes:
- alphabetAlphabet
The base alphabet.
- kmer_alphabet, smer_alphabetint
The
KmerAlphabet
for k and s, respectively.- permutationPermutation
The permutation.
- select(sequence, alphabet_check=True)#
Obtain all overlapping k-mers from a sequence and select the syncmers from them.
- Parameters:
- sequenceSequence
The sequence to find the syncmers in. Must be compatible with the given kmer_alphabet
- alphabet_check: bool, optional
If set to false, the compatibility between the alphabet of the sequence and the alphabet of the
SyncmerSelector
is not checked to gain additional performance.
- Returns:
- syncmer_indicesndarray, dtype=np.uint32
The sequence indices where the syncmers start.
- syncmersndarray, dtype=np.int64
The corresponding k-mer codes of the syncmers.
- select_from_kmers(kmers)#
Select syncmers for the given k-mers.
The k-mers are not required to overlap.
- Parameters:
- kmersndarray, dtype=np.int64
The k-mer codes to select the syncmers from.
- Returns:
- syncmer_indicesndarray, dtype=np.uint32
The sequence indices where the syncmers start.
- syncmersndarray, dtype=np.int64
The corresponding k-mer codes of the syncmers.
Notes
Since for s-mer creation, the k-mers need to be converted back to symbol codes again and since the input k-mers are not required to overlap, calling
select()
is much faster. However,select()
is only available forSequence
objects.