CachedSyncmerSelector
#
- class biotite.sequence.align.CachedSyncmerSelector(alphabet, k, s, permutation=None, offset=(0,))[source]#
Bases:
SyncmerSelector
Selects the syncmers in sequences.
Fulsfills the same purpose as
SyncmerSelector
, but precomputes for each possible k-mer, whether it is a syncmer, at initialization. Hence, syncmer selection is faster at the cost of longer initialization time.- Parameters:
- alphabetAlphabet
The base alphabet the k-mers and s-mers are created from. Defines the type of sequence this
MinimizerSelector
can be applied on.- k, sint
The length of the k-mers and s-mers, respectively.
- permutationPermutation
If set, the s-mer order is permuted, i.e. the minimum s-mer is chosen based on the ordering of the sort keys from
Permutation.permute()
. ThisPermutation
must be compatible with s (not with k). By default, the standard order of theKmerAlphabet
is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases [1].- offsetarray-like of int
If the minimum s-mer in a k-mer is at one of the given offset positions, that k-mer is a syncmer. Negative values indicate the position from the end of the k-mer. By default, the minimum position needs to be at the start of the k-mer, which is termed open syncmer.
- Attributes:
- alphabetAlphabet
The base alphabet.
- kmer_alphabet, smer_alphabetint
The
KmerAlphabet
for k and s, respectively.- permutationPermutation
The permutation.
See also
SyncmerSelector
A standard variant for syncmer selection.
Notes
Both the initialization time and memory requirements are proportional to the size of the kmer_alphabet, i.e. \(n^k\). Hence, it is adviced to use this class only for rather small alphabets.
References
Examples
>>> sequence = NucleotideSequence("GGCAAGTGACA") >>> kmer_alph = KmerAlphabet(sequence.alphabet, k=5) >>> # The initialization can quite a long time for large *k-mer* alphabets... >>> closed_syncmer_selector = CachedSyncmerSelector( ... sequence.alphabet, ... # The same k as in the KmerAlphabet ... k=5, ... s=2, ... # The offset determines that closed syncmers will be selected ... offset=(0, -1) ... ) >>> # ...but the actual syncmer identification is very fast >>> syncmer_pos, syncmers = closed_syncmer_selector.select(sequence) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in syncmers]) ['GGCAA', 'AAGTG', 'AGTGA', 'GTGAC']
- select(sequence, alphabet_check=True)#
Obtain all overlapping k-mers from a sequence and select the syncmers from them.
- Parameters:
- sequenceSequence
The sequence to find the syncmers in. Must be compatible with the given kmer_alphabet
- alphabet_check: bool, optional
If set to false, the compatibility between the alphabet of the sequence and the alphabet of the
CachedSyncmerSelector
is not checked to gain additional performance.
- Returns:
- syncmer_indicesndarray, dtype=np.uint32
The sequence indices where the syncmers start.
- syncmersndarray, dtype=np.int64
The corresponding k-mer codes of the syncmers.
- select_from_kmers(kmers)#
Select syncmers for the given k-mers.
The k-mers are not required to overlap.
- Parameters:
- kmersndarray, dtype=np.int64
The k-mer codes to select the syncmers from.
- Returns:
- syncmer_indicesndarray, dtype=np.uint32
The sequence indices where the syncmers start.
- syncmersndarray, dtype=np.int64
The corresponding k-mer codes of the syncmers.