biotite.sequence.align.CachedSyncmerSelector¶
- class biotite.sequence.align.CachedSyncmerSelector(alphabet, k, s, permutation=None, offset=(0,))[source]¶
Bases:
SyncmerSelector
Selects the syncmers in sequences.
Fulsfills the same purpose as
SyncmerSelector
, but precomputes for each possible k-mer, whether it is a syncmer, at initialization. Hence, syncmer selection is faster at the cost of longer initialization time.- Parameters
- alphabetAlphabet
The base alphabet the k-mers and s-mers are created from. Defines the type of sequence this
MinimizerSelector
can be applied on.- k, sint
The length of the k-mers and s-mers, respectively.
- permutationPermutation
If set, the s-mer order is permuted, i.e. the minimum s-mer is chosen based on the ordering of the sort keys from
Permutation.permute()
. ThisPermutation
must be compatible with s (not with k). By default, the standard order of theKmerAlphabet
is used. This standard order is often the lexicographical order, which is known to yield suboptimal density in many cases 1.- offsetarray-like of int
If the minimum s-mer in a k-mer is at one of the given offset positions, that k-mer is a syncmer. Negative values indicate the position from the end of the k-mer. By default, the minimum position needs to be at the start of the k-mer, which is termed open syncmer.
See also
SyncmerSelector
A standard variant for syncmer selection.
Notes
Both the initialization time and memory requirements are proportional to the size of the kmer_alphabet, i.e. \(n^k\). Hence, it is adviced to use this class only for rather small alphabets.
References
- 1
M. Roberts, W. Hayes, B. R. Hunt, S. M. Mount, J. A. Yorke, “Reducing storage requirements for biological sequence comparison,” Bioinformatics, vol. 20, pp. 3363–3369, December 2004. doi: 10.1093/bioinformatics/bth408
Examples
>>> sequence = NucleotideSequence("GGCAAGTGACA") >>> kmer_alph = KmerAlphabet(sequence.alphabet, k=5) >>> # The initialization can quite a long time for large *k-mer* alphabets... >>> closed_syncmer_selector = CachedSyncmerSelector( ... sequence.alphabet, ... # The same k as in the KmerAlphabet ... k=5, ... s=2, ... # The offset determines that closed syncmers will be selected ... offset=(0, -1) ... ) >>> # ...but the actual syncmer identification is very fast >>> syncmer_pos, syncmers = closed_syncmer_selector.select(sequence) >>> print(["".join(kmer_alph.decode(kmer)) for kmer in syncmers]) ['GGCAA', 'AAGTG', 'AGTGA', 'GTGAC']
- Attributes
- alphabetAlphabet
The base alphabet.
- kmer_alphabet, smer_alphabetint
The
KmerAlphabet
for k and s, respectively.- permutationPermutation
The permutation.
- select(sequence, alphabet_check=True)¶
Obtain all overlapping k-mers from a sequence and select the syncmers from them.
- Parameters
- sequenceSequence
The sequence to find the syncmers in. Must be compatible with the given kmer_alphabet
- alphabet_check: bool, optional
If set to false, the compatibility between the alphabet of the sequence and the alphabet of the
CachedSyncmerSelector
is not checked to gain additional performance.
- Returns
- syncmer_indicesndarray, dtype=np.uint32
The sequence indices where the syncmers start.
- syncmersndarray, dtype=np.int64
The corresponding k-mer codes of the syncmers.
- select_from_kmers(kmers)¶
Select syncmers for the given k-mers.
The k-mers are not required to overlap.
- Parameters
- kmersndarray, dtype=np.int64
The k-mer codes to select the syncmers from.
- Returns
- syncmer_indicesndarray, dtype=np.uint32
The sequence indices where the syncmers start.
- syncmersndarray, dtype=np.int64
The corresponding k-mer codes of the syncmers.