biotite.sequence.align.ScoreThresholdRule¶

class biotite.sequence.align.ScoreThresholdRule(matrix, threshold)[source]¶

Bases: SimilarityRule

This similarity rule calculates all k-mers that have a greater or equal similarity score with a given k-mer than a defined threshold score.

The similarity score \(S\) of two k-mers \(a\) and \(b\) is defined as the sum of the pairwise similarity scores from a substitution matrix \(M\):

\[S(a,b) = \sum_{i=1}^k M(a_i, b_i)\]

Therefore, this similarity rule allows substitutions with similar symbols within a k-mer.

This class is especially useful for finding similar k-mers in protein sequences.

Parameters

matrixSubstitutionMatrix: The similarity scores are taken from this matrix. The matrix must be symmetric.
thresholdint: The threshold score. A k-mer \(b\) is regarded as similar to a k-mer \(a\), if the similarity score between \(a\) and \(b\) is equal or greater than the threshold.

Notes

For efficient generation of similar k-mers an implementation of the branch-and-bound algorithm 1 is used.

References

1: M. Hauser, C. E. Mayer, J. Söding, “kClust: fast and sensitive clustering of large protein sequence databases,” BMC Bioinformatics, vol. 14, pp. 248, August 2013. doi: 10.1186/1471-2105-14-248

Examples

>>> kmer_alphabet = KmerAlphabet(ProteinSequence.alphabet, k=3)
>>> matrix = SubstitutionMatrix.std_protein_matrix()
>>> rule = ScoreThresholdRule(matrix, threshold=15)
>>> similars = rule.similar_kmers(kmer_alphabet, kmer_alphabet.encode("AIW"))
>>> print(["".join(s) for s in kmer_alphabet.decode_multiple(similars)])
['AFW', 'AIW', 'ALW', 'AMW', 'AVW', 'CIW', 'GIW', 'SIW', 'SVW', 'TIW', 'VIW', 'XIW']

similar_kmers(kmer_alphabet, kmer)¶

Calculate all similar k-mers for a given k-mer.

Parameters

kmer_alphabetKmerAlphabet: The reference k-mer alphabet to select the k-mers from.
kmerint: The symbol code for the k-mer to find similars for.

Returns

similar_kmersndarray, dtype=np.int64: The symbol codes for all similar k-mers.