biotite.sequence.align.EValueEstimator

class biotite.sequence.align.EValueEstimator(lam, k)[source]

Bases: object

This class is used to calculate expect values (E-values) for local pairwise sequence alignments.

The E-value is a measure to quantify the significance of a found homology. It is the number of alignments, that would result from aligning random sequences of a given length, with a score at least as high as the score from an alignment of interest.

The calculation of the E-value from score and sequence lengths depend on the two parameters \(\lambda\) and \(K\) 1. These parameters are estimated from sampling a large number of random sequence alignments in from_samples() 2, which may be time consuming. If these parameters are known, the constructor can be used instead.

Based on the sampled parameters, the decadic logarithm of the E-value can be quickly calculated via log_evalue().

Parameters
lamfloat

The \(\lambda\) parameter.

kfloat

The \(K\) parameter.

Notes

The calculated E-value is a rough estimation that gets more accurate the more sequences are used in the sampling process. Note that the accuracy for alignment of short sequences, where the average length of a sampled alignment make up a significant part of the complete sampled sequence 1.

References

1(1,2)

S. F. Altschul, W. Gish, “Local alignment statistics,” Methods in Enzymology, vol. 266, pp. 460–480, January 1996. doi: 10.1016/S0076-6879(96)66029-7

2(1,2)

S. F. Altschul, B. W. Erickson, “A nonlinear measure of subalignment similarity and its significance levels,” Bulletin of Mathematical Biology, vol. 48, pp. 617–632, September 1986. doi: 10.1007/BF02462327

Examples

Create an alignment, whose significance should be evaluated.

>>> query = NucleotideSequence("CGACGGCGTCTACGAGTCAACATCATTC")
>>> hit = NucleotideSequence("GCTTTATTACGGGTTTACGAGTTCAACATCACGAAAACAA")
>>> matrix = SubstitutionMatrix.std_nucleotide_matrix()
>>> gap_penalty = (-12, -2)
>>> alignment = align_optimal(query, hit, matrix, gap_penalty, local=True)[0]
>>> print(alignment)
ACGGCGTCTACGAGT-CAACATCA
ACGG-GTTTACGAGTTCAACATCA
>>> print(alignment.score)
77

Create an estimator based on the same scoring scheme as the alignment. Use background symbol frequencies from the hypothetical reference database.

>>> # Ensure deterministic results
>>> np.random.seed(0)
>>> # Sequences in database have a GC content of 0.6
>>> background = np.array([0.2, 0.3, 0.3, 0.2])
>>> estimator = EValueEstimator.from_samples(
...     query.alphabet, matrix, gap_penalty, background, sample_length=100
... )

Approach 1: Calculate E-value based on number of sequences in the hypothetical database (100).

>>> log_e = estimator.log_evalue(alignment.score, len(query), 100 * len(hit))
>>> print(f"E-value = {10**log_e:.2e}")
E-value = 3.36e-01

Approach 2: Calculate E-value based on total length of all sequences in the hypothetical database combined (10000).

>>> log_e = estimator.log_evalue(alignment.score, len(query), 10000)
>>> print(f"E-value = {10**log_e:.2e}")
E-value = 8.41e-01
static from_samples(alphabet, matrix, gap_penalty, frequencies, sample_length=1000, sample_size=1000)

Create an EValueEstimator with \(\lambda\) and \(K\) estimated via sampling alignments of random sequences based on a given scoring scheme.

The parameters are estimated from the sampled alignment scores using the method of moments 2.

Parameters
alphabetAlphabet, length=k

The alphabet for the sampled sequences.

matrixSubstitutionMatrix

The substitution matrix. It must be compatible with the given alphabet and the expected similarity score between two random symbols must be negative.

gap_penaltyint or tuple(int,int)

Either a linear (int) or affine (tuple) gap penalty. Integers must be negative.

frequenciesndarray, shape=k, dtype=float

The background frequencies for each symbol in the alphabet. The random sequences are created based on these frequencies.

sample_lengthint

The length of the sampled sequences. It should be much larger than the average length of a local alignment of two sequences. The runtime scales quadratically with this parameter.

sample_sizeint

The number of sampled sequences. The accuracy of the estimated parameters and E-values, but also the runtime increases with the sample size.

Returns
estimatorEValueEstimator

A EValueEstimator with sampled \(\lambda\) and \(K\) parameters.

Notes

The sampling process generates random sequences based on numpy.random. To ensure reproducible results you could call numpy.random.seed() before running from_samples().

log_evalue(score, seq1_length, seq2_length)

Calculate the decadic logarithm of the E-value for a given score.

The E-value and the logarithm of the E-value is calculated as

\[ \begin{align}\begin{aligned}E = Kmn e^{-\lambda s}\\\log_{10} E = (\log_{10} Kmn) - \frac{\lambda s}{\ln 10},\end{aligned}\end{align} \]

where \(s\) is the similarity score and \(m\) and \(n\) are the lengths of the aligned sequences.

Parameters
scoreint or ndarray, dtype=int

The score to evaluate.

seq1_lengthint or ndarray, dtype=int

The length of the first sequence. In the context of a homology search in a sequence database, this is usually the length of the query sequence.

seq2_lengthint or ndarray, dtype=int

The length of the second sequence. In the context of a homology search in a sequence database, this is usually either the combined length of all sequences in the database or the length of the hit sequence multiplied by the number of sequences in the database.

Returns
log_efloat

The decadic logarithm of the E-value.

Notes

This method returns the logarithm of the E-value instead of the E-value, as low E-values indicating a highly significant homology cannot be accurately represented by a float.