biotite.sequence.align.EValueEstimator¶
- class biotite.sequence.align.EValueEstimator(lam, k)[source]¶
Bases:
object
This class is used to calculate expect values (E-values) for local pairwise sequence alignments.
The E-value is a measure to quantify the significance of a found homology. It is the number of alignments, that would result from aligning random sequences of a given length, with a score at least as high as the score from an alignment of interest.
The calculation of the E-value from score and sequence lengths depend on the two parameters \(\lambda\) and \(K\) 1. These parameters are estimated from sampling a large number of random sequence alignments in
from_samples()
2, which may be time consuming. If these parameters are known, the constructor can be used instead.Based on the sampled parameters, the decadic logarithm of the E-value can be quickly calculated via
log_evalue()
.- Parameters
- lamfloat
The \(\lambda\) parameter.
- kfloat
The \(K\) parameter.
Notes
The calculated E-value is a rough estimation that gets more accurate the more sequences are used in the sampling process. Note that the accuracy for alignment of short sequences, where the average length of a sampled alignment make up a significant part of the complete sampled sequence 1.
References
- 1(1,2)
S. F. Altschul, W. Gish, “Local alignment statistics,” Methods in Enzymology, vol. 266, pp. 460–480, January 1996. doi: 10.1016/S0076-6879(96)66029-7
- 2(1,2)
S. F. Altschul, B. W. Erickson, “A nonlinear measure of subalignment similarity and its significance levels,” Bulletin of Mathematical Biology, vol. 48, pp. 617–632, September 1986. doi: 10.1007/BF02462327
Examples
Create an alignment, whose significance should be evaluated.
>>> query = NucleotideSequence("CGACGGCGTCTACGAGTCAACATCATTC") >>> hit = NucleotideSequence("GCTTTATTACGGGTTTACGAGTTCAACATCACGAAAACAA") >>> matrix = SubstitutionMatrix.std_nucleotide_matrix() >>> gap_penalty = (-12, -2) >>> alignment = align_optimal(query, hit, matrix, gap_penalty, local=True)[0] >>> print(alignment) ACGGCGTCTACGAGT-CAACATCA ACGG-GTTTACGAGTTCAACATCA >>> print(alignment.score) 77
Create an estimator based on the same scoring scheme as the alignment. Use background symbol frequencies from the hypothetical reference database.
>>> # Ensure deterministic results >>> np.random.seed(0) >>> # Sequences in database have a GC content of 0.6 >>> background = np.array([0.2, 0.3, 0.3, 0.2]) >>> estimator = EValueEstimator.from_samples( ... query.alphabet, matrix, gap_penalty, background, sample_length=100 ... )
Approach 1: Calculate E-value based on number of sequences in the hypothetical database (100).
>>> log_e = estimator.log_evalue(alignment.score, len(query), 100 * len(hit)) >>> print(f"E-value = {10**log_e:.2e}") E-value = 3.36e-01
Approach 2: Calculate E-value based on total length of all sequences in the hypothetical database combined (10000).
>>> log_e = estimator.log_evalue(alignment.score, len(query), 10000) >>> print(f"E-value = {10**log_e:.2e}") E-value = 8.41e-01
- static from_samples(alphabet, matrix, gap_penalty, frequencies, sample_length=1000, sample_size=1000)¶
Create an
EValueEstimator
with \(\lambda\) and \(K\) estimated via sampling alignments of random sequences based on a given scoring scheme.The parameters are estimated from the sampled alignment scores using the method of moments 2.
- Parameters
- alphabetAlphabet, length=k
The alphabet for the sampled sequences.
- matrixSubstitutionMatrix
The substitution matrix. It must be compatible with the given alphabet and the expected similarity score between two random symbols must be negative.
- gap_penaltyint or tuple(int,int)
Either a linear (
int
) or affine (tuple
) gap penalty. Integers must be negative.- frequenciesndarray, shape=k, dtype=float
The background frequencies for each symbol in the alphabet. The random sequences are created based on these frequencies.
- sample_lengthint
The length of the sampled sequences. It should be much larger than the average length of a local alignment of two sequences. The runtime scales quadratically with this parameter.
- sample_sizeint
The number of sampled sequences. The accuracy of the estimated parameters and E-values, but also the runtime increases with the sample size.
- Returns
- estimatorEValueEstimator
A
EValueEstimator
with sampled \(\lambda\) and \(K\) parameters.
Notes
The sampling process generates random sequences based on
numpy.random
. To ensure reproducible results you could callnumpy.random.seed()
before runningfrom_samples()
.
- log_evalue(score, seq1_length, seq2_length)¶
Calculate the decadic logarithm of the E-value for a given score.
The E-value and the logarithm of the E-value is calculated as
\[ \begin{align}\begin{aligned}E = Kmn e^{-\lambda s}\\\log_{10} E = (\log_{10} Kmn) - \frac{\lambda s}{\ln 10},\end{aligned}\end{align} \]where \(s\) is the similarity score and \(m\) and \(n\) are the lengths of the aligned sequences.
- Parameters
- scoreint or ndarray, dtype=int
The score to evaluate.
- seq1_lengthint or ndarray, dtype=int
The length of the first sequence. In the context of a homology search in a sequence database, this is usually the length of the query sequence.
- seq2_lengthint or ndarray, dtype=int
The length of the second sequence. In the context of a homology search in a sequence database, this is usually either the combined length of all sequences in the database or the length of the hit sequence multiplied by the number of sequences in the database.
- Returns
- log_efloat
The decadic logarithm of the E-value.
Notes
This method returns the logarithm of the E-value instead of the E-value, as low E-values indicating a highly significant homology cannot be accurately represented by a
float
.
Gallery¶
Genome comparison between chloroplasts and cyanobacteria