SequenceProfile#
- class biotite.sequence.SequenceProfile(symbols, gaps, alphabet)[source]#
Bases:
objectA
SequenceProfileobject stores information about a sequence profile of aligned sequences. It is possible to calculate and return its consensus sequence.This class saves the position frequency matrix (position count matrix) ‘symbols’ of the occurrences of each alphabet symbol at each position. It also saves the number of gaps at each position in the array ‘gaps’.
With
from_alignment()aSequenceProfileobject can be created from an indefinite number of aligned sequences.With
probability_matrix()the position probability matrix can be created based on ‘symbols’ and a pseudocount.With
log_odds_matrix()the position weight matrix can be created based on the before calculated position probability matrix and the background frequencies.With
sequence_probability_from_matrix()the probability of a sequence can be calculated based on the before calculated position probability matrix of this instance of object SequenceProfile.With
sequence_score_from_matrix()the score of a sequence can be calculated based on the before calculated position weight matrix of this instance of object SequenceProfile.All attributes of this class are publicly accessible.
- Parameters:
- symbolsndarray, dtype=int, shape=(n,k)
This matrix simply saves for each position how often absolutely each symbol is present.
- gapsndarray, dtype=int, shape=n
Array which indicates the number of gaps at each position.
- alphabetAlphabet, length=k
Alphabet of sequences of sequence profile.
- Attributes:
- symbolsndarray, dtype=int, shape=(n,k)
This matrix simply saves for each position how often absolutely each symbol is present.
- gapsndarray, dtype=int, shape=n
Array which indicates the number of gaps at each position.
- alphabetAlphabet, length=k
Alphabet of sequences of sequence profile
Examples
Create a profile from a multiple sequence alignment:
>>> sequences = [ ... NucleotideSequence("CGCTCATTC"), ... NucleotideSequence("CGCTATTC"), ... NucleotideSequence("CCCTCAATC"), ... ] >>> msa, _, _, _ = align_multiple( ... sequences, SubstitutionMatrix.std_nucleotide_matrix(), gap_penalty=-5 ... ) >>> print(msa) CGCTCATTC CGCT-ATTC CCCTCAATC >>> profile = SequenceProfile.from_alignment(msa) >>> print(profile) A C G T 0 0 3 0 0 1 0 1 2 0 2 0 3 0 0 3 0 0 0 3 4 0 2 0 0 5 3 0 0 0 6 1 0 0 2 7 0 0 0 3 8 0 3 0 0 >>> print(profile.gaps) [0 0 0 0 1 0 0 0 0]
Slice the profile (masks and index arrays are also supported):
>>> print(profile[2:]) A C G T 0 0 3 0 0 1 0 0 0 3 2 0 2 0 0 3 3 0 0 0 4 1 0 0 2 5 0 0 0 3 6 0 3 0 0
Use the profile to compute the position probability matrix:
>>> print(profile.probability_matrix()) [[0.000 1.000 0.000 0.000] [0.000 0.333 0.667 0.000] [0.000 1.000 0.000 0.000] [0.000 0.000 0.000 1.000] [0.000 1.000 0.000 0.000] [1.000 0.000 0.000 0.000] [0.333 0.000 0.000 0.667] [0.000 0.000 0.000 1.000] [0.000 1.000 0.000 0.000]]
- static from_alignment(alignment, alphabet=None)#
Get an object of
SequenceProfilefrom an object ofAlignment.Based on the sequences of the alignment, the SequenceProfile parameters symbols and gaps are calculated.
- Parameters:
- alignmentAlignment
An Alignment object to create the SequenceProfile object from.
- alphabetbool
This alphabet will be used when creating the SequenceProfile object. If no alphabet is selected, the alphabet for this
SequenceProfile. object will be calculated from the sequences of object Alignment.
- Returns:
- profile: SequenceProfile
The created
SequenceProfileobject.
- log_odds_matrix(background_frequencies=None, pseudocount=0)#
Calculate the position weight matrix (PWM) based on the position probability matrix (PPM) (with given pseudocount) and background_frequencies. This new matrix has the same shape as ‘symbols’.
\[W(S) = \log_2 \left( \frac{P(S)}{B_S} \right)\]\(S\): The symbol.
\(P(S)\): The probability of symbol \(S\) at the sequence position.
\(c_p\): The background frequency of symbol \(S\).
- Parameters:
- background_frequenciesndarray, shape=(k,), dtype=float, optional
The background frequencies for each symbol in the alphabet. By default, a uniform distribution is assumed.
- pseudocountint, optional
Amount added to the number of observed cases in order to change the expected probability of the PPM.
- Returns:
- pwmndarray, dtype=float, shape=(n,k)
The calculated the position weight matrix.
- probability_matrix(pseudocount=0)#
Calculate the position probability matrix (PPM) based on ‘symbols’ and the given pseudocount. This new matrix has the same shape as ‘symbols’.
\[P(S) = \frac {C_S + \frac{c_p}{k}} {\sum_{i} C_i + c_p}\]\(S\): The symbol.
\(C_S\): The count of symbol \(S\) at the sequence position.
\(c_p\): The pseudocount.
\(k\): Length of the alphabet.
- Parameters:
- pseudocountint, optional
Amount added to the number of observed cases in order to change the expected probability of the PPM.
- Returns:
- ppmndarray, dtype=float, shape=(n,k)
The calculated the position probability matrix.
- sequence_probability(sequence, pseudocount=0)#
Calculate probability of a sequence based on the position probability matrix (PPM).
The sequence probability is the product of the probability of the respective symbol over all sequence positions.
- Parameters:
- sequenceSequence
The input sequence.
- pseudocountint, optional
Amount added to the number of observed cases in order to change the expected probability of the PPM.
- Returns:
- probabilityfloat
The calculated probability for the input sequence based on the PPM.
- sequence_score(sequence, background_frequencies=None, pseudocount=0)#
Calculate score of a sequence based on the position weight matrix (PWM).
The score is the sum of weights (log-odds scores) of the respective symbol over all sequence positions.
- Parameters:
- sequenceSequence
The input sequence.
- background_frequenciesndarray, shape=(k,), dtype=float, optional
The background frequencies for each symbol in the alphabet. By default a uniform distribution is assumed.
- pseudocountint, optional
Amount added to the number of observed cases in order to change the expected probability of the PPM.
- Returns:
- scorefloat
The calculated score for the input sequence based on the PWM.
- to_consensus(as_general=False)#
Get the consensus sequence for this SequenceProfile object.
- Parameters:
- as_generalbool
If true, returns consensus sequence as GeneralSequence object. Otherwise, the consensus sequence object type is chosen based on the alphabet of this SequenceProfile object.
- Returns:
- consensus: Sequence
The calculated consensus sequence.
Gallery#
Identification of a binding site by sequence conservation