biotite.sequence.SequenceProfile¶
- class biotite.sequence.SequenceProfile(symbols, gaps, alphabet)[source]¶
Bases:
object
A
SequenceProfile
object stores information about a sequence profile of aligned sequences. It is possible to calculate and return its consensus sequence.This class saves the position frequency matrix (position count matrix) ‘symbols’ of the occurrences of each alphabet symbol at each position. It also saves the number of gaps at each position in the array ‘gaps’.
With
probability_matrix()
the position probability matrix can be created based on ‘symbols’ and a pseudocount.With
log_odds_matrix()
the position weight matrix can be created based on the before calculated position probability matrix and the background frequencies.With
from_alignment()
aSequenceProfile
object can be created from an indefinite number of aligned sequences.With
sequence_probability_from_matrix()
the probability of a sequence can be calculated based on the before calculated position probability matrix of this instance of object SequenceProfile.With
sequence_score_from_matrix()
the score of a sequence can be calculated based on the before calculated position weight matrix of this instance of object SequenceProfile.All attributes of this class are publicly accessible.
- Parameters
- symbolsndarray, dtype=int, shape=(n,k)
This matrix simply saves for each position how often absolutely each symbol is present.
- gapsndarray, dtype=int, shape=n
Array which indicates the number of gaps at each position.
- alphabetAlphabet, length=k
Alphabet of sequences of sequence profile
- Attributes
- symbolsndarray, dtype=int, shape=(n,k)
This matrix simply saves for each position how often absolutely each symbol is present.
- gapsndarray, dtype=int, shape=n
Array which indicates the number of gaps at each position.
- alphabetAlphabet, length=k
Alphabet of sequences of sequence profile
- static from_alignment(alignment, alphabet=None)¶
Get an object of
SequenceProfile
from an object ofAlignment
.Based on the sequences of the alignment, the SequenceProfile parameters symbols and gaps are calculated.
- Parameters
- alignmentAlignment
An Alignment object to create the SequenceProfile object from.
- alphabetbool
This alphabet will be used when creating the SequenceProfile object. If no alphabet is selected, the alphabet for this SequenceProfile object will be calculated from the sequences of object Alignment. (Default: None).
- Returns
- profile: SequenceProfile
The created SequenceProfile object
- log_odds_matrix(background_frequencies=None, pseudocount=0)¶
Calculate the position weight matrix (PWM) based on the position probability matrix (PPM) (with given pseudocount) and background_frequencies. This new matrix has the same shape as ‘symbols’.
\[W(S) = \log_2 \left( \frac{P(S)}{B_S} \right)\]\(S\): The symbol.
\(P(S)\): The probability of symbol \(S\) at the sequence position.
\(c_p\): The background frequency of symbol \(S\).
- Parameters
- pseudocount: int, optional
Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)
- background_frequencies: ndarray, shape=(k,), dtype=float, optional
The background frequencies for each symbol in the alphabet. By default, a uniform distribution is assumed.
- Returns
- pwm: ndarray, dtype=float, shape=(n,k)
The calculated the position weight matrix.
- probability_matrix(pseudocount=0)¶
Calculate the position probability matrix (PPM) based on ‘symbols’ and the given pseudocount. This new matrix has the same shape as ‘symbols’.
\[P(S) = \frac {C_S + \frac{c_p}{k}} {\sum_{i} C_i + c_p}\]\(S\): The symbol.
\(C_S\): The count of symbol \(S\) at the sequence position.
\(c_p\): The pseudocount.
\(k\): Length of the alphabet.
- Parameters
- pseudocount: int, optional
Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)
- Returns
- ppm: ndarray, dtype=float, shape=(n,k)
The calculated the position probability matrix.
- sequence_probability(sequence, pseudocount=0)¶
Calculate probability of a sequence based on the position probability matrix (PPM).
The sequence probability is the product of the probability of the respective symbol over all sequence positions.
- Parameters
- sequenceSequence
The input sequence.
- pseudocount: int, optional
Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)
- Returns
- probability: float
The calculated probability for the input sequence based on the PPM.
- sequence_score(sequence, background_frequencies=None, pseudocount=0)¶
Calculate score of a sequence based on the position weight matrix (PWM).
The score is the sum of weights (log-odds scores) of the respective symbol over all sequence positions.
- Parameters
- sequenceSequence
The input sequence.
- pseudocount: int, optional
Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)
- background_frequencies: ndarray, shape=(k,), dtype=float, optional
The background frequencies for each symbol in the alphabet. By default a uniform distribution is assumed.
- Returns
- score: float
The calculated score for the input sequence based on the PWM.
- to_consensus(as_general=False)¶
Get the consensus sequence for this SequenceProfile object.
- Parameters
- as_generalbool
If true, returns consensus sequence as GeneralSequence object. Otherwise, the consensus sequence object type is chosen based on the alphabet of this SequenceProfile object (Default: False).
- Returns
- consensus: Sequence
The calculated consensus sequence
Gallery¶
Sequence logo of the Anderson promoter collection
Conservation of LexA DNA-binding site
Identification of the ribosomal binding site