biotite.sequence.SequenceProfile¶

class biotite.sequence.SequenceProfile(symbols, gaps, alphabet)[source]¶

Bases: object

A SequenceProfile object stores information about a sequence profile of aligned sequences. It is possible to calculate and return its consensus sequence.

This class saves the position frequency matrix (position count matrix) ‘symbols’ of the occurrences of each alphabet symbol at each position. It also saves the number of gaps at each position in the array ‘gaps’.

With probability_matrix() the position probability matrix can be created based on ‘symbols’ and a pseudocount.

With log_odds_matrix() the position weight matrix can be created based on the before calculated position probability matrix and the background frequencies.

With from_alignment() a SequenceProfile object can be created from an indefinite number of aligned sequences.

With sequence_probability_from_matrix() the probability of a sequence can be calculated based on the before calculated position probability matrix of this instance of object SequenceProfile.

With sequence_score_from_matrix() the score of a sequence can be calculated based on the before calculated position weight matrix of this instance of object SequenceProfile.

All attributes of this class are publicly accessible.

Parameters

symbolsndarray, dtype=int, shape=(n,k): This matrix simply saves for each position how often absolutely each symbol is present.
gapsndarray, dtype=int, shape=n: Array which indicates the number of gaps at each position.
alphabetAlphabet, length=k: Alphabet of sequences of sequence profile

Attributes

symbolsndarray, dtype=int, shape=(n,k): This matrix simply saves for each position how often absolutely each symbol is present.
gapsndarray, dtype=int, shape=n: Array which indicates the number of gaps at each position.
alphabetAlphabet, length=k: Alphabet of sequences of sequence profile

static from_alignment(alignment, alphabet=None)¶

Get an object of SequenceProfile from an object of Alignment.

Based on the sequences of the alignment, the SequenceProfile parameters symbols and gaps are calculated.

Parameters

alignmentAlignment: An Alignment object to create the SequenceProfile object from.
alphabetbool: This alphabet will be used when creating the SequenceProfile object. If no alphabet is selected, the alphabet for this SequenceProfile object will be calculated from the sequences of object Alignment. (Default: None).

Returns

profile: SequenceProfile: The created SequenceProfile object

log_odds_matrix(background_frequencies=None, pseudocount=0)¶

Calculate the position weight matrix (PWM) based on the position probability matrix (PPM) (with given pseudocount) and background_frequencies. This new matrix has the same shape as ‘symbols’.

\[W(S) = \log_2 \left( \frac{P(S)}{B_S} \right)\]

\(S\): The symbol.

\(P(S)\): The probability of symbol \(S\) at the sequence position.

\(c_p\): The background frequency of symbol \(S\).

Parameters

pseudocount: int, optional: Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)
background_frequencies: ndarray, shape=(k,), dtype=float, optional: The background frequencies for each symbol in the alphabet. By default, a uniform distribution is assumed.

Returns

pwm: ndarray, dtype=float, shape=(n,k): The calculated the position weight matrix.

probability_matrix(pseudocount=0)¶

Calculate the position probability matrix (PPM) based on ‘symbols’ and the given pseudocount. This new matrix has the same shape as ‘symbols’.

\[P(S) = \frac {C_S + \frac{c_p}{k}} {\sum_{i} C_i + c_p}\]

\(S\): The symbol.

\(C_S\): The count of symbol \(S\) at the sequence position.

\(c_p\): The pseudocount.

\(k\): Length of the alphabet.

Parameters

pseudocount: int, optional: Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)

Returns

ppm: ndarray, dtype=float, shape=(n,k): The calculated the position probability matrix.

sequence_probability(sequence, pseudocount=0)¶

Calculate probability of a sequence based on the position probability matrix (PPM).

The sequence probability is the product of the probability of the respective symbol over all sequence positions.

Parameters

sequenceSequence: The input sequence.
pseudocount: int, optional: Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)

Returns

probability: float: The calculated probability for the input sequence based on the PPM.

sequence_score(sequence, background_frequencies=None, pseudocount=0)¶

Calculate score of a sequence based on the position weight matrix (PWM).

The score is the sum of weights (log-odds scores) of the respective symbol over all sequence positions.

Parameters

sequenceSequence: The input sequence.
pseudocount: int, optional: Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)
background_frequencies: ndarray, shape=(k,), dtype=float, optional: The background frequencies for each symbol in the alphabet. By default a uniform distribution is assumed.

Returns

score: float: The calculated score for the input sequence based on the PWM.

to_consensus(as_general=False)¶

Get the consensus sequence for this SequenceProfile object.

Parameters

as_generalbool: If true, returns consensus sequence as GeneralSequence object. Otherwise, the consensus sequence object type is chosen based on the alphabet of this SequenceProfile object (Default: False).

Returns

consensus: Sequence: The calculated consensus sequence

Gallery¶

Sequence logo of the Anderson promoter collection

Conservation of LexA DNA-binding site

Identification of the ribosomal binding site