biotite.sequence.SequenceProfile

class biotite.sequence.SequenceProfile(symbols, gaps, alphabet)[source]

Bases: object

A SequenceProfile object stores information about a sequence profile of aligned sequences. It is possible to calculate and return its consensus sequence.

This class saves the position frequency matrix (position count matrix) ‘symbols’ of the occurrences of each alphabet symbol at each position. It also saves the number of gaps at each position in the array ‘gaps’.

With probability_matrix() the position probability matrix can be created based on ‘symbols’ and a pseudocount.

With log_odds_matrix() the position weight matrix can be created based on the before calculated position probability matrix and the background frequencies.

With from_alignment() a SequenceProfile object can be created from an indefinite number of aligned sequences.

With sequence_probability_from_matrix() the probability of a sequence can be calculated based on the before calculated position probability matrix of this instance of object SequenceProfile.

With sequence_score_from_matrix() the score of a sequence can be calculated based on the before calculated position weight matrix of this instance of object SequenceProfile.

All attributes of this class are publicly accessible.

Parameters
symbolsndarray, dtype=int, shape=(n,k)

This matrix simply saves for each position how often absolutely each symbol is present.

gapsndarray, dtype=int, shape=n

Array which indicates the number of gaps at each position.

alphabetAlphabet, length=k

Alphabet of sequences of sequence profile

Attributes
symbolsndarray, dtype=int, shape=(n,k)

This matrix simply saves for each position how often absolutely each symbol is present.

gapsndarray, dtype=int, shape=n

Array which indicates the number of gaps at each position.

alphabetAlphabet, length=k

Alphabet of sequences of sequence profile

static from_alignment(alignment, alphabet=None)

Get an object of SequenceProfile from an object of Alignment.

Based on the sequences of the alignment, the SequenceProfile parameters symbols and gaps are calculated.

Parameters
alignmentAlignment

An Alignment object to create the SequenceProfile object from.

alphabetbool

This alphabet will be used when creating the SequenceProfile object. If no alphabet is selected, the alphabet for this SequenceProfile object will be calculated from the sequences of object Alignment. (Default: None).

Returns
profile: SequenceProfile

The created SequenceProfile object

log_odds_matrix(background_frequencies=None, pseudocount=0)

Calculate the position weight matrix (PWM) based on the position probability matrix (PPM) (with given pseudocount) and background_frequencies. This new matrix has the same shape as ‘symbols’.

\[W(S) = \log_2 \left( \frac{P(S)}{B_S} \right)\]

\(S\): The symbol.

\(P(S)\): The probability of symbol \(S\) at the sequence position.

\(c_p\): The background frequency of symbol \(S\).

Parameters
pseudocount: int, optional

Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)

background_frequencies: ndarray, shape=(k,), dtype=float, optional

The background frequencies for each symbol in the alphabet. By default, a uniform distribution is assumed.

Returns
pwm: ndarray, dtype=float, shape=(n,k)

The calculated the position weight matrix.

probability_matrix(pseudocount=0)

Calculate the position probability matrix (PPM) based on ‘symbols’ and the given pseudocount. This new matrix has the same shape as ‘symbols’.

\[P(S) = \frac {C_S + \frac{c_p}{k}} {\sum_{i} C_i + c_p}\]

\(S\): The symbol.

\(C_S\): The count of symbol \(S\) at the sequence position.

\(c_p\): The pseudocount.

\(k\): Length of the alphabet.

Parameters
pseudocount: int, optional

Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)

Returns
ppm: ndarray, dtype=float, shape=(n,k)

The calculated the position probability matrix.

sequence_probability(sequence, pseudocount=0)

Calculate probability of a sequence based on the position probability matrix (PPM).

The sequence probability is the product of the probability of the respective symbol over all sequence positions.

Parameters
sequenceSequence

The input sequence.

pseudocount: int, optional

Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)

Returns
probability: float

The calculated probability for the input sequence based on the PPM.

sequence_score(sequence, background_frequencies=None, pseudocount=0)

Calculate score of a sequence based on the position weight matrix (PWM).

The score is the sum of weights (log-odds scores) of the respective symbol over all sequence positions.

Parameters
sequenceSequence

The input sequence.

pseudocount: int, optional

Amount added to the number of observed cases in order to change the expected probability of the PPM. (Default: 0)

background_frequencies: ndarray, shape=(k,), dtype=float, optional

The background frequencies for each symbol in the alphabet. By default a uniform distribution is assumed.

Returns
score: float

The calculated score for the input sequence based on the PWM.

to_consensus(as_general=False)

Get the consensus sequence for this SequenceProfile object.

Parameters
as_generalbool

If true, returns consensus sequence as GeneralSequence object. Otherwise, the consensus sequence object type is chosen based on the alphabet of this SequenceProfile object (Default: False).

Returns
consensus: Sequence

The calculated consensus sequence