`SequenceProfile`#

class biotite.sequence.SequenceProfile(symbols, gaps, alphabet)[source]#

Bases: object

A SequenceProfile object stores information about a sequence profile of aligned sequences. It is possible to calculate and return its consensus sequence.

This class saves the position frequency matrix (position count matrix) ‘symbols’ of the occurrences of each alphabet symbol at each position. It also saves the number of gaps at each position in the array ‘gaps’.

With from_alignment() a SequenceProfile object can be created from an indefinite number of aligned sequences.

With probability_matrix() the position probability matrix can be created based on ‘symbols’ and a pseudocount.

With log_odds_matrix() the position weight matrix can be created based on the before calculated position probability matrix and the background frequencies.

With sequence_probability_from_matrix() the probability of a sequence can be calculated based on the before calculated position probability matrix of this instance of object SequenceProfile.

With sequence_score_from_matrix() the score of a sequence can be calculated based on the before calculated position weight matrix of this instance of object SequenceProfile.

All attributes of this class are publicly accessible.

Parameters:

symbolsndarray, dtype=int, shape=(n,k): This matrix simply saves for each position how often absolutely each symbol is present.
gapsndarray, dtype=int, shape=n: Array which indicates the number of gaps at each position.
alphabetAlphabet, length=k: Alphabet of sequences of sequence profile.

Attributes:

symbolsndarray, dtype=int, shape=(n,k): This matrix simply saves for each position how often absolutely each symbol is present.
gapsndarray, dtype=int, shape=n: Array which indicates the number of gaps at each position.
alphabetAlphabet, length=k: Alphabet of sequences of sequence profile

Examples

Create a profile from a multiple sequence alignment:

>>> sequences = [
...     NucleotideSequence("CGCTCATTC"),
...     NucleotideSequence("CGCTATTC"),
...     NucleotideSequence("CCCTCAATC"),
... ]
>>> msa, _, _, _ = align_multiple(
...     sequences, SubstitutionMatrix.std_nucleotide_matrix(), gap_penalty=-5
... )
>>> print(msa)
CGCTCATTC
CGCT-ATTC
CCCTCAATC
>>> profile = SequenceProfile.from_alignment(msa)
>>> print(profile)
  A C G T
0 0 3 0 0
1 0 1 2 0
2 0 3 0 0
3 0 0 0 3
4 0 2 0 0
5 3 0 0 0
6 1 0 0 2
7 0 0 0 3
8 0 3 0 0
>>> print(profile.gaps)
[0 0 0 0 1 0 0 0 0]

Slice the profile (masks and index arrays are also supported):

>>> print(profile[2:])
  A C G T
0 3 0 0
0 0 0 3
0 2 0 0
3 0 0 0
1 0 0 2
0 0 0 3
0 3 0 0

Use the profile to compute the position probability matrix:

>>> print(profile.probability_matrix())
[[0.000 1.000 0.000 0.000]
 [0.000 0.333 0.667 0.000]
 [0.000 1.000 0.000 0.000]
 [0.000 0.000 0.000 1.000]
 [0.000 1.000 0.000 0.000]
 [1.000 0.000 0.000 0.000]
 [0.333 0.000 0.000 0.667]
 [0.000 0.000 0.000 1.000]
 [0.000 1.000 0.000 0.000]]

static from_alignment(alignment, alphabet=None)#

Get an object of SequenceProfile from an object of Alignment.

Based on the sequences of the alignment, the SequenceProfile parameters symbols and gaps are calculated.

Parameters:

alignmentAlignment: An Alignment object to create the SequenceProfile object from.
alphabetbool: This alphabet will be used when creating the SequenceProfile object. If no alphabet is selected, the alphabet for this SequenceProfile. object will be calculated from the sequences of object Alignment.

Returns:

profile: SequenceProfile: The created SequenceProfile object.

log_odds_matrix(background_frequencies=None, pseudocount=0)#

Calculate the position weight matrix (PWM) based on the position probability matrix (PPM) (with given pseudocount) and background_frequencies. This new matrix has the same shape as ‘symbols’.

\[W(S) = \log_2 \left( \frac{P(S)}{B_S} \right)\]

\(S\): The symbol.

\(P(S)\): The probability of symbol \(S\) at the sequence position.

\(c_p\): The background frequency of symbol \(S\).

Parameters:

background_frequenciesndarray, shape=(k,), dtype=float, optional: The background frequencies for each symbol in the alphabet. By default, a uniform distribution is assumed.
pseudocountint, optional: Amount added to the number of observed cases in order to change the expected probability of the PPM.

Returns:

pwmndarray, dtype=float, shape=(n,k): The calculated the position weight matrix.

probability_matrix(pseudocount=0)#

Calculate the position probability matrix (PPM) based on ‘symbols’ and the given pseudocount. This new matrix has the same shape as ‘symbols’.

\[P(S) = \frac {C_S + \frac{c_p}{k}} {\sum_{i} C_i + c_p}\]

\(S\): The symbol.

\(C_S\): The count of symbol \(S\) at the sequence position.

\(c_p\): The pseudocount.

\(k\): Length of the alphabet.

Parameters:

pseudocountint, optional: Amount added to the number of observed cases in order to change the expected probability of the PPM.

Returns:

ppmndarray, dtype=float, shape=(n,k): The calculated the position probability matrix.

sequence_probability(sequence, pseudocount=0)#

Calculate probability of a sequence based on the position probability matrix (PPM).

The sequence probability is the product of the probability of the respective symbol over all sequence positions.

Parameters:

sequenceSequence: The input sequence.
pseudocountint, optional: Amount added to the number of observed cases in order to change the expected probability of the PPM.

Returns:

probabilityfloat: The calculated probability for the input sequence based on the PPM.

sequence_score(sequence, background_frequencies=None, pseudocount=0)#

Calculate score of a sequence based on the position weight matrix (PWM).

The score is the sum of weights (log-odds scores) of the respective symbol over all sequence positions.

Parameters:

sequenceSequence: The input sequence.
background_frequenciesndarray, shape=(k,), dtype=float, optional: The background frequencies for each symbol in the alphabet. By default a uniform distribution is assumed.
pseudocountint, optional: Amount added to the number of observed cases in order to change the expected probability of the PPM.

Returns:

scorefloat: The calculated score for the input sequence based on the PWM.

to_consensus(as_general=False)#

Get the consensus sequence for this SequenceProfile object.

Parameters:

as_generalbool: If true, returns consensus sequence as GeneralSequence object. Otherwise, the consensus sequence object type is chosen based on the alphabet of this SequenceProfile object.

Returns:

consensus: Sequence: The calculated consensus sequence.

Gallery#

Conservation of binding site

Sequence logo of sequences with equal length

Identification of a binding site by sequence conservation

SequenceProfile#

Gallery#

`SequenceProfile`#