Profiles and position-specific scoring matrices#

Often sequences are not viewed in isolation: For example, if you investigate a protein family, you do not handle a single sequence, but an arbitrarily large collection of highly similar sequences. At some point handling those sequences as a mere collection of individual sequences becomes impractical for answering common questions, like which are the prevalent amino acids at a certain positions of the sequences.

Sequence profiles#

This is where sequence profiles come into play: They condense a collection of aligned sequences into a matrix that tracks the frequency of each symbol at each position of the alignment. Hence, asking the questions such as ‘How frequent is an alanine at the n-th position?’ becomes a trivial indexing operation.

As example for profiles, we will reuse the cyclotide sequence family from the previous chapter.

from tempfile import NamedTemporaryFile
import biotite.sequence.io.fasta as fasta
import biotite.database.entrez as entrez

query = (
    entrez.SimpleQuery("Cyclotide") &
    entrez.SimpleQuery("cter") &
    entrez.SimpleQuery("srcdb_swiss-prot", field="Properties") ^
    entrez.SimpleQuery("Precursor")
)
uids = entrez.search(query, "protein")
temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
fasta_file = fasta.FastaFile.read(
    entrez.fetch_single_file(uids, temp_file.name, "protein", "fasta")
)
sequences = {
    # The cyclotide variant is the last character in the header
    header[-1]: seq for header, seq in fasta.get_sequences(fasta_file).items()
}
# Extract cyclotide N as query sequence for later
query = sequences.pop("N")

To create a profile, we first need to align the sequences, so corresponding symbols appear the same position. Then we can create a SequenceProfile object from the Alignment, which simply counts for each alignment column (i.e. the sequence position) the number of occurrences for each symbol.

import biotite.sequence as seq
import biotite.sequence.align as align

alignment, _, _, _ = seq.align.align_multiple(
    list(sequences.values()),
    align.SubstitutionMatrix.std_protein_matrix(),
    gap_penalty=-5,
)
profile = seq.SequenceProfile.from_alignment(alignment)
print(profile)
    A  C  D  E  F  G  H  I  K  L  M  N  P  Q  R  S  T  V  W  Y  B  Z  X  *
 0  0  0  0  0  0 11  2  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 1  0  0  0  2  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0
 2  0  0  0  0  0  0  0  7  0  2  0  0  0  0  0  0  0  2  0  0  0  0  0  0
 3  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  0
 4  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 5  3  0  0  0  0 10  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 6  0  0  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 7  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0
 8  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 9  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0
10  0  0  0  0 10  0  0  0  0  0  0  0  0  0  0  0  0  0  3  0  0  0  0  0
11  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
12  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  0
13  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
14  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  4  0  0  0  0  0  0  0
15  0  0  0  0  0  0  0  9  0  0  0  0  0  0  0  0  0  4  0  0  0  0  0  0
16  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2 11  0  0  0  0  0  0  0
17  5  0  0  0  0  2  0  0  0  0  0  0  0  0  0  1  5  0  0  0  0  0  0  0
18  1  0  0  0  0  0  0  2  0  4  0  0  0  0  0  0  0  6  0  0  0  0  0  0
19  2  0  0  0  0  0  0  2  0  4  0  0  0  0  0  0  0  5  0  0  0  0  0  0
20  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
21  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
22  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0
23  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
24  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
25  0  0  4  0  0  0  0  0  0  0  0  6  0  0  0  3  0  0  0  0  0  0  0  0
26  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
27  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0  0  0  0
28  0 13  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
29  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 13  0  0  0  0
30  0  0  0  0  0  0  0  1  1  4  0  2  0  0  3  0  0  0  0  0  0  0  0  0
31  0  0  6  0  0  0  0  0  0  0  0  7  0  0  0  0  0  0  0  0  0  0  0  0

Each row in the displayed count matrix (accessible via SequenceProfile.symbols) refers to a single position, i.e. a column in the input MSA, and each column refers to a symbol in the underlying alphabet (accessible via SequenceProfile.alphabet). For completeness it should be noted that SequenceProfile.gaps also tracks the gaps for each position in the alignment, but we will not further use them in this tutorial.

Note that the information about the individual sequences is lost in the condensation process: There is no way to reconstruct the original sequences from the profile. However, we can still extract a consensus sequence from the profile, which is a sequence that represents the most frequent symbol at each position.

print(profile.to_consensus())
GEIPCGESCVFIPCTITAVVGCSCKNKVCYLN

Position-specific scoring matrices#

Sequence profiles can achieve even more: The substitution matrices we have seen so far assign a score to a pair of two symbols, regardless of their position in a sequence. However, as discussed above, the position is crucial information to determine how conserved a certain symbol is in a family of sequences.

Hence, we extend the concept of substitution matrices to position-specific scoring matrices, which assign a score to a symbol and a position (instead of another symbols). A typical way to create a position-specific scoring matrix is to use the log-odds matrix from a profile.

# For a real analysis amino acid background frequencies should be provided
log_odds = profile.log_odds_matrix(pseudocount=1)

Now, we encounter a problem: To create a SubstitutionMatrix object from the log-odds matrix, we require two Alphabet objects: One is taken from the query sequence to be aligned to the profile, but which alphabet do we use for the positional axis of the matrix? Likewise, the alignment functions (e.g. align_optimal()) require two sequences to be aligned, but we only have one query sequence.

To solve this problem, biotite.sequence provides the PositionalSequence which acts as a placeholder for the second sequence in the alignment. Its alphabet contains a unique symbol for each position, i.e. the alphabet has the sought length.

positional_seq = seq.PositionalSequence(profile.to_consensus())
matrix = align.SubstitutionMatrix(
    positional_seq.alphabet,
    seq.ProteinSequence.alphabet,
    # Substitution matrices require integer scores
    # Multiply by 10 to increase value range and convert to integer
    (log_odds * 10).astype(int)
)
alignment = align.align_optimal(
    positional_seq, query, matrix, gap_penalty=-5, max_number=1
)[0]
print(alignment)
GEIP---CGES-CVFIP---CTI-TAVVG--CSCKNK---VCYL-N
G---SAFCGE-TCV---LGTC--YT----PDCSC---TALVC-LKN

Only the length of the input sequence passed to PositionalSequence matters for the alignment to work. The consensus sequence of the profile was merely chosen for cosmetic reasons, i.e. to have a meaningful string representation of the positional sequence and thus the alignment.

Further applications#

Using a sequence profile is only one way to create a position-specific scoring matrix. The combination of PositionalSequence objects and a corresponding SubstitutionMatrix allows to assign a score between two sequence positions based on any property.

As a toy example, let’s say we want to harness the information that cyclotides have strongly conserved disulfide bridges. Hence, we want the score matrix not only to reward when cysteines are paired with cysteines in general, but specifically when cysteines are paired with cysteines at the corresponding position, i.e. the first cysteine with the first cysteine, etc. As template we will use the standard BLOSUM62 substitution matrix, expanded into a position-specific scoring matrix using SubstitutionMatrix.as_positional():

cyclotide_n = query
cyclotide_c = sequences["C"]
aa_matrix = align.SubstitutionMatrix.std_protein_matrix()
# `as_positional()` expands the matrix into a position-specific scoring matrix
# for the given sequences while also giving us the positional sequences for them
pos_matrix, pos_cyclotide_n, pos_cyclotide_c = aa_matrix.as_positional(
    cyclotide_n, cyclotide_c
)
# Introduce bias for cysteine positions
score_matrix = pos_matrix.score_matrix().copy()
cys_code = seq.ProteinSequence.alphabet.encode("C")
score_matrix[cyclotide_n.code == cys_code, cyclotide_c.code == cys_code] = 100
disulfide_matrix = align.SubstitutionMatrix(
    pos_matrix.get_alphabet1(),
    pos_matrix.get_alphabet2(),
    score_matrix,
)
print(disulfide_matrix)
    G   V   P   C   A   E   S   C   V   W   I   P   C   T   V   T   A   L   L   G   C   S   C   K   D   K   V   C   Y   L   D
G   6  -3  -2  -3   0  -2   0  -3  -3  -2  -4  -2  -3  -2  -3  -2   0  -4  -4   6  -3   0  -3  -2  -1  -2  -3  -3  -3  -4  -1
S   0  -2  -1  -1   1   0   4  -1  -2  -3  -2  -1  -1   1  -2   1   1  -2  -2   0  -1   4  -1   0   0   0  -2  -1  -2  -2   0
A   0   0  -1   0   4  -1   1   0   0  -3  -1  -1   0   0   0   0   4  -1  -1   0   0   1   0  -1  -2  -1   0   0  -2  -1  -2
F  -3  -1  -4  -2  -2  -3  -2  -2  -1   1   0  -4  -2  -2  -1  -2  -2   0   0  -3  -2  -2  -2  -3  -3  -3  -1  -2   3   0  -3
C  -3  -1  -3 100   0  -4  -1   9  -1  -2  -1  -3   9  -1  -1  -1   0  -1  -1  -3   9  -1   9  -3  -3  -3  -1   9  -2  -1  -3
G   6  -3  -2  -3   0  -2   0  -3  -3  -2  -4  -2  -3  -2  -3  -2   0  -4  -4   6  -3   0  -3  -2  -1  -2  -3  -3  -3  -4  -1
E  -2  -2  -1  -4  -1   5   0  -4  -2  -3  -3  -1  -4  -1  -2  -1  -1  -3  -3  -2  -4   0  -4   1   2   1  -2  -4  -2  -3   2
T  -2   0  -1  -1   0  -1   1  -1   0  -2  -1  -1  -1   5   0   5   0  -1  -1  -2  -1   1  -1  -1  -1  -1   0  -1  -2  -1  -1
C  -3  -1  -3   9   0  -4  -1 100  -1  -2  -1  -3   9  -1  -1  -1   0  -1  -1  -3   9  -1   9  -3  -3  -3  -1   9  -2  -1  -3
V  -3   4  -2  -1   0  -2  -2  -1   4  -3   3  -2  -1   0   4   0   0   1   1  -3  -1  -2  -1  -2  -3  -2   4  -1  -1   1  -3
L  -4   1  -3  -1  -1  -3  -2  -1   1  -2   2  -3  -1  -1   1  -1  -1   4   4  -4  -1  -2  -1  -2  -4  -2   1  -1  -1   4  -4
G   6  -3  -2  -3   0  -2   0  -3  -3  -2  -4  -2  -3  -2  -3  -2   0  -4  -4   6  -3   0  -3  -2  -1  -2  -3  -3  -3  -4  -1
T  -2   0  -1  -1   0  -1   1  -1   0  -2  -1  -1  -1   5   0   5   0  -1  -1  -2  -1   1  -1  -1  -1  -1   0  -1  -2  -1  -1
C  -3  -1  -3   9   0  -4  -1   9  -1  -2  -1  -3 100  -1  -1  -1   0  -1  -1  -3   9  -1   9  -3  -3  -3  -1   9  -2  -1  -3
Y  -3  -1  -3  -2  -2  -2  -2  -2  -1   2  -1  -3  -2  -2  -1  -2  -2  -1  -1  -3  -2  -2  -2  -2  -3  -2  -1  -2   7  -1  -3
T  -2   0  -1  -1   0  -1   1  -1   0  -2  -1  -1  -1   5   0   5   0  -1  -1  -2  -1   1  -1  -1  -1  -1   0  -1  -2  -1  -1
P  -2  -2   7  -3  -1  -1  -1  -3  -2  -4  -3   7  -3  -1  -2  -1  -1  -3  -3  -2  -3  -1  -3  -1  -1  -1  -2  -3  -3  -3  -1
D  -1  -3  -1  -3  -2   2   0  -3  -3  -4  -3  -1  -3  -1  -3  -1  -2  -4  -4  -1  -3   0  -3  -1   6  -1  -3  -3  -3  -4   6
C  -3  -1  -3   9   0  -4  -1   9  -1  -2  -1  -3   9  -1  -1  -1   0  -1  -1  -3 100  -1   9  -3  -3  -3  -1   9  -2  -1  -3
S   0  -2  -1  -1   1   0   4  -1  -2  -3  -2  -1  -1   1  -2   1   1  -2  -2   0  -1   4  -1   0   0   0  -2  -1  -2  -2   0
C  -3  -1  -3   9   0  -4  -1   9  -1  -2  -1  -3   9  -1  -1  -1   0  -1  -1  -3   9  -1 100  -3  -3  -3  -1   9  -2  -1  -3
T  -2   0  -1  -1   0  -1   1  -1   0  -2  -1  -1  -1   5   0   5   0  -1  -1  -2  -1   1  -1  -1  -1  -1   0  -1  -2  -1  -1
A   0   0  -1   0   4  -1   1   0   0  -3  -1  -1   0   0   0   0   4  -1  -1   0   0   1   0  -1  -2  -1   0   0  -2  -1  -2
L  -4   1  -3  -1  -1  -3  -2  -1   1  -2   2  -3  -1  -1   1  -1  -1   4   4  -4  -1  -2  -1  -2  -4  -2   1  -1  -1   4  -4
V  -3   4  -2  -1   0  -2  -2  -1   4  -3   3  -2  -1   0   4   0   0   1   1  -3  -1  -2  -1  -2  -3  -2   4  -1  -1   1  -3
C  -3  -1  -3   9   0  -4  -1   9  -1  -2  -1  -3   9  -1  -1  -1   0  -1  -1  -3   9  -1   9  -3  -3  -3  -1 100  -2  -1  -3
L  -4   1  -3  -1  -1  -3  -2  -1   1  -2   2  -3  -1  -1   1  -1  -1   4   4  -4  -1  -2  -1  -2  -4  -2   1  -1  -1   4  -4
K  -2  -2  -1  -3  -1   1   0  -3  -2  -3  -3  -1  -3  -1  -2  -1  -1  -2  -2  -2  -3   0  -3   5  -1   5  -2  -3  -2  -2  -1
N   0  -3  -2  -3  -2   0   1  -3  -3  -4  -3  -2  -3   0  -3   0  -2  -3  -3   0  -3   1  -3   0   1   0  -3  -3  -2  -3   1

You might notice the new shape of the substitution matrix: Instead of spanning the 24 x 24 amino acids including ambiguous symbols, it now matches the lengths of the two input sequences. In this matrix you can spot the biased scores at the matching cysteine positions.