Profiles and position-specific scoring matrices#
Often sequences are not viewed in isolation: For example, if you investigate a protein family, you do not handle a single sequence, but an arbitrarily large collection of highly similar sequences. At some point handling those sequences as a mere collection of individual sequences becomes impractical for answering common questions, like which are the prevalent amino acids at a certain positions of the sequences.
Sequence profiles#
This is where sequence profiles come into play: They condense a collection of aligned sequences into a matrix that tracks the frequency of each symbol at each position of the alignment. Hence, asking the questions such as ‘How frequent is an alanine at the n-th position?’ becomes a trivial indexing operation.
As example for profiles, we will reuse the cyclotide sequence family from the previous chapter.
from tempfile import NamedTemporaryFile
import biotite.sequence.io.fasta as fasta
import biotite.database.entrez as entrez
query = (
entrez.SimpleQuery("Cyclotide") &
entrez.SimpleQuery("cter") &
entrez.SimpleQuery("srcdb_swiss-prot", field="Properties") ^
entrez.SimpleQuery("Precursor")
)
uids = entrez.search(query, "protein")
temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
fasta_file = fasta.FastaFile.read(
entrez.fetch_single_file(uids, temp_file.name, "protein", "fasta")
)
sequences = {
# The cyclotide variant is the last character in the header
header[-1]: seq for header, seq in fasta.get_sequences(fasta_file).items()
}
# Extract cyclotide N as query sequence for later
query = sequences.pop("N")
To create a profile, we first need to align the sequences, so corresponding symbols
appear the same position.
Then we can create a SequenceProfile object from the Alignment, which
simply counts for each alignment column (i.e. the sequence position) the number of
occurrences for each symbol.
import biotite.sequence as seq
import biotite.sequence.align as align
alignment, _, _, _ = seq.align.align_multiple(
list(sequences.values()),
align.SubstitutionMatrix.std_protein_matrix(),
gap_penalty=-5,
)
profile = seq.SequenceProfile.from_alignment(alignment)
print(profile)
A C D E F G H I K L M N P Q R S T V W Y B Z X *
0 0 0 0 0 0 11 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 7 0 2 0 0 0 0 0 0 0 2 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0
4 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 3 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0
8 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0
10 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0
11 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0
13 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0
15 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 11 0 0 0 0 0 0 0
17 5 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 5 0 0 0 0 0 0 0
18 1 0 0 0 0 0 0 2 0 4 0 0 0 0 0 0 0 6 0 0 0 0 0 0
19 2 0 0 0 0 0 0 2 0 4 0 0 0 0 0 0 0 5 0 0 0 0 0 0
20 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
21 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
22 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0
23 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
25 0 0 4 0 0 0 0 0 0 0 0 6 0 0 0 3 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0 0 0
28 0 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 0 0
30 0 0 0 0 0 0 0 1 1 4 0 2 0 0 3 0 0 0 0 0 0 0 0 0
31 0 0 6 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0
Each row in the displayed count matrix
(accessible via SequenceProfile.symbols) refers to a single position, i.e. a
column in the input MSA, and each column refers to a symbol in the underlying alphabet
(accessible via SequenceProfile.alphabet).
For completeness it should be noted that SequenceProfile.gaps also tracks the
gaps for each position in the alignment, but we will not further use them in this
tutorial.
Note that the information about the individual sequences is lost in the condensation process: There is no way to reconstruct the original sequences from the profile. However, we can still extract a consensus sequence from the profile, which is a sequence that represents the most frequent symbol at each position.
print(profile.to_consensus())
GEIPCGESCVFIPCTITAVVGCSCKNKVCYLN
Profile visualization as sequence logo#
A common way to visualize a sequence profile is a sequence logo. It depicts each profile position as a stack of letters: The degree of conservation (more precisely the Shannon entropy) is the height of a stack, and each letter’s height in the stack is proportional to its frequency at the respective position.
import matplotlib.pyplot as plt
from biotite.sequence.graphics import plot_sequence_logo
fig, ax = plt.subplots(figsize=(8.0, 2.0), constrained_layout=True)
plot_sequence_logo(ax, profile)
ax.set_xlabel("Residue position")
ax.set_ylabel("Bits")
Text(0, 0.5, 'Bits')
Position-specific scoring matrices#
Sequence profiles can achieve even more: The substitution matrices we have seen so far assign a score to a pair of two symbols, regardless of their position in a sequence. However, as discussed above, the position is crucial information to determine how conserved a certain symbol is in a family of sequences.
Hence, we extend the concept of substitution matrices to position-specific scoring matrices, which assign a score to a symbol and a position (instead of another symbols). A typical way to create a position-specific scoring matrix is to use the log-odds matrix from a profile.
# For a real analysis amino acid background frequencies should be provided
log_odds = profile.log_odds_matrix(pseudocount=1)
Now, we encounter a problem:
To create a SubstitutionMatrix object from the log-odds matrix, we require two
Alphabet objects:
One is taken from the query sequence to be aligned to the profile, but which alphabet
do we use for the positional axis of the matrix?
Likewise, the alignment functions (e.g. align_optimal()) require two sequences
to be aligned, but we only have one query sequence.
To solve this problem, biotite.sequence provides the PositionalSequence
which acts as a placeholder for the second sequence in the alignment.
Its alphabet contains a unique symbol for each position, i.e. the alphabet has the
sought length.
positional_seq = seq.PositionalSequence(profile.to_consensus())
matrix = align.SubstitutionMatrix(
positional_seq.alphabet,
seq.ProteinSequence.alphabet,
# Substitution matrices require integer scores
# Multiply by 10 to increase value range and convert to integer
(log_odds * 10).astype(int)
)
alignment = align.align_optimal(
positional_seq, query, matrix, gap_penalty=-5, max_number=1
)[0]
print(alignment)
GEIP---CGES-CVFIP---CTI-TAVVG--CSCKNK---VCYL-N
G---SAFCGE-TCV---LGTC--YT----PDCSC---TALVC-LKN
Only the length of the input sequence passed to PositionalSequence matters for
the alignment to work.
The consensus sequence of the profile was merely chosen for cosmetic reasons, i.e.
to have a meaningful string representation of the positional sequence and thus the
alignment.
Further applications#
Using a sequence profile is only one way to create a position-specific scoring matrix.
The combination of PositionalSequence objects and a corresponding
SubstitutionMatrix allows to assign a score between two sequence positions
based on any property.
As a toy example, let’s say we want to harness the information that cyclotides have
strongly conserved disulfide bridges.
Hence, we want the score matrix not only to reward when cysteines are paired with
cysteines in general, but specifically when cysteines are paired with cysteines at the
corresponding position, i.e. the first cysteine with the first cysteine, etc.
As template we will use the standard BLOSUM62 substitution matrix, expanded into
a position-specific scoring matrix using SubstitutionMatrix.as_positional():
cyclotide_n = query
cyclotide_c = sequences["C"]
aa_matrix = align.SubstitutionMatrix.std_protein_matrix()
# `as_positional()` expands the matrix into a position-specific scoring matrix
# for the given sequences while also giving us the positional sequences for them
pos_matrix, pos_cyclotide_n, pos_cyclotide_c = aa_matrix.as_positional(
cyclotide_n, cyclotide_c
)
# Introduce bias for cysteine positions
score_matrix = pos_matrix.score_matrix().copy()
cys_code = seq.ProteinSequence.alphabet.encode("C")
score_matrix[cyclotide_n.code == cys_code, cyclotide_c.code == cys_code] = 100
disulfide_matrix = align.SubstitutionMatrix(
pos_matrix.get_alphabet1(),
pos_matrix.get_alphabet2(),
score_matrix,
)
print(disulfide_matrix)
G V P C A E S C V W I P C T V T A L L G C S C K D K V C Y L D
G 6 -3 -2 -3 0 -2 0 -3 -3 -2 -4 -2 -3 -2 -3 -2 0 -4 -4 6 -3 0 -3 -2 -1 -2 -3 -3 -3 -4 -1
S 0 -2 -1 -1 1 0 4 -1 -2 -3 -2 -1 -1 1 -2 1 1 -2 -2 0 -1 4 -1 0 0 0 -2 -1 -2 -2 0
A 0 0 -1 0 4 -1 1 0 0 -3 -1 -1 0 0 0 0 4 -1 -1 0 0 1 0 -1 -2 -1 0 0 -2 -1 -2
F -3 -1 -4 -2 -2 -3 -2 -2 -1 1 0 -4 -2 -2 -1 -2 -2 0 0 -3 -2 -2 -2 -3 -3 -3 -1 -2 3 0 -3
C -3 -1 -3 100 0 -4 -1 9 -1 -2 -1 -3 9 -1 -1 -1 0 -1 -1 -3 9 -1 9 -3 -3 -3 -1 9 -2 -1 -3
G 6 -3 -2 -3 0 -2 0 -3 -3 -2 -4 -2 -3 -2 -3 -2 0 -4 -4 6 -3 0 -3 -2 -1 -2 -3 -3 -3 -4 -1
E -2 -2 -1 -4 -1 5 0 -4 -2 -3 -3 -1 -4 -1 -2 -1 -1 -3 -3 -2 -4 0 -4 1 2 1 -2 -4 -2 -3 2
T -2 0 -1 -1 0 -1 1 -1 0 -2 -1 -1 -1 5 0 5 0 -1 -1 -2 -1 1 -1 -1 -1 -1 0 -1 -2 -1 -1
C -3 -1 -3 9 0 -4 -1 100 -1 -2 -1 -3 9 -1 -1 -1 0 -1 -1 -3 9 -1 9 -3 -3 -3 -1 9 -2 -1 -3
V -3 4 -2 -1 0 -2 -2 -1 4 -3 3 -2 -1 0 4 0 0 1 1 -3 -1 -2 -1 -2 -3 -2 4 -1 -1 1 -3
L -4 1 -3 -1 -1 -3 -2 -1 1 -2 2 -3 -1 -1 1 -1 -1 4 4 -4 -1 -2 -1 -2 -4 -2 1 -1 -1 4 -4
G 6 -3 -2 -3 0 -2 0 -3 -3 -2 -4 -2 -3 -2 -3 -2 0 -4 -4 6 -3 0 -3 -2 -1 -2 -3 -3 -3 -4 -1
T -2 0 -1 -1 0 -1 1 -1 0 -2 -1 -1 -1 5 0 5 0 -1 -1 -2 -1 1 -1 -1 -1 -1 0 -1 -2 -1 -1
C -3 -1 -3 9 0 -4 -1 9 -1 -2 -1 -3 100 -1 -1 -1 0 -1 -1 -3 9 -1 9 -3 -3 -3 -1 9 -2 -1 -3
Y -3 -1 -3 -2 -2 -2 -2 -2 -1 2 -1 -3 -2 -2 -1 -2 -2 -1 -1 -3 -2 -2 -2 -2 -3 -2 -1 -2 7 -1 -3
T -2 0 -1 -1 0 -1 1 -1 0 -2 -1 -1 -1 5 0 5 0 -1 -1 -2 -1 1 -1 -1 -1 -1 0 -1 -2 -1 -1
P -2 -2 7 -3 -1 -1 -1 -3 -2 -4 -3 7 -3 -1 -2 -1 -1 -3 -3 -2 -3 -1 -3 -1 -1 -1 -2 -3 -3 -3 -1
D -1 -3 -1 -3 -2 2 0 -3 -3 -4 -3 -1 -3 -1 -3 -1 -2 -4 -4 -1 -3 0 -3 -1 6 -1 -3 -3 -3 -4 6
C -3 -1 -3 9 0 -4 -1 9 -1 -2 -1 -3 9 -1 -1 -1 0 -1 -1 -3 100 -1 9 -3 -3 -3 -1 9 -2 -1 -3
S 0 -2 -1 -1 1 0 4 -1 -2 -3 -2 -1 -1 1 -2 1 1 -2 -2 0 -1 4 -1 0 0 0 -2 -1 -2 -2 0
C -3 -1 -3 9 0 -4 -1 9 -1 -2 -1 -3 9 -1 -1 -1 0 -1 -1 -3 9 -1 100 -3 -3 -3 -1 9 -2 -1 -3
T -2 0 -1 -1 0 -1 1 -1 0 -2 -1 -1 -1 5 0 5 0 -1 -1 -2 -1 1 -1 -1 -1 -1 0 -1 -2 -1 -1
A 0 0 -1 0 4 -1 1 0 0 -3 -1 -1 0 0 0 0 4 -1 -1 0 0 1 0 -1 -2 -1 0 0 -2 -1 -2
L -4 1 -3 -1 -1 -3 -2 -1 1 -2 2 -3 -1 -1 1 -1 -1 4 4 -4 -1 -2 -1 -2 -4 -2 1 -1 -1 4 -4
V -3 4 -2 -1 0 -2 -2 -1 4 -3 3 -2 -1 0 4 0 0 1 1 -3 -1 -2 -1 -2 -3 -2 4 -1 -1 1 -3
C -3 -1 -3 9 0 -4 -1 9 -1 -2 -1 -3 9 -1 -1 -1 0 -1 -1 -3 9 -1 9 -3 -3 -3 -1 100 -2 -1 -3
L -4 1 -3 -1 -1 -3 -2 -1 1 -2 2 -3 -1 -1 1 -1 -1 4 4 -4 -1 -2 -1 -2 -4 -2 1 -1 -1 4 -4
K -2 -2 -1 -3 -1 1 0 -3 -2 -3 -3 -1 -3 -1 -2 -1 -1 -2 -2 -2 -3 0 -3 5 -1 5 -2 -3 -2 -2 -1
N 0 -3 -2 -3 -2 0 1 -3 -3 -4 -3 -2 -3 0 -3 0 -2 -3 -3 0 -3 1 -3 0 1 0 -3 -3 -2 -3 1
You might notice the new shape of the substitution matrix: Instead of spanning the 24 x 24 amino acids including ambiguous symbols, it now matches the lengths of the two input sequences. In this matrix you can spot the biased scores at the matching cysteine positions.