Pairwise sequence alignment#

When comparing two (or more) sequences, usually an alignment needs to be performed. Two kinds of algorithms need to be distinguished here: Heuristic algorithms do not guarantee to yield the optimal alignment, but instead they are very fast. This approach will be used in a later chapter. On the other hand, there are algorithms that calculate the optimal alignment, i.e. they find the arrangement of symbols with the highest similarity score, but are quite slow.

The biotite.sequence.align package provides the function align_optimal(), which fits into the latter category. It either performs an optimal global alignment, i.e. the entire sequences are aligned, or an optimal local alignment, i.e. only the highest scoring excerpt of both sequences are aligned.

align_optimal() (as well as most other functionalities in biotite.sequence.align) can align any two Sequence objects with each other. In fact the Sequence objects can be instances from different Sequence subclasses and therefore may have different alphabets, although in the overwhelming number of use cases the alphabets of the aligned sequences will be the same. The only condition that must be satisfied, is that the SubstitutionMatrix alphabets match the alphabets of the sequences to be aligned: A SubstitutionMatrix maps a combination of two symbols, one from the first sequence the other one from the second sequence, to a similarity score. A SubstitutionMatrix object contains two alphabets with length n or m, respectively, and an (n,m)-shaped ndarray storing the similarity scores. You can choose one of many predefined matrices from an internal database or you can create a custom matrix on your own.

So much for theory. Let’s start by showing different ways to construct a SubstitutionMatrix, in our case for protein sequence alignments:

import biotite.sequence as seq
import biotite.sequence.align as align
import numpy as np

alph = seq.ProteinSequence.alphabet
# Load the standard protein substitution matrix, which is BLOSUM62
matrix = align.SubstitutionMatrix.std_protein_matrix()
print("\nBLOSUM62\n")
print(matrix)
# Load another matrix from internal database
matrix = align.SubstitutionMatrix(alph, alph, "BLOSUM50")
# Load a matrix dictionary representation,
# modify it, and create the SubstitutionMatrix
# (The dictionary could be alternatively loaded from a string containing
# the matrix in NCBI format)
matrix_dict = align.SubstitutionMatrix.dict_from_db("BLOSUM62")
matrix_dict[("P","Y")] = 100
matrix = align.SubstitutionMatrix(alph, alph, matrix_dict)
# And now create a matrix by directly providing the ndarray
# containing the similarity scores
# (identity matrix in our case)
scores = np.identity(len(alph), dtype=int)
matrix = align.SubstitutionMatrix(alph, alph, scores)
print("\n\nIdentity matrix\n")
print(matrix)

BLOSUM62

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   0  -4  -3   2  -2   4   2  -3  -3  -2  -2  -2  -1   1  -2  -1  -4  -3  -1  -4
M  -1  -1  -3  -2   0  -3  -2   1  -1   2   5  -2  -2   0  -1  -1  -1   1  -1  -1  -3  -1  -1  -4
N  -2  -3   1   0  -3   0   1  -3   0  -3  -2   6  -2   0   0   1   0  -3  -4  -2   3   0  -1  -4
P  -1  -3  -1  -1  -4  -2  -2  -3  -1  -3  -2  -2   7  -1  -2  -1  -1  -2  -4  -3  -2  -1  -2  -4
Q  -1  -3   0   2  -3  -2   0  -3   1  -2   0   0  -1   5   1   0  -1  -2  -2  -1   0   3  -1  -4
R  -1  -3  -2   0  -3  -2   0  -3   2  -2  -1   0  -2   1   5  -1  -1  -3  -3  -2  -1   0  -1  -4
S   1  -1   0   0  -2   0  -1  -2   0  -2  -1   1  -1   0  -1   4   1  -2  -3  -2   0   0   0  -4
T   0  -1  -1  -1  -2  -2  -2  -1  -1  -1  -1   0  -1  -1  -1   1   5   0  -2  -2  -1  -1   0  -4
V   0  -1  -3  -2  -1  -3  -3   3  -2   1   1  -3  -2  -2  -3  -2   0   4  -3  -1  -3  -2  -1  -4
W  -3  -2  -4  -3   1  -2  -2  -3  -3  -2  -1  -4  -4  -2  -3  -3  -2  -3  11   2  -4  -3  -2  -4
Y  -2  -2  -3  -2   3  -3   2  -1  -2  -1  -1  -2  -3  -1  -2  -2  -2  -1   2   7  -3  -2  -1  -4
B  -2  -3   4   1  -3  -1   0  -3   0  -4  -3   3  -2   0  -1   0  -1  -3  -4  -3   4   1  -1  -4
Z  -1  -3   1   4  -3  -2   0  -3   1  -3  -1   0  -1   3   0   0  -1  -2  -3  -2   1   4  -1  -4
X   0  -2  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -2  -1  -1   0   0  -1  -2  -1  -1  -1  -1  -4
*  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4   1


Identity matrix

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
C   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
D   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
E   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
F   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
G   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
H   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
I   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
K   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
L   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0   0
M   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
N   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0   0
P   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0   0
Q   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0   0
R   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0   0
S   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   0
T   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0
V   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0   0
W   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0   0
Y   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0   0
B   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0   0
Z   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0   0
X   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0
*   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1

For our protein sequence alignment we will use the standard BLOSUM62 matrix. The final decision we need to make is the gap penalty: It defines the score penalty for inserting a gap into the alignment. Here we can choose between a linear gap penalty (same penalty for each gap) and an affine gap penalty (high penalty for gap opening, low penalty for extension of a gap).

seq1 = seq.ProteinSequence("BIQTITE")
seq2 = seq.ProteinSequence("IQLITE")
matrix = align.SubstitutionMatrix.std_protein_matrix()
print("\nLocal alignment")
alignments = align.align_optimal(
    # A single integer indicates a linear gap penalty
    seq1, seq2, matrix, gap_penalty=-10, local=True
)
for ali in alignments:
    print(ali)
print()
print("Global alignment")
alignments = align.align_optimal(
    seq1, seq2, matrix, gap_penalty=-10, local=False
)
for ali in alignments:
    print(ali)

Local alignment
IQTITE
IQLITE

Global alignment
BIQTITE
-IQLITE

As you might have noticed, the function does not return a single alignment, but an entire list. The reason is that there can be multiple optimal alignments with the same score. Each alignment is represented by an Alignment object. This object saves the input sequences together with a so called trace - the indices to symbols in these sequences that are aligned to each other (-1 for a gap). Additionally the alignment score is stored in this object. This object can also prettyprint the alignment into a human readable form as demonstrated above.

For publication purposes you can create an actual figure using Matplotlib. You can either decide to color the symbols based on the symbol type or based on the similarity within the alignment columns. In this case we will go with the similarity visualization.

import matplotlib.pyplot as plt
import biotite.sequence.graphics as graphics

fig, ax = plt.subplots(figsize=(2.0, 0.8), constrained_layout=True)
graphics.plot_alignment_similarity_based(
    ax, alignments[0], matrix=matrix, symbols_per_line=len(alignments[0])
)

If you are interested in more advanced visualization examples, have a look at the example gallery.

You can also do some simple analysis on these objects, like determining the sequence identity or calculating the score. For further custom analysis, it can be convenient to have directly the aligned symbols codes instead of the trace.

alignment = alignments[0]
print("Score: ", alignment.score)
print("Recalculated score:", align.score(alignment, matrix=matrix))
print("Sequence identity:", align.get_sequence_identity(alignment))
print("Symbols:")
print(align.get_symbols(alignment))
print("Symbol codes:")
print(align.get_codes(alignment))

Score:  12
Recalculated score: 12
Sequence identity: 0.8333333333333334
Symbols:
[['B', 'I', 'Q', 'T', 'I', 'T', 'E'], [None, 'I', 'Q', 'L', 'I', 'T', 'E']]
Symbol codes:
[[20  7 13 16  7 16  3]
 [-1  7 13  9  7 16  3]]

Loading alignments from FASTA files#

You might wonder, why you should recalculate the score, when the score has already been directly computed via align_optimal(). The answer is that you might load an alignment from a FASTA file using get_alignment(), where the score is not provided.

from tempfile import NamedTemporaryFile
import biotite.sequence.io.fasta as fasta

temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
fasta_file = fasta.FastaFile()
fasta.set_alignment(fasta_file, alignment, seq_names=["seq_1", "seq_2"])
print(fasta_file)
fasta_file.write(temp_file.name)

fasta_file = fasta.FastaFile.read(temp_file.name)
alignment = fasta.get_alignment(fasta_file)
alignment.score = align.score(alignment, matrix=matrix)

>seq_1
BIQTITE
>seq_2
-IQLITE