`SubstitutionMatrix`#

class biotite.sequence.align.SubstitutionMatrix(alphabet1, alphabet2, score_matrix)[source]#

Bases: object

A SubstitutionMatrix is the foundation for scoring in sequence alignments. A SubstitutionMatrix maps each possible pairing of a symbol of a first alphabet with a symbol of a second alphabet to a score (integer).

The class uses a 2-D (m x n) ndarray (dtype=:attr:numpy.int32), where each element stores the score for a symbol pairing, indexed by the symbol codes of the respective symbols in an m-length alphabet 1 and an n-length alphabet 2.

There are 3 ways to creates instances:

At first a 2-D ndarray containing the scores can be directly provided.

Secondly a dictionary can be provided, where the keys are pairing tuples and values are the corresponding scores. The pairing tuples consist of a symbol of alphabet 1 as first element and a symbol of alphabet 2 as second element. Parings have to be provided for each possible combination.

At last a valid matrix name can be given, which is loaded from the internal matrix database. The following matrices are available:

Nucleotide substitution matrices from NCBI database

NUC - Also usable with ambiguous alphabet

Protein substitution matrices from NCBI database

PAM<n>

BLOSUM<n>

MATCH - Only differentiates between match and mismatch

IDENTITY - Strongly penalizes mismatches

GONNET - Not usable with default protein alphabet

DAYHOFF

Corrected protein substitution matrices [1], <BLOCKS> is the BLOCKS version, the matrix is based on

BLOSUM<n>_<BLOCKS>

RBLOSUM<n>_<BLOCKS>

CorBLOSUM<n>_<BLOCKS>

Structural alphabet substitution matrices

3Di - For 3Di alphabet from foldseek [2]

PB - For Protein Blocks alphabet from PBexplore [3]

A list of all available matrix names is returned by list_db().

Since this class can handle two different alphabets, it is possible to align two different types of sequences.

Objects of this class are immutable.

Parameters:

alphabet1Alphabet, length=m: The first alphabet of the substitution matrix.
alphabet2Alphabet, length=n: The second alphabet of the substitution matrix.
score_matrixndarray, shape=(m,n) or dict or str: Either a symbol code indexed ndarray containing the scores, or a dictionary mapping the symbol pairing to scores, or a string referencing a matrix in the internal database.

Attributes:

shapetuple: Get the shape (i.e.

Raises:

KeyError: If the matrix dictionary misses a symbol given in the alphabet.

References

Examples

Creating a matrix for two different (nonsense) alphabets via a matrix dictionary:

>>> alph1 = Alphabet(["foo","bar"])
>>> alph2 = Alphabet([1,2,3])
>>> matrix_dict = {("foo",1):5,  ("foo",2):10, ("foo",3):15,
...                ("bar",1):42, ("bar",2):42, ("bar",3):42}
>>> matrix = SubstitutionMatrix(alph1, alph2, matrix_dict)
>>> print(matrix.score_matrix())
[[ 5 10 15]
 [42 42 42]]
>>> print(matrix.get_score("foo", 2))
10
>>> print(matrix.get_score_by_code(0, 1))
10

Creating an identity substitution matrix via the score matrix:

>>> alph = NucleotideSequence.alphabet_unamb
>>> matrix = SubstitutionMatrix(alph, alph, np.identity(len(alph), dtype=int))
>>> print(matrix)
    A   C   G   T
A   1   0   0   0
C   0   1   0   0
G   0   0   1   0
T   0   0   0   1

Creating a matrix via database name:

>>> alph = ProteinSequence.alphabet
>>> matrix = SubstitutionMatrix(alph, alph, "BLOSUM50")

as_positional(sequence1, sequence2)#

Transform this substitution matrix and two sequences into positional equivalents.

This means the new substitution matrix is position-specific: It has the lengths of the sequences instead of the lengths of their alphabets. Its scores represent the same scores as the original matrix, but now mapped onto the positions of the sequences.

Parameters:

sequence1, sequence2seq.Sequence, length=n: The sequences to create the positional equivalents from.

Returns:

pos_matrixalign.SubstitutionMatrix, shape=(n, n): The position-specific substitution matrix.
pos_sequence1, pos_sequence2PositionalSequence, length=n: The positional sequences.

Notes

After the transformation the substitution scores remain the same, i.e. substitution_matrix.get_score(sequence1[i], sequence2[j]) is equal to pos_matrix.get_score(pos_sequence1[i], pos_sequence2[j]).

Examples

Run an alignment with the usual substitution matrix:

>>> seq1 = ProteinSequence("BIQTITE")
>>> seq2 = ProteinSequence("IQLITE")
>>> matrix = SubstitutionMatrix.std_protein_matrix()
>>> print(matrix)
    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   0  -4  -3   2  -2   4   2  -3  -3  -2  -2  -2  -1   1  -2  -1  -4  -3  -1  -4
M  -1  -1  -3  -2   0  -3  -2   1  -1   2   5  -2  -2   0  -1  -1  -1   1  -1  -1  -3  -1  -1  -4
N  -2  -3   1   0  -3   0   1  -3   0  -3  -2   6  -2   0   0   1   0  -3  -4  -2   3   0  -1  -4
P  -1  -3  -1  -1  -4  -2  -2  -3  -1  -3  -2  -2   7  -1  -2  -1  -1  -2  -4  -3  -2  -1  -2  -4
Q  -1  -3   0   2  -3  -2   0  -3   1  -2   0   0  -1   5   1   0  -1  -2  -2  -1   0   3  -1  -4
R  -1  -3  -2   0  -3  -2   0  -3   2  -2  -1   0  -2   1   5  -1  -1  -3  -3  -2  -1   0  -1  -4
S   1  -1   0   0  -2   0  -1  -2   0  -2  -1   1  -1   0  -1   4   1  -2  -3  -2   0   0   0  -4
T   0  -1  -1  -1  -2  -2  -2  -1  -1  -1  -1   0  -1  -1  -1   1   5   0  -2  -2  -1  -1   0  -4
V   0  -1  -3  -2  -1  -3  -3   3  -2   1   1  -3  -2  -2  -3  -2   0   4  -3  -1  -3  -2  -1  -4
W  -3  -2  -4  -3   1  -2  -2  -3  -3  -2  -1  -4  -4  -2  -3  -3  -2  -3  11   2  -4  -3  -2  -4
Y  -2  -2  -3  -2   3  -3   2  -1  -2  -1  -1  -2  -3  -1  -2  -2  -2  -1   2   7  -3  -2  -1  -4
B  -2  -3   4   1  -3  -1   0  -3   0  -4  -3   3  -2   0  -1   0  -1  -3  -4  -3   4   1  -1  -4
Z  -1  -3   1   4  -3  -2   0  -3   1  -3  -1   0  -1   3   0   0  -1  -2  -3  -2   1   4  -1  -4
X   0  -2  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -2  -1  -1   0   0  -1  -2  -1  -1  -1  -1  -4
*  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4   1
>>> alignment = align_optimal(seq1, seq2, matrix, gap_penalty=-10)[0]
>>> print(alignment)
BIQTITE
-IQLITE

Running the alignment with positional equivalents gives the same result:

>>> pos_matrix, pos_seq1, pos_seq2 = matrix.as_positional(seq1, seq2)
>>> print(pos_matrix)
    I   Q   L   I   T   E
B  -3   0  -4  -3  -1   1
I   4  -3   2   4  -1  -3
Q  -3   5  -2  -3  -1   2
T  -1  -1  -1  -1   5  -1
I   4  -3   2   4  -1  -3
T  -1  -1  -1  -1   5  -1
E  -3   2  -3  -3  -1   5
>>> pos_alignment = align_optimal(pos_seq1, pos_seq2, pos_matrix, gap_penalty=-10)[0]
>>> print(pos_alignment)
BIQTITE
-IQLITE

Increase the substitution score for the first symbols in both sequences to align to each other:

>>> score_matrix = pos_matrix.score_matrix().copy()
>>> score_matrix[0, 0] = 100
>>> biased_matrix = SubstitutionMatrix(
...     pos_matrix.get_alphabet1(), pos_matrix.get_alphabet2(), score_matrix
... )
>>> print(biased_matrix)
    I   Q   L   I   T   E
B 100   0  -4  -3  -1   1
I   4  -3   2   4  -1  -3
Q  -3   5  -2  -3  -1   2
T  -1  -1  -1  -1   5  -1
I   4  -3   2   4  -1  -3
T  -1  -1  -1  -1   5  -1
E  -3   2  -3  -3  -1   5
>>> biased_alignment = align_optimal(pos_seq1, pos_seq2, biased_matrix, gap_penalty=-10)[0]
>>> print(biased_alignment)
BIQTITE
I-QLITE

static dict_from_db(matrix_name)#

Create a matrix dictionary from a valid matrix name in the internal matrix database.

The keys of the dictionary consist of tuples containing the aligned symbols and the values are the corresponding scores.

Parameters:

matrix_namestr: The name of the matrix in the internal database.

Returns:

matrix_dictdict: A dictionary representing the substitution matrix.

static dict_from_str(string)#

Create a matrix dictionary from a string in NCBI matrix format.

Symbols of the first alphabet are taken from the left column, symbols of the second alphabet are taken from the top row.

The keys of the dictionary consist of tuples containing the aligned symbols and the values are the corresponding scores.

Parameters:

stringstr: The string containing the substitution matrix in NCBI format.

Returns:

matrix_dictdict: A dictionary representing the substitution matrix.

get_alphabet1()#

Get the first alphabet.

Returns:

alphabetAlphabet: The first alphabet.

get_alphabet2()#

Get the second alphabet.

Returns:

alphabetAlphabet: The second alphabet.

get_score(symbol1, symbol2)#

Get the substitution score of two symbols.

Parameters:

symbol1, symbol2object: Symbols to be aligned.

Returns:

scoreint: The substitution / alignment score.

get_score_by_code(code1, code2)#

Get the substitution score of two symbols, represented by their code.

Parameters:

code1, code2int: Symbol codes of the two symbols to be aligned.

Returns:

scoreint: The substitution / alignment score.

is_symmetric()#

Check whether the substitution matrix is symmetric, i.e. both alphabets are identical and the score matrix is symmetric.

Returns:

is_symmetricbool: True, if both alphabets are identical and the score matrix is symmetric, false otherwise.

static list_db()#

List all matrix names in the internal database.

Returns:

db_listlist: List of matrix names in the internal database.

score_matrix()#

Get the 2-D ndarray containing the score values.

Returns:

matrixndarray, shape=(m,n), dtype=np.int32: The symbol code indexed score matrix. The array is read-only.

static std_3di_matrix()#

Get the default SubstitutionMatrix for 3Di sequence alignments. [2]

Returns:

matrixSubstitutionMatrix: Default matrix.

static std_nucleotide_matrix()#

Get the default SubstitutionMatrix for DNA sequence alignments.

Returns:

matrixSubstitutionMatrix: Default matrix.

static std_protein_blocks_matrix(undefined_match=200, undefined_mismatch=-200)#

Get the default SubstitutionMatrix for Protein Blocks sequences.

The matrix is adapted from PBxplore [3].

Parameters:

undefined_match, undefined_mismatchint, optional: The match and mismatch score for undefined symbols. The default values were chosen arbitrarily, but are in the order of magnitude of the other score values.

Returns:

matrixSubstitutionMatrix: Default matrix.

References

static std_protein_matrix()#

Get the default SubstitutionMatrix for protein sequence alignments, which is BLOSUM62.

Returns:

matrixSubstitutionMatrix: Default matrix.

transpose()#

Get a copy of this instance, where the alphabets are interchanged.

Returns:

transposedSubstitutionMatrix: The transposed substitution matrix.

Gallery#

Pairwise sequence alignment of protein sequences

Customized visualization of a multiple sequence alignment

Finding homologous regions in two genomes

Finding homologs of a gene in a genome

Hydropathy and conservation of ion channels

Fetching and aligning a protein from different species

Display sequence similarity in a heat map

Plot epitope mapping data onto protein sequence alignments

Polymorphisms in a gene

Quantifying gene expression from RNA-seq data

Comparative genome assembly

Dendrogram of a substitution matrix

Biotite color schemes for protein sequences

Statistics of local alignments and the E-value

LDDT for predicted structure evaluation

Multiple Structural alignment of orthologous proteins

Searching for structural homologs in a protein structure database

SubstitutionMatrix#

Gallery#

`SubstitutionMatrix`#