biotite.sequence.align.SubstitutionMatrix

class biotite.sequence.align.SubstitutionMatrix(alphabet1, alphabet2, score_matrix)[source]

Bases: object

A SubstitutionMatrix is the foundation for scoring in sequence alignments. A SubstitutionMatrix maps each possible pairing of a symbol of a first alphabet with a symbol of a second alphabet to a score (integer).

The class uses a 2-D (m x n) ndarray (dtype=`np.int32`), where each element stores the score for a symbol pairing, indexed by the symbol codes of the respective symbols in an m-length alphabet 1 and an n-length alphabet 2.

There are 3 ways to creates instances:

At first a 2-D ndarray containing the scores can be directly provided.

Secondly a dictionary can be provided, where the keys are pairing tuples and values are the corresponding scores. The pairing tuples consist of a symbol of alphabet 1 as first element and a symbol of alphabet 2 as second element. Parings have to be provided for each possible combination.

At last a valid matrix name can be given, which is loaded from the internal matrix database. The following matrices are avaliable:

  • Nucleotide substitution matrices from NCBI database
    • NUC - Also usable with ambiguous alphabet
  • Protein substitution matrices from NCBI database

    • PAM<n>
    • BLOSUM<n>
    • MATCH - Only differentiates between match and mismatch
    • IDENTITY - Strongly penalizes mismatches
    • GONNET - Not usable with default protein alphabet
    • DAYHOFF
  • Corrected protein substitution matrices [1], <BLOCKS> is the BLOCKS version, the matrix is based on

    • BLOSUM<n>_<BLOCKS>
    • RBLOSUM<n>_<BLOCKS>
    • CorBLOSUM<n>_<BLOCKS>

A list of all available matrix names is returned by list_db().

Since this class can handle two different alphabets, it is possible to align two different types of sequences.

Objects of this class are immutable.

Parameters:
alphabet1 : Alphabet, length=m

The first alphabet of the substitution matrix.

alphabet2 : Alphabet, length=n

The second alphabet of the substitution matrix.

score_matrix : ndarray, shape=(m,n) or dict or str

Either a symbol code indexed ndarray containing the scores, or a dictionary mapping the symbol pairing to scores, or a string referencing a matrix in the internal database.

Raises:
KeyError

If the matrix dictionary misses a symbol given in the alphabet.

References

[1](1, 2) M Hess, F Keul, M Goesele and K Hamacher, “Addressing inaccuracies in BLOSUM computation improves homology search performance.” BMC Bioinformatics, 17 (2016).

Examples

Creating a matrix for two different (nonsense) alphabets via a matrix dictionary:

>>> alph1 = Alphabet(["foo","bar"])
>>> alph2 = Alphabet([1,2,3])
>>> matrix_dict = {("foo",1):5,  ("foo",2):10, ("foo",3):15,
...                ("bar",1):42, ("bar",2):42, ("bar",3):42}
>>> matrix = SubstitutionMatrix(alph1, alph2, matrix_dict)
>>> print(matrix.score_matrix())
[[ 5 10 15]
 [42 42 42]]
>>> print(matrix.get_score("foo", 2))
10
>>> print(matrix.get_score_by_code(0, 1))
10

Creating a matrix via database name:

>>> alph = ProteinSequence.alphabet
>>> matrix = SubstitutionMatrix(alph, alph, "BLOSUM50")
get_alphabet1()[source]

Get the first alphabet.

Returns:
alphabet : Alphabet

The first alphabet.

get_alphabet2()[source]

Get the second alphabet.

Returns:
alphabet : Alphabet

The second alphabet.

score_matrix()[source]

Get the 2-D ndarray containing the score values.

In the strict sense, this breaks the immutability of the SubstitutionMatrix object. Therefore, using the returned ndarray read-only is strongly recommended.

Returns:
matrix : ndarray

The symbol code indexed score matrix.

transpose()[source]

Get a copy of this instance, where the alphabets are interchanged.

Returns:
transposed : SubstitutionMatrix

The transposed substitution matrix.

is_symmetric()[source]

Check whether the substitution matrix is symmetric, i.e. both alphabets are identical and the score matrix is symmetric.

Returns:
is_symmetric : bool

True, if both alphabets are identical and the score matrix is symmetric, false otherwise.

get_score_by_code(code1, code2)[source]

Get the substitution score of two symbols, represented by their code.

Parameters:
code1, code2 : int

Symbol codes of the two symbols to be aligned.

Returns:
score : int

The substitution / alignment score.

get_score(symbol1, symbol2)[source]

Get the substitution score of two symbols.

Parameters:
symbol1, symbol2 : object

Symbols to be aligned.

Returns:
score : int

The substitution / alignment score.

shape()[source]

Get the shape (i.e. the length of both alphabets) of the subsitution matrix.

Returns:
shape : tuple

Matrix shape.

static dict_from_str(string)[source]

Create a matrix dictionary from a string in NCBI matrix format.

Symbols of the first alphabet are taken from the left column, symbols of the second alphabet are taken from the top row.

The keys of the dictionary consist of tuples containing the aligned symbols and the values are the corresponding scores.

Returns:
matrix_dict : dict

A dictionary representing the substitution matrix.

static dict_from_db(matrix_name)[source]

Create a matrix dictionary from a valid matrix name in the internal matrix database.

The keys of the dictionary consist of tuples containing the aligned symbols and the values are the corresponding scores.

Returns:
matrix_dict : dict

A dictionary representing the substitution matrix.

static list_db()[source]

List all matrix names in the internal database.

Returns:
db_list : list

List of matrix names in the internal database.

static std_protein_matrix()[source]

Get the default SubstitutionMatrix for protein sequence alignments, which is BLOSUM62.

Returns:
matrix : SubstitutionMatrix

Default matrix.

static std_nucleotide_matrix()[source]

Get the default SubstitutionMatrix for DNA sequence alignments.

Returns:
matrix : SubstitutionMatrix

Default matrix.