biotite.sequence.align.align_multiple

biotite.sequence.align.align_multiple(sequences, matrix, gap_penalty=- 10, terminal_penalty=True, distances=None, guide_tree=None)[source]

Perform a multiple sequence alignment using a progressive alignment algorithm. 1

Based on pairwise sequence distances a guide tree is constructed. The sequences are progessively aligned according to the tree, following the rule ‘Once a gap, always a gap’.

Parameters
sequenceslist of Sequence

The sequences to be aligned. The alpahbet of the substitution matrix must be equal or extend the alphabet of each sequence.

matrixSubstitutionMatrix

The substitution matrix used for scoring. Must be symmetric.

gap_penaltyint or tuple(int, int), optional

If an integer is provided, the value will be interpreted as general gap penalty. If a tuple is provided, an affine gap penalty is used. The first integer in the tuple is the gap opening penalty, the second integer is the gap extension penalty. The values need to be negative. (Default: -10)

terminal_penaltybool, optional

If true, gap penalties are applied to terminal gaps. (Default: True)

distancesndarray, shape=(n,n)

Pairwise distances of the sequences. The matrix must be symmetric and all entries must be larger than 0. By default the pairwise distances are calculated from similarities obtained from optimal global pairwise alignments (align_optimal()). The similarities are converted into distances using the method proposed by Feng & Doolittle 2.

guide_treeTree

The guide tree to be used for the progressive alignment. By default the guide tree is constructed from distances via the UPGMA clustering method.

Returns
alignmentAlignment

The global multiple sequence alignment of the input sequences.

orderndarray, dtype=int

The sequence order represented by the guide tree. When this order is applied to alignment sequence order, similar sequences are adjacent to each other.

treeTree

The guide tree used for progressive alignment. Equal to guide_tree if provided.

distance_matrixndarray, shape=(n,n), dtype=float32

The pairwise distance matrix used to construct the guide tree. Equal to distances if provided.

Notes

The similarity to distance conversion is performed according to the following formula:

\[D_{a,b} = -\ln\left( \frac { S_{a,b} - S_{a,b}^{rand} } { S_{a,b}^{max} - S_{a,b}^{rand} } \right)\]
\[S_{a,b}^{max} = \frac{ S_{a,a} + S_{b,b} }{ 2 }\]
\[S_{a,b}^{rand} = \frac{1}{L_{a,b}} \left( \sum_{x \in \Omega} \sum_{y \in \Omega} s_{x,y} \cdot N_a(x) \cdot N_b(y) \right) + N_{a,b}^{open} \cdot p^{open} + N_{a,b}^{ext} \cdot p^{ext}\]

\(D_{a,b}\) - The distance between the sequences a and b.

\(S_{a,b}\) - The similarity score between the sequences a and b.

\(s_{x,y}\) - The similarity score between the symbols x and y.

\(\Omega\) - The sequence alphabet.

\(N_a(x)\) - Number of occurences of symbol x in sequence a.

\(N_{a,b}^{open}, N_{a,b}^{ext}\) - Number of gap openings/ extensions, in the alignment of a and b.

\(p^{open}, p^{ext}\) - The penalty for a gap opening/extension.

\(L_{a,b}\) - Number of columns in the alignment of a and b.

In rare cases of extremely unrelated sequences, \(S_{a,b}\) can be lower than \(S_{a,b}^{rand}\). In this case the logaritmus cannot be calculated and a ValueError is raised.

References

1

D. Feng, R. F. Doolittle, “Progressive sequence alignment as a prerequisite to correct phylogenetic trees,” Journal of Molecular Evolution, vol. 25, pp. 351–360, August 1987. doi: 10.1007/BF02603120

2

D. Feng, R. F. Doolittle, “Progressive alignment of amino acid sequences and construction of phylogenetic trees from them,” Methods in Enzymology, vol. 266, pp. 368–382, January 1996. doi: 10.1016/S0076-6879(96)66023-6

Examples

>>> seq1 = ProteinSequence("BIQTITE")
>>> seq2 = ProteinSequence("TITANITE")
>>> seq3 = ProteinSequence("BISMITE")
>>> seq4 = ProteinSequence("IQLITE")
>>> matrix = SubstitutionMatrix.std_protein_matrix()
>>>
>>> alignment, order, tree, distances = align_multiple(
...     [seq1, seq2, seq3, seq4], matrix
... )
>>>
>>> print(alignment)
BIQT-ITE
TITANITE
BISM-ITE
-IQL-ITE
>>> print(alignment[:, order.tolist()])
-IQL-ITE
BISM-ITE
BIQT-ITE
TITANITE
>>> print(distances)
[[0.000 1.034 0.382 0.560]
 [1.034 0.000 0.923 1.132]
 [0.382 0.923 0.000 0.632]
 [0.560 1.132 0.632 0.000]]
>>>
>>> print(tree.to_newick(
...     labels=["seq1", "seq2", "seq3", "seq4"], include_distance=False
... ))
((seq4,(seq3,seq1)),seq2);