biotite.sequence.align.align_optimal

biotite.sequence.align.align_optimal(seq1, seq2, matrix, gap_penalty=- 10, terminal_penalty=True, local=False, max_number=1000)[source]

Perform an optimal alignment of two sequences based on a dynamic programming algorithm.

This algorithm yields an optimal alignment, i.e. the sequences are aligned in the way that results in the highest similarity score. This operation can be very time and space consuming, because both scale linearly with each sequence length.

The aligned sequences do not need to be instances from the same Sequence subclass, since they do not need to have the same alphabet. The only requirement is that the SubstitutionMatrix’ alphabets extend the alphabets of the two sequences.

This function can either perform a global alignment, based on the Needleman-Wunsch algorithm 1 or a local alignment, based on the Smith–Waterman algorithm 2.

Furthermore this function supports affine gap penalties using the Gotoh algorithm 3, however, this requires approximately 4 times the RAM space and execution time.

Parameters
seq1, seq2Sequence

The sequences to be aligned.

matrixSubstitutionMatrix

The substitution matrix used for scoring.

gap_penaltyint or tuple(int, int), optional

If an integer is provided, the value will be interpreted as linear gap penalty. If a tuple is provided, an affine gap penalty is used. The first integer in the tuple is the gap opening penalty, the second integer is the gap extension penalty. The values need to be negative. (Default: -10)

terminal_penaltybool, optional

If true, gap penalties are applied to terminal gaps. If local is true, this parameter has no effect. (Default: True)

localbool, optional

If false, a global alignment is performed, otherwise a local alignment is performed. (Default: False)

max_numberint, optional

The maximum number of alignments returned. When the number of branches exceeds this value in the traceback step, no further branches are created. (Default: 1000)

Returns
alignmentslist, type=Alignment

A list of alignments. Each alignment in the list has the same maximum similarity score.

See also

align_banded

References

1

S. B. Needleman, C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, pp. 443–453, March 1970. doi: 10.1016/0022-2836(70)90057-4

2

T. F. Smith, M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, pp. 195–197, March 1981. doi: 10.1016/0022-2836(81)90087-5

3

O. Gotoh, “An improved algorithm for matching biological sequences,” Journal of Molecular Biology, vol. 162, pp. 705–708, December 1982. doi: 10.1016/0022-2836(82)90398-9

Examples

>>> seq1 = NucleotideSequence("ATACGCTTGCT")
>>> seq2 = NucleotideSequence("AGGCGCAGCT")
>>> matrix = SubstitutionMatrix.std_nucleotide_matrix()
>>> ali = align_optimal(seq1, seq2, matrix, gap_penalty=-6)
>>> for a in ali:
...     print(a, "\n")
ATACGCTTGCT
AGGCGCA-GCT 

ATACGCTTGCT
AGGCGC-AGCT