biotite.sequence.align.align_optimal

biotite.sequence.align.align_optimal(seq1, seq2, matrix, gap_penalty=-10, terminal_penalty=True, local=False, max_number=1000)

Perform an optimal alignment of two sequences based on the dynamic programming algorithm. [1]

This algorithm yields an optimal alignment, i.e. the sequences are aligned in the way that results in the highest similarity score. This operation can be very time and space consuming, because both scale linearly with each sequence length.

The aligned sequences do not need to be instances from the same Sequence subclass, since they do not need to have the same alphabet. The only requirement is that the substitution matrix’ alphabets extend the alphabets of the two sequences.

This function can either perform a global alignment, based on the Needleman-Wunsch algorithm [1] or a local alignment, based on the Smith–Waterman algorithm [2].

Furthermore this function supports affine gap penalties using the Gotoh algorithm [3], however, this requires approximately 4 times the RAM space and execution time.

Parameters
seq1, seq2Sequence

The sequences to be aligned.

matrixSubstitutionMatrix

The substitution matrix used for scoring.

gap_penaltyint or (tuple, dtype=int), optional

If an integer is provided, the value will be interpreted as general gap penalty. If a tuple is provided, an affine gap penalty is used. The first integer in the tuple is the gap opening penalty, the second integer is the gap extension penalty. The values need to be negative. (Default: -10)

terminal_penaltybool, optional

If true, gap penalties are applied to terminal gaps. If local is true, this parameter has no effect. (Default: True)

localbool, optional

If false, a global alignment is performed, otherwise a local alignment is performed. (Default: False)

max_numberint, optional

The maximum number of alignments returned. When the number of branches exceeds this value in the traceback step, no further branches are created. (Default: 1000)

Returns
alignmentslist, type=Alignment

A list of alignments. Each alignment in the list has the same maximum similarity score.

References

1(1,2)

SB Needleman, CD Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins.” J Mol Biol, 48, 443-453 (1970).

2

TF Smith, MS Waterman, “Identification of common molecular subsequences.” J Mol Biol, 147, 195-197 (1981).

3

O Gotoh, “An improved algorithm for matching biological sequences.” J Mol Biol, 162, 705-708 (1982).

Examples

>>> seq1 = NucleotideSequence("ATACGCTTGCT")
>>> seq2 = NucleotideSequence("AGGCGCAGCT")
>>> matrix = SubstitutionMatrix.std_nucleotide_matrix()
>>> ali = align_optimal(seq1, seq2, matrix, gap_penalty=-6)
>>> for a in ali:
...     print(a, "\n")
ATACGCTTGCT
AGGCGCA-GCT 
<BLANKLINE>
ATACGCTTGCT
AGGCGC-AGCT 
<BLANKLINE>