Multiple sequence alignments#
The biotite.application subpackage provides interfaces to various
multiple sequence alignments (MSAs) programs.
For our example we choose the software MUSCLE:
The subpackage biotite.application.muscle contains the class
MuscleApp that does the job.
First we get some homologous input sequences from NCBI Entrez.
from tempfile import NamedTemporaryFile
import biotite.sequence.io.fasta as fasta
import biotite.database.entrez as entrez
# Use cyclotide sequences again, but this time more than two
query = (
entrez.SimpleQuery("Cyclotide") &
entrez.SimpleQuery("cter") &
entrez.SimpleQuery("srcdb_swiss-prot", field="Properties") ^
entrez.SimpleQuery("Precursor")
)
uids = entrez.search(query, "protein")
temp_file = NamedTemporaryFile(suffix=".fasta", delete=False)
fasta_file = fasta.FastaFile.read(
entrez.fetch_single_file(uids, temp_file.name, "protein", "fasta")
)
sequences = list(fasta.get_sequences(fasta_file).values())
We simply input the sequences to MuscleApp, run the application
and get the resulting Alignment object.
import biotite.application.muscle as muscle
app = muscle.MuscleApp(sequences)
app.start()
app.join()
alignment = app.get_alignment()
print(alignment)
G-IPCGESCVFIPCTVTALLGCSCKDKVCYKN
G-IPCGESCVFIPC-ISTVIGCSCKNKVCYRN
G-IPCGESCVFIPC-ITAAIGCSCKSKVCYRN
G-IPCGESCVFIPC-ITGIAGCSCKSKVCYRN
GSAFCGETCVLGTC---YTPDCSCTALVCLKN
H-EPCGESCVFIPC-ITTVVGCSCKNKVCYD-
H-EPCGESCVFIPC-ITTVVGCSCKNKVCYN-
GTVPCGESCVFIPC-ITGIAGCSCKNKVCYID
G-LPCGESCVFIPC-ITTVVGCSCKNKVCYND
G-LPCGESCVFIPC-ITTVVGCSCKNKVCYNN
G-IPCGESCVFIPC-ISSVVGCSCKSKVCYLD
G-IPCAESCVWIPCTVTALLGCSCKDKVCYLD
G-IPCAESCVWIPCTVTALLGCSCKDKVCYLN
G-VPCAESCVWIPCTVTALLGCSCKDKVCYLD
In most MSA software even more information than the mere alignment can be extracted. For instance the guide tree, that was used for the alignment, can be obtained from the MUSCLE output.
import matplotlib.pyplot as plt
import biotite.sequence.graphics as graphics
tree = app.get_guide_tree()
fig, ax = plt.subplots(figsize=(6.0, 3.0), constrained_layout=True)
graphics.plot_dendrogram(ax, tree, orientation="left", show_distance=True)
_ = ax.set_xlabel("Distance")
There is also a convenience method, that handles the Application
execution internally.
However, this shortcut returns only the Alignment.
alignment = muscle.MuscleApp.align(sequences)
Variety of MSA software#
The alternatives to MUSCLE are Clustal Omega and MAFFT.
To use them, simply replace MuscleApp with ClustalOmegaApp or
MafftApp.
import biotite.application.clustalo as clustalo
alignment = clustalo.ClustalOmegaApp.align(sequences)
print(alignment)
-GIPCGESCVFIPCTVTALLGCSCKDKVCYKN
-GIPCGESCVFIPCI-STVIGCSCKNKVCYRN
-GIPCGESCVFIPCI-TAAIGCSCKSKVCYRN
-GIPCGESCVFIPCI-TGIAGCSCKSKVCYRN
GSAFCGETCVLGTCY---TPDCSCTALVCLKN
-HEPCGESCVFIPCI-TTVVGCSCKNKVCYD-
-HEPCGESCVFIPCI-TTVVGCSCKNKVCYN-
GTVPCGESCVFIPCI-TGIAGCSCKNKVCYID
-GLPCGESCVFIPCI-TTVVGCSCKNKVCYND
-GLPCGESCVFIPCI-TTVVGCSCKNKVCYNN
-GIPCGESCVFIPCI-SSVVGCSCKSKVCYLD
-GIPCAESCVWIPCTVTALLGCSCKDKVCYLD
-GIPCAESCVWIPCTVTALLGCSCKDKVCYLN
-GVPCAESCVWIPCTVTALLGCSCKDKVCYLD
As shown in the output, the alignment with Clustal Omega slightly differs from the one performed with MUSCLE.
Custom substitution matrices and sequence types#
If the MSA software supports protein sequence alignment and custom substitution matrices, e.g. MUSCLE and MAFFT, almost any type of sequence can be aligned. Let’s show this on the example of a nonsense alphabet.
import numpy as np
import biotite.application.mafft as mafft
import biotite.sequence as seq
import biotite.sequence.align as align
alphabet = seq.Alphabet(("foo", "bar", 42))
sequences = [
seq.GeneralSequence(alphabet, symbols) for symbols in [
["foo", "bar", 42, "foo", "foo", 42, 42],
["foo", 42, "foo", "bar", "foo", 42, 42],
]
]
matrix = align.SubstitutionMatrix(
alphabet, alphabet, np.array([
[ 100, -100, -100],
[-100, 100, -100],
[-100, -100, 100]
])
)
alignment = mafft.MafftApp.align(sequences, matrix=matrix)
# As the alphabet does not have characters as symbols
# the alignment cannot be directly printed
# However, we can print the trace
print(alignment.trace)
[[ 0 0]
[ 1 -1]
[ 2 1]
[ 3 2]
[-1 3]
[ 4 4]
[ 5 5]
[ 6 6]]
This works, because internally the sequences and the matrix are converted into protein sequences/matrix. Then the masquerading sequences are aligned via the software and finally the sequences are mapped back into the original sequence type.