Structural alphabets#

In the previous chapters we have seen the multitude of methods, that can be applied to sequence and structure data. Structural alphabets combine the best of both worlds: A structural alphabet is a representation of protein or nucleic acid structures, where each residue is encoded into a single character, based on the local geometry or contact partners of that residue. This way the high performance sequence-based methods, for example alignment searches, can be applied to structure data.

The biotite.structure.alphabet subpackage provides multiple structural alphabets, but for the scope of this tutorial we will focus on the 3Di alphabet, popularized by the Foldseek software for fast protein structure comparison. Just keep in mind that in the following examples the underlying structural alphabet can be substituted with minimal modifications.

Converting structures to sequences#

We start by getting the structure of our protein of interest. In this case we will use ferredoxin from E. coli (PDB ID: 2ZVS). After filtering out all non-amino acid residues, we create the 3Di sequence for each chain with to_3di().

import biotite.database.rcsb as rcsb
import biotite.structure as struc
import biotite.structure.alphabet as strucalph
import biotite.structure.io.pdbx as pdbx

# You can actually directly read the downloaded PDBx file content
# without an intermediate file
pdbx_file = pdbx.BinaryCIFFile.read(rcsb.fetch("2ZVS", "bcif"))
ec_ferredoxin = pdbx.get_structure(pdbx_file, model=1)
# This structural alphabet expects a peptide chain
ec_ferredoxin = ec_ferredoxin[struc.filter_amino_acids(ec_ferredoxin)]
structural_sequences, chain_starts = strucalph.to_3di(ec_ferredoxin)
print(structural_sequences)
print(chain_starts)
[I3DSequence("dkffavvapppcqpqvpqpqnqwdddddhiggdllrqlqcvvhdpgrvvcvrgpprgmdhddvrdddnvvsvvsvcvrvnd"), I3DSequence("dkwqalvapapppqqpvdplsqwaaddnhiggpllqdllpvphdpgrvvvvprpdpsmdhdvvndddnvvsvvncvvvpd"), I3DSequence("dkwqalvdpqpvpqqvqdpqnqwdddpryitgnllqdlccvvhdpgrrscvvdppnrighdvvsdadnvrsvvsvvvvpd")]
[   0  638 1266]

Note that to_3di() returns not a single I3DSequence sequence but a list of sequences, one for each chain in the structure. Accompanying the sequences, the function also returns the atom indices where each of the chains starts. As our structure contains only one chain, the desired 3Di sequence is the first and only element in the list.

ec_ferredoxin_3di = structural_sequences[0]
print(ec_ferredoxin_3di)
dkffavvapppcqpqvpqpqnqwdddddhiggdllrqlqcvvhdpgrvvcvrgpprgmdhddvrdddnvvsvvsvcvrvnd

Each symbol in this rather cryptic sequence corresponds to one residue in the structure. To get the corresponding residues as AtomArray objects we can use the residue-level functionality of biotite.structure. While the sequence is hardly human-readable, its true power lies in its ability to be compared to 3Di sequences from other proteins.

Sequence alignments on structural alphabets#

As mentioned in the sequence chapter, the sequence based methods in biotite.sequence generally do not care about the type of sequence. This means that we can use any method we have learned so far on structural sequences as well. For the scope of this tutorial we will merely use align_optimal() to find corresponding residues in two structures. As the structure is generally much better conserved than its sequence, the alignment of 3Di sequences will even work on remote homologs with low amino acid sequence identity, where a classical sequence alignment would fail. To demonstrate this, we will compare the E. coli ferredoxin with the remotely similar ferredoxin from the thermophilic archaeon S. tokodaii.

pdbx_file = pdbx.BinaryCIFFile.read(rcsb.fetch("1XER", "bcif"))
st_ferredoxin = pdbx.get_structure(pdbx_file, model=1)
st_ferredoxin = st_ferredoxin[struc.filter_amino_acids(st_ferredoxin)]
st_ferredoxin_3di = strucalph.to_3di(st_ferredoxin)[0][0]

To align the two 3Di sequences, we merely need a SubstitutionMatrix that matches the alphabet of the I3DSequence. Like for amino acid and nucleotide sequences, biotite.sequence.align provides it out of the box with SubstitutionMatrix.std_3di_matrix().

import matplotlib.pyplot as plt
import biotite.sequence.align as align
import biotite.sequence.graphics as graphics

matrix = align.SubstitutionMatrix.std_3di_matrix()
alignment = align.align_optimal(
    ec_ferredoxin_3di,
    st_ferredoxin_3di,
    matrix,
    gap_penalty=(-10, -1),
    terminal_penalty=False,
)[0]

fig, ax = plt.subplots(figsize=(8.0, 2.0))
graphics.plot_alignment_similarity_based(
    ax, alignment, matrix=matrix, labels=["EC", "ST"], symbols_per_line=50
)
../../_images/alphabet_5_0.png

If you prefer coloring the symbols in the alignment by their type, you are lucky: biotite.sequence.graphics provides a color scheme for each of the supported structural alphabets as well.

fig, ax = plt.subplots(figsize=(8.0, 2.0))
graphics.plot_alignment_type_based(
    ax, alignment, labels=["EC", "ST"], symbols_per_line=50
)
../../_images/alphabet_6_0.png

Example: Superimposing structures#

One typical use case of structural alphabets is superimposing structures of remote homologs. Here the challenge is finding the corresponding residues in the two structures, whose squared distance the superimposition algorithm should minimize. The solution is to use the alignment of the structural alphabet: One simply inputs the CA atoms of the aligned residues.

def rmsd_from_alignment(fixed, mobile, alignment):
    """
    A very simple function that extracts corresponding residues (the 'anchors')
    from an alignment and uses them to run a superimposition.
    Finally the RMSD of the superimposed structures plus the number of anchors is
    returned.
    """
    alignment_codes = align.get_codes(alignment)
    anchor_mask = (
        # Anchors must be structurally similar
        (matrix.score_matrix()[alignment_codes[0], alignment_codes[1]] > 0)
        # Gaps are not anchors
        & (alignment_codes != -1).all(axis=0)
    )
    superimposition_anchors = alignment.trace[anchor_mask]
    # Each anchor corresponds to a residue
    # Use the CA atoms as representative for each residue
    fixed_ca = fixed[fixed.atom_name == "CA"]
    mobile_ca = mobile[mobile.atom_name == "CA"]
    fixed_anchors = fixed_ca[superimposition_anchors[:, 0]]
    mobile_anchors = mobile_ca[superimposition_anchors[:, 1]]

    mobile_anchors, transform = struc.superimpose(
        fixed_anchors,
        mobile_anchors,
    )
    return struc.rmsd(fixed_anchors, mobile_anchors), len(superimposition_anchors)

rmsd, n_anchors = rmsd_from_alignment(ec_ferredoxin, st_ferredoxin, alignment)
print("Number of corresponding residues found:", n_anchors)
print("RMSD:", rmsd)
Number of corresponding residues found: 37
RMSD: 4.370331

Again, with a classical amino acid sequence based approach the accuracy of the superimposition would be much lower: In this case less corresponding residues can be found from the amino sequence alignment and the RMSD between them is significantly higher.

matrix = align.SubstitutionMatrix.std_protein_matrix()
ec_ferredoxin_seq = struc.to_sequence(ec_ferredoxin)[0][0]
st_ferredoxin_seq = struc.to_sequence(st_ferredoxin)[0][0]
alignment = align.align_optimal(
    ec_ferredoxin_seq,
    st_ferredoxin_seq,
    matrix,
    gap_penalty=(-10, -1),
    terminal_penalty=False,
)[0]
rmsd, n_anchors = rmsd_from_alignment(ec_ferredoxin, st_ferredoxin, alignment)
print("Number of corresponding residues found:", n_anchors)
print("RMSD:", rmsd)

fig, ax = plt.subplots(figsize=(8.0, 2.0))
graphics.plot_alignment_similarity_based(
    ax, alignment, matrix=matrix, labels=["EC", "ST"], symbols_per_line=50
)
Number of corresponding residues found: 21
RMSD: 6.214908
../../_images/alphabet_8_1.png

This shows only a small fraction of the versatility of structural alphabets. They can also be used to find structural homologs in a large database, to superimpose multiple structures at once and much more!