Searching for structural homologs in a protein structure database#

The following script implements a protein structure search algorithm akin to foldseek [1]. It harnesses the 3Di alphabet to translate 3D structures into a sequences. Structurally homologous regions between these 3Di sequences can be identified quickly using k-mer matching followed by ungapped seed extension and finally gapped alignment.

While the following example could also be scaled up easily to a large structure database, for the sake of simplicity we will use a random selection of structures from the PDB plus an expected target structure, which is known to be structurally homologous to the query structure. The aim of the script is to identify this target among the decoy structures in the dataset.

# Code source: Patrick Kunzmann
# License: BSD 3 clause

import collections
import matplotlib.pyplot as plt
import numpy as np
import biotite
import biotite.database.rcsb as rcsb
import biotite.sequence as seq
import biotite.sequence.align as align
import biotite.sequence.graphics as graphics
import biotite.structure as struc
import biotite.structure.alphabet as strucalph
import biotite.structure.io.pdbx as pdbx

# Amino acid background frequencies
# Must have the same order as `ProteinSequence.alphabet`
AA_FREQUENCY = {
    "A": 35155,
    "C": 8669,
    "D": 24161,
    "E": 28354,
    "F": 17367,
    "G": 33229,
    "H": 9906,
    "I": 23161,
    "K": 25872,
    "L": 40625,
    "M": 10101,
    "N": 20212,
    "P": 23435,
    "Q": 19208,
    "R": 23105,
    "S": 32070,
    "T": 26311,
    "V": 29012,
    "W": 5990,
    "Y": 14488,
    # Set ambiguous amino acid count to 1 to avoid division by zero in log odds matrix
    "B": 1,
    "Z": 1,
    "X": 1,
    "*": 1,
}

# Two protein structures from the same SCOP superfamily but different families
QUERY_ID = "4R8K"  # Asparaginase (Cavia porcellus)
QUERY_CHAIN = "B"
TARGET_ID = "1NNS"  # L-asparaginase 2 (Escherichia coli)
TARGET_CHAIN = "A"
N_DECOYS = 50
# Approximate size (number of amino acids) of AlphaFold DB:
# For E-value estimation assume that the hypothetical database has this size
DB_SIZE = 2e13

## Alignment search parameters
K = 6  # adopted from `foldseek`
SPACED_SEED = "11*1*1**11"  # adopted from `foldseek`
# The number of consecutive undefined 3Di symbols ('d'),
# that will be marked as undefined spans
UNDEFINED_SPAN_LENGTH = 4
X_DROP = 20
X_DROP_ACCEPT_THRESHOLD = 100
BAND_WIDTH = 20
GAP_PENALTY = (-10, -1)
EVALUE_THRESHOLD = 1e-1

Limitation of protein sequence alignment#

To motivate using a structural alphabet, we first we try a regular sequence alignment on the two homologs.

def get_protein_chain(pdb_id, chain_id):
    pdbx_file = pdbx.BinaryCIFFile.read(rcsb.fetch(pdb_id, "bcif"))
    # Bonds are required for the later molecular visualization
    atoms = pdbx.get_structure(pdbx_file, model=1, include_bonds=True)
    atoms = atoms[struc.filter_amino_acids(atoms)]
    return atoms[atoms.chain_id == chain_id]


query_aa_sequence = struc.to_sequence(get_protein_chain(QUERY_ID, QUERY_CHAIN))[0][0]
target_aa_sequence = struc.to_sequence(get_protein_chain(TARGET_ID, TARGET_CHAIN))[0][0]

alignment = align.align_optimal(
    query_aa_sequence,
    target_aa_sequence,
    align.SubstitutionMatrix.std_protein_matrix(),
    gap_penalty=(-10, -1),
    local=True,
)[0]

print(f"Seqence identity: {int(100 * align.get_sequence_identity(alignment))}%")
fig, ax = plt.subplots(figsize=(8.0, 3.0), constrained_layout=True)
graphics.plot_alignment_similarity_based(
    ax,
    alignment,
    matrix=align.SubstitutionMatrix.std_protein_matrix(),
    labels=["Query", "Target"],
    show_numbers=True,
    show_line_position=True,
)
structure search
Seqence identity: 27%

The sequence identity is quite low. One would typically refer to this range as the twilight zone, were it is unclear if the two sequences are actually homologs. If the database comprised the entire AlphaFold DB [2], the probability of an finding an alignment as good as this one by chance would be quite high as the E-value suggests below.

The background frequencies for E-value calculation are taken from [3].

background = np.array(list(AA_FREQUENCY.values()))
# Normalize background frequencies
background = background / background.sum()
np.random.seed(0)
estimator = align.EValueEstimator.from_samples(
    seq.ProteinSequence.alphabet,
    align.SubstitutionMatrix.std_protein_matrix(),
    GAP_PENALTY,
    background,
    sample_size=500,
)
log_evalue = estimator.log_evalue(alignment.score, len(query_aa_sequence), DB_SIZE)
print(f"E-value: {10**log_evalue:.2e}")
E-value: 5.49e-02

Sequence preparation#

Now the actual task: For simplicity we compile a small database of random structures from the PDB, that represents the database we want to search in. Into this database we shuffle the target structure to which we expect a hit for our query. To get a list of PDB entries we can select from, we search for all entries with at least one peptide chain. Furthermore, we want the resolution be be sufficiently good to resolve side chains.

pdb_query = rcsb.FieldQuery(
    "rcsb_entry_info.polymer_entity_count_protein", greater_or_equal=1
) & rcsb.FieldQuery("reflns.d_resolution_high", less=2.0)
# For the sake of performance, only the first 10000 matches are requested from the PDB
protein_pdb_ids = rcsb.search(pdb_query, range=(0, 10000))
# From these 10000, randomly select the decoys
rng = np.random.default_rng(0)
decoy_pdb_ids = rng.choice(protein_pdb_ids, size=100, replace=False)
# Mix the target structure to the decoy set
database_pdb_ids = np.concatenate((decoy_pdb_ids, [TARGET_ID]))
rng.shuffle(database_pdb_ids)

Now the selected PDB entries are downloaded and for each peptide chain the 3Di sequence that reflects the structure is determined. We will identify each sequence by a combination of the PDB ID and chain ID.

db_ids = []
db_sequences = []
for pdb_id in database_pdb_ids:
    pdbx_file = pdbx.BinaryCIFFile.read(rcsb.fetch(pdb_id, "bcif"))
    # Use only one model per structure,
    # as most models will probably be very similar to each other anyway
    # It is also important to use the 'auth_asym_id' for consistency with
    # 'entity_poly.pdbx_strand_id' below
    atoms = pdbx.get_structure(pdbx_file, model=1, use_author_fields=True)
    entity_poly = pdbx_file.block["entity_poly"]
    for i in range(entity_poly.row_count):
        chains_in_entity = entity_poly["pdbx_strand_id"].as_array(str)[i]
        # Do not include duplicate sequences, e.g. from two chains of a homodimer
        representative_chain_id = chains_in_entity[0]
        chain = atoms[atoms.chain_id == representative_chain_id]
        # Only use the peptide part of the structure
        chain = chain[struc.filter_amino_acids(chain)]
        if chain.array_length() == 0:
            continue
        # Since we input a single chain, we know only one sequence is created
        structural_sequence = strucalph.to_3di(chain)[0][0]
        if len(structural_sequence) < K:
            # Cannot index a sequence later, if it is shorter than the k-mer length
            continue
        db_ids.append((pdb_id, representative_chain_id))
        db_sequences.append(structural_sequence)

Structure superimposition based on alignment#

As bonus, we can use the aligned residues as ‘anchors’ for structure superimposition. This means, that we superimpose the \(C_{\alpha}\) atoms of the aligned residues and apply the resulting transformation to the full structure.

# In this example we know there is only one significant alignment
# and this corresponds to the query and target structure
expected_alignment, _ = next(iter(significant_alignments.values()))
anchor_indices = expected_alignment.trace
anchor_indices = anchor_indices[(anchor_indices != -1).all(axis=1)]

query_chain = get_protein_chain(QUERY_ID, QUERY_CHAIN)
target_chain = get_protein_chain(TARGET_ID, TARGET_CHAIN)
query_anchors = query_chain[query_chain.atom_name == "CA"][anchor_indices[:, 0]]
target_anchors = target_chain[target_chain.atom_name == "CA"][anchor_indices[:, 1]]
# Find transformation that superimposes the anchor atoms
_, transform = struc.superimpose(query_anchors, target_anchors)
# Apply this transformation to full structure
target_chain = transform.apply(target_chain)
# Visualization with PyMOL...
structure search

Gallery generated by Sphinx-Gallery