Fetching structure files from the RCSB PDB#

Downloading structure files from the RCSB PDB is quite easy: We simply specify the PDB ID, the file format and the target directory for the fetch() function. The function returns the path to the downloaded file, so you can simply load the file via the biotite.structure.io subpackage (more on this in a later tutorial). We will download a protein structure of the miniprotein TC5b (PDB: 1L2Y) into a temporary directory.

from tempfile import gettempdir
from os.path import basename
import biotite.database.rcsb as rcsb

file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
print(basename(file_path))

1l2y.pdb

In case we want to download multiple files, we are able to specify a list of PDB IDs, which in turn gives us a list of file paths.

# Download files in the more modern mmCIF format
file_paths = rcsb.fetch(["1l2y", "1aki"], "cif", gettempdir())
print([basename(file_path) for file_path in file_paths])

['1l2y.cif', '1aki.cif']

By default fetch() checks whether the file to be fetched already exists in the directory and downloads it, if it does not exist yet. If we want to download files regardless, set overwrite to true.

# Download file in the fast and small BinaryCIF format
file_path = rcsb.fetch("1l2y", "bcif", gettempdir(), overwrite=True)

If we omit the file path or set it to None, the downloaded data will be returned directly as a file-like object, without creating a file on the disk at all.

file = rcsb.fetch("1l2y", "pdb")
lines = file.readlines()
print("".join(lines[:10] + ["..."]))

HEADER    DE NOVO PROTEIN                         25-FEB-02   1L2Y              
TITLE     NMR STRUCTURE OF TRP-CAGE MINIPROTEIN CONSTRUCT TC5B                  
COMPND    MOL_ID: 1;                                                            
COMPND   2 MOLECULE: TC5B;                                                      
COMPND   3 CHAIN: A;                                                            
COMPND   4 ENGINEERED: YES                                                      
SOURCE    MOL_ID: 1;                                                            
SOURCE   2 SYNTHETIC: YES;                                                      
SOURCE   3 OTHER_DETAILS: THE PROTEIN WAS SYNTHESIZED USING STANDARD FMOC       
SOURCE   4 SOLID-PHASE SYNTHESIS METHODS ON AN APPLIED BIOSYSTEMS 433A PEPTIDE  
...

Searching for entries#

As mentioned in the previous chapter, in many cases one is not interested in a specific structure, but in a set of structures that fits some desired criteria. And also similar to the other biotite.database subpackages, PDB IDs matching those criteria can be searched for by defining a Query and passing it to search(). For this purpose the RCSB search API can be used. Likewise, count() is used to request the number of matching PDB IDs, which is faster and more database-friendly than measuring the length of the list returned by a search() call.

query = rcsb.BasicQuery("HCN1")
pdb_ids = rcsb.search(query)
print(pdb_ids)
print(rcsb.count(query))
files = rcsb.fetch(pdb_ids, "cif", gettempdir())

['2XPI', '5U6P', '5U6O', '8T4M', '8T4Y', '8T50', '8UC8', '6UQF', '6UQG', '8UC7', '9BC6', '9BC7', '8Y60', '3U0Z']

This was a simple search for the occurrence of the search term in any field. You can also search for a value in a specific field with a FieldQuery. A complete list of the available fields and their supported operators is documented on this page and on that page.

# Query for 'lacA' gene
query1 = rcsb.FieldQuery(
    "rcsb_entity_source_organism.rcsb_gene_name.value",
    exact_match="lacA"
)
# Query for resolution below 1.5 Å
query2 = rcsb.FieldQuery("reflns.d_resolution_high", less=1.5)

The search API allows even more complex queries, e.g. for sequence or structure similarity. Have a look at the API reference of biotite.database.rcsb.

Multiple Query objects can be combined using the | (or) or & (and) operator for a more fine-grained selection. A FieldQuery is negated with ~.

composite_query = query1 & ~query2
print(rcsb.search(composite_query))

['1KQA', '1KRR', '1KRU', '1KRV', '3U7V', '4DUW', '4IUG', '4LFK', '4LFL', '4LFM', '4LFN', '5IFP', '5IFT', '5IHR', '5JUV', '5MGC', '5MGD']

Often the structures behind the obtained PDB IDs have a degree of redundancy. For example they may represent the same protein sequences or result from the same set of experiments. You may use Grouping of structures to group redundant entries or even return only single representatives of each group.

query = rcsb.BasicQuery("Transketolase")
# Group PDB IDs from the same collection
print(rcsb.search(
    query, group_by=rcsb.DepositGrouping(), return_groups=True
))
# Get only a single representative of each group
print(rcsb.search(
    query, group_by=rcsb.DepositGrouping(), return_groups=False
))

{'G_1002178': ['5RVW', '5RVX', '5RVY', '5RVZ', '5RW0'], 'G_1002179': ['5RW1'], 'G_1002349': ['7IF4']}

['5RVW', '5RW1', '7IF4']

Note that grouping may omit PDB IDs in search results, if such PDB IDs cannot be grouped. For example, in the case shown above only a few PDB entries were uploaded as collection and hence are part of the search results.

Getting computational models#

By default search() only returns experimental structures. In addition to that the RCSB lists an order of magnitude more computational models. They can be included in search results by adding "computational" to the content_types parameter.

query = (
    rcsb.FieldQuery("rcsb_polymer_entity.pdbx_description", contains_phrase="Lysozyme")
    & rcsb.FieldQuery(
        "rcsb_entity_source_organism.scientific_name", exact_match="Homo sapiens"
    )
)
ids = rcsb.search(query, content_types=("computational",))
print(ids)

['AF_AFA0A080YUZ5F1', 'AF_AFP61626F1', 'AF_AFO75951F1', 'AF_AFQ6UWQ5F1', 'AF_AFQ7Z4W2F1', 'AF_AFQ96KX0F1', 'AF_AFQ86SG7F1', 'AF_AFQ8N1E2F1']

The returned four-character IDs are the RCSB PDB IDs of experimental structures like we already saw above. The IDs with the AF_ on the other hand are computational models from AlphaFold DB.

To download those we require another subpackage: biotite.database.afdb. Its fetch() function works very similarly.

import biotite.database.afdb as afdb

files = []
# For the sake of run time, only download the first 5 entries
for id in ids[:5]:
    if id.startswith("AF_"):
        # Entry is in AlphaFold DB
        files.append(afdb.fetch(id, "cif", gettempdir()))
    elif id.startswith("MA_"):
        # Entry is in ModelArchive, which is not yet supported
        raise NotImplementedError
    else:
        # Entry is in RCSB PDB
        files.append(rcsb.fetch(id, "cif", gettempdir()))
print([basename(file) for file in files])

['AF_AFA0A080YUZ5F1.cif', 'AF_AFP61626F1.cif', 'AF_AFO75951F1.cif', 'AF_AFQ6UWQ5F1.cif', 'AF_AFQ7Z4W2F1.cif']