Fetching structure files from the RCSB PDB#
Downloading structure files from the RCSB PDB is quite easy:
We simply specify the PDB ID, the file format and the target directory
for the fetch() function.
The function returns the path to the downloaded file, so you
can simply load the file via the biotite.structure.io subpackage
(more on this in a later tutorial).
We will download a protein structure of the miniprotein TC5b
(PDB: 1L2Y) into a temporary directory.
from tempfile import gettempdir
from os.path import basename
import biotite.database.rcsb as rcsb
file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
print(basename(file_path))
1l2y.pdb
In case we want to download multiple files, we are able to specify a list of PDB IDs, which in turn gives us a list of file paths.
# Download files in the more modern mmCIF format
file_paths = rcsb.fetch(["1l2y", "1aki"], "cif", gettempdir())
print([basename(file_path) for file_path in file_paths])
['1l2y.cif', '1aki.cif']
By default fetch() checks whether the file to be fetched
already exists in the directory and downloads it, if it does not
exist yet.
If we want to download files regardless, set overwrite to
true.
# Download file in the fast and small BinaryCIF format
file_path = rcsb.fetch("1l2y", "bcif", gettempdir(), overwrite=True)
If we omit the file path or set it to None, the downloaded data
will be returned directly as a file-like object, without creating a
file on the disk at all.
file = rcsb.fetch("1l2y", "pdb")
lines = file.readlines()
print("".join(lines[:10] + ["..."]))
HEADER DE NOVO PROTEIN 25-FEB-02 1L2Y
TITLE NMR STRUCTURE OF TRP-CAGE MINIPROTEIN CONSTRUCT TC5B
COMPND MOL_ID: 1;
COMPND 2 MOLECULE: TC5B;
COMPND 3 CHAIN: A;
COMPND 4 ENGINEERED: YES
SOURCE MOL_ID: 1;
SOURCE 2 SYNTHETIC: YES;
SOURCE 3 OTHER_DETAILS: THE PROTEIN WAS SYNTHESIZED USING STANDARD FMOC
SOURCE 4 SOLID-PHASE SYNTHESIS METHODS ON AN APPLIED BIOSYSTEMS 433A PEPTIDE
...
Searching for entries#
As mentioned in the previous chapter, in many cases one is not interested in a
specific structure, but in a set of structures that fits some desired criteria.
And also similar to the other biotite.database subpackages,
PDB IDs matching those criteria can be searched for by defining a
Query and passing it to search().
For this purpose the RCSB search API can be used.
Likewise, count() is used to request the number of matching
PDB IDs, which is faster and more database-friendly than measuring the length
of the list returned by a search() call.
query = rcsb.BasicQuery("HCN1")
pdb_ids = rcsb.search(query)
print(pdb_ids)
print(rcsb.count(query))
files = rcsb.fetch(pdb_ids, "cif", gettempdir())
['2XPI', '5U6P', '5U6O', '8T4M', '8T4Y', '8T50', '8UC8', '6UQF', '6UQG', '8UC7', '9BC6', '9BC7', '8Y60', '3U0Z']
14
This was a simple search for the occurrence of the search term in any
field.
You can also search for a value in a specific field with a
FieldQuery.
A complete list of the available fields and their supported operators
is documented
on this page
and on that page.
# Query for 'lacA' gene
query1 = rcsb.FieldQuery(
"rcsb_entity_source_organism.rcsb_gene_name.value",
exact_match="lacA"
)
# Query for resolution below 1.5 Å
query2 = rcsb.FieldQuery("reflns.d_resolution_high", less=1.5)
The search API allows even more complex queries, e.g. for sequence
or structure similarity. Have a look at the API reference of
biotite.database.rcsb.
Multiple Query objects can be combined using the | (or)
or & (and) operator for a more fine-grained selection.
A FieldQuery is negated with ~.
composite_query = query1 & ~query2
print(rcsb.search(composite_query))
['1KQA', '1KRR', '1KRU', '1KRV', '3U7V', '4DUW', '4IUG', '4LFK', '4LFL', '4LFM', '4LFN', '5IFP', '5IFT', '5IHR', '5JUV', '5MGC', '5MGD']
Often the structures behind the obtained PDB IDs have a degree of
redundancy.
For example they may represent the same protein sequences or result
from the same set of experiments.
You may use Grouping of structures to group redundant
entries or even return only single representatives of each group.
query = rcsb.BasicQuery("Transketolase")
# Group PDB IDs from the same collection
print(rcsb.search(
query, group_by=rcsb.DepositGrouping(), return_groups=True
))
# Get only a single representative of each group
print(rcsb.search(
query, group_by=rcsb.DepositGrouping(), return_groups=False
))
{'G_1002178': ['5RVW', '5RVX', '5RVY', '5RVZ', '5RW0'], 'G_1002179': ['5RW1'], 'G_1002349': ['7IF4']}
['5RVW', '5RW1', '7IF4']
Note that grouping may omit PDB IDs in search results, if such PDB IDs cannot be grouped. For example, in the case shown above only a few PDB entries were uploaded as collection and hence are part of the search results.
Getting computational models#
By default search() only returns experimental structures.
In addition to that the RCSB lists an order of magnitude more computational models.
They can be included in search results by adding "computational" to the
content_types parameter.
query = (
rcsb.FieldQuery("rcsb_polymer_entity.pdbx_description", contains_phrase="Lysozyme")
& rcsb.FieldQuery(
"rcsb_entity_source_organism.scientific_name", exact_match="Homo sapiens"
)
)
ids = rcsb.search(query, content_types=("computational",))
print(ids)
['AF_AFA0A080YUZ5F1', 'AF_AFP61626F1', 'AF_AFO75951F1', 'AF_AFQ6UWQ5F1', 'AF_AFQ7Z4W2F1', 'AF_AFQ96KX0F1', 'AF_AFQ86SG7F1', 'AF_AFQ8N1E2F1']
The returned four-character IDs are the RCSB PDB IDs of experimental structures
like we already saw above.
The IDs with the AF_ on the other hand are computational models from
AlphaFold DB.
To download those we require another subpackage: biotite.database.afdb.
Its fetch() function works very similarly.
import biotite.database.afdb as afdb
files = []
# For the sake of run time, only download the first 5 entries
for id in ids[:5]:
if id.startswith("AF_"):
# Entry is in AlphaFold DB
files.append(afdb.fetch(id, "cif", gettempdir()))
elif id.startswith("MA_"):
# Entry is in ModelArchive, which is not yet supported
raise NotImplementedError
else:
# Entry is in RCSB PDB
files.append(rcsb.fetch(id, "cif", gettempdir()))
print([basename(file) for file in files])
['AF_AFA0A080YUZ5F1.cif', 'AF_AFP61626F1.cif', 'AF_AFO75951F1.cif', 'AF_AFQ6UWQ5F1.cif', 'AF_AFQ7Z4W2F1.cif']