Loading structures from file#
Usually structures are not built from scratch, but they are read from a file.
For this tutorial, we will work on a protein structure as small as possible,
namely the miniprotein TC5b (PDB: 1L2Y).
The structure of this 20-residue protein (304 atoms) has been elucidated via
NMR.
Thus, the corresponding structure file consists of multiple (namely 38) models, each
showing another conformation.
Reading PDB files#
Probably one of the most popular structure file formats to date is the
Protein Data Bank Exchange (PDB) format.
At first we load the structure from a PDB file via the class
PDBFile in the subpackage biotite.structure.io.pdb.
from tempfile import gettempdir, NamedTemporaryFile
import biotite.structure.io.pdb as pdb
import biotite.database.rcsb as rcsb
pdb_file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
pdb_file = pdb.PDBFile.read(pdb_file_path)
tc5b = pdb_file.get_structure()
print(type(tc5b).__name__)
print(tc5b.stack_depth())
print(tc5b.array_length())
print(tc5b.shape)
AtomArrayStack
38
304
(38, 304)
The method PDBFile.get_structure() returns a AtomArrayStack
unless the model parameter is specified, even if the file contains only
one model.
The following example shows how to write an atom array or stack back into a
PDB file:
pdb_file = pdb.PDBFile()
pdb_file.set_structure(tc5b)
temp_file = NamedTemporaryFile(suffix=".pdb", delete=False)
pdb_file.write(temp_file.name)
temp_file.close()
Other information (authors, secondary structure, etc.) cannot be
easily extracted from PDB files using PDBFile.
Working with the PDBx format#
After all, the PDB format itself is deprecated now due to several
shortcomings and was replaced by the Protein Data Bank Exchange (PDBx)
format.
As PDBx has become the standard structure format, it is also the format with
the most comprehensive interface in Biotite.
Today, this format has two common encodings:
The original text-based Crystallographic Information Framework (CIF)
and the BinaryCIF format.
While the former is human-readable, the latter is more efficient in terms of
file size and parsing speed.
The biotite.structure.io.pdbx subpackage provides classes for
interacting with both formats, CIFFile and BinaryCIFFile,
respectively.
In the following section we will focus on CIFFile, but
BinaryCIFFile works analogous.
import biotite.structure.io.pdbx as pdbx
cif_file_path = rcsb.fetch("1l2y", "cif", gettempdir())
cif_file = pdbx.CIFFile.read(cif_file_path)
PDBx can be imagined as hierarchical dictionary, with several levels:
File: The entirety of the PDBx file.
Block: The data for a single structure (e.g. 1L2Y).
Category: A coherent group of data (e.g. atom_site describes the atoms). Each column in the category must have the same length.
Column: Contains values of a specific type (e.g atom_site.Cartn_x contains the x coordinates for each atom). Contains two Data instances, one for the actual data and one for a mask. In a lot of categories a column contains only a single value.
Data: The actual data in form of a
ndarray.
Each level may contain multiple instances of the next lower level, e.g. a
category may contain multiple columns.
Each level is represented by a separate class, that can be used like a
dictionary.
For CIF files these are CIFFile, CIFBlock,
CIFCategory, CIFColumn and CIFData.
Note that CIFColumn is not treated like a dictionary, but
instead has a data and mask attribute.
block = cif_file["1L2Y"]
category = block["audit_author"]
column = category["name"]
data = column.data
print(data.array)
['Neidigh, J.W.' 'Fesinmeyer, R.M.' 'Andersen, N.H.']
The data access can be cut short, especially if the file contains a single block and a certain data type is expected instead of strings.
category = cif_file.block["audit_author"]
column = category["pdbx_ordinal"]
print(column.as_array(int))
[1 2 3]
As already mentioned, many categories contain only a single value per column. In this case it may be convenient to get only a single item instead of an array.
for key, column in cif_file.block["citation"].items():
print(f"{key:25}{column.as_item()}")
id primary
title Designing a 20-residue protein.
journal_abbrev Nat.Struct.Biol.
journal_volume 9
page_first 425
page_last 430
year 2002
journal_id_ASTM NSBIEW
country US
journal_id_ISSN 1072-8368
journal_id_CSD 2024
book_publisher ?
pdbx_database_id_PubMed 11979279
pdbx_database_id_DOI 10.1038/nsb798
Note the ? in the output.
It indicates that the value is masked as ‘unknown’.
That becomes clear when we look at the mask of that column.
mask = block["citation"]["book_publisher"].mask.array
print(mask)
print(pdbx.MaskValue(mask[0]))
[2]
2
For setting/adding blocks, categories etc. we simply assign values as we would do with dictionaries.
category = pdbx.CIFCategory()
category["number"] = pdbx.CIFColumn(pdbx.CIFData([1, 2]))
category["person"] = pdbx.CIFColumn(pdbx.CIFData(["me", "you"]))
category["greeting"] = pdbx.CIFColumn(pdbx.CIFData(["Hi!", "Hello!"]))
block["greetings"] = category
print(category.serialize())
loop_
_greetings.number
_greetings.person
_greetings.greeting
1 me Hi!
2 you Hello!
For the sake of brevity it is also possible to omit CIFColumn and
CIFData and even pass columns directly at category creation.
category = pdbx.CIFCategory({
# If the columns contain only a single value, no list is required
"fruit": "apple",
"color": "red",
"taste": "delicious",
})
block["fruits"] = category
print(category.serialize())
_fruits.fruit apple
_fruits.color red
_fruits.taste delicious
For BinaryCIFFile the usage is analogous.
bcif_file_path = rcsb.fetch("1l2y", "bcif", gettempdir())
bcif_file = pdbx.BinaryCIFFile.read(bcif_file_path)
for key, column in bcif_file["1L2Y"]["audit_author"].items():
print(f"{key:25}{column.as_array()}")
name ['Neidigh, J.W.' 'Fesinmeyer, R.M.' 'Andersen, N.H.']
pdbx_ordinal [1 2 3]
The main difference is that BinaryCIFData has an additional
encoding attribute that specifies how the data is compressed in the binary
representation.
A well chosen encoding can reduce the file size significantly.
import numpy as np
# Default uncompressed encoding
array = np.arange(100)
print(pdbx.BinaryCIFData(array).serialize())
print("\nvs.\n")
# Delta encoding followed by run-length encoding
# [0, 1, 2, ...] -> [0, 1, 1, ...] -> [0, 1, 1, 99]
print(
pdbx.BinaryCIFData(
array,
encoding = [
# [0, 1, 2, ...] -> [0, 1, 1, ...]
pdbx.DeltaEncoding(),
# [0, 1, 1, ...] -> [0, 1, 1, 99]
pdbx.RunLengthEncoding(),
# [0, 1, 1, 99] -> b"\x00\x00..."
pdbx.ByteArrayEncoding()
]
).serialize()
)
{'data': b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n\x00\x00\x00\x0b\x00\x00\x00\x0c\x00\x00\x00\r\x00\x00\x00\x0e\x00\x00\x00\x0f\x00\x00\x00\x10\x00\x00\x00\x11\x00\x00\x00\x12\x00\x00\x00\x13\x00\x00\x00\x14\x00\x00\x00\x15\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\x18\x00\x00\x00\x19\x00\x00\x00\x1a\x00\x00\x00\x1b\x00\x00\x00\x1c\x00\x00\x00\x1d\x00\x00\x00\x1e\x00\x00\x00\x1f\x00\x00\x00 \x00\x00\x00!\x00\x00\x00"\x00\x00\x00#\x00\x00\x00$\x00\x00\x00%\x00\x00\x00&\x00\x00\x00\'\x00\x00\x00(\x00\x00\x00)\x00\x00\x00*\x00\x00\x00+\x00\x00\x00,\x00\x00\x00-\x00\x00\x00.\x00\x00\x00/\x00\x00\x000\x00\x00\x001\x00\x00\x002\x00\x00\x003\x00\x00\x004\x00\x00\x005\x00\x00\x006\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x00=\x00\x00\x00>\x00\x00\x00?\x00\x00\x00@\x00\x00\x00A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00E\x00\x00\x00F\x00\x00\x00G\x00\x00\x00H\x00\x00\x00I\x00\x00\x00J\x00\x00\x00K\x00\x00\x00L\x00\x00\x00M\x00\x00\x00N\x00\x00\x00O\x00\x00\x00P\x00\x00\x00Q\x00\x00\x00R\x00\x00\x00S\x00\x00\x00T\x00\x00\x00U\x00\x00\x00V\x00\x00\x00W\x00\x00\x00X\x00\x00\x00Y\x00\x00\x00Z\x00\x00\x00[\x00\x00\x00\\\x00\x00\x00]\x00\x00\x00^\x00\x00\x00_\x00\x00\x00`\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00', 'encoding': [{'type': <TypeCode.INT32: 3>, 'kind': 'ByteArray'}]}
vs.
{'data': b'\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00c\x00\x00\x00', 'encoding': [{'srcType': <TypeCode.INT32: 3>, 'origin': np.int64(0), 'kind': 'Delta'}, {'srcSize': 100, 'srcType': <TypeCode.INT32: 3>, 'kind': 'RunLength'}, {'type': <TypeCode.INT32: 3>, 'kind': 'ByteArray'}]}
As finding good encodings manually can be tedious, compress() does this
automatically - from a single BinaryCIFData to an entire
BinaryCIFFile.
uncompressed_data = pdbx.BinaryCIFData(np.arange(100))
print(f"Uncompressed size: {len(uncompressed_data.serialize()['data'])} bytes")
compressed_data = pdbx.compress(uncompressed_data)
print(f"Compressed size: {len(compressed_data.serialize()['data'])} bytes")
Uncompressed size: 400 bytes
Compressed size: 16 bytes
Using structures from a PDBx file#
While this low-level API is useful for using the entire potential of the PDBx format, most applications require only reading/writing a structure. As the BinaryCIF format is both, smaller and faster to parse, it is recommended to use it instead of the CIF format in Biotite.
tc5b = pdbx.get_structure(bcif_file)
# Do some fancy stuff
pdbx.set_structure(bcif_file, tc5b)
Similar to PDBFile, get_structure() creates automatically an
AtomArrayStack, even if the file actually contains only a single
model.
If you would like to have an AtomArray instead, you have to specify
the model parameter.