Loading structures from file#

Usually structures are not built from scratch, but they are read from a file. For this tutorial, we will work on a protein structure as small as possible, namely the miniprotein TC5b (PDB: 1L2Y). The structure of this 20-residue protein (304 atoms) has been elucidated via NMR. Thus, the corresponding structure file consists of multiple (namely 38) models, each showing another conformation.

Reading PDB files#

Probably one of the most popular structure file formats to date is the Protein Data Bank Exchange (PDB) format. At first we load the structure from a PDB file via the class PDBFile in the subpackage biotite.structure.io.pdb.

from tempfile import gettempdir, NamedTemporaryFile
import biotite.structure.io.pdb as pdb
import biotite.database.rcsb as rcsb

pdb_file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
pdb_file = pdb.PDBFile.read(pdb_file_path)
tc5b = pdb_file.get_structure()
print(type(tc5b).__name__)
print(tc5b.stack_depth())
print(tc5b.array_length())
print(tc5b.shape)

AtomArrayStack
38
304
(38, 304)

The method PDBFile.get_structure() returns a AtomArrayStack unless the model parameter is specified, even if the file contains only one model. The following example shows how to write an atom array or stack back into a PDB file:

pdb_file = pdb.PDBFile()
pdb_file.set_structure(tc5b)
temp_file = NamedTemporaryFile(suffix=".pdb", delete=False)
pdb_file.write(temp_file.name)
temp_file.close()

Other information (authors, secondary structure, etc.) cannot be easily extracted from PDB files using PDBFile.

Working with the PDBx format#

After all, the PDB format itself is deprecated now due to several shortcomings and was replaced by the Protein Data Bank Exchange (PDBx) format. As PDBx has become the standard structure format, it is also the format with the most comprehensive interface in Biotite. Today, this format has two common encodings: The original text-based Crystallographic Information Framework (CIF) and the BinaryCIF format. While the former is human-readable, the latter is more efficient in terms of file size and parsing speed. The biotite.structure.io.pdbx subpackage provides classes for interacting with both formats, CIFFile and BinaryCIFFile, respectively. In the following section we will focus on CIFFile, but BinaryCIFFile works analogous.

import biotite.structure.io.pdbx as pdbx

cif_file_path = rcsb.fetch("1l2y", "cif", gettempdir())
cif_file = pdbx.CIFFile.read(cif_file_path)

PDBx can be imagined as hierarchical dictionary, with several levels:

File: The entirety of the PDBx file.

Block: The data for a single structure (e.g. 1L2Y).

Category: A coherent group of data (e.g. atom_site describes the atoms). Each column in the category must have the same length.

Column: Contains values of a specific type (e.g atom_site.Cartn_x contains the x coordinates for each atom). Contains two Data instances, one for the actual data and one for a mask. In a lot of categories a column contains only a single value.

Data: The actual data in form of a ndarray.

Each level may contain multiple instances of the next lower level, e.g. a category may contain multiple columns. Each level is represented by a separate class, that can be used like a dictionary. For CIF files these are CIFFile, CIFBlock, CIFCategory, CIFColumn and CIFData. Note that CIFColumn is not treated like a dictionary, but instead has a data and mask attribute.

block = cif_file["1L2Y"]
category = block["audit_author"]
column = category["name"]
data = column.data
print(data.array)

['Neidigh, J.W.' 'Fesinmeyer, R.M.' 'Andersen, N.H.']

The data access can be cut short, especially if the file contains a single block and a certain data type is expected instead of strings.

category = cif_file.block["audit_author"]
column = category["pdbx_ordinal"]
print(column.as_array(int))

[1 2 3]

As already mentioned, many categories contain only a single value per column. In this case it may be convenient to get only a single item instead of an array.

for key, column in cif_file.block["citation"].items():
    print(f"{key:25}{column.as_item()}")

id                       primary
title                    Designing a 20-residue protein.
journal_abbrev           Nat.Struct.Biol.
journal_volume           9
page_first               425
page_last                430
year                     2002
journal_id_ASTM          NSBIEW
country                  US
journal_id_ISSN          1072-8368
journal_id_CSD           2024
book_publisher           ?
pdbx_database_id_PubMed  11979279
pdbx_database_id_DOI     10.1038/nsb798

Note the ? in the output. It indicates that the value is masked as ‘unknown’. That becomes clear when we look at the mask of that column.

mask = block["citation"]["book_publisher"].mask.array
print(mask)
print(pdbx.MaskValue(mask[0]))

[2]
2

For setting/adding blocks, categories etc. we simply assign values as we would do with dictionaries.

category = pdbx.CIFCategory()
category["number"] = pdbx.CIFColumn(pdbx.CIFData([1, 2]))
category["person"] = pdbx.CIFColumn(pdbx.CIFData(["me", "you"]))
category["greeting"] = pdbx.CIFColumn(pdbx.CIFData(["Hi!", "Hello!"]))
block["greetings"] = category
print(category.serialize())

loop_
_greetings.number 
_greetings.person 
_greetings.greeting 
1 me  Hi!
2 you Hello!

For the sake of brevity it is also possible to omit CIFColumn and CIFData and even pass columns directly at category creation.

category = pdbx.CIFCategory({
    # If the columns contain only a single value, no list is required
    "fruit": "apple",
    "color": "red",
    "taste": "delicious",
})
block["fruits"] = category
print(category.serialize())

_fruits.fruit   apple
_fruits.color   red
_fruits.taste   delicious

For BinaryCIFFile the usage is analogous.

bcif_file_path = rcsb.fetch("1l2y", "bcif", gettempdir())
bcif_file = pdbx.BinaryCIFFile.read(bcif_file_path)
for key, column in bcif_file["1L2Y"]["audit_author"].items():
    print(f"{key:25}{column.as_array()}")

name                     ['Neidigh, J.W.' 'Fesinmeyer, R.M.' 'Andersen, N.H.']
pdbx_ordinal             [1 2 3]

The main difference is that BinaryCIFData has an additional encoding attribute that specifies how the data is compressed in the binary representation. A well chosen encoding can reduce the file size significantly.

import numpy as np

# Default uncompressed encoding
array = np.arange(100)
print(pdbx.BinaryCIFData(array).serialize())
print("\nvs.\n")
# Delta encoding followed by run-length encoding
# [0, 1, 2, ...] -> [0, 1, 1, ...] -> [0, 1, 1, 99]
print(
    pdbx.BinaryCIFData(
        array,
        encoding = [
            # [0, 1, 2, ...] -> [0, 1, 1, ...]
            pdbx.DeltaEncoding(),
            # [0, 1, 1, ...] -> [0, 1, 1, 99]
            pdbx.RunLengthEncoding(),
            # [0, 1, 1, 99] -> b"\x00\x00..."
            pdbx.ByteArrayEncoding()
        ]
    ).serialize()
)

{'data': b'\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n\x00\x00\x00\x0b\x00\x00\x00\x0c\x00\x00\x00\r\x00\x00\x00\x0e\x00\x00\x00\x0f\x00\x00\x00\x10\x00\x00\x00\x11\x00\x00\x00\x12\x00\x00\x00\x13\x00\x00\x00\x14\x00\x00\x00\x15\x00\x00\x00\x16\x00\x00\x00\x17\x00\x00\x00\x18\x00\x00\x00\x19\x00\x00\x00\x1a\x00\x00\x00\x1b\x00\x00\x00\x1c\x00\x00\x00\x1d\x00\x00\x00\x1e\x00\x00\x00\x1f\x00\x00\x00 \x00\x00\x00!\x00\x00\x00"\x00\x00\x00#\x00\x00\x00$\x00\x00\x00%\x00\x00\x00&\x00\x00\x00\'\x00\x00\x00(\x00\x00\x00)\x00\x00\x00*\x00\x00\x00+\x00\x00\x00,\x00\x00\x00-\x00\x00\x00.\x00\x00\x00/\x00\x00\x000\x00\x00\x001\x00\x00\x002\x00\x00\x003\x00\x00\x004\x00\x00\x005\x00\x00\x006\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x00=\x00\x00\x00>\x00\x00\x00?\x00\x00\x00@\x00\x00\x00A\x00\x00\x00B\x00\x00\x00C\x00\x00\x00D\x00\x00\x00E\x00\x00\x00F\x00\x00\x00G\x00\x00\x00H\x00\x00\x00I\x00\x00\x00J\x00\x00\x00K\x00\x00\x00L\x00\x00\x00M\x00\x00\x00N\x00\x00\x00O\x00\x00\x00P\x00\x00\x00Q\x00\x00\x00R\x00\x00\x00S\x00\x00\x00T\x00\x00\x00U\x00\x00\x00V\x00\x00\x00W\x00\x00\x00X\x00\x00\x00Y\x00\x00\x00Z\x00\x00\x00[\x00\x00\x00\\\x00\x00\x00]\x00\x00\x00^\x00\x00\x00_\x00\x00\x00`\x00\x00\x00a\x00\x00\x00b\x00\x00\x00c\x00\x00\x00', 'encoding': [{'type': <TypeCode.INT32: 3>, 'kind': 'ByteArray'}]}

vs.

{'data': b'\x00\x00\x00\x00\x01\x00\x00\x00\x01\x00\x00\x00c\x00\x00\x00', 'encoding': [{'srcType': <TypeCode.INT32: 3>, 'origin': np.int64(0), 'kind': 'Delta'}, {'srcSize': 100, 'srcType': <TypeCode.INT32: 3>, 'kind': 'RunLength'}, {'type': <TypeCode.INT32: 3>, 'kind': 'ByteArray'}]}

As finding good encodings manually can be tedious, compress() does this automatically - from a single BinaryCIFData to an entire BinaryCIFFile.

uncompressed_data = pdbx.BinaryCIFData(np.arange(100))
print(f"Uncompressed size: {len(uncompressed_data.serialize()['data'])} bytes")
compressed_data = pdbx.compress(uncompressed_data)
print(f"Compressed size: {len(compressed_data.serialize()['data'])} bytes")

Uncompressed size: 400 bytes
Compressed size: 16 bytes

Using structures from a PDBx file#

While this low-level API is useful for using the entire potential of the PDBx format, most applications require only reading/writing a structure. As the BinaryCIF format is both, smaller and faster to parse, it is recommended to use it instead of the CIF format in Biotite.

tc5b = pdbx.get_structure(bcif_file)
# Do some fancy stuff
pdbx.set_structure(bcif_file, tc5b)

Similar to PDBFile, get_structure() creates automatically an AtomArrayStack, even if the file actually contains only a single model. If you would like to have an AtomArray instead, you have to specify the model parameter.