.. include:: /tutorial/preamble.rst

Loading structures from file
============================
Usually structures are not built from scratch, but they are read from a file.
For this tutorial, we will work on a protein structure as small as possible,
namely the miniprotein *TC5b* (PDB: ``1L2Y``).
The structure of this 20-residue protein (304 atoms) has been elucidated via
NMR.
Thus, the corresponding structure file consists of multiple (namely 38) models, each
showing another conformation.

Reading PDB files
-----------------

.. currentmodule:: biotite.structure.io.pdb

Probably one of the most popular structure file formats to date is the
*Protein Data Bank Exchange* (PDB) format.
At first we load the structure from a PDB file via the class
:class:`PDBFile` in the subpackage :mod:`biotite.structure.io.pdb`.

.. jupyter-execute::

    from tempfile import gettempdir, NamedTemporaryFile
    import biotite.structure.io.pdb as pdb
    import biotite.database.rcsb as rcsb

    pdb_file_path = rcsb.fetch("1l2y", "pdb", gettempdir())
    pdb_file = pdb.PDBFile.read(pdb_file_path)
    tc5b = pdb_file.get_structure()
    print(type(tc5b).__name__)
    print(tc5b.stack_depth())
    print(tc5b.array_length())
    print(tc5b.shape)

The method :func:`PDBFile.get_structure()` returns a :class:`AtomArrayStack`
unless the :obj:`model` parameter is specified, even if the file contains only
one model.
The following example shows how to write an atom array or stack back into a
PDB file:

.. jupyter-execute::

    pdb_file = pdb.PDBFile()
    pdb_file.set_structure(tc5b)
    temp_file = NamedTemporaryFile(suffix=".pdb", delete=False)
    pdb_file.write(temp_file.name)
    temp_file.close()

Other information (authors, secondary structure, etc.) cannot be
easily extracted from PDB files using :class:`PDBFile`.

Working with the PDBx format
----------------------------

.. currentmodule:: biotite.structure.io.pdbx

After all, the *PDB* format itself is deprecated now due to several
shortcomings and was replaced by the *Protein Data Bank Exchange* (PDBx)
format.
As PDBx has become the standard structure format, it is also the format with
the most comprehensive interface in *Biotite*.
Today, this format has two common encodings:
The original text-based *Crystallographic Information Framework* (CIF)
and the *BinaryCIF* format.
While the former is human-readable, the latter is more efficient in terms of
file size and parsing speed.
The :mod:`biotite.structure.io.pdbx` subpackage provides classes for
interacting with both formats, :class:`CIFFile` and :class:`BinaryCIFFile`,
respectively.
In the following section we will focus on :class:`CIFFile`, but
:class:`BinaryCIFFile` works analogous.

.. jupyter-execute::

    import biotite.structure.io.pdbx as pdbx

    cif_file_path = rcsb.fetch("1l2y", "cif", gettempdir())
    cif_file = pdbx.CIFFile.read(cif_file_path)

*PDBx* can be imagined as hierarchical dictionary, with several
levels:

  #. **File**: The entirety of the *PDBx* file.
  #. **Block**: The data for a single structure (e.g. `1L2Y`).
  #. **Category**: A coherent group of data
     (e.g. `atom_site` describes the atoms).
     Each column in the category must have the same length.
  #. **Column**: Contains values of a specific type
     (e.g `atom_site.Cartn_x` contains the *x* coordinates for each atom).
     Contains two *Data* instances, one for the actual data and one
     for a mask.
     In a lot of categories a column contains only a single value.
  #. **Data**: The actual data in form of a :class:`ndarray`.

Each level may contain multiple instances of the next lower level, e.g. a
category may contain multiple columns.
Each level is represented by a separate class, that can be used like a
dictionary.
For CIF files these are :class:`CIFFile`, :class:`CIFBlock`,
:class:`CIFCategory`, :class:`CIFColumn` and :class:`CIFData`.
Note that :class:`CIFColumn` is not treated like a dictionary, but
instead has a ``data`` and ``mask`` attribute.

.. jupyter-execute::

    block = cif_file["1L2Y"]
    category = block["audit_author"]
    column = category["name"]
    data = column.data
    print(data.array)

The data access can be cut short, especially if the file contains a single block
and a certain data type is expected instead of strings.

.. jupyter-execute::

    category = cif_file.block["audit_author"]
    column = category["pdbx_ordinal"]
    print(column.as_array(int))

As already mentioned, many categories contain only a single value per column.
In this case it may be convenient to get only a single item instead of an
array.

.. jupyter-execute::

    for key, column in cif_file.block["citation"].items():
        print(f"{key:25}{column.as_item()}")

Note the ``?`` in the output.
It indicates that the value is masked as '*unknown*'.
That becomes clear when we look at the mask of that column.

.. jupyter-execute::

    mask = block["citation"]["book_publisher"].mask.array
    print(mask)
    print(pdbx.MaskValue(mask[0]))


For setting/adding blocks, categories etc. we simply assign values as we would
do with dictionaries.

.. jupyter-execute::

    category = pdbx.CIFCategory()
    category["number"] = pdbx.CIFColumn(pdbx.CIFData([1, 2]))
    category["person"] = pdbx.CIFColumn(pdbx.CIFData(["me", "you"]))
    category["greeting"] = pdbx.CIFColumn(pdbx.CIFData(["Hi!", "Hello!"]))
    block["greetings"] = category
    print(category.serialize())

For the sake of brevity it is also possible to omit :class:`CIFColumn` and
:class:`CIFData` and even pass columns directly at category creation.

.. jupyter-execute::

    category = pdbx.CIFCategory({
        # If the columns contain only a single value, no list is required
        "fruit": "apple",
        "color": "red",
        "taste": "delicious",
    })
    block["fruits"] = category
    print(category.serialize())

For :class:`BinaryCIFFile` the usage is analogous.

.. jupyter-execute::

    bcif_file_path = rcsb.fetch("1l2y", "bcif", gettempdir())
    bcif_file = pdbx.BinaryCIFFile.read(bcif_file_path)
    for key, column in bcif_file["1L2Y"]["audit_author"].items():
        print(f"{key:25}{column.as_array()}")

The main difference is that :class:`BinaryCIFData` has an additional
``encoding`` attribute that specifies how the data is compressed in the binary
representation.
A well chosen encoding can reduce the file size significantly.

.. jupyter-execute::

    import numpy as np

    # Default uncompressed encoding
    array = np.arange(100)
    print(pdbx.BinaryCIFData(array).serialize())
    print("\nvs.\n")
    # Delta encoding followed by run-length encoding
    # [0, 1, 2, ...] -> [0, 1, 1, ...] -> [0, 1, 1, 99]
    print(
        pdbx.BinaryCIFData(
            array,
            encoding = [
                # [0, 1, 2, ...] -> [0, 1, 1, ...]
                pdbx.DeltaEncoding(),
                # [0, 1, 1, ...] -> [0, 1, 1, 99]
                pdbx.RunLengthEncoding(),
                # [0, 1, 1, 99] -> b"\x00\x00..."
                pdbx.ByteArrayEncoding()
            ]
        ).serialize()
    )

As finding good encodings manually can be tedious, :func:`compress()` does this
automatically - from a single :class:`BinaryCIFData` to an entire
:class:`BinaryCIFFile`.

.. jupyter-execute::

    uncompressed_data = pdbx.BinaryCIFData(np.arange(100))
    print(f"Uncompressed size: {len(uncompressed_data.serialize()['data'])} bytes")
    compressed_data = pdbx.compress(uncompressed_data)
    print(f"Compressed size: {len(compressed_data.serialize()['data'])} bytes")


Using structures from a PDBx file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

While this low-level API is useful for using the entire potential of
the PDBx format, most applications require only reading/writing a
structure.
As the *BinaryCIF* format is both, smaller and faster to parse, it is
recommended to use it instead of the *CIF* format in *Biotite*.

.. jupyter-execute::

    tc5b = pdbx.get_structure(bcif_file)
    # Do some fancy stuff
    pdbx.set_structure(bcif_file, tc5b)

Similar to :class:`PDBFile`, :func:`get_structure()` creates automatically an
:class:`AtomArrayStack`, even if the file actually contains only a single
model.
If you would like to have an :class:`AtomArray` instead, you have to specify
the :obj:`model` parameter.