Pinder eval#
Harnessing Biotite: the workhorse beind Pinder’s abstractions#
At the heart of pinder
lies a strategic integration with the numpy-native biotite package. This integration allowed for the manipulation and operation of structural data at unparalleled efficiency. Through numpy’s optimized array operations, pinder
manages to handle large-scale data without compromising on speed, accuracy or ease of use.
Evaluation harness: implementation of DockQ metrics#
One of pinder
’s core contributions has been the custom implementation of the gold-standard set of structural interface prediction quality metrics: Lrms, Irms, Fnat, DockQ and the CAPRI classification.
While I won’t cover each of these metrics in detail in this notebook, I want to highlight the motivation and features of the DockQ implementation. This metric, essential in assessing the quality of protein-protein docking predictions, was first introduced by Sankar Basu and Bjorn Wallner in 2016, and has since remained the go-to metric for PPI prediction methodology benchmarks. The metric introduces a normalized quality measure that combines all of the metrics standardized by the CAPRI community to enable direct ranking of models, correlation with scoring functions and use as a target function in machine learning algorithms.
Building off of the fantastic work in the original DockQ publication, we wanted to address several performance bottlenecks and reduce installation friction by introducing an OS-agnostic, python-native implementation of DockQ, leveraging the spectacular biotite package.
BiotiteDockQ: efficient implementation of evaluation metrics#
Biotite offers an extensive toolbox for working with sequence and structure data – searching and fetching from biological databases, reading and writing popular file formats, analyzing, editing and visualizing, as well as interfacing external applications for further analysis.
The focus on performance and intuitive usability is ultimately what makes it my go-to tool for working with structural data, especially when writing pinder
.
Internally, the majority of data is stored as NumPy ndarray objects. In my opinion, the AtomArray
and AtomArrayStack
provide some of the most powerful constructs in the biotite.structure
package.
The object hierarchy follows AtomArrayStack
-> AtomArray
-> Atom
, where each parent contains an array of the child.
Collections of euclidean coordinates are a natural match for numpy arrays. All of the other fields typically found in a PDB file format are effectively metadata that can be stored as annotations. The primitive base Atom
object leverages these principles to enable an efficient storage and access mechanism: annotations are stored as scalar values and coordinates as length 3 ndarrays. When accessing slices of the data, there is no need for custom, un-optimized python implementations, as each of the fields can be accessed as numpy-native data structures.
This property becomes especially powerful when you consider the AtomArrayStack
. An AtomArrayStack
stores data for m
models, where each model contains the same atoms at different positions. Hence, the annotation arrays are represented as ndarray
objects of length n
like the AtomArray
, while the coordinates are a (m x n x 3) ndarray
.
If you imagine a collection of decoys from a protein-protein docking method, you typically have a sequence of PDB files containing the same two proteins with one or more proteins in alternative conformations. As such, the only thing that is different about the structural models is the coordinate data – a natural fit for the AtomArrayStack
. Rather than storing separate copies of annotations for each model in memory, you store just a single set of annotation arrays in the same index order as the atoms in the collection of AtomArray
’s in the stack.
These properties combine to result in not only an intuitive means for accessing data through natural array indexing mechanisms, but also a major performance boost due to the ability to apply vectorized numpy methods.
With these powerful constructs in place, the real bottleneck that remains is, perhaps unsurprisingly, file I/O – each read of a PDB file requires streaming large amounts of text data. Thankfully, even these woes can be addressed in a number of ways:
Transitioning to the binary
MMTF
format, which features a smaller file size and a highly increased I/O operation performance, than text-based file formats.Adopting
fastpdb
, a high performance drop-in replacement for Biotite’s PDBFile written in Rust.Exploiting structural/geometric properties or representations of docking algorithms when possible.
I’d like to give a practical illustration of the third optimization to give a flavor for why the biotite package can be so powerful.
It is common for protein-protein docking algorithms (especially rigid-body methods based on Fast-Fourier-Transforms) to internally represent the predictions as a set of mathematical operators. The basic premise is to represent the mobile, ligand-protein in a given model as a vector that encodes the translation and rotation.
Although the specific encoding implementations may differ (euler angles, rotation matrices, quaternions, etc.), one common thread is the ability to efficiently store the model as a single vector which can be applied as an operation to the starting coordinates, reducing the need to store separate copies of the coordinates in memory or PDB files. Sadly, it is often necessary to write the final models as PDB files for use in downstream analyses or clustering schemes.
Below I illustrate how we can write a simple method for generating a “stack” of poses from a dataframe containing translation vectors and quaternions and a single atom array for the ligand protein:
import numpy as np
import pandas as pd
from biotite.structure.atoms import AtomArray, AtomArrayStack
from numpy import ndarray
from scipy.spatial.transform import Rotation as R
def generate_pose_stack(ligand: AtomArray, operations: pd.DataFrame) -> AtomArrayStack:
"""Generate AtomArrayStack of ligand poses from 7-component vectors.
Parameters
----------
ligand : AtomArray
Ligand protein biotite atom array centered at origin.
operations : pd.DataFrame
Dataframe containing translation vectors and rotation quaternions
to transform ligand protein centered at origin to a pose.
The dataframe must containg the following columns:
`[tx, ty, tz, qw, qx, qy, qz]`
Returns
-------
AtomArrayStack
AtomArrayStack of shape `(operations.shape[0], ligand.shape[0], 3)`.
"""
translations = operations[["tx", "ty", "tz"]].to_numpy()
# quaternions in SciPy scalar last format (x, y, z, w)
rotations = operations[["qx", "qy", "qz", "qw"]].to_numpy()
# M poses
M = translations.shape[0]
# Inititalize an empty stack with shape (m x n x 3)
stack = AtomArrayStack(M, ligand.shape[0])
# copy annotations from ligand
for annot in ligand.get_annotation_categories():
# Notice we are not repeating this for M models
stack.set_annotation(annot, np.copy(ligand.get_annotation(annot)))
# generate one pose at a time
for i in range(M):
q = R.from_quat(rotations[i, ...])
stack.coord[i, ...] = q.apply(ligand.coord) + translations[i, ...]
return stack
Now to try it out in practice!
Below, I will generate 20 random models of a ligand protein where each model follows a rotation about the z-axis using some quaternion math.
First, let’s write a helper function to convert a rotation axis and angle into a quaternion:
import math
def quaternion_from_axis_angle(axis: ndarray, angle: float) -> ndarray:
"""
Convert an axis-angle representation to a quaternion.
Parameters:
- axis: The axis of rotation as a 3D vector. Should be normalized.
- angle: The angle of rotation in radians.
Returns:
- ndarray: A quaternion in SciPy scalar last format (x, y, z, w).
"""
w = math.cos(angle / 2.0)
x = axis[0] * math.sin(angle / 2.0)
y = axis[1] * math.sin(angle / 2.0)
z = axis[2] * math.sin(angle / 2.0)
return np.array([x, y, z, w])
We first seed the rotation about the z-axis with a random quaternion:
# Random quaternion
q = R.random().as_quat()
# Axis to introduce rotations about (z-axis)
axis = np.array([0.0, 0.0, 1.0])
q
array([ 0.17082487, -0.53194653, 0.73552107, 0.38322383])
Now we will introduce 20 rotation angles about the axis to describe the “trajectory”:
num_angles = 20
step_size = (2 * math.pi) / (num_angles - 1)
thetas = np.linspace(
step_size, 2*math.pi,
num=num_angles,
endpoint=False
)
quaternions = []
for angle in thetas:
axis_rot = quaternion_from_axis_angle(axis, angle)
# Combine with base quaternion
rot = (q * axis_rot)
quaternions.append(rot)
Now that we have the rotations defined, lets encode them into a dataframe and add a random translation from the starting origin:
operations = pd.DataFrame([
{'qx': q[0], 'qy': q[1], 'qz': q[2], 'qw': q[3], 'frame': i}
for i, q in enumerate(quaternions)
])
operations[['tx', 'ty', 'tz']] = [0.5, -0.5, 0.5]
operations
qx | qy | qz | qw | frame | tx | ty | tz | |
---|---|---|---|---|---|---|---|---|
0 | 0.0 | -0.0 | 0.121063 | 0.377997 | 0 | 0.5 | -0.5 | 0.5 |
1 | 0.0 | -0.0 | 0.227289 | 0.364468 | 1 | 0.5 | -0.5 | 0.5 |
2 | 0.0 | -0.0 | 0.328490 | 0.342882 | 2 | 0.5 | -0.5 | 0.5 |
3 | 0.0 | -0.0 | 0.422431 | 0.313717 | 3 | 0.5 | -0.5 | 0.5 |
4 | 0.0 | -0.0 | 0.507034 | 0.277617 | 4 | 0.5 | -0.5 | 0.5 |
5 | 0.0 | -0.0 | 0.580429 | 0.235381 | 5 | 0.5 | -0.5 | 0.5 |
6 | 0.0 | -0.0 | 0.640995 | 0.187942 | 6 | 0.5 | -0.5 | 0.5 |
7 | 0.0 | -0.0 | 0.687392 | 0.136349 | 7 | 0.5 | -0.5 | 0.5 |
8 | 0.0 | -0.0 | 0.718594 | 0.081742 | 8 | 0.5 | -0.5 | 0.5 |
9 | 0.0 | -0.0 | 0.733913 | 0.025327 | 9 | 0.5 | -0.5 | 0.5 |
10 | 0.0 | -0.0 | 0.733009 | -0.031646 | 10 | 0.5 | -0.5 | 0.5 |
11 | 0.0 | -0.0 | 0.715902 | -0.087921 | 11 | 0.5 | -0.5 | 0.5 |
12 | 0.0 | -0.0 | 0.682971 | -0.142252 | 12 | 0.5 | -0.5 | 0.5 |
13 | 0.0 | -0.0 | 0.634943 | -0.193438 | 13 | 0.5 | -0.5 | 0.5 |
14 | 0.0 | -0.0 | 0.572881 | -0.240349 | 14 | 0.5 | -0.5 | 0.5 |
15 | 0.0 | -0.0 | 0.498155 | -0.281947 | 15 | 0.5 | -0.5 | 0.5 |
16 | 0.0 | -0.0 | 0.412418 | -0.317313 | 16 | 0.5 | -0.5 | 0.5 |
17 | 0.0 | -0.0 | 0.317564 | -0.345665 | 17 | 0.5 | -0.5 | 0.5 |
18 | 0.0 | -0.0 | 0.215692 | -0.366376 | 18 | 0.5 | -0.5 | 0.5 |
19 | 0.0 | -0.0 | 0.109051 | -0.378988 | 19 | 0.5 | -0.5 | 0.5 |
The operations dataframe can now be used in our generate_pose_stack
function from before:
# Load an example PDB file into an AtomArray
from pinder.core.structure.atoms import atom_array_from_pdb_file
ligand = atom_array_from_pdb_file("af2_L.pdb")
stack = generate_pose_stack(ligand, operations)
stack.shape
(20, 7994)
While this is optional, I want to showcase some interesting applications of the plotly visualization package in this context.
Below, I show how we can convert this AtomArrayStack
into a dataframe that describes the trajectory, and adds additional metadata about the atoms – their solvent accessible surface-area and Van Der Waals radii.
import plotly
plotly.offline.init_notebook_mode()
import plotly.express as px
import biotite.structure as struc
stack_df = []
for i, model in enumerate(stack):
# Lets color the atoms by their solvent accessibility and store the atomic VdW radii
stack_df.extend([
{
'x': coord[0], 'y': coord[1], 'z': coord[2],
'sasa': sasa,
'vdw': struc.info.vdw_radius_protor(res_name, atom_name),
'frame': i
}
for coord, res_name, atom_name, sasa in zip(model.coord, model.res_name, model.atom_name, struc.sasa(model))
])
stack_df = pd.DataFrame(stack_df)
stack_df
x | y | z | sasa | vdw | frame | |
---|---|---|---|---|---|---|
0 | -7.422350 | -50.109081 | 58.511002 | 21.949207 | 1.64 | 0 |
1 | -7.740695 | -49.272335 | 57.354000 | 2.027911 | 1.88 | 0 |
2 | -9.211197 | -48.843922 | 57.383999 | 0.000000 | 1.61 | 0 |
3 | -9.552464 | -47.732361 | 56.971001 | 0.000000 | 1.42 | 0 |
4 | -7.406618 | -50.002026 | 56.037998 | 1.351941 | 1.88 | 0 |
... | ... | ... | ... | ... | ... | ... |
159875 | -8.957576 | -31.997358 | 35.235001 | 2.163199 | 1.61 | 19 |
159876 | -8.709561 | -32.618103 | 34.174000 | 31.378901 | 1.42 | 19 |
159877 | -11.373942 | -32.478718 | 35.494999 | 32.852154 | 1.88 | 19 |
159878 | -11.319888 | -33.269363 | 34.324001 | 26.210915 | 1.46 | 19 |
159879 | -8.242172 | -31.931511 | 36.250999 | 45.843403 | 1.46 | 19 |
159880 rows × 6 columns
Visualize the trajectory!
stack_df.loc[:, "vdw_scale"] = stack_df.vdw - 1
fig = px.scatter_3d(
stack_df,
x="x", y="y", z="z", color="sasa",
animation_frame="frame",
template="plotly_dark",
size="vdw_scale",
range_x=[stack_df.x.min(), stack_df.x.max()],
range_y=[stack_df.y.min(), stack_df.y.max()],
range_z=[stack_df.z.min(), stack_df.z.max()],
color_continuous_scale="tropic",
opacity=1.0,
)
fig.update_traces(mode='markers', marker_line_width=0.1, marker_line_color="#171717")
# Hacks to improve animation player
btn1, btn2 = fig.layout.updatemenus[0].buttons
btn1.args[1]["visible"] = False
btn1.args[1]["frame"]["duration"] = 220 # buttons
btn2.args[1]["frame"]["duration"] = 220
btn2.args[1]["transition"]["duration"] = 0
btn2.args[1]["transition"]["duration"] = 0
fig.layout.updatemenus[0].buttons = btn1, btn2
fig.layout.sliders[0].steps[0].args[1]["frame"]["duration"] = 0 # slider
fig.update_scenes(aspectratio=dict(x=1, y=1, z=1))
fig.show()