MLSB PINDER Challenge#

Rules: training
Rules: inference
Evaluation dataset
Accessing training data

The goal of this tutorial is to outline some basic rules for participating in the PINDER track of the MLSB challenge and provide simple hands-on examples for how participants can access and use the pinder dataset.

Specifically, we will cover:

Rules for model training
Rules for valid inference submissions
Accessing and loading data for training your model
A description of the inputs to be provided in the evaluation set

Rules for valid model training #

Participants MUST use the sequences and SMILES in the provided train and validation sets from PINDER or PLINDER. In order to ensure no leakage, external data augmentation is not allowed.
If starting structures/conformations need to be generated for the model, then this can only be done from the training and validation sequences and SMILES. Note that this is only the case for train & validation - no external folding methods or starting structures are allowed for the test set under any circumstance!. Only the predicted structures/conformers themselves may be used in this way, the embeddings or models used to generate such predictions may not. E.g. it is not valid to “distill” a method that was not trained on PLINDER/PINDER
The PINDER and PLINDER datasets should be used independently; combining the sets is considered augmentation and is not allowed.
For inference, only the inputs provided in the evaluation sets may be used: canonical sequences, structures and MSAs; no alternate templates or sequences are permitted. The inputs that will be used by assessors for each challenge track is as follows:
- PLINDER: (SMILES, monomer protein structure, monomer FASTA, monomer MSA)
- PINDER: (monomer protein structure 1, monomer protein structure 2, FASTA 1, FASTA 2, MSA 1, MSA 2)
Model selection must be performed exclusively on the validation set designed for this purpose within the PINDER and PLINDER datasets.
Methods relying on any model derivatives or embeddings trained on structures outside the PINDER/PLINDER training set are not permitted (e.g., ESM2, MSA: ✅; ESM3/ESMFold/SAProt/UniMol: ❌).
For instruction on how to load training and validation data, check the links below:
- PLINDER
- PINDER

Rules for valid inference submissions #

Submission system will use Hugging Face Spaces. To qualify for submission, each team must:

Provide an MLSB submission ID or a link to a preprint/paper describing their methodology. This publication does not have to specifically report training or evaluation on the P(L)INDER dataset. Previously published methods, such as DiffDock, only need to link their existing paper. Note that entry into this competition does not equate to an MLSB workshop paper submission.
Create a copy of the provided inference template.
- Go to the top right corner of the page and click on the drop-down menu (vertical ellipsis) right next to the “Community”, then select “Duplicate this space”.
Change files in the newly created space to reflect the peculiarities of your model
- Edit requirements.txt to capture all dependencies.
- Include a inference_app.py file. This contains a predict function that should be modified to reflect the specifics of inference using their model.
- Include a train.py file to ensure that training and model selection use only the PINDER/PLINDER datasets and to clearly show any additional hyperparameters used.
- Provide a LICENSE file that allows for reuse, derivative works, and distribution of the provided software and weights (e.g., MIT or Apache2 license).
- Modify the Dockerfile as appropriate (including selecting the right base image)
Submit to the leaderboard via the designated form.
- On submission page, add reference to the newly created space in the format username/space (e.g mlsb/alphafold3)

Evaluation dataset #

Although the exact composition of the eval set will be shared at a future date, below we provide an overview of the dataset and what to expect

Two leaderboards, one for each of PINDER and PLINDER, will be created using a single evaluation set for each.
Evaluation sets will be subsets of 150-200 structures from the current PINDER and PLINDER test splits (subsets to enable reasonable eval runtime).
Each evaluation sample will contain a predefined input/output to ensure performance assessment is model-dependent, not input-dependent.
The focus will be exclusively on flexible docking/co-folding, with a single canonical structure per protein, sampled from apo and predicted structures.
Monomer input structures will be sampled from paired structures available in PINDER/PLINDER, balanced between apo and predicted structures and stratified by “flexibility” level according to specified conformational difference thresholds.
Inputs will be: (monomer protein structure 1, monomer protein structure 2, FASTA 1, FASTA 2) for PINDER

Accessing and loading data for training #

In order to access the train and val splits for PINDER, please refer to the pinder documentation

Once you have downloaded the pinder dataset, either via the pinder package or directly through gsutil, you will have all of the necessary files for training.

For those mainly interested in torch dataloaders, refer to the readme section and tutorial on the torch dataloader provided in pinder. TLDR:

from pinder.core.loader.dataset import PinderDataset, get_torch_loader
train, val = [get_torch_loader(PinderDataset(split=split)) for split in ["train", "val"]] # do NOT use test, we will verify your pipeline does not use test in neither training nor model selection

For those interested in loading/filtering/sampling/augmenting data using pinder utilities, see remaining sections below.

You are ONLY allowed to access those systems labeled with split train and split val for model training and validation, respectively.

See below for two different options for accessing the index and split labels.

If you have already installed pinder (preferred method):

import torch
from pinder.core import get_index

index = get_index()
train = index.query('split == "train"').reset_index(drop=True)
val = index.query('split == "val"').reset_index(drop=True)
train.shape, val.shape

((1560682, 34), (1958, 34))

Without installing pinder (need to install gcsfs and pandas or install the gsutil utility to get the index file)

import gcsfs
import pandas as pd

index_uri = "gs://pinder/2024-02/index.parquet"
fs = gcsfs.GCSFileSystem(token="anon")
with fs.open(index_uri, "rb") as f:
    index = pd.read_parquet(f)

train = index.query('split == "train"').reset_index(drop=True)
val = index.query('split == "val"').reset_index(drop=True)
train.shape, val.shape

((1560682, 34), (1958, 34))

Minimal example with filepaths #

For those who simply want access to PDB files and/or sequences, below we provide a minimal example of how to go from a row in the pinder index to a tuple of filepaths and sequences akin to the expected inputs for inference.

Later sections provide alternative means to loading data with common pinder utilities, including the PinderLoader and PinderDataset torch dataset.

All pinder data should be stored in PINDER_BASE_DIR. Unless you customized the download directory, this would default to: ~/.local/share/pinder/2024-02/

PDB files are stored in a subdirectory, named pdbs.

To go from a row in the index to a collection of filepaths, you can either use the pydantic model for the pinder index schema (IndexEntry) or construct the filepaths yourself.

We will first illustrate how to do this via IndexEntry

from pinder.core import get_pinder_location
from pinder.core.index.utils import IndexEntry

pinder_dir = get_pinder_location()
row = train.sample(1).squeeze()
entry = IndexEntry(**row.to_dict())
# IndexEntry has a convenience property `pdb_paths` which returns a dict of structure_type: relative_path | list[relative_path]
relative_paths = entry.pdb_paths
absolute_paths = {}
for structure_type, rel_path in relative_paths.items():
    # non-canonical apo monomers are stored as a list of relative paths
    if isinstance(rel_path, list):
        absolute_paths[structure_type] = []
        for alt_monomer in rel_path:
            absolute_paths[structure_type].append(pinder_dir / alt_monomer) 
    # Not all systems have every type of monomer. When they are not available, the relative path is ""
    elif rel_path == "":
        absolute_paths[structure_type] = rel_path
    # Convert relative path to absolute path 
    else:
        absolute_paths[structure_type] = pinder_dir / rel_path

relative_paths, absolute_paths

({'native': 'pdbs/3j1r__O1_A8AAA0--3j1r__P1_A8AAA0.pdb',
  'holo_R': 'pdbs/3j1r__O1_A8AAA0-R.pdb',
  'holo_L': 'pdbs/3j1r__P1_A8AAA0-L.pdb',
  'predicted_R': 'pdbs/af__A8AAA0.pdb',
  'predicted_L': 'pdbs/af__A8AAA0.pdb',
  'apo_R': '',
  'apo_L': '',
  'apo_R_alt': [],
  'apo_L_alt': []},
 {'native': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/3j1r__O1_A8AAA0--3j1r__P1_A8AAA0.pdb'),
  'holo_R': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/3j1r__O1_A8AAA0-R.pdb'),
  'holo_L': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/3j1r__P1_A8AAA0-L.pdb'),
  'predicted_R': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/af__A8AAA0.pdb'),
  'predicted_L': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/af__A8AAA0.pdb'),
  'apo_R': '',
  'apo_L': '',
  'apo_R_alt': [],
  'apo_L_alt': []})

In the above example, the IndexEntry.pdb_paths property was used to conveniently extract filepaths from a row in the index. This is done by using the following columns from the index:

id
holo_R_pdb
holo_L_pdb
predicted_R_pdb
predicted_L_pdb
apo_R_pdbs
apo_L_pdbs

It is possible to do this yourself without IndexEntry:

row = train.sample(1).squeeze()
absolute_paths = {
    "native": pinder_dir / "pdbs" / f"{row.id}.pdb",
}
pdb_cols = [
    "holo_R_pdb", "holo_L_pdb", # holo monomers for receptor and ligand, respectively
    "predicted_R_pdb", "predicted_L_pdb", # predicted monomers
    "apo_R_pdb", "apo_L_pdb", # canonical apo monomers
    "apo_R_pdbs", "apo_L_pdbs", # canonical + non-canonical (alternative) apo monomers, separated by a semi-colon
]
for pdb_column in pdb_cols:
    if pdb_column.endswith("pdbs"):
        absolute_paths[pdb_column] = [
            pinder_dir / "pdbs" / alt_apo if alt_apo != "" else "" 
            for alt_apo in row[pdb_column].split(";")
        ]
    else:
        pdb_name = row[pdb_column]
        absolute_paths[pdb_column] = pinder_dir / "pdbs" / pdb_name if pdb_name != "" else ""

absolute_paths

{'native': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/4qvv__AA1_P30657--4qvv__S1_P40302.pdb'),
 'holo_R_pdb': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/4qvv__AA1_P30657-R.pdb'),
 'holo_L_pdb': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/4qvv__S1_P40302-L.pdb'),
 'predicted_R_pdb': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/af__P30657.pdb'),
 'predicted_L_pdb': PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs/af__P40302.pdb'),
 'apo_R_pdb': '',
 'apo_L_pdb': '',
 'apo_R_pdbs': [''],
 'apo_L_pdbs': ['']}

The most minimal interface for loading these filepaths and extracting e.g. coordinates and sequence would be via pinder.core.loader.structure module:

from pinder.core.loader.structure import Structure
# Note: since this notebook is executed in CI, I will also create a `PinderSystem` object which will auto-download any missing PDB file
# You do NOT need to do this if you already downloaded the dataset
if not absolute_paths["holo_R_pdb"].is_file():
    from pinder.core import PinderSystem
    _ = PinderSystem(absolute_paths["native"].stem)
    
receptor = Structure(absolute_paths["holo_R_pdb"])
receptor

2024-11-15 12:14:29,840 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

Structure(
    filepath=/home/runner/.local/share/pinder/2024-02/pdbs/4qvv__AA1_P30657-R.pdb,
    uniprot_map=None,
    pinder_id='4qvv__AA1_P30657-R',
    atom_array=<class 'biotite.structure.AtomArray'> with shape (1824,),
    pdb_engine='fastpdb',
)

receptor.coords[0:10]

array([[  42.843, -166.299,   40.963],
       [  43.   , -164.91 ,   41.486],
       [  42.193, -164.727,   42.763],
       [  41.016, -165.075,   42.816],
       [  42.542, -163.854,   40.456],
       [  43.347, -163.95 ,   39.272],
       [  42.653, -162.434,   41.033],
       [  42.837, -164.19 ,   43.794],
       [  42.152, -163.84 ,   45.031],
       [  42.728, -162.539,   45.568]], dtype=float32)

receptor.sequence

'TQQPIVTGTSVISMKYDNGVIIAADNLGSYGSLLRFNGVERLIPVGDNTVVGISGDISDMQHIERLLKDLVTENAYDNPLADAEEALEPSYIFEYLATVMYQRRSKMNPLWNAIIVAGVQSNGDQFLRYVNLLGVTYSSPTLATGFGAHMANPLLRKVVDRESDIPKTTVQVAEEAIVNAMRVLYYRDARSSRNFSLAIIDKNTGLTFKKNLQVENMKWDFAKDIKGYGTQKI'

receptor.fasta

'>4qvv__AA1_P30657-R\nTQQPIVTGTSVISMKYDNGVIIAADNLGSYGSLLRFNGVERLIPVGDNTVVGISGDISDMQHIERLLKDLVTENAYDNPLADAEEALEPSYIFEYLATVMYQRRSKMNPLWNAIIVAGVQSNGDQFLRYVNLLGVTYSSPTLATGFGAHMANPLLRKVVDRESDIPKTTVQVAEEAIVNAMRVLYYRDARSSRNFSLAIIDKNTGLTFKKNLQVENMKWDFAKDIKGYGTQKI'

# You can also write the Structure object to a PDB file if desired (e.g. after making changes)
from pathlib import Path
from tempfile import TemporaryDirectory
with TemporaryDirectory() as tmp_dir:
    temp_dir = Path(tmp_dir)
    receptor.to_pdb(temp_dir / "modified_receptor.pdb")

Using pinder utilities to construct a dataloader #

Before proceeding with this section, you may find it helpful to review the existing tutorials available in pinder.

Specifcially, the tutorials covering:

We will start by looking at the most basic way to load items from the training and validation set: via PinderSystem objects

from pinder.core import PinderSystem

def get_system(system_id: str) -> PinderSystem:
    return PinderSystem(system_id)

system = get_system(train.id.iloc[0])
system

PinderSystem(
entry = IndexEntry(
    (
        'split',
        'train',
    ),
    (
        'id',
        '8phr__X4_UNDEFINED--8phr__W4_UNDEFINED',
    ),
    (
        'pdb_id',
        '8phr',
    ),
    (
        'cluster_id',
        'cluster_24559_24559',
    ),
    (
        'cluster_id_R',
        'cluster_24559',
    ),
    (
        'cluster_id_L',
        'cluster_24559',
    ),
    (
        'pinder_s',
        False,
    ),
    (
        'pinder_xl',
        False,
    ),
    (
        'pinder_af2',
        False,
    ),
    (
        'uniprot_R',
        'UNDEFINED',
    ),
    (
        'uniprot_L',
        'UNDEFINED',
    ),
    (
        'holo_R_pdb',
        '8phr__X4_UNDEFINED-R.pdb',
    ),
    (
        'holo_L_pdb',
        '8phr__W4_UNDEFINED-L.pdb',
    ),
    (
        'predicted_R_pdb',
        '',
    ),
    (
        'predicted_L_pdb',
        '',
    ),
    (
        'apo_R_pdb',
        '',
    ),
    (
        'apo_L_pdb',
        '',
    ),
    (
        'apo_R_pdbs',
        '',
    ),
    (
        'apo_L_pdbs',
        '',
    ),
    (
        'holo_R',
        True,
    ),
    (
        'holo_L',
        True,
    ),
    (
        'predicted_R',
        False,
    ),
    (
        'predicted_L',
        False,
    ),
    (
        'apo_R',
        False,
    ),
    (
        'apo_L',
        False,
    ),
    (
        'apo_R_quality',
        '',
    ),
    (
        'apo_L_quality',
        '',
    ),
    (
        'chain1_neff',
        10.78125,
    ),
    (
        'chain2_neff',
        11.1171875,
    ),
    (
        'chain_R',
        'X4',
    ),
    (
        'chain_L',
        'W4',
    ),
    (
        'contains_antibody',
        False,
    ),
    (
        'contains_antigen',
        False,
    ),
    (
        'contains_enzyme',
        False,
    ),
)
native=Structure(
    filepath=/home/runner/.local/share/pinder/2024-02/pdbs/8phr__X4_UNDEFINED--8phr__W4_UNDEFINED.pdb,
    uniprot_map=None,
    pinder_id='8phr__X4_UNDEFINED--8phr__W4_UNDEFINED',
    atom_array=<class 'biotite.structure.AtomArray'> with shape (2556,),
    pdb_engine='fastpdb',
)
holo_receptor=Structure(
    filepath=/home/runner/.local/share/pinder/2024-02/pdbs/8phr__X4_UNDEFINED-R.pdb,
    uniprot_map=/home/runner/.local/share/pinder/2024-02/mappings/8phr__X4_UNDEFINED-R.parquet,
    pinder_id='8phr__X4_UNDEFINED-R',
    atom_array=<class 'biotite.structure.AtomArray'> with shape (1358,),
    pdb_engine='fastpdb',
)
holo_ligand=Structure(
    filepath=/home/runner/.local/share/pinder/2024-02/pdbs/8phr__W4_UNDEFINED-L.pdb,
    uniprot_map=/home/runner/.local/share/pinder/2024-02/mappings/8phr__W4_UNDEFINED-L.parquet,
    pinder_id='8phr__W4_UNDEFINED-L',
    atom_array=<class 'biotite.structure.AtomArray'> with shape (1198,),
    pdb_engine='fastpdb',
)
apo_receptor=None
apo_ligand=None
pred_receptor=None
pred_ligand=None
)

You will notice in the printed PinderSystem object has the following properties:

native - the ground-truth dimer complex
holo_receptor - the receptor chain (monomer) from the ground-truth complex
holo_ligand - the ligand chain (monomer) from the ground-truth complex
apo_receptor - the canonical apo chain (monomer) paired to the receptor chain
apo_ligand - the canonical apo chain (monomer) paired to the ligand chain
pred_receptor - the AlphaFold2 predicted monomer paired to the receptor chain
pred_ligand - the AlphaFold2 predicted monomer paired to the ligand chain

These properties are pointers to Structure objects. The Structure object provides the most direct mode of access to structures and associated properties.

Note: not all systems have an apo and/or predicted structure for all chains of the ground-truth dimer complex!

As was the case in the example above, when the alternative monomers are not available, the property will have a value of None.

You can determine which systems have which alternative monomer pairings a priori by looking at the boolean columns in the index apo_R and apo_L for the apo receptor and ligand, and predicted_R and predicted_L for the predicted receptor and ligand, respectively.

For instance, we can load a different system that does have apo receptor and ligand as such:

apo_system = get_system(train.query('apo_R and apo_L').id.iloc[0])
receptor = apo_system.apo_receptor
ligand = apo_system.apo_ligand 

receptor, ligand

(Structure(
     filepath=/home/runner/.local/share/pinder/2024-02/pdbs/3wdb__A1_P9WPC9.pdb,
     uniprot_map=/home/runner/.local/share/pinder/2024-02/mappings/3wdb__A1_P9WPC9.parquet,
     pinder_id='3wdb__A1_P9WPC9',
     atom_array=<class 'biotite.structure.AtomArray'> with shape (1144,),
     pdb_engine='fastpdb',
 ),
 Structure(
     filepath=/home/runner/.local/share/pinder/2024-02/pdbs/6ucr__A1_P9WPC9.pdb,
     uniprot_map=/home/runner/.local/share/pinder/2024-02/mappings/6ucr__A1_P9WPC9.parquet,
     pinder_id='6ucr__A1_P9WPC9',
     atom_array=<class 'biotite.structure.AtomArray'> with shape (1193,),
     pdb_engine='fastpdb',
 ))

We can now access e.g. the sequence and the coordinates of the structures via the Structure objects:

receptor.sequence

'PLGSMFERFTDRARRVVVLAQEEARMLNHNYIGTEHILLGLIHEGEGVAAKSLESLGISLEGVRSQVEEIIGQGQQAPSGHIPFTPRAKKVLELSLREALQLGHNYIGTEHILLGLIREGEGVAAQVLVKLGAELTRVRQQVIQLLSGY'

receptor.coords[0:5]

array([[-12.982, -17.271, -11.271],
       [-14.36 , -17.069, -11.749],
       [-15.261, -16.373, -10.703],
       [-15.461, -15.161, -10.801],
       [-14.842, -18.494, -12.077]], dtype=float32)

We can always access the underyling biotite AtomArray via the Structure.atom_array property:

receptor.atom_array[0:5]

array([
	Atom(np.array([-12.982, -17.271, -11.271], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="N", element="N", b_factor=0.0),
	Atom(np.array([-14.36 , -17.069, -11.749], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="CA", element="C", b_factor=0.0),
	Atom(np.array([-15.261, -16.373, -10.703], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="C", element="C", b_factor=0.0),
	Atom(np.array([-15.461, -15.161, -10.801], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="O", element="O", b_factor=0.0),
	Atom(np.array([-14.842, -18.494, -12.077], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="CB", element="C", b_factor=0.0)
])

For a more comprehensive overview of all of the Structure class properties, refer to the pinder system tutorial.

Using PinderLoader and PinderDataset to fetch, filter, transform systems #

While the PinderSystem object provides a self-contained access to structures associated with a dimer system, the PinderLoader provides a base abstraction for how to iterate over systems, apply optional filters and/or transforms, and return training and validation data represented as PinderSystem and Structure objects.

PinderDataset is an example implementation of a torch Dataset that can be consumed in a torch DataLoader. It uses the PinderLoader under the hood and additionally implements a default transform and target_transform function that converts the Structure objects returned by PinderLoader into dictionaries of structural properties encoded as tensors. The return value of the PinderDataset.__getitem__ represents an example of dataset sample that is suitable for collating into DataLoader batches via the default collate_fn defined in pinder.core.loader.dataset.collate_batch.

This is covered in much greater detail in the pinder loader tutorial, but we will quickly showcase how both can be used to load data in an ML context.

from pinder.core import PinderLoader
from pinder.core.loader import filters

base_filters = [
    filters.FilterDetachedHolo(radius=12, max_components=2),
    filters.FilterByResidueCount(min_residue_count=10, max_residue_count=2000),
]
sub_filters = [
    filters.FilterSubByAtomTypes(min_atom_types=4),
    filters.FilterByHoloSeqIdentity(min_sequence_identity=0.8),
]
loader = PinderLoader(
    split="val",
    base_filters = base_filters,
    sub_filters = sub_filters
)

loader

PinderLoader(split=val, monomers=holo, systems=1958)

You can now access individual items in the loader or iterate over it.

The current default return value of PinderLoader.__getitem__ is a tuple consisting of (system, feature_complex, target_complex):

system: A PinderSystem instance corresponding to the item index
feature_complex: A Structure object containing a sampled receptor and ligand monomer superimposed to the ground-truth complex.
target_complex: A Structure object containing the ground-truth holo complex.

Note: the monomers in the feature_complex can consist of holo/apo/pred or a mix of them. You can control which monomer is selected via the monomer_priority argument.

Valid values are:

holo (default)
apo
pred
random (select a monomer at random from the set of monomer types available in both the receptor and ligand)
random_mixed (select a monomer at random from the set of monomer types available in the receptor and ligand, separately)

If you wanted to leverage the PinderLoader but mainly just want the filepaths and/or sequence, you can do so with the returned Structure objects:

system, sample, target = loader[0]

receptor = target.filter("chain_id", ["R"])
ligand = target.filter("chain_id", ["L"])
# Can do things like e.g.
with open(f"./receptor_{receptor.pinder_id}.fasta", "w") as f:
    f.write(receptor.fasta)

2024-11-15 12:14:33,669 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

from tqdm import tqdm

loaded_systems = set()
limit = 10 # for faster exec in CI
for system, feature_complex, target_complex in tqdm(loader):
    loaded_systems.add(system.entry.id)
    if len(loaded_systems) >= limit:
        break
    
    

  0%|          | 0/1958 [00:00<?, ?it/s]

  0%|          | 1/1958 [00:00<07:01,  4.64it/s]

2024-11-15 12:14:35,012 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

  0%|          | 2/1958 [00:01<28:53,  1.13it/s]

2024-11-15 12:14:36,367 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=23

  0%|          | 3/1958 [00:05<1:07:18,  2.07s/it]

2024-11-15 12:14:39,835 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=9

  0%|          | 4/1958 [00:07<1:10:07,  2.15s/it]

2024-11-15 12:14:42,124 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

  0%|          | 5/1958 [00:08<1:00:52,  1.87s/it]

2024-11-15 12:14:43,492 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

  0%|          | 6/1958 [00:09<54:02,  1.66s/it]

2024-11-15 12:14:44,749 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

  0%|          | 7/1958 [00:11<50:35,  1.56s/it]

2024-11-15 12:14:46,087 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=11

  0%|          | 8/1958 [00:13<53:08,  1.64s/it]

2024-11-15 12:14:47,892 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

  0%|          | 9/1958 [00:14<49:34,  1.53s/it]

2024-11-15 12:14:49,178 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

  0%|          | 9/1958 [00:15<57:27,  1.77s/it]

len(loaded_systems)

# PinderDataset - torch dataset

from pinder.core.loader import filters, transforms
from pinder.core.loader.dataset import PinderDataset

base_filters = [
    filters.FilterDetachedHolo(radius=12, max_components=2),
    filters.FilterByResidueCount(min_residue_count=10, max_residue_count=2000),
]
sub_filters = [
    filters.FilterSubByAtomTypes(min_atom_types=4),
    filters.FilterByHoloSeqIdentity(min_sequence_identity=0.8),
]

# We can include Structure-level transforms (and filters) which will operate on the target and/or feature complexes
target_transforms = [
    transforms.SelectAtomTypes(atom_types=["CA", "N", "C", "O"]),
]
# In addition to slicing only backbone atoms, we introduce random rotation to the ligand protein 
# in the feature complex while preserving the target (ground-truth) complex orientations.
feature_transforms = [
    transforms.SelectAtomTypes(atom_types=["CA", "N", "C", "O"]),
    transforms.RandomLigandTransform(max_translation=10.0),
]
train_dataset = PinderDataset(
    split="train", 
    # We can leverage holo, apo, pred, random and random_mixed monomer sampling strategies
    monomer_priority="random_mixed",
    base_filters = base_filters,
    sub_filters = sub_filters,
    structure_transforms_target=target_transforms,
    structure_transforms_feature=feature_transforms,
)
train_dataset

<pinder.core.loader.dataset.PinderDataset at 0x7f9a78cd3fd0>

You can now access individual items in the PinderDataset or iterate over it.

The current default return value of PinderDataset.__getitem__ is a dict consisting of the following key, value pairs:

target_complex: The ground-truth holo dimer, represented with a set of default properties encoded as Tensor’s
feature_complex: The sampled dimer complex, representing “features”, also represented with a set of default properties encoded as Tensor’s
id: The pinder ID for the selected system
target_id: The IDs of the receptor and ligand holo monomers, concatenated into a single ID string
sample_id: The IDs of the sampled receptor and ligand holo monomers, concatenated into a single ID string. This can be useful for debugging purposes or generally tracking which specific monomers are selected when targeting alternative monomers (more on this shortly)

Each of the target_complex and feature_complex values are dictionaries with structural properties encoded by the pinder.core.loader.geodata.structure2tensor function by default:

atom_coordinates
atom_types
element_types
chain_ids
residue_coordinates
residue_types
residue_ids

You can choose to use a different representation by overriding the default values of transform and target_transform.

data_item = train_dataset[0]
data_item

{'target_complex': {'atom_types': tensor([[0.],
          [1.],
          [2.],
          ...,
          [1.],
          [2.],
          [3.]]),
  'element_types': tensor([[3.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [2.]]),
  'residue_types': tensor([[16.],
          [16.],
          [16.],
          ...,
          [ 0.],
          [ 0.],
          [ 0.]]),
  'atom_coordinates': tensor([[131.7500, 429.3090, 163.5360],
          [132.6810, 428.2520, 163.1550],
          [133.5150, 428.6750, 161.9500],
          ...,
          [177.7620, 463.8650, 166.9020],
          [177.4130, 465.0800, 167.7550],
          [176.8000, 464.9490, 168.8150]]),
  'residue_coordinates': tensor([[132.6810, 428.2520, 163.1550],
          [133.5560, 429.2490, 159.5910],
          [133.8750, 432.8980, 160.6290],
          [136.1110, 431.8050, 163.5130],
          [138.3420, 429.8920, 161.0830],
          [138.5230, 432.9090, 158.7600],
          [139.4710, 435.1730, 161.6750],
          [142.1200, 432.6310, 162.6740],
          [143.5890, 432.7500, 159.1600],
          [143.6190, 436.5600, 159.1540],
          [145.2600, 436.8040, 162.5830],
          [147.8380, 434.1320, 161.7330],
          [148.7390, 435.8920, 158.4790],
          [149.0640, 439.2630, 160.2240],
          [151.2220, 437.7670, 162.9840],
          [153.4710, 435.8030, 160.6120],
          [153.9420, 438.9180, 158.4710],
          [156.1140, 440.2660, 161.3150],
          [158.3330, 437.1860, 161.4250],
          [161.8090, 436.4120, 160.1100],
          [161.1120, 432.8370, 158.9350],
          [157.4020, 432.0710, 159.3930],
          [156.5290, 428.3870, 159.0310],
          [154.0400, 428.1160, 156.1500],
          [153.3520, 424.6590, 154.7510],
          [152.8370, 424.8210, 150.9870],
          [150.9700, 422.4160, 148.7320],
          [153.5900, 422.6860, 145.9670],
          [157.3050, 423.3410, 145.6600],
          [158.4860, 426.6730, 144.2700],
          [160.7730, 427.0880, 141.2590],
          [163.9630, 429.0530, 141.9600],
          [166.4230, 427.4330, 139.5550],
          [167.6170, 430.8330, 138.3250],
          [170.5180, 431.9210, 140.5320],
          [169.7800, 435.6680, 140.3160],
          [167.6790, 437.8790, 142.6110],
          [167.3350, 435.1910, 145.2890],
          [168.9090, 437.3000, 148.0580],
          [166.1860, 439.9440, 148.5750],
          [164.1010, 438.1300, 151.1890],
          [163.0190, 438.8690, 154.7450],
          [164.6750, 435.6560, 155.9860],
          [168.1170, 436.6550, 154.6330],
          [169.8040, 437.0930, 158.0010],
          [173.2680, 438.4850, 158.5750],
          [174.9270, 440.8490, 156.1460],
          [175.9170, 440.8010, 152.5000],
          [179.6060, 441.0910, 153.4330],
          [180.8000, 439.9830, 156.8710],
          [184.2540, 439.9980, 158.4580],
          [185.5840, 437.2970, 160.7670],
          [186.4140, 439.9000, 163.4480],
          [182.9430, 441.4570, 163.7500],
          [181.4930, 442.0910, 167.2100],
          [178.0780, 440.5360, 167.9060],
          [176.1650, 440.9290, 171.1790],
          [173.0530, 438.9530, 172.1310],
          [170.3990, 440.6990, 174.2170],
          [166.7880, 440.0710, 175.2260],
          [164.4480, 442.1010, 173.0340],
          [160.9780, 443.6110, 173.3810],
          [158.8330, 443.9000, 170.2730],
          [160.1820, 442.9360, 166.8360],
          [163.5010, 444.7230, 166.1630],
          [164.6310, 442.4290, 163.3160],
          [166.6330, 444.2560, 160.6240],
          [166.1700, 447.5210, 162.5220],
          [168.1350, 449.9450, 164.6440],
          [168.0040, 449.0470, 168.3240],
          [168.4210, 451.0540, 171.5220],
          [169.0440, 450.0360, 175.1300],
          [166.1650, 450.3680, 177.6100],
          [166.7800, 449.8300, 181.3340],
          [163.8480, 448.9150, 183.5760],
          [163.3500, 443.9770, 187.9750],
          [165.0240, 442.4930, 184.8870],
          [168.1120, 443.0980, 182.7810],
          [168.1280, 445.9210, 180.2050],
          [166.7230, 445.0130, 176.8030],
          [167.0690, 446.1330, 173.1930],
          [164.0790, 447.8390, 171.5780],
          [163.4630, 449.0630, 168.0350],
          [164.4160, 452.7320, 168.0130],
          [166.7460, 455.3190, 166.5790],
          [167.5460, 459.0090, 166.8440],
          [170.2010, 459.9620, 169.3810],
          [169.5150, 456.8220, 171.4570],
          [170.7440, 454.3880, 168.7860],
          [172.8900, 451.6810, 170.3850],
          [173.2420, 449.3000, 167.4490],
          [171.4700, 447.3670, 164.7150],
          [169.7840, 443.9650, 164.9670],
          [170.8880, 441.4120, 162.3680],
          [169.3320, 438.1640, 163.6510],
          [166.7170, 436.9060, 166.1030],
          [166.2830, 433.5290, 167.7900],
          [162.5490, 433.4260, 168.5040],
          [162.6440, 430.3540, 170.7620],
          [164.9490, 432.1120, 173.2350],
          [163.6530, 435.5760, 172.2210],
          [167.2540, 436.7450, 171.8300],
          [168.3710, 439.3650, 169.3060],
          [171.8910, 439.6680, 167.8930],
          [173.1360, 443.2500, 167.5870],
          [176.0990, 444.7950, 165.8030],
          [177.1670, 447.8660, 167.8270],
          [177.4160, 451.2430, 166.1360],
          [181.1790, 451.0810, 166.7480],
          [181.3560, 448.4350, 163.9980],
          [180.6020, 448.8920, 160.3060],
          [177.8440, 446.8990, 158.6170],
          [176.7540, 446.3160, 155.0150],
          [173.3820, 444.6900, 154.4770],
          [169.6430, 444.9660, 153.9760],
          [167.6080, 446.9450, 156.5140],
          [163.9630, 447.9270, 156.8800],
          [163.1520, 451.2160, 155.1430],
          [160.3870, 453.6570, 156.0420],
          [158.9330, 453.6920, 152.5150],
          [159.8050, 453.2460, 148.8360],
          [161.2560, 456.7540, 148.4440],
          [164.9420, 455.8330, 148.8520],
          [167.3780, 455.9500, 145.9260],
          [171.0380, 455.0610, 145.5060],
          [173.5830, 457.5340, 146.9270],
          [171.2360, 458.9430, 149.5740],
          [171.8150, 459.8640, 153.2130],
          [169.5910, 457.9090, 155.6000],
          [168.4960, 458.3960, 159.2100],
          [167.0870, 455.9020, 161.7110],
          [163.6430, 457.0390, 162.8610],
          [161.8700, 456.3550, 166.1660],
          [160.7490, 452.9260, 164.8950],
          [164.2440, 451.8750, 163.8110],
          [163.4490, 452.3040, 160.1090],
          [165.6820, 454.0140, 157.5590],
          [164.2940, 457.2310, 156.0670],
          [165.6810, 459.7290, 153.5780],
          [167.3870, 462.7790, 155.0760],
          [170.2060, 465.8340, 161.8360],
          [172.1620, 462.7680, 160.6890],
          [174.0280, 461.7040, 163.8180],
          [175.2110, 458.5660, 161.9930],
          [176.5410, 458.2410, 158.4450],
          [174.5550, 455.7530, 156.3540],
          [174.3580, 455.6030, 152.5600],
          [171.8590, 453.6260, 150.4910],
          [173.6300, 451.2730, 148.0810],
          [170.4820, 450.1750, 146.2000],
          [166.8520, 451.1980, 145.7810],
          [164.0730, 450.1880, 148.1490],
          [162.6710, 446.7750, 147.1980],
          [159.1420, 445.8080, 148.2240],
          [158.3160, 442.3640, 149.5950],
          [155.0300, 443.1470, 151.4050],
          [152.6800, 446.0960, 151.8500],
          [155.0610, 447.4890, 154.4910],
          [158.2010, 445.3460, 154.0190],
          [160.8090, 447.5340, 152.2870],
          [164.4130, 446.3150, 152.1360],
          [167.3360, 448.5380, 151.1380],
          [171.0310, 447.6630, 150.9500],
          [172.8160, 450.1440, 153.2260],
          [176.3550, 450.4540, 154.5850],
          [176.6310, 452.0780, 158.0240],
          [180.1600, 453.0970, 159.0310],
          [179.9110, 454.9710, 162.3160],
          [178.7370, 458.4030, 163.4610],
          [179.1350, 461.9150, 162.0830],
          [180.3160, 464.9940, 163.9950],
          [195.1700, 454.0210, 194.1160],
          [196.4220, 450.5570, 193.1610],
          [195.0530, 447.5260, 191.3340],
          [195.7530, 445.4320, 194.4410],
          [195.6760, 446.8960, 197.9710],
          [199.1230, 447.9300, 199.1640],
          [198.4070, 447.0140, 202.7840],
          [198.1130, 443.3080, 203.5090],
          [195.0350, 441.7530, 205.1060],
          [194.7040, 440.1720, 208.5300],
          [191.7420, 437.9460, 207.6670],
          [190.9460, 435.0990, 205.3140],
          [188.0870, 435.6760, 202.8880],
          [184.8300, 433.7320, 203.1830],
          [184.1270, 431.8260, 199.9570],
          [180.5450, 430.8500, 200.7460],
          [178.2470, 433.3320, 199.0220],
          [178.2770, 433.9260, 195.2680],
          [176.3170, 437.2030, 195.0280],
          [177.7690, 440.7330, 194.9660],
          [181.3350, 439.3910, 195.1330],
          [182.5820, 441.3460, 192.0940],
          [182.3530, 444.9560, 193.3610],
          [186.0870, 445.2240, 194.0290],
          [188.7230, 447.5570, 192.6190],
          [190.8760, 444.5870, 191.5490],
          [188.2670, 443.5230, 188.9590],
          [190.1360, 444.3010, 185.7360],
          [188.8140, 444.1990, 182.1980],
          [185.1300, 444.1560, 181.3470],
          [182.1290, 442.2030, 182.6060],
          [181.3330, 441.2700, 178.9860],
          [184.3690, 440.6570, 176.7700],
          [184.4240, 439.7520, 173.0740],
          [187.1540, 437.5560, 171.6020],
          [187.5170, 439.9750, 168.6500],
          [188.4180, 443.0500, 170.7240],
          [191.4360, 445.2100, 169.8910],
          [193.7480, 445.6510, 172.8940],
          [196.8200, 447.8990, 173.0280],
          [199.4910, 448.6120, 175.6310],
          [201.2080, 451.9090, 176.3780],
          [203.6110, 453.4740, 178.8680],
          [201.6130, 454.9210, 181.7570],
          [202.0690, 458.3840, 183.2730],
          [200.4970, 458.3690, 186.7180],
          [197.7960, 455.8130, 187.5240],
          [195.1460, 455.8470, 184.7720],
          [193.6160, 452.5740, 185.9790],
          [189.8600, 452.3800, 185.3150],
          [189.9440, 455.8580, 183.7680],
          [189.7400, 457.4310, 180.3410],
          [193.1630, 457.9480, 178.8100],
          [194.6770, 460.5170, 176.4640],
          [197.9720, 460.6430, 174.6010],
          [200.7710, 462.5710, 176.3210],
          [203.7970, 463.7400, 174.3330],
          [207.1150, 464.5820, 175.9770],
          [210.1390, 466.5260, 174.8060],
          [210.8850, 463.9390, 172.1200],
          [213.6440, 461.8050, 173.6150],
          [211.1280, 459.2430, 174.9140],
          [208.0390, 457.7610, 173.2930],
          [204.4770, 458.9110, 173.9310],
          [202.5240, 457.8000, 177.0030],
          [198.9190, 457.7220, 178.2440],
          [197.5730, 459.9040, 181.0500],
          [194.2310, 460.2350, 182.8060],
          [192.1140, 462.6430, 180.7850],
          [189.0250, 463.1720, 178.7000],
          [187.1240, 465.8360, 176.8220],
          [188.2850, 466.3320, 173.2420],
          [191.5580, 464.4520, 173.8600],
          [189.9950, 461.1280, 174.9010],
          [191.8440, 458.2970, 173.1540],
          [190.4670, 455.2690, 174.9900],
          [189.7380, 453.5820, 178.3070],
          [192.3390, 451.8830, 180.5000],
          [191.2020, 448.5340, 181.9050],
          [194.3800, 447.0280, 183.4050],
          [197.8700, 447.9650, 184.5850],
          [200.9840, 445.8270, 184.9660],
          [202.8290, 447.6220, 187.7590],
          [206.2120, 445.9180, 187.3400],
          [206.6040, 447.1310, 183.7460],
          [204.4430, 450.2380, 184.1830],
          [202.2310, 449.1370, 181.2890],
          [198.6660, 450.3790, 180.7720],
          [196.2310, 448.2490, 178.7680],
          [194.0550, 450.4240, 176.5260],
          [190.8490, 449.7090, 174.6070],
          [190.4360, 452.3080, 171.8240],
          [187.3830, 454.5590, 171.7980],
          [186.4670, 452.9730, 168.4480],
          [185.9050, 449.6480, 170.2580],
          [182.8890, 448.8910, 172.4330],
          [183.5680, 448.6030, 176.1630],
          [181.3330, 447.5400, 179.0590],
          [182.7340, 447.5270, 182.5670],
          [183.3280, 449.2260, 185.8920],
          [184.8760, 452.6980, 185.6990],
          [185.7140, 455.4170, 188.2030],
          [182.9420, 457.9530, 188.8300],
          [183.2930, 461.5650, 189.9350],
          [182.7580, 462.1320, 193.6660],
          [180.0620, 464.7490, 193.1030],
          [176.4120, 463.9030, 192.5790],
          [177.1550, 460.1730, 192.6930],
          [173.8180, 459.5980, 194.4390],
          [172.0770, 461.1030, 191.4040],
          [173.6950, 458.5410, 189.0910],
          [171.0060, 456.0190, 188.1270],
          [170.8210, 453.3390, 185.4210],
          [170.0180, 454.8270, 182.0350],
          [171.8690, 458.1080, 182.6590],
          [174.2320, 459.6520, 180.1210],
          [177.8640, 459.9130, 181.2230],
          [180.8780, 461.8410, 179.9490],
          [184.5650, 461.8070, 180.8430],
          [186.0650, 464.7990, 182.6440],
          [189.5780, 466.2740, 182.6170],
          [190.7120, 463.6340, 185.1390],
          [189.2980, 460.6630, 183.2290],
          [186.3560, 460.3000, 185.6240],
          [182.7840, 459.6590, 184.5210],
          [180.2130, 462.3320, 185.3590],
          [176.5320, 462.8410, 184.6140],
          [175.8580, 465.0300, 181.5790],
          [178.1930, 465.7110, 174.9000],
          [180.9150, 464.1190, 172.7670],
          [181.2510, 460.5420, 174.1130],
          [178.4430, 457.9660, 174.2140],
          [178.5420, 456.4790, 177.7180],
          [175.4730, 455.1590, 179.5500],
          [175.4670, 454.0690, 183.1870],
          [174.1210, 450.5430, 183.6830],
          [173.9850, 450.6340, 187.5030],
          [174.3640, 452.9240, 190.5240],
          [177.5140, 454.2200, 192.1820],
          [179.3060, 451.6570, 194.3540],
          [181.4890, 453.2110, 197.0570],
          [184.7390, 451.3110, 197.6330],
          [186.4870, 454.0230, 199.6690],
          [185.5270, 457.3810, 201.1790],
          [186.7910, 459.0050, 197.9610],
          [186.6010, 456.0430, 195.5320],
          [183.3680, 455.3620, 193.6270],
          [182.9110, 452.9430, 190.7280],
          [180.0190, 452.5230, 188.3060],
          [179.0910, 450.0250, 185.5960],
          [178.9850, 451.7840, 182.2280],
          [179.0510, 451.0080, 178.5100],
          [181.0530, 453.1550, 176.0780],
          [180.1320, 452.7170, 172.4060],
          [182.0570, 455.4530, 170.6120],
          [181.8140, 459.1820, 170.0770],
          [178.5990, 461.0580, 169.3130],
          [177.7620, 463.8650, 166.9020]]),
  'residue_ids': tensor([  4.,   4.,   4.,  ..., 182., 182., 182.]),
  'chain_ids': tensor([0., 0., 0.,  ..., 1., 1., 1.])},
 'feature_complex': {'atom_types': tensor([[0.],
          [1.],
          [2.],
          ...,
          [1.],
          [2.],
          [3.]]),
  'element_types': tensor([[3.],
          [0.],
          [0.],
          ...,
          [0.],
          [0.],
          [2.]]),
  'residue_types': tensor([[16.],
          [16.],
          [16.],
          ...,
          [ 0.],
          [ 0.],
          [ 0.]]),
  'atom_coordinates': tensor([[131.7500, 429.3090, 163.5360],
          [132.6810, 428.2520, 163.1550],
          [133.5150, 428.6750, 161.9500],
          ...,
          [186.5629, 421.9754, 185.6878],
          [185.6360, 421.5445, 184.5560],
          [184.7227, 422.2786, 184.1773]]),
  'residue_coordinates': tensor([[132.6810, 428.2520, 163.1550],
          [133.5560, 429.2490, 159.5910],
          [133.8750, 432.8980, 160.6290],
          [136.1110, 431.8050, 163.5130],
          [138.3420, 429.8920, 161.0830],
          [138.5230, 432.9090, 158.7600],
          [139.4710, 435.1730, 161.6750],
          [142.1200, 432.6310, 162.6740],
          [143.5890, 432.7500, 159.1600],
          [143.6190, 436.5600, 159.1540],
          [145.2600, 436.8040, 162.5830],
          [147.8380, 434.1320, 161.7330],
          [148.7390, 435.8920, 158.4790],
          [149.0640, 439.2630, 160.2240],
          [151.2220, 437.7670, 162.9840],
          [153.4710, 435.8030, 160.6120],
          [153.9420, 438.9180, 158.4710],
          [156.1140, 440.2660, 161.3150],
          [158.3330, 437.1860, 161.4250],
          [161.8090, 436.4120, 160.1100],
          [161.1120, 432.8370, 158.9350],
          [157.4020, 432.0710, 159.3930],
          [156.5290, 428.3870, 159.0310],
          [154.0400, 428.1160, 156.1500],
          [153.3520, 424.6590, 154.7510],
          [152.8370, 424.8210, 150.9870],
          [150.9700, 422.4160, 148.7320],
          [153.5900, 422.6860, 145.9670],
          [157.3050, 423.3410, 145.6600],
          [158.4860, 426.6730, 144.2700],
          [160.7730, 427.0880, 141.2590],
          [163.9630, 429.0530, 141.9600],
          [166.4230, 427.4330, 139.5550],
          [167.6170, 430.8330, 138.3250],
          [170.5180, 431.9210, 140.5320],
          [169.7800, 435.6680, 140.3160],
          [167.6790, 437.8790, 142.6110],
          [167.3350, 435.1910, 145.2890],
          [168.9090, 437.3000, 148.0580],
          [166.1860, 439.9440, 148.5750],
          [164.1010, 438.1300, 151.1890],
          [163.0190, 438.8690, 154.7450],
          [164.6750, 435.6560, 155.9860],
          [168.1170, 436.6550, 154.6330],
          [169.8040, 437.0930, 158.0010],
          [173.2680, 438.4850, 158.5750],
          [174.9270, 440.8490, 156.1460],
          [175.9170, 440.8010, 152.5000],
          [179.6060, 441.0910, 153.4330],
          [180.8000, 439.9830, 156.8710],
          [184.2540, 439.9980, 158.4580],
          [185.5840, 437.2970, 160.7670],
          [186.4140, 439.9000, 163.4480],
          [182.9430, 441.4570, 163.7500],
          [181.4930, 442.0910, 167.2100],
          [178.0780, 440.5360, 167.9060],
          [176.1650, 440.9290, 171.1790],
          [173.0530, 438.9530, 172.1310],
          [170.3990, 440.6990, 174.2170],
          [166.7880, 440.0710, 175.2260],
          [164.4480, 442.1010, 173.0340],
          [160.9780, 443.6110, 173.3810],
          [158.8330, 443.9000, 170.2730],
          [160.1820, 442.9360, 166.8360],
          [163.5010, 444.7230, 166.1630],
          [164.6310, 442.4290, 163.3160],
          [166.6330, 444.2560, 160.6240],
          [166.1700, 447.5210, 162.5220],
          [168.1350, 449.9450, 164.6440],
          [168.0040, 449.0470, 168.3240],
          [168.4210, 451.0540, 171.5220],
          [169.0440, 450.0360, 175.1300],
          [166.1650, 450.3680, 177.6100],
          [166.7800, 449.8300, 181.3340],
          [163.8480, 448.9150, 183.5760],
          [163.3500, 443.9770, 187.9750],
          [165.0240, 442.4930, 184.8870],
          [168.1120, 443.0980, 182.7810],
          [168.1280, 445.9210, 180.2050],
          [166.7230, 445.0130, 176.8030],
          [167.0690, 446.1330, 173.1930],
          [164.0790, 447.8390, 171.5780],
          [163.4630, 449.0630, 168.0350],
          [164.4160, 452.7320, 168.0130],
          [166.7460, 455.3190, 166.5790],
          [167.5460, 459.0090, 166.8440],
          [170.2010, 459.9620, 169.3810],
          [169.5150, 456.8220, 171.4570],
          [170.7440, 454.3880, 168.7860],
          [172.8900, 451.6810, 170.3850],
          [173.2420, 449.3000, 167.4490],
          [171.4700, 447.3670, 164.7150],
          [169.7840, 443.9650, 164.9670],
          [170.8880, 441.4120, 162.3680],
          [169.3320, 438.1640, 163.6510],
          [166.7170, 436.9060, 166.1030],
          [166.2830, 433.5290, 167.7900],
          [162.5490, 433.4260, 168.5040],
          [162.6440, 430.3540, 170.7620],
          [164.9490, 432.1120, 173.2350],
          [163.6530, 435.5760, 172.2210],
          [167.2540, 436.7450, 171.8300],
          [168.3710, 439.3650, 169.3060],
          [171.8910, 439.6680, 167.8930],
          [173.1360, 443.2500, 167.5870],
          [176.0990, 444.7950, 165.8030],
          [177.1670, 447.8660, 167.8270],
          [177.4160, 451.2430, 166.1360],
          [181.1790, 451.0810, 166.7480],
          [181.3560, 448.4350, 163.9980],
          [180.6020, 448.8920, 160.3060],
          [177.8440, 446.8990, 158.6170],
          [176.7540, 446.3160, 155.0150],
          [173.3820, 444.6900, 154.4770],
          [169.6430, 444.9660, 153.9760],
          [167.6080, 446.9450, 156.5140],
          [163.9630, 447.9270, 156.8800],
          [163.1520, 451.2160, 155.1430],
          [160.3870, 453.6570, 156.0420],
          [158.9330, 453.6920, 152.5150],
          [159.8050, 453.2460, 148.8360],
          [161.2560, 456.7540, 148.4440],
          [164.9420, 455.8330, 148.8520],
          [167.3780, 455.9500, 145.9260],
          [171.0380, 455.0610, 145.5060],
          [173.5830, 457.5340, 146.9270],
          [171.2360, 458.9430, 149.5740],
          [171.8150, 459.8640, 153.2130],
          [169.5910, 457.9090, 155.6000],
          [168.4960, 458.3960, 159.2100],
          [167.0870, 455.9020, 161.7110],
          [163.6430, 457.0390, 162.8610],
          [161.8700, 456.3550, 166.1660],
          [160.7490, 452.9260, 164.8950],
          [164.2440, 451.8750, 163.8110],
          [163.4490, 452.3040, 160.1090],
          [165.6820, 454.0140, 157.5590],
          [164.2940, 457.2310, 156.0670],
          [165.6810, 459.7290, 153.5780],
          [167.3870, 462.7790, 155.0760],
          [170.2060, 465.8340, 161.8360],
          [172.1620, 462.7680, 160.6890],
          [174.0280, 461.7040, 163.8180],
          [175.2110, 458.5660, 161.9930],
          [176.5410, 458.2410, 158.4450],
          [174.5550, 455.7530, 156.3540],
          [174.3580, 455.6030, 152.5600],
          [171.8590, 453.6260, 150.4910],
          [173.6300, 451.2730, 148.0810],
          [170.4820, 450.1750, 146.2000],
          [166.8520, 451.1980, 145.7810],
          [164.0730, 450.1880, 148.1490],
          [162.6710, 446.7750, 147.1980],
          [159.1420, 445.8080, 148.2240],
          [158.3160, 442.3640, 149.5950],
          [155.0300, 443.1470, 151.4050],
          [152.6800, 446.0960, 151.8500],
          [155.0610, 447.4890, 154.4910],
          [158.2010, 445.3460, 154.0190],
          [160.8090, 447.5340, 152.2870],
          [164.4130, 446.3150, 152.1360],
          [167.3360, 448.5380, 151.1380],
          [171.0310, 447.6630, 150.9500],
          [172.8160, 450.1440, 153.2260],
          [176.3550, 450.4540, 154.5850],
          [176.6310, 452.0780, 158.0240],
          [180.1600, 453.0970, 159.0310],
          [179.9110, 454.9710, 162.3160],
          [178.7370, 458.4030, 163.4610],
          [179.1350, 461.9150, 162.0830],
          [180.3160, 464.9940, 163.9950],
          [194.3526, 447.5152, 165.0098],
          [196.6725, 449.6794, 167.1105],
          [196.8679, 450.8312, 170.7208],
          [196.8410, 454.4390, 169.4914],
          [195.0953, 455.5005, 166.2610],
          [197.4823, 455.6115, 163.3175],
          [195.7098, 458.5518, 161.6839],
          [196.0764, 461.8777, 163.4590],
          [193.1024, 463.9345, 164.6282],
          [191.9094, 467.2896, 163.3425],
          [190.1228, 468.3376, 166.5335],
          [190.9830, 469.0458, 170.1438],
          [189.1918, 466.9384, 172.7404],
          [186.6393, 468.4757, 175.1059],
          [187.6807, 467.9099, 178.7294],
          [184.4171, 468.9856, 180.3412],
          [182.4022, 465.8605, 181.1073],
          [183.6892, 463.0532, 183.3230],
          [181.2283, 460.2528, 182.4630],
          [181.6926, 457.5359, 179.8219],
          [185.1478, 458.8627, 178.9016],
          [186.9255, 455.5023, 179.3012],
          [185.3688, 453.4701, 176.4487],
          [188.3953, 453.8640, 174.1803],
          [190.7147, 451.2960, 172.6294],
          [193.7639, 453.0481, 174.1239],
          [192.6568, 452.1291, 177.6703],
          [195.3462, 449.6016, 178.6003],
          [195.5117, 447.4044, 181.6697],
          [192.5437, 446.7234, 183.9113],
          [189.8618, 448.8829, 185.5079],
          [190.7307, 447.3081, 188.8804],
          [194.4259, 446.5515, 189.4066],
          [196.0781, 444.9496, 192.4376],
          [199.6059, 445.8761, 193.5045],
          [200.4472, 442.1646, 193.9684],
          [199.7299, 441.1097, 190.3710],
          [202.2173, 439.0560, 188.3553],
          [203.0555, 440.7037, 185.0176],
          [205.2080, 439.1884, 182.2646],
          [206.4511, 440.3921, 178.8846],
          [206.9065, 438.3754, 175.7063],
          [207.7449, 438.8314, 172.0303],
          [204.5241, 439.4103, 170.1007],
          [203.5230, 437.6822, 166.8588],
          [200.8308, 439.7688, 165.2041],
          [198.7335, 442.1299, 167.3265],
          [197.3833, 440.2521, 170.3676],
          [196.3582, 443.4805, 172.1103],
          [193.2925, 443.0303, 174.3430],
          [193.1009, 439.3580, 173.3560],
          [193.8182, 435.9807, 174.8857],
          [197.3287, 434.7910, 174.1180],
          [198.9368, 431.3986, 173.5697],
          [202.5520, 430.2996, 173.3122],
          [203.9446, 430.0112, 169.7784],
          [207.1122, 428.0082, 169.1232],
          [209.2617, 428.5442, 166.0389],
          [211.9335, 426.4479, 164.3769],
          [214.2351, 426.8219, 167.3813],
          [216.6637, 429.5557, 166.3666],
          [214.5489, 432.2387, 168.0726],
          [212.7513, 432.2271, 171.4095],
          [209.0453, 431.5531, 171.8874],
          [206.4182, 434.2415, 171.2905],
          [202.7466, 434.8990, 172.0839],
          [199.9616, 434.8857, 169.4974],
          [196.2333, 435.5594, 169.5850],
          [194.5094, 432.3153, 170.5171],
          [192.3965, 430.4469, 173.0099],
          [190.7503, 427.1035, 173.6003],
          [193.0082, 424.5374, 175.2518],
          [196.1631, 426.5507, 174.4633],
          [195.1830, 429.7135, 176.3638],
          [198.1797, 430.9191, 178.3767],
          [196.9966, 434.3573, 179.4789],
          [195.5133, 437.7083, 178.5043],
          [197.4335, 440.5316, 176.8300],
          [196.7045, 443.9624, 178.3066],
          [199.3535, 446.2308, 176.7397],
          [201.8074, 446.4113, 173.8422],
          [204.9715, 448.4697, 173.4356],
          [205.1411, 448.9081, 169.6665],
          [208.7397, 450.1417, 169.4380],
          [210.1404, 446.9701, 171.0258],
          [207.2875, 444.7153, 169.8894],
          [206.6585, 443.6548, 173.4894],
          [203.3596, 442.1872, 174.7069],
          [202.4481, 442.4736, 178.3906],
          [200.8108, 439.2700, 179.6350],
          [198.8336, 438.4688, 182.7847],
          [198.8734, 434.6842, 183.3729],
          [195.6036, 432.7627, 183.4439],
          [196.4229, 431.8594, 187.0619],
          [196.0511, 435.5532, 187.9922],
          [192.7223, 437.3526, 188.2768],
          [192.0043, 439.9416, 185.5864],
          [189.1790, 442.4687, 185.2074],
          [189.1234, 444.7405, 182.1900],
          [187.9978, 445.5241, 178.6667],
          [188.6112, 442.7753, 176.1063],
          [187.7632, 442.2633, 172.4479],
          [184.4313, 440.5417, 171.7821],
          [183.4538, 438.4355, 168.7823],
          [181.4411, 440.2989, 166.1355],
          [178.6028, 437.7743, 166.2046],
          [175.7381, 437.9249, 168.6711],
          [177.2672, 440.9391, 170.4188],
          [173.7674, 442.3126, 171.0345],
          [172.9790, 439.1570, 173.0164],
          [175.9153, 439.7880, 175.3641],
          [174.4827, 441.0170, 178.6743],
          [175.9812, 441.4048, 182.1591],
          [176.1678, 438.0897, 183.9754],
          [176.7931, 436.0152, 180.8312],
          [179.4820, 433.3434, 180.6301],
          [182.2566, 434.0090, 178.1136],
          [184.9619, 431.8602, 176.5359],
          [187.9352, 432.6283, 174.3080],
          [187.8763, 431.4971, 170.6800],
          [190.6716, 430.5054, 168.2884],
          [191.3847, 434.1945, 167.5829],
          [191.5563, 435.2455, 171.2354],
          [188.1155, 436.8791, 171.1087],
          [185.4874, 436.5120, 173.8189],
          [182.2231, 434.8255, 172.8403],
          [179.0839, 433.7808, 174.6890],
          [179.0841, 430.1452, 175.8026],
          [183.5072, 425.5566, 178.9499],
          [187.1284, 425.5991, 180.1243],
          [187.7951, 429.2438, 181.1198],
          [185.8707, 431.1746, 183.7843],
          [185.0097, 434.5277, 182.1988],
          [181.8987, 436.5487, 183.0672],
          [180.7975, 439.6705, 181.2077],
          [180.2646, 442.6608, 183.5021],
          [178.6914, 444.9704, 180.8924],
          [177.3435, 445.0929, 177.3312],
          [179.2271, 445.2756, 174.0495],
          [180.6412, 448.7187, 173.2475],
          [181.2056, 449.3054, 169.5295],
          [184.3606, 451.3069, 168.7915],
          [184.5037, 450.5530, 165.0542],
          [182.2632, 448.8328, 162.5050],
          [184.2036, 445.6192, 163.2110],
          [185.6626, 446.3996, 166.6711],
          [183.6476, 445.5793, 169.8036],
          [184.9115, 445.6296, 173.3903],
          [183.3317, 444.2998, 176.5750],
          [184.1225, 444.5065, 180.2867],
          [184.8602, 441.0260, 181.6250],
          [186.4991, 439.3104, 184.5917],
          [188.6786, 436.2172, 184.1296],
          [189.3349, 434.2180, 187.3071],
          [191.0636, 431.0608, 186.1018],
          [190.1397, 427.8091, 184.4214],
          [187.0923, 425.7105, 185.2866],
          [186.5629, 421.9754, 185.6878]]),
  'residue_ids': tensor([  4.,   4.,   4.,  ..., 182., 182., 182.]),
  'chain_ids': tensor([0., 0., 0.,  ..., 1., 1., 1.])},
 'id': '8phr__X4_UNDEFINED--8phr__W4_UNDEFINED',
 'sample_id': '8phr__X4_UNDEFINED-R--8phr__W4_UNDEFINED-L',
 'target_id': '8phr__X4_UNDEFINED-R--8phr__W4_UNDEFINED-L'}

from pinder.core.loader.dataset import collate_batch, get_torch_loader

# Now wrap the dataset in a torch DataLoader
batch_size = 2
train_dataloader = get_torch_loader(
    train_dataset, 
    batch_size=batch_size,
    shuffle=True,
    collate_fn=collate_batch,
    num_workers=0, 
)

# Get a batch from the dataloader
batch = next(iter(train_dataloader))

# expected batch dict keys
assert set(batch.keys()) == {
    "target_complex",
    "feature_complex",
    "id",
    "sample_id",
    "target_id",
}
feature_coords = batch["feature_complex"]["atom_coordinates"]
# Ensure batch size propagates to tensor dims
assert feature_coords.shape[0] == batch_size
# Ensure coordinates have dim 3
assert feature_coords.shape[2] == 3

2024-11-15 12:14:55,934 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=7

2024-11-15 12:14:57,458 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=4, items=5

batch

{'target_complex': {'atom_types': tensor([[[ 0.],
           [ 1.],
           [ 2.],
           ...,
           [-1.],
           [-1.],
           [-1.]],
  
          [[ 0.],
           [ 1.],
           [ 2.],
           ...,
           [ 1.],
           [ 2.],
           [ 3.]]]),
  'element_types': tensor([[[ 3.],
           [ 0.],
           [ 0.],
           ...,
           [-1.],
           [-1.],
           [-1.]],
  
          [[ 3.],
           [ 0.],
           [ 0.],
           ...,
           [ 0.],
           [ 0.],
           [ 2.]]]),
  'residue_types': tensor([[[ 0.],
           [ 0.],
           [ 0.],
           ...,
           [-1.],
           [-1.],
           [-1.]],
  
          [[ 4.],
           [ 4.],
           [ 4.],
           ...,
           [ 5.],
           [ 5.],
           [ 5.]]]),
  'atom_coordinates': tensor([[[ 238.8460,  357.3990,  387.2690],
           [ 238.3230,  356.4020,  388.1950],
           [ 238.8780,  355.0210,  387.8720],
           ...,
           [-100.0000, -100.0000, -100.0000],
           [-100.0000, -100.0000, -100.0000],
           [-100.0000, -100.0000, -100.0000]],
  
          [[ 280.7290,  142.4940,  235.3350],
           [ 280.2980,  141.7710,  234.1080],
           [ 280.1140,  140.3010,  234.3730],
           ...,
           [ 315.1630,  132.9490,  180.6540],
           [ 314.1190,  131.9150,  181.0770],
           [ 312.9030,  132.2420,  181.0280]]]),
  'residue_coordinates': tensor([[[ 238.3230,  356.4020,  388.1950],
           [ 240.4110,  353.2410,  388.5110],
           [ 239.1700,  349.8340,  389.7560],
           ...,
           [-100.0000, -100.0000, -100.0000],
           [-100.0000, -100.0000, -100.0000],
           [-100.0000, -100.0000, -100.0000]],
  
          [[ 280.2980,  141.7710,  234.1080],
           [ 279.4230,  138.1500,  233.3690],
           [ 281.2600,  138.0610,  230.0370],
           ...,
           [ 320.3230,  136.4760,  182.5070],
           [ 317.2980,  134.3810,  183.4050],
           [ 315.1630,  132.9490,  180.6540]]]),
  'residue_ids': tensor([[  2.,   2.,   2.,  ..., -99., -99., -99.],
          [  7.,   7.,   7.,  ..., 239., 239., 239.]]),
  'chain_ids': tensor([[ 0.,  0.,  0.,  ..., -1., -1., -1.],
          [ 0.,  0.,  0.,  ...,  1.,  1.,  1.]])},
 'feature_complex': {'atom_types': tensor([[[ 0.],
           [ 1.],
           [ 2.],
           ...,
           [-1.],
           [-1.],
           [-1.]],
  
          [[ 0.],
           [ 1.],
           [ 2.],
           ...,
           [ 1.],
           [ 2.],
           [ 3.]]]),
  'element_types': tensor([[[ 3.],
           [ 0.],
           [ 0.],
           ...,
           [-1.],
           [-1.],
           [-1.]],
  
          [[ 3.],
           [ 0.],
           [ 0.],
           ...,
           [ 0.],
           [ 0.],
           [ 2.]]]),
  'residue_types': tensor([[[ 0.],
           [ 0.],
           [ 0.],
           ...,
           [-1.],
           [-1.],
           [-1.]],
  
          [[ 4.],
           [ 4.],
           [ 4.],
           ...,
           [ 5.],
           [ 5.],
           [ 5.]]]),
  'atom_coordinates': tensor([[[ 239.0594,  357.0178,  386.4921],
           [ 238.6853,  355.9713,  387.4380],
           [ 239.4948,  354.6902,  387.1700],
           ...,
           [-100.0000, -100.0000, -100.0000],
           [-100.0000, -100.0000, -100.0000],
           [-100.0000, -100.0000, -100.0000]],
  
          [[ 280.7290,  142.4940,  235.3350],
           [ 280.2980,  141.7710,  234.1080],
           [ 280.1140,  140.3010,  234.3730],
           ...,
           [ 341.9383,  155.9993,  243.4603],
           [ 342.7656,  157.1881,  242.9700],
           [ 342.2882,  158.3454,  243.1132]]]),
  'residue_coordinates': tensor([[[ 238.6853,  355.9713,  387.4380],
           [ 240.8402,  352.8592,  388.1541],
           [ 238.7678,  349.6642,  388.6942],
           ...,
           [-100.0000, -100.0000, -100.0000],
           [-100.0000, -100.0000, -100.0000],
           [-100.0000, -100.0000, -100.0000]],
  
          [[ 280.2980,  141.7710,  234.1080],
           [ 279.4230,  138.1500,  233.3690],
           [ 281.2600,  138.0610,  230.0370],
           ...,
           [ 338.9307,  150.4625,  241.7879],
           [ 340.4811,  153.7729,  240.7964],
           [ 341.9383,  155.9993,  243.4603]]]),
  'residue_ids': tensor([[  2.,   2.,   2.,  ..., -99., -99., -99.],
          [  7.,   7.,   7.,  ..., 239., 239., 239.]]),
  'chain_ids': tensor([[ 0.,  0.,  0.,  ..., -1., -1., -1.],
          [ 0.,  0.,  0.,  ...,  1.,  1.,  1.]])},
 'id': ['7n2c__M1_P0A7S9--7n2c__S1_P0A7U3',
  '6rjf__B34_O91734--6rjf__C34_O91734'],
 'sample_id': ['af__P0A7S9--af__P0A7U3',
  '6rjf__B34_O91734-R--6rjf__C34_O91734-L'],
 'target_id': ['7n2c__M1_P0A7S9-R--7n2c__S1_P0A7U3-L',
  '6rjf__B34_O91734-R--6rjf__C34_O91734-L']}

Implementing your own PyTorch Dataset & DataLoader for pinder #

We invite you to review the existing tutorial on this topic in the pinder documentation. Please don’t hesitate to ask questions or otherwise engage via GitHub issues!