# MLSB PINDER Challenge
1. [Rules: training](#training-rules)
2. [Rules: inference](#inference-rules)
3. [Evaluation dataset](#eval-dataset)
4. [Accessing training data](#accessing-training-data)
 1. [Minimal filepath example](#minimal-filepath-example)
 1. [Helpful pinder utilities](#pinder-utilities)
 2. [PinderLoader and PinderDataset](#pinder-loader-and-dataset)
 3. [Implementing a torch dataloader](#torch-dataloader)




The goal of this tutorial is to outline some basic rules for participating in the PINDER track of the MLSB challenge and provide simple hands-on examples for how participants can access and use the `pinder` dataset.

Specifically, we will cover: 
* Rules for model training
* Rules for valid inference submissions
* Accessing and loading data for training your model
* A description of the inputs to be provided in the evaluation set



## Rules for valid model training 


* Participants MUST use the sequences and SMILES in the provided train and validation sets from PINDER or PLINDER. In order to ensure no leakage, external data augmentation is not allowed.
* If starting structures/conformations need to be generated for the model, then this can only be done from the training and validation sequences and SMILES. Note that this is only the case for train & validation - no external folding methods or starting structures are allowed for the test set under any circumstance!. Only the predicted structures/conformers themselves may be used in this way, the embeddings or models used to generate such predictions may not. E.g. it is not valid to “distill” a method that was not trained on PLINDER/PINDER
* The PINDER and PLINDER datasets should be used independently; combining the sets is considered augmentation and is not allowed.
* For inference, only the inputs provided in the evaluation sets may be used: canonical sequences, structures and MSAs; no alternate templates or sequences are permitted. The inputs that will be used by assessors for each challenge track is as follows:
 * PLINDER: (SMILES, monomer protein structure, monomer FASTA, monomer MSA)
 * PINDER: (monomer protein structure 1, monomer protein structure 2, FASTA 1, FASTA 2, MSA 1, MSA 2)
* Model selection must be performed exclusively on the validation set designed for this purpose within the PINDER and PLINDER datasets.
* Methods relying on any model derivatives or embeddings trained on structures outside the PINDER/PLINDER training set are not permitted (e.g., ESM2, MSA: ✅; ESM3/ESMFold/SAProt/UniMol: ❌).
* For instruction on how to load training and validation data, check the links below:
 * [PLINDER](https://github.com/plinder-org/plinder/blob/8f7f4372a675abbbd948ea6aeaf6b870bf312a02/docs/examples/mlsb_challenge.md#mlsb-notebook-target)
 * [PINDER](https://pinder-org.github.io/pinder/pinder-mlsb.html#accessing-and-loading-data-for-training)



## Rules for valid inference submissions

Submission system will use Hugging Face Spaces. To qualify for submission, each team must:

- Provide an MLSB submission ID or a link to a preprint/paper describing their methodology. This publication does not have to specifically report training or evaluation on the P(L)INDER dataset. Previously published methods, such as DiffDock, only need to link their existing paper. Note that entry into this competition does not equate to an MLSB workshop paper submission.
- Create a copy of the provided [inference template](https://huggingface.co/spaces/MLSB/pinder_inference_template/blob/main/inference_app.py).
 - Go to the top right corner of the page and click on the drop-down menu (vertical ellipsis) right next to the “Community”, then select “Duplicate this space”.
- Change files in the newly created space to reflect the peculiarities of your model
 - Edit `requirements.txt` to capture all dependencies.
 - Include a `inference_app.py` file. This contains a `predict` function that should be modified to reflect the specifics of inference using their model.
 - Include a `train.py` file to ensure that training and model selection use only the PINDER/PLINDER datasets and to clearly show any additional hyperparameters used.
 - Provide a LICENSE file that allows for reuse, derivative works, and distribution of the provided software and weights (e.g., MIT or Apache2 license).
 - Modify the Dockerfile as appropriate (including selecting the right base image)
- Submit to the leaderboard via the [designated form](https://huggingface.co/spaces/MLSB/leaderboard2024).
 - On submission page, add reference to the newly created space in the format username/space (e.g mlsb/alphafold3)


## Evaluation dataset 

Although the exact composition of the eval set will be shared at a future date, below we provide an overview of the dataset and what to expect

- Two leaderboards, one for each of PINDER and PLINDER, will be created using a single evaluation set for each.
- Evaluation sets will be subsets of 150-200 structures from the current PINDER and PLINDER test splits (subsets to enable reasonable eval runtime).
- Each evaluation sample will contain a predefined input/output to ensure performance assessment is model-dependent, not input-dependent.
- The focus will be exclusively on flexible docking/co-folding, with a single canonical structure per protein, sampled from apo and predicted structures.
- Monomer input structures will be sampled from paired structures available in PINDER/PLINDER, balanced between apo and predicted structures and stratified by "flexibility" level according to specified conformational difference thresholds.
- Inputs will be: `(monomer protein structure 1, monomer protein structure 2, FASTA 1, FASTA 2)` for PINDER





## Accessing and loading data for training 

In order to access the train and val splits for PINDER, please refer to the [pinder documentation](https://github.com/pinder-org/pinder/tree/main?tab=readme-ov-file#%EF%B8%8F-getting-the-dataset)

Once you have downloaded the pinder dataset, either via the `pinder` package or directly through `gsutil`, you will have all of the necessary files for training. 

For those mainly interested in torch dataloaders, refer to the [readme section](https://github.com/pinder-org/pinder#5--dataloader) and [tutorial](https://pinder-org.github.io/pinder/pinder-loader.html) on the torch dataloader provided in pinder. TLDR: 
```python
from pinder.core.loader.dataset import PinderDataset, get_torch_loader
train, val = [get_torch_loader(PinderDataset(split=split)) for split in ["train", "val"]] # do NOT use test, we will verify your pipeline does not use test in neither training nor model selection
```

For those interested in loading/filtering/sampling/augmenting data using pinder utilities, see remaining sections below. 


You are ONLY allowed to access those systems labeled with split `train` and split `val` for model training and validation, respectively. 

See below for two different options for accessing the index and split labels.


**If you have already installed pinder (preferred method):**


In [1]:
import torch
from pinder.core import get_index

index = get_index()
train = index.query('split == "train"').reset_index(drop=True)
val = index.query('split == "val"').reset_index(drop=True)
train.shape, val.shape

((1560682, 34), (1958, 34))

**Without installing pinder (need to install `gcsfs` and `pandas` or install the `gsutil` utility to get the index file)**

In [2]:
import gcsfs
import pandas as pd

index_uri = "gs://pinder/2024-02/index.parquet"
fs = gcsfs.GCSFileSystem(token="anon")
with fs.open(index_uri, "rb") as f:
 index = pd.read_parquet(f)

train = index.query('split == "train"').reset_index(drop=True)
val = index.query('split == "val"').reset_index(drop=True)
train.shape, val.shape

((1560682, 34), (1958, 34))

### Minimal example with filepaths 

For those who simply want access to PDB files and/or sequences, below we provide a minimal example of how to go from a row in the pinder index to a tuple of filepaths and sequences akin to the expected inputs for inference. 

Later sections provide alternative means to loading data with common pinder utilities, including the `PinderLoader` and `PinderDataset` torch dataset.

All `pinder` data should be stored in `PINDER_BASE_DIR`. Unless you customized the download directory, this would default to:
`~/.local/share/pinder/2024-02/`

PDB files are stored in a subdirectory, named `pdbs`. 

To go from a row in the index to a collection of filepaths, you can either use the pydantic model for the pinder index schema (`IndexEntry`) or construct the filepaths yourself. 

We will first illustrate how to do this via `IndexEntry`

In [3]:
from pinder.core import get_pinder_location
from pinder.core.index.utils import IndexEntry

pinder_dir = get_pinder_location()
row = train.sample(1).squeeze()
entry = IndexEntry(**row.to_dict())
# IndexEntry has a convenience property `pdb_paths` which returns a dict of structure_type: relative_path | list[relative_path]
relative_paths = entry.pdb_paths
absolute_paths = {}
for structure_type, rel_path in relative_paths.items():
 # non-canonical apo monomers are stored as a list of relative paths
 if isinstance(rel_path, list):
 absolute_paths[structure_type] = []
 for alt_monomer in rel_path:
 absolute_paths[structure_type].append(pinder_dir / alt_monomer) 
 # Not all systems have every type of monomer. When they are not available, the relative path is ""
 elif rel_path == "":
 absolute_paths[structure_type] = rel_path
 # Convert relative path to absolute path 
 else:
 absolute_paths[structure_type] = pinder_dir / rel_path

relative_paths, absolute_paths

({'native': 'pdbs/8ir7__S1_P51765--8ir7__AA1_Q7DGD4.pdb',
 'holo_R': 'pdbs/8ir7__S1_P51765-R.pdb',
 'holo_L': 'pdbs/8ir7__AA1_Q7DGD4-L.pdb',
 'predicted_R': 'pdbs/af__P51765.pdb',
 'predicted_L': 'pdbs/af__Q7DGD4.pdb',
 'apo_R': '',
 'apo_L': '',
 'apo_R_alt': [],
 'apo_L_alt': []},
 {'native': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8ir7__S1_P51765--8ir7__AA1_Q7DGD4.pdb'),
 'holo_R': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8ir7__S1_P51765-R.pdb'),
 'holo_L': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8ir7__AA1_Q7DGD4-L.pdb'),
 'predicted_R': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/af__P51765.pdb'),
 'predicted_L': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/af__Q7DGD4.pdb'),
 'apo_R': '',
 'apo_L': '',
 'apo_R_alt': [],
 'apo_L_alt': []})

In the above example, the `IndexEntry.pdb_paths` property was used to conveniently extract filepaths from a row in the index. This is done by using the following columns from the index:
* `id`
* `holo_R_pdb`
* `holo_L_pdb`
* `predicted_R_pdb`
* `predicted_L_pdb`
* `apo_R_pdbs`
* `apo_L_pdbs`

It is possible to do this yourself without IndexEntry:

In [4]:
row = train.sample(1).squeeze()
absolute_paths = {
 "native": pinder_dir / "pdbs" / f"{row.id}.pdb",
}
pdb_cols = [
 "holo_R_pdb", "holo_L_pdb", # holo monomers for receptor and ligand, respectively
 "predicted_R_pdb", "predicted_L_pdb", # predicted monomers
 "apo_R_pdb", "apo_L_pdb", # canonical apo monomers
 "apo_R_pdbs", "apo_L_pdbs", # canonical + non-canonical (alternative) apo monomers, separated by a semi-colon
]
for pdb_column in pdb_cols:
 if pdb_column.endswith("pdbs"):
 absolute_paths[pdb_column] = [
 pinder_dir / "pdbs" / alt_apo if alt_apo != "" else "" 
 for alt_apo in row[pdb_column].split(";")
 ]
 else:
 pdb_name = row[pdb_column]
 absolute_paths[pdb_column] = pinder_dir / "pdbs" / pdb_name if pdb_name != "" else ""

absolute_paths


{'native': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6rao__H2_Q6HAD2--6rao__F1_Q6HAD0.pdb'),
 'holo_R_pdb': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6rao__H2_Q6HAD2-R.pdb'),
 'holo_L_pdb': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6rao__F1_Q6HAD0-L.pdb'),
 'predicted_R_pdb': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/af__Q6HAD2.pdb'),
 'predicted_L_pdb': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/af__Q6HAD0.pdb'),
 'apo_R_pdb': '',
 'apo_L_pdb': '',
 'apo_R_pdbs': [''],
 'apo_L_pdbs': ['']}

The most minimal interface for loading these filepaths and extracting e.g. coordinates and sequence would be via `pinder.core.loader.structure` module:

In [6]:
from pinder.core.loader.structure import Structure
# Note: since this notebook is executed in CI, I will also create a `PinderSystem` object which will auto-download any missing PDB file
# You do NOT need to do this if you already downloaded the dataset
if not absolute_paths["holo_R_pdb"].is_file():
 from pinder.core import PinderSystem
 _ = PinderSystem(absolute_paths["native"].stem)
 
receptor = Structure(absolute_paths["holo_R_pdb"])
receptor


2024-09-05 16:50:41,522 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=7, items=7
2024-09-05 16:50:41,943 | pinder.core.utils.cloud.process_many:23 | INFO : runtime succeeded: 0.42s


Structure(
 filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6rao__H2_Q6HAD2-R.pdb,
 uniprot_map=None,
 pinder_id='6rao__H2_Q6HAD2-R',
 atom_array= with shape (1737,),
 pdb_engine='fastpdb',
)

In [7]:
receptor.coords[0:10]

array([[ 30.417, -28.254, 74.132],
 [ 31.12 , -28.716, 75.321],
 [ 32.127, -29.81 , 74.98 ],
 [ 32.824, -29.725, 73.966],
 [ 30.125, -29.23 , 76.361],
 [ 29.491, -30.413, 75.911],
 [ 32.233, -30.777, 75.9 ],
 [ 32.949, -32.052, 75.825],
 [ 34.468, -31.925, 75.898],
 [ 35.158, -32.93 , 76.097]], dtype=float32)

In [8]:
receptor.sequence

'SLLERGLSKLTLNAWKDREGKIPAGSMSAMYNPETIQLDYQTRFDTEDTINTASQSNRYVISEPVGLNLTLLFDSQMPGNTTPIETQLAMLKSLCAVDAATGSPYFLRITWGKMRWENKGWFAGRARDLSVTYTLFDRDATPLRATVQLSLVADESFVIQQSLKTQSAPDRALVSVPDLASLPLLALSAGGVLASSVDYLSLAWDNDLDNLDDFQTGDFLRATK'

In [9]:
receptor.fasta

'>6rao__H2_Q6HAD2-R\nSLLERGLSKLTLNAWKDREGKIPAGSMSAMYNPETIQLDYQTRFDTEDTINTASQSNRYVISEPVGLNLTLLFDSQMPGNTTPIETQLAMLKSLCAVDAATGSPYFLRITWGKMRWENKGWFAGRARDLSVTYTLFDRDATPLRATVQLSLVADESFVIQQSLKTQSAPDRALVSVPDLASLPLLALSAGGVLASSVDYLSLAWDNDLDNLDDFQTGDFLRATK'

In [10]:
# You can also write the Structure object to a PDB file if desired (e.g. after making changes)
from pathlib import Path
from tempfile import TemporaryDirectory
with TemporaryDirectory() as tmp_dir:
 temp_dir = Path(tmp_dir)
 receptor.to_pdb(temp_dir / "modified_receptor.pdb")




### Using pinder utilities to construct a dataloader 


Before proceeding with this section, you may find it helpful to review the existing tutorials available in `pinder`. 

Specifcially, the tutorials covering:
* [pinder index](https://pinder-org.github.io/pinder/pinder-index.html)
* [pinder system](https://pinder-org.github.io/pinder/pinder-system.html)
* [pinder loader](https://pinder-org.github.io/pinder/pinder-loader.html)
* [cropped superposition](https://pinder-org.github.io/pinder/superposition.html)


**We will start by looking at the most basic way to load items from the training and validation set: via `PinderSystem` objects**

In [11]:
from pinder.core import PinderSystem

def get_system(system_id: str) -> PinderSystem:
 return PinderSystem(system_id)


system = get_system(train.id.iloc[0])
system
 

PinderSystem(
entry = IndexEntry(
 (
 'split',
 'train',
 ),
 (
 'id',
 '8phr__X4_UNDEFINED--8phr__W4_UNDEFINED',
 ),
 (
 'pdb_id',
 '8phr',
 ),
 (
 'cluster_id',
 'cluster_24559_24559',
 ),
 (
 'cluster_id_R',
 'cluster_24559',
 ),
 (
 'cluster_id_L',
 'cluster_24559',
 ),
 (
 'pinder_s',
 False,
 ),
 (
 'pinder_xl',
 False,
 ),
 (
 'pinder_af2',
 False,
 ),
 (
 'uniprot_R',
 'UNDEFINED',
 ),
 (
 'uniprot_L',
 'UNDEFINED',
 ),
 (
 'holo_R_pdb',
 '8phr__X4_UNDEFINED-R.pdb',
 ),
 (
 'holo_L_pdb',
 '8phr__W4_UNDEFINED-L.pdb',
 ),
 (
 'predicted_R_pdb',
 '',
 ),
 (
 'predicted_L_pdb',
 '',
 ),
 (
 'apo_R_pdb',
 '',
 ),
 (
 'apo_L_pdb',
 '',
 ),
 (
 'apo_R_pdbs',
 '',
 ),
 (
 'apo_L_pdbs',
 '',
 ),
 (
 'holo_R',
 True,
 ),
 (
 'holo_L',
 True,
 ),
 (
 'predicted_R',
 False,
 ),
 (
 'predicted_L',
 False,
 ),
 (
 'apo_R',
 False,
 ),
 (
 'apo_L',
 False,
 ),
 (
 'apo_R_quality',
 '',
 ),
 (
 'apo_L_quality',
 '',
 ),
 (
 'chain1_neff',
 10.78125,
 ),
 (
 'chain2_neff',
 11.1171875,
 ),
 (
 

You will notice in the printed `PinderSystem` object has the following properties:
* `native` - the ground-truth dimer complex
* `holo_receptor` - the receptor chain (monomer) from the ground-truth complex
* `holo_ligand` - the ligand chain (monomer) from the ground-truth complex
* `apo_receptor` - the canonical _apo_ chain (monomer) paired to the receptor chain
* `apo_ligand` - the canonical _apo_ chain (monomer) paired to the ligand chain
* `pred_receptor` - the AlphaFold2 predicted monomer paired to the receptor chain 
* `pred_ligand` - the AlphaFold2 predicted monomer paired to the ligand chain


These properties are pointers to `Structure` objects. The `Structure` object provides the most direct mode of access to structures and associated properties. 

**Note: not all systems have an apo and/or predicted structure for all chains of the ground-truth dimer complex!** 

As was the case in the example above, when the alternative monomers are not available, the property will have a value of `None`. 

You can determine which systems have which alternative monomer pairings _a priori_ by looking at the boolean columns in the index `apo_R` and `apo_L` for the apo receptor and ligand, and `predicted_R` and `predicted_L` for the predicted receptor and ligand, respectively. 


For instance, we can load a different system that _does_ have apo receptor and ligand as such:

In [12]:
apo_system = get_system(train.query('apo_R and apo_L').id.iloc[0])
receptor = apo_system.apo_receptor
ligand = apo_system.apo_ligand 

receptor, ligand


(Structure(
 filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/3wdb__A1_P9WPC9.pdb,
 uniprot_map=/Users/danielkovtun/.local/share/pinder/2024-02/mappings/3wdb__A1_P9WPC9.parquet,
 pinder_id='3wdb__A1_P9WPC9',
 atom_array= with shape (1144,),
 pdb_engine='fastpdb',
 ),
 Structure(
 filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6ucr__A1_P9WPC9.pdb,
 uniprot_map=/Users/danielkovtun/.local/share/pinder/2024-02/mappings/6ucr__A1_P9WPC9.parquet,
 pinder_id='6ucr__A1_P9WPC9',
 atom_array= with shape (1193,),
 pdb_engine='fastpdb',
 ))

We can now access e.g. the sequence and the coordinates of the structures via the `Structure` objects:

In [13]:
receptor.sequence


'PLGSMFERFTDRARRVVVLAQEEARMLNHNYIGTEHILLGLIHEGEGVAAKSLESLGISLEGVRSQVEEIIGQGQQAPSGHIPFTPRAKKVLELSLREALQLGHNYIGTEHILLGLIREGEGVAAQVLVKLGAELTRVRQQVIQLLSGY'

In [14]:
receptor.coords[0:5]

array([[-12.982, -17.271, -11.271],
 [-14.36 , -17.069, -11.749],
 [-15.261, -16.373, -10.703],
 [-15.461, -15.161, -10.801],
 [-14.842, -18.494, -12.077]], dtype=float32)

We can always access the underyling biotite [AtomArray](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) via the `Structure.atom_array` property:


In [15]:
receptor.atom_array[0:5]

array([
	Atom(np.array([-12.982, -17.271, -11.271], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="N", element="N", b_factor=0.0),
	Atom(np.array([-14.36 , -17.069, -11.749], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="CA", element="C", b_factor=0.0),
	Atom(np.array([-15.261, -16.373, -10.703], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="C", element="C", b_factor=0.0),
	Atom(np.array([-15.461, -15.161, -10.801], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="O", element="O", b_factor=0.0),
	Atom(np.array([-14.842, -18.494, -12.077], dtype=float32), chain_id="R", res_id=2, ins_code="", res_name="PRO", hetero=False, atom_name="CB", element="C", b_factor=0.0)
])

For a more comprehensive overview of all of the `Structure` class properties, refer to the [pinder system](https://pinder-org.github.io/pinder/pinder-system.html) tutorial.


### Using PinderLoader and PinderDataset to fetch, filter, transform systems 

While the `PinderSystem` object provides a self-contained access to structures associated with a dimer system, the `PinderLoader` provides a base abstraction for how to iterate over systems, apply optional filters and/or transforms, and return training and validation data represented as `PinderSystem` and `Structure` objects. 

`PinderDataset` is an example implementation of a torch `Dataset` that can be consumed in a torch `DataLoader`. It uses the `PinderLoader` under the hood and additionally implements a default `transform` and `target_transform` function that converts the `Structure` objects returned by `PinderLoader` into dictionaries of structural properties encoded as tensors. The return value of the `PinderDataset.__getitem__` represents an example of dataset sample that is suitable for collating into `DataLoader` batches via the default `collate_fn` defined in `pinder.core.loader.dataset.collate_batch`. 

This is covered in much greater detail in the [pinder loader](https://pinder-org.github.io/pinder/pinder-loader.html) tutorial, but we will quickly showcase how both can be used to load data in an ML context. 



In [16]:
from pinder.core import PinderLoader
from pinder.core.loader import filters

base_filters = [
 filters.FilterDetachedHolo(radius=12, max_components=2),
 filters.FilterByResidueCount(min_residue_count=10, max_residue_count=2000),
]
sub_filters = [
 filters.FilterSubByAtomTypes(min_atom_types=4),
 filters.FilterByHoloSeqIdentity(min_sequence_identity=0.8),
]
loader = PinderLoader(
 split="val",
 base_filters = base_filters,
 sub_filters = sub_filters
)

loader

PinderLoader(split=val, monomers=holo, systems=1958)

You can now access individual items in the loader or iterate over it. 

The current default return value of `PinderLoader.__getitem__` is a tuple consisting of `(system, feature_complex, target_complex)`:
1. `system`: A `PinderSystem` instance corresponding to the item index
2. `feature_complex`: A `Structure` object containing a sampled receptor and ligand monomer superimposed to the ground-truth complex.
3. `target_complex`: A `Structure` object containing the ground-truth holo complex.


Note: the monomers in the `feature_complex` can consist of holo/apo/pred or a mix of them. You can control which monomer is selected via the `monomer_priority` argument.

Valid values are:
* holo (default)
* apo
* pred
* random (select a monomer at random from the set of monomer types available in both the receptor and ligand)
* random_mixed (select a monomer at random from the set of monomer types available in the receptor and ligand, separately)


If you wanted to leverage the `PinderLoader` but mainly just want the filepaths and/or sequence, you can do so with the returned `Structure` objects:

In [17]:
system, sample, target = loader[0]

receptor = target.filter("chain_id", ["R"])
ligand = target.filter("chain_id", ["L"])
# Can do things like e.g.
with open(f"./receptor_{receptor.pinder_id}.fasta", "w") as f:
 f.write(receptor.fasta)


In [19]:
from tqdm import tqdm

loaded_systems = set()
limit = 10 # for faster exec in CI
for system, feature_complex, target_complex in tqdm(loader):
 loaded_systems.add(system.entry.id)
 if len(loaded_systems) >= limit:
 break
 
 

 0%|▉ | 9/1958 [00:02<07:35, 4.28it/s]


In [20]:
len(loaded_systems)

10

In [21]:
# PinderDataset - torch dataset

from pinder.core.loader import filters, transforms
from pinder.core.loader.dataset import PinderDataset

base_filters = [
 filters.FilterDetachedHolo(radius=12, max_components=2),
 filters.FilterByResidueCount(min_residue_count=10, max_residue_count=2000),
]
sub_filters = [
 filters.FilterSubByAtomTypes(min_atom_types=4),
 filters.FilterByHoloSeqIdentity(min_sequence_identity=0.8),
]

# We can include Structure-level transforms (and filters) which will operate on the target and feature complexes returned by PinderLoader
structure_transforms = [
 transforms.SelectAtomTypes(atom_types=["CA", "N", "C", "O"])
]
train_dataset = PinderDataset(
 split="train", 
 # We can leverage holo, apo, pred, random and random_mixed monomer sampling strategies
 monomer_priority="random_mixed",
 base_filters = base_filters,
 sub_filters = sub_filters,
 structure_transforms=structure_transforms,
)
train_dataset




You can now access individual items in the PinderDataset or iterate over it. 

The current default return value of `PinderDataset.__getitem__` is a dict consisting of the following key, value pairs:
* `target_complex`: The ground-truth holo dimer, represented with a set of default properties encoded as `Tensor`'s
* `feature_complex`: The sampled dimer complex, representing "features", also represented with a set of default properties encoded as `Tensor`'s
* `id`: The pinder ID for the selected system
* `target_id`: The IDs of the receptor and ligand holo monomers, concatenated into a single ID string
* `sample_id`: The IDs of the sampled receptor and ligand holo monomers, concatenated into a single ID string. This can be useful for debugging purposes or generally tracking which specific monomers are selected when targeting alternative monomers (more on this shortly)


Each of the `target_complex` and `feature_complex` values are dictionaries with structural properties encoded by the `pinder.core.loader.geodata.structure2tensor` function by default:
* `atom_coordinates`
* `atom_types`
* `residue_coordinates`
* `residue_types`
* `residue_ids`

You can choose to use a different representation by overriding the default values of `transform` and `target_transform`.

In [22]:
data_item = train_dataset[0]
data_item


{'target_complex': {'atom_types': tensor([[0., 0., 0., ..., 0., 0., 0.],
 [1., 0., 0., ..., 0., 0., 0.],
 [1., 0., 0., ..., 0., 0., 0.],
 ...,
 [1., 0., 0., ..., 0., 0., 0.],
 [1., 0., 0., ..., 0., 0., 0.],
 [0., 0., 1., ..., 0., 0., 0.]]),
 'residue_types': tensor([[16.],
 [16.],
 [16.],
 ...,
 [ 0.],
 [ 0.],
 [ 0.]]),
 'atom_coordinates': tensor([[131.7500, 429.3090, 163.5360],
 [132.6810, 428.2520, 163.1550],
 [133.5150, 428.6750, 161.9500],
 ...,
 [177.7620, 463.8650, 166.9020],
 [177.4130, 465.0800, 167.7550],
 [176.8000, 464.9490, 168.8150]]),
 'residue_coordinates': tensor([[131.7500, 429.3090, 163.5360],
 [132.6810, 428.2520, 163.1550],
 [133.5150, 428.6750, 161.9500],
 ...,
 [177.7620, 463.8650, 166.9020],
 [177.4130, 465.0800, 167.7550],
 [176.8000, 464.9490, 168.8150]]),
 'residue_ids': tensor([ 4., 4., 4., ..., 182., 182., 182.])},
 'feature_complex': {'atom_types': tensor([[0., 0., 0., ..., 0., 0., 0.],
 [1., 0., 0., ..., 0., 0., 0.],
 [1., 0., 0., ..., 0., 0., 0.],
 ...,


In [8]:
from pinder.core.loader.dataset import collate_batch, get_torch_loader

# Now wrap the dataset in a torch DataLoader
batch_size = 2
train_dataloader = get_torch_loader(
 train_dataset, 
 batch_size=batch_size,
 shuffle=True,
 collate_fn=collate_batch,
 num_workers=0, 
)

# Get a batch from the dataloader
batch = next(iter(train_dataloader))

# expected batch dict keys
assert set(batch.keys()) == {
 "target_complex",
 "feature_complex",
 "id",
 "sample_id",
 "target_id",
}
feature_coords = batch["feature_complex"]["atom_coordinates"]
# Ensure batch size propagates to tensor dims
assert feature_coords.shape[0] == batch_size
# Ensure coordinates have dim 3
assert feature_coords.shape[2] == 3


2024-09-05 12:58:37,879 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=7, items=7
2024-09-05 12:58:38,274 | pinder.core.utils.cloud.process_many:23 | INFO : runtime succeeded: 0.40s
2024-09-05 12:58:39,038 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=5, items=5
2024-09-05 12:58:39,234 | pinder.core.utils.cloud.process_many:23 | INFO : runtime succeeded: 0.20s


In [9]:
batch

{'target_complex': {'atom_types': tensor([[[ 0., 0., 0., ..., 0., 0., 0.],
 [ 1., 0., 0., ..., 0., 0., 0.],
 [ 1., 0., 0., ..., 0., 0., 0.],
 ...,
 [ 1., 0., 0., ..., 0., 0., 0.],
 [ 1., 0., 0., ..., 0., 0., 0.],
 [ 0., 0., 1., ..., 0., 0., 0.]],
 
 [[ 0., 0., 0., ..., 0., 0., 0.],
 [ 1., 0., 0., ..., 0., 0., 0.],
 [ 1., 0., 0., ..., 0., 0., 0.],
 ...,
 [-1., -1., -1., ..., -1., -1., -1.],
 [-1., -1., -1., ..., -1., -1., -1.],
 [-1., -1., -1., ..., -1., -1., -1.]]]),
 'residue_types': tensor([[[ 3.],
 [ 3.],
 [ 3.],
 ...,
 [13.],
 [13.],
 [13.]],
 
 [[14.],
 [14.],
 [14.],
 ...,
 [-1.],
 [-1.],
 [-1.]]]),
 'atom_coordinates': tensor([[[ 8.1120, -9.6510, 29.2570],
 [ 8.4540, -9.0780, 30.5620],
 [ 9.9250, -8.6600, 30.6590],
 ...,
 [ 15.9010, -12.2620, 25.1370],
 [ 16.3010, -11.3060, 23.9960],
 [ 17.2380, -11.5950, 23.2480]],
 
 [[ 157.1080, 94.6520, 177.6160],
 [ 158.1820, 94.2070, 176.7270],
 [ 159.5680, 94.6010, 177.2280],
 ...,
 [-100.0000, -100.0000, -100.0000],
 [-100.0000, -100.000

### Implementing your own PyTorch Dataset & DataLoader for pinder 

We invite you to review the [existing tutorial](https://pinder-org.github.io/pinder/pinder-loader.html#implementing-your-own-pytorch-dataset-dataloader-for-pinder) on this topic in the pinder documentation. Please don't hesitate to ask questions or otherwise engage via GitHub issues!


