{ "cells": [ { "cell_type": "markdown", "id": "0ed6ea88-d58f-4ab4-b0c5-5c0df10741c4", "metadata": {}, "source": [ "# MLSB PINDER Challenge\n", "1. [Rules: training](#training-rules)\n", "2. [Rules: inference](#inference-rules)\n", "3. [Evaluation dataset](#eval-dataset)\n", "4. [Accessing training data](#accessing-training-data)\n", " 1. [Minimal filepath example](#minimal-filepath-example)\n", " 1. [Helpful pinder utilities](#pinder-utilities)\n", " 2. [PinderLoader and PinderDataset](#pinder-loader-and-dataset)\n", " 3. [Implementing a torch dataloader](#torch-dataloader)\n", "\n" ] }, { "cell_type": "markdown", "id": "937932d5-6af9-48d0-ae1e-2702e5e1229f", "metadata": {}, "source": [ "\n", "The goal of this tutorial is to outline some basic rules for participating in the PINDER track of the MLSB challenge and provide simple hands-on examples for how participants can access and use the `pinder` dataset.\n", "\n", "Specifically, we will cover: \n", "* Rules for model training\n", "* Rules for valid inference submissions\n", "* Accessing and loading data for training your model\n", "* A description of the inputs to be provided in the evaluation set\n", "\n" ] }, { "cell_type": "markdown", "id": "67cfa07c-4710-4e5f-ab05-2c9f4e5ffd52", "metadata": {}, "source": [ "## Rules for valid model training \n", "\n", "\n", "* Participants MUST use the sequences and SMILES in the provided train and validation sets from PINDER or PLINDER. In order to ensure no leakage, external data augmentation is not allowed.\n", "* If starting structures/conformations need to be generated for the model, then this can only be done from the training and validation sequences and SMILES. Note that this is only the case for train & validation - no external folding methods or starting structures are allowed for the test set under any circumstance!. Only the predicted structures/conformers themselves may be used in this way, the embeddings or models used to generate such predictions may not. E.g. it is not valid to “distill” a method that was not trained on PLINDER/PINDER\n", "* The PINDER and PLINDER datasets should be used independently; combining the sets is considered augmentation and is not allowed.\n", "* For inference, only the inputs provided in the evaluation sets may be used: canonical sequences, structures and MSAs; no alternate templates or sequences are permitted. The inputs that will be used by assessors for each challenge track is as follows:\n", " * PLINDER: (SMILES, monomer protein structure, monomer FASTA, monomer MSA)\n", " * PINDER: (monomer protein structure 1, monomer protein structure 2, FASTA 1, FASTA 2, MSA 1, MSA 2)\n", "* Model selection must be performed exclusively on the validation set designed for this purpose within the PINDER and PLINDER datasets.\n", "* Methods relying on any model derivatives or embeddings trained on structures outside the PINDER/PLINDER training set are not permitted (e.g., ESM2, MSA: ✅; ESM3/ESMFold/SAProt/UniMol: ❌).\n", "* For instruction on how to load training and validation data, check the links below:\n", " * [PLINDER](https://github.com/plinder-org/plinder/blob/8f7f4372a675abbbd948ea6aeaf6b870bf312a02/docs/examples/mlsb_challenge.md#mlsb-notebook-target)\n", " * [PINDER](https://pinder-org.github.io/pinder/pinder-mlsb.html#accessing-and-loading-data-for-training)\n", "\n" ] }, { "cell_type": "markdown", "id": "de0c4e7c-a166-400d-9d5d-f4338854fdb7", "metadata": {}, "source": [ "## Rules for valid inference submissions\n", "\n", "Submission system will use Hugging Face Spaces. To qualify for submission, each team must:\n", "\n", "- Provide an MLSB submission ID or a link to a preprint/paper describing their methodology. This publication does not have to specifically report training or evaluation on the P(L)INDER dataset. Previously published methods, such as DiffDock, only need to link their existing paper. Note that entry into this competition does not equate to an MLSB workshop paper submission.\n", "- Create a copy of the provided [inference template](https://huggingface.co/spaces/MLSB/pinder_inference_template/blob/main/inference_app.py).\n", " - Go to the top right corner of the page and click on the drop-down menu (vertical ellipsis) right next to the “Community”, then select “Duplicate this space”.\n", "- Change files in the newly created space to reflect the peculiarities of your model\n", " - Edit `requirements.txt` to capture all dependencies.\n", " - Include a `inference_app.py` file. This contains a `predict` function that should be modified to reflect the specifics of inference using their model.\n", " - Include a `train.py` file to ensure that training and model selection use only the PINDER/PLINDER datasets and to clearly show any additional hyperparameters used.\n", " - Provide a LICENSE file that allows for reuse, derivative works, and distribution of the provided software and weights (e.g., MIT or Apache2 license).\n", " - Modify the Dockerfile as appropriate (including selecting the right base image)\n", "- Submit to the leaderboard via the [designated form](https://huggingface.co/spaces/MLSB/leaderboard2024).\n", " - On submission page, add reference to the newly created space in the format username/space (e.g mlsb/alphafold3)\n" ] }, { "cell_type": "markdown", "id": "bcc95d30-311b-4aa7-b286-71cf5aa1cc40", "metadata": {}, "source": [ "## Evaluation dataset \n", "\n", "Although the exact composition of the eval set will be shared at a future date, below we provide an overview of the dataset and what to expect\n", "\n", "- Two leaderboards, one for each of PINDER and PLINDER, will be created using a single evaluation set for each.\n", "- Evaluation sets will be subsets of 150-200 structures from the current PINDER and PLINDER test splits (subsets to enable reasonable eval runtime).\n", "- Each evaluation sample will contain a predefined input/output to ensure performance assessment is model-dependent, not input-dependent.\n", "- The focus will be exclusively on flexible docking/co-folding, with a single canonical structure per protein, sampled from apo and predicted structures.\n", "- Monomer input structures will be sampled from paired structures available in PINDER/PLINDER, balanced between apo and predicted structures and stratified by \"flexibility\" level according to specified conformational difference thresholds.\n", "- Inputs will be: `(monomer protein structure 1, monomer protein structure 2, FASTA 1, FASTA 2)` for PINDER\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "7a4ae677-9da7-4bfa-9a0e-47507ff2a9bf", "metadata": {}, "source": [ "## Accessing and loading data for training \n", "\n", "In order to access the train and val splits for PINDER, please refer to the [pinder documentation](https://github.com/pinder-org/pinder/tree/main?tab=readme-ov-file#%EF%B8%8F-getting-the-dataset)\n", "\n", "Once you have downloaded the pinder dataset, either via the `pinder` package or directly through `gsutil`, you will have all of the necessary files for training. \n", "\n", "For those mainly interested in torch dataloaders, refer to the [readme section](https://github.com/pinder-org/pinder#5--dataloader) and [tutorial](https://pinder-org.github.io/pinder/pinder-loader.html) on the torch dataloader provided in pinder. TLDR: \n", "```python\n", "from pinder.core.loader.dataset import PinderDataset, get_torch_loader\n", "train, val = [get_torch_loader(PinderDataset(split=split)) for split in [\"train\", \"val\"]] # do NOT use test, we will verify your pipeline does not use test in neither training nor model selection\n", "```\n", "\n", "For those interested in loading/filtering/sampling/augmenting data using pinder utilities, see remaining sections below. \n" ] }, { "cell_type": "markdown", "id": "c764e3e9-db3c-4cb6-a53e-0e64d82a67e5", "metadata": {}, "source": [ "You are ONLY allowed to access those systems labeled with split `train` and split `val` for model training and validation, respectively. \n", "\n", "See below for two different options for accessing the index and split labels.\n", "\n", "\n", "**If you have already installed pinder (preferred method):**\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "945a0fbc-a126-4827-844c-77c9990884a7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1560682, 34), (1958, 34))" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import torch\n", "from pinder.core import get_index\n", "\n", "index = get_index()\n", "train = index.query('split == \"train\"').reset_index(drop=True)\n", "val = index.query('split == \"val\"').reset_index(drop=True)\n", "train.shape, val.shape" ] }, { "cell_type": "markdown", "id": "25f4dc14-3832-42d5-adb5-17ca449722cb", "metadata": {}, "source": [ "**Without installing pinder (need to install `gcsfs` and `pandas` or install the `gsutil` utility to get the index file)**" ] }, { "cell_type": "code", "execution_count": 2, "id": "e2851b66-e6c0-425c-936b-6e0291366d74", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1560682, 34), (1958, 34))" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import gcsfs\n", "import pandas as pd\n", "\n", "index_uri = \"gs://pinder/2024-02/index.parquet\"\n", "fs = gcsfs.GCSFileSystem(token=\"anon\")\n", "with fs.open(index_uri, \"rb\") as f:\n", " index = pd.read_parquet(f)\n", "\n", "train = index.query('split == \"train\"').reset_index(drop=True)\n", "val = index.query('split == \"val\"').reset_index(drop=True)\n", "train.shape, val.shape" ] }, { "cell_type": "markdown", "id": "cace51fd-0bb9-4f8b-9a1f-41a734ec6c78", "metadata": {}, "source": [ "### Minimal example with filepaths \n", "\n", "For those who simply want access to PDB files and/or sequences, below we provide a minimal example of how to go from a row in the pinder index to a tuple of filepaths and sequences akin to the expected inputs for inference. \n", "\n", "Later sections provide alternative means to loading data with common pinder utilities, including the `PinderLoader` and `PinderDataset` torch dataset." ] }, { "cell_type": "markdown", "id": "32b44b0d-287b-4276-a3d6-a3bf0f0c55c5", "metadata": {}, "source": [ "All `pinder` data should be stored in `PINDER_BASE_DIR`. Unless you customized the download directory, this would default to:\n", "`~/.local/share/pinder/2024-02/`\n", "\n", "PDB files are stored in a subdirectory, named `pdbs`. \n", "\n", "To go from a row in the index to a collection of filepaths, you can either use the pydantic model for the pinder index schema (`IndexEntry`) or construct the filepaths yourself. \n", "\n", "We will first illustrate how to do this via `IndexEntry`" ] }, { "cell_type": "code", "execution_count": 3, "id": "a633fef6-0fc1-4a06-99b7-d4a9a4620793", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "({'native': 'pdbs/8ir7__S1_P51765--8ir7__AA1_Q7DGD4.pdb',\n", " 'holo_R': 'pdbs/8ir7__S1_P51765-R.pdb',\n", " 'holo_L': 'pdbs/8ir7__AA1_Q7DGD4-L.pdb',\n", " 'predicted_R': 'pdbs/af__P51765.pdb',\n", " 'predicted_L': 'pdbs/af__Q7DGD4.pdb',\n", " 'apo_R': '',\n", " 'apo_L': '',\n", " 'apo_R_alt': [],\n", " 'apo_L_alt': []},\n", " {'native': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8ir7__S1_P51765--8ir7__AA1_Q7DGD4.pdb'),\n", " 'holo_R': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8ir7__S1_P51765-R.pdb'),\n", " 'holo_L': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8ir7__AA1_Q7DGD4-L.pdb'),\n", " 'predicted_R': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/af__P51765.pdb'),\n", " 'predicted_L': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/af__Q7DGD4.pdb'),\n", " 'apo_R': '',\n", " 'apo_L': '',\n", " 'apo_R_alt': [],\n", " 'apo_L_alt': []})" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pinder.core import get_pinder_location\n", "from pinder.core.index.utils import IndexEntry\n", "\n", "pinder_dir = get_pinder_location()\n", "row = train.sample(1).squeeze()\n", "entry = IndexEntry(**row.to_dict())\n", "# IndexEntry has a convenience property `pdb_paths` which returns a dict of structure_type: relative_path | list[relative_path]\n", "relative_paths = entry.pdb_paths\n", "absolute_paths = {}\n", "for structure_type, rel_path in relative_paths.items():\n", " # non-canonical apo monomers are stored as a list of relative paths\n", " if isinstance(rel_path, list):\n", " absolute_paths[structure_type] = []\n", " for alt_monomer in rel_path:\n", " absolute_paths[structure_type].append(pinder_dir / alt_monomer) \n", " # Not all systems have every type of monomer. When they are not available, the relative path is \"\"\n", " elif rel_path == \"\":\n", " absolute_paths[structure_type] = rel_path\n", " # Convert relative path to absolute path \n", " else:\n", " absolute_paths[structure_type] = pinder_dir / rel_path\n", "\n", "relative_paths, absolute_paths" ] }, { "cell_type": "markdown", "id": "3688f121-ad73-4239-9c81-77833476de90", "metadata": {}, "source": [ "In the above example, the `IndexEntry.pdb_paths` property was used to conveniently extract filepaths from a row in the index. This is done by using the following columns from the index:\n", "* `id`\n", "* `holo_R_pdb`\n", "* `holo_L_pdb`\n", "* `predicted_R_pdb`\n", "* `predicted_L_pdb`\n", "* `apo_R_pdbs`\n", "* `apo_L_pdbs`\n", "\n", "It is possible to do this yourself without IndexEntry:" ] }, { "cell_type": "code", "execution_count": 4, "id": "19de834a-b006-4852-9e45-252f97bde0b2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'native': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6rao__H2_Q6HAD2--6rao__F1_Q6HAD0.pdb'),\n", " 'holo_R_pdb': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6rao__H2_Q6HAD2-R.pdb'),\n", " 'holo_L_pdb': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6rao__F1_Q6HAD0-L.pdb'),\n", " 'predicted_R_pdb': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/af__Q6HAD2.pdb'),\n", " 'predicted_L_pdb': PosixPath('/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/af__Q6HAD0.pdb'),\n", " 'apo_R_pdb': '',\n", " 'apo_L_pdb': '',\n", " 'apo_R_pdbs': [''],\n", " 'apo_L_pdbs': ['']}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "row = train.sample(1).squeeze()\n", "absolute_paths = {\n", " \"native\": pinder_dir / \"pdbs\" / f\"{row.id}.pdb\",\n", "}\n", "pdb_cols = [\n", " \"holo_R_pdb\", \"holo_L_pdb\", # holo monomers for receptor and ligand, respectively\n", " \"predicted_R_pdb\", \"predicted_L_pdb\", # predicted monomers\n", " \"apo_R_pdb\", \"apo_L_pdb\", # canonical apo monomers\n", " \"apo_R_pdbs\", \"apo_L_pdbs\", # canonical + non-canonical (alternative) apo monomers, separated by a semi-colon\n", "]\n", "for pdb_column in pdb_cols:\n", " if pdb_column.endswith(\"pdbs\"):\n", " absolute_paths[pdb_column] = [\n", " pinder_dir / \"pdbs\" / alt_apo if alt_apo != \"\" else \"\" \n", " for alt_apo in row[pdb_column].split(\";\")\n", " ]\n", " else:\n", " pdb_name = row[pdb_column]\n", " absolute_paths[pdb_column] = pinder_dir / \"pdbs\" / pdb_name if pdb_name != \"\" else \"\"\n", "\n", "absolute_paths\n" ] }, { "cell_type": "markdown", "id": "3fa21e41-e076-4080-93fd-e5062e73dc9f", "metadata": {}, "source": [ "The most minimal interface for loading these filepaths and extracting e.g. coordinates and sequence would be via `pinder.core.loader.structure` module:" ] }, { "cell_type": "code", "execution_count": 6, "id": "575049d7-e550-4b59-98db-0d7b045e391a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-09-05 16:50:41,522 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=7, items=7\n", "2024-09-05 16:50:41,943 | pinder.core.utils.cloud.process_many:23 | INFO : runtime succeeded: 0.42s\n" ] }, { "data": { "text/plain": [ "Structure(\n", " filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6rao__H2_Q6HAD2-R.pdb,\n", " uniprot_map=None,\n", " pinder_id='6rao__H2_Q6HAD2-R',\n", " atom_array= with shape (1737,),\n", " pdb_engine='fastpdb',\n", ")" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pinder.core.loader.structure import Structure\n", "# Note: since this notebook is executed in CI, I will also create a `PinderSystem` object which will auto-download any missing PDB file\n", "# You do NOT need to do this if you already downloaded the dataset\n", "if not absolute_paths[\"holo_R_pdb\"].is_file():\n", " from pinder.core import PinderSystem\n", " _ = PinderSystem(absolute_paths[\"native\"].stem)\n", " \n", "receptor = Structure(absolute_paths[\"holo_R_pdb\"])\n", "receptor\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "a31f5f1d-7fac-4cde-a1f1-e151afeb17e7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 30.417, -28.254, 74.132],\n", " [ 31.12 , -28.716, 75.321],\n", " [ 32.127, -29.81 , 74.98 ],\n", " [ 32.824, -29.725, 73.966],\n", " [ 30.125, -29.23 , 76.361],\n", " [ 29.491, -30.413, 75.911],\n", " [ 32.233, -30.777, 75.9 ],\n", " [ 32.949, -32.052, 75.825],\n", " [ 34.468, -31.925, 75.898],\n", " [ 35.158, -32.93 , 76.097]], dtype=float32)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "receptor.coords[0:10]" ] }, { "cell_type": "code", "execution_count": 8, "id": "3c9bceba-819b-4d32-b95a-86239ad6e1bf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'SLLERGLSKLTLNAWKDREGKIPAGSMSAMYNPETIQLDYQTRFDTEDTINTASQSNRYVISEPVGLNLTLLFDSQMPGNTTPIETQLAMLKSLCAVDAATGSPYFLRITWGKMRWENKGWFAGRARDLSVTYTLFDRDATPLRATVQLSLVADESFVIQQSLKTQSAPDRALVSVPDLASLPLLALSAGGVLASSVDYLSLAWDNDLDNLDDFQTGDFLRATK'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "receptor.sequence" ] }, { "cell_type": "code", "execution_count": 9, "id": "0c11d67a-c186-4710-bf4d-3589b1fe483f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'>6rao__H2_Q6HAD2-R\\nSLLERGLSKLTLNAWKDREGKIPAGSMSAMYNPETIQLDYQTRFDTEDTINTASQSNRYVISEPVGLNLTLLFDSQMPGNTTPIETQLAMLKSLCAVDAATGSPYFLRITWGKMRWENKGWFAGRARDLSVTYTLFDRDATPLRATVQLSLVADESFVIQQSLKTQSAPDRALVSVPDLASLPLLALSAGGVLASSVDYLSLAWDNDLDNLDDFQTGDFLRATK'" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "receptor.fasta" ] }, { "cell_type": "code", "execution_count": 10, "id": "3fd4c712-625c-4f80-ace4-b4c1a1b8a600", "metadata": {}, "outputs": [], "source": [ "# You can also write the Structure object to a PDB file if desired (e.g. after making changes)\n", "from pathlib import Path\n", "from tempfile import TemporaryDirectory\n", "with TemporaryDirectory() as tmp_dir:\n", " temp_dir = Path(tmp_dir)\n", " receptor.to_pdb(temp_dir / \"modified_receptor.pdb\")\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "aacdc3c6-3a3f-4d5b-a5f7-41aec6f3e838", "metadata": {}, "source": [ "### Using pinder utilities to construct a dataloader \n", "\n", "\n", "Before proceeding with this section, you may find it helpful to review the existing tutorials available in `pinder`. \n", "\n", "Specifcially, the tutorials covering:\n", "* [pinder index](https://pinder-org.github.io/pinder/pinder-index.html)\n", "* [pinder system](https://pinder-org.github.io/pinder/pinder-system.html)\n", "* [pinder loader](https://pinder-org.github.io/pinder/pinder-loader.html)\n", "* [cropped superposition](https://pinder-org.github.io/pinder/superposition.html)\n" ] }, { "cell_type": "markdown", "id": "cd5420e9-0d55-4c6a-a621-daf17af1b887", "metadata": {}, "source": [ "**We will start by looking at the most basic way to load items from the training and validation set: via `PinderSystem` objects**" ] }, { "cell_type": "code", "execution_count": 11, "id": "ee6840bb-f6db-4911-8bef-a900bf4a9c8e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PinderSystem(\n", "entry = IndexEntry(\n", " (\n", " 'split',\n", " 'train',\n", " ),\n", " (\n", " 'id',\n", " '8phr__X4_UNDEFINED--8phr__W4_UNDEFINED',\n", " ),\n", " (\n", " 'pdb_id',\n", " '8phr',\n", " ),\n", " (\n", " 'cluster_id',\n", " 'cluster_24559_24559',\n", " ),\n", " (\n", " 'cluster_id_R',\n", " 'cluster_24559',\n", " ),\n", " (\n", " 'cluster_id_L',\n", " 'cluster_24559',\n", " ),\n", " (\n", " 'pinder_s',\n", " False,\n", " ),\n", " (\n", " 'pinder_xl',\n", " False,\n", " ),\n", " (\n", " 'pinder_af2',\n", " False,\n", " ),\n", " (\n", " 'uniprot_R',\n", " 'UNDEFINED',\n", " ),\n", " (\n", " 'uniprot_L',\n", " 'UNDEFINED',\n", " ),\n", " (\n", " 'holo_R_pdb',\n", " '8phr__X4_UNDEFINED-R.pdb',\n", " ),\n", " (\n", " 'holo_L_pdb',\n", " '8phr__W4_UNDEFINED-L.pdb',\n", " ),\n", " (\n", " 'predicted_R_pdb',\n", " '',\n", " ),\n", " (\n", " 'predicted_L_pdb',\n", " '',\n", " ),\n", " (\n", " 'apo_R_pdb',\n", " '',\n", " ),\n", " (\n", " 'apo_L_pdb',\n", " '',\n", " ),\n", " (\n", " 'apo_R_pdbs',\n", " '',\n", " ),\n", " (\n", " 'apo_L_pdbs',\n", " '',\n", " ),\n", " (\n", " 'holo_R',\n", " True,\n", " ),\n", " (\n", " 'holo_L',\n", " True,\n", " ),\n", " (\n", " 'predicted_R',\n", " False,\n", " ),\n", " (\n", " 'predicted_L',\n", " False,\n", " ),\n", " (\n", " 'apo_R',\n", " False,\n", " ),\n", " (\n", " 'apo_L',\n", " False,\n", " ),\n", " (\n", " 'apo_R_quality',\n", " '',\n", " ),\n", " (\n", " 'apo_L_quality',\n", " '',\n", " ),\n", " (\n", " 'chain1_neff',\n", " 10.78125,\n", " ),\n", " (\n", " 'chain2_neff',\n", " 11.1171875,\n", " ),\n", " (\n", " 'chain_R',\n", " 'X4',\n", " ),\n", " (\n", " 'chain_L',\n", " 'W4',\n", " ),\n", " (\n", " 'contains_antibody',\n", " False,\n", " ),\n", " (\n", " 'contains_antigen',\n", " False,\n", " ),\n", " (\n", " 'contains_enzyme',\n", " False,\n", " ),\n", ")\n", "native=Structure(\n", " filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8phr__X4_UNDEFINED--8phr__W4_UNDEFINED.pdb,\n", " uniprot_map=None,\n", " pinder_id='8phr__X4_UNDEFINED--8phr__W4_UNDEFINED',\n", " atom_array= with shape (2556,),\n", " pdb_engine='fastpdb',\n", ")\n", "holo_receptor=Structure(\n", " filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8phr__X4_UNDEFINED-R.pdb,\n", " uniprot_map=/Users/danielkovtun/.local/share/pinder/2024-02/mappings/8phr__X4_UNDEFINED-R.parquet,\n", " pinder_id='8phr__X4_UNDEFINED-R',\n", " atom_array= with shape (1358,),\n", " pdb_engine='fastpdb',\n", ")\n", "holo_ligand=Structure(\n", " filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/8phr__W4_UNDEFINED-L.pdb,\n", " uniprot_map=/Users/danielkovtun/.local/share/pinder/2024-02/mappings/8phr__W4_UNDEFINED-L.parquet,\n", " pinder_id='8phr__W4_UNDEFINED-L',\n", " atom_array= with shape (1198,),\n", " pdb_engine='fastpdb',\n", ")\n", "apo_receptor=None\n", "apo_ligand=None\n", "pred_receptor=None\n", "pred_ligand=None\n", ")" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pinder.core import PinderSystem\n", "\n", "def get_system(system_id: str) -> PinderSystem:\n", " return PinderSystem(system_id)\n", "\n", "\n", "system = get_system(train.id.iloc[0])\n", "system\n", " " ] }, { "cell_type": "markdown", "id": "d84f141f-5dd6-426c-b9f7-ffc6bd63834d", "metadata": {}, "source": [ "You will notice in the printed `PinderSystem` object has the following properties:\n", "* `native` - the ground-truth dimer complex\n", "* `holo_receptor` - the receptor chain (monomer) from the ground-truth complex\n", "* `holo_ligand` - the ligand chain (monomer) from the ground-truth complex\n", "* `apo_receptor` - the canonical _apo_ chain (monomer) paired to the receptor chain\n", "* `apo_ligand` - the canonical _apo_ chain (monomer) paired to the ligand chain\n", "* `pred_receptor` - the AlphaFold2 predicted monomer paired to the receptor chain \n", "* `pred_ligand` - the AlphaFold2 predicted monomer paired to the ligand chain\n", "\n", "\n", "These properties are pointers to `Structure` objects. The `Structure` object provides the most direct mode of access to structures and associated properties. \n", "\n", "**Note: not all systems have an apo and/or predicted structure for all chains of the ground-truth dimer complex!** \n", "\n", "As was the case in the example above, when the alternative monomers are not available, the property will have a value of `None`. \n", "\n", "You can determine which systems have which alternative monomer pairings _a priori_ by looking at the boolean columns in the index `apo_R` and `apo_L` for the apo receptor and ligand, and `predicted_R` and `predicted_L` for the predicted receptor and ligand, respectively. \n", "\n", "\n", "For instance, we can load a different system that _does_ have apo receptor and ligand as such:" ] }, { "cell_type": "code", "execution_count": 12, "id": "16e07a40-bc85-412d-bd52-858e4ce8d05b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Structure(\n", " filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/3wdb__A1_P9WPC9.pdb,\n", " uniprot_map=/Users/danielkovtun/.local/share/pinder/2024-02/mappings/3wdb__A1_P9WPC9.parquet,\n", " pinder_id='3wdb__A1_P9WPC9',\n", " atom_array= with shape (1144,),\n", " pdb_engine='fastpdb',\n", " ),\n", " Structure(\n", " filepath=/Users/danielkovtun/.local/share/pinder/2024-02/pdbs/6ucr__A1_P9WPC9.pdb,\n", " uniprot_map=/Users/danielkovtun/.local/share/pinder/2024-02/mappings/6ucr__A1_P9WPC9.parquet,\n", " pinder_id='6ucr__A1_P9WPC9',\n", " atom_array= with shape (1193,),\n", " pdb_engine='fastpdb',\n", " ))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "apo_system = get_system(train.query('apo_R and apo_L').id.iloc[0])\n", "receptor = apo_system.apo_receptor\n", "ligand = apo_system.apo_ligand \n", "\n", "receptor, ligand\n" ] }, { "cell_type": "markdown", "id": "b5430d21-3c31-4a7a-a667-e687eeeecbc2", "metadata": {}, "source": [ "We can now access e.g. the sequence and the coordinates of the structures via the `Structure` objects:" ] }, { "cell_type": "code", "execution_count": 13, "id": "764dae70-980e-4cd3-9393-78a26bcf6136", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'PLGSMFERFTDRARRVVVLAQEEARMLNHNYIGTEHILLGLIHEGEGVAAKSLESLGISLEGVRSQVEEIIGQGQQAPSGHIPFTPRAKKVLELSLREALQLGHNYIGTEHILLGLIREGEGVAAQVLVKLGAELTRVRQQVIQLLSGY'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "receptor.sequence\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "b0d94de4-68bb-456f-9ac5-a6c93ad79613", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[-12.982, -17.271, -11.271],\n", " [-14.36 , -17.069, -11.749],\n", " [-15.261, -16.373, -10.703],\n", " [-15.461, -15.161, -10.801],\n", " [-14.842, -18.494, -12.077]], dtype=float32)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "receptor.coords[0:5]" ] }, { "cell_type": "markdown", "id": "36fe3c2d-89c3-444b-b37c-2135ffc1e78e", "metadata": {}, "source": [ "We can always access the underyling biotite [AtomArray](https://www.biotite-python.org/latest/apidoc/biotite.structure.AtomArray.html) via the `Structure.atom_array` property:\n" ] }, { "cell_type": "code", "execution_count": 15, "id": "a2965877-9130-4b0b-a9a0-e3352cd4d68f", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([\n", "\tAtom(np.array([-12.982, -17.271, -11.271], dtype=float32), chain_id=\"R\", res_id=2, ins_code=\"\", res_name=\"PRO\", hetero=False, atom_name=\"N\", element=\"N\", b_factor=0.0),\n", "\tAtom(np.array([-14.36 , -17.069, -11.749], dtype=float32), chain_id=\"R\", res_id=2, ins_code=\"\", res_name=\"PRO\", hetero=False, atom_name=\"CA\", element=\"C\", b_factor=0.0),\n", "\tAtom(np.array([-15.261, -16.373, -10.703], dtype=float32), chain_id=\"R\", res_id=2, ins_code=\"\", res_name=\"PRO\", hetero=False, atom_name=\"C\", element=\"C\", b_factor=0.0),\n", "\tAtom(np.array([-15.461, -15.161, -10.801], dtype=float32), chain_id=\"R\", res_id=2, ins_code=\"\", res_name=\"PRO\", hetero=False, atom_name=\"O\", element=\"O\", b_factor=0.0),\n", "\tAtom(np.array([-14.842, -18.494, -12.077], dtype=float32), chain_id=\"R\", res_id=2, ins_code=\"\", res_name=\"PRO\", hetero=False, atom_name=\"CB\", element=\"C\", b_factor=0.0)\n", "])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "receptor.atom_array[0:5]" ] }, { "cell_type": "markdown", "id": "7ef2c09c-3624-4c3b-ba43-41a621f089b3", "metadata": {}, "source": [ "For a more comprehensive overview of all of the `Structure` class properties, refer to the [pinder system](https://pinder-org.github.io/pinder/pinder-system.html) tutorial.\n" ] }, { "cell_type": "markdown", "id": "f45ee6eb-ab1d-4992-b6e4-5f3eb36c2cc6", "metadata": {}, "source": [ "### Using PinderLoader and PinderDataset to fetch, filter, transform systems \n", "\n", "While the `PinderSystem` object provides a self-contained access to structures associated with a dimer system, the `PinderLoader` provides a base abstraction for how to iterate over systems, apply optional filters and/or transforms, and return training and validation data represented as `PinderSystem` and `Structure` objects. \n", "\n", "`PinderDataset` is an example implementation of a torch `Dataset` that can be consumed in a torch `DataLoader`. It uses the `PinderLoader` under the hood and additionally implements a default `transform` and `target_transform` function that converts the `Structure` objects returned by `PinderLoader` into dictionaries of structural properties encoded as tensors. The return value of the `PinderDataset.__getitem__` represents an example of dataset sample that is suitable for collating into `DataLoader` batches via the default `collate_fn` defined in `pinder.core.loader.dataset.collate_batch`. \n", "\n", "This is covered in much greater detail in the [pinder loader](https://pinder-org.github.io/pinder/pinder-loader.html) tutorial, but we will quickly showcase how both can be used to load data in an ML context. \n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "id": "3e71a783-0929-413a-80e4-8a6ca6f46c74", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PinderLoader(split=val, monomers=holo, systems=1958)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pinder.core import PinderLoader\n", "from pinder.core.loader import filters\n", "\n", "base_filters = [\n", " filters.FilterDetachedHolo(radius=12, max_components=2),\n", " filters.FilterByResidueCount(min_residue_count=10, max_residue_count=2000),\n", "]\n", "sub_filters = [\n", " filters.FilterSubByAtomTypes(min_atom_types=4),\n", " filters.FilterByHoloSeqIdentity(min_sequence_identity=0.8),\n", "]\n", "loader = PinderLoader(\n", " split=\"val\",\n", " base_filters = base_filters,\n", " sub_filters = sub_filters\n", ")\n", "\n", "loader" ] }, { "cell_type": "markdown", "id": "f2433505-2f73-4f88-babb-4d8c54de943c", "metadata": {}, "source": [ "You can now access individual items in the loader or iterate over it. \n", "\n", "The current default return value of `PinderLoader.__getitem__` is a tuple consisting of `(system, feature_complex, target_complex)`:\n", "1. `system`: A `PinderSystem` instance corresponding to the item index\n", "2. `feature_complex`: A `Structure` object containing a sampled receptor and ligand monomer superimposed to the ground-truth complex.\n", "3. `target_complex`: A `Structure` object containing the ground-truth holo complex.\n", "\n", "\n", "Note: the monomers in the `feature_complex` can consist of holo/apo/pred or a mix of them. You can control which monomer is selected via the `monomer_priority` argument.\n", "\n", "Valid values are:\n", "* holo (default)\n", "* apo\n", "* pred\n", "* random (select a monomer at random from the set of monomer types available in both the receptor and ligand)\n", "* random_mixed (select a monomer at random from the set of monomer types available in the receptor and ligand, separately)\n" ] }, { "cell_type": "markdown", "id": "775a0ed6-9dae-49a3-8755-2cdc6606f680", "metadata": {}, "source": [ "If you wanted to leverage the `PinderLoader` but mainly just want the filepaths and/or sequence, you can do so with the returned `Structure` objects:" ] }, { "cell_type": "code", "execution_count": 17, "id": "2fc9dc84-d7ae-4765-82f1-4df75a45fd7f", "metadata": {}, "outputs": [], "source": [ "system, sample, target = loader[0]\n", "\n", "receptor = target.filter(\"chain_id\", [\"R\"])\n", "ligand = target.filter(\"chain_id\", [\"L\"])\n", "# Can do things like e.g.\n", "with open(f\"./receptor_{receptor.pinder_id}.fasta\", \"w\") as f:\n", " f.write(receptor.fasta)\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "4c3200cc-0f42-4737-91a6-a490f2af55ab", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0%|▉ | 9/1958 [00:02<07:35, 4.28it/s]\n" ] } ], "source": [ "from tqdm import tqdm\n", "\n", "loaded_systems = set()\n", "limit = 10 # for faster exec in CI\n", "for system, feature_complex, target_complex in tqdm(loader):\n", " loaded_systems.add(system.entry.id)\n", " if len(loaded_systems) >= limit:\n", " break\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 20, "id": "0625b00a-1208-48e6-908d-08de0be3f221", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(loaded_systems)" ] }, { "cell_type": "code", "execution_count": 21, "id": "d9154449-15c6-4095-b3be-6e64721324dc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# PinderDataset - torch dataset\n", "\n", "from pinder.core.loader import filters, transforms\n", "from pinder.core.loader.dataset import PinderDataset\n", "\n", "base_filters = [\n", " filters.FilterDetachedHolo(radius=12, max_components=2),\n", " filters.FilterByResidueCount(min_residue_count=10, max_residue_count=2000),\n", "]\n", "sub_filters = [\n", " filters.FilterSubByAtomTypes(min_atom_types=4),\n", " filters.FilterByHoloSeqIdentity(min_sequence_identity=0.8),\n", "]\n", "\n", "# We can include Structure-level transforms (and filters) which will operate on the target and feature complexes returned by PinderLoader\n", "structure_transforms = [\n", " transforms.SelectAtomTypes(atom_types=[\"CA\", \"N\", \"C\", \"O\"])\n", "]\n", "train_dataset = PinderDataset(\n", " split=\"train\", \n", " # We can leverage holo, apo, pred, random and random_mixed monomer sampling strategies\n", " monomer_priority=\"random_mixed\",\n", " base_filters = base_filters,\n", " sub_filters = sub_filters,\n", " structure_transforms=structure_transforms,\n", ")\n", "train_dataset" ] }, { "cell_type": "markdown", "id": "b2cbd084-e728-4eea-bf7b-6d15e7e4cf5f", "metadata": {}, "source": [ "\n", "You can now access individual items in the PinderDataset or iterate over it. \n", "\n", "The current default return value of `PinderDataset.__getitem__` is a dict consisting of the following key, value pairs:\n", "* `target_complex`: The ground-truth holo dimer, represented with a set of default properties encoded as `Tensor`'s\n", "* `feature_complex`: The sampled dimer complex, representing \"features\", also represented with a set of default properties encoded as `Tensor`'s\n", "* `id`: The pinder ID for the selected system\n", "* `target_id`: The IDs of the receptor and ligand holo monomers, concatenated into a single ID string\n", "* `sample_id`: The IDs of the sampled receptor and ligand holo monomers, concatenated into a single ID string. This can be useful for debugging purposes or generally tracking which specific monomers are selected when targeting alternative monomers (more on this shortly)\n", "\n", "\n", "Each of the `target_complex` and `feature_complex` values are dictionaries with structural properties encoded by the `pinder.core.loader.geodata.structure2tensor` function by default:\n", "* `atom_coordinates`\n", "* `atom_types`\n", "* `residue_coordinates`\n", "* `residue_types`\n", "* `residue_ids`\n", "\n", "You can choose to use a different representation by overriding the default values of `transform` and `target_transform`." ] }, { "cell_type": "code", "execution_count": 22, "id": "38ec8e3d-ba0f-4ff2-a8d1-44be72e32ad6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'target_complex': {'atom_types': tensor([[0., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 1., ..., 0., 0., 0.]]),\n", " 'residue_types': tensor([[16.],\n", " [16.],\n", " [16.],\n", " ...,\n", " [ 0.],\n", " [ 0.],\n", " [ 0.]]),\n", " 'atom_coordinates': tensor([[131.7500, 429.3090, 163.5360],\n", " [132.6810, 428.2520, 163.1550],\n", " [133.5150, 428.6750, 161.9500],\n", " ...,\n", " [177.7620, 463.8650, 166.9020],\n", " [177.4130, 465.0800, 167.7550],\n", " [176.8000, 464.9490, 168.8150]]),\n", " 'residue_coordinates': tensor([[131.7500, 429.3090, 163.5360],\n", " [132.6810, 428.2520, 163.1550],\n", " [133.5150, 428.6750, 161.9500],\n", " ...,\n", " [177.7620, 463.8650, 166.9020],\n", " [177.4130, 465.0800, 167.7550],\n", " [176.8000, 464.9490, 168.8150]]),\n", " 'residue_ids': tensor([ 4., 4., 4., ..., 182., 182., 182.])},\n", " 'feature_complex': {'atom_types': tensor([[0., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [1., 0., 0., ..., 0., 0., 0.],\n", " [0., 0., 1., ..., 0., 0., 0.]]),\n", " 'residue_types': tensor([[16.],\n", " [16.],\n", " [16.],\n", " ...,\n", " [ 0.],\n", " [ 0.],\n", " [ 0.]]),\n", " 'atom_coordinates': tensor([[131.7500, 429.3090, 163.5360],\n", " [132.6810, 428.2520, 163.1550],\n", " [133.5150, 428.6750, 161.9500],\n", " ...,\n", " [177.7620, 463.8650, 166.9020],\n", " [177.4130, 465.0800, 167.7550],\n", " [176.8000, 464.9490, 168.8150]]),\n", " 'residue_coordinates': tensor([[131.7500, 429.3090, 163.5360],\n", " [132.6810, 428.2520, 163.1550],\n", " [133.5150, 428.6750, 161.9500],\n", " ...,\n", " [177.7620, 463.8650, 166.9020],\n", " [177.4130, 465.0800, 167.7550],\n", " [176.8000, 464.9490, 168.8150]]),\n", " 'residue_ids': tensor([ 4., 4., 4., ..., 182., 182., 182.])},\n", " 'id': '8phr__X4_UNDEFINED--8phr__W4_UNDEFINED',\n", " 'sample_id': '8phr__X4_UNDEFINED-R--8phr__W4_UNDEFINED-L',\n", " 'target_id': '8phr__X4_UNDEFINED-R--8phr__W4_UNDEFINED-L'}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_item = train_dataset[0]\n", "data_item\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "edd68e94-0782-47a5-a9cb-d6f8d1db00f1", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "2024-09-05 12:58:37,879 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=7, items=7\n", "2024-09-05 12:58:38,274 | pinder.core.utils.cloud.process_many:23 | INFO : runtime succeeded: 0.40s\n", "2024-09-05 12:58:39,038 | pinder.core.utils.cloud:375 | INFO : Gsutil process_many=download_to_filename, threads=5, items=5\n", "2024-09-05 12:58:39,234 | pinder.core.utils.cloud.process_many:23 | INFO : runtime succeeded: 0.20s\n" ] } ], "source": [ "from pinder.core.loader.dataset import collate_batch, get_torch_loader\n", "\n", "# Now wrap the dataset in a torch DataLoader\n", "batch_size = 2\n", "train_dataloader = get_torch_loader(\n", " train_dataset, \n", " batch_size=batch_size,\n", " shuffle=True,\n", " collate_fn=collate_batch,\n", " num_workers=0, \n", ")\n", "\n", "# Get a batch from the dataloader\n", "batch = next(iter(train_dataloader))\n", "\n", "# expected batch dict keys\n", "assert set(batch.keys()) == {\n", " \"target_complex\",\n", " \"feature_complex\",\n", " \"id\",\n", " \"sample_id\",\n", " \"target_id\",\n", "}\n", "feature_coords = batch[\"feature_complex\"][\"atom_coordinates\"]\n", "# Ensure batch size propagates to tensor dims\n", "assert feature_coords.shape[0] == batch_size\n", "# Ensure coordinates have dim 3\n", "assert feature_coords.shape[2] == 3\n" ] }, { "cell_type": "code", "execution_count": 9, "id": "608249db-4ba3-4391-b8da-79a8550ba974", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'target_complex': {'atom_types': tensor([[[ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 1., ..., 0., 0., 0.]],\n", " \n", " [[ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [-1., -1., -1., ..., -1., -1., -1.],\n", " [-1., -1., -1., ..., -1., -1., -1.],\n", " [-1., -1., -1., ..., -1., -1., -1.]]]),\n", " 'residue_types': tensor([[[ 3.],\n", " [ 3.],\n", " [ 3.],\n", " ...,\n", " [13.],\n", " [13.],\n", " [13.]],\n", " \n", " [[14.],\n", " [14.],\n", " [14.],\n", " ...,\n", " [-1.],\n", " [-1.],\n", " [-1.]]]),\n", " 'atom_coordinates': tensor([[[ 8.1120, -9.6510, 29.2570],\n", " [ 8.4540, -9.0780, 30.5620],\n", " [ 9.9250, -8.6600, 30.6590],\n", " ...,\n", " [ 15.9010, -12.2620, 25.1370],\n", " [ 16.3010, -11.3060, 23.9960],\n", " [ 17.2380, -11.5950, 23.2480]],\n", " \n", " [[ 157.1080, 94.6520, 177.6160],\n", " [ 158.1820, 94.2070, 176.7270],\n", " [ 159.5680, 94.6010, 177.2280],\n", " ...,\n", " [-100.0000, -100.0000, -100.0000],\n", " [-100.0000, -100.0000, -100.0000],\n", " [-100.0000, -100.0000, -100.0000]]]),\n", " 'residue_coordinates': tensor([[[ 8.1120, -9.6510, 29.2570],\n", " [ 8.4540, -9.0780, 30.5620],\n", " [ 9.9250, -8.6600, 30.6590],\n", " ...,\n", " [ 15.9010, -12.2620, 25.1370],\n", " [ 16.3010, -11.3060, 23.9960],\n", " [ 17.2380, -11.5950, 23.2480]],\n", " \n", " [[ 157.1080, 94.6520, 177.6160],\n", " [ 158.1820, 94.2070, 176.7270],\n", " [ 159.5680, 94.6010, 177.2280],\n", " ...,\n", " [-100.0000, -100.0000, -100.0000],\n", " [-100.0000, -100.0000, -100.0000],\n", " [-100.0000, -100.0000, -100.0000]]]),\n", " 'residue_ids': tensor([[ 20., 20., 20., ..., 327., 327., 327.],\n", " [ 1., 1., 1., ..., -99., -99., -99.]])},\n", " 'feature_complex': {'atom_types': tensor([[[ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " [ 0., 0., 1., ..., 0., 0., 0.]],\n", " \n", " [[ 0., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " [ 1., 0., 0., ..., 0., 0., 0.],\n", " ...,\n", " [-1., -1., -1., ..., -1., -1., -1.],\n", " [-1., -1., -1., ..., -1., -1., -1.],\n", " [-1., -1., -1., ..., -1., -1., -1.]]]),\n", " 'residue_types': tensor([[[ 3.],\n", " [ 3.],\n", " [ 3.],\n", " ...,\n", " [13.],\n", " [13.],\n", " [13.]],\n", " \n", " [[14.],\n", " [14.],\n", " [14.],\n", " ...,\n", " [-1.],\n", " [-1.],\n", " [-1.]]]),\n", " 'atom_coordinates': tensor([[[ 8.7379, -10.4326, 29.9403],\n", " [ 8.6252, -9.0862, 30.4928],\n", " [ 10.0447, -8.5609, 30.7012],\n", " ...,\n", " [ 15.9010, -12.2620, 25.1370],\n", " [ 16.3010, -11.3060, 23.9960],\n", " [ 17.2380, -11.5950, 23.2480]],\n", " \n", " [[ 157.1080, 94.6520, 177.6160],\n", " [ 158.1820, 94.2070, 176.7270],\n", " [ 159.5680, 94.6010, 177.2280],\n", " ...,\n", " [-100.0000, -100.0000, -100.0000],\n", " [-100.0000, -100.0000, -100.0000],\n", " [-100.0000, -100.0000, -100.0000]]]),\n", " 'residue_coordinates': tensor([[[ 8.7379, -10.4326, 29.9403],\n", " [ 8.6252, -9.0862, 30.4928],\n", " [ 10.0447, -8.5609, 30.7012],\n", " ...,\n", " [ 15.9010, -12.2620, 25.1370],\n", " [ 16.3010, -11.3060, 23.9960],\n", " [ 17.2380, -11.5950, 23.2480]],\n", " \n", " [[ 157.1080, 94.6520, 177.6160],\n", " [ 158.1820, 94.2070, 176.7270],\n", " [ 159.5680, 94.6010, 177.2280],\n", " ...,\n", " [-100.0000, -100.0000, -100.0000],\n", " [-100.0000, -100.0000, -100.0000],\n", " [-100.0000, -100.0000, -100.0000]]]),\n", " 'residue_ids': tensor([[ 20., 20., 20., ..., 327., 327., 327.],\n", " [ 1., 1., 1., ..., -99., -99., -99.]])},\n", " 'id': ['4aui__A1_Q51056--4aui__C1_Q51056',\n", " '8f54__M1_UNDEFINED--8f54__R1_UNDEFINED'],\n", " 'sample_id': ['af__Q51056--4aui__C1_Q51056-L',\n", " '8f54__M1_UNDEFINED-R--8f54__R1_UNDEFINED-L'],\n", " 'target_id': ['4aui__A1_Q51056-R--4aui__C1_Q51056-L',\n", " '8f54__M1_UNDEFINED-R--8f54__R1_UNDEFINED-L']}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "batch" ] }, { "cell_type": "markdown", "id": "41367114-02b5-49f2-828a-51d51bbb9f08", "metadata": {}, "source": [ "### Implementing your own PyTorch Dataset & DataLoader for pinder \n", "\n", "We invite you to review the [existing tutorial](https://pinder-org.github.io/pinder/pinder-loader.html#implementing-your-own-pytorch-dataset-dataloader-for-pinder) on this topic in the pinder documentation. Please don't hesitate to ask questions or otherwise engage via GitHub issues!\n", "\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "pinder", "language": "python", "name": "pinder" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.14" } }, "nbformat": 4, "nbformat_minor": 5 }