pinder.core.index package#

Submodules#

pinder.core.index.id module#

class pinder.core.index.id.Protein(source: str, uniprot: str, chain: str | None = 'A', from_residue: int | None = None, to_residue: int | None = None)[source][source]#

Bases: object

A class to represent a Protein.

Attributes:
sourcestr

The source of the protein.

uniprotstr

The uniprot ID of the protein.

chainOptional[str]

The chain of the protein, can be None.

from_residueOptional[int]

The starting residue of the protein, can be None.

to_residueOptional[int]

The ending residue of the protein, can be None.

Methods

__str__():

Returns a string representation of the protein.

Examples

>>> protein = Protein(source='6q0r', chain='B', uniprot='Q66K64')
>>> str(protein) == '6q0r__B_Q66K64'
True
>>> protein = Protein(source='af2', chain='A', uniprot='Q14498', from_residue=1, to_residue=100)
>>> str(protein) == 'af2__A_Q14498_1_100'
False
source: str#
uniprot: str#
chain: str | None = 'A'#
from_residue: int | None = None#
to_residue: int | None = None#
class pinder.core.index.id.Monomer(proteins: list[Protein], side: str | None = None)[source][source]#

Bases: object

A class to represent a Monomer.

Attributes:
proteinsList[Protein]

A list of Protein objects that make up the monomer.

sideOptional[str]

The side of the monomer, can be None.

Methods

__str__():

Returns a string representation of the monomer.

from_string(monomer_str: str):

Parses a monomer string into a Monomer object.

Examples

>>> filename = "6q0r__B_Q66K64.pdb"
>>> parsed_monomer = Monomer.from_string(filename)
>>> print(parsed_monomer)
6q0r__B_Q66K64
>>> assert filename == str(parsed_monomer) + '.pdb'
>>> print(Monomer([Protein(source='af2', chain='A', uniprot='Q14498', from_residue=1, to_residue=100)]))
af2__Q14498_1_100
>>> print(Monomer([Protein(source='af2', chain='A', uniprot='Q14498', from_residue=1, to_residue=100)], side='L'))
af2__Q14498_1_100-L
>>> print(Monomer([Protein(source='6q0r', chain='B', uniprot='Q66K64')], side='R'))
6q0r__B_Q66K64-R
>>> print(Monomer.from_string("af2__Q14498_1_100-R.pdb"))
af2__Q14498_1_100-R
>>> print(Monomer([Protein(source='af2', chain='A', uniprot='Q14498')], side='R'))
af2__Q14498-R
>>> print(Monomer.from_string("af2__Q14498-R.pdb"))
af2__Q14498-R
proteins: list[Protein]#
side: str | None = None#
classmethod from_string(monomer_str: str) Monomer[source]#

parse a monomer string into a Monomer object supports when monomer is a .pdb file or a part of a complex

class pinder.core.index.id.Dimer(monomer1: Monomer, monomer2: Monomer)[source][source]#

Bases: object

A class used to represent a Dimer, which is a complex of two Monomers.

Attributes:
monomer1Monomer

The first monomer in the dimer

monomer2Monomer

The second monomer in the dimer

Methods

__str__():

Returns a string representation of the dimer.

Examples

>>> monomer1 = Monomer.from_string("6q0r__B_Q66K64")
>>> monomer2 = Monomer.from_string("6q0r__D_Q14498")
>>> dimer = Dimer(monomer1, monomer2)
>>> assert str(dimer) == "6q0r__B_Q66K64--6q0r__D_Q14498"
>>> print(dimer)
6q0r__B_Q66K64--6q0r__D_Q14498
monomer1: Monomer#
monomer2: Monomer#
classmethod from_string(dimer_str: str) Dimer[source]#

pinder.core.index.system module#

class pinder.core.index.system.FolderNames[source][source]#

Bases: object

apo: str = 'apo'#
holo: str = 'holo'#
predicted: str = 'predicted'#
alphafold: str = 'predicted'#
af2: str = 'predicted'#
class pinder.core.index.system.PinderSystem(entry: str | IndexEntry, apo_receptor_pdb_code: str = '', apo_ligand_pdb_code: str = '', metadata: MetadataEntry | None = None, dataset_path: Path | None = None, pdb_engine: str = 'fastpdb', **kwargs: dict[str, Any])[source][source]#

Bases: object

Represents a system within the Pinder framework designed to handle and process structural data. It provides functionality to load, align, and analyze protein structures within the context of a Pinder index entry.

Upon initialization, the system loads the ground-truth dimer and sets the PDB and mapping directories.

Individual monomers (holo, apo, predicted) are defined as cached_property properties which will only initialize the Structure objects when they are requested.

Methods include creating complexes, calculating RMSD, difficulty assessment, and updating substructure presence based on filtering criteria.

entry#

An index entry object containing primary metadata for the system.

Type:

IndexEntry

pdbs_path#

Path to the directory containing PDB files.

Type:

Path

mappings_path#

Path to the directory containing Parquet mapping files.

Type:

Path

native#

The native (ground-truth) structure of the system.

Type:

Structure

property holo_receptor: Structure#

The holo form of the receptor.

property holo_ligand: Structure#

The holo form of the ligand.

property aligned_holo_R: Structure#

The holo form of the receptor, aligned to the coordinates of the respective chain in the native structure.

property aligned_holo_L: Structure#

The holo form of the ligand, aligned to the coordinates of the respective chain in the native structure.

property apo_receptor: Structure#

The apo form of the receptor.

property apo_ligand: Structure#

The apo form of the ligand.

property pred_receptor: Structure#

The predicted form of the receptor (currently alphafold2).

property pred_ligand: Structure#

The predicted form of the ligand (currently alphafold2).

filter_common_uniprot_res() None[source]#

Filters the loaded protein structures for common UniProt residues, ensuring that comparisons between structures are made on a common set of residues.

create_masked_bound_unbound_complexes(monomer_types: Sequence[str] = ['apo', 'predicted'], remove_differing_atoms: bool = True, renumber_residues: bool = False, remove_differing_annotations: bool = False) tuple[Structure, Structure, Structure][source]#

Create dimer complexes for apo and predicted cropped to common holo substructures.

The method applies a pairwise masking procedure which crops both unbound and bound structures such that they have equal numbers of residues and atom counts.

Note: this method may result in very distorted holo (ground-truth) structures if the unbound monomer structures have little to no sequence and atoms in common. Unless you need all monomer types to be equal shapes, the PinderSystem.create_complex method or pure-superposition without masking (Structure.superimpose) is more appropriate.

Parameters:
  • monomer_types (Sequence[str]) – The unbound monomer types to consider (apo, predicted, or both).

  • remove_differing_atoms (bool) – Whether to remove non-overlappings atoms that may still be present even after sequence-based alignment.

  • renumber_residues (bool) – Whether to renumber the residues in the receptor and ligand Structure’s to match numbering of the holo counterparts.

  • remove_differing_annotations (bool) – Whether to remove annotation categories (set to empty str or default value for the category type). This is useful if you need to perform biotite.structure.filter_intersection on the resulting structures. Note: this can have unintended side-effects like stripping the element attribute on structures. By default, the annotation categories are removed if they don’t match in order to define the intersecting atom mask, after which the original structure annotations are preserved by applying the intersecting mask to the original AtomArray. Default is False.

Returns:

A tuple of the cropped holo, apo, and predicted Structure objects, respectively.

Return type:

tuple[Structure, Structure, Structure]

create_complex(receptor: Structure, ligand: Structure, remove_differing_atoms: bool = True, renumber_residues: bool = False, remove_differing_annotations: bool = False) Structure[source]#

Creates a complex from the receptor and ligand structures.

The complex is created by aligning the monomers to their respective holo forms and combining them into a single structure.

Parameters:
  • receptor (Structure) – The receptor structure.

  • ligand (Structure) – The ligand structure.

  • remove_differing_atoms (bool) – Whether to remove non-overlappings atoms that may still be present even after sequence-based alignment.

  • renumber_residues (bool) – Whether to renumber the residues in the receptor and ligand Structure’s to match numbering of the holo counterparts.

  • remove_differing_annotations (bool) – Whether to remove annotation categories (set to empty str or default value for the category type). This is useful if you need to perform biotite.structure.filter_intersection on the resulting structures. Note: this can have unintended side-effects like stripping the element attribute on structures. By default, the annotation categories are removed if they don’t match in order to define the intersecting atom mask, after which the original structure annotations are preserved by applying the intersecting mask to the original AtomArray. Default is False.

Returns:

A new Structure instance representing the complex.

Return type:

Structure

create_apo_complex(remove_differing_atoms: bool = True, renumber_residues: bool = False, remove_differing_annotations: bool = False) Structure[source]#

Creates an apo complex using the receptor and ligand structures. Falls back to the holo structures if apo structures are not available.

Parameters:
  • remove_differing_atoms (bool) – Whether to remove non-overlappings atoms that may still be present even after sequence-based alignment.

  • renumber_residues (bool) – Whether to renumber the residues in the apo receptor and ligand Structure’s to match numbering of the holo counterparts.

  • remove_differing_annotations (bool) – Whether to remove annotation categories (set to empty str or default value for the category type). This is useful if you need to perform biotite.structure.filter_intersection on the resulting structures. Note: this can have unintended side-effects like stripping the element attribute on structures. By default, the annotation categories are removed if they don’t match in order to define the intersecting atom mask, after which the original structure annotations are preserved by applying the intersecting mask to the original AtomArray. Default is False.

Returns:

A new Structure instance representing the apo complex.

Return type:

Structure

create_pred_complex(remove_differing_atoms: bool = True, renumber_residues: bool = False, remove_differing_annotations: bool = False) Structure[source]#

Creates a predicted complex using the receptor and ligand structures. Falls back to the holo structures if predicted structures are not available.

Parameters:
  • remove_differing_atoms (bool) – Whether to remove non-overlappings atoms that may still be present even after sequence-based alignment.

  • renumber_residues (bool) – Whether to renumber the residues in the predicted receptor and ligand Structure’s to match numbering of the holo counterparts.

  • remove_differing_annotations (bool) – Whether to remove annotation categories (set to empty str or default value for the category type). This is useful if you need to perform biotite.structure.filter_intersection on the resulting structures. Note: this can have unintended side-effects like stripping the element attribute on structures. By default, the annotation categories are removed if they don’t match in order to define the intersecting atom mask, after which the original structure annotations are preserved by applying the intersecting mask to the original AtomArray. Default is False.

Returns:

A new Structure instance representing the predicted complex.

Return type:

Structure

unbound_rmsd(monomer_name: MonomerName) dict[str, float][source]#

Calculates the RMSD of the unbound receptor and ligand with respect to their holo forms for a given monomer state (apo or predicted).

Parameters:

monomer_name (MonomerName) – Enum representing the monomer state.

Returns:

A dictionary with RMSD values for the receptor and ligand.

Return type:

dict[str, float]

unbound_difficulty(monomer_name: MonomerName, contact_rad: float = 5.0) dict[str, float | str][source]#

Assesses the difficulty of docking the unbound structures based on the given monomer state (apo or predicted).

Parameters:

monomer_name (MonomerName) – Enum representing the monomer state.

Returns:

A dictionary with difficulty assessment metrics.

Return type:

dict[str, Union[float, str]]

apo_monomer_difficulty(monomer_name: MonomerName, body: str, contact_rad: float = 5.0) dict[str, float | str][source]#

Evaluates the difficulty of docking for an individual apo monomer.

Takes the specified body of the monomer (receptor or ligand) into account.

Parameters:
  • monomer_name (MonomerName) – Enum representing the monomer state.

  • body (str) – String indicating which body (‘receptor’ or ‘ligand’) to use.

Returns:

A dictionary with docking difficulty metrics for the specified monomer body.

Return type:

dict[str, Union[float, str]]

download_entry() None[source]#

Downloads data associated with an entry for the PinderSystem instance.

It checks for the existence of PDB and Parquet files and downloads missing files from the Pinder bucket to the local dataset root directory.

property filepaths: dict[str, str | None]#

Retrieves the file paths for the structural data associated with the system.

Returns:

A dictionary with keys for each structure type (e.g., ‘holo_receptor’, ‘holo_ligand’) and values as the corresponding file paths or None if not available.

Return type:

dict[str, Optional[str]]

property metadata: MetadataEntry | None#

Retrieves the additional metadata associated with an IndexEntry.

Returns:

A MetadataEntry object if the metadata exists, otherwise None.

Return type:

MetadataEntry | None

property pymol_script: str#

Constructs a PyMOL script to visualize the loaded structures in the PinderSystem.

Returns:

A string representing the PyMOL script commands for visualizing the structures.

Return type:

str

load_alt_apo_structure(alt_pdbs: list[str], code: str, canon_code: str, chain_id: str | None = None, pdb_engine: str = 'fastpdb') Structure | None[source]#

Loads an alternate apo structure based on the provided PDB codes, if available.

Parameters:
  • alt_pdbs (List[str]) – A list of alternative PDB file paths.

  • code (str) – The specific code to identify the alternate apo structure.

  • canon_code (str) – The canonical code for the apo structure.

  • chain_id (str, optional) – The chain ID to assign to the structure. Defaults to None (leave as is).

  • pdb_engine (str, optional) – The PDB engine to use for reading the structure. Defaults to “fastpdb”.

Returns:

The loaded Structure object if found, otherwise None.

Return type:

Optional[Structure]

static load_structure(pdb_file: Path | None, chain_id: str | None = None, pdb_engine: str = 'fastpdb') Structure | None[source]#

Loads a structure from a PDB file if it exists and is valid.

Parameters:
  • pdb_file (Path, optional) – The file path to the PDB file.

  • chain_id (str, optional) – The chain ID to assign to the structure. Defaults to None (leave as is).

  • pdb_engine (str, optional) – The PDB engine to use for reading the structure. Defaults to “fastpdb”.

Returns:

The loaded Structure object if the file is valid, otherwise None.

Return type:

Optional[Structure]

pinder.core.index.utils module#

pinder.core.index.utils.get_pinder_location() Path[source][source]#

Determines the base directory location for the Pinder data.

First, check the environment for PINDER_DATA_DIR. If provided, assume it is the full path to the directory containing the index, pdbs/, and mappings/. Otherwise, the PINDER_BASE_DIR environment variable is checked, and if unset, falls back to the default XDG_DATA_HOME location (~/.local/share), and appends the current version of the Pinder release. The Pinder release can be controlled via the PINDER_RELEASE environment variable.

Returns:

The path to the base directory for Pinder data.

Return type:

Path

pinder.core.index.utils.get_pinder_bucket_root() str[source][source]#

Constructs the root bucket path for the Pinder data in Google Cloud Storage based on the PINDER_RELEASE environment variable.

Returns:

The root bucket path as a string.

Return type:

str

pinder.core.index.utils.get_index_location(csv_name: str = 'index.parquet', remote: bool = False) Path | str[source][source]#

Gets the file path for the Pinder index CSV/Parquet file, either locally or remotely.

Parameters:
  • csv_name (str) – The name of the CSV/Parquet file. Defaults to “index.parquet”.

  • remote (bool) – A flag to determine if the remote file path should be returned. Defaults to False.

Returns:

The path to the index CSV/Parquet file, either as a Path object (local) or a string (remote).

Return type:

Union[Path, str]

pinder.core.index.utils.get_index(csv_name: str = 'index.parquet', update: bool = False) DataFrame[source][source]#

Retrieves the Pinder index as a pandas DataFrame. If the index is not already loaded, it reads from the local CSV/Parquet file, or downloads it if not present.

Parameters:
  • csv_name (str) – The name of the CSV/Parquet file to load. Defaults to “index.parquet”.

  • update (bool) – Whether to force update index on disk even if it exists. Default is False.

Returns:

The Pinder index as a DataFrame.

Return type:

pd.DataFrame

pinder.core.index.utils.get_metadata(csv_name: str = 'metadata.parquet', update: bool = False, extra_glob: str = 'metadata-*.csv.gz', extra_data: SupplementaryData | tuple[SupplementaryData] | tuple[()] = ()) DataFrame[source][source]#

Retrieves the Pinder index metadata as a pandas DataFrame.

If the metadata is not already loaded, it reads from the local CSV/Parquet file, or downloads it if not present.

Also attempts to read extra metadata. We assume that all extra CSV files contain an id field.

Parameters:
  • csv_name (str) – The name of the CSV file to load. Defaults to “metadata.parquet”.

  • update (bool) – Whether to force update index on disk even if it exists. Default is False.

  • extra_glob (str) – The pattern to match extra metadata CSV files. Defaults to “metadata-*.csv.gz”

Returns:

The Pinder metadata as a DataFrame.

Return type:

pd.DataFrame

pinder.core.index.utils.get_extra_metadata(local_location: str, remote_location: str, glob_pattern: str = 'metadata-*.csv.gz', extra_data: tuple[SupplementaryData] | tuple[()] = (), update: bool = False) DataFrame | None[source][source]#

Retrieves extra metadata as a pandas DataFrame.

If the metadata is not already loaded, it reads from local CSV files, or downloads them if not present.

We assume that all CSV files contain an id field.

Parameters:
  • local_location (str) – The filepath to the local directory containing CSV files.

  • remote_location (str) – The filepath to the remote location containing CSV files.

  • glob_pattern (str) – The pattern to match extra metadata CSV files. Defaults to “metadata-*.csv.gz”

  • update (bool) – Whether to force update index on disk even if it exists. Default is False.

Returns:

The Pinder metadata as a DataFrame.

Return type:

pd.DataFrame

class pinder.core.index.utils.SupplementaryData(value)[source][source]#

Bases: str, Enum

An enumeration.

sequence_database = 'sequence_database.parquet'#
supplementary_metadata = 'supplementary_metadata.parquet'#
entity_metadata = 'entity_metadata.parquet'#
chain_metadata = 'chain_metadata.parquet'#
ecod_metadata = 'ecod_metadata.parquet'#
enzyme_classification_metadata = 'enzyme_classification_metadata.parquet'#
interface_annotations = 'interface_annotations.parquet'#
sabdab_metadata = 'sabdab_metadata.parquet'#
monomer_neff = 'monomer_neff.parquet'#
paired_neff = 'test_split_paired_neffs.parquet'#
transient_interface_metadata = 'transient_interface_metadata.parquet'#
ialign_split_similarity_labels = 'ialign_split_similarity_labels.parquet'#
pinder.core.index.utils.get_supplementary_data(supplementary_data: SupplementaryData, update: bool = False) DataFrame[source][source]#

Retrieves supplementary data file exposed in the Pinder dataset as a pandas DataFrame. If the data is not already loaded, it reads from the local Parquet file, or downloads it if not present.

Parameters:
  • supplementary_data (SupplementaryData) – The name of the Parquet file to load.

  • update (bool) – Whether to force update data on disk even if it exists. Default is False.

Returns:

The requested supplementary Pinder data as a DataFrame.

Return type:

pd.DataFrame

pinder.core.index.utils.get_sequence_database(pqt_name: str = 'sequence_database.parquet', update: bool = False) DataFrame[source][source]#

Retrieves sequences for all PDB files in the Pinder dataset (including dimers, bound and unbound monomers) as a pandas DataFrame. If the database is not already loaded, it reads from the local Parquet file, or downloads it if not present.

Parameters:
  • pqt_name (str) – The name of the Parquet file to load. Defaults to “sequence_database.parquet”.

  • update (bool) – Whether to force update sequence database on disk even if it exists. Default is False.

Returns:

The Pinder sequence database as a DataFrame.

Return type:

pd.DataFrame

pinder.core.index.utils.download_dataset(skip_inflation: bool = False) None[source][source]#

Downloads the Pinder dataset archives (PDBs and mappings) and optionally inflates them.

Parameters:

skip_inflation (bool) – If True, the method will skip inflating (unzipping) the downloaded archives. Defaults to False.

Note

Required disk space to download the full Pinder dataset:

# compressed
144G    pdbs.zip
149M    test_set_pdbs.zip
6.8G    mappings.zip

# unpacked
672G    pdbs
705M    test_set_pdbs
25G     mappings
pinder.core.index.utils.get_arg_parser_args(argv: list[str] | None = None, title: str = 'Download latest pinder dataset to disk') dict[str, str | bool][source][source]#

The command-line arg parser for different pinder data downloads and updates.

It accepts command-line arguments to specify the base directory, release version, and whether to skip inflating the compressed archives.

Parameters:

argv (Optional[List[str]]) – The command-line arguments. If None, sys.argv is used.

pinder.core.index.utils.download_pinder_cmd(argv: list[str] | None = None) None[source][source]#

The command-line interface for downloading the latest Pinder dataset to disk.

It accepts command-line arguments to specify the base directory, release version, and whether to skip inflating the compressed archives.

Parameters:

argv (Optional[List[str]]) – The command-line arguments. If None, sys.argv is used.

Note

Required disk space to download the full Pinder dataset:

# compressed
144G    pdbs.zip
149M    test_set_pdbs.zip
6.8G    mappings.zip

# unpacked
672G    pdbs
705M    test_set_pdbs
25G     mappings
pinder.core.index.utils.update_index_cmd(argv: list[str] | None = None) None[source][source]#

The command-line interface for downloading the latest Pinder index to disk.

It accepts command-line arguments to specify the base directory and release version.

Parameters:

argv (Optional[List[str]]) – The command-line arguments. If None, sys.argv is used.

pinder.core.index.utils.get_missing_blobs(prefix: str) tuple[list[str], list[Path]][source][source]#
pinder.core.index.utils.sync_pinder_structure_data(argv: list[str] | None = None) None[source][source]#

The command-line interface for syncing any structural data files missing on disk.

It accepts command-line arguments to specify the base directory and release version.

Parameters:

argv (Optional[List[str]]) – The command-line arguments. If None, sys.argv is used.

class pinder.core.index.utils.IndexEntry(*, split: str, id: str, pdb_id: str, cluster_id: str, cluster_id_R: str, cluster_id_L: str, pinder_s: bool, pinder_xl: bool, pinder_af2: bool, uniprot_R: str, uniprot_L: str, holo_R_pdb: str, holo_L_pdb: str, predicted_R_pdb: str, predicted_L_pdb: str, apo_R_pdb: str, apo_L_pdb: str, apo_R_pdbs: str, apo_L_pdbs: str, holo_R: bool, holo_L: bool, predicted_R: bool, predicted_L: bool, apo_R: bool, apo_L: bool, apo_R_quality: str, apo_L_quality: str, chain1_neff: float, chain2_neff: float, chain_R: str, chain_L: str, contains_antibody: bool, contains_antigen: bool, contains_enzyme: bool)[source][source]#

Bases: BaseModel

Pydantic model representing a single entry in the Pinder index.

Stores all associated metadata for a particular dataset entry as attributes.

Parameters:
split (str):

The type of data split (e.g., ‘train’, ‘test’).

id (str):

The unique identifier for the dataset entry.

pdb_id (str):

The PDB identifier associated with the entry.

cluster_id (str):

The cluster identifier associated with the entry.

cluster_id_R (str):

The cluster identifier associated with receptor dimer body.

cluster_id_L (str):

The cluster identifier associated with ligand dimer body.

pinder_s (bool):

Flag indicating if the entry is part of the Pinder-S dataset.

pinder_xl (bool):

Flag indicating if the entry is part of the Pinder-XL dataset.

pinder_af2 (bool):

Flag indicating if the entry is part of the Pinder-AF2 dataset.

uniprot_R (str):

The UniProt identifier for the receptor protein.

uniprot_L (str):

The UniProt identifier for the ligand protein.

holo_R_pdb (str):

The PDB identifier for the holo form of the receptor protein.

holo_L_pdb (str):

The PDB identifier for the holo form of the ligand protein.

predicted_R_pdb (str):

The PDB identifier for the predicted structure of the receptor protein.

predicted_L_pdb (str):

The PDB identifier for the predicted structure of the ligand protein.

apo_R_pdb (str):

The PDB identifier for the apo form of the receptor protein.

apo_L_pdb (str):

The PDB identifier for the apo form of the ligand protein.

apo_R_pdbs (str):

The PDB identifiers for the apo forms of the receptor protein.

apo_L_pdbs (str):

The PDB identifiers for the apo forms of the ligand protein.

holo_R (bool):

Flag indicating if the holo form of the receptor protein is available.

holo_L (bool):

Flag indicating if the holo form of the ligand protein is available.

predicted_R (bool):

Flag indicating if the predicted structure of the receptor protein is available.

predicted_L (bool):

Flag indicating if the predicted structure of the ligand protein is available.

apo_R (bool):

Flag indicating if the apo form of the receptor protein is available.

apo_L (bool):

Flag indicating if the apo form of the ligand protein is available.

apo_R_quality (str):

Classification of apo receptor pairing quality. Can be high, low, ‘’. All test and val are labeled high. Train split is broken into high and low, depending on whether the pairing was produced with a low-confidence quality/eval metrics or high if the same metrics were used as for train and val. If no pairing exists, it is labeled with an empty string.

apo_L_quality (str):

Classification of apo ligand pairing quality. Can be high, low, ‘’. All test and val are labeled high. Train split is broken into high and low, depending on whether the pairing was produced with a low-confidence quality/eval metrics or high if the same metrics were used as for train and val. If no pairing exists, it is labeled with an empty string.

chain1_neff (float):

The Neff value for the first chain in the protein complex.

chain2_neff (float):

The Neff value for the second chain in the protein complex.

chain_R (str):

The chain identifier for the receptor protein.

chain_L (str):

The chain identifier for the ligand protein.

contains_antibody (bool):

Flag indicating if the protein complex contains an antibody as per SAbDab.

contains_antigen (bool):

Flag indicating if the protein complex contains an antigen as per SAbDab.

contains_enzyme (bool):

Flag indicating if the protein complex contains an enzyme as per EC ID number.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

split: str#
id: str#
pdb_id: str#
cluster_id: str#
cluster_id_R: str#
cluster_id_L: str#
pinder_s: bool#
pinder_xl: bool#
pinder_af2: bool#
uniprot_R: str#
uniprot_L: str#
holo_R_pdb: str#
holo_L_pdb: str#
predicted_R_pdb: str#
predicted_L_pdb: str#
apo_R_pdb: str#
apo_L_pdb: str#
apo_R_pdbs: str#
apo_L_pdbs: str#
holo_R: bool#
holo_L: bool#
predicted_R: bool#
predicted_L: bool#
apo_R: bool#
apo_L: bool#
apo_R_quality: str#
apo_L_quality: str#
chain1_neff: float#
chain2_neff: float#
chain_R: str#
chain_L: str#
contains_antibody: bool#
contains_antigen: bool#
contains_enzyme: bool#
pdb_path(pdb_name: str) str[source]#

Constructs the relative path for a given PDB file within the Pinder dataset.

Parameters:

pdb_name (str) – The name of the PDB file.

Returns:

The relative path as a string.

Return type:

str

mapping_path(pdb_name: str) str[source]#
property pdb_paths: dict[str, str | list[str]]#

Dictionary containing the PDB type as key and PDB file name as value. Alternative apo monomers (if available) are returned as a list.

Type:

dict[str, str | list[str]]

property mapping_paths: dict[str, str | list[str]]#

Dictionary containing the PDB type as key and parquet file names corresponding to residue-level mapping information as value. are returned as a list.

Type:

dict[str, str | list[str]]

property pinder_id: str#

PINDER identifier for the dimer.

Type:

str

property pinder_pdb: str#

PINDER dimer PDB file name.

Type:

str

property apo_R_alt: list[str] | list[None]#

list of alternative apo receptor PDBs associated with entry.

Type:

list[str]

property apo_L_alt: list[str] | list[None]#

list of alternative apo ligand PDBs associated with entry.

Type:

list[str]

property homodimer: bool#

loose definition of whether the entry is a homodimer (based on identical receptor and ligand UniProt IDs).

Type:

bool

property test_system: bool#

whether the system is part of the test split.

Type:

bool

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_fields: ClassVar[Dict[str, FieldInfo]] = {'apo_L': FieldInfo(annotation=bool, required=True), 'apo_L_pdb': FieldInfo(annotation=str, required=True), 'apo_L_pdbs': FieldInfo(annotation=str, required=True), 'apo_L_quality': FieldInfo(annotation=str, required=True), 'apo_R': FieldInfo(annotation=bool, required=True), 'apo_R_pdb': FieldInfo(annotation=str, required=True), 'apo_R_pdbs': FieldInfo(annotation=str, required=True), 'apo_R_quality': FieldInfo(annotation=str, required=True), 'chain1_neff': FieldInfo(annotation=float, required=True), 'chain2_neff': FieldInfo(annotation=float, required=True), 'chain_L': FieldInfo(annotation=str, required=True), 'chain_R': FieldInfo(annotation=str, required=True), 'cluster_id': FieldInfo(annotation=str, required=True), 'cluster_id_L': FieldInfo(annotation=str, required=True), 'cluster_id_R': FieldInfo(annotation=str, required=True), 'contains_antibody': FieldInfo(annotation=bool, required=True), 'contains_antigen': FieldInfo(annotation=bool, required=True), 'contains_enzyme': FieldInfo(annotation=bool, required=True), 'holo_L': FieldInfo(annotation=bool, required=True), 'holo_L_pdb': FieldInfo(annotation=str, required=True), 'holo_R': FieldInfo(annotation=bool, required=True), 'holo_R_pdb': FieldInfo(annotation=str, required=True), 'id': FieldInfo(annotation=str, required=True), 'pdb_id': FieldInfo(annotation=str, required=True), 'pinder_af2': FieldInfo(annotation=bool, required=True), 'pinder_s': FieldInfo(annotation=bool, required=True), 'pinder_xl': FieldInfo(annotation=bool, required=True), 'predicted_L': FieldInfo(annotation=bool, required=True), 'predicted_L_pdb': FieldInfo(annotation=str, required=True), 'predicted_R': FieldInfo(annotation=bool, required=True), 'predicted_R_pdb': FieldInfo(annotation=str, required=True), 'split': FieldInfo(annotation=str, required=True), 'uniprot_L': FieldInfo(annotation=str, required=True), 'uniprot_R': FieldInfo(annotation=str, required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

class pinder.core.index.utils.MetadataEntry(*, id: str, entry_id: str, method: str, date: str, release_date: str, resolution: float, label: str, probability: float, chain1_id: str, chain2_id: str, assembly: int, assembly_details: str, oligomeric_details: str, oligomeric_count: int, biol_details: str, complex_type: str, chain_1: str, asym_id_1: str, chain_2: str, asym_id_2: str, length1: int, length2: int, length_resolved_1: int, length_resolved_2: int, number_of_components_1: int, number_of_components_2: int, link_density: float, planarity: float, max_var_1: float, max_var_2: float, num_atom_types: int, n_residue_pairs: int, n_residues: int, buried_sasa: float, intermolecular_contacts: int, charged_charged_contacts: int, charged_polar_contacts: int, charged_apolar_contacts: int, polar_polar_contacts: int, apolar_polar_contacts: int, apolar_apolar_contacts: int, interface_atom_gaps_4A: int, missing_interface_residues_4A: int, interface_atom_gaps_8A: int, missing_interface_residues_8A: int, entity_id_R: int, entity_id_L: int, pdb_strand_id_R: str, pdb_strand_id_L: str, ECOD_names_R: str, ECOD_names_L: str, **extra_data: Any)[source][source]#

Bases: BaseModel

Pydantic model representing a single entry in the Pinder metadata.

Stores detailed metadata for a particular dataset entry as attributes.

Parameters:
  • id (str) – The unique identifier for the PINDER entry. It follows the convention <Receptor>–<Ligand>, where <Receptor> is <pdbid>__<chain_1>_<uniprotid> and <Ligand> is <pdbid>__<chain_2><uniprotid>.

  • entry_id (str) – The RCSB entry identifier associated with the PINDER entry.

  • method (str) – The experimental method for structure determination (XRAY, CRYO-EM, etc.).

  • date (str) – Date of deposition into RCSB PDB.

  • release_date (str) – Date of initial public release in RCSB PDB.

  • resolution (float) – The resolution of the experimental structure.

  • label (str) – Classification of the interface as likely to be biologically-relevant or a crystal contact, annotated using PRODIGY-cryst. PRODIGY-cryst uses machine learning to compute bio-relevant/crystal contact propensity based on Intermolecular contact types and Interfacial link density.

  • probability (float) – Probability that the protein complex is a true biological complex.

  • chain1_id (str) – The Receptor chain identifier associated with the dimer entry. Should all be chain ‘R’.

  • chain2_id (str) – The Ligand chain identifier associated with the dimer entry. Should all be chain ‘L’.

  • assembly (int) – Which bioassembly is used to derive the structure. 1, 2, 3 means first, second, and third assembly, respectively. All PINDER dimers are derived from the first biological assembly.

  • assembly_details (str) – How the bioassembly information was derived. Is it author-defined or from another source.

  • oligomeric_details (str) – Description of the oligomeric state of the protein complex.

  • oligomeric_count (int) – The oligomeric count associated with the dataset entry.

  • biol_details (str) – The biological assembly details associated with the dataset entry.

  • complex_type (str) – The type of the complex in the dataset entry (homomer or heteromer).

  • chain_1 (str) – New chain id generated post-bioassembly generation, to reflect the asym_id of the bioassembly and also to ensure that there is no collision of chain ids, for example in homooligomers (receptor chain).

  • asym_id_1 (str) – The first asymmetric identifier (author chain ID)

  • chain_2 (str) – New chain id generated post-bioassembly generation, to reflect the asym_id of the bioassembly and also to ensure that there is no collision of chain ids, for example in homooligomers (ligand chain).

  • asym_id_2 (str) – The second asymmetric identifier (author chain ID)

  • length1 (int) – The number of amino acids in the first (receptor) chain.

  • length2 (int) – The number of amino acids in the second (ligand) chain.

  • length_resolved_1 (int) – The structurally resolved (CA) length of the first (receptor) chain in amino acids.

  • length_resolved_2 (int) – The structurally resolved (CA) length of the second (ligand) chain in amino acids.

  • number_of_components_1 (int) – The number of connected components in the first (receptor) chain (contiguous structural fragments)

  • number_of_components_2 (int) – The number of connected components in the second (receptor) chain (contiguous structural fragments)

  • link_density (float) – Density of contacts at the interface as reported by PRODIGY-cryst. Interfacial link density is defined as the number of interfacial contacts normalized by the maximum possible number of pairwise contacts for that interface. Values range between 0 and 1, with higher values indicating a denser contact network at the interface.

  • planarity (float) – Defined as the deviation of interfacial Cα atoms from the fitted plane. This interface characteristic quantifies interfacial shape complementarity. Transient complexes have smaller and more planar interfaces than permanent and structural scaffold complexes.

  • max_var_1 (float) – The maximum variance of coordinates projected onto the largest principal component. This allows the detection of long end-to-end stacked complexes, likely to be repetitive with small interfaces (receptor chain).

  • max_var_2 (float) – The maximum variance of coordinates projected onto the largest principal component. This allows the detection of long end-to-end stacked complexes, likely to be repetitive with small interfaces (ligand chain).

  • num_atom_types (int) – Number of unique atom types in structure. This is an important annotation to identify complexes with only Cα or backbone atoms.

  • n_residue_pairs (int) – The number of residue pairs at the interface.

  • n_residues (int) – The number of residues at the interface.

  • buried_sasa (float) – The buried solvent accessible surface area upon complex formation.

  • intermolecular_contacts (int) – The total number of intermolecular contacts (pair residues with any atom within a 5Å distance cutoff) at the interface. Annotated using PRODIGY-cryst.

  • charged_charged_contacts (int) – Denotes intermolecular contacts between any of the charged amino acids (E, D, H, K). Annotated using PRODIGY-cryst.

  • charged_polar_contacts (int) – Denotes intermolecular contacts between charged amino acids (E, D, H, K, R) and polar amino acids (N, Q, S, T). Annotated using PRODIGY-cryst.

  • charged_apolar_contacts (int) – Denotes intermolecular contacts between charged amino acids (E, D, H, K) and apolar amino acids (A, C, G, F, I, M, L, P, W, V, Y). Annotated using PRODIGY-cryst.

  • polar_polar_contacts (int) – Denotes intermolecular contacts between any of the charged amino acids (N, Q, S, T). Annotated using PRODIGY-cryst.

  • apolar_polar_contacts (int) – Denotes intermolecular contacts between apolar amino acids (A, C, G,F, I, M, L, P, W, V, Y) and polar amino acids (N, Q, S, T). Annotated using PRODIGY-cryst.

  • apolar_apolar_contacts (int) – Denotes intermolecular contacts between any of the charged amino acids (A, C, G, F, I, M, L, P, W, V, Y). Annotated using PRODIGY-cryst.

  • interface_atom_gaps_4A (int) – Number of interface atoms within a 4Å radius of a residue gap. A Gap is determined by residue numbering; regions where one or more of the expected residue index is missing is marked as a gap.

  • missing_interface_residues_4A (int) – Number of interface residues within a 4Å radius of a residue gap. A Gap is determined by residue numbering; regions where one or more of the expected residue index is missing is marked as a gap.

  • interface_atom_gaps_8A (int) – Number of interface atoms within an 8Å radius of a residue gap. A Gap is determined by residue numbering; regions where one or more of the expected residue index is missing is marked as a gap.

  • missing_interface_residues_8A (int) – Number of interface residues within an 8Å radius of a residue gap. A Gap is determined by residue numbering; regions where one or more of the expected residue index is missing is marked as a gap.

  • entity_id_R (int) – The RCSB PDB entity_id corresponding to the receptor dimer chain.

  • entity_id_L (int) – The RCSB PDB entity_id corresponding to the ligand dimer chain.

  • pdb_strand_id_R (str) – The RCSB PDB pdb_strand_id (author chain) corresponding to the receptor dimer chain.

  • pdb_strand_id_L (str) – The RCSB PDB pdb_strand_id (author chain) corresponding to the ligand dimer chain.

  • ECOD_names_R (str) – The RCSB-derived ECOD domain protein family name(s) corresponding to the receptor dimer chain. If multiple ECOD domain annotations were found, the domains are delimited with a comma.

  • ECOD_names_L (str) – The RCSB-derived ECOD domain protein family name(s) corresponding to the ligand dimer chain. If multiple ECOD domain annotations were found, the domains are delimited with a comma.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

id: str#
entry_id: str#
method: str#
date: str#
release_date: str#
resolution: float#
label: str#
probability: float#
chain1_id: str#
chain2_id: str#
assembly: int#
assembly_details: str#
oligomeric_details: str#
oligomeric_count: int#
biol_details: str#
complex_type: str#
chain_1: str#
asym_id_1: str#
chain_2: str#
asym_id_2: str#
length1: int#
length2: int#
length_resolved_1: int#
length_resolved_2: int#
number_of_components_1: int#
number_of_components_2: int#
planarity: float#
max_var_1: float#
max_var_2: float#
num_atom_types: int#
n_residue_pairs: int#
n_residues: int#
buried_sasa: float#
intermolecular_contacts: int#
charged_charged_contacts: int#
charged_polar_contacts: int#
charged_apolar_contacts: int#
polar_polar_contacts: int#
apolar_polar_contacts: int#
apolar_apolar_contacts: int#
interface_atom_gaps_4A: int#
missing_interface_residues_4A: int#
interface_atom_gaps_8A: int#
missing_interface_residues_8A: int#
entity_id_R: int#
entity_id_L: int#
pdb_strand_id_R: str#
pdb_strand_id_L: str#
ECOD_names_R: str#
ECOD_names_L: str#
property pinder_id: str#

PINDER identifier for the dimer.

Type:

str

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_fields: ClassVar[Dict[str, FieldInfo]] = {'ECOD_names_L': FieldInfo(annotation=str, required=True), 'ECOD_names_R': FieldInfo(annotation=str, required=True), 'apolar_apolar_contacts': FieldInfo(annotation=int, required=True), 'apolar_polar_contacts': FieldInfo(annotation=int, required=True), 'assembly': FieldInfo(annotation=int, required=True), 'assembly_details': FieldInfo(annotation=str, required=True), 'asym_id_1': FieldInfo(annotation=str, required=True), 'asym_id_2': FieldInfo(annotation=str, required=True), 'biol_details': FieldInfo(annotation=str, required=True), 'buried_sasa': FieldInfo(annotation=float, required=True), 'chain1_id': FieldInfo(annotation=str, required=True), 'chain2_id': FieldInfo(annotation=str, required=True), 'chain_1': FieldInfo(annotation=str, required=True), 'chain_2': FieldInfo(annotation=str, required=True), 'charged_apolar_contacts': FieldInfo(annotation=int, required=True), 'charged_charged_contacts': FieldInfo(annotation=int, required=True), 'charged_polar_contacts': FieldInfo(annotation=int, required=True), 'complex_type': FieldInfo(annotation=str, required=True), 'date': FieldInfo(annotation=str, required=True), 'entity_id_L': FieldInfo(annotation=int, required=True), 'entity_id_R': FieldInfo(annotation=int, required=True), 'entry_id': FieldInfo(annotation=str, required=True), 'id': FieldInfo(annotation=str, required=True), 'interface_atom_gaps_4A': FieldInfo(annotation=int, required=True), 'interface_atom_gaps_8A': FieldInfo(annotation=int, required=True), 'intermolecular_contacts': FieldInfo(annotation=int, required=True), 'label': FieldInfo(annotation=str, required=True), 'length1': FieldInfo(annotation=int, required=True), 'length2': FieldInfo(annotation=int, required=True), 'length_resolved_1': FieldInfo(annotation=int, required=True), 'length_resolved_2': FieldInfo(annotation=int, required=True), 'link_density': FieldInfo(annotation=float, required=True), 'max_var_1': FieldInfo(annotation=float, required=True), 'max_var_2': FieldInfo(annotation=float, required=True), 'method': FieldInfo(annotation=str, required=True), 'missing_interface_residues_4A': FieldInfo(annotation=int, required=True), 'missing_interface_residues_8A': FieldInfo(annotation=int, required=True), 'n_residue_pairs': FieldInfo(annotation=int, required=True), 'n_residues': FieldInfo(annotation=int, required=True), 'num_atom_types': FieldInfo(annotation=int, required=True), 'number_of_components_1': FieldInfo(annotation=int, required=True), 'number_of_components_2': FieldInfo(annotation=int, required=True), 'oligomeric_count': FieldInfo(annotation=int, required=True), 'oligomeric_details': FieldInfo(annotation=str, required=True), 'pdb_strand_id_L': FieldInfo(annotation=str, required=True), 'pdb_strand_id_R': FieldInfo(annotation=str, required=True), 'planarity': FieldInfo(annotation=float, required=True), 'polar_polar_contacts': FieldInfo(annotation=int, required=True), 'probability': FieldInfo(annotation=float, required=True), 'release_date': FieldInfo(annotation=str, required=True), 'resolution': FieldInfo(annotation=float, required=True)}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

pinder.core.index.utils.fix_index(df_index: DataFrame, index_fields: list[str], cast_types: bool = True) DataFrame[source][source]#

Fix the index dataframe according to the IndexEntry schema

pinder.core.index.utils.fix_metadata(df_metadata: DataFrame, metadata_fields: list[str], cast_types: bool = True) DataFrame[source][source]#

Fix the metadata dataframe according to the MetadataEntry schema

pinder.core.index.utils.downcast_dtypes(df: DataFrame, str_as_cat: bool = True) DataFrame[source][source]#
pinder.core.index.utils.set_mapping_column_types(mapping_df: DataFrame) DataFrame[source][source]#

Set the column types for the mapping dataframe to avoid 1.0 2.0 “integers”. Int64 is supposed to support NaNs. This function requires Pandas 1.0+.

Parameters:
mapping_dfpd.DataFrame

The dataframe whose column types are to be set.

Module contents#