pinder.core.loader package#
Submodules#
pinder.core.loader.dataset module#
Construct torch datasets and dataloaders from pinder systems.
This module provides two example implementations of how to integrate the pinder dataset into a torch-based machine learning pipeline.
PinderDataset: A torch Dataset that can be used with torch DataLoaders.
PPIDataset: A torch-geometric Dataset that can be used with torch-geometric DataLoaders. This class is designed to be used with the torch_geometric package.
Together, the two datasets provide an example implementation of how to abstract away the complexity of loading and processing multiple structures associated with each PinderSystem by leveraging the following utilities from pinder:
pinder.core.PinderLoader
pinder.core.loader.filters
pinder.core.loader.transforms
The examples cover two different batch data item structures to illustrate two different use-cases:
PinderDataset: A batch of (target_complex, feature_complex) pairs, where target_complex and feature_complex are torch.Tensor objects representing the atomic coordinates and atom types of the holo and sampled (decoy, holo/apo/pred) complexes, respectively.
PPIDataset: A batch of PairedPDB objects, where the receptor and ligand are encoded separately in a heterogeneous graph, via torch_geometric.data.HeteroData, holding multiple node and/or edge types in disjunct storage objects.
- pinder.core.loader.dataset.structure2tensor_transform(structure: Structure) dict[str, Tensor] [source][source]#
- pinder.core.loader.dataset.pad_to_max_length(mat: Tensor, max_length: int | Sequence[int] | Tensor, dims: Sequence[int], value: int | float | None = None) Tensor [source][source]#
Takes a tensor and pads it to maximum length with right padding on the specified dimensions.
- Parameters:
mat (Tensor) – The tensor to pad. Can be of any shape
max_length (int | Sequence[int] | Tensor) – The size of the tensor along specified dimensions after padding.
dims (Sequence[int]) – The dimensions to pad. Must have the same number of elements as max_length.
value (int, optional) – The value to pad with, by default None
- Returns:
- The padded tensor. Below are examples of input and output shapes
- Example 1:
input: (2, 3, 4), max_length: 5, dims: [0, 2] output: (5, 3, 5)
- Example 2:
input: (2, 3, 4), max_length: 5, dims: [0] output: (5, 3, 4)
- Example 3:
input: (2, 3, 4), max_length: [5, 7], dims: [0, 2] output: (5, 3, 7)
- Return type:
Tensor
- pinder.core.loader.dataset.pad_and_stack(tensors: list[Tensor], dim: int = 0, dims_to_pad: list[int] | None = None, value: int | float | None = None) Tensor [source][source]#
Pads a list of tensors to the maximum length observed along each dimension and then stacks them along a new dimension (given by dim).
- Parameters:
tensors (list[Tensor]) – A list of tensors to pad and stack
dim (int) – The new dimension to stack along.
dims_to_pad (list[int] | None) – The dimensions to pad
value (int | float | None, optional) – The value to pad with, by default None
- Returns:
- The padded and stacked tensor. Below are examples of input and output shapes
- Example 1: Sequence features (although redundant with torch.rnn.utils.pad_sequence)
input: [(2,), (7,)], dim: 0 output: (2, 7)
- Example 2: Pair features (e.g., pairwise coordinates)
input: [(4, 4, 3), (7, 7, 3)], dim: 0 output: (2, 7, 7, 3)
- Return type:
Tensor
- pinder.core.loader.dataset.collate_complex(structures: list[dict[str, Tensor]], coords_pad_value: int = -100, atom_type_pad_value: int = -1, residue_id_pad_value: int = -99, chain_id_pad_value: int = -1, element_type_pad_value: int = -1) dict[str, Tensor] [source][source]#
- pinder.core.loader.dataset.collate_batch(batch: list[dict[str, dict[str, Tensor] | str]]) dict[str, dict[str, Tensor] | list[str]] [source][source]#
Collate a batch of PinderDataset items into a merged mini-batch of Tensors.
Used as the default collate_fn for the torch DataLoader consuming PinderDataset.
- Parameters:
batch (list[dict[str, dict[str, Tensor] | str]]) – A list of dictionaries containing the data for each item in the batch.
- Returns:
A dictionary containing the merged Tensors for the batch.
- Return type:
dict[str, dict[str, Tensor] | list[str]]
- class pinder.core.loader.dataset.PinderDataset(split: str | None = None, index: ~pandas.core.frame.DataFrame | None = None, metadata: ~pandas.core.frame.DataFrame | None = None, monomer_priority: str = 'holo', base_filters: list[~pinder.core.loader.filters.PinderFilterBase] = [], sub_filters: list[~pinder.core.loader.filters.PinderFilterSubBase] = [], structure_filters: list[~pinder.core.loader.filters.StructureFilter] = [], structure_transforms_target: list[~pinder.core.loader.transforms.StructureTransform] = [], structure_transforms_feature: list[~pinder.core.loader.transforms.StructureTransform] = [], transform: ~typing.Callable[[~pinder.core.loader.structure.Structure], ~torch.Tensor | dict[str, ~torch.Tensor]] = <function structure2tensor_transform>, target_transform: ~typing.Callable[[~pinder.core.loader.structure.Structure], ~torch.Tensor | dict[str, ~torch.Tensor]] = <function structure2tensor_transform>, ids: list[str] | None = None, fallback_to_holo: bool = True, use_canonical_apo: bool = True, crop_equal_monomer_shapes: bool = True, index_query: str | None = None, metadata_query: str | None = None, pre_specified_monomers: dict[str, str] | ~pandas.core.frame.DataFrame | None = None, **kwargs: ~typing.Any)[source][source]#
Bases:
Dataset
- class pinder.core.loader.dataset.PPIDataset(node_types: set[NodeRepresentation], split: str = 'train', monomer1: str = 'holo_receptor', monomer2: str = 'holo_ligand', base_filters: list[PinderFilterBase] = [], sub_filters: list[PinderFilterSubBase] = [], structure_filters: list[StructureFilter] = [], root: Path = PosixPath('/home/runner/.local/share/pinder/2024-02'), transform: Callable[[PairedPDB], PairedPDB] | None = None, pre_transform: Callable[[PairedPDB], PairedPDB] | None = None, pre_filter: Callable[[PinderSystem], PinderSystem | bool] | None = None, limit_by: int | None = None, force_reload: bool = False, filenames_dir: Path | str | None = None, repeat: int = 1, use_cache: bool = False, ids: list[str] | None = None, add_edges: bool = True, k: int = 10, parallel: bool = False, max_workers: int | None = None, fallback_to_holo: bool = True, crop_equal_monomer_shapes: bool = True)[source][source]#
Bases:
Dataset
- property raw_file_names: list[str]#
The name of the files in the
self.raw_dir
folder that must be present in order to skip downloading.
- property processed_file_names: list[Path]#
The name of the files in the
self.processed_dir
folder that must be present in order to skip processing.
- static process_single_file(system: PinderSystem, node_types: set[NodeRepresentation], monomer1: str, monomer2: str, output_file: Path, add_edges: bool = True, k: int = 10, fallback_to_holo: bool = True, pre_transform: Callable[[PairedPDB], PairedPDB] | None = None) bool [source]#
- static process_single_file_parallel(args: tuple[int, PinderLoader, set[NodeRepresentation], Path, str, str, bool, int, bool, bool, Callable[[PinderSystem], PinderSystem | bool] | None, Callable[[PairedPDB], PairedPDB] | None]) str | None [source]#
- pinder.core.loader.dataset.get_geo_loader(dataset: PPIDataset, batch_size: int = 2, num_workers: int = 1) DataLoader [source][source]#
- pinder.core.loader.dataset.get_torch_loader(dataset: PinderDataset, batch_size: int = 2, shuffle: bool = True, sampler: 'Sampler[PinderDataset]' | None = None, num_workers: int = 1, collate_fn: Callable[[list[dict[str, Any]]], dict[str, Any]] = <function collate_batch>, **kwargs: Any) DataLoader[PinderDataset] [source][source]#
pinder.core.loader.filters module#
- class pinder.core.loader.filters.PinderFilterBase[source][source]#
Bases:
object
- filter(ps: PinderSystem) bool [source]#
- class pinder.core.loader.filters.MinAtomTypesFilter(min_atom_types: int = 3)[source][source]#
Bases:
StructureFilter
- class pinder.core.loader.filters.FilterMetadataFields(**kwargs: tuple[str, float | str | bool | int | None])[source][source]#
Bases:
PinderFilterBase
- filter(ps: PinderSystem) bool [source]#
- class pinder.core.loader.filters.ResidueCount(min_residue_count: int | None = None, max_residue_count: int | None = None, count_hetero: bool = False)[source][source]#
Bases:
ChainQuery
- class pinder.core.loader.filters.AtomTypeCount(min_atom_type: int | None = None, max_atom_type: int | None = None, count_hetero: bool = False)[source][source]#
Bases:
ChainQuery
- class pinder.core.loader.filters.CompleteBackBone(fraction: float = 0.9)[source][source]#
Bases:
ChainQuery
- class pinder.core.loader.filters.CheckChainElongation(max_var_contribution: float = 0.92)[source][source]#
Bases:
ChainQuery
- class pinder.core.loader.filters.DetachedChainQuery(radius: int = 12, max_components: int = 2)[source][source]#
Bases:
ChainQuery
- class pinder.core.loader.filters.CheckContacts(min_contacts: int = 5, radius: float = 10.0, calpha_only: bool = True, backbone_only: bool = True, heavy_only: bool = True)[source][source]#
Bases:
DualChainQuery
- class pinder.core.loader.filters.FilterByResidueCount(**kwargs: int | None | bool)[source][source]#
Bases:
PinderFilterBase
- filter(ps: PinderSystem) bool [source]#
Filter by residue count in holo monomers.
Examples
>>> from pinder.core import PinderSystem >>> pinder_id = "1df0__A1_Q07009--1df0__B1_Q64537" >>> ps = PinderSystem(pinder_id) >>> res_filter = FilterByResidueCount(min_residue_count=10, max_residue_count=500) >>> res_filter(ps) False
- class pinder.core.loader.filters.FilterByMissingHolo[source][source]#
Bases:
PinderFilterBase
- filter(ps: PinderSystem) bool [source]#
- class pinder.core.loader.filters.FilterSubByContacts(min_contacts: int = 5, radius: float = 10.0, calpha_only: bool = True, backbone_only: bool = True, heavy_only: bool = True)[source][source]#
Bases:
PinderFilterBase
- filter(ps: PinderSystem) bool [source]#
- class pinder.core.loader.filters.FilterByHoloElongation(max_var_contribution: float = 0.92)[source][source]#
Bases:
PinderFilterBase
- filter(ps: PinderSystem) bool [source]#
- class pinder.core.loader.filters.FilterDetachedHolo(radius: int = 12, max_components: int = 2)[source][source]#
Bases:
PinderFilterBase
- filter(ps: PinderSystem) bool [source]#
- class pinder.core.loader.filters.PinderFilterSubBase[source][source]#
Bases:
object
- filter(ps: PinderSystem) PinderSystem [source]#
- static filter_by_chain_query(ps: PinderSystem, chain_query: ChainQuery, update_monomers: bool = True) PinderSystem [source]#
- class pinder.core.loader.filters.FilterSubLengths(min_length: int = 0, max_length: int = 1000)[source][source]#
Bases:
PinderFilterSubBase
- filter(ps: PinderSystem, update_monomers: bool = True) PinderSystem [source]#
- class pinder.core.loader.filters.FilterSubRmsds(rmsd_cutoff: float = 7.5)[source][source]#
Bases:
PinderFilterSubBase
- filter(ps: PinderSystem, update_monomers: bool = True) PinderSystem [source]#
- class pinder.core.loader.filters.FilterByHoloOverlap(min_overlap: int = 5)[source][source]#
Bases:
PinderFilterSubBase
- filter(ps: PinderSystem, update_monomers: bool = True) PinderSystem [source]#
- class pinder.core.loader.filters.FilterByHoloSeqIdentity(min_sequence_identity: float = 0.8)[source][source]#
Bases:
PinderFilterSubBase
- filter(ps: PinderSystem, update_monomers: bool = True) PinderSystem [source]#
- class pinder.core.loader.filters.FilterSubByAtomTypes(min_atom_types: int = 4)[source][source]#
Bases:
PinderFilterSubBase
- filter(ps: PinderSystem, update_monomers: bool = True) PinderSystem [source]#
- class pinder.core.loader.filters.FilterSubByChainQuery(chain_query: ChainQuery)[source][source]#
Bases:
PinderFilterSubBase
- filter(ps: PinderSystem, update_monomers: bool = True) PinderSystem [source]#
- class pinder.core.loader.filters.FilterByElongation(max_var_contribution: float = 0.92)[source][source]#
Bases:
PinderFilterSubBase
- filter(ps: PinderSystem, update_monomers: bool = True) PinderSystem [source]#
- class pinder.core.loader.filters.FilterDetachedSub(radius: int = 12, max_components: int = 2)[source][source]#
Bases:
PinderFilterSubBase
- filter(ps: PinderSystem, update_monomers: bool = True) PinderSystem [source]#
pinder.core.loader.geodata module#
- pinder.core.loader.geodata.structure2tensor(atom_coordinates: ndarray[Any, dtype[float64]] | None = None, atom_types: ndarray[Any, dtype[str_]] | None = None, element_types: ndarray[Any, dtype[str_]] | None = None, residue_coordinates: ndarray[Any, dtype[float64]] | None = None, residue_ids: ndarray[Any, dtype[int64]] | None = None, residue_types: ndarray[Any, dtype[str_]] | None = None, chain_ids: ndarray[Any, dtype[str_]] | None = None, dtype: dtype = torch.float32) dict[str, Tensor] [source][source]#
- class pinder.core.loader.geodata.NodeRepresentation(value)[source][source]#
Bases:
Enum
An enumeration.
- Surface = 'surface'#
- Atom = 'atom'#
- Residue = 'residue'#
- class pinder.core.loader.geodata.PairedPDB(_mapping: Dict[str, Any] | None = None, **kwargs)[source][source]#
Bases:
HeteroData
- classmethod from_pinder_system(system: PinderSystem, node_types: set[NodeRepresentation], monomer1: str = 'holo_receptor', monomer2: str = 'holo_ligand', add_edges: bool = True, k: int = 10, fallback_to_holo: bool = True) PairedPDB [source]#
pinder.core.loader.loader module#
- pinder.core.loader.loader.get_systems(systems: list[str]) Generator[PinderSystem, PinderSystem, None] [source][source]#
- pinder.core.loader.loader.get_available_monomers(row: Series) dict[str, list[str]] [source][source]#
Get the available monomers for a given row of the index.
- Parameters:
row (pd.Series) – A row of the pinder index representing a pinder system.
- Returns:
A dictionary mapping dimer body (R or L) to a list of available monomer types (apo, predicted, or holo).
- Return type:
dict[str, list[str]]
- pinder.core.loader.loader.get_alternate_apo_codes(row: Series, side: str) list[str] [source][source]#
Get the list of non-canonical (alternate) apo PDB codes for the specified dimer side (R or L).
- Parameters:
row (pd.Series) – A row of the pinder index representing a pinder system.
side (str) – The dimer side, R or L, representing receptor or ligand, respectively.
- Returns:
- A list of 4-letter PDB codes for all alternate apo monomers (when available).
The codes can be used to select an alternate apo monomer when working with PinderSystem objects. When no alternate apo monomers exist, returns an empty list.
- Return type:
list[str]
- pinder.core.loader.loader.select_monomer(row: Series, monomer_priority: str = 'holo', fallback_to_holo: bool = True, canonical_apo: bool = True) dict[str, str] [source][source]#
Select a monomer type to use for the receptor and ligand in a given pinder dimer system.
- Parameters:
row (pd.Series) – A row of the pinder index representing a pinder system.
monomer_priority (str, optional) – The monomer priority to use. Defaults to “holo”. Allowed values are “apo”, “holo”, “pred”, “random” or “random_mixed”.. See note about the random and random_mixed options.
fallback_to_holo (bool, optional) – Whether to fallback to the holo monomer when no other monomer is available. Defaults to True.
canonical_apo (bool, optional) – Whether to use the canonical apo monomer when the apo monomer type is available and selected. Defaults to True. To sample non-canonical apo monomers, set this value to False.
- Returns:
- A dictionary mapping dimer body (R or L) to the selected monomer type (apo, predicted, or holo).
If non-canonical apo monomers are selected, the dictionary values will point to the apo PDB code to load. See PinderSystem for more details on how the apo PDB code is used.
- Return type:
dict[str, str]
Note
The allowed values for monomer_priority are “apo”, “holo”, “pred”, “random” or “random_mixed”.
When monomer_priority is set to one of the available monomer types (holo, apo, pred), the same monomer type will be selected for both receptor and ligand.
When the monomer priority is “random”, a random monomer type will be selected from the set of monomer types available for both the receptor and ligand. This option ensures the same type of monomer is used for the receptor and ligand.
When the monomer priority is “random_mixed”, a random monomer type will be selected for each of receptor and ligand, separately.
Enabling the fallback_to_holo option (default) will enable silent fallback to holo when the monomer_priority is set to one of apo or pred, but the corresponding monomer is not available for the dimer. This is useful when only one of receptor or ligand has an unbound monomer, but you wish to include apo or predicted structures in your workflow. If fallback_to_holo is disabled, an error will be raised when the monomer_priority is set to one of apo or pred, but the corresponding monomer is not available for the dimer.
- class pinder.core.loader.loader.PinderLoader(split: str | None = None, ids: list[str] | None = None, index: DataFrame | None = None, metadata: DataFrame | None = None, subset: DatasetName | None = None, base_filters: list[PinderFilterBase] = [], sub_filters: list[PinderFilterSubBase] = [], structure_filters: list[StructureFilter] = [], structure_transforms_target: list[StructureTransform] = [], structure_transforms_feature: list[StructureTransform] = [], index_query: str | None = None, metadata_query: str | None = None, writer: PinderWriterBase | None = None, monomer_priority: str = 'holo', fallback_to_holo: bool = True, use_canonical_apo: bool = True, crop_equal_monomer_shapes: bool = True, max_load_attempts: int = 10, pre_specified_monomers: dict[str, str] | DataFrame | None = None)[source][source]#
Bases:
object
- static apply_dimer_filters(dimer: PinderSystem, base_filters: list[PinderFilterBase] | list[None] = [], sub_filters: list[PinderFilterSubBase] | list[None] = []) PinderSystem | bool [source]#
pinder.core.loader.structure module#
- class pinder.core.loader.structure.Structure(filepath: Path, uniprot_map: Path | None | pd.DataFrame | None = None, pinder_id: str | None = None, atom_array: AtomArray = None, pdb_engine: str = 'fastpdb')[source][source]#
Bases:
object
- filepath: Path#
- uniprot_map: Path | None | DataFrame = None#
- pinder_id: str | None = None#
- atom_array: AtomArray = None#
- pdb_engine: str = 'fastpdb'#
- to_pdb(filepath: Path | None = None) None [source]#
Write Structure Atomarray to a PDB file.
- Parameters:
- filepathPath | None
Filepath to output PDB. If not provided, will write to self.filepath, potentially overwriting if the file already exists!
- Returns:
- None
- filter(property: str, mask: Iterable[bool | int | str], copy: bool = True, negate: bool = False) Structure | None [source]#
- align_common_sequence(other: Structure, copy: bool = True, remove_differing_atoms: bool = True, renumber_residues: bool = False, remove_differing_annotations: bool = False) tuple[Structure, Structure] [source]#
- get_contacts(radius: float = 5.0, heavy_only: bool = False, backbone_only: bool = False) set[tuple[str, str, int, int]] [source]#
- get_interface_mask(interface_residues: dict[str, list[int]], calpha_only: bool = True, remove_hetero: bool = False) ndarray[Any, dtype[bool_]] [source]#
- get_interface_residues(contacts: set[tuple[str, str, int, int]] | None = None, radius: float = 5.0, heavy_only: bool = False, backbone_only: bool = False, calpha_mask: bool = True) dict[str, list[int]] [source]#
- property uniprot_mapping: DataFrame | None#
The uniprot mapping for the structure, if available.
- Type:
“pd.DataFrame | None
- property resolved_mapping: DataFrame | None#
The uniprot mapping for the structure filtered to resolved residues, if available.
- Type:
“pd.DataFrame | None
- property resolved_pdb2uniprot: dict[int, int]#
Dictionary mapping PDB residue ID number to UniProt numbering, where resolved and mapping is available.
- Type:
“dict[int, int]
- property resolved_uniprot2pdb: dict[int, int]#
Dictionary mapping UniProt residue numbering to PDB numbering, where resolved and mapping is available.
- Type:
“dict[int, int]
- property coords: ndarray[Any, dtype[float64]]#
The coordinates of the atoms in the structure.
- Type:
ndarray[np.double]
- property dataframe: DataFrame#
The dataframe representation of the structure.
- Type:
pd.DataFrame
- property backbone_mask: ndarray[Any, dtype[bool_]]#
a logical mask for backbone atoms.
- Type:
ndarray[np.bool_]
- property calpha_mask: ndarray[Any, dtype[bool_]]#
a logical mask for alpha carbon atoms.
- Type:
ndarray[np.bool_]
- property n_atoms: int#
The number of atoms in the structure.
- Type:
int
- property chains: list[str]#
The list of chain IDs in the structure.
- Type:
list[str]
- property chain_sequence: dict[str, list[str]]#
The chain sequence dictionary, where keys are chain IDs and values are lists of residue codes.
- Type:
dict[str, list[str]]
- property sequence: str#
The amino acid sequence of the structure.
- Type:
str
- property fasta: str#
The fasta representation of the structure sequence.
- Type:
str
- property tokenized_sequence: torch.Tensor#
The tokenized sequence representation of the structure sequence.
- Type:
torch.Tensor
- property residue_names: list[str]#
The list of distinct residue names in the structure.
- Type:
list[str]
- property residues: list[int]#
The list of distinct residue IDs in the structure.
- Type:
list[int]
- property atom_names: list[str]#
The list of distinct atom names in the structure.
- Type:
list[str]
- property b_factor: list[float]#
A list of B-factor values for each atom in the structure.
- Type:
list[float]
- pinder.core.loader.structure.find_potential_interchain_bonded_atoms(structure: Structure, interface_res: dict[str, list[int]] | None = None, radius: float = 2.3) AtomArray [source][source]#
- pinder.core.loader.structure.mask_common_uniprot(mono_A: Structure, mono_B: Structure) tuple[Structure, Structure] [source][source]#
pinder.core.loader.transforms module#
- class pinder.core.loader.transforms.TransformBase[source][source]#
Bases:
object
- transform(dimer: PinderSystem) PinderSystem [source]#
- class pinder.core.loader.transforms.SelectAtomTypes(atom_types: list[str] = ['CA'])[source][source]#
Bases:
StructureTransform
- class pinder.core.loader.transforms.SuperposeToReference(reference_type: str = 'holo')[source][source]#
Bases:
TransformBase
- transform(ppi: PinderSystem) PinderSystem [source]#
- class pinder.core.loader.transforms.RandomLigandTransform(max_translation: float = 10.0)[source][source]#
Bases:
StructureTransform
pinder.core.loader.utils module#
pinder.core.loader.writer module#
- class pinder.core.loader.writer.PinderWriterBase(output_path: Path)[source][source]#
Bases:
object
- write(dimer: PinderSystem) None [source]#
- class pinder.core.loader.writer.PinderDefaultWriter(output_path: Path)[source][source]#
Bases:
PinderWriterBase
- write(dimer: PinderSystem) None [source]#
- class pinder.core.loader.writer.PinderClusteredWriter(output_path: Path)[source][source]#
Bases:
PinderDefaultWriter
- write(dimer: PinderSystem) None [source]#