Changelog#

Pinder 2024-02 release#

New dataset release version 2024-02, with RCSB PDB entries ingested up until February 2024
We’ve increased Foldseek’s sensitivity and now utilize LDDT (new) to better capture the alignment of flexible chains.
We’ve added unit tests and QC reports that sanity check the splits using sequence alignments from MMseqs, domain annotations from ECOD, iAlign similary amongst other methods.
Improved alignment sensitivity by optimizing Foldseek parameters
We introduced an additional de-leaking step where we search for transitive neighbors in the interface alignment graph and use this to remove leakage across clusters
With the re-run, we were able to increase the total dataset size, including newer RCSB entries that were not part of the previous ingest (this one goes up to Feb 2024)
Most importantly, this increases pinder-af2 holdout size from 69 before to 180 total now.
- In addition to an overall larger size, we further improved the de-leaking process by scoring pairs of dimers with Foldseek or MMSeqs hits via iAlign to extract interface similarity.
- The final pinder-af2 set is comprised of systems that come from a PDB entry that was released after the AF2 training date (time split) and low or no interface similarity to AF2 training members.
- There are 675 clusters in the time-split and 180 systems in the final iAlign-deleaked set.

Updated dataset counts#

Dataset	Size	PDB IDs	Clusters	Apo pairs	AF2 Pairs	Apo \(\geq 1\)	AF2 \(\geq 1\)	Description
`PINDER-XL`	1,955	1,955	1,955	342	1,747	436	1,775	Full test set, 1,955 cluster representatives
`PINDER-S`	250	250	250	93	250	143	250	A smaller subset of PINDER-XL, comprised of 250 clusters (188 heterodimer and 62 homodimers) sampled for diverse Uniprot and PFAM annotations, with full AFDB coverage, to evaluate methods for which sampling from the full set is too slow
`PINDER-AF2`	180	180	180	30	127	37	133	A smaller subset of PINDER-XL, comprised of 180 clusters, 30 of which have paired apo structures and 131 with paired AFDB structures, which were deleaked against the AF2MM training set with a more rigorous deleaking process to remove any members with interfaces similar to the AF2MM training set as determined by iAlign, to evaluate methods against AF2MM
`Val`	1,958	1,958	1,958	342	1,789	441	1,817	Validation set, consisting of 1,958 cluster representatives
`Train`	1,560,682	62,706	42,220	136,498	566,171	274,194	621,276	Training set, consisting of 1,560,682 dimers from 42,220 clusters

Diversity stats#

Test + val size#

Pinder cluster-level diversity: 12.41%
Acceptance criteria: test + val > 5-10% of overall number of clusters

Interface size distribution#

Interface residue length distributions between splits
Acceptance criteria: no specific criteria, we just want to ensure there is some overlap across splits

Planarity distribution#

Planarity distribution between splits (lower means more transient interaction)
Acceptance criteria: no specific criteria, we just want to ensure there is some overlap across splits

ECOD domain-domain diversity#

ECOD domain is defined as (1) ECOD id which overlaps with the interface most (2) Minimum of 25% of interface & 10 AA should be covered
“Pair diversity”
- test: 9.65%
- val: 10.82%
- test+val: 18.11%
Acceptance criteria: >5%

Both chains unique ECOD domain (unique binding site ratio)#

Percentage of test systems with ECOD ids unique to test (both chains)
- Check if single domain exists in train irrespective of pairs
- If neither of the ECOD domains exist in train then this system is completely unique to the test
test: 30.91%
val: 33.15%
Acceptance criteria: ~1/3rd

ECOD single chain domain diversity#

ECOD domain diversity based on chains
“Single-chain diversity”
- test: 18.47%
- val: 21.14%
- test+val: 33.83%

Pfam-clan distribution#

Pfam-clan level diversity (broader than ECOD families)
- Note: not all chains are annotated
test: 52.65%

Leakage QC#

UniProt#

Intersection of uniprot pairs between train & test
- test: 3.17%
- val: 3.63%
Acceptance criteria: <5%

ECOD#

Percent of test set with same ECOD pair as train
- test: 6.39%
- val: 8.02%
Acceptance criteria: <10%

Sequence similarity (global)#

Test vs subsampled train (1 per cluster with clusters >20: 4,658 total systems)
- 0.05% (2/3915) from sampling 0.24% of the training set (3627 entries).
- subsampling 1.2% of the training set (a different set of 18,135 train systems) results in 0.07% leakage (3/3915)
Acceptance criteria: N/A

Parameter optimization#

Foldseek and MMseqs2 alignments are used directly to de-leak, therefore looking at the leakage given these methods will yield no leakage unless the alignments produced are different from our pipeline. This could be because of bugs or inconsistencies between Foldseek/MMseqs2 runs. We performed Foldseek/MMseqs2 alignments between subsampled train data vs test and looked at the following:

Alignment consistency with same parameter settings
Alignment performance with different settings a. Tested Foldseek in slow-mode (performs tm-align at post instead of 3di) b. Tested different e-value cut-offs for both c. Higher “max n alignments per sequence” in MMseqs2
Test on single chain ECOD homology classification a. True alignment if both of the aligned chains have the same ECOD family. b. False if they have different ECOD family.

Alignment consistency:

Foldseek alignments yielded ~90% (thr > 0.6) of overlap between the old and new alignments, both runs having similar total numbers above 0.6 lddt.
MMseqs2 yielded substantially higher number of alignments. This was due to the lower number of sequences in the sequence subset used, which gave more freedom to the “max n alignments per sequence” parameter (1000). Therefore, most of these are suspected to be short and unspecific alignments at the used 30% ident. cutoff:

Alignment at different settings:

The only setting with substantially different outcome was the use of “slow-mode” in Foldseek instead of “fast-mode” used in our pipeline (makes use of Foldseek alphabet + sequence similarity)
- In general we see more alignments above the lddt threshold

Assessment of differences in terms of performance:

MMseqs2 contribution is minimal (possibly random) with increased hits, confirming the hypothesis of short unspecific hits (FP)
- default max hits:
  - TP: 1012 FP: 750
- increased max hits:
  - TP: 1088 FP: 29560
Foldseek fast-mode is better in terms of precision. This can be attributed to the VQ-VAE task used in Foldseek alphabet training which was based on SCOPe classification. This leads to good alignments for relevant folds and suboptimal alignments for “unimportant” folds which lead to low mean global lddt compared to tm-align pass.
- fast-mode (alignment-type 2, 3Di+AA Gotoh-Smith-Waterman) at 0.5 lddt threshold:
  - TP: 2328 FP: 3586
- slow-mode (alignment-type 1, TMalign) at 0.6 lddt threshold:
  - TP: 2200 FP: 3560

Conclusion:

Our current settings for both Foldseek and MMseqs2 are good and no need to change anything.
Alignment graph cutoff for Foldseek graph can be changed to 0.5 or 0.55 to cover more TPs, but may not make a substantial difference.

Pinder 2023-11 (patch update)#

Previously, test set was defined by randomly sampling two accessions from each cluster
This was done without redundancy removal, leading to redundant systems across pinder-xl, pinder-s and pinder-af2
This update removes redundant test members as defined by having the same receptor and ligand UniProt IDs
Representative was selected based on lowest number of interface atom gaps, highest sequence resolution, and crystal resolution (in that order)
If both cluster members had identical characteristics, a representative was selected at random
The structural files in the bucket remain unchanged; however, the index file was updated to remove the redundant rows.
- NOTE: In order to update to the new index file please remove the index.csv.gz from your PINDER_BASE_DIR env variable or default location in ~/.local/share/pinder
- Alternatively, you can use the new CLI tool pinder_update_index --help

Updated test dataset counts#

Dataset	Monomer	Current count	Previous count (with redundant test members)
pinder_xl	holo	1455	1756
pinder_xl	apo	323	401
pinder_xl	predicted	1296	1575
pinder_s	holo	101	117
pinder_s	apo	59	71
pinder_s	predicted	90	106
pinder_af2	holo	69	72
pinder_af2	apo	15	15
pinder_af2	predicted	50	52

Pinder 2023-11#

Initial release
PINDER_RELEASE=2023-11

Total dataset size#

# compressed
34G     pdbs.zip
867M    csvs.zip

# unpacked
156G    pdbs
1.6G    csvs

File manifest#

gs://pinder/2023-11/index.csv.gz
gs://pinder/2023-11/metadata.csv.gz
gs://pinder/2023-11/csvs.zip
gs://pinder/2023-11/csvs/
gs://pinder/2023-11/pdbs.zip
gs://pinder/2023-11/pdbs/