Submodules# module#
- DataFrame, against: str = 'test') tuple[DataFrame, float] [source][source]#
Get the intersection of the uniprot pairs between the train and test/val splits
- Parameters:
split_index – the split index dataframe
against – the split to compare against, either “test” or “val”
- DataFrame, pindex: DataFrame) tuple[DataFrame, set[str], set[str]] [source][source]#
Convert the metadata dataframe to a dataframe with ECOD pairs
- Parameters:
meta_data – the metadata dataframe
pindex – the pindex dataframe
- Returns:
the ECOD pairs dataframe test_ids: the test ids val_ids: the val ids
- Return type:
- DataFrame, min_intersection: int = 10, frac_interface_threshold: float = 0.25) DataFrame [source][source]#
Annotate the longest intersecting ECOD domain for each chain in the pair
- Parameters:
ecod_RL – the ECOD annotated dataframe
- Returns:
the ECOD annotated dataframe with the longest intersecting domain
- Return type:
- DataFrame, ecod_R: DataFrame, ecod_L: DataFrame) DataFrame [source][source]#
Merge the annotated interfaces with the pindex dataframe
- Parameters:
pindex – the pindex dataframe
ecod_R – the ECOD annotated dataframe for chain R
ecod_L – the ECOD annotated dataframe for chain L
- Returns:
the pindex dataframe with the ECOD annotations
- Return type:
- DataFrame, pindex: DataFrame, frac_interface_threshold: float = 0.25, min_intersection: int = 10, top_n: int | None = None) tuple[DataFrame, DataFrame, DataFrame, DataFrame, DataFrame] [source][source]#
Get the ECOD paired leakage between the test and train/val splits
- Parameters:
metadata – the metadata dataframe
pindex – the pindex dataframe
frac_interface_threshold – the fraction of the interface that must be covered by the ECOD domain
min_intersection – int the minimum length of an interface
top_n – int The number of leaked pairs to include in the output dataframe
- Returns:
the problem cases (potentially leaked ECOD accession pairs) in the test split problem_cases_val: the problem cases (potentially leaked ECOD accession pairs) in the val split
- Return type:
- DataFrame, against: str = 'test') tuple[DataFrame, float] [source][source]#
Obtains the chain level binding site leakage between the test/val and train splits
- str | None = None, metadata_file: str | None = None, frac_interface_threshold: float = 0.25, min_intersection: int = 10) tuple[DataFrame, DataFrame, DataFrame, DataFrame] [source][source]#
Extract ECOD paired binding site leakage for test and val splits.
- Parameters:
index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().
metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().
frac_interface_threshold (float) – Fraction of interface required to be covered by ECOD domain. Default is 0.25.
min_intersection (int) – Minimum required intersection between interface and ECOD domain. Default is 10.
- Returns:
The test and val split binding leakage, respectively.
- Return type:
tuple[pd.DataFrame, pd.DataFrame] module#
Calculate interface alignment scores via IS-align.
Used to handle similarity hit finding following initial alignments via Foldseek or MMSeq2.
For full method details, please see:
- str, query_pdb: Path, hit_id: str, hit_pdb: Path, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) dict[str, str | float | int] | None [source][source]#
- DataFrame, pdb_root: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs'), n_jobs: int = 48, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) DataFrame [source][source]#
- DataFrame, batch_size: int = 1000, batch_offset: int = 0, overwrite: bool = False, cache_dir: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/ialign_results'), n_jobs: int = 48, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) None [source][source]# module#
- Path | str | None = None, metadata_file: Path | str | None = None, pfam_file: Path | str | None = None) tuple[DataFrame, DataFrame, DataFrame] [source][source]#
- DataFrame, metadata: DataFrame, frac_interface_threshold: float = 0.25, min_intersection: int = 10) DataFrame [source][source]#
- DataFrame, pfam_data: DataFrame) DataFrame [source][source]#
- DataFrame, metadata: DataFrame, output_dir: Path, top_n_pfams: int = 50) None [source][source]#
- DataFrame, output_dir: Path, top_n_pfams: int = 50) None [source][source]#
- DataFrame, output_dir: Path, top_n_pfams: int = 50) None [source][source]#
- DataFrame, metadata: DataFrame) DataFrame [source][source]#
- DataFrame) DataFrame [source][source]#
- DataFrame, metadata: DataFrame, output_dir: Path) None [source][source]#
- str | None = None, metadata_file: str | None = None, pfam_file: str | None = None, output_dir: str | Path = PosixPath('/home/runner/.local/share/pinder/2024-02/data/pfam_visualizations'), frac_interface_threshold: float = 0.25, min_intersection: int = 10, top_n_pfams: int = 50) None [source][source]#
Extract PFAM clan diversity and generate visualizations.
- Parameters:
index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().
metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().
pfam_file (str | None) – Optional path to Pfam data file, if not provided will fetch and cache it.
output_dir (str | Path) – Directory to store PFAM visualizations. Defaults to get_pinder_location() / ‘data/pfam_visualizations’.
frac_interface_threshold (float) – Fraction of interface required to be covered by ECOD domain. Default is 0.25.
min_intersection (int) – Minimum required intersection between interface and ECOD domain. Default is 10.
top_n_pfams (int) – Number of top Pfam clans to visualize. Default is 50.
- Returns:
None. module#
pinder_qc COMMAND
COMMAND is one of the following:
Extract ECOD paired binding site leakage for test and val splits.
Extract PFAM clan diversity and generate visualizations.
Extract sequence similarity / leakage by subsampling members in train split.
pinder_qc uniprot_leakage –help
pinder_qc uniprot_leakage
pinder_qc uniprot_leakage <flags>
-i, --index_path=INDEX_PATH
Type: Optional['str | None']
Default: None
-s, --split=SPLIT
Type: 'str'
Default: 'test' module#
- Path | str | None = None, metadata_file: Path | str | None = None) tuple[DataFrame, DataFrame] [source][source]#
- Path, pindex: DataFrame) DataFrame [source][source]#
- tuple[tuple[str, str], tuple[str, str], str, str], output_path: Path, chain_overlap_threshold: float = 0.3) tuple[Alignment, Path, float, bool] [source][source]#
- tuple[Alignment, Path, float, bool]) None [source][source]#
- dict[tuple[str, str], str], test_sequences: dict[tuple[str, str], str], output_path: Path, num_cpu: int | None = None, parallel: bool = True, chain_overlap_threshold: float = 0.3, n_chunks: int = 100) list[tuple[Alignment, Path, float, bool]] [source][source]#
- Path) dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]] [source][source]#
- DataFrame) dict[tuple[tuple[str, str], ...], str] [source][source]#
Generate mapping of system IDs from training data
- str) tuple[str, str, str] [source][source]#
Extract pdb_id, chain, and uni from a combined monomer_id string
- str, alignments: dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]], train_system_id_mapping: dict[tuple[tuple[str, str], ...], str]) tuple[str, dict[tuple[tuple[str, str], ...], tuple[tuple[Alignment, float, tuple[str, str]], tuple[Alignment, float, tuple[str, str]]]] | None] | None [source][source]#
Map system IDs to alignments if applicable
- list[tuple[Alignment, float, tuple[str, str]]], l_alignments: list[tuple[Alignment, float, tuple[str, str]]], train_systems: set[tuple[tuple[str, str], ...]]) dict[tuple[tuple[str, str], ...], tuple[tuple[Alignment, float, tuple[str, str]], tuple[Alignment, float, tuple[str, str]]]] | None [source][source]#
Check for dual alignments across systems and return them if valid
- list[tuple[int, int]]) tuple[ndarray[Any, dtype[int64]], ndarray[Any, dtype[int64]]] [source][source]#
Extract aligned indices from an alignment trace
- Alignment, test_system: str, train_system: str, test_chain: str, train_chain: str, metadata: DataFrame) tuple[float, float] [source][source]#
Calculate the percentage of interface residues that are aligned
- DataFrame, alignments: dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]], train_system_id_mapping: dict[tuple[tuple[str, str], ...], str], metadata: DataFrame, chain_overlap_threshold: float) None [source][source]#
Analyze leakage across systems based on alignments and mapping
- DataFrame, cluster_size_cutoff: int, num_to_sample: int, random_state: int, cache_path: Path, n_chunks: int) DataFrame [source][source]#
- str | None = None, metadata_file: str | None = None, test_table_file: Path | str | None = None, entity_metadata_file: str | None = None, cache_path: str | Path = PosixPath('/home/runner/.local/share/pinder/2024-02/data/similarity-cache'), cluster_size_cutoff: int = 20, chain_overlap_threshold: float = 0.3, num_to_sample: int = 1, n_chunks: int = 100, random_state: int = 42, use_cache: bool = True, num_cpu: int | None = None, pinder_dir: Path | None = None, config: ClusterConfig = ClusterConfig(seed=40, canonical_method='foldseek_community', edge_weight='weight', foldseek_cluster_edge_threshold=0.7, foldseek_edge_threshold=0.55, foldseek_af2_difficulty_threshold=0.7, mmseqs_edge_threshold=0.0, resolution_thr=3.5, min_chain_length=40, min_atom_types=3, max_var_thr=0.98, oligomeric_count=2, method='X-RAY DIFFRACTION', interface_atom_gaps_4A=0, prodigy_label='BIO', number_of_components=1, alphafold_cutoff_date='2021-10-01', depth_limit=2, max_node_degree=1000, top_n=1, min_depth_2_hits_with_comm=1, max_depth_2_hits_with_comm=2000, max_depth_2_hits=1000)) None [source][source]#
Extract sequence similarity / leakage by subsampling members in train split.
- Parameters:
index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().
metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().
test_table_file (str | None) – Path to custom/intermediate test systems table, if not provided must provide pinder ingest directory via pinder_dir.
entity_metadata_file (str | None) – Path to custom/intermediate entity metadata file, if not provided will use get_supplementary_data().
cache_path (str | Path) – Directory to store cached alignments. Defaults to get_pinder_location() / ‘data/similarity-cache’.
cluster_size_cutoff (int) – The minimum size of a train set cluster for sampling. Default is 20.
chain_overlap_threshold (float) – Threshold for chain overlap in interface residues. Default is 0.3.
num_to_sample (int) – The number of cluster elements to sample. Default is 1.
n_chunks (int) – The number of sequence pair chunks to evaluate. Default is 100.
random_state (int) – Random state for train set subsampling. Default is 42.
use_cache (bool) – Whether to use cached alignments. Default is True.
num_cpu (int | None) – Limit number of CPU used in multiprocessing. Default is None (use all).
pinder_dir (Path | None) – Directory to pinder dataset generation directory. If not provided, will assume it is being run on local files / post-data gen.
config (ClusterConfig) – The config object used to generate the clustering. Used to infer location of test_table_file if pinder_dir is provided.
- Returns:
None. module# module#
- Path | str, filename: str = 'PDBfam.parquet', url: str = '', overwrite: bool = True) DataFrame [source][source]#
- Path | str | None = None) DataFrame [source][source]#
- Path | str | None = None) DataFrame [source][source]#
- DataFrame, pml_file: Path = PosixPath('/home/runner/work/pinder/pinder/docs/view_potential_leaks.pml'), max_scenes: int = 100, pdb_dir: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs'), align_types: list[str] = ['align', 'super', 'chain']) None [source][source]#