pinder.data.qc package#
Submodules#
pinder.data.qc.annotation_check module#
- pinder.data.qc.annotation_check.get_paired_uniprot_intersection(split_index: DataFrame, against: str = 'test') tuple[DataFrame, float] [source][source]#
Get the intersection of the uniprot pairs between the train and test/val splits
- Parameters:
split_index – the split index dataframe
against – the split to compare against, either “test” or “val”
- pinder.data.qc.annotation_check.metadata_to_ecod_pairs(meta_data: DataFrame, pindex: DataFrame) tuple[DataFrame, set[str], set[str]] [source][source]#
Convert the metadata dataframe to a dataframe with ECOD pairs
- Parameters:
meta_data – the metadata dataframe
pindex – the pindex dataframe
- Returns:
the ECOD pairs dataframe test_ids: the test ids val_ids: the val ids
- Return type:
ecod_RL
- pinder.data.qc.annotation_check.annotate_longest_intersecting_ecod_domain(ecod_RL: DataFrame, min_intersection: int = 10, frac_interface_threshold: float = 0.25) DataFrame [source][source]#
Annotate the longest intersecting ECOD domain for each chain in the pair
- Parameters:
ecod_RL – the ECOD annotated dataframe
- Returns:
the ECOD annotated dataframe with the longest intersecting domain
- Return type:
ecod_RL_with_ecod
- pinder.data.qc.annotation_check.merge_annotated_interfaces_with_pindex(pindex: DataFrame, ecod_R: DataFrame, ecod_L: DataFrame) DataFrame [source][source]#
Merge the annotated interfaces with the pindex dataframe
- Parameters:
pindex – the pindex dataframe
ecod_R – the ECOD annotated dataframe for chain R
ecod_L – the ECOD annotated dataframe for chain L
- Returns:
the pindex dataframe with the ECOD annotations
- Return type:
pindex_with_RL
- pinder.data.qc.annotation_check.get_ecod_paired_leakage(metadata: DataFrame, pindex: DataFrame, frac_interface_threshold: float = 0.25, min_intersection: int = 10, top_n: int | None = None) tuple[DataFrame, DataFrame, DataFrame, DataFrame, DataFrame] [source][source]#
Get the ECOD paired leakage between the test and train/val splits
- Parameters:
metadata – the metadata dataframe
pindex – the pindex dataframe
frac_interface_threshold – the fraction of the interface that must be covered by the ECOD domain
min_intersection – int the minimum length of an interface
top_n – int The number of leaked pairs to include in the output dataframe
- Returns:
the problem cases (potentially leaked ECOD accession pairs) in the test split problem_cases_val: the problem cases (potentially leaked ECOD accession pairs) in the val split
- Return type:
problem_cases_test
- pinder.data.qc.annotation_check.get_binding_leakage(annotated_pindex: DataFrame, against: str = 'test') tuple[DataFrame, float] [source][source]#
Obtains the chain level binding site leakage between the test/val and train splits
- pinder.data.qc.annotation_check.binding_leakage_main(index_file: str | None = None, metadata_file: str | None = None, frac_interface_threshold: float = 0.25, min_intersection: int = 10) tuple[DataFrame, DataFrame, DataFrame, DataFrame] [source][source]#
Extract ECOD paired binding site leakage for test and val splits.
- Parameters:
index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().
metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().
frac_interface_threshold (float) – Fraction of interface required to be covered by ECOD domain. Default is 0.25.
min_intersection (int) – Minimum required intersection between interface and ECOD domain. Default is 10.
- Returns:
The test and val split binding leakage, respectively.
- Return type:
tuple[pd.DataFrame, pd.DataFrame]
pinder.data.qc.ialign module#
Calculate interface alignment scores via IS-align.
Used to handle similarity hit finding following initial alignments via Foldseek or MMSeq2.
For full method details, please see: https://doi.org/10.1093/bioinformatics/btq404
- pinder.data.qc.ialign.ialign(query_id: str, query_pdb: Path, hit_id: str, hit_pdb: Path, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) dict[str, str | float | int] | None [source][source]#
- pinder.data.qc.ialign.ialign_all(df: DataFrame, pdb_root: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs'), n_jobs: int = 48, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) DataFrame [source][source]#
- pinder.data.qc.ialign.process_in_batches(df: DataFrame, batch_size: int = 1000, batch_offset: int = 0, overwrite: bool = False, cache_dir: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/ialign_results'), n_jobs: int = 48, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) None [source][source]#
pinder.data.qc.pfam_diversity module#
- pinder.data.qc.pfam_diversity.load_data(index_file: Path | str | None = None, metadata_file: Path | str | None = None, pfam_file: Path | str | None = None) tuple[DataFrame, DataFrame, DataFrame] [source][source]#
- pinder.data.qc.pfam_diversity.get_ecod_annotations(pindex: DataFrame, metadata: DataFrame, frac_interface_threshold: float = 0.25, min_intersection: int = 10) DataFrame [source][source]#
- pinder.data.qc.pfam_diversity.process_pfam_data(pindex: DataFrame, pfam_data: DataFrame) DataFrame [source][source]#
- pinder.data.qc.pfam_diversity.visualize_data(pindex: DataFrame, metadata: DataFrame, output_dir: Path, top_n_pfams: int = 50) None [source][source]#
- pinder.data.qc.pfam_diversity.plot_pfam_clan_distribution(pindex: DataFrame, output_dir: Path, top_n_pfams: int = 50) None [source][source]#
- pinder.data.qc.pfam_diversity.plot_ecod_family_distribution(pindex: DataFrame, output_dir: Path, top_n_pfams: int = 50) None [source][source]#
- pinder.data.qc.pfam_diversity.get_merged_metadata(pindex: DataFrame, metadata: DataFrame) DataFrame [source][source]#
- pinder.data.qc.pfam_diversity.get_cluster_diversity(merged_metadata: DataFrame) DataFrame [source][source]#
- pinder.data.qc.pfam_diversity.plot_merged_metadata_distributions(pindex: DataFrame, metadata: DataFrame, output_dir: Path) None [source][source]#
- pinder.data.qc.pfam_diversity.pfam_diversity_main(index_file: str | None = None, metadata_file: str | None = None, pfam_file: str | None = None, output_dir: str | Path = PosixPath('/home/runner/.local/share/pinder/2024-02/data/pfam_visualizations'), frac_interface_threshold: float = 0.25, min_intersection: int = 10, top_n_pfams: int = 50) None [source][source]#
Extract PFAM clan diversity and generate visualizations.
- Parameters:
index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().
metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().
pfam_file (str | None) – Optional path to Pfam data file, if not provided will fetch and cache it.
output_dir (str | Path) – Directory to store PFAM visualizations. Defaults to get_pinder_location() / ‘data/pfam_visualizations’.
frac_interface_threshold (float) – Fraction of interface required to be covered by ECOD domain. Default is 0.25.
min_intersection (int) – Minimum required intersection between interface and ECOD domain. Default is 10.
top_n_pfams (int) – Number of top Pfam clans to visualize. Default is 50.
- Returns:
None.
pinder.data.qc.run module#
pinder_qc
NAME
pinder_qc
SYNOPSIS
pinder_qc COMMAND
COMMANDS
COMMAND is one of the following:
uniprot_leakage
binding_leakage
Extract ECOD paired binding site leakage for test and val splits.
pfam_diversity
Extract PFAM clan diversity and generate visualizations.
sequence_leakage
Extract sequence similarity / leakage by subsampling members in train split.
pinder_qc uniprot_leakage –help
NAME
pinder_qc uniprot_leakage
SYNOPSIS
pinder_qc uniprot_leakage <flags>
FLAGS
-i, --index_path=INDEX_PATH
Type: Optional['str | None']
Default: None
-s, --split=SPLIT
Type: 'str'
Default: 'test'
pinder.data.qc.similarity_check module#
- pinder.data.qc.similarity_check.load_data(index_file: Path | str | None = None, metadata_file: Path | str | None = None) tuple[DataFrame, DataFrame] [source][source]#
- pinder.data.qc.similarity_check.process_test_table(test_table_path: Path, pindex: DataFrame) DataFrame [source][source]#
- pinder.data.qc.similarity_check.align_sequences(input_data: tuple[tuple[str, str], tuple[str, str], str, str], output_path: Path, chain_overlap_threshold: float = 0.3) tuple[Alignment, Path, float, bool] [source][source]#
- pinder.data.qc.similarity_check.write_alignment(alignment_output: tuple[Alignment, Path, float, bool]) None [source][source]#
- pinder.data.qc.similarity_check.generate_alignments(train_sequences: dict[tuple[str, str], str], test_sequences: dict[tuple[str, str], str], output_path: Path, num_cpu: int | None = None, parallel: bool = True, chain_overlap_threshold: float = 0.3, n_chunks: int = 100) list[tuple[Alignment, Path, float, bool]] [source][source]#
- pinder.data.qc.similarity_check.get_processed_alignments(alignments_path: Path) dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]] [source][source]#
- pinder.data.qc.similarity_check.load_train_system_id_mapping(sampled_train: DataFrame) dict[tuple[tuple[str, str], ...], str] [source][source]#
Generate mapping of system IDs from training data
- pinder.data.qc.similarity_check.get_pdb_chain_uni(monomer_id: str) tuple[str, str, str] [source][source]#
Extract pdb_id, chain, and uni from a combined monomer_id string
- pinder.data.qc.similarity_check.system_id_to_alignment(sys_id: str, alignments: dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]], train_system_id_mapping: dict[tuple[tuple[str, str], ...], str]) tuple[str, dict[tuple[tuple[str, str], ...], tuple[tuple[Alignment, float, tuple[str, str]], tuple[Alignment, float, tuple[str, str]]]] | None] | None [source][source]#
Map system IDs to alignments if applicable
- pinder.data.qc.similarity_check.alignments_to_dual_alignment(r_alignments: list[tuple[Alignment, float, tuple[str, str]]], l_alignments: list[tuple[Alignment, float, tuple[str, str]]], train_systems: set[tuple[tuple[str, str], ...]]) dict[tuple[tuple[str, str], ...], tuple[tuple[Alignment, float, tuple[str, str]], tuple[Alignment, float, tuple[str, str]]]] | None [source][source]#
Check for dual alignments across systems and return them if valid
- pinder.data.qc.similarity_check.get_aligned_indices(alignment_trace: list[tuple[int, int]]) tuple[ndarray[Any, dtype[int64]], ndarray[Any, dtype[int64]]] [source][source]#
Extract aligned indices from an alignment trace
- pinder.data.qc.similarity_check.get_aligned_interface_indices(alignment: Alignment, test_system: str, train_system: str, test_chain: str, train_chain: str, metadata: DataFrame) tuple[float, float] [source][source]#
Calculate the percentage of interface residues that are aligned
- pinder.data.qc.similarity_check.analyze_leakage(test_table: DataFrame, alignments: dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]], train_system_id_mapping: dict[tuple[tuple[str, str], ...], str], metadata: DataFrame, chain_overlap_threshold: float) None [source][source]#
Analyze leakage across systems based on alignments and mapping
- pinder.data.qc.similarity_check.subsample_train(pindex: DataFrame, cluster_size_cutoff: int, num_to_sample: int, random_state: int, cache_path: Path, n_chunks: int) DataFrame [source][source]#
- pinder.data.qc.similarity_check.sequence_leakage_main(index_file: str | None = None, metadata_file: str | None = None, test_table_file: Path | str | None = None, entity_metadata_file: str | None = None, cache_path: str | Path = PosixPath('/home/runner/.local/share/pinder/2024-02/data/similarity-cache'), cluster_size_cutoff: int = 20, chain_overlap_threshold: float = 0.3, num_to_sample: int = 1, n_chunks: int = 100, random_state: int = 42, use_cache: bool = True, num_cpu: int | None = None, pinder_dir: Path | None = None, config: ClusterConfig = ClusterConfig(seed=40, canonical_method='foldseek_community', edge_weight='weight', foldseek_cluster_edge_threshold=0.7, foldseek_edge_threshold=0.55, foldseek_af2_difficulty_threshold=0.7, mmseqs_edge_threshold=0.0, resolution_thr=3.5, min_chain_length=40, min_atom_types=3, max_var_thr=0.98, oligomeric_count=2, method='X-RAY DIFFRACTION', interface_atom_gaps_4A=0, prodigy_label='BIO', number_of_components=1, alphafold_cutoff_date='2021-10-01', depth_limit=2, max_node_degree=1000, top_n=1, min_depth_2_hits_with_comm=1, max_depth_2_hits_with_comm=2000, max_depth_2_hits=1000)) None [source][source]#
Extract sequence similarity / leakage by subsampling members in train split.
- Parameters:
index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().
metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().
test_table_file (str | None) – Path to custom/intermediate test systems table, if not provided must provide pinder ingest directory via pinder_dir.
entity_metadata_file (str | None) – Path to custom/intermediate entity metadata file, if not provided will use get_supplementary_data().
cache_path (str | Path) – Directory to store cached alignments. Defaults to get_pinder_location() / ‘data/similarity-cache’.
cluster_size_cutoff (int) – The minimum size of a train set cluster for sampling. Default is 20.
chain_overlap_threshold (float) – Threshold for chain overlap in interface residues. Default is 0.3.
num_to_sample (int) – The number of cluster elements to sample. Default is 1.
n_chunks (int) – The number of sequence pair chunks to evaluate. Default is 100.
random_state (int) – Random state for train set subsampling. Default is 42.
use_cache (bool) – Whether to use cached alignments. Default is True.
num_cpu (int | None) – Limit number of CPU used in multiprocessing. Default is None (use all).
pinder_dir (Path | None) – Directory to pinder dataset generation directory. If not provided, will assume it is being run on local files / post-data gen.
config (ClusterConfig) – The config object used to generate the clustering. Used to infer location of test_table_file if pinder_dir is provided.
- Returns:
None.
pinder.data.qc.uniprot_leakage module#
pinder.data.qc.utils module#
- pinder.data.qc.utils.download_pdbfam_db(download_dir: Path | str, filename: str = 'PDBfam.parquet', url: str = 'http://dunbrack2.fccc.edu/ProtCiD/pfam/PDBfam.txt.gz', overwrite: bool = True) DataFrame [source][source]#
- pinder.data.qc.utils.load_metadata(metadata_file: Path | str | None = None) DataFrame [source][source]#
- pinder.data.qc.utils.load_entity_metadata(entity_metadata_file: Path | str | None = None) DataFrame [source][source]#
- pinder.data.qc.utils.view_potential_leaks(potential_leak_pairs: DataFrame, pml_file: Path = PosixPath('/home/runner/work/pinder/pinder/docs/view_potential_leaks.pml'), max_scenes: int = 100, pdb_dir: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs'), align_types: list[str] = ['align', 'super', 'chain']) None [source][source]#