pinder.data.qc package#

Submodules#

pinder.data.qc.annotation_check module#

pinder.data.qc.annotation_check.get_paired_uniprot_intersection(split_index: DataFrame, against: str = 'test') tuple[DataFrame, float][source][source]#

Get the intersection of the uniprot pairs between the train and test/val splits

Parameters:
  • split_index – the split index dataframe

  • against – the split to compare against, either “test” or “val”

pinder.data.qc.annotation_check.metadata_to_ecod_pairs(meta_data: DataFrame, pindex: DataFrame) tuple[DataFrame, set[str], set[str]][source][source]#

Convert the metadata dataframe to a dataframe with ECOD pairs

Parameters:
  • meta_data – the metadata dataframe

  • pindex – the pindex dataframe

Returns:

the ECOD pairs dataframe test_ids: the test ids val_ids: the val ids

Return type:

ecod_RL

pinder.data.qc.annotation_check.annotate_longest_intersecting_ecod_domain(ecod_RL: DataFrame, min_intersection: int = 10, frac_interface_threshold: float = 0.25) DataFrame[source][source]#

Annotate the longest intersecting ECOD domain for each chain in the pair

Parameters:

ecod_RL – the ECOD annotated dataframe

Returns:

the ECOD annotated dataframe with the longest intersecting domain

Return type:

ecod_RL_with_ecod

pinder.data.qc.annotation_check.merge_annotated_interfaces_with_pindex(pindex: DataFrame, ecod_R: DataFrame, ecod_L: DataFrame) DataFrame[source][source]#

Merge the annotated interfaces with the pindex dataframe

Parameters:
  • pindex – the pindex dataframe

  • ecod_R – the ECOD annotated dataframe for chain R

  • ecod_L – the ECOD annotated dataframe for chain L

Returns:

the pindex dataframe with the ECOD annotations

Return type:

pindex_with_RL

pinder.data.qc.annotation_check.get_ecod_paired_leakage(metadata: DataFrame, pindex: DataFrame, frac_interface_threshold: float = 0.25, min_intersection: int = 10, top_n: int | None = None) tuple[DataFrame, DataFrame, DataFrame, DataFrame, DataFrame][source][source]#

Get the ECOD paired leakage between the test and train/val splits

Parameters:
  • metadata – the metadata dataframe

  • pindex – the pindex dataframe

  • frac_interface_threshold – the fraction of the interface that must be covered by the ECOD domain

  • min_intersection – int the minimum length of an interface

  • top_n – int The number of leaked pairs to include in the output dataframe

Returns:

the problem cases (potentially leaked ECOD accession pairs) in the test split problem_cases_val: the problem cases (potentially leaked ECOD accession pairs) in the val split

Return type:

problem_cases_test

pinder.data.qc.annotation_check.get_binding_leakage(annotated_pindex: DataFrame, against: str = 'test') tuple[DataFrame, float][source][source]#

Obtains the chain level binding site leakage between the test/val and train splits

pinder.data.qc.annotation_check.binding_leakage_main(index_file: str | None = None, metadata_file: str | None = None, frac_interface_threshold: float = 0.25, min_intersection: int = 10) tuple[DataFrame, DataFrame, DataFrame, DataFrame][source][source]#

Extract ECOD paired binding site leakage for test and val splits.

Parameters:
  • index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().

  • metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().

  • frac_interface_threshold (float) – Fraction of interface required to be covered by ECOD domain. Default is 0.25.

  • min_intersection (int) – Minimum required intersection between interface and ECOD domain. Default is 10.

Returns:

The test and val split binding leakage, respectively.

Return type:

tuple[pd.DataFrame, pd.DataFrame]

pinder.data.qc.ialign module#

Calculate interface alignment scores via IS-align.

Used to handle similarity hit finding following initial alignments via Foldseek or MMSeq2.

For full method details, please see: https://doi.org/10.1093/bioinformatics/btq404

pinder.data.qc.ialign.ialign(query_id: str, query_pdb: Path, hit_id: str, hit_pdb: Path, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) dict[str, str | float | int] | None[source][source]#
pinder.data.qc.ialign.ialign_all(df: DataFrame, pdb_root: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs'), n_jobs: int = 48, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) DataFrame[source][source]#
pinder.data.qc.ialign.process_in_batches(df: DataFrame, batch_size: int = 1000, batch_offset: int = 0, overwrite: bool = False, cache_dir: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/ialign_results'), n_jobs: int = 48, config: IalignConfig = IalignConfig(rmsd_threshold=5.0, log_pvalue_threshold=-9.0, is_score_threshold=0.3, alignment_printout=0, speed_mode=1, min_residues=5, min_interface=5, distance_cutoff=10.0, output_prefix='output')) None[source][source]#

pinder.data.qc.pfam_diversity module#

pinder.data.qc.pfam_diversity.load_data(index_file: Path | str | None = None, metadata_file: Path | str | None = None, pfam_file: Path | str | None = None) tuple[DataFrame, DataFrame, DataFrame][source][source]#
pinder.data.qc.pfam_diversity.get_ecod_annotations(pindex: DataFrame, metadata: DataFrame, frac_interface_threshold: float = 0.25, min_intersection: int = 10) DataFrame[source][source]#
pinder.data.qc.pfam_diversity.process_pfam_data(pindex: DataFrame, pfam_data: DataFrame) DataFrame[source][source]#
pinder.data.qc.pfam_diversity.visualize_data(pindex: DataFrame, metadata: DataFrame, output_dir: Path, top_n_pfams: int = 50) None[source][source]#
pinder.data.qc.pfam_diversity.get_pfam_diversity(pindex: DataFrame) DataFrame[source][source]#
pinder.data.qc.pfam_diversity.plot_pfam_clan_distribution(pindex: DataFrame, output_dir: Path, top_n_pfams: int = 50) None[source][source]#
pinder.data.qc.pfam_diversity.report_ecod_diversity(pindex: DataFrame) DataFrame[source][source]#
pinder.data.qc.pfam_diversity.plot_ecod_family_distribution(pindex: DataFrame, output_dir: Path, top_n_pfams: int = 50) None[source][source]#
pinder.data.qc.pfam_diversity.get_merged_metadata(pindex: DataFrame, metadata: DataFrame) DataFrame[source][source]#
pinder.data.qc.pfam_diversity.get_cluster_diversity(merged_metadata: DataFrame) DataFrame[source][source]#
pinder.data.qc.pfam_diversity.plot_merged_metadata_distributions(pindex: DataFrame, metadata: DataFrame, output_dir: Path) None[source][source]#
pinder.data.qc.pfam_diversity.pfam_diversity_main(index_file: str | None = None, metadata_file: str | None = None, pfam_file: str | None = None, output_dir: str | Path = PosixPath('/home/runner/.local/share/pinder/2024-02/data/pfam_visualizations'), frac_interface_threshold: float = 0.25, min_intersection: int = 10, top_n_pfams: int = 50) None[source][source]#

Extract PFAM clan diversity and generate visualizations.

Parameters:
  • index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().

  • metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().

  • pfam_file (str | None) – Optional path to Pfam data file, if not provided will fetch and cache it.

  • output_dir (str | Path) – Directory to store PFAM visualizations. Defaults to get_pinder_location() / ‘data/pfam_visualizations’.

  • frac_interface_threshold (float) – Fraction of interface required to be covered by ECOD domain. Default is 0.25.

  • min_intersection (int) – Minimum required intersection between interface and ECOD domain. Default is 10.

  • top_n_pfams (int) – Number of top Pfam clans to visualize. Default is 50.

Returns:

None.

pinder.data.qc.run module#

pinder_qc

NAME
    pinder_qc

SYNOPSIS
    pinder_qc COMMAND

COMMANDS
    COMMAND is one of the following:

    uniprot_leakage

    binding_leakage
    Extract ECOD paired binding site leakage for test and val splits.

    pfam_diversity
    Extract PFAM clan diversity and generate visualizations.

    sequence_leakage
    Extract sequence similarity / leakage by subsampling members in train split.

pinder_qc uniprot_leakage –help

NAME
    pinder_qc uniprot_leakage

SYNOPSIS
    pinder_qc uniprot_leakage <flags>

FLAGS
    -i, --index_path=INDEX_PATH
        Type: Optional['str | None']
        Default: None
    -s, --split=SPLIT
        Type: 'str'
        Default: 'test'
pinder.data.qc.run.main() None[source][source]#

pinder.data.qc.similarity_check module#

pinder.data.qc.similarity_check.load_data(index_file: Path | str | None = None, metadata_file: Path | str | None = None) tuple[DataFrame, DataFrame][source][source]#
pinder.data.qc.similarity_check.process_test_table(test_table_path: Path, pindex: DataFrame) DataFrame[source][source]#
pinder.data.qc.similarity_check.align_sequences(input_data: tuple[tuple[str, str], tuple[str, str], str, str], output_path: Path, chain_overlap_threshold: float = 0.3) tuple[Alignment, Path, float, bool][source][source]#
pinder.data.qc.similarity_check.write_alignment(alignment_output: tuple[Alignment, Path, float, bool]) None[source][source]#
pinder.data.qc.similarity_check.generate_alignments(train_sequences: dict[tuple[str, str], str], test_sequences: dict[tuple[str, str], str], output_path: Path, num_cpu: int | None = None, parallel: bool = True, chain_overlap_threshold: float = 0.3, n_chunks: int = 100) list[tuple[Alignment, Path, float, bool]][source][source]#
pinder.data.qc.similarity_check.get_processed_alignments(alignments_path: Path) dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]][source][source]#
pinder.data.qc.similarity_check.load_train_system_id_mapping(sampled_train: DataFrame) dict[tuple[tuple[str, str], ...], str][source][source]#

Generate mapping of system IDs from training data

pinder.data.qc.similarity_check.get_pdb_chain_uni(monomer_id: str) tuple[str, str, str][source][source]#

Extract pdb_id, chain, and uni from a combined monomer_id string

pinder.data.qc.similarity_check.system_id_to_alignment(sys_id: str, alignments: dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]], train_system_id_mapping: dict[tuple[tuple[str, str], ...], str]) tuple[str, dict[tuple[tuple[str, str], ...], tuple[tuple[Alignment, float, tuple[str, str]], tuple[Alignment, float, tuple[str, str]]]] | None] | None[source][source]#

Map system IDs to alignments if applicable

pinder.data.qc.similarity_check.alignments_to_dual_alignment(r_alignments: list[tuple[Alignment, float, tuple[str, str]]], l_alignments: list[tuple[Alignment, float, tuple[str, str]]], train_systems: set[tuple[tuple[str, str], ...]]) dict[tuple[tuple[str, str], ...], tuple[tuple[Alignment, float, tuple[str, str]], tuple[Alignment, float, tuple[str, str]]]] | None[source][source]#

Check for dual alignments across systems and return them if valid

pinder.data.qc.similarity_check.get_aligned_indices(alignment_trace: list[tuple[int, int]]) tuple[ndarray[Any, dtype[int64]], ndarray[Any, dtype[int64]]][source][source]#

Extract aligned indices from an alignment trace

pinder.data.qc.similarity_check.get_aligned_interface_indices(alignment: Alignment, test_system: str, train_system: str, test_chain: str, train_chain: str, metadata: DataFrame) tuple[float, float][source][source]#

Calculate the percentage of interface residues that are aligned

pinder.data.qc.similarity_check.analyze_leakage(test_table: DataFrame, alignments: dict[tuple[str, str], list[tuple[Alignment, float, tuple[str, str]]]], train_system_id_mapping: dict[tuple[tuple[str, str], ...], str], metadata: DataFrame, chain_overlap_threshold: float) None[source][source]#

Analyze leakage across systems based on alignments and mapping

pinder.data.qc.similarity_check.subsample_train(pindex: DataFrame, cluster_size_cutoff: int, num_to_sample: int, random_state: int, cache_path: Path, n_chunks: int) DataFrame[source][source]#
pinder.data.qc.similarity_check.sequence_leakage_main(index_file: str | None = None, metadata_file: str | None = None, test_table_file: Path | str | None = None, entity_metadata_file: str | None = None, cache_path: str | Path = PosixPath('/home/runner/.local/share/pinder/2024-02/data/similarity-cache'), cluster_size_cutoff: int = 20, chain_overlap_threshold: float = 0.3, num_to_sample: int = 1, n_chunks: int = 100, random_state: int = 42, use_cache: bool = True, num_cpu: int | None = None, pinder_dir: Path | None = None, config: ClusterConfig = ClusterConfig(seed=40, canonical_method='foldseek_community', edge_weight='weight', foldseek_cluster_edge_threshold=0.7, foldseek_edge_threshold=0.55, foldseek_af2_difficulty_threshold=0.7, mmseqs_edge_threshold=0.0, resolution_thr=3.5, min_chain_length=40, min_atom_types=3, max_var_thr=0.98, oligomeric_count=2, method='X-RAY DIFFRACTION', interface_atom_gaps_4A=0, prodigy_label='BIO', number_of_components=1, alphafold_cutoff_date='2021-10-01', depth_limit=2, max_node_degree=1000, top_n=1, min_depth_2_hits_with_comm=1, max_depth_2_hits_with_comm=2000, max_depth_2_hits=1000)) None[source][source]#

Extract sequence similarity / leakage by subsampling members in train split.

Parameters:
  • index_file (str | None) – Path to custom/intermediate index file, if not provided will use get_index().

  • metadata_file (str | None) – Path to custom/intermediate metadata file, if not provided will use get_metadata().

  • test_table_file (str | None) – Path to custom/intermediate test systems table, if not provided must provide pinder ingest directory via pinder_dir.

  • entity_metadata_file (str | None) – Path to custom/intermediate entity metadata file, if not provided will use get_supplementary_data().

  • cache_path (str | Path) – Directory to store cached alignments. Defaults to get_pinder_location() / ‘data/similarity-cache’.

  • cluster_size_cutoff (int) – The minimum size of a train set cluster for sampling. Default is 20.

  • chain_overlap_threshold (float) – Threshold for chain overlap in interface residues. Default is 0.3.

  • num_to_sample (int) – The number of cluster elements to sample. Default is 1.

  • n_chunks (int) – The number of sequence pair chunks to evaluate. Default is 100.

  • random_state (int) – Random state for train set subsampling. Default is 42.

  • use_cache (bool) – Whether to use cached alignments. Default is True.

  • num_cpu (int | None) – Limit number of CPU used in multiprocessing. Default is None (use all).

  • pinder_dir (Path | None) – Directory to pinder dataset generation directory. If not provided, will assume it is being run on local files / post-data gen.

  • config (ClusterConfig) – The config object used to generate the clustering. Used to infer location of test_table_file if pinder_dir is provided.

Returns:

None.

pinder.data.qc.uniprot_leakage module#

pinder.data.qc.uniprot_leakage.uniprot_leakage_main(index_path: str | None = None, split: str = 'test') tuple[DataFrame, float][source][source]#
pinder.data.qc.uniprot_leakage.report_uniprot_test_val_leakage(index_path: str | None = None) DataFrame[source][source]#

pinder.data.qc.utils module#

pinder.data.qc.utils.download_pdbfam_db(download_dir: Path | str, filename: str = 'PDBfam.parquet', url: str = 'http://dunbrack2.fccc.edu/ProtCiD/pfam/PDBfam.txt.gz', overwrite: bool = True) DataFrame[source][source]#
pinder.data.qc.utils.load_index(index_file: Path | str | None = None) DataFrame[source][source]#
pinder.data.qc.utils.load_metadata(metadata_file: Path | str | None = None) DataFrame[source][source]#
pinder.data.qc.utils.load_entity_metadata(entity_metadata_file: Path | str | None = None) DataFrame[source][source]#
pinder.data.qc.utils.load_pfam_db(pfam_file: Path | str | None = None) DataFrame[source][source]#
pinder.data.qc.utils.view_potential_leaks(potential_leak_pairs: DataFrame, pml_file: Path = PosixPath('/home/runner/work/pinder/pinder/docs/view_potential_leaks.pml'), max_scenes: int = 100, pdb_dir: Path = PosixPath('/home/runner/.local/share/pinder/2024-02/pdbs'), align_types: list[str] = ['align', 'super', 'chain']) None[source][source]#

Module contents#