Method for constructing complex bait dataset based on structure and sequence co-de-redundancy

By employing a structure-sequence collaborative redundancy removal method, a high-quality, low-redundancy complex decoy dataset is constructed, solving the problems of high redundancy and quality imbalance in existing technologies and enabling effective evaluation of complex models.

CN121583345BActive Publication Date: 2026-06-12ZHEJIANG UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV OF TECH
Filing Date
2026-01-26
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing complex decoy datasets suffer from high redundancy, uneven quality, and insufficient representativeness. They also lack structure and sequence co-control, making it difficult to evaluate the quality of complex models.

Method used

By employing a structure-sequence collaborative redundancy removal method, two-layer clustering is performed using structural similarity and sequence homology. Stratified sampling is then combined with TM-score, lDDT, and DockQ_wave metrics to construct a high-quality, low-redundancy decoy dataset.

🎯Benefits of technology

The generated dataset has a balanced quality distribution, making it suitable for training and validating complex models. It features low redundancy and high diversity, making it suitable for training, validating, and benchmarking complex model quality assessment algorithms.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121583345B_ABST
    Figure CN121583345B_ABST
Patent Text Reader

Abstract

A method for constructing a complex bait dataset based on structure-sequence collaborative deduplication, belonging to the field of bioinformatics, is disclosed. First, an initial set of protein complex structures is screened, removing entries containing nucleic acids, small molecules, or non-protein chains, and selecting binary complexes that meet the requirements of completeness and resolution. Second, structural clustering and sequence clustering are performed based on three-dimensional structural similarity and sequence homology, respectively, and the results are jointly compared to remove redundant complex entries that are highly similar in both structure and sequence. Subsequently, multiple sets of bait structures are generated using molecular docking or predictive modeling methods, with quality indices calculated, using representative complexes from each cluster as targets. Finally, stratified sampling and proportional balancing are performed based on the score intervals of the quality indices to construct a high-quality protein complex bait dataset with structure-sequence collaborative deduplication and balanced quality distribution. The dataset generated by this invention combines low redundancy, high diversity, and controllable quality distribution.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of bioinformatics and protein structure, and in particular to a method for constructing a protein complex decoy dataset based on the collaborative deduplication of structure and sequence. Background Technology

[0002] In recent years, with the emergence of deep learning methods such as AlphaFold, RoseTTAFold, and DiffDock, large-scale compound prediction models have become available. However, when evaluating the quality of these prediction models, there is still a lack of standardized decoy datasets that are structurally reliable, reasonably distributed, and have low redundancy.

[0003] Current common decoy datasets typically start with experimentally resolved complex structures, generating various decoy structures through molecular docking or random perturbation. While this approach can simulate prediction errors to some extent, several problems remain: First, high redundancy, as multiple complex structures may originate from the same sequence or highly similar structural templates, leading to repetitive structure types and insufficient representativeness in the dataset; second, a lack of unified standards in the decoy generation and screening process, resulting in inconsistent quality; and third, uneven quality distribution, with indicators such as DockQ or lDDT concentrated in low-score regions, lacking sufficient coverage of high-quality and medium-quality decoy structures. Some studies have attempted to reduce repetition through structure or sequence clustering, but these often focus only on a single dimension, failing to achieve coordinated control of sequence and structural features, thus limiting data diversity and representativeness. Therefore, how to simultaneously consider structural differences and sequence independence during the construction of decoy datasets, and obtain a balanced distribution of decoy structures through quality stratified sampling, has become a critical problem urgently needing to be solved in the field of protein complex model quality assessment. Summary of the Invention

[0004] To overcome the shortcomings of existing complex decoy dataset methods, this invention proposes a method for constructing complex decoy datasets based on collaborative deduplication of structure and sequence. Through steps such as two-layer clustering of structure and sequence, quality-level screening, and balanced sampling, redundancy and distribution bias among structures are effectively controlled. Unlike traditional strategies that rely solely on a single dimension (such as sequence or structure) for deduplication, this invention utilizes both sequence homology and structural similarity information during data screening to eliminate duplicate or highly similar complex entries. Furthermore, it employs stratified sampling based on three quality metrics—TM-score, lDDT, and DockQ_wave—to ensure a balanced distribution of high, medium, and low-quality structures within the dataset. The resulting dataset exhibits both high diversity and low redundancy, achieving statistical equilibrium in quality distribution, and can be used for training and validation of complex model quality evaluation algorithms.

[0005] The technical solution adopted by this invention to solve its technical problem is:

[0006] A method for constructing a complex decoy dataset based on structure-sequence collaborative deduplication is proposed. First, the initial set of protein complex structures is screened, removing entries containing nucleic acids, small molecules, or non-protein chains, and selecting binary complexes that meet the requirements of completeness and resolution. Second, structural clustering and sequence clustering are performed based on three-dimensional structural similarity and sequence homology, respectively, and the results are jointly compared to remove redundant complex entries that are highly similar in both structure and sequence. Then, using representative complexes from each cluster as targets, multiple sets of decoy structures are generated using molecular docking or predictive modeling methods, and their TM-score, lDDT, and DockQ_wave quality indices are calculated. Finally, stratified sampling and proportional balancing are performed based on the score ranges of the above three indices to construct a high-quality protein complex decoy dataset with structure-sequence collaborative deduplication and balanced quality distribution.

[0007] Furthermore, the method includes the following steps:

[0008] Step 1) Complex screening: Extract all binary protein complexes from the RCSB PDB NextGen database, and retain only structures composed of protein chains with a structural resolution of less than or equal to 4 Å and at least 8 interface residues;

[0009] Step 2) Structural clustering: A similarity graph is constructed based on the structural similarity of the whole chain and the overlap of interface residues, and the chain-level clustering is carried out by the label propagation algorithm to eliminate structural redundancy;

[0010] Step 3) Data partitioning: Based on the clustering results, the complex structure is divided into training set, validation set and test set, and data leakage is eliminated by cluster-level independence constraints;

[0011] Step 4) Sequence clustering: Use MMseqs2 to cluster sequences at a set percentage (40%) sequence consistency threshold, and combine the results of structure clustering to perform two-level redundancy removal, so as to achieve collaborative redundancy removal of structure and sequence.

[0012] Step 5) Target screening: Perform multidimensional quality evaluation on the deredundancy-removed complex clusters, and prioritize the selection of representative complexes with high resolution, complete interfaces and diverse functions as target structures.

[0013] Step 6) Decoy generation: For the target complex, multiple decoy models with diverse structures are generated using deep learning prediction and molecular docking methods;

[0014] Step 7) Balanced Sampling: Based on the scores of the three indicators, the data is divided into equal-width bins in the range [0,1], and quota sampling within the label is performed to achieve balanced coverage in the high / medium / low quality range and obtain a low-redundancy decoy set.

[0015] Furthermore, the process of step 1) is as follows:

[0016] Step 1.1) Obtain all protein structure entries from the RCSB PDB NextGen database (as of April 20, 2024) and download the corresponding mmCIF files; then perform preliminary quality screening to remove entries containing nucleic acids, small molecule ligands or non-protein polymers, while retaining only structures with a structural resolution of less than or equal to 4 Å and at least 8 interface residues;

[0017] Step 1.2) For each mmCIF structure file, construct all biological assemblies according to the assembly annotations provided by the PDB; if multiple assemblies exist, the primary recommended assembly is selected first; annotate all protein chains in each assembly, including chain ID, entity origin, polymer type, and whether it is a repeating chain. Remove structural fragments of homologous repeating chains and abnormally short chains;

[0018] Step 1.3) Iterate through all protein chain pair combinations in each biological assembly and screen out binary chain pairs with real physical contact. If the distance between the main chain atoms of two chains is less than 8 Å, it is defined as a protein-protein interaction complex with interfacial contact, and finally the first number (2.09 million) of PPI chain pairs are obtained; for each screened PPI chain pair, record its PDB number, chain pair ID, whether it is a homologous chain pair, number of contact residues, contact area and other structural geometric characteristics.

[0019] Step 1.4) To distinguish whether the interfaces observed in the protein complex are biologically relevant or crystal contacts, PRODIGY-Cryst is used to discriminate and annotate the interfaces for each PPI chain. Simultaneously, to quantify the evolutionary information content of each chain, the effective sequence number Neff of the two chains in the binary complex is calculated, and an MSA is constructed for each chain. The sequence set in the MSA is denoted as . For the sequence Assign weights The formula is as follows:

[0020] ;

[0021] in, The percentage of sequence consistency is calculated based on the collinear sites found during alignment. This is an indicator function; when the similarity is higher than a threshold... The timer is 1, and the threshold is... Set the value to 0.8; annotate chain1_neff and chain2_neff (from the MSA of each chain) for each binary complex.

[0022] Furthermore, the process of step 2) is as follows:

[0023] Step 2.1) Extract the set of interface residues for each complex. The interface is defined as the set of residues with an arbitrary atomic distance of less than 8 Å between interacting chains. If the number of interface residues of a chain is less than 8, it is considered an occasional contact or an unstable complex and is removed.

[0024] Step 2.2) Implement efficient chain-level structure alignment based on the Foldseek algorithm framework. By extracting the matching intervals and residue pair information output by Foldseek, calculate the alignment between all chains. Structural similarity score;

[0025] Step 2.3) Construct a graph from all chain alignment results. Each node Represents a protein chain, edge To represent the structural similarity between two chains, each edge records the following weight information: chain With chain Structural similarity score overlap ratio of residues , The formula for the overlap ratio is as follows:

[0026] ;

[0027] ;

[0028] in, Represents the matching interval With chain The overlap ratio of interfacial residues, Corresponding chain The overlap ratio. Edges that meet the following conditions are retained to form a high-quality comparison image: ≥ First threshold (0.75), indicating high alignment quality; ≥ The second threshold (0.5), meaning at least half of the interface area participates in the matching;

[0029] Step 2.4) Perform unsupervised clustering using the asynchronous label propagation algorithm on the high-quality alignment graph to obtain the structural cluster number of each chain, denoted as . Clustering only propagates labels between chains that simultaneously satisfy structural similarity and interface overlap;

[0030] Step 2.5) For each PPI complex, the combination of the structural cluster numbers to which its two chains belong is defined as the structural cluster ID of that PPI, denoted as:

[0031] ;

[0032] The cluster ID is the unique identifier of the complex at the structural redundancy removal level. If two chains fall into the same cluster combination, it is considered that there is redundancy at the structural level. Finally, each PPI complex is mapped to a unique chain pair structural cluster identifier in the above manner, resulting in a second number (47201) PPI structural clusters.

[0033] Furthermore, the process of step 3) is as follows:

[0034] Step 3.1) Take out the second number (47201) PPI structural clusters obtained in Step 2.5) and select the third number (4000) clusters for verification and testing. All PPI complexes that meet the structural quality criteria for verification and testing are initially marked as high-quality candidate sets. These criteria include that the number of interface residues is not less than the threshold and that the structural integrity and resolution meet the preset requirements.

[0035] Step 3.2) To prevent leakage of structural and sequence information between the high-quality candidate set and the training set, all PPI complexes in both the high-quality candidate set and the training set are disassembled, and then structural similarity and sequence similarity maps are constructed. Specifically, connections are established in the structural similarity map when the lDDT of any two strands is ≥0.6 and the overlap ratio of interface residues is ≥0.5; connections are established in the sequence similarity map when sequence consistency is ≥ a preset percentage; for each complex system s, its two strands a and b are extended to a depth D=2 in both the structural and sequence similarity maps, defining a system-level neighborhood set:

[0036] ;

[0037] When the number of neighbors meets the preset range ( When the structure is marked as a high-quality candidate structure, it is combined into a high-quality candidate set;

[0038] Step 3.3) Sort the high-quality candidate set according to the following criteria: ① Heterodimers are given priority; ② High resolution is given priority; ③ The structure is published more recently. The sorted high-quality candidate set is divided according to the structure clusters to minimize redundancy in the test set.

[0039] Step 3.4) Finally, in the high-quality candidate set after partitioning, the highest priority PPI complex is selected from each cluster to obtain the fourth number (4000) PPI complexes. These are randomly divided into test and validation sets, and a leakage removal operation is performed in the training set: all neighbors of the PPI complexes in the test and validation sets in the training set are removed from the training set in the structural or sequence similarity graph.

[0040] Furthermore, step 4) is as follows:

[0041] Step 4.1) Based on the UniRef_IDs of the two strands of each PPI complex labeled in Step 1.2), extract the UniRef_IDs of all PPI complexes in the training set, resulting in a total of 47098 UniRef_IDs for the fifth number of protein chains. Subsequently, download the FASTA sequence file for each chain and use the MMseqs2 ultrafast protein sequence alignment tool to perform pairwise alignments on all chains. Using sequence consistency ≥ a preset percentage (40%) as the clustering condition, construct a single-stranded sequence cluster set. Each protein chain is assigned to a unique sequence cluster, thus giving all chains a corresponding cluster label;

[0042] Step 4.2) For each PPI complex in the training set, find the sequence cluster to which each of the two strands it comprises belongs, and then define the sequence cluster label for that PPI complex:

[0043] ;

[0044] in, and The first Sequence clustering numbers of two chains in a PPI;

[0045] Step 4.3) Since some PPI systems are already in the same structural cluster in terms of structural clustering, there may still be cases where their constituent chains are homologous sequences. Therefore, structural clustering labels need to be considered simultaneously. With sequence clustering of labels This allows for the identification of redundant pairs for any two PPI systems. If the following double matching conditions are met:

[0046] and and ;

[0047] Both PPI systems are considered redundant at both the structural and sequence levels. Furthermore, to ensure that orientation does not affect the redundancy assessment, the interchange of chain pairs must also be considered. The actual assessment is as follows:

[0048] or ;

[0049] When structural clustering labels With sequence clustering of labels If they are all the same, they are considered redundant; after processing by this strategy, the sixth number (42443) of PPI clusters with de-redundant structures are finally used to generate the seventh number (65524) of PPI clusters with strictly de-redundant sequences and structures.

[0050] Furthermore, step 5) is as follows:

[0051] Step 5.1) First, multiple biophysical and structural features are used to perform quality filtering on each PPI structure data in the training set, retaining complex structures that meet the following conditions: (1) the number of atomic types in both chains is greater than 3; (2) the source is a biological assembly; (3) the biological assembly is manually annotated, or the biological assembly is determined by both manual annotation and software inference; (4) the number of residues in each chain is >40; (5) the total resolved length of the two chains of the complex is less than 1200; (6) the X-ray resolution is less than or equal to 4.0 Å; (7) the embedding area is greater than 100 Å. 2 (8) The number of interfacial atomic contacts is greater than 5; (9) The effective sequence number Neff of both chains is greater than 10; Only the structures that satisfy all the above conditions are retained, and the eighth number (10126) clusters are finally obtained;

[0052] Step 5.2) For each sequence-structure cluster, select the most representative structure as the target for that cluster. To this end, define the following priority scoring function to comprehensively measure the priority of each structure:

[0053] ;

[0054] in: These indicate whether the complex contains antibodies, antigens, and enzymes, respectively. The resolution of the structure is expressed in angstroms. These are the positive coefficients that control the relative weights of each factor, empirically set to... =0.5、 =0.2、 =0.1、 =0.2. In this function, the smaller the score, the higher the priority of the structure. During sorting, each cluster retains the structure with the lowest score as the representative structure, and finally obtains the eighth number (10126) representative Target structures.

[0055] Furthermore, step 6) is as follows:

[0056] Step 6.1) End-to-end method (generating the complex directly from the sequence), as follows:

[0057] Step 6.1.1) MassiveFold process: By perturbing the hyperparameters of AlphaFold2 (such as sampling temperature, number of recycles, dropout ratio), a ninth number (100) of complex candidate structures are generated. For each perturbation parameter combination Perform one forward inference:

[0058] ;

[0059] in, This represents the trained AlphaFold prediction network. , These are the amino acid sequences of the two chains, respectively. For the first Hyperparameter combination for subsampling The predicted 3D coordinates of the generated complex; by changing the random sampling distribution of the hyperparameters. Under the same input sequence, complex predictions with different binding poses are generated, and finally the ninth number (100) decoy models are obtained;

[0060] Step 6.1.2) AlphaFold 3 diffusion generation process: First, determine the coordinates of the target complex. Adding Gaussian noise to form a sequence Then, the reverse diffusion process is learned through a neural network:

[0061] ;

[0062] in, Indicates the first The noise state of the step, This is the noise attenuation coefficient. For neural networks that predict noise, The standard deviation of noise. To sample noise, high-quality complex structures can be gradually generated from random noise through multiple reverse sampling iterations, ultimately obtaining the ninth number (100) decoy models;

[0063] Step 6.2) Docking method, as follows:

[0064] Step 6.2.1) HDOCK Process: HDOCK adopts a rigid body docking method based on fast Fourier transform, automatically generating tenth number (200) decoy models for each target structure. The docking scoring function integrates van der Waals energy, electric potential energy, desolvation energy, and embedding area, etc.

[0065] ;

[0066] in, This represents the van der Waals interaction energy. Represents electrostatic energy. Indicates desolvation energy, Indicates the embedded area. The corresponding energy weight coefficients are used to select the tenth number (200) of the lowest energy combined poses as candidate decoys for HDOCK by minimizing the energy.

[0067] Step 6.2.2) HADDOCK Process: HADDOCK adopts a phased docking and progressive flexible refinement strategy. First, a rigid body sampling stage is performed to generate the ninth number (100) candidate structures; second, a flexible refinement stage is performed. Due to the large amount of computation and long time consumption in this stage, only the first eleventh number (40) candidate structures are flexibly optimized; finally, in the energy refinement stage, the energy minimization optimization is performed on these eleventh number (40) models, and the total energy function is defined as:

[0068] ;

[0069] in, The fuzzy constraint energy is based on experimental or predictive results. The remaining parameters are the same as those defined in HDOCK. Through multiple rounds of screening and flexible optimization, an eleventh number (40) of decoy structures with high interface complementarity are obtained.

[0070] Step 6.2.3) diffDock-PP process: A deep learning docking strategy based on a diffusion model is adopted. For each PPI complex, NUM_FOLDS=3, SEED=0, and NUM_SAMPLES=60 are set. That is, each target samples the twelfth (60) structure under three different random initializations, for a total of thirteenth (180) decoys; DiffDock-PP in rigid body pose space The reverse diffusion process is performed to generate the complex structure:

[0071] ;

[0072] in, Indicates the first The rotation and translation state of the step, For step size parameters, This is the score function predicted by the neural network, used to learn the conditional gradient distribution of the complex pose. As Gaussian noise, through stepwise iterative generation and denoising, DiffDock-PP samples a diverse thirteenth number (180) decoy structures for each PPI complex on a physically reasonable energy surface.

[0073] Furthermore, the process of step 7) is as follows:

[0074] Step 7.1) Align each decoy model with its corresponding natural assembly using USAlign, and use OpenStructure to calculate and retain three score indicators: the first is the global folding accuracy TM-score, the second is the local atomic level precision lDDT, and the third is the interface quality DockQ_wave. The three scores are uniformly normalized to the [0,1] interval.

[0075] Step 7.2) Divide TM-score, lDDT and DockQ_wave into ten equally wide score intervals [0.0,0.1), [0.1,0.2) … [0.9,1.0] on [0,1]. Record the source of the decoy as the source label, such as docking class (HDOCK, HADDOCK3, DiffDock-PP) and sampling class (AF3); For each score interval and each source label of each indicator, perform fixed-scale stratified balanced sampling: the target number of samples is the fourteenth number (15). If the number of samples is less than the fourteenth number (15), first retain all samples, and then replenish them from adjacent score intervals in the order of same label, nearest neighbor interval, energy / confidence quantile. For intervals with sufficient samples, further perform secondary balance according to energy or model confidence quantile to avoid the concentration effect of the same energy band.

[0076] Step 7.3) Finally, three sets of parallel balanced subsets are formed, corresponding to TM-score, lDDT and DockQ_wave respectively. This design achieves uniform coverage from low to high in three dimensions: global folding, local atomic level and interface quality.

[0077] The beneficial effects of this invention are that the generated dataset has low redundancy, high diversity, and controllable quality distribution, making it suitable for training, validation, and benchmarking of complex model quality evaluation algorithms. Attached Figure Description

[0078] Figure 1 This is an overall flowchart of a method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal.

[0079] Figure 2 This is a schematic diagram of the photosynthetic membrane protein complex 2BHW with an amino acid sequence length of 446. Detailed Implementation

[0080] The present invention will now be further described with reference to the accompanying drawings.

[0081] Reference Figure 1 A method for constructing a complex decoy dataset based on structure and sequence collaborative deredundancy removal, the method comprising the following steps:

[0082] Step 1) Collect all binary protein complex structures. The process is as follows:

[0083] Step 1.1) Obtain all protein structure entries from the RCSB PDB NextGen database (as of April 20, 2024) and download the corresponding mmCIF files. Then, perform preliminary quality screening to remove entries containing nucleic acids, small molecule ligands, or non-protein polymers, and retain only structures with a structural resolution of less than or equal to 4 Å and at least 8 interface residues.

[0084] Step 1.2) For each mmCIF structure file, construct all biological assemblies according to the assembly annotations provided by PDB. If multiple assemblies exist, the main recommended assembly is selected first. All protein chains in each assembly are labeled, including chain ID, UniRef_ID, entity source, polymer type, whether it is a repeating chain, etc., and structural fragments of homologous repeating chains and abnormally short chains are removed.

[0085] Step 1.3) Traverse all protein chain pair combinations in each biological assembly and screen out binary chain pairs with real physical contact. If the distance between the main chain atoms of the two chains is less than 8 Å, it is defined as a protein-protein interaction complex with interfacial contact. Finally, 2.09 million PPI chain pairs are obtained. For each screened PPI chain pair, record its PDB number, chain pair ID, whether it is a homologous chain pair, number of contact residues, contact area and other structural geometric features.

[0086] Step 1.4) To distinguish whether the interfaces observed in the protein complex are biologically relevant or crystal contacts, PRODIGY-Cryst is used to discriminate and annotate the interfaces for each PPI chain. Simultaneously, to quantify the evolutionary information content of each chain, the effective sequence number Neff of the two chains in the binary complex is calculated, and an MSA is constructed for each chain. The sequence set in the MSA is denoted as . For the sequence Assign weights The formula is as follows:

[0087] ;

[0088] in, The percentage of sequence consistency is calculated based on the collinear sites found during alignment. This is an indicator function; when the similarity is higher than a threshold... The timer is 1, and the threshold is... Take 0.8 and annotate chain1_neff and chain2_neff (from the MSA of each chain) for each binary complex.

[0089] Step 2) Structure-based clustering, the process is as follows:

[0090] Step 2.1) Extract the set of interface residues for each complex. The interface is defined as the set of residues with an arbitrary atomic distance of less than 8 Å between interacting chains. If the number of interface residues of a chain is less than 8, it is considered an occasional contact or an unstable complex and is removed.

[0091] Step 2.2) Implement efficient chain-level structure alignment based on the Foldseek algorithm framework. By extracting the matching intervals and residue pair information output by Foldseek, calculate the alignment between all chains. Structural similarity score;

[0092] Step 2.3) Construct a graph from all chain alignment results. Each node Represents a protein chain, edge To represent the structural similarity between two chains, each edge records the following weight information: chain With chain Structural similarity score overlap ratio of residues , The formula for the overlap ratio is as follows:

[0093] ;

[0094] ;

[0095] in, Represents the matching interval With chain The overlap ratio of interfacial residues, Corresponding chain The overlap ratio is determined, and edges that meet the following conditions are retained to form a high-quality comparison image: ≥0.75 indicates high comparison quality; ≥0.5, meaning at least half of the interface area participates in the matching;

[0096] Step 2.4) Perform unsupervised clustering using the asynchronous label propagation algorithm on the high-quality alignment graph to obtain the structural cluster number of each chain, denoted as . Clustering only propagates labels between chains that simultaneously satisfy structural similarity and interface overlap;

[0097] Step 2.5) For each PPI complex, the combination of the structural cluster numbers to which its two chains belong is defined as the structural cluster ID of that PPI, denoted as:

[0098] ;

[0099] The cluster ID is the unique identifier of the complex at the structural redundancy removal level. If two chains fall into the same cluster combination, they are considered to have redundancy at the structural level; each PPI complex is mapped to a unique chain-to-structural cluster identifier in the above manner, resulting in 47201 PPI structural clusters.

[0100] Step 3) Divide the dataset into training, validation, and test sets, and perform data leakage removal operations. The process is as follows:

[0101] Step 3.1) Take 4,000 of the 47,201 PPI structural clusters obtained in step 2.5) for verification and testing. All PPI complexes that meet the structural quality criteria for verification and testing are initially marked as high-quality candidate sets. These criteria include that the number of interface residues is not less than the threshold and that the structural integrity and resolution reach the preset values.

[0102] Step 3.2) To prevent leakage of structural and sequence information between the high-quality candidate set and the training set, this invention disassembles all PPI complexes in both the high-quality candidate set and the training set, and then constructs a structural similarity map and a sequence similarity map. Specifically, when the lDDT of any two strands is ≥0.6 and the overlap ratio of interface residues is ≥0.5, a connection is established in the structural similarity map; when the sequence identity is ≥40%, a connection is established in the sequence similarity map. For each complex system s, its two strands a and b expand their neighborhoods to depth D=2 in both the structural similarity map and the sequence similarity map, defining a system-level neighborhood set.

[0103] ;

[0104] When the number of neighbors satisfies When this happens, the structure is marked as a high-quality candidate structure, forming a high-quality candidate set;

[0105] Step 3.3) The high-quality candidate set is sorted according to the following criteria: ① Heterodimers are given priority; ② High resolution is given priority; ③ The structure is published more recently. The high-quality candidate set after sorting is divided according to the structure clusters to minimize redundancy in the test set.

[0106] Step 3.4) Finally, in the high-quality candidate set after partitioning, the highest priority PPI complex is selected from each cluster, resulting in 4000 PPI complexes. These are randomly divided into test and validation sets, and a leakage removal operation is performed on the training set: all neighbors of the PPI complexes in the test and validation sets in the training set are removed from the training set in the structural or sequence similarity graph.

[0107] Step 4) Sequence-based clustering, the process is as follows:

[0108] Step 4.1) Based on the UniRef_IDs of the two strands of each PPI complex labeled in Step 1.2), extract the UniRef_IDs of all PPI complexes in the training set, resulting in a total of 47,098 protein strand UniRef_IDs. Then, download the FASTA sequence file for each strand and use the MMseqs2 ultrafast protein sequence alignment tool to perform pairwise alignments of all strands. Using sequence identity ≥40% as the clustering condition, construct a single-stranded sequence cluster set. Each protein chain is assigned to a unique sequence cluster, thus giving all chains a corresponding cluster label.

[0109] Step 4.2) For each PPI complex in the training set, find the sequence cluster to which each of its two strands belongs, and then define the sequence cluster label for that PPI complex:

[0110] ;

[0111] in, and The first Sequence clustering numbers of two chains in a PPI;

[0112] Step 4.3) Since some PPI systems are already in the same structural cluster in terms of structural clustering, there may still be cases where their constituent chains are homologous sequences. Therefore, structural clustering labels need to be considered simultaneously. With sequence clustering of labels This allows for the identification of redundant pairs. For any two PPI systems... If the following double matching conditions are met:

[0113] and and ;

[0114] Both PPI systems are considered redundant at both the structural and sequence levels. Furthermore, to ensure that direction does not affect the redundancy assessment, the interchange of chain pairs must also be considered. The actual assessment is as follows:

[0115] or ;

[0116] When both the sequence and structure conditions are met, it is considered redundant. After processing by this strategy, 65,524 PPI clusters with strictly de-redundant sequences and structures are generated from 42,443 PPI clusters with de-redundant structures.

[0117] Step 5) Low-quality structure filtering and target structure selection, the process is as follows:

[0118] Step 5.1) First, multiple biophysical and structural features are used to perform quality filtering on each PPI structure data in the training set. Specifically, the complex structures that meet the following conditions are retained: (1) the number of atomic types in the two chains is greater than 3; (2) the source is a biological assembly; (3) the biological assembly is manually annotated, or the biological assembly is determined by both manual annotation and software inference; (4) the number of residues in each chain is >40; (5) the total resolved length of the two chains of the complex is less than 1200; (6) the X-ray resolution is less than or equal to 4.0 Å; (7) the embedding area is greater than 100 Å. 2 (8) The number of interfacial atomic contacts is greater than 5; (9) The effective sequence number Neff of both chains is greater than 10; Only the structures that meet all the above conditions are retained, and finally 10126 clusters are obtained;

[0119] Step 5.2) For each sequence-structure cluster, select the most representative structure as the target of the cluster. To do this, define the following priority scoring function to comprehensively measure the priority of each structure:

[0120] ;

[0121] in: These indicate whether the complex contains antibodies, antigens, and enzymes, respectively. The resolution of the structure is expressed in angstroms. These are the positive coefficients that control the relative weights of each factor, empirically set to... =0.5、 =0.2、 =0.1、 =0.2, in this function, the smaller the score, the higher the priority of the structure. During sorting, each cluster retains the structure with the lowest score as the representative structure. Finally, 10126 representative Target structures are obtained.

[0122] Step 6) Bait model generation, the process is as follows:

[0123] Step 6.1) End-to-end method (generating complex directly from sequence)

[0124] This method takes the amino acid sequence of a protein chain as input and directly generates the three-dimensional coordinates of the protein complex using a deep learning structure prediction model. This method eliminates the need for a separate docking stage; instead, it automatically learns inter-chain interactions through the attention mechanism within the neural network and three-dimensional geometric constraints. After multiple sequence alignment encoding, the results are input into MassiveFold and AlphaFold3 respectively, generating a diverse set of complex structures. ;as follows:

[0125] Step 6.1.1) MassiveFold process:

[0126] By perturbing the hyperparameters of AlphaFold2 (such as sampling temperature, number of recycles, and dropout ratio), 100 candidate complex structures were generated. For each combination of perturbed parameters... Perform one forward inference:

[0127] ;

[0128] in, This represents the trained AlphaFold prediction network. , These are the amino acid sequences of the two chains, respectively. For the first Hyperparameter combination for subsampling To predict the 3D coordinates of the generated complex, the random sampling distribution of the hyperparameters was changed. Under the same input sequence, complex predictions with different structures were generated, resulting in 100 decoy models.

[0129] Step 6.1.2) AlphaFold 3 diffusion generation process:

[0130] First, the coordinates of the target complex... Adding Gaussian noise to form a sequence Then, the reverse diffusion process is learned through a neural network:

[0131] ;

[0132] in, Indicates the first The noise state of the step, This is the noise attenuation coefficient. For neural networks that predict noise, The standard deviation of noise. The sampling noise is used as a basis. Through multiple backsampling iterations, high-quality complex structures can be gradually generated from random noise, ultimately yielding 100 decoy models.

[0133] Step 6.2) Docking Method

[0134] The Docking method takes protein monomer structures as input and generates multiple candidate complex models based on spatial search and energy optimization. Given two monomer structures... The docking process uses a rotation matrix. With translation vector exist Searching for the optimal combination of poses in space The coordinates of the complex are obtained:

[0135] ;

[0136] Then, all candidate models are scored and ranked according to the physical energy function or deep learning score, as follows:

[0137] Step 6.2.1) HDOCK process:

[0138] HDOCK employs a rigid body docking method based on Fast Fourier Transform, automatically generating approximately 200 decoy models for each target structure. The docking scoring function integrates van der Waals energy, electric potential energy, desolvation energy, and embedding area, among other factors.

[0139] ;

[0140] in, This represents the van der Waals interaction energy. Represents electrostatic energy. Indicates desolvation energy, Indicates the embedded area. These are the corresponding energy weighting coefficients. The 200 lowest-energy combined poses are selected as candidate decoys for HDOCK by minimizing energy.

[0141] Step 6.2.2) HADDOCK process:

[0142] HADDOCK employs a phased docking and progressive flexible refinement strategy. First, a rigid body sampling phase is performed, generating 100 candidate structures. Next, a flexible refinement phase is executed; due to the large computational load and long processing time, only the first 40 candidate structures undergo flexible optimization. Finally, in the energy refinement phase, energy minimization optimization is performed on these 40 models, with the total energy function defined as:

[0143] ;

[0144] in, This represents the fuzzy constraint energy obtained based on experiments or predictions. The remaining parameters are the same as those defined in HDOCK. Through multiple rounds of screening and flexible optimization, 40 decoy structures with high interface complementarity were obtained.

[0145] Step 6.2.3) diffDock-PP process:

[0146] DiffDock-PP employs a deep learning docking strategy based on a diffusion model. For each complex target, NUM_FOLDS=3, SEED=0, and NUM_SAMPLES=60 are set, meaning each target samples 60 structures under three different random initializations, for a total of 180 decoys. DiffDock-PP operates in rigid body pose space. The reverse diffusion process is performed to generate the complex structure:

[0147] ;

[0148] in, Indicates the first The rotation and translation state of the step, For step size parameters, This is the score function predicted by the neural network, used to learn the conditional gradient distribution of the complex pose. The noise is Gaussian; through iterative generation and denoising, DiffDock-PP samples 180 diverse decoy structures on a physically reasonable energy surface.

[0149] Step 7) Perform equalization sampling on the generated bait model. The process is as follows:

[0150] Step 7.1) Align each decoy model with its corresponding natural assembly using USAlign, and calculate and retain three score metrics using OpenStructure: global folding accuracy (TM-score), local atomic precision (lDDT), and interface quality (DockQ_wave). All three scores are uniformly normalized to the [0,1] interval.

[0151] Step 7.2) Divide TM-score, lDDT, and DockQ_wave into ten equally wide score intervals [0.0, 0.1), [0.1, 0.2), ... [0.9, 1.0] on [0, 1]. Record the source of the decoy as the source label, such as docking class (HDOCK, HADDOCK3, DiffDock-PP) and sampling class (AF3). For each score interval and source label of each indicator, perform fixed-size stratified balanced sampling: the target sample size is 15. If the sample size is less than 15, retain all samples first, then replenish from adjacent score intervals in the order of same label, nearest neighbor interval, and energy / confidence quantile. For intervals with sufficient samples, further perform secondary balancing according to energy or model confidence quantile.

[0152] Step 7.3) Finally, three sets of parallel balanced subsets are formed, corresponding to TM-score, lDDT and DockQ_wave respectively. This design achieves uniform coverage from low to high in three dimensions: global folding, local atomic level and interface quality.

[0153] This embodiment uses the native structure 2BHW of a protein complex with an amino acid sequence length of 446 as an example. The method for constructing a complex decoy dataset based on structure and sequence co-deredundancy removal includes the following steps in the decoy model generation and sampling stages:

[0154] Step 6) Bait model generation, the process is as follows:

[0155] Step 6.1) End-to-end method (generating the complex directly from the sequence), as follows:

[0156] Step 6.1.1) MassiveFold process:

[0157] By perturbing the hyperparameters of AlphaFold2 (such as sampling temperature, number of recycles, and dropout ratio), 100 candidate complex structures for 2BHW_BC were generated. For each combination of perturbed parameters... Perform one forward inference:

[0158] ;

[0159] in, This represents the trained AlphaFold prediction network. , These are the amino acid sequences of the two chains, respectively. For the first Hyperparameter combination for subsampling The predicted 3D coordinates of the generated complex are obtained by changing the random sampling distribution of the hyperparameters. By generating complex predictions with different structures under the same input sequence, 100 decoy models of 2BHW_BC were finally obtained.

[0160] Step 6.1.2) AlphaFold 3 diffusion generation process:

[0161] First, the coordinates of 2BHW_BC Adding Gaussian noise to form a sequence Then, the reverse diffusion process is learned through a neural network:

[0162] ;

[0163] in, Indicates the first The noise state of the step, This is the noise attenuation coefficient. For neural networks that predict noise, The standard deviation of noise. The sampling noise is used. Through multiple backsampling iterations, high-quality complex structures can be gradually generated from random noise, ultimately resulting in 100 decoy models of 2BHW_BC.

[0164] Step 6.2) The Docking method is as follows:

[0165] Step 6.2.1) HDOCK process:

[0166] HDOCK employs a rigid body docking method based on Fast Fourier Transform, automatically generating approximately 200 decoy models for 2BHW_BC. The docking scoring function integrates van der Waals energy, electric potential energy, desolvation energy, and embedding area, among other factors.

[0167] ;

[0168] in, This represents the van der Waals interaction energy. Represents electrostatic energy. Indicates desolvation energy, Indicates the embedded area. These are the corresponding energy weighting coefficients. The 200 lowest-energy combined poses are selected as candidate decoys for 2BHW_BC by minimizing energy.

[0169] Step 6.2.2) HADDOCK process:

[0170] HADDOCK employs a phased docking and progressive flexible refinement strategy. First, a rigid body sampling phase is performed, generating 100 candidate structures for 2BHW_BC. Next, a flexible refinement phase is executed; due to the large computational load and long processing time, only the first 40 candidate structures are optimized. Finally, in the energy refinement phase, energy minimization optimization is performed on these 40 models. The total energy function is defined as:

[0171] ;

[0172] in, This represents the fuzzy constraint energy obtained based on experiments or predictions; the remaining parameters are the same as those defined in HDOCK. Through multiple rounds of screening and flexible optimization, 40 decoy models with high interface complementarity were obtained.

[0173] Step 6.2.3) diffDock-PP process:

[0174] DiffDock-PP employs a deep learning docking strategy based on a diffusion model. For 2BHW_BC, NUM_FOLDS=3, SEED=0, and NUM_SAMPLES=60, meaning each target is sampled with 60 structures under three different random initializations, for a total of 180 decoy structures. DiffDock-PP operates in rigid body pose space. The reverse diffusion process is performed to generate the complex structure:

[0175] ;

[0176] in, Indicates the first The rotation and translation state of the step, For step size parameters, This is the score function predicted by the neural network, used to learn the conditional gradient distribution of the complex pose. The noise is Gaussian. Through iterative generation and denoising, DiffDock-PP samples 180 diverse decoy structures on a physically reasonable energy surface for 2BHW_BC.

[0177] Step 7) Perform equalization sampling on the generated bait model. The process is as follows:

[0178] Step 7.1) Align each decoy model of 2BHW_BC with its corresponding natural assembly using USAlign, and use OpenStructure to calculate and retain three score indicators: the first is the global folding accuracy TM-score, the second is the local atomic level precision lDDT, and the third is the interface quality DockQ_wave. The three scores are uniformly normalized to the [0,1] interval.

[0179] Step 7.2) Divide TM-score, lDDT, and DockQ_wave into ten equally wide score intervals [0.0, 0.1), [0.1, 0.2), ... [0.9, 1.0] on [0, 1]. The decoy generation source of 2BHW_BC is denoted as the source label, for example, docking class (HDOCK, HADDOCK3, DiffDock-PP) and sampling class (AF3). For each score interval and source label of each indicator, perform fixed-size stratified balanced sampling: the target sample size is 15. If the sample size is less than 15, retain all samples first, then replenish from adjacent score intervals in the order of same label, nearest neighbor interval, and energy / confidence quantile. For intervals with sufficient samples, further secondary balancing is performed according to energy or model confidence quantile.

[0180] Step 7.3) Finally, three sets of parallel balanced subsets are formed, corresponding to TM-score, lDDT and DockQ_wave respectively. The final 2BHW_BC decoy model achieves a uniform distribution from low to high in the three dimensions of global folding, local atomic level and interface quality.

[0181] The above describes the results demonstrated by an example of this invention, which shows good performance in removing redundancy from complex datasets and generating decoy models. Clearly, this invention is not only suitable for the above embodiments, but can also be implemented with various modifications without departing from the basic spirit and scope of this invention.

Claims

1. A method for constructing a complex decoy dataset based on structure and sequence collaborative deredundancy removal, characterized in that, First, the initial set of protein complex structures was screened, removing entries containing nucleic acids, small molecules, or non-protein chains, and selecting binary complexes that met the requirements for completeness and resolution. Second, structural clustering and sequence clustering were performed based on three-dimensional structural similarity and sequence homology, respectively, and the results of both were jointly compared to remove redundant complex entries that were highly similar in both structure and sequence. Subsequently, multiple decoy models were generated using molecular docking or predictive modeling methods, with their TM-score, lDDT, and DockQ_wave quality indices calculated, targeting representative complexes from each cluster. Finally, stratified sampling and proportional balancing were performed based on the score ranges of the above three indices to construct a high-quality protein complex decoy dataset with coordinated deredundancy removal of structure and sequence and balanced quality distribution.

2. The method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal as described in claim 1, characterized in that, The method includes the following steps: Step 1) Complex screening: Extract all binary protein complexes from the RCSB PDB NextGen database, and retain only structures composed of protein chains with a structural resolution of less than or equal to 4 Å and at least 8 interface residues; Step 2) Structural clustering: A similarity map is constructed based on the three-dimensional structural similarity of the whole chain and the overlap of interface residues, and the chain-level clustering is carried out through the label propagation algorithm to eliminate structural redundancy; Step 3) Data partitioning: Based on the clustering results, the complex structure is divided into training set, validation set and test set, and data leakage is eliminated by cluster-level independence constraints; Step 4) Sequence clustering: Use MMseqs2 to cluster under a set percentage sequence consistency threshold, and combine the results of structure clustering to perform two-level redundancy removal, so as to achieve collaborative redundancy removal of structure and sequence. Step 5) Target screening: Perform multidimensional quality evaluation on the deredundancy-removed complex clusters, and prioritize the selection of representative complexes with high resolution, complete interfaces and diverse functions as target structures. Step 6) Decoy generation: For the target complex, multiple decoy models with diverse structures are generated using deep learning prediction and molecular docking methods; Step 7) Balanced Sampling: Based on the scores of the three indicators, the data is divided into equal-width bins in the range [0,1], and quota sampling within the label is performed to achieve balanced coverage in the high / medium / low quality range and obtain a low-redundancy decoy set.

3. The method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal as described in claim 2, characterized in that, The process of step 2) is as follows: Step 2.1) Extract the set of interface residues for each complex. The interface is defined as the set of residues with an arbitrary atomic distance of less than 8 Å between interacting chains. If the number of interface residues of a chain is less than 8, it is considered an occasional contact or an unstable complex and is removed. Step 2.2) Implement efficient chain-level structure alignment based on the Foldseek algorithm framework. By extracting the matching intervals and residue pair information output by Foldseek, calculate the alignment between all chains. Structural similarity score; Step 2.3) Construct a graph from all chain alignment results. Each node Represents a protein chain, edge To represent the structural similarity between two chains, each edge records the following weight information: chain With chain Structural similarity score overlap ratio of residues , To form a "high-quality comparison graph", edges that meet the following conditions are retained: ≥ The first threshold indicates high alignment quality; ≥ The second threshold, meaning that at least half of the interface area participates in the matching; Step 2.4) Perform unsupervised clustering using the asynchronous label propagation algorithm on the high-quality alignment graph to obtain the structural cluster number of each chain, denoted as . Clustering only propagates labels between chains that simultaneously satisfy structural similarity and interface overlap; Step 2.5) For each PPI complex, the combination of the structural cluster numbers to which its two chains belong is defined as the structural cluster ID of the PPI. This cluster ID is the unique identifier of the complex at the structural redundancy removal level. If the two chains fall into the same cluster combination, it is considered that there is redundancy at the structural level. Finally, each PPI complex is mapped to a unique chain pair structural cluster identifier in the above manner, resulting in a second number of PPI structural clusters.

4. The method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal as described in claim 3, characterized in that, The process of step 3) is as follows: Step 3.1) Take out the third number of PPI structural clusters obtained in Step 2.5) for verification and testing. All PPI complexes that meet the structural quality criteria for verification and testing are initially marked as high-quality candidate sets. These criteria include that the number of interface residues is not less than the threshold and that the structural integrity and resolution reach the preset values. Step 3.2) Disassemble all PPI complexes in the high-quality candidate set and training set, and then construct a structural similarity graph and a sequence similarity graph. A connection is established in the structural similarity graph when the lDDT of any two chains is ≥0.6; a connection is established in the sequence similarity graph when the sequence consistency is ≥ a preset percentage. For each complex system s, its two chains a and b expand their neighborhoods to depth D=2 in both the structural similarity graph and the sequence similarity graph, defining a system-level neighborhood set. When the number of neighbors meets the preset range, the structure is marked as a high-quality candidate structure and a high-quality candidate set is formed. Step 3.3) The high-quality candidate set is sorted according to the following criteria: ① Heterodimers are given priority; ② High resolution is given priority; ③ The more recent the structure was published, the higher the priority. The sorted high-quality candidate set is then divided according to the structure clusters. Step 3.4) Finally, in the high-quality candidate set after partitioning, the highest priority PPI complex is selected from each cluster to obtain the fourth number of PPI complexes. These are randomly divided into test and validation sets, and a leakage removal operation is performed in the training set: all neighbors of the PPI complexes in the test and validation sets in the training set are removed from the training set in the structural or sequence similarity graph.

5. The method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal as described in claim 2, characterized in that: The process of step 4) is as follows: Step 4.1) Based on the UniRef_IDs of the two strands of each PPI complex, extract the UniRef_IDs of all PPI complexes in the training set, resulting in a total of five UniRef_IDs for protein chains. Then, download the FASTA sequence file for each chain and use the MMseqs2 ultrafast protein sequence alignment tool to perform pairwise alignments of all chains. Using sequence identity ≥40% as the clustering condition, construct a single-stranded sequence cluster set. Each protein chain is assigned to a unique sequence cluster, thus giving all chains a corresponding cluster label. Step 4.2) For each PPI complex in the training set, find the sequence cluster to which each of its two strands belongs, and then define the sequence cluster pair label of the PPI complex. ; Step 4.3) Simultaneously consider structural clustering labels With sequence clustering of labels This allows for the identification of redundant pairs for any two PPI systems. When the structural clustering label With sequence clustering of labels If they are all the same, they are considered redundant; the sixth number of PPI clusters with redundant structures were used to generate the seventh number of PPI clusters with strictly redundant sequences and structures.

6. The method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal as described in claim 2, characterized in that: The process of step 5) is as follows: Step 5.1) First, multiple biophysical and structural features are used to perform quality filtering on each PPI structure data in the training set, retaining complex structures that meet the following conditions: (1) the number of atomic types in both chains is greater than 3; (2) the source is a biological assembly; (3) the biological assembly is determined by manual annotation or by a combination of manual annotation and software inference; (4) the number of residues in each chain is >40; (5) the total resolved length of the two chains of the complex is less than 1200; (6) the X-ray resolution is less than or equal to 4.0 Å; (7) the embedding area is greater than 100 Å. 2 (8) The number of interfacial atomic contacts is greater than 5; (9) The number of effective sequences Neff of both chains is greater than 10; Only the structures that satisfy all the above conditions are retained, and the eighth number of clusters are finally obtained; Step 5.2) For each sequence-structure cluster, select the most representative structure as the representative Target of the cluster, define a priority scoring function to comprehensively measure the priority of each structure, and obtain the eighth number of representative Target structures.

7. The method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal as described in claim 2, characterized in that, The process of step 6) is as follows: Step 6.1) End-to-end method, as follows: Step 6.1.1) MassiveFold Process By perturbing the hyperparameters of AlphaFold2, a ninth number of candidate complex structures were generated. For each perturbation parameter combination... Perform one forward inference by changing the random sampling distribution of the hyperparameters. Under the same input sequence, complexes with different structures are predicted, and finally the ninth number of decoy structures are obtained. Step 6.1.2) AlphaFold 3 Diffusion Generation Process First, the coordinates of the target complex... Adding Gaussian noise to form a sequence Then, the reverse diffusion process is learned through a neural network. Through multiple reverse sampling iterations, high-quality complex structures are gradually generated from random noise, and finally the ninth number of decoy structures are obtained. Step 6.2) Docking method, as follows: Step 6.2.1) HDOCK process: HDOCK employs a rigid body docking method based on Fast Fourier Transform, which automatically generates the tenth decoy model for each target structure. The docking scoring function integrates van der Waals energy, electric potential energy, desolvation energy, and embedding area. The tenth docking pose with the lowest energy is selected as the candidate decoy for HDOCK by minimizing energy. Step 6.2.2) HADDOCK process: HADDOCK employs a phased docking and progressive flexible refinement strategy. First, a rigid body sampling phase is performed to generate the ninth number of candidate structures. Second, a flexible refinement phase is executed, where only the first eleventh number of candidate structures are flexibly optimized. Finally, in the energy refinement phase, the eleventh number of models are optimized for energy minimization. Through multiple rounds of screening and flexible optimization, the eleventh number of decoy structures with high interface complementarity are obtained. Step 6.2.3) diffDock-PP process: DiffDock-PP employs a deep learning docking strategy based on a diffusion model. For each PPI complex, NUM_FOLDS, SEED, and NUM_SAMPLES are set. This means that for each target, the twelfth decoy structure is sampled under three different random initializations, resulting in a total of thirteen decoy structures. DiffDock-PP operates in rigid body pose space. The reverse diffusion process is performed to generate the complex structure. Through stepwise iterative generation and denoising, DiffDock-PP samples a diverse thirteenth decoy structure for each PPI complex on a physically reasonable energy surface.

8. The method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal as described in claim 2, characterized in that, The process of step 7) is as follows: Step 7.1) Align each decoy model with its corresponding natural assembly using USAlign, and use OpenStructure to calculate and retain three score indicators: the first is the global folding accuracy TM-score, the second is the local atomic level precision lDDT, and the third is the interface quality DockQ_wave. The three scores are uniformly normalized to the [0,1] interval. Step 7.2) Divide TM-score, lDDT and DockQ_wave into ten equally wide score intervals [0.0,0.1), [0.1,0.2) … [0.9,1.0] on [0,1]. Record the source of the decoy as the source label. For each score interval and each source label of each indicator, perform fixed-scale stratified balanced sampling: the target number of samples is the fourteenth number. If the number of samples is less than the fourteenth number, first retain all samples, and then fill in the gaps from adjacent score intervals in the order of same label, nearest neighbor interval, and energy / confidence quantile. For intervals with sufficient samples, a secondary equalization is performed based on energy or model confidence quantiles. 7.3) Finally, three sets of parallel balanced subsets are formed, corresponding to TM-score, lDDT and DockQ_wave respectively. This design achieves uniform coverage from low to high in three dimensions: global folding, local atomic level and interface quality.

9. The method for constructing a complex decoy dataset based on structure and sequence collaborative redundancy removal as described in claim 2, characterized in that, The process of step 1) is as follows: Step 1.1) Obtain all protein structure entries from the RCSB PDB NextGen database and download the corresponding mmCIF files; then perform preliminary quality screening to remove entries containing nucleic acids, small molecule ligands or non-protein polymers, while retaining only structures with a structural resolution of less than or equal to 4 Å and at least 8 interface residues; Step 1.2) For each mmCIF structure file, construct all biological assemblies according to the assembly annotations provided by PDB; if there are multiple assemblies, the main recommended assembly is selected first; annotate all protein chains in each assembly, including chain ID, UniRef_ID, entity source, polymer type, whether it is a repeating chain, and remove structural fragments of homologous repeating chains and abnormally short chains. Step 1.3) Traverse all protein chain pair combinations in each biological assembly and screen out binary chain pairs with real physical contact; if the distance between the main chain atoms of two chains is less than 8 Å, it is defined as a protein-protein interaction complex with interfacial contact, and finally the first number of PPI chain pairs are obtained; for each screened PPI chain pair, record its PDB number, chain pair ID, whether it is a homologous chain pair, number of contact residues, contact area and other structural geometric features. Step 1.4) To distinguish whether the interfaces observed in the protein complex are biologically relevant or crystal contacts, PRODIGY-Cryst is used to discriminate and annotate the interfaces for each PPI chain. Simultaneously, to quantify the evolutionary information content of each chain, the effective sequence number Neff of the two chains in the binary complex is calculated, and an MSA is constructed for each chain. The sequence set in the MSA is denoted as . For the sequence Assign weights ; Annotate chain1_neff and chain2_neff for each binary complex.