Protein interaction prediction method and system based on multi-source biological network and zero hypothesis test

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a three-layer enhanced joint network and null hypothesis testing, the problem of lack of significance assessment in protein interaction prediction was solved, enabling high-confidence target screening and experimental guidance.

CN122245399APending Publication Date: 2026-06-19SOUTH CHINA UNIV OF TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: SOUTH CHINA UNIV OF TECH
Filing Date: 2026-03-18
Publication Date: 2026-06-19

Application Information

Patent Timeline

18 Mar 2026

Application

19 Jun 2026

Publication

CN122245399A

IPC: G16B15/20; G16B40/00; G16B50/30

AI Tagging

Application Domain

Biostatistics Instruments

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing protein-protein interaction prediction methods cannot provide individual statistical significance assessments, making it difficult for biologists to objectively select targets for experimental validation, and the model prediction results lack interpretability.

Method used

A three-layer augmented joint network was constructed, and topological features and embedding vectors were calculated using data from the STRING and BioGRID databases. The model was trained using a random forest classifier, and high-confidence protein interactions were screened by calculating P-values and Z-scores through null hypothesis testing.

Benefits of technology

It provides reliable statistical standards, improves the credibility and interpretability of prediction results, and helps biologists efficiently screen targets for experimental verification.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245399A_ABST

Patent Text Reader

Abstract

This invention discloses a protein interaction prediction method and system based on multi-source biological networks and null hypothesis testing, belonging to the interdisciplinary field of bioinformatics and artificial intelligence. It includes the following steps: Step S1, integrating protein interaction data from the STRING and BioGRID databases to construct a three-layer enhanced joint network containing different confidence levels. By introducing a null hypothesis testing framework, this invention calculates the true P-value and Z-score for each predicted protein interaction pair, overcoming the limitation of traditional methods that only output fuzzy probability scores. This allows biologists to screen high-confidence candidate targets for experimental verification based on clear statistical standards, thus establishing a quantitative and reliable decision-making bridge between computational prediction and wet experimentation.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of bioinformatics and artificial intelligence, specifically to a method and system for predicting protein interactions based on multi-source biological networks and null hypothesis testing. Background Technology

[0002] Protein-protein interaction (PPI) prediction is an important research area in bioinformatics, and it is of great significance for understanding life processes and discovering drug targets. Currently, prediction methods in this field mainly rely on obtaining data from public databases such as STRING and BioGRID, and then using computational models for analysis.

[0003] Existing PPI prediction techniques, such as those based on network topology similarity, feature engineering and machine learning classifiers, and deep learning methods like graph neural networks, typically construct the prediction task as a binary classification or link prediction problem. These techniques generally output a probability score or confidence level representing the likelihood of an interaction.

[0004] However, methods based on network topology similarity infer potential connections by calculating metrics such as common neighbors. While these methods are computationally simple, their prediction accuracy is limited, especially in capturing deep associations in complex biological networks. Machine learning methods based on feature engineering train classifiers such as support vector machines and random forests using hand-constructed sequence, structural, or network features. The performance of these methods depends on the quality of feature design, and the models are prone to learning inherent biases in the dataset rather than true biological patterns.

[0005] Deep learning methods based on graph neural networks can automatically learn node representations and achieve superior performance on multiple benchmarks. However, their model decision-making process is like a "black box," and the prediction results lack interpretability, leading to limited trust from biologists.

[0006] Despite the continuous evolution of these methods, a long-standing common problem remains unresolved: these methods typically only output a probability score or confidence level, failing to provide a statistically significant measure for each individual prediction. This makes it difficult for researchers to objectively assess whether a particular prediction reflects a potential biological association or may be due to random noise or model bias. In practical applications, when faced with a large number of predictions, biologists lack a reliable statistical standard to prioritize the most worthwhile targets for experimental validation, significantly reducing the efficiency and translational value of computational predictions in guiding experiments.

[0007] Therefore, we propose a protein interaction prediction method and system based on multi-source biological networks and null hypothesis testing to alleviate or solve the above problems.

[0008] The information disclosed above in this background section is only for enhancing the understanding of the background section of this invention, and therefore may include prior art that is not known to those skilled in the art. Summary of the Invention

[0009] To address the aforementioned technical problems, this invention provides a protein interaction prediction method and system based on multi-source biological networks and null hypothesis testing, thereby resolving issues such as the lack of individual statistical significance assessment in the prediction results and the inability to provide biologists with reliable quantitative screening basis in the prior art.

[0010] To achieve the above objectives, this invention provides a protein interaction prediction method based on multi-source biological networks and null hypothesis testing, comprising the following steps:

[0011] Step S1: Load protein interaction data with a comprehensive score ≥700 from the STRING database and human protein interaction data validated by low-throughput experiments from the BioGRID database in blocks;

[0012] The protein IDs in the STRING database are mapped to standard gene symbols and merged with the data in the BioGRID database to remove duplicates, thus constructing a three-layer enhanced joint network. The three-layer enhanced joint network includes a core layer, a first extended layer, and a second extended layer. The core layer consists of overlapping interaction data between STRING and BioGRID, the first extended layer consists of interaction data unique to BioGRID, and the second extended layer consists of high-resolution interaction data unique to STRING.

[0013] The three-layer enhanced joint network is serialized and stored in NetworkX graph object format;

[0014] Step S2: Calculate four types of topological features for each protein node in the three-layer enhanced joint network. The four types of topological features include degree centrality, clustering coefficient, approximate betweenness centrality, and PageRank value.

[0015] The four types of topological features are concatenated with a 124-dimensional randomly initialized embedding vector to generate a 128-dimensional node feature vector, and the 128-dimensional node feature vector is then L2 normalized.

[0016] Serialize and store the network with features;

[0017] Step S3: Sample all real edges from the network with features as positive samples, and randomly sample an equal number of non-edges as negative samples;

[0018] Five-dimensional edge features are extracted for each sample edge, including the number of common neighbors, Jaccard similarity, preference dependency index, node feature cosine similarity, and node degree difference.

[0019] The samples were divided into training and test sets in an 8:2 ratio. A random forest classifier with early stopping mechanism was used to train the 5-dimensional edge features, determine the optimal number of decision trees, and save the trained model.

[0020] Step S4: Randomly sample a preset number of unconnected node pairs from the three-layer enhanced joint network, use the trained model to predict the score, and construct the null hypothesis distribution;

[0021] For any pair of nodes to be predicted, obtain the model prediction score S, and calculate the P value and Z score. The P value is the ratio of the number of samples ≥ S in the null hypothesis distribution to the total number of samples in the null hypothesis distribution, and the Z score is (S - mean of the null hypothesis distribution) / standard deviation of the null hypothesis distribution.

[0022] Step S5: Based on the trained model and null hypothesis distribution, predict non-edge or specified node pairs in the entire network, and screen high-confidence potential protein interactions according to the prediction score, the significance threshold of P value < 0.05, and the Z score.

[0023] A protein interaction prediction system, comprising:

[0024] The network construction module is used to perform step S1 and construct the three-layer enhanced joint network;

[0025] The feature engineering module is used to perform step S2 and generate node feature vectors;

[0026] The model training module is used to execute step S3 and train the random forest prediction model.

[0027] The null hypothesis testing module is used to perform step S4, construct the null hypothesis distribution, and perform the statistic calculation in step S5;

[0028] The prediction output module is used to perform the prediction and result filtering output in step S5.

[0029] Compared with the prior art, the beneficial effects of the present invention are:

[0030] This invention introduces a null hypothesis testing framework to calculate the true P-value and Z-score for each predicted protein-protein interaction pair. This changes the limitation of traditional methods that only output fuzzy probability scores, allowing biologists to screen high-confidence candidate targets for experimental verification based on clear statistical criteria. Thus, a quantitative and reliable decision-making bridge is established between computational prediction and wet experimentation.

[0031] This invention integrates high-reliability experimental data with high-throughput prediction data in a hierarchical manner. The core layer ensures the high reliability of the basic data, while the extension layer greatly expands the coverage and information density of the network, providing a more comprehensive knowledge base for subsequent machine learning modeling.

[0032] This invention constructs a multidimensional feature for each protein node that integrates classical network topology attributes with randomly initialized latent semantic embedding vectors. This enables the model to utilize both explicit structural information and capture latent semantic associations, thereby learning the complex patterns of protein interactions more accurately and improving the model's expressive power.

[0033] The above overview is for illustrative purposes only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the invention will become readily apparent from the accompanying drawings and the following detailed description. Attached Figure Description

[0034] Figure 1 This is a flowchart of the protein interaction prediction system of the present invention.

[0035] Figure 2 This is a flowchart of the protein interaction prediction method based on multi-source biological networks and null hypothesis testing according to the present invention.

[0036] Figure 3 This is a comparison chart of the ROC curve and Precision-Recall (PR) curve of the model performance of this invention.

[0037] Figure 4 This is a schematic diagram of the three-layer enhanced joint network structure of the present invention. Detailed Implementation

[0038] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. It should be noted that the drawings are schematic and not illustrated to scale. For clarity and convenience, the relative sizes and proportions of the parts shown in the drawings have been exaggerated or reduced in size. Any size is only illustrative and not limiting.

[0039] Example 1

[0040] As attached Figure 1 As shown, the protein interaction prediction system includes:

[0041] The network construction module is used to construct the three-layer enhanced joint network;

[0042] The feature engineering module is used to generate node feature vectors;

[0043] The model training module is used to train the random forest prediction model.

[0044] The null hypothesis testing module is used to construct the null hypothesis distribution and perform the statistic calculation in step S5;

[0045] The prediction output module is used to perform predictions and result filtering output.

[0046] Example 2

[0047] As attached Figure 2 As shown, the protein interaction prediction method based on multi-source biological networks and null hypothesis testing includes the following steps:

[0048] Step S1: Construction of Multi-Source PPI Network

[0049] Input the STRING database file 9606.protein.links.full.v12.0.txt and filter interaction data with a combined score (combined_score) ≥ 700. This score threshold is based on the confidence level of the STRING database; a score ≥ 700 indicates a high-confidence interaction association between proteins, effectively filtering false positive data introduced by text mining or homology inference.

[0050] Input the BioGRID database file, retaining only low-throughput experimental validation data with human-derived tissue species. While low-throughput data has lower coverage than high-throughput data, it offers higher validation accuracy and less non-specific binding noise, making it suitable as the foundation for a high-confidence data layer. Specifically, download BIOGRID-ALL-5.0.250.tab3.txt from the BioGRID database, extract data meeting the following criteria: Throughput = 'Low Throughput', OrganizationName Interactor A = 'Homo sapiens', Organization Name Interactor B = 'Homosapiens', and then place this data into a newly created text document "human_low_throughput_biogrid.txt" for later use.

[0051] Using the 9606.protein.info.v12.0.txt mapping file provided by the STRING database, protein IDs in the STRING data are uniformly converted into standard gene symbols.

[0052] During the mapping process, protein IDs that could not match the standard gene symbol were skipped to avoid interference from non-standardized IDs in subsequent network construction. Final statistics showed a mapping efficiency of 98.7%, with only 1.3% of proteins filtered out due to their IDs not being included in the database. Of the filtered 1.3%, 62% were from non-human species, and 38% were novel proteins with standard gene symbols not included in the STRING database. This filtering ensured species consistency and ID standardization among nodes in the network.

[0053] The mapped STRING data and BioGRID data are merged and deduplicated to construct an undirected, unweighted, simple graph, where nodes represent proteins and edges represent trusted interactions.

[0054] The output file union_network_genes.pkl is serialized and stored in NetworkX graph object format, containing 18,275 nodes and 336,777 edges, serving as the basic network skeleton for subsequent feature engineering and prediction. A schematic diagram of the three-layer augmented joint network structure is attached. Figure 4 As shown in the figure, 28652 corresponds to the core layer (STRING and BioGRID overlap and interact), 99847 corresponds to the first extension layer (BioGRID-specific interaction), and 208278 corresponds to the second extension layer (STRING-specific high-resolution interaction). Different grayscale regions distinguish the three-layer structure.

[0055] Step S2, Node Feature Engineering

[0056] Calculate four core topological features: degree centrality, clustering coefficient, approximate betweenness centrality, and PageRank value.

[0057] The approximate betweenness centrality calculation employs a sampling acceleration strategy, selecting 100 source nodes for path sampling instead of full node traversal calculation. This reduces computation time from hours to minutes while maintaining accuracy. The PageRank value is set to a maximum of 50 iterations and a convergence tolerance of 1×10⁻⁶. -4 To balance computational efficiency with result stability.

[0058] A 124-dimensional randomly initialized dense vector is generated and then normalized by L2 as a latent semantic embedding feature of the node, which is used to capture protein functional associations that cannot be directly reflected by the network topology.

[0059] The 4-dimensional topological features are concatenated with the 124-dimensional embedding features to form a 128-dimensional node feature vector, and then L2 normalization is performed again to eliminate the difference in scale between different feature dimensions and ensure that the weights of each feature are balanced during model training.

[0060] The output file union_network_genes_with_features.pkl is also a NetworkX graph object, with a new field called features added to the attribute dictionary of each node. Its value is a NumPy array of length 128.

[0061] Three nodes (CCNA2, CDK4, and EXO1) were randomly selected for feature verification. The results showed that the feature dimensions were correct and the topological feature values were consistent with the characteristics of biological networks (CCNA2 has a degree centrality of 254, which reflects its core position in the cell cycle regulation network).

[0062] Step S3: Prediction Model Training

[0063] All 336,777 real edges were extracted from the feature network as positive samples, representing protein pairs with real interactions.

[0064] An equal number of non-edges are generated as negative samples by negative sampling without replacement. 336,777 pairs are randomly selected from all unconnected node pairs in the network to ensure that the number of negative samples is balanced with that of positive samples and to avoid model bias.

[0065] Five edge-level features are extracted for each sample edge: number of common neighbors, Jaccard similarity, preferential attachment index, node feature cosine similarity, and node degree difference.

[0066] During feature extraction, for the special case of isolated nodes, the Jaccard similarity and preference dependency index are set to 0 to ensure the robustness of feature calculation.

[0067] Training was performed using a random forest classifier with an early stopping mechanism: the patience value was set to 10, meaning that training would stop when the accuracy on the test set did not improve after increasing the number of decision trees 10 times consecutively, thus avoiding overfitting.

[0068] During training, the number of decision trees was gradually increased from 10 to 130, and the best test accuracy of 94.19% was finally achieved when the number of trees was 30. At this time, the accuracy of the training set was 94.15%, and the difference between the accuracy of the training set and the test set was only 0.04%, which proves that the model has excellent generalization ability.

[0069] The output file ppi_model.pkl is a serialized storage of the trained sklearn random forest model object in pickle format. It contains a complete set of decision trees, feature importance, and hyperparameter configuration, which can be used for subsequent prediction and inference.

[0070] Step S4: Constructing the null hypothesis distribution

[0071] Pattern 1: Batch prediction of Top-K latent links

[0072] A systematic scan of all unconnected node pairs in the entire network is performed, and a Top-K list of high-confidence candidate interactions is output in descending order of predicted scores, along with p-values and significance markers.

[0073] The scanning process employs a block-based processing strategy, dividing large-scale node pairs into multiple sub-blocks for parallel computation, thereby reducing memory usage and improving prediction efficiency.

[0074] Mode 2: Predict specified node pairs

[0075] Users provide one or more protein pairs to be validated line by line through standard input. The system performs a complete prediction and statistical test process for each pair and returns results in real time, including the fields of prediction_score, pvalue, zscore, and significance.

[0076] This mode supports breakpoint continuation. If the program is interrupted during input, it can be restarted to continue processing the remaining node pairs from the breakpoint. It is suitable for targeted validation, pathway completion, or hypothesis-driven research.

[0077] Pattern 3: Reconstructing the null hypothesis distribution

[0078] Randomly sample 10,000 pairs of unconnected nodes, call the model to predict their scores, generate a new null hypothesis distribution and save it as null_distribution.pkl.

[0079] During the sampling process, the random seed is set to a fixed value of 42 to ensure that the null hypothesis distribution can be repeatedly constructed to adapt to different network versions or updated statistical benchmarks.

[0080] Mode 4: Evaluate Model Performance

[0081] The overall discriminative ability of the model is quantified by calculating the area under the receiver operating curve (AUC) and mean precision (AP) on the independently retained test set.

[0082] After the performance evaluation is completed, the ROC curve and PR curve are automatically plotted and saved as model_performance_curves.png to visually demonstrate the model's performance advantages.

[0083] Example 3

[0084] A targeted prediction method for specified protein pairs includes the following steps:

[0085] Users input 7 pairs of specific proteins to be verified through an interactive interface, in the format of "GeneA, GeneB" per line:

[0086] GRB2,TRPV3

[0087] MTOR, G6PD

[0088] STAT3,CDC45

[0089] MTOR, IGF1R

[0090] STAT3,PTPN2

[0091] STAT3,ACE

[0092] GRB2,ALOX5

[0093] The system loads the prepared resources, namely the network with features (union_network_genes_with_features.pkl), the trained model (ppi_model.pkl), and the null hypothesis distribution (null_distribution.pkl).

[0094] The system checks whether all input gene symbols exist in the network. For each input pair, a 128-dimensional feature vector of two nodes is extracted from the network. Based on these two nodes, a 5-dimensional edge feature is calculated in real time. The loaded random forest model is then invoked, and a prediction score indicating the interaction between the protein pair is calculated based on this edge feature.

[0095] For the calculated predicted scores, the system queries and calculates from the preloaded null hypothesis distribution:

[0096] P-value calculation: The number of scores greater than or equal to 0.951 in the null hypothesis distribution is divided by the total sample size of 10,000, resulting in a P-value of 0.0092.

[0097] Z-score calculation: Z = (0.951 - 0.0934) / 0.1826 = 4.699.

[0098] Based on the preset significance level (α=0.05), the significance of the prediction result was determined (P < 0.05, therefore marked as True). The results are shown in Table 1.

[0099] Table 1. Structured Prediction Results

[0100] Protein A Protein B Predicted score Do interactive edges exist in the network? p-value Z score Significance (P<0.05) GRB2 TRPV3 0.125 False 0.128 0.173 False MTOR G6PD 0.802 False 0.0245 3.879 True STAT3 CDC45 0.944 False 0.0104 4.657 True MTOR IGF1R 0.995 True 0.0016 4.939 True STAT3 PTPN2 0.993 True 0.0022 4.926 True STAT3 ACE 0.951 False 0.0092 4.699 True GRB2 ALOX5 0.826 True 0.0227 4.011 True

[0101] As shown in Table 1, for known interaction pairs in the network, the model gives extremely high prediction scores (>0.82) and highly significant P-values (<0.023), verifying the model's ability to capture known biological knowledge.

[0102] For pairs that do not exist in the network but have high prediction scores and significant P-values, the GRB2-TRPV3 pair has low prediction scores (0.125) and insignificant P-values (0.128 > 0.05). The system marks it as non-significant, effectively filtering out false positive predictions that may be caused by noise.

[0103] Example 4: Comparison of the performance differences between the model of the present invention and two traditional PPI prediction methods.

[0104] Control group 1: Traditional link prediction method based on common neighbors, which only uses the number of common neighbors in the topology features as the basis for prediction.

[0105] Control group 2: A machine learning method based on support vector machine (SVM) was used for training with the same 5-dimensional edge features as the present invention.

[0106] The same training set, test set, and independent validation set as in Example 2 were used to ensure the fairness of the experiment. Evaluation metrics included AUC, AP, precision, recall, and F1 score, as shown in Table 2.

[0107] Table 2 Comparison of experimental results

[0108] Model Method AUC AP accuracy Recall rate F1 value Common Neighbor Method 0.7215 0.7328 0.7531 0.7124 0.7322 SVM model 0.8942 0.9015 0.8876 0.8753 0.8814 This invention model 0.9819 0.9852 0.9419 0.9387 0.9403

[0109] As shown in Table 2, the AUC and AP of the model in this invention reached 0.9819 and 0.9852, respectively, significantly higher than the control group method, demonstrating the effectiveness of the technical solution that integrates topological features and embedding features and introduces null hypothesis testing. A comparison of the ROC curve and the Precision-Recall (PR) curve is attached. Figure 3 As shown.

[0110] Compared to the SVM model, the model in this invention achieves a 5.43% improvement in accuracy and a 6.34% improvement in recall, demonstrating the advantages of random forest classifiers in handling high-dimensional feature data. The common neighbor method exhibits the lowest performance, proving that a single topological feature cannot effectively capture the complex interaction patterns between proteins, highlighting the necessity of multi-feature fusion.

[0111] In summary, this invention constructs a three-layer enhanced joint network that integrates high-resolution STRING data and low-throughput BioGRID experimental data. This retains the core high-reliability interaction of dual verification while incorporating unique data from both databases, ultimately forming a large-scale joint network with 18,275 nodes and 336,777 edges, achieving an overlap rate of 22.30%. This network addresses the shortcomings of single-database data, such as high noise and limited coverage.

[0112] This invention integrates four types of topological features with a 124-dimensional embedding vector to generate a 128-dimensional normalized node feature vector. This captures the structural position of proteins in the network and supplements potential semantic associations, enabling the characterization of complex interaction patterns between proteins. Furthermore, it employs an approximation algorithm and parallel computing to accelerate feature computation, with the entire feature engineering process taking only 41.29 seconds. The effectiveness and stability of the feature vectors are demonstrated through feature validation on nodes such as CCNA2, CDK4, and EXO1.

[0113] This invention constructs a null hypothesis distribution based on 10,000 pairs of random, unconnected nodes, calculates the P-value and Z-score for each prediction result, and uses P<0.05 as a significance threshold to filter high-confidence interactions. This overcomes the limitations of traditional black-box model output and solves the problem that existing technologies cannot evaluate the significance of individual prediction results.

[0114] This invention employs a random forest classifier with early stopping mechanism, automatically determining the optimal number of decision trees to be 30, effectively avoiding overfitting. The model achieves an accuracy of 94.19% on the test set, with a difference of only 0.04% between the training and test set accuracy. On an independent test set of 5000 positive and 5000 negative samples, the AUC value reaches 0.9819, the AP value reaches 0.9852, the average prediction score for positive samples is 0.9087, and the average score for negative samples is only 0.0870, demonstrating extremely strong ability to distinguish interaction relationships.

[0115] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0116] Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for predicting protein interactions based on multi-source biological networks and null hypothesis testing, characterized in that, Includes the following steps: Step S1: Integrate protein interaction data from the STRING database and the BioGRID database to construct a three-layer enhanced joint network with different confidence levels; The three-layer enhanced joint network includes: a core layer consisting of the overlapping parts of the two databases, a first extension layer consisting of the unique parts of the BioGRID database, and a second extension layer consisting of the unique parts of the STRING database with a comprehensive score not lower than a preset threshold. Step S2: Calculate the topological features for each protein node in the joint network, and fuse and normalize them with the generated embedding vector to form a node feature vector; Step S3: Extract known interaction pairs and non-interaction pairs from the network as training samples, extract the edge features of each pair of samples, and use the edge features to train a random forest classifier as a prediction model. Step S4: Extract multiple protein pairs that are known to have no interaction from the joint network, use the prediction model to predict these protein pairs, and construct the null hypothesis distribution from the set of predicted scores obtained. Step S5: For the protein pair to be predicted, obtain a prediction score using the prediction model, and obtain a statistical measure to evaluate its significance by comparing the score with the null hypothesis distribution; based on the prediction score and the statistical measure, screen and output potential protein interactions with high confidence.

2. The method according to claim 1, characterized in that, In step S1, the preset threshold is a comprehensive score of ≥700 in the STRING database; the data in the BioGRID database is selected low-throughput experimental verification data.

3. The method according to claim 1, characterized in that, In step S2, the topological features include degree centrality, clustering coefficient, betweenness centrality, and PageRank value; the embedding vector is a vector that is randomly initialized and L2 normalized; the node feature vector is obtained by concatenating the topological features with the embedding vector and then performing L2 normalization again.

4. The method according to claim 1, characterized in that, In step S3, the edge features include: number of common neighbors, Jaccard similarity, preference dependency index, node feature cosine similarity, and node degree difference.

5. The method according to claim 1, characterized in that, In step S3, the training process of the random forest classifier uses an early stopping mechanism to determine the optimal number of decision trees.

6. A protein interaction prediction system for implementing the method of any one of claims 1-5, characterized in that, include: The network construction module is used to perform step S1 and construct the three-layer enhanced joint network; The feature engineering module is used to perform step S2 and generate node feature vectors; The model training module is used to execute step S3 and train the random forest prediction model. The null hypothesis testing module is used to perform step S4, construct the null hypothesis distribution, and perform the statistic calculation in step S5; The prediction output module is used to perform the prediction and result filtering output in step S5.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method steps as described in any one of claims 1-5.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1-5.

9. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the method as described in any one of claims 1-5.