A protein-nucleic acid binding site prediction method and system based on isometric geometric graph learning
By constructing a multi-scale 3D geometric graph and performing feature learning based on isovariant geometric graph learning, the accuracy and cost issues of protein-nucleic acid binding site prediction in existing technologies are solved, achieving high-precision binding site prediction and supporting downstream applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU INSTITUTE OF MEDICAL SCIENCES CHINESE ACADEMY OF SCIENCES
- Filing Date
- 2026-03-17
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for predicting protein-nucleic acid binding sites struggle to accurately characterize the multi-scale three-dimensional geometry of proteins without relying on evolutionary features, and the prediction results are difficult to directly serve downstream molecular docking or structural modeling tasks.
We employ a learning method based on isovariant geometry graphs. By constructing a multi-scale three-dimensional geometry graph, using amino acid residues as nodes and spatial neighborhood relationships as edges, and combining direction-sensitive geometric features, we input the three-dimensional isovariant geometry graph neural model for feature learning, output the probability of nucleic acid binding sites, and train it using a hybrid loss function.
It achieves high-precision prediction of binding site residues without relying on evolutionary features, reduces computational costs, and can directly serve downstream molecular docking and structural modeling, with good generalization ability and computational efficiency.
Smart Images

Figure CN122201429A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the interdisciplinary field of bioinformatics, structural biology and artificial intelligence, and in particular relates to a method and system for predicting protein-nucleic acid binding sites based on isovariant geometric graph learning. Background Technology
[0002] The specific interactions between proteins and nucleic acid molecules (DNA or RNA) play a crucial role in gene transcription regulation, RNA processing, replication and repair, and disease development. Accurately identifying the amino acid residues in protein molecules involved in nucleic acid binding is of great significance for elucidating molecular mechanisms, assisting in structural modeling, and conducting drug design.
[0003] Existing methods for predicting protein-nucleic acid binding sites mainly include sequence-based and structure-based methods. Sequence-based methods typically rely on amino acid composition, conservation, or evolutionary features for prediction. While computationally efficient, they struggle to characterize the geometric constraints of protein-nucleic acid interactions, which are highly dependent on spatial configuration and directionality. Structure-based methods can utilize protein three-dimensional conformation information, but current techniques generally suffer from the following shortcomings: 1. Most methods employ coarse-grained residue-level representations, failing to explicitly model side chain conformations and atomic-level geometric relationships, making it difficult to accurately describe the directionality and stereocomplementarity at the binding interface; 2. They rely on multiple sequence alignments or evolutionary features, resulting in high computational costs and limiting their application in large-scale structure prediction workflows; 3. Prediction results are typically used only for classification and are difficult to directly serve downstream protein-nucleic acid molecular docking or structural modeling tasks.
[0004] Therefore, there is an urgent need for a binding site prediction method that can accurately characterize the multi-scale three-dimensional geometry of proteins without relying on evolutionary features, while maintaining consistency in rotation and translation. Summary of the Invention
[0005] To address the aforementioned technical problems, this invention proposes a method and system for predicting protein-nucleic acid binding sites based on isovariant geometry learning. This method is used for predicting protein-DNA binding sites, protein-RNA binding sites, and related molecular structure modeling and docking analysis.
[0006] To achieve the above objectives, this invention provides a method for predicting protein-nucleic acid binding sites based on isovariant geometry learning, comprising: Obtain the three-dimensional structural coordinates of the target protein, and perform normalization preprocessing on the three-dimensional structural coordinates; Based on the preprocessed structural coordinates, a three-dimensional geometric graph is constructed with amino acid residues as nodes and spatial neighborhood relationships as edges. The nodes contain multi-scale geometric features, and the edges contain direction-sensitive geometric features. The three-dimensional geometric graph is input into a three-dimensional isovariant geometric graph neural model for feature learning and information transmission, and the probability of each residue being a nucleic acid binding site is output. The set of binding site residues is obtained based on the probability, and the prediction result is output.
[0007] Optionally, the normalization preprocessing of the three-dimensional structural coordinates includes: translating and rotating the protein structure so that the centroid of the structure is located at the origin of the coordinate system, and aligning the principal component axes with the coordinate axes.
[0008] Optionally, constructing a three-dimensional geometry using amino acid residues as nodes includes: using residues The atomic coordinates are used as node coordinates when two residues Edge connections are established when the interatomic distance does not exceed a preset threshold.
[0009] Optionally, the multi-scale geometric features include: amino acid-level features, main chain-level features, and side chain-level features; The amino acid-level features include hot unique encoding of residue types and sequence embeddings generated by a pre-trained protein language model; The main chain horizontal features include the sine and cosine encoding of the main chain dihedral angles; The sidechain horizontal features include sine and cosine encoding of the sidechain torsion angle.
[0010] Optionally, the orientation-sensitive geometric features include: distance features formed by expanding the distance between residues using radial basis functions, orientation features formed by expanding the normalized displacement vector using spherical harmonic functions, and Euler angle features of the relative orientation between the local skeleton coordinate systems of the residues.
[0011] Optionally, the Euler angle features are encoded using the tensor product of the zeroth-order spherical harmonic function and the radial basis function to form a rotationally equivariant edge feature representation; ; in, The edge features are represented by rotationally equivariant characteristics. For radial basis functions, It is a distance vector. It is a spherical harmonic function. Let be Euler angles, i and j represent the neighbor node indices in the residue graph, and k represent the basis function index of the radial basis function or spherical harmonic function expansion.
[0012] Optionally, the three-dimensional isovariant geometric graph neural model adopts an edge-aware graph attention mechanism, which fuses the geometric features of edges with the geometric features of nodes, and introduces geometric structure constraints through depth tensor product for geometric modulation. ; in, Features that have undergone geometric modulation For depth tensor product, The feature vector of node i With the feature vector of node j Through linear transformation matrix and It is obtained by fusion and combining edge geometric features. Direction vector The spherical harmonic function is represented by; Based on the modulated features, attention coefficients and value vectors are calculated using Softmax normalization. ; in, These are the linear projection matrices used to calculate the attention coefficient and message value, respectively. The normalized attention coefficient; For the node Passed to the node The weighted message vector is the value vector; Aggregate attention coefficients and value vectors to update node feature representations; The updated node features are input into the classification layer, and the probability that each residue is a nucleic acid binding site is calculated using the Sigmoid function.
[0013] Optionally, the three-dimensional isovariant geometric neural model is trained using a hybrid loss function, which includes class-balanced Focal Loss and triplet center loss. The class-balanced Focal Loss adjusts the weights according to the number of valid samples in each class, and the triplet center loss enhances feature separability by optimizing the distance between sample features and class centers.
[0014] The present invention also provides a protein-nucleic acid binding site prediction system based on isovariant geometry learning, comprising: a data processing module, a three-dimensional geometry construction module, a probability output module, and a result prediction module; The data processing module is used to obtain the three-dimensional structural coordinates of the target protein and perform standardized preprocessing on the three-dimensional structural coordinates; The three-dimensional geometry construction module is used to construct a three-dimensional geometry based on the preprocessed structural coordinates, with amino acid residues as nodes and spatial neighborhood relationships as edges. The nodes contain multi-scale geometric features, and the edges contain direction-sensitive geometric features. The probability output module is used to input the three-dimensional geometric graph into the three-dimensional isovariant geometric graph neural model for feature learning and information transmission, and output the probability that each residue is a nucleic acid binding site; The result prediction module is used to obtain the set of binding site residues based on the probability and output the prediction result.
[0015] Compared with the prior art, the present invention has the following advantages and technical effects: First, this invention explicitly models the atomic-level conformation and relative orientation of residues using side-chain torsion angles and backbone Euler angles. The principle is that nucleic acid binding has high directional specificity, and traditional coarse-grained models cannot capture the subtle spatial arrangement of side chains. This invention, by introducing fine geometric features, enables the model to recognize specific recognition patterns dominated by side-chain conformation. Second, this invention innovatively uses a protein pre-trained language model (ESM-2) to replace computationally expensive multiple sequence alignment (MSA) and evolutionary spectrum (PSSM), supplementing single-sequence information with the evolutionary knowledge implicit in large-scale models. This eliminates the time-consuming MSA search while achieving extremely fast inference (<1 second for a single protein), significantly reducing computational costs and expanding application scenarios. Furthermore, thanks to the three-dimensional SE(3) isovariant network architecture design, the model not only has strong tolerance to local noise in the AlphaFold2 predicted structure but also exhibits good generalization ability for various ligands such as DNA, RNA, and metal ions. Ultimately, the prediction results generated by the model have excellent spatial continuity and physical rationality, and can be directly used as high-quality fuzzy interaction constraints to drive HADDOCK3 molecular docking, significantly improving the DockQ score of complex structure modeling, and effectively solving the problem that existing prediction methods are difficult to directly serve downstream structural biology modeling. Attached Figure Description
[0016] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings: Figure 1 This is a flowchart of a protein-nucleic acid binding site prediction method based on isovariant geometry learning according to an embodiment of the present invention; Figure 2 This is a graph showing the statistical significance test results of the performance improvement of the embodiments of the present invention and existing methods on the test set; Figure 3 This is a graph showing the end-to-end inference speed results for proteins of different lengths in an embodiment of the present invention. Detailed Implementation
[0017] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0018] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.
[0019] This embodiment addresses the following problems in existing protein-DNA / protein-RNA binding site prediction technologies: First, many methods employ coarse-grained representations at the residue level, failing to explicitly model the conformation of side chains and the directional geometry of atomic layers, resulting in insufficient characterization of stereocomplementarity and directional constraints at the binding interface; second, some methods rely on multiple sequence alignment or evolutionary features, leading to high computational costs and difficulty in meeting the application requirements of large-scale structure prediction and screening; third, existing prediction results mostly remain at the classification level, making it difficult to directly translate into constraint information usable for downstream protein-nucleic acid docking or complex structure modeling. Therefore, this invention, without relying on evolutionary features, introduces a multi-scale three-dimensional geometric structure representation and an isovariant learning mechanism consistent with rotation and translation, achieving high-precision prediction of binding site residues and enabling the prediction results to directly serve docking and structure modeling.
[0020] This embodiment proposes a protein-nucleic acid binding site prediction method based on isovariant geometry learning, such as... Figure 1 As shown, the specific steps include: Obtain the three-dimensional structural coordinates of the target protein, and perform normalization preprocessing on the three-dimensional structural coordinates; Based on the preprocessed structural coordinates, a three-dimensional geometric graph is constructed with amino acid residues as nodes and spatial neighborhood relationships as edges. The nodes contain multi-scale geometric features, and the edges contain direction-sensitive geometric features. The three-dimensional geometric graph is input into a three-dimensional isovariant geometric graph neural model for feature learning and information transmission, and the probability of each residue being a nucleic acid binding site is output. The set of binding site residues is obtained based on the probability, and the prediction result is output.
[0021] Specifically, to achieve the above objectives, the core of this embodiment lies in constructing a three-dimensional geometric graph containing multi-scale node geometry and orientation-sensitive edge geometry using the three-dimensional structure of the protein monomer as input, and performing graph neural network inference under SE(3) equal variation constraints to output residue-level binding probabilities, which can be further converted into docking constraints. Specifically, firstly, the three-dimensional structural coordinates of the target protein are obtained and standardized preprocessed to ensure that the structural data can be used for subsequent geometric feature calculations; then, a three-dimensional geometric graph is constructed using amino acid residues as basic units, with residues as graph nodes, and edges between nodes are established based on spatial neighborhood relationships. Unlike existing graph representations that only use a single scale or only use distance, this invention constructs multi-scale structural features on the node side, including at least residue type information, main chain local conformation information, and side chain geometric conformation and orientation information, thereby explicitly introducing the stereochemical and directional constraints required for interface recognition at the residue level; at the same time, geometric edge features are constructed on the edge side to characterize the geometric relationships such as relative distance, relative direction, and relative orientation between residues, so that the edge information not only reflects "distance" but also "orientation and relative posture". Then, the node features and edge features are input into a three-dimensional isovariant geometric graph neural network for feature learning and information transmission. By maintaining consistency through the overall rotation and translation of the input structure, stable inference of the same structure under different coordinate systems is achieved, thereby outputting the probability or score of each residue as a nucleic acid binding site; the set of binding site residues is obtained based on the probability or score and the prediction result is output. In some optional embodiments, the present invention further transforms the predicted set of binding site residues into an interface constraint or active residue set as a priori input for protein-nucleic acid molecule docking and complex structure modeling, so as to improve the accuracy and stability of downstream modeling. In the above technical solution, "multi-scale node geometric features (especially side chain conformation and orientation information) + orientation-sensitive geometric edge features + SE(3) isovariant learning" constitutes the necessary technical features that are indispensable for the present invention to achieve its purpose and effect; while using the prediction results for docking and modeling is an optional technical feature, used to further expand the application scenarios and enhance the technical effect.
[0022] Furthermore, the standardization preprocessing of the three-dimensional structural coordinates includes: translating and rotating the protein structure so that the centroid of the structure is located at the origin of the coordinate system, and aligning the principal component axes with the coordinate axes.
[0023] Specifically, this invention uses the three-dimensional structure of a protein monomer and its amino acid sequence as input, outputs residue-level nucleic acid binding probabilities, and can be used as constraint information for downstream molecular docking. In this embodiment, the nucleic acid binding site datasets used for training and validation are from publicly available benchmark datasets and compiled by resources such as BioLiP: In the protein-DNA binding site prediction task, the Train 573 and Test 129 datasets are used. Train 573 contains structures released before January 6, 2016, and Test 129 contains structures released between January 6, 2016, and December 5, 2018; simultaneously, an independent test set, Test 181 (built by GraphSite, with new DNA-binding proteins added between December 6, 2018, and August 19, 2021), is introduced for generalization evaluation. In the protein-RNA binding site prediction task, the Train 495 and Test 117 datasets are used, both from BioLiP and employing the same time segmentation strategy as described above. To ensure the non-redundancy of the training / test sets, CD-HIT was used to remove redundancy from the protein chains, ensuring that the sequence identity between the training and test sets did not exceed 30%.
[0024] For each protein chain, residue-level labeling is performed based on its spatial proximity to the corresponding nucleic acid (DNA or RNA) in the complex structure: when the minimum atomic distance between any atom of a residue and any atom of the nucleic acid is less than "0.5A + the sum of the van der Waals radii of the two nearest atoms", the residue is labeled as a binding site residue; otherwise, it is labeled as a non-binding site residue. This labeling rule can reflect the actual contact relationship of the binding interface at a relatively fine level at the residue level and is suitable for subsequent geometric modeling with residues as nodes.
[0025] Furthermore, constructing a three-dimensional geometry using amino acid residues as nodes includes: using residues The atomic coordinates are used as node coordinates when two residues Edge connections are established when the interatomic distance does not exceed a preset threshold.
[0026] Specifically, the input protein structure is represented as a residue diagram. Each node represents a residue, and the node coordinates are taken from the residue's coordinates. Coordinates; when the two residues are... Edges are established between distances not exceeding 11.5 Å to capture local and mid-range interactions.
[0027] In this residue graph, nodes and edges incorporate multi-scale geometric information: nodes aggregate amino acid-level, main chain-level, and atom / side chain-level information; edges encode geometric quantities such as distance, direction, and relative orientation between residues, and maintain consistency under rotation / translation transformations through isovariant geometric encoding.
[0028] Furthermore, the multi-scale geometric features include: amino acid-level features, main chain-level features, and side chain-level features; The amino acid-level features include hot unique encoding of residue types and sequence embeddings generated by a pre-trained protein language model; The main chain horizontal features include the sine and cosine encoding of the main chain dihedral angles; The sidechain horizontal features include sine and cosine encoding of the sidechain torsion angle.
[0029] Specifically, this invention constructs hierarchical features at nodes to enhance the characterization of the "directionality and 3D complementarity" of the bonding interface, including amino acid level, main chain level, and side chain level, respectively: Amino acid level, using 20-dimensional hot unique encoding and the pre-trained model ESM-2 (650M parameter version) to generate 1280-dimensional sequence embeddings: ; in, This represents the embedding of node amino acid-level features. This indicates the unique thermal encoding of amino acids. This represents the protein sequence context embedding generated by the ESM-2 model (650M parameter version). This represents a vector concatenation operation, used to join two vectors sequentially.
[0030] Main chain horizontal, calculate main chain dihedral angle And convert them into sine and cosine values: ; in, This indicates the horizontal feature embedding of the node's main chain. It is a dihedral angle of the protein backbone.
[0031] The side chain is horizontal, with a maximum of four side chain twist angles. The sine and cosine coding allows for the explicit introduction of side chain orientation information at the residue level: ; in, This indicates the horizontal feature embedding of the node sidechain. This refers to the torsion angle of the protein side chain.
[0032] The final node feature is a concatenation of the above three elements: .
[0033] Furthermore, the orientation-sensitive geometric features include: distance features formed by expanding the distance between residues using radial basis functions, orientation features formed by expanding the normalized displacement vector using spherical harmonic functions, and Euler angle features of the relative orientation between the local skeleton coordinate systems of the residues.
[0034] Specifically, for each edge (i,j), a joint distance-angle encoding is constructed: the distance between residues is expanded using radial basis functions (RBF), the direction of the normalized displacement vector is expanded using spherical harmonics (SH), and a joint radial-angle embedding is formed through tensor product to maintain the equivariance of SE(3): ; in It is a distance vector. For normalized direction vectors, Represents the tensor product.
[0035] Backbone Euler angle embedding: To capture the relative orientation between residues, the Euler angles of the residues are calculated. and Euler angles between local skeleton coordinate systems This is one of the key innovations of this invention. Each angle and distance are jointly encoded: ; in, The edge features are represented by rotationally equivariant features. For radial basis functions, It is a distance vector. It is a spherical harmonic function. For Euler angles, i and j represent the neighbor node indices in the residue graph, corresponding to the i-th and j-th amino acid residues in the protein, and k represents the basis function index of the radial basis function or spherical harmonic function expansion; Final edge features This is a combination of the features mentioned above.
[0036] Furthermore, the three-dimensional isovariant geometric graph neural model employs an edge-aware graph attention mechanism and utilizes an early fusion strategy. In the... In the computation of irreducible representations (Irreps), traditional equivariant graph attention mechanisms only utilize node features. This invention directly incorporates geometric edge features Injected into the pre-attention stage: ; in, The order of the irreducible representation (Irreps) ), As a scalar, For vectors, etc.; For the first The and the first The node at the th Eigenvectors of order; are the learnable weight matrices of the source node and the target node on the Lth-order feature, respectively; Representing an edge In the Geometric characteristic components on the order.
[0037] Geometric modulation is achieved by introducing geometric constraints using depth-wise tensor product (DTP). ; in, Features that have undergone geometric modulation For depth tensor product, The feature vector of node i With the feature vector of node j Through linear transformation matrix and It is obtained by fusion and combining edge geometric features. Direction vector The spherical harmonic function is represented by; Based on the modulated features, attention coefficients and value vectors are calculated using Softmax normalization. ; in, These are the linear projection matrices used to calculate the attention coefficient and message value, respectively; The normalized attention coefficient; For the node Passed to the node The weighted message vector is the value vector; Aggregate attention coefficients and value vectors and update node features, outputting the probability that each residue is a nucleic acid binding site; ; in, For nodes The set of neighboring nodes; The aggregated node features are used as input for the next layer.
[0038] Furthermore, the three-dimensional isovariant geometric graph neural model is trained using a hybrid loss function, which includes class-balanced Focal Loss and triplet center loss. The class-balanced Focal Loss adjusts the weights according to the number of valid samples in each class, while the triplet center loss enhances feature separability by optimizing the distance between sample features and class centers.
[0039] Specifically, the output layer of the model trained using a hybrid loss function predicts the probability that each node belongs to a binding site using a multilayer perceptron. The training process minimizes the following mixed loss function: The following hybrid loss functions: ; in This is the weighting coefficient (0.1 in this example). and The definitions of each part are as follows: Focal Loss (Class Balance) ): ; in, This is the hyperparameter used to calculate the number of valid samples (0.999 in this embodiment); For category The total number of samples in the training set (i.e., binding sites or non-binding sites); This is a class-balanced weighting term; the fewer the samples in a class, the greater its weight. This represents the model's predicted probability of the true class. The focusing parameter (2.0 in this example) is used to increase the weight of difficult-to-distinguish samples.
[0040] Triple central loss ( ): ; in, The number of samples in a batch; For the first The feature embedding vector of each sample in the penultimate layer; For the first The true category label of each sample; For category The feature center vector (learnable parameter); The feature center vectors for non-true categories (i.e., incorrect categories); The square of the Euclidean distance between the two vectors; Margin is used to force similar samples to move closer to the center and different samples to move further away from the center.
[0041] Evaluation indicators: The performance of the prediction model is evaluated based on the following metrics: ; ; ; ; ; AUC-ROC: Area under the receiver operating characteristic curve, which measures the overall classification ability of the model.
[0042] AUPR: Area Under the Precision-Recall Curve, focuses on the accuracy of positive example predictions, and is especially suitable for imbalanced datasets.
[0043] TP, FP, FN, and TN represent true positive, false positive, false negative, and true negative, respectively.
[0044] Comparative experiment on the prediction performance of protein-nucleic acid binding sites: We selected mainstream methods such as GraphBind (structural + evolutionary features), EquiPNAS (isovariant structure + MSA features), and GraphSite (structural features) as comparative examples. Evaluation metrics included the area under the receiver operating characteristic curve (ROC-AUC), the area under the precision-recall curve (PR-AUC), the F1 score, and the Matthews correlation coefficient (MCC).
[0045] Table 1 shows the performance comparison data of each model on the experimentally determined crystal structure dataset: Table 1 Data shows that this embodiment outperforms the comparison example in all metrics. Particularly in PR-AUC, a key metric for measuring class imbalance performance, EquiSite achieves improvements of 2.0% and 2.6% over the best comparison example on DNA_129 and DNA_181, respectively. This demonstrates that the proposed "backbone Euler angles" and "side chain twist angles" features can effectively capture subtle geometric patterns at the binding interface, thereby achieving more accurate identification without relying on multiple sequence alignment (MSA).
[0046] To verify the model's applicability on non-experimental structures, it was tested on structures predicted by AlphaFold2: DNA_129_AF2: EquiSite (the present invention) has a ROC-AUC of 0.949 and a PR-AUC of 0.581, which is superior to EquiPNAS (ROC-AUC 0.940).
[0047] RNA_117_AF2: EquiSite (this invention) showed a leading ROC-AUC of 0.891 and an MCC of 0.372. The conclusion indicates that the SE(3) isovariant network architecture of this invention is highly robust to local noise introduced by structure prediction and is suitable for large-scale proteomics structure screening.
[0048] Statistical significance and robustness verification of resampling: To eliminate the randomness introduced by the dataset partitioning, a subsampling strategy was employed for statistical validation. 70% of the samples were randomly selected from each test set, and the experiment was repeated 10 times. Paired t-tests or Wilcoxon signed-rank tests were performed, and the results are as follows: Figure 2 The results showed that all P-values were much less than 0.001, which statistically strongly proves that the performance improvement of the present invention is not accidental, but stems from the essential improvement of the technical solution.
[0049] Verification of the generalization ability of metal ions and small molecules: To demonstrate the universality of the geometric learning framework proposed in this invention, it is applied to metal ions (…). The binding sites of ATP and HEME were predicted and compared with the general ligand prediction methods GraphBind and DELIA. The comparison results are shown in Table 2.
[0050] Table 2 The results show that the present invention achieved the best ROC-AUC scores on five different types of non-nucleic acid ligands. This indicates that the "geometric edge features" and "isovariant attention mechanism" constructed in this invention capture the physical nature of intermolecular interactions (such as spatial complementarity and electrostatic matching), rather than just specific patterns for nucleic acid sequences, and therefore has broad application prospects.
[0051] Ablation Study of Key Components: To quantify the contribution of each technical feature, an item-by-item removal experiment was conducted on the DNA_129 dataset.
[0052] Complete model: PR-AUC = 0.602; Removing the ESM-2 embeddings: PR-AUC drops to 0.500 (a decrease of 16.9%). This demonstrates that the sequence context provided by the pre-trained language model is fundamental.
[0053] Removing geometric edge features (without Euler angles and spherical harmonic encoding): PR-AUC drops to 0.587. This demonstrates that the core innovation of this invention, "skeleton Euler angles," is crucial for capturing directional interactions.
[0054] Removing atomic-level features (without side-chain torsion angle): PR-AUC drops to 0.588. This demonstrates the necessity of explicit side-chain modeling.
[0055] Removing the auxiliary loss (without TCL / CBFL): PR-AUC decreased to 0.594. This demonstrates that the hybrid loss function helps optimize the feature space distribution.
[0056] Detailed description of the effects of downstream molecular docking applications: The EquiSite prediction results were converted into fuzzy interaction constraints (AIRs) to guide the HADDOCK3 docking. Detailed structural quality indicators are shown in Table 3. Table 3 The results show that in the docking of the RNA complex 6XH2, the model guided by this invention achieved the High Quality standard (DockQ>0.8), with an interface root mean square deviation (iRMSD) of only 1.07 Å, close to the natural structure. In contrast, EquiPNAS failed to dock due to prediction bias (DockQ=0.218). This fully demonstrates that the prediction results of this invention have higher spatial physical realism and can effectively solve practical structural biology modeling problems.
[0057] Performance test: To verify the computational efficiency advantage of this invention, tests were conducted on a single NVIDIA RTX 3090 GPU. The end-to-end inference time (including feature extraction) for proteins of different lengths was within 5 seconds, as shown in the results. Figure 3 It is much faster than traditional evolutionary methods that require tens of minutes to perform MSA search, greatly improving the efficiency of large-scale screening.
[0058] This embodiment also provides a protein-nucleic acid binding site prediction system based on isovariant geometry learning, including: a data processing module, a three-dimensional geometry construction module, a probability output module, and a result prediction module; The data processing module is used to obtain the three-dimensional structural coordinates of the target protein and perform standardized preprocessing on the three-dimensional structural coordinates; The three-dimensional geometry construction module is used to construct a three-dimensional geometry based on the preprocessed structural coordinates, with amino acid residues as nodes and spatial neighborhood relationships as edges. The nodes contain multi-scale geometric features, and the edges contain direction-sensitive geometric features. The probability output module is used to input the three-dimensional geometric graph into the three-dimensional isovariant geometric graph neural model for feature learning and information transmission, and output the probability that each residue is a nucleic acid binding site; The result prediction module is used to obtain the set of binding site residues based on the probability and output the prediction result.
[0059] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for predicting protein-nucleic acid binding sites based on isovariant geometric graph learning, characterized in that, include: Obtain the three-dimensional structural coordinates of the target protein, and perform normalization preprocessing on the three-dimensional structural coordinates; Based on the preprocessed structural coordinates, a three-dimensional geometric graph is constructed with amino acid residues as nodes and spatial neighborhood relationships as edges. The nodes contain multi-scale geometric features, and the edges contain direction-sensitive geometric features. The three-dimensional geometric graph is input into a three-dimensional isovariant geometric graph neural model for feature learning and information transmission, and the probability of each residue being a nucleic acid binding site is output. The set of binding site residues is obtained based on the probability, and the prediction result is output.
2. The protein-nucleic acid binding site prediction method based on isovariant geometry learning according to claim 1, characterized in that, The standardization preprocessing of the three-dimensional structural coordinates includes: translating and rotating the protein structure so that the centroid of the structure is located at the origin of the coordinate system, and aligning the principal component axes with the coordinate axes.
3. The protein-nucleic acid binding site prediction method based on isovariant geometry learning according to claim 1, characterized in that, Constructing a three-dimensional geometry using amino acid residues as nodes includes: using residues The atomic coordinates are used as node coordinates, when two residues Edge connections are established when the interatomic distance does not exceed a preset threshold.
4. The protein-nucleic acid binding site prediction method based on isovariant geometry learning according to claim 1, characterized in that, The multi-scale geometric features include: amino acid-level features, main chain-level features, and side chain-level features; The amino acid-level features include hot unique encoding of residue types and sequence embeddings generated by a pre-trained protein language model; The main chain horizontal features include the sine and cosine encoding of the main chain dihedral angles; The sidechain horizontal features include sine and cosine encoding of the sidechain torsion angle.
5. The protein-nucleic acid binding site prediction method based on isovariant geometry learning according to claim 1, characterized in that, The orientation-sensitive geometric features include: distance features formed by expanding the distance between residues using radial basis functions, orientation features formed by expanding the normalized displacement vector using spherical harmonic functions, and Euler angle features of the relative orientation between the local skeleton coordinate systems of the residues.
6. The protein-nucleic acid binding site prediction method based on isovariant geometry learning according to claim 5, characterized in that, The Euler angle features are encoded by the tensor product of the zeroth-order spherical harmonic function and the radial basis function, forming a rotationally equivariant edge feature representation; ; in, The edge features are represented by rotationally equivariant features. For radial basis functions, It is a distance vector. It is a spherical harmonic function. Let be Euler angles, i and j represent the neighbor node indices in the residue graph, and k represent the basis function index of the radial basis function or spherical harmonic function expansion.
7. The protein-nucleic acid binding site prediction method based on isovariant geometry learning according to claim 6, characterized in that, The three-dimensional isovariant geometric graph neural model adopts an edge-aware graph attention mechanism, which fuses the geometric features of edges with the geometric features of nodes, and introduces geometric structure constraints through depth tensor product for geometric modulation. ; in, Features that have undergone geometric modulation For depth tensor product, The feature vector of node i With the feature vector of node j Through linear transformation matrix and It is obtained by fusion and combining edge geometric features. Direction vector The spherical harmonic function is represented by; Based on the modulated features, attention coefficients and value vectors are calculated using Softmax normalization. ; in, These are the linear projection matrices used to calculate the attention coefficient and message value, respectively; The normalized attention coefficient; For the node Passed to the node The weighted message vector is the value vector; Aggregate attention coefficients and value vectors to update node feature representations; The updated node features are input into the classification layer, and the probability that each residue is a nucleic acid binding site is calculated using the Sigmoid function.
8. The protein-nucleic acid binding site prediction method based on isovariant geometry learning according to claim 6, characterized in that, The three-dimensional isovariant geometric neural model is trained using a hybrid loss function, which includes class-balanced Focal Loss and triplet center loss. The class-balanced Focal Loss adjusts the weights according to the number of valid samples in each class, while the triplet center loss enhances feature separability by optimizing the distance between sample features and class centers.
9. A protein-nucleic acid binding site prediction system based on isovariant geometry learning, used to implement the method as described in any one of claims 1-8, characterized in that, include: The module includes a data processing module, a 3D geometry construction module, a probability output module, and a result prediction module. The data processing module is used to obtain the three-dimensional structural coordinates of the target protein and perform standardized preprocessing on the three-dimensional structural coordinates. The three-dimensional geometry construction module is used to construct a three-dimensional geometry based on the preprocessed structural coordinates, with amino acid residues as nodes and spatial neighborhood relationships as edges. The nodes contain multi-scale geometric features, and the edges contain direction-sensitive geometric features. The probability output module is used to input the three-dimensional geometric graph into the three-dimensional isovariant geometric graph neural model for feature learning and information transmission, and output the probability that each residue is a nucleic acid binding site; The result prediction module is used to obtain the set of binding site residues based on the probability and output the prediction result.