Protein binding site prediction method and device based on dynamic mask graph convolution

By introducing a residue contact matrix attention mechanism and similarity comparison learning into protein binding site prediction using dynamic mask map convolution technology, the problem of capturing non-local associations and autonomously mining common patterns in existing technologies is solved, thus achieving more efficient and accurate binding site prediction.

CN122290701APending Publication Date: 2026-06-26UNIV OF ELECTRONICS SCI & TECH OF CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
UNIV OF ELECTRONICS SCI & TECH OF CHINA
Filing Date
2026-04-02
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively capture non-local associations between residues in protein binding site prediction, cannot accurately identify discontinuous binding sites, and are unable to autonomously uncover common patterns among different protein binding sites. This results in a significant decrease in prediction performance when faced with novel ligands or unseen protein structures.

Method used

A dynamic mask map convolution method is adopted, which models three-dimensional spatial constraints in the sequence encoding stage through the attention mechanism guided by the residue contact matrix. Combined with the site similarity comparison learning pre-trained model, multiple rounds of dynamic mask map convolution operation are performed to iteratively update the residue node features, reduce noise interference and enhance the transmission of key signals.

Benefits of technology

It improves the model's ability to perceive discontinuous binding sites, enhances its ability to generalize predictions of unknown proteins or ligands, optimizes the information propagation efficiency of graph structures, and improves the accuracy and robustness of binding site prediction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122290701A_ABST
    Figure CN122290701A_ABST
Patent Text Reader

Abstract

This invention provides a method and apparatus for predicting protein binding sites based on dynamic masked graph convolution, belonging to the field of neural network learning technology. The method includes generating an initial embedding representation of residues based on amino acid sequences; inputting the initial embedding representation of residues into a pre-trained embedding enhancement module to generate enhanced embedding representations of residues; constructing a residue graph based on the three-dimensional structural information of the protein to be predicted, and using the enhanced embedding representations of residues as initial features of each residue node in the residue graph; inputting the residue graph into a trained protein binding site prediction model to obtain final features after multiple iterations; and generating the predicted probability of each residue being a binding site based on the final features. This invention optimizes the information propagation efficiency of the graph structure through a multi-round dynamic masked graph convolution mechanism, thereby improving the overall accuracy and robustness of binding site prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of neural network learning technology, and more specifically, to a method and apparatus for predicting protein binding sites based on dynamic mask map convolution. Background Technology

[0002] Protein binding site prediction is a core component of drug development, functional annotation, and molecular mechanism research. Its goal is to identify key residues in protein sequences or structures that interact with ligands. With the explosive growth of protein sequence and structural data, deep learning-based prediction methods have become mainstream. Traditional protein language models rely solely on the continuity of amino acid sequences during the feature encoding stage, and their attention mechanisms focus on local interactions between adjacent residues, failing to effectively capture non-local associations between residues and making it difficult to accurately identify discontinuous binding sites. Furthermore, existing methods struggle to autonomously uncover common patterns among different protein binding sites, and their prediction performance significantly declines when faced with novel ligands or unseen protein structures. Summary of the Invention

[0003] The purpose of this invention is to provide a method and apparatus for predicting protein binding sites based on dynamic mask image convolution, in order to improve the aforementioned problems. To achieve the above objective, the technical solution adopted by this invention is as follows: In a first aspect, this application provides a method for predicting protein binding sites based on dynamic mask map convolution, including: Obtain the amino acid sequence of the protein to be predicted, and generate an initial embedding representation of the residues based on the amino acid sequence; The initial embedding representation of the residue is input into a pre-trained embedding enhancement module to generate an enhanced embedding representation of the residue. The embedding enhancement module is pre-trained through protein binding site similarity comparison learning. A residue map is constructed based on the three-dimensional structural information of the protein to be predicted, and the enhanced embedding representation of the residues is used as the initial feature of each residue node in the residue map. The residue map is input into the trained protein binding site prediction model, and multiple rounds of dynamic mask map convolution are performed to iteratively update the features of each residue node in the residue map, so as to obtain the final features after multiple rounds of iteration. Based on the final features, the predicted probability of each residue being a binding site is generated.

[0004] Secondly, this application also provides a protein binding site prediction device based on dynamic mask map convolution, comprising: The sequence acquisition module is used to obtain the amino acid sequence of the protein to be predicted; An embedding generation module is used to generate an initial embedding representation of residues based on the amino acid sequence; An embedding enhancement module is used to input the initial embedding representation of the residue into a pre-trained embedding enhancement module to generate an enhanced embedding representation of the residue. The embedding enhancement module is pre-trained through protein binding site similarity comparison learning. The graph construction module is used to construct a residue graph based on the three-dimensional structural information of the protein to be predicted, and to use the enhanced embedding representation of the residues as the initial feature of each residue node in the residue graph. The graph convolution module is used to input the residue graph into the trained protein binding site prediction model, perform multiple rounds of dynamic mask graph convolution operations, iteratively update the features of each residue node in the residue graph, and obtain the final features after multiple iterations. The output module is used to generate a predicted probability that each residue is a binding site based on the final features.

[0005] Thirdly, this application also provides a protein binding site prediction device based on dynamic mask map convolution, comprising: Memory, used to store computer programs; A processor is configured to implement the steps of the protein binding site prediction method based on dynamic mask map convolution when executing the computer program.

[0006] Fourthly, this application also provides a readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the protein binding site prediction method based on dynamic mask map convolution described above.

[0007] The beneficial effects of this invention are as follows: This invention improves the model's ability to perceive discontinuous binding sites by introducing a residue contact matrix-guided attention mechanism to model three-dimensional spatial constraints during the sequence encoding stage. Through pre-training using binding site similarity comparison learning, the model autonomously mines common chemical and spatial features of different binding sites, significantly enhancing its generalization prediction ability for unknown proteins or ligands. Finally, a multi-round dynamic masked graph convolution mechanism adaptively weakens noise interference from non-binding sites and strengthens the transmission of key signals, optimizing the information propagation efficiency of the graph structure, thereby improving the overall accuracy and robustness of binding site prediction. This provides more efficient and accurate binding site prediction support for structure-based drug design.

[0008] Other features and advantages of the invention will be set forth in the following description, and will be apparent in part from the description, or may be learned by practicing embodiments of the invention. The objects and other advantages of the invention may be realized and obtained by means of the structures particularly pointed out in the written description, claims, and drawings. Attached Figure Description

[0009] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0010] Figure 1 This is a schematic diagram of the protein binding site prediction method based on dynamic mask map convolution as described in this embodiment of the invention. Figure 2 This is a diagram of the framework for predicting ligand binding sites based on dynamic mask image convolution as described in this embodiment of the invention. Figure 3 This is a schematic diagram of the similarity comparison learning training process described in an embodiment of the present invention; Figure 4 This is a schematic diagram of dynamic mask image convolution as described in an embodiment of the present invention; Figure 5 This is a schematic diagram of the protein binding site prediction device based on dynamic mask map convolution as described in an embodiment of the present invention. Figure 6 This is a schematic diagram of the protein binding site prediction device based on dynamic mask map convolution as described in an embodiment of the present invention.

[0011] Marked in the image: 800. Protein binding site prediction device based on dynamic mask image convolution; 801. Processor; 802. Memory; 803. Multimedia component; 804. I / O interface; 805. Communication component. Detailed Implementation

[0012] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0013] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this invention, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0014] Example 1: This embodiment provides a protein binding site prediction method based on dynamic mask map convolution. This method is used in the early stage of drug design. For protein targets with known or predictable three-dimensional structures, it is necessary to quickly identify key residues that interact with ligands from the protein sequence.

[0015] See Figure 1 , Figure 2 The figure shows that this method includes: S1. Obtain the amino acid sequence of the protein to be predicted, and generate an initial embedding representation of the residues based on the amino acid sequence; Specifically, this embodiment uses the protein to be predicted. Taking the protein to be predicted as an example, The amino acid sequence is represented as ,in, Indicates the first position in the sequence One amino acid, It is the length of the amino acid sequence, that is, the number of protein residues.

[0016] Furthermore, the protein to be predicted amino acid sequence input To obtain the initial protein sequence ,in, Represents the real number field. Indicates the length of the amino acid sequence, This represents the feature dimension. Wherein, the... It is a protein language model based on the T5 architecture, specifically designed for protein sequence representation learning.

[0017] Based on the above embodiments, this method further includes: S2. Input the initial embedding representation of the residue into a pre-trained embedding enhancement module to generate an enhanced embedding representation of the residue. The embedding enhancement module is pre-trained through protein binding site similarity comparison learning. Specifically, step S2 includes: S21. After performing a linear transformation on the initial embedding representation of the residues, through... The head attention unit calculates multiple raw attention weights between residues; First, for the initial embedding Perform a linear transformation to generate the query matrix, key matrix, and value matrix: ; In the formula, Indicates the first Head attention unit, Represents the query matrix. Indicates the initial embedding. Represents the key matrix. Represents a value matrix, , and All are learnable parameter matrices with dimension 1. Representing feature dimension, This represents the dimension of protein residue features mapped through a learnable parameter matrix.

[0018] Among them, the query matrix The key matrix is ​​used to characterize the feature dimension that the current residue needs to be queried. Response features used to match other residues, value matrix Used to store specific feature information to be aggregated.

[0019] Furthermore, the original attention weights are generated based on the query matrix and the key matrix: ; In the formula, Indicates the first Attention weights of head attention units , Represents the query matrix. This represents the transpose of the key matrix. Indicates the feature dimension.

[0020] In this embodiment, through calculation and The inner product is used to measure the feature matching degree between residue pairs. and inner product divided by This is to avoid because An excessively large value causes the inner product result to expand, thus hindering stability. The gradient of the function ensures a reasonable distribution of attention weights.

[0021] It should be noted that the calculation method for the original attention weights of each attention unit is the same, and will not be repeated here.

[0022] S22. Define the residue contact matrix The original attention weights are spatially constrained by using the residue contact matrix to obtain the corresponding spatially constrained attention weights. In this embodiment, the residue contact matrix Defined based on the three-dimensional structural information of proteins, specifically, when two residues in a protein... and When spatial contact exists, the corresponding element in the matrix... The value is 1, otherwise it is marked as 0.

[0023] Specifically, the method for calculating the spatial constraint attention weights is as follows: ; In the formula, Indicates spatially constrained attention weights. Represents the original attention weights, indicating Learnable parameter matrix Represents the residue contact matrix, wherein the learnable parameter matrix To dynamically adjust the intensity of the impact of contact information.

[0024] In this embodiment, the residue contact matrix is... As a physical prior, it directly filters out pairs of residues that are adjacent in sequence but spatially separated, that is... The weights of the elements are reset to zero to ensure that attention only propagates between spatially adjacent residues; while the learnable matrix This endows the model with adaptive adjustment capabilities, assigning reasonable attention weights to spatially contacting residue pairs, so that the attention weights not only conform to the physical structural facts but also meet the biological functional requirements.

[0025] S23. Perform normalization and weighted aggregation on each spatial constraint attention weight to obtain the corresponding attention output value; Specifically, the normalization operation is as follows: ; In the formula, This represents the normalized attention weights. Represents the spatially constrained attention weights, where, Each residue pair (i.e.) The attention weights are transformed into a probability distribution to ensure that the sum of the weights is 1, which facilitates the comparison between different residues and provides a stable weight basis for subsequent feature aggregation.

[0026] Specifically, will and Weighted aggregation is performed so that residues with stronger spatial correlations have higher weights in the aggregation process, thereby making the output features more focused on functional interaction regions. The aggregation operation is as follows: ; In the formula, Indicates attention output, This represents the normalized attention weights. Represents a value matrix.

[0027] S24. Concatenate multiple attention output values ​​into an attention matrix, and perform a linear transformation on the attention matrix to obtain an enhanced embedding representation of the protein; ; In the formula, Represents the attention matrix. Indicates the first The attention output of each attention unit. This represents an enhanced embedding representation of proteins. Let represent the learnable parameter matrix, where For the real number field, The number of attention units. For the projection dimension, This indicates the dimension of protein residue characteristics.

[0028] In this embodiment, the enhanced embedding representation of the protein It not only preserves the sequence evolution and local structural information in the initial embedding, but also accurately encodes the functional interaction patterns of residues in the three-dimensional structure of the protein through deep integration of spatial constraints, laying a high-quality feature foundation for subsequent binding site prediction.

[0029] Specifically, such as Figure 3 As shown, in step S2, the pre-training process of the embedded enhancement module includes: S25. Construct positive sample pairs ( and negative sample pairs ( The positive sample ( For samples including anchor protein and protein positive samples The negative sample pairs ( Including anchor protein samples and protein negative samples Among them, the protein positive sample With anchor protein samples Protein negative samples have binding sites with similar chemical properties. With anchor protein samples There are binding sites with chemically dissimilar properties.

[0030] S26. Extract the initial embedding representations of the binding sites of the anchor protein samples, positive protein samples, and negative protein samples, respectively; S27. Input the positive sample pairs and negative sample pairs and their corresponding initial embedding representations into the embedding enhancement module to obtain the enhanced embedding representations corresponding to each protein sample; S28. Based on the enhanced embedding representation of each protein sample, the features of the binding sites of each protein sample are extracted and pooled to obtain the global features of the binding sites of each protein sample. Specifically, for anchor protein samples Its binding site index set is ,in, This represents the total number of residues at the binding site. To integrate the geometric morphology and chemical microenvironment characteristics of each binding site, it is necessary to perform average pooling on the residue-enhanced embedding features corresponding to each site sequentially. The specific calculation method is as follows: ; In the formula, Indicates anchor protein sample Global characteristics of binding sites, This represents the total number of residues belonging to the binding site. Indicates the first An enhanced embedding representation of each binding site. The final result... The global feature representation of the binding site of the anchor protein sample not only covers the spatial structure information of the binding site, but also integrates chemical property features, providing a comprehensive and standardized feature foundation for subsequent comparative learning.

[0031] S29. Construct a contrastive learning loss function based on the global characteristics of the binding sites of each protein sample; Specifically, the contrastive learning loss function is constructed, including: S291. Calculate the first similarity between the global features of the binding sites of the anchor protein sample and the positive sample. ,in, This represents the global feature representation of the binding site of the anchor protein sample. This represents the global feature representation of the binding site of a protein in a positive sample. S292. Calculate the second similarity between the global features of the binding sites of the anchor protein samples and the negative samples. ,in, This represents the global feature representation of the binding site of the anchor protein sample. Global feature representation of protein binding sites in negative samples; S293. Construct the contrastive learning loss function with the objective of maximizing the first similarity and minimizing the second similarity: ; in, This represents the contrastive learning loss function. This represents the sample pair label, which takes the value 0 or 1. Representing positive samples Or negative samples ,when Representing positive samples hour, The value is 1, when Negative samples hour, The value is 0. Indicates positive samples Or negative samples Global feature representation of binding sites, It is the similarity between features of protein samples, measuring the consistency of features of two binding sites. This is a calculation parameter, typically set to 1.0, used to constrain the feature distance of negative sample pairs: when the distance of negative sample pairs... When the distance is zero, the loss term is 0; when the distance is zero... At that time, the feature distance is forcibly increased through loss penalty.

[0032] S294. The embedding enhancement module is pre-trained and optimized using the contrastive learning loss function: In this embodiment, the training process of contrastive learning is to minimize the loss of all sample pairs, which is expressed as: ; in, This represents the loss for all sample pairs. This represents the total number of anchor proteins in the pre-training dataset. Indicates anchor protein sample The loss.

[0033] In this embodiment, the embedding enhancement module constructs a contrastive learning paradigm to deeply explore the common features among different protein binding sites. Through this pre-training, the model can autonomously summarize the general rules of binding sites during the learning process, reducing its dependence on specific labeled data. As a result, it can still maintain good predictive performance when faced with new protein or ligand types, laying a solid foundation for accurate identification of binding sites in the future.

[0034] Meanwhile, by optimizing the loss function, the features of binding sites with similar structures can be clustered in the vector space, while the features of binding sites with different ligands or significant structural differences are scattered, thereby effectively improving the model's ability to generalize features of unknown binding sites.

[0035] Based on the above embodiments, this method further includes: S3. Construct a residue map based on the three-dimensional structural information of the protein to be predicted, and use the enhanced embedding representation of the residues as the initial feature of each residue node in the residue map; Specifically, such as Figure 2 , Figure 4 As shown, step S3 includes: S31. Treat each residue as a node to form a node set. ,in The corresponding protein's first One residue, Representing the number of residues, the enhanced embedding representation of residues. Assign the corresponding node as the initial feature ;in, To enhance embedded representation The Line, i.e., the first The enhanced embedding feature representation of each protein residue, therefore the initial node feature matrix of the node set is: ; In the formula, Represents the initial node feature matrix. Indicates initial features, To enhance the embedded representation, Represents the real number field. Indicates the number of residues. This represents the feature dimension of the enhanced embedding representation.

[0036] S32. Obtain the two residues corresponding to the elements with a value of 1 in the residue contact matrix. Establish an edge between the nodes corresponding to the two residues. From this, we can obtain the edge Its corresponding initial adjacency matrix Used to quantify the connection relationships between nodes. Represents the elements of the adjacency matrix. Represents the real number field. Indicates the number of residues.

[0037] S33. Construct a graph structure based on the nodes and edges. In this context, the adjacency matrix of the graph structure has the same connectivity relationship as the residue contact matrix.

[0038] Based on the above embodiments, this method further includes: S4. Input the residue map into the trained protein binding site prediction model, perform multiple rounds of dynamic mask map convolution operation, iteratively update the features of each residue node in the residue map, and obtain the final features after multiple rounds of iteration; In this embodiment, a multi-round closed-loop iterative mechanism of prediction → masking → convolution → re-prediction is adopted to achieve dynamic optimization of information transmission in the protein structure map, gradually correct the initial prediction error, and flexibly reduce the noise interference of non-binding sites, significantly improving the accuracy of binding site prediction. Specifically, step S4 includes: S41. Based on the node characteristics of the current residue graph, use a graph convolutional layer to predict the prediction probability of all residues in this round: ; in, Binding probability represents the confidence that a residue is a binding site; the closer the value is to the target site, the higher the confidence level. This indicates that the residue is more likely to participate in ligand binding; This is the initial adjacency matrix; express Wheel node characteristics, This is a learnable parameter matrix, responsible for mapping high-dimensional features to the probability output space. For the first The characteristic dimensions of a wheel; For activation functions, such as .

[0039] S42. Obtain the dynamic mask from the previous round, and calculate the dynamic mask for the current round based on the dynamic mask from the previous round and the prediction probability for the current round. The dynamic mask for the first round is equal to the prediction probability for the first round. ; In the formula, express The dynamic mask of the wheel, This is a hyperparameter that controls the weighting of the prediction probability in this round. for The predicted probability of the wheel, express The dynamic mask of the wheel, where the initial mask... That is, the predicted probability of the first round is directly used as the dynamic mask in the first round.

[0040] S43. Construct the weighted adjacency matrix for the next round based on the dynamic mask of this round, and perform graph convolution based on the weighted adjacency matrix to update the node features; Specifically, weighted graph convolutional message propagation achieves flexible reduction of interference from non-binding sites. A weighted adjacency matrix is ​​constructed using dynamic masks. This allows the neighbor information propagation weight of non-binding sites to adaptively decrease with the mask value, thereby achieving noise suppression. ; In the formula, express The weighted adjacency matrix of the rounds, This is the initial adjacency matrix constructed based on the residue contact matrix, and the initial weighted adjacency matrix. .

[0041] Furthermore, based on the weighted adjacency matrix Perform graph convolution to update node features: ; In the formula, express Point features of the wheel, For activation functions, such as , express The weighted adjacency matrix of the rounds, express The node features of the wheel carry the residue properties and interaction information optimized in the early stage; For the first The learnable parameter matrix of the wheel is responsible for dimensional transformation and pattern extraction of features, capturing high-order correlation patterns between residues; This is a bias term used to adjust the baseline level of features, enhancing the model's adaptability to complex data distributions.

[0042] In this embodiment, a nonlinear transformation is introduced by an activation function to further enhance the model's ability to capture nonlinear relationships in residue interactions, making the updated node features more focused on the key biological characteristics of the binding site and effectively filtering out noise interference from nonfunctional regions.

[0043] S44. Repeat steps S41 to S43 for multiple iterations to obtain the final features. ; Based on the above embodiments, this method further includes: S5. Based on the final features, generate the predicted probability that each residue is a binding site; Specifically, after After multiple iterations, the model has gradually optimized its feature representation through dynamic adjustments. At this point, the feature matrix output from the last round of graph convolution is used to predict the binding site labels. ; In the formula, is a vector with values ​​from 0 to 1, representing the final predicted probability that each residue in the protein is a binding site. It is an adjacency matrix. For the first Wheel node characteristics, This is the parameter matrix for the final prediction layer.

[0044] Based on the above embodiments, this method also provides a training method for a protein binding site prediction model, including: S6. Train an initial prediction model using the training dataset to obtain prediction results, wherein the training dataset contains several protein samples and real annotations of binding sites; S7. Based on the prediction results and the ground truth annotations, calculate the weighted cross-entropy loss, wherein the weighted cross-entropy loss assigns different loss weights to the binding site category and the non-binding site category; wherein the expression for the weighted cross-entropy loss is: ; In the formula, This represents the weighted cross-entropy loss. Represents all proteins The number of residues, This represents the proportion of non-binding sites and is used to indicate the loss weight of positive samples. For residues The true label, Indicates the binding site. Indicates non-binding site, To predict probabilities.

[0045] In this embodiment, by using binding sites Assign higher weight , is a non-binding site Assign lower weights It can effectively offset the model bias caused by class imbalance, improve the prediction accuracy of minority binding sites, and ultimately improve the identification accuracy of binding sites, especially suitable for protein structure data with a very low proportion of binding sites.

[0046] Preferably, the The calculation method is as follows: ; In the formula, The percentage of non-binding sites. This represents the number of non-binding site residues. This represents the total number of residues in the training dataset.

[0047] S8. With the goal of minimizing the weighted cross-entropy loss, optimize the parameters in the initial prediction model until convergence, and obtain the trained protein binding site prediction model.

[0048] Example 2: like Figure 5 As shown, this embodiment provides a protein binding site prediction device based on dynamic mask map convolution, the device comprising: The sequence acquisition module is used to obtain the amino acid sequence of the protein to be predicted; An embedding generation module is used to generate an initial embedding representation of residues based on the amino acid sequence; An embedding enhancement module is used to input the initial embedding representation of the residue into a pre-trained embedding enhancement module to generate an enhanced embedding representation of the residue. The embedding enhancement module is pre-trained through protein binding site similarity comparison learning. The graph construction module is used to construct a residue graph based on the three-dimensional structural information of the protein to be predicted, and to use the enhanced embedding representation of the residues as the initial feature of each residue node in the residue graph. The graph convolution prediction module is used to input the residue graph into the trained protein binding site prediction model, perform multiple rounds of dynamic mask graph convolution operations, iteratively update the features of each residue node in the residue graph, and obtain the final features after multiple rounds of iteration. The output module is used to generate a predicted probability that each residue is a binding site based on the final features.

[0049] Based on the above embodiments, the embedding enhancement module includes: A linear transformation unit is used to perform a linear transformation on the initial embedding representation of the residues; Multi-head attention units are used to compute multiple original attention weights between residues based on the linearly transformed representation; A spatial constraint masking unit is used to obtain a predefined residue contact matrix and to mask each original attention weight using the residue contact matrix to obtain the corresponding spatial constraint attention weight. The aggregation output unit is used to perform normalization and weighted aggregation on each spatial constraint attention weight to obtain the corresponding attention output value. The splicing transformation unit is used to splice multiple attention output values ​​into an attention matrix and perform a linear transformation on the attention matrix to obtain an enhanced embedding representation of the protein.

[0050] Based on the above embodiments, the embedding enhancement module is obtained through pre-training, and the pre-training process is executed by a pre-training unit, which includes: A sample construction subunit is used to construct positive sample pairs and negative sample pairs, wherein the positive sample pairs include anchor protein samples and positive protein samples, and the negative sample pairs include anchor protein samples and negative protein samples. The initial embedding extraction subunit is used to extract the initial embedding representations of the binding sites of anchor protein samples, positive protein samples, and negative protein samples, respectively. An enhanced embedding acquisition subunit is used to input the positive sample pairs and negative sample pairs and their corresponding initial embedding representations into the embedding enhancement module to obtain the enhanced embedding representations corresponding to each protein sample. The global feature generation subunit is used to extract the features of the binding sites of each protein sample based on the enhanced embedding representation of each protein sample and perform pooling processing to obtain the global features of the binding sites of each protein sample. The loss function construction subunit is used to construct a contrastive learning loss function based on the global features of the binding sites of each protein sample. The parameter optimization subunit is used to pre-train and optimize the embedding enhancement module using the contrastive learning loss function.

[0051] Based on the above embodiments, the loss function construction subunit is specifically used for: Calculate the first similarity between the global features of the binding sites of the anchor protein sample and the positive sample; Calculate the second similarity between the global features of the binding sites of the anchor protein samples and the negative samples; The contrastive learning loss function is constructed with the goal of maximizing the first similarity and minimizing the second similarity.

[0052] Based on the above embodiments, the graph construction module includes: The node initialization unit is used to treat each residue as a node to form a node set, and to assign the enhanced embedding representation of the residue to the corresponding node as an initial feature; An edge construction unit is used to obtain the two residues corresponding to the element with a value of 1 in the residue contact matrix, and to establish an edge between the nodes corresponding to the two residues. The graph generation unit is used to construct a graph structure based on the nodes and edges, wherein the adjacency matrix of the graph structure has the same connection relationship as the residue contact matrix.

[0053] Based on the above embodiments, the multi-round dynamic masked graph convolution operation performed by the graph convolution prediction module includes: The probability prediction unit is used to predict the probability of all residues in the current round using a graph convolutional layer based on the node features of the current residue graph. The dynamic mask calculation unit is used to obtain the dynamic mask of the previous round and calculate the dynamic mask of the current round based on the dynamic mask of the previous round and the prediction probability of the current round. The dynamic mask of the first round is equal to the prediction probability of the first round. The feature update unit is used to construct the weighted adjacency matrix for the next round based on the dynamic mask of the current round, and perform graph convolution based on the weighted adjacency matrix to update the node features.

[0054] It should be noted that the specific manner in which each module performs its operation in the apparatus described in the above embodiments has been described in detail in the embodiments of the method, and will not be elaborated here.

[0055] Example 3: Corresponding to the above method embodiments, this embodiment also provides a protein binding site prediction device based on dynamic mask image convolution. The protein binding site prediction device based on dynamic mask image convolution described below can be referred to in correspondence with the protein binding site prediction method based on dynamic mask image convolution described above.

[0056] Figure 6 This is a block diagram illustrating a protein binding site prediction device 800 based on dynamic mask map convolution, according to an exemplary embodiment. Figure 6 As shown, the protein binding site prediction device 800 based on dynamic mask map convolution may include: a processor 801 and a memory 802. The protein binding site prediction device 800 based on dynamic mask map convolution may also include one or more of the following: a multimedia component 803, an I / O interface 804, and a communication component 805.

[0057] The processor 801 controls the overall operation of the protein binding site prediction device 800 based on dynamic mask map convolution to complete all or part of the steps in the aforementioned protein binding site prediction method based on dynamic mask map convolution. The memory 802 stores various types of data to support the operation of the protein binding site prediction device 800 based on dynamic mask map convolution. This data may include, for example, instructions for any application or method operating on the protein binding site prediction device 800 based on dynamic mask map convolution, as well as application-related data such as contact data, sent and received messages, images, audio, video, etc. The memory 802 can be implemented using any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The multimedia component 803 may include a screen and an audio component. The screen may be, for example, a touchscreen, and the audio component is used to output and / or input audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 802 or transmitted via the communication component 805. The audio component also includes at least one speaker for outputting audio signals. I / O interface 804 provides an interface between processor 801 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons can be virtual or physical. Communication component 805 is used for wired or wireless communication between the protein binding site prediction device 800 based on dynamic mask map convolution and other devices. Wireless communication includes, for example, Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination thereof. Therefore, the corresponding communication component 805 may include a Wi-Fi module, a Bluetooth module, or an NFC module.

[0058] In an exemplary embodiment, the protein binding site prediction device 800 based on dynamic mask map convolution can be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the protein binding site prediction method based on dynamic mask map convolution described above.

[0059] In another exemplary embodiment, a computer-readable storage medium including program instructions is also provided. When executed by a processor, these program instructions implement the steps of the protein binding site prediction method based on dynamic mask map convolution described above. For example, the computer-readable storage medium may be the memory 802 including the program instructions described above. These program instructions may be executed by the processor 801 of the protein binding site prediction device 800 based on dynamic mask map convolution to complete the protein binding site prediction method based on dynamic mask map convolution described above.

[0060] Example 4: Corresponding to the above method embodiments, this embodiment also provides a readable storage medium. The readable storage medium described below corresponds to the protein binding site prediction method based on dynamic mask map convolution described above.

[0061] A readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the protein binding site prediction method based on dynamic mask map convolution described in the above method embodiments.

[0062] The readable storage medium can specifically be a USB flash drive, external hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk, or any other readable storage medium capable of storing program code.

[0063] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

[0064] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for predicting protein binding sites based on dynamic mask image convolution, characterized in that, include: Obtain the amino acid sequence of the protein to be predicted, and generate an initial embedding representation of the residues based on the amino acid sequence; The initial embedding representation of the residue is input into a pre-trained embedding enhancement module to generate an enhanced embedding representation of the residue. The embedding enhancement module is pre-trained through protein binding site similarity comparison learning. A residue map is constructed based on the three-dimensional structural information of the protein to be predicted, and the enhanced embedding representation of the residues is used as the initial feature of each residue node in the residue map. The residue map is input into the trained protein binding site prediction model, and multiple rounds of dynamic mask map convolution are performed to iteratively update the features of each residue node in the residue map, so as to obtain the final features after multiple rounds of iteration. Based on the final features, the predicted probability of each residue being a binding site is generated.

2. The protein binding site prediction method based on dynamic mask map convolution according to claim 1, characterized in that, The initial embedding representation of the residue is input into a pre-trained embedding enhancement module to generate an enhanced embedding representation of the residue, including: After performing a linear transformation on the initial embedding representation of the residues, multiple original attention weights between the residues are calculated using a multi-head attention unit; Define a residue contact matrix, and use the residue contact matrix to perform spatial constraint masking on each original attention weight to obtain the corresponding spatial constraint attention weight; Normalize and weighted aggregate each spatial constraint attention weight to obtain the corresponding attention output value; Multiple attention output values ​​are concatenated into an attention matrix, and the attention matrix is ​​then linearly transformed to obtain an enhanced embedding representation of the protein.

3. The protein binding site prediction method based on dynamic mask map convolution according to claim 1, characterized in that, The pre-training process of the embedding enhancement module includes: Construct positive sample pairs and negative sample pairs, wherein the positive sample pairs include anchor protein samples and positive protein samples, and the negative sample pairs include anchor protein samples and negative protein samples. Initial embedding representations of binding sites for anchor protein samples, positive protein samples, and negative protein samples were extracted respectively. The positive and negative sample pairs and their corresponding initial embedding representations are input into the embedding enhancement module to obtain the enhanced embedding representations for each protein sample. Based on the enhanced embedding representation of each protein sample, the features of the binding sites of each protein sample are extracted and pooled to obtain the global features of the binding sites of each protein sample. Based on the global characteristics of the binding sites of each protein sample, a contrastive learning loss function is constructed. The embedding enhancement module is pre-trained and optimized using the contrastive learning loss function.

4. The protein binding site prediction method based on dynamic mask map convolution according to claim 2, characterized in that, A residue map is constructed based on the three-dimensional structural information of the protein to be predicted, and the enhanced embedding representation is used as the initial feature of each residue node in the residue map, including: Each residue is treated as a node to form a set of nodes, and the enhanced embedding representation of the residue is assigned to the corresponding node as an initial feature; Obtain the two residues corresponding to the element with a value of 1 in the residue contact matrix, and establish an edge between the nodes corresponding to the two residues; The nodes and edges form a graph structure, wherein the adjacency matrix of the graph structure has the same connection relationship as the residue contact matrix.

5. The protein binding site prediction method based on dynamic mask map convolution according to claim 1, characterized in that, The residue map is input into the trained protein binding site prediction model, and multiple rounds of dynamic mask map convolution are performed to iteratively update the features of each residue node in the residue map, including: Based on the node characteristics of the current residue graph, a graph convolutional layer is used to predict the prediction probability of all residues in this round; Obtain the dynamic mask from the previous round, and calculate the dynamic mask for the current round based on the dynamic mask from the previous round and the prediction probability for the current round. The dynamic mask for the first round is equal to the prediction probability for the first round. Construct the weighted adjacency matrix for the next round based on the dynamic mask of this round, and perform graph convolution based on the weighted adjacency matrix to update node features.

6. A protein binding site prediction device based on dynamic mask map convolution, characterized in that, include: The sequence acquisition module is used to obtain the amino acid sequence of the protein to be predicted; An embedding generation module is used to generate an initial embedding representation of residues based on the amino acid sequence; An embedding enhancement module is used to input the initial embedding representation of the residue into a pre-trained embedding enhancement module to generate an enhanced embedding representation of the residue. The embedding enhancement module is pre-trained through protein binding site similarity comparison learning. The graph construction module is used to construct a residue graph based on the three-dimensional structural information of the protein to be predicted, and to use the enhanced embedding representation of the residues as the initial feature of each residue node in the residue graph. The graph convolution prediction module is used to input the residue graph into the trained protein binding site prediction model, perform multiple rounds of dynamic mask graph convolution operations, iteratively update the features of each residue node in the residue graph, and obtain the final features after multiple rounds of iteration. The output module is used to generate a predicted probability that each residue is a binding site based on the final features.

7. The protein binding site prediction device based on dynamic mask map convolution according to claim 6, characterized in that, The embedded enhancement module includes: A linear transformation unit is used to perform a linear transformation on the initial embedding representation of the residues; Multi-head attention units are used to compute multiple original attention weights between residues based on the linearly transformed representation; A spatial constraint masking unit is used to obtain a predefined residue contact matrix and to mask each original attention weight using the residue contact matrix to obtain the corresponding spatial constraint attention weight. The aggregation output unit is used to perform normalization and weighted aggregation on each spatial constraint attention weight to obtain the corresponding attention output value. The splicing transformation unit is used to splice multiple attention output values ​​into an attention matrix and perform a linear transformation on the attention matrix to obtain an enhanced embedding representation of the protein.

8. The protein binding site prediction device based on dynamic mask map convolution according to claim 6, characterized in that, The embedding enhancement module is obtained through pre-training, and the pre-training process is executed by a pre-training unit, which includes: A sample construction subunit is used to construct positive sample pairs and negative sample pairs, wherein the positive sample pairs include anchor protein samples and positive protein samples, and the negative sample pairs include anchor protein samples and negative protein samples. The initial embedding extraction subunit is used to extract the initial embedding representations of the binding sites of anchor protein samples, positive protein samples, and negative protein samples, respectively. An enhanced embedding acquisition subunit is used to input the positive sample pairs and negative sample pairs and their corresponding initial embedding representations into the embedding enhancement module to obtain the enhanced embedding representations corresponding to each protein sample. The global feature generation subunit is used to extract the features of the binding sites of each protein sample based on the enhanced embedding representation of each protein sample and perform pooling processing to obtain the global features of the binding sites of each protein sample. The loss function construction subunit is used to construct a contrastive learning loss function based on the global features of the binding sites of each protein sample. The parameter optimization subunit is used to pre-train and optimize the embedding enhancement module using the contrastive learning loss function.

9. The protein binding site prediction device based on dynamic mask map convolution according to claim 7, characterized in that, The graph construction module includes: The node initialization unit is used to treat each residue as a node to form a node set, and to assign the enhanced embedding representation of the residue to the corresponding node as an initial feature; An edge construction unit is used to obtain the two residues corresponding to the element with a value of 1 in the residue contact matrix, and to establish an edge between the nodes corresponding to the two residues. The graph generation unit is used to construct a graph structure based on the nodes and edges, wherein the adjacency matrix of the graph structure has the same connection relationship as the residue contact matrix.

10. The protein binding site prediction device based on dynamic mask map convolution according to claim 6, characterized in that, The graph convolution prediction module includes: The probability prediction unit is used to predict the probability of all residues in the current round using a graph convolutional layer based on the node features of the current residue graph. The dynamic mask calculation unit is used to obtain the dynamic mask of the previous round and calculate the dynamic mask of the current round based on the dynamic mask of the previous round and the prediction probability of the current round. The dynamic mask of the first round is equal to the prediction probability of the first round. The feature update unit is used to construct the weighted adjacency matrix for the next round based on the dynamic mask of the current round, and perform graph convolution based on the weighted adjacency matrix to update the node features.