Training method of prediction model, binding affinity prediction method, device and equipment
By enhancing and fusing the characterization of proteins and ligands through an interactive perceptual attention network, and constructing a loss function to train a prediction model, this approach solves the problems of long processing time, high cost, and low accuracy in existing methods. It achieves efficient and accurate binding affinity prediction, supporting the drug discovery process.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
- Filing Date
- 2022-06-24
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for predicting protein-ligand binding affinity suffer from problems such as long processing time, high computational cost, low accuracy, and lack of interpretability, making it particularly difficult to effectively screen for valuable new drug candidates during drug discovery.
We employ a prediction model based on an interactive perceptual attention network. By performing reinforcement learning and fusion learning on the representations of proteins and ligands, we construct a loss function to train the prediction model, thereby improving the accuracy and interpretability of binding affinity prediction.
It improves the accuracy of prediction models, reduces computational and time costs, enhances the understanding and prediction of binding affinity, and is applicable to structure-based drug discovery processes.
Smart Images

Figure CN115171776B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence, particularly to deep learning, natural language processing, and material screening technologies. Background Technology
[0002] Protein-ligand binding affinity prediction is a crucial stage in drug discovery. Binding affinity reflects the strength of the binding reaction between the protein and ligand, indicating the effectiveness of the ligand's action on the protein. Therefore, accurate prediction of binding affinity can help screen truly valuable drug candidates, thereby accelerating subsequent stages of drug development, significantly reducing research costs, and providing important guidance for new drug discovery. Thus, improving prediction accuracy has become a pressing technical challenge. Summary of the Invention
[0003] This disclosure provides a method for training a prediction model, a method for predicting affinity, an apparatus, and a device.
[0004] According to a first aspect of this disclosure, a method for training a prediction model is provided, comprising:
[0005] Reinforcement learning was performed on the first characterization of the protein and the first characterization of the ligand to obtain the second characterization of the protein and the second characterization of the ligand.
[0006] By fusing the second characterization of proteins and the second characterization of ligands, a third characterization of proteins and ligands is obtained, which is a complex molecular characterization of proteins and ligands.
[0007] Based on the third characterization of proteins and ligands, the binding affinity between proteins and ligands is predicted, and the predicted binding affinity value is obtained.
[0008] A loss function is constructed based on the combination of predicted affinity values and true affinity values;
[0009] A predictive model for predicting binding affinity is obtained by training based on the loss function.
[0010] According to a second aspect of this disclosure, a method for predicting affinity is provided, comprising:
[0011] First characterization of the target protein and first characterization of the target ligand;
[0012] Input the first characterization of the target protein and the first characterization of the target ligand into the prediction model, and obtain the predicted value of the binding affinity between the target protein and the target ligand output by the prediction model;
[0013] The prediction model is obtained using the training method provided in the first aspect.
[0014] According to a third aspect of this disclosure, a training apparatus for a prediction model is provided, comprising:
[0015] The reinforcement learning module is used to perform reinforcement learning on the first characterization of the protein and the first characterization of the ligand, respectively, to obtain the second characterization of the protein and the second characterization of the ligand.
[0016] A fusion learning module is used to perform fusion learning on the second characterization of the protein and the second characterization of the ligand to obtain a third characterization of the protein and the ligand, wherein the third characterization is a composite molecular characterization of the protein and the ligand;
[0017] The first prediction module is used to predict the binding affinity between the protein and the ligand based on the third characterization of the protein and the ligand, and obtain the predicted binding affinity value.
[0018] The construction module is used to construct a loss function based on the predicted and true values of the binding affinity;
[0019] The training module is used to train a prediction model for predicting binding affinity based on the loss function.
[0020] According to a fourth aspect of this disclosure, a combined affinity prediction device is provided, comprising:
[0021] The receiving module is used to receive the first characterization of the target protein and the first characterization of the target ligand;
[0022] The second prediction module is used to input the first characterization of the target protein and the first characterization of the target ligand into the prediction model, and obtain the predicted value of the binding affinity between the target protein and the target ligand output by the prediction model.
[0023] The prediction model is obtained using the training method provided in the first aspect.
[0024] According to a fifth aspect of this disclosure, an electronic device is provided, comprising:
[0025] At least one processor; and
[0026] The memory is communicatively connected to the at least one processor; wherein,
[0027] The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the methods provided in the first and second aspects described above.
[0028] According to a sixth aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause the computer to perform the methods provided in the first and second aspects described above.
[0029] According to a seventh aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the methods provided in the first and second aspects described above.
[0030] The technical solutions of this disclosure can improve the prediction accuracy of the trained prediction model.
[0031] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0032] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:
[0033] Figure 1 This is a flowchart illustrating a training method for a prediction model according to an embodiment of the present disclosure;
[0034] Figure 2 This is a schematic diagram of the architecture of a prediction model according to an embodiment of the present disclosure;
[0035] Figure 3 This is a schematic diagram of an attention pooling network according to an embodiment of the present disclosure;
[0036] Figure 4 This is a schematic flowchart of a binding affinity prediction method according to an embodiment of the present disclosure;
[0037] Figure 5 This is a schematic diagram of the composition of a training apparatus for a prediction model according to an embodiment of the present disclosure;
[0038] Figure 6 This is a schematic diagram of the composition of a binding affinity prediction device according to an embodiment of the present disclosure;
[0039] Figure 7 This is a schematic diagram of a training scenario for a prediction model according to an embodiment of the present disclosure;
[0040] Figure 8 This is a schematic diagram of a binding affinity prediction scenario according to an embodiment of the present disclosure;
[0041] Figure 9 This is a block diagram of an electronic device used to implement a training method for a prediction model or a method combining affinity prediction for embodiments of the present disclosure. Detailed Implementation
[0042] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.
[0043] The terms "first," "second," and "third," etc., used in the embodiments, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, such as including a series of steps or units. A method, system, product, or apparatus is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to these processes, methods, products, or apparatuses.
[0044] In related technologies, the prediction of protein-ligand binding affinity mainly includes the following methods: (1) Experimental determination method: This method is the most accurate, but it is time-consuming and requires huge computational costs. (2) Docking method based on physical simulation: This method uses theories such as molecular dynamics to simulate and designs a scoring function to predict and estimate the binding affinity of the simulated docking. (3) Machine learning method: This method uses machine learning models such as Support Vector Machine (SVM) and Random Forest to make predictions by manually extracting features. (4) Deep learning-based methods: Deep learning-based methods include: a) using one-dimensional (1D) convolutional neural networks to learn the amino acid sequence of proteins and the simplified molecular input line entry system (SMILES) character sequences of ligands (i.e., drugs) respectively, and combining the two learned features for prediction; b) using three-dimensional (3D) convolutional neural networks to learn the characterization of protein-ligand complexes for prediction; c) using graph neural networks to learn the features of ligand graphs, and combining them with 1D convolutional neural networks to learn the amino acid sequence features of proteins; d) treating the protein-ligand complex molecule as a whole molecular graph, and using graph neural networks for characterization learning to make predictions. However, experimental measurement methods are very time-consuming, and in the development of new drugs, it is necessary to screen from a large number of candidate sets, which is impractical for practical applications. Physical simulation-based docking methods consume a lot of computational resources and have very low accuracy. Machine learning-based methods require domain-specific expert knowledge for feature engineering, and the prediction accuracy depends on manually extracted features, and it is difficult to further improve the performance on large datasets. Deep learning-based methods, on the one hand, mostly learn the protein and ligand (drug) components independently or treat them as a whole, lacking effective modeling of their interaction. On the other hand, they lack interpretability and cannot interpret and analyze the interactive information of the prediction results.
[0045] One of the key aspects of accurately predicting binding affinity lies in the characterization and learning of protein-ligand complex molecules. Unlike general molecular graphs, this type of complex molecule is formed by the interaction of two molecules; therefore, modeling important semantic interaction information and extracting valuable interaction information is particularly important. This disclosure proposes a method based on an interaction-aware attention network to predict protein-ligand binding affinity, which can efficiently and accurately predict the binding affinity between proteins and ligands.
[0046] This disclosure provides a method for training a prediction model. Figure 1This is a flowchart illustrating a training method for a predictive model according to an embodiment of the present disclosure. This training method can be applied to a training device for the predictive model. The training device is located in an electronic device, which includes, but is not limited to, fixed devices and / or mobile devices. For example, fixed devices include, but are not limited to, servers, which can be cloud servers or ordinary servers. Mobile devices include, but are not limited to, one or more terminals such as mobile phones and tablets. In some possible implementations, the method can also be implemented by a processor calling computer-readable instructions stored in memory. Figure 1 As shown, the training method for this prediction model includes:
[0047] S101: Reinforcement learning is performed on the first characterization of the protein and the first characterization of the ligand to obtain the second characterization of the protein and the second characterization of the ligand.
[0048] S102: The second characterization of the protein and the second characterization of the ligand are fused and learned to obtain the third characterization of the protein and the ligand. The third characterization is a complex molecular characterization of the protein and the ligand.
[0049] S103: Based on the third characterization of proteins and ligands, the binding affinity between proteins and ligands is predicted, and the predicted binding affinity value is obtained;
[0050] S104: Construct a loss function based on the predicted and true values of binding affinity;
[0051] S105: A prediction model for predicting binding affinity is obtained by training based on the loss function.
[0052] In this embodiment of the disclosure, the first characterization of the protein can be represented by a plurality of atomic characteristic sequences comprising the protein. Here, the plurality of atomic characteristic sequences can be characteristic sequences of some atoms in the protein, or characteristic sequences of all atoms in the protein.
[0053] In this embodiment of the disclosure, the first characterization of the ligand can be represented by a plurality of atomic characteristic sequences comprising the ligand included in the protein. Here, the plurality of atomic characteristic sequences can be characteristic sequences of some atoms in the ligand, or characteristic sequences of all atoms in the ligand.
[0054] Here, a ligand is a substance that can act on a protein. For example, a ligand can be a compound or mixture, specifically a food, drug, or other substance.
[0055] In this embodiment of the disclosure, the first characterization of the protein and the first characterization of the ligand can be the first characterization of the protein and the first characterization of the ligand extracted during initialization, or the first characterization of the protein and the first characterization of the ligand output by the upstream deep learning model.
[0056] Here, the upstream deep learning model is a model capable of determining the first characterization of the protein and the first characterization of the ligand. The upstream deep learning model can be connected to the prediction model and is located upstream of the prediction model. This disclosure does not limit the specific methods of training the upstream deep learning model.
[0057] Here, the first representation is the original representation. The second representation is a representation that combines semantic interaction information with the first representation.
[0058] In this embodiment, a separate interactive attention pooling network from a pre-defined model can be used to perform reinforcement learning on the first representation of the protein and the first representation of the ligand, respectively. The specific learning process will be described in detail in subsequent embodiments and will not be repeated here.
[0059] In this embodiment, a global fusion attention pooling network from a pre-defined model can be used to fuse the second characterization of the protein and the second characterization of the ligand. The specific fusion process will be described in detail in subsequent embodiments and will not be repeated here.
[0060] In this embodiment of the disclosure, binding affinity is a measure of the degree of interaction between a protein and a ligand. Specifically, it can be the magnitude of the interaction between the ligand and the protein.
[0061] In this embodiment of the disclosure, the binding affinity prediction value is a prediction value output by the prediction model.
[0062] In this embodiment of the disclosure, the true value of binding affinity is the actual value of the binding affinity between the protein and the ligand in the sample set. This true value can be a value determined experimentally or a reliable value obtained from a data source. It should be noted that this disclosure does not limit the source of the true value of binding affinity.
[0063] In this embodiment of the disclosure, the prediction model includes at least a network structure for implementing reinforcement learning, a network structure for implementing fusion, and a network structure for implementing affinity prediction. Each network structure includes a Convolutional Neural Network (CNN). The above is merely illustrative and is not intended to limit all possible network types included in the prediction model; an exhaustive list is not provided here.
[0064] The technical solution described in this disclosure involves performing reinforcement learning on the first characterization of the protein and the first characterization of the ligand respectively to obtain the second characterization of the protein and the ligand; then performing fusion learning on the second characterization of the protein and the ligand to obtain the third characterization of protein-ligand binding; predicting the binding affinity of the protein and ligand based on the third characterization of protein-ligand binding to obtain a predicted binding affinity value; constructing a loss function based on the predicted binding affinity value and the true binding affinity value; and training a prediction model for predicting binding affinity based on the loss function. Thus, by first performing reinforcement learning on the first characterization of the protein and the first characterization of the ligand separately, and then performing fusion learning on the second characterization of the protein and the second characterization of the ligand, the obtained third characterization of the protein and ligand is more accurate, thereby making the predicted binding affinity more accurate and improving the prediction accuracy of the trained prediction model. Furthermore, compared to traditional experimental determination methods and physical simulation methods, the computational and time costs are lower, making it suitable for practical application in structure-based drug discovery. Compared to other machine learning and deep learning methods, the prediction model disclosed in this paper fully considers the semantic interaction mechanism, has strong generalization ability, enhances the learning ability of the prediction model and the prediction effect of understanding binding affinity, and provides interpretability for the prediction results.
[0065] In some embodiments, the training method of the prediction model may further include: obtaining a complex molecular map of the protein and ligand; performing feature separation on the complex molecular map to obtain a first characterization of the protein and a first characterization of the ligand.
[0066] This disclosure does not limit the source of the protein-ligand complex molecular map. The complex molecular map can be an experimentally verified complex molecular map or a complex molecular map obtained from a data source. The above is only an illustrative example and is not intended to limit all possible sources of the complex molecular map; it is simply not exhaustive.
[0067] Thus, by performing feature separation on the complex molecular map of proteins and ligands, the first characterization of proteins and ligands can be obtained, making the determined first characterization of proteins and ligands more accurate. This helps to improve the accuracy of the information input into the prediction model, thereby improving the training speed and accuracy of the prediction model.
[0068] In some embodiments, the first characterization of a protein includes atomic node characterization of the protein; the first characterization of a ligand includes atomic node characterization of the ligand.
[0069] In some embodiments, the training method for the prediction model may further include: determining the edge representation of the protein and the ligand based on the atomic node representation of the protein and the atomic node representation of the ligand.
[0070] Here, the edge representation of proteins and ligands can be understood as the representation of the edges formed by the pairing of atoms in the protein and atoms in the ligands. For each edge, an atom in the protein forms one end of the edge, and an atom in the ligand forms the other end.
[0071] This provides a better data foundation for subsequent training of prediction models, which helps improve the prediction accuracy of the trained models.
[0072] In some embodiments, reinforcement learning is performed on the first representation of the protein and the first representation of the ligand to obtain the second representation of the protein and the second representation of the ligand, respectively, including: embedding the first representation of the ligand into the first representation of the protein through a first semantic interaction learning sub-model in the prediction model to learn first contextual information on the effect of the ligand on the protein; obtaining the second representation of the protein based on the first contextual information and the first representation of the protein; embedding the first representation of the protein into the first representation of the ligand through a second semantic interaction learning sub-model in the prediction model to learn second contextual information on the effect of the protein on the ligand; obtaining the second representation of the ligand based on the second contextual information and the first representation of the ligand.
[0073] Here, the first semantic interaction learning sub-model and the second semantic interaction learning sub-model are two independent sub-models.
[0074] Here, the second representation is a representation that incorporates semantic interaction information. The second representation, relative to the first representation, contains semantic interaction information.
[0075] Thus, by employing different semantic interaction learning sub-models to perform reinforcement learning on the first representation of the protein and the first representation of the ligand respectively, the second representation of the protein and the second representation of the ligand can be obtained quickly, which helps to improve the training speed of the prediction model and also helps to improve the training accuracy of the prediction model.
[0076] In some embodiments, the second characterization of the protein and the second characterization of the ligand are fused and learned to obtain a third characterization of the protein and the ligand, including: determining the fusion weights corresponding to the protein and the ligand respectively; based on the fusion weights corresponding to the protein and the ligand respectively, the second characterization of the protein and the second characterization of the ligand are fused and learned to obtain a fourth characterization of the protein and the ligand; and the fourth characterization of the protein and the ligand and the edge characterization of the protein and the ligand are fused and learned to obtain a third characterization of the protein and the ligand.
[0077] Here, the sum of the fusion weight corresponding to the protein and the fusion weight corresponding to the ligand is 1.
[0078] Here, the fourth characterization of proteins and ligands is the characterization of the complex molecules obtained from the initial fusion of proteins and ligands.
[0079] Here, the third characterization of proteins and ligands is the characterization of complex molecules obtained by the deep fusion of proteins and ligands.
[0080] Here, the third characterization takes into account more factors during the fusion process compared to the fourth characterization. For example, when fusing to obtain the fourth characterization, the edge characterization of the protein and ligand is not considered; when fusing to obtain the third characterization, the edge characterization of the protein and ligand is considered.
[0081] This can increase the speed at which the third characterization of protein-ligand binding is determined, thereby helping to improve the training speed of the prediction model.
[0082] In some embodiments, the fourth representation of the protein and ligand and the edge representation of the protein and ligand are fused and learned to obtain the third representation of the protein and ligand, including: embedding the edge representation of the protein and ligand into the fourth representation of protein and ligand binding through a third semantic interaction learning sub-model in the prediction model, and learning the third contextual information of the effect of the edge representation of the protein and ligand on protein and ligand binding; and obtaining the third representation of protein and ligand binding based on the third contextual information and the fourth representation of protein and ligand binding.
[0083] Here, the outputs of the first semantic interaction learning sub-model and the second semantic interaction learning sub-model are both connected to the input of the third semantic interaction learning sub-model.
[0084] Thus, by combining the edge representations of proteins and ligands to determine the third representation of protein-ligand binding, the determined third representation of protein-ligand binding can be more accurate, which helps to improve the prediction accuracy of the trained prediction model.
[0085] Figure 2 A schematic diagram of the predictive model architecture is shown, such as... Figure 2 As shown, the architecture includes a representation decoupling module, a decoupled cross pooling module, a fused global pooling module, and a prediction module.
[0086] The characterization dissociation extraction module is used to determine the first characterization of the protein and the first characterization of the ligand. Specifically, the input to the characterization dissociation extraction module can be a complex molecular graph of the protein and ligand, or the atomic node characterization of the protein and the atomic node characterization of the ligand, and may further include the edge characterization of the protein and the ligand.
[0087] For the input atomic node representation and edge representation, three sets V can be extracted using formulas (1), (2) and (3). P V L E I This is used for modeling interactive attention mechanisms.
[0088] v P ={a i′,P |a i ∈v P} (1)
[0089] v L ={a i′,L |a i ∈v L} (2)
[0090] ε I ={e i′j′ |a i ∈v P a j ∈v L} (3)
[0091] Where P represents a protein molecule, i represents the i-th atom in the protein molecule, and a i′,P It is a representation of the atoms within a protein molecule, a i′,P It refers to the representation of the i-th atom in a protein.
[0092] Where L represents the ligand molecule, i represents the i-th atom in the ligand molecule, and a i′,L It is a characterization of the atoms within the ligand molecule, a i′,L It refers to the characterization of the i-th atom in the ligand.
[0093] Among them, e i′j′ The representation of an edge, specifically representing the atom 'a' in a protein molecule. i With the atom in the ligand molecule j Edges formed by interaction (also known as connecting edges).
[0094] For the characterization of all atomic nodes, select two sets of independent atoms within the molecule, namely the protein atom set V. P and ligand atom set V L and the set of intermolecular interaction edges E I (That is, one end of the connection belongs to the protein, and the other end belongs to the ligand), to obtain V P V L and E I After the three sets, the protein atom set V is respectively... P and ligand atom set V LThe input is fed into two layers of attention pooling modules (separate interactive attention pooling module and global fusion attention pooling module). The attention pooling module uses a gated recurrent unit (GRU) as the basic unit for semantic interactive learning, combining the molecular representation of protein and ligand pairs with semantic dependence, and using a separate interactive attention pooling method to obtain the second representation of the protein and the second representation of the ligand.
[0095] For the second characterization of proteins and ligands, a global fusion attention pooling approach is adopted to first obtain the fourth characterization of proteins and ligands (also known as the initial composite characterization). Then, the interaction information between the fourth characterization of proteins and ligands and the edge characterization is learned to obtain the third characterization of proteins and ligands (also known as the final composite characterization).
[0096] Figure 3 A schematic diagram of an attention pooling network is shown, such as... Figure 3 As shown, the attention pooling network consists of three main parts: a first semantic interaction sub-model, a second semantic interaction sub-model, and a third semantic interaction sub-model. The first semantic interaction sub-model is used to determine the second representation of the protein; the second semantic interaction sub-model is used to determine the second representation of the ligand; and the third semantic interaction model is used to determine the third representation of both the protein and the ligand.
[0097] First, the input protein atomic node representation and ligand atomic node representation are summed to obtain the initial protein first representation. First characterization of ligands The two molecular-level representations can be viewed as two virtual nodes, connected to all atomic nodes within the protein and all atomic nodes within the ligand, respectively. Then, for each molecule, the interaction context information is learned by embedding the representation of the other molecule through an interaction-aware graph attention network. For example... Figure 3 As shown, this stage involves learning processes through two channels: protein and ligand. The first semantic interaction sub-model provides the protein learning channel, and the aggregation process based on the interactive attention mechanism can be represented by the following formula:
[0098]
[0099]
[0100] in, This is used to calculate the interaction attention weights. This represents the trainable model parameters. The computation takes into account the characterization from the ligands, which can provide first contextual information for simulating the interaction phase. This allows the system to learn accurate weights to identify atomic nodes valuable for interactions. `soft max()` represents the normalization function. `Leaky ReLU()` represents the activation function. `T` represents the transpose, and `t` represents the layer. Characterization of molecules, a i,P A characterization of an atom in a protein molecule. Indicates a i,P The weight, Indicates that for a i,P The updated representation.
[0101] because Figure 3 The ligand learning channel below is completely symmetrical to the protein learning channel above. A similar process can be used to learn secondary contextual information about how proteins interact with ligands.
[0102] The first characterization of the isolated proteins was updated using GRU, resulting in the second characterization of the proteins; the first characterization of the isolated ligands was also updated using GRU, resulting in the second characterization of the ligands.
[0103]
[0104]
[0105] in, This represents the second characteristic of proteins. This represents the second characteristic of the ligand. This indicates the first contextual information indicating how the ligand functions for the protein. This indicates the second contextual information that indicates how a protein functions in relation to its ligand.
[0106] After the above stages of learning, superimposed T c Layer-separated interactive attention pooling can yield protein characterization results. and ligand characterization The third semantic interaction sub-model employs a global fusion strategy to fuse and learn two separate molecular representations, resulting in a fourth representation (i.e., a composite molecular representation) h. 0 :
[0107]
[0108] Here, θ represents the proportional weight of fusion, θ∈[0,1]. The larger the value of θ, the greater the contribution of the ligand to the fusion characterization. θ can be reasonably weighed based on data and task requirements to determine the degree of fusion. Since binding affinity originates from the interaction between the protein and the ligand, the interaction edge E between them exists. IThis is highly significant for prediction. Considering that the number of truly helpful interaction edges is fixed, a recognition mechanism can be designed to select them. Therefore, the inventors proposed a global fusion attention pooling layer to aggregate valuable interaction edge information. The entire attention pooling process that operates on edge representations can be represented as:
[0109]
[0110]
[0111]
[0112] Where Leaky Relu() represents the activation function, h t-1 Characterization of the composite molecule with fused t-1 layers, h t This indicates the characterization of complex molecules (i.e., the third characterization), where T represents transposition. Indicates the interaction attention weight, This represents the trainable model parameters. The calculation takes into account the representation from the edges, e ij The characterization of intermolecular connections, h t-1 ||e ij This represents the splicing of composite molecular characterization and edge characterization, e ij Describe the set of edges ε I The representation of an edge in , e ij The weight, Indicates e ij The updated representation. `soft max()` represents the normalization function, `t` represents the layer number, and `T` represents the value of `T`. g Indicates the maximum number of layers; GRU represents a gated cyclic unit. This represents the third context information used in the simulated interaction phase.
[0113] in, and These are the trainable model parameters. Indicates attention weights, Reflects the intermolecular bonding e ij The importance of binding affinity prediction. It is important to note the weighting value. This approach can be used to more accurately interpret the prediction results of binding affinity. The weighted sorting of interaction edges provides significant assistance in the drug discovery screening process. The GRU determines how much information comes from aggregating intermediate interaction edges and how much information from the fused global representation is retained. Therefore, the proposed scheme not only boasts high prediction accuracy but also provides excellent interpretability.
[0114] This disclosure employs a predictive model approach, which, compared to traditional experimental methods and physical simulations, has lower computational and time costs, enabling practical applications in structure-based drug discovery. Compared to other machine learning and deep learning methods, the predictive model of this disclosure fully considers semantic interactions, exhibits strong generalization ability, enhances the learning capacity of the predictive model, improves the accuracy of predictive affinity, and provides interpretability for the prediction results.
[0115] It should be understood that Figure 2 The architecture diagram shown and Figure 3 The diagram of the attention pooling network shown is merely illustrative and not restrictive. Those skilled in the art can use it to their advantage. Figure 2 and Figure 3 Even with various obvious changes and / or substitutions to the examples, the resulting technical solutions still fall within the scope of this disclosure.
[0116] Based on the prediction model trained using the above-described training method, this disclosure presents a binding affinity prediction method. This binding affinity prediction method is applied to electronic devices, including but not limited to computers, mobile phones, or tablet computers. This disclosure does not limit the type of electronic device. Figure 4 As shown, the binding affinity prediction method includes:
[0117] S401, receiving the first characterization of the target protein and the first characterization of the target ligand;
[0118] S402, input the first characterization of the target protein and the first characterization of the target ligand into the prediction model, and obtain the predicted value of the binding affinity between the target protein and the target ligand output by the prediction model.
[0119] Here, the prediction model adopts Figure 1 The training method for the prediction model shown is obtained.
[0120] This not only saves manpower and time in calculating binding affinity, but also improves the accuracy of predicting the binding affinity between proteins and ligands.
[0121] In some embodiments, the binding affinity prediction method may further include: obtaining atomic node characterizations of the target protein and the target ligand; summing the atomic node characterizations of the target protein and the target ligand respectively to obtain a first characterization of the target protein and a first characterization of the target ligand.
[0122] Here, the first characterization of the target protein may include the characterization of some of the atomic nodes included in the target protein, or it may include the characterization of all the atomic nodes included in the target protein.
[0123] Here, the first characterization of the target ligand may include the characterization of some of the atomic nodes included in the target ligand, or it may include the characterization of all the atomic nodes included in the target ligand.
[0124] Thus, by using atomic node characterization of the target protein and the target ligand, the first characterization of the target protein and the first characterization of the target ligand can be determined, and multiple first characterizations of the target protein and multiple first characterizations of the target ligand can be generated. This not only enriches the input data of the prediction model, but also improves the prediction speed of the prediction model.
[0125] In some embodiments, the binding affinity prediction method may further include: determining the edge representation of the target protein and the target ligand based on the atomic node representation of the target protein and the atomic node representation of the target ligand.
[0126] In this way, relevant data on the side characterization of the target protein and the target ligand can be provided to the prediction model, thereby helping to improve the prediction accuracy of the prediction model.
[0127] This disclosure provides an embodiment of a training apparatus for a prediction model, such as... Figure 5 As shown, the training device for the prediction model includes: a reinforcement learning module 501, used to perform reinforcement learning on the first representation of the protein and the first representation of the ligand respectively to obtain the second representation of the protein and the second representation of the ligand; a fusion learning module 502, used to perform fusion learning on the second representation of the protein and the second representation of the ligand to obtain the third representation of the protein and the ligand, wherein the third representation is a composite molecular representation of the protein and the ligand; a first prediction module 503, used to predict the binding affinity of the protein and the ligand based on the third representation of the protein and the ligand to obtain the predicted value of the binding affinity; a construction module 504, used to construct a loss function based on the predicted value of the binding affinity and the true value of the binding affinity; and a training module 505, used to train a prediction model for predicting the binding affinity based on the loss function.
[0128] In some embodiments, the training apparatus for the prediction model further includes: a first acquisition module 506 (not shown in the figure) for acquiring a complex molecular map of the protein and ligand; and a feature separation module 507 (not shown in the figure) for performing feature separation on the complex molecular map to obtain a first characterization of the protein and a first characterization of the ligand.
[0129] In some embodiments, the first characterization of the protein includes the atomic node characterization of the protein; the first characterization of the ligand includes the atomic node characterization of the ligand; the training apparatus for the prediction model further includes: a first determination module 508 (not shown in the figure), for determining the edge characterization of the protein and the ligand based on the atomic node characterization of the protein and the atomic node characterization of the ligand.
[0130] In some embodiments, the reinforcement learning module 501 includes: a first reinforcement learning submodule, configured to embed a first representation of a ligand into a first representation of a protein through a first semantic interaction learning submodel in a prediction model, thereby learning first contextual information about the ligand's effect on the protein; and to obtain a second representation of the protein based on the first contextual information and the first representation of the protein; and a second reinforcement learning submodule, configured to embed the first representation of the protein into the first representation of the ligand through a second semantic interaction learning submodel in a prediction model, thereby learning second contextual information about the protein's effect on the ligand; and to obtain a second representation of the ligand based on the second contextual information and the first representation of the ligand.
[0131] In some embodiments, the fusion learning module 502 includes: a determination submodule for determining the fusion weights corresponding to the protein and the ligand respectively; a first fusion submodule for performing fusion learning on the second representation of the protein and the second representation of the ligand based on the fusion weights corresponding to the protein and the ligand respectively, to obtain a fourth representation of the protein and the ligand; and a second fusion submodule for performing fusion learning on the fourth representation of the protein and the ligand and the edge representation of the protein and the ligand respectively, to obtain a third representation of the protein and the ligand.
[0132] In some embodiments, the second fusion submodule is configured to: embed the edge representation of the protein and ligand into the fourth representation of the protein and ligand through the third semantic interaction learning submodel in the prediction model, and learn the third contextual information that the edge representation of the protein and ligand plays a role in the binding of the protein and ligand; and obtain the third representation of the protein and ligand based on the third contextual information and the fourth representation of the protein and ligand.
[0133] Those skilled in the art should understand that the functions of each processing module in the training device of the prediction model in the embodiments of this disclosure can be understood with reference to the relevant description of the training method of the prediction model described above. Each processing module in the training device of the prediction model in the embodiments of this disclosure can be implemented by an analog circuit that implements the functions described in the embodiments of this disclosure, or it can be implemented by running software that performs the functions described in the embodiments of this disclosure on an electronic device.
[0134] The training device for the prediction model in this embodiment can improve the training speed and accuracy of the prediction model, and also improve the accuracy of the prediction of the binding affinity.
[0135] This disclosure provides an embodiment of a combined affinity prediction device, such as... Figure 6As shown, the binding affinity prediction device includes: a receiving module 601 for receiving a first characterization of the target protein and a first characterization of the target ligand; and a second prediction module 602 for inputting the first characterization of the target protein and the first characterization of the target ligand into a prediction model to obtain the predicted value of the binding affinity between the target protein and the target ligand output by the prediction model.
[0136] In some embodiments, the binding affinity prediction device further includes: a second acquisition module 603 (not shown in the figure) for acquiring atomic node characterization of the target protein and atomic node characterization of the target ligand; and a second determination module 604 (not shown in the figure) for summing the atomic node characterization of the target protein and the atomic node characterization of the target ligand respectively to obtain a first characterization of the target protein and a first characterization of the target ligand.
[0137] In some embodiments, the binding affinity prediction device further includes: a third determining module 605 (not shown in the figure), for determining the edge characterization of the target protein and the target ligand based on the atomic node characterization of the target protein and the atomic node characterization of the target ligand.
[0138] Those skilled in the art should understand that the functions of each processing module in the binding affinity prediction device of this disclosure can be understood with reference to the relevant description of the binding affinity prediction method described above. Each processing module in the binding affinity prediction device of this disclosure can be implemented by an analog circuit that implements the functions described in the embodiments of this disclosure, or by running software that performs the functions described in the embodiments of this disclosure on an electronic device.
[0139] The binding affinity prediction device of this embodiment can improve the prediction efficiency of protein-ligand binding affinity and also improve the prediction accuracy of protein-ligand binding affinity.
[0140] This disclosure also provides a schematic diagram of a model training scenario, such as... Figure 7As shown, an electronic device, such as a cloud server, mines protein-ligand complex molecular maps and ground truth values of binding affinity between proteins and ligands from multiple data sources based on training tasks sent by terminals. Here, training tasks sent by different terminals can be for different proteins and different ligands. The electronic device performs reinforcement learning on the first representation of the protein and the first representation of the ligand separately to obtain the second representation of the protein and the second representation of the ligand; it then performs fusion learning on the second representation of the protein and the second representation of the ligand to obtain the third representation of the protein and ligand; based on the third representation of the protein and ligand, it predicts the binding affinity of the protein and ligand to obtain the predicted binding affinity value; it constructs a loss function based on the predicted binding affinity value and the ground truth binding affinity value; and it trains a prediction model for predicting binding affinity based on the loss function. The electronic device returns the trained prediction model to the terminal.
[0141] This disclosure also provides a schematic diagram of a scenario for predicting binding affinity, such as... Figure 8 As shown, an electronic device, such as a cloud server, receives the protein and ligand to be predicted sent by the terminal, determines the predicted value of the binding affinity between the protein and ligand based on the protein and ligand sent by the terminal, and returns the predicted value of the binding affinity between the protein and ligand to the terminal.
[0142] This disclosure does not limit the number of terminals and electronic devices; in practical applications, it may include multiple terminals and multiple electronic devices.
[0143] It should be understood that Figure 7 and Figure 8 The scene diagrams shown are merely illustrative and not restrictive; those skilled in the art can interpret them based on... Figure 7 and Figure 8 Even with various obvious changes and / or substitutions to the examples, the resulting technical solutions still fall within the scope of this disclosure.
[0144] The acquisition, storage, and application of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0145] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0146] Figure 9A schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0147] like Figure 9 As shown, device 900 includes a computing unit 901, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 902 or a computer program loaded from storage unit 908 into random access memory (RAM) 903. RAM 903 may also store various programs and data required for the operation of device 900. The computing unit 901, ROM 902, and RAM 903 are interconnected via bus 904. Input / output (I / O) interface 905 is also connected to bus 904.
[0148] Multiple components in device 900 are connected to I / O interface 905, including: input unit 906, such as keyboard, mouse, etc.; output unit 907, such as various types of monitors, speakers, etc.; storage unit 908, such as disk, optical disk, etc.; and communication unit 909, such as network card, modem, wireless transceiver, etc. Communication unit 909 allows device 900 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0149] The computing unit 901 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as methods for training predictive models and / or methods incorporating affinity prediction. For example, in some embodiments, methods for training predictive models and / or methods incorporating affinity prediction may be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and / or installed on device 900 via ROM 902 and / or communication unit 909. When the computer program is loaded into RAM 903 and executed by computing unit 901, one or more steps of the training method for the prediction model and / or the combination of affinity prediction method described above can be performed. Alternatively, in other embodiments, computing unit 901 can be configured by any other suitable means (e.g., by means of firmware) to perform the training method for the prediction model and / or the combination of affinity prediction method.
[0150] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0151] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0152] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory, read-only memory, erasable programmable read-only memory (EPROM), flash memory, optical fiber, compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0153] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a cathode ray tube (CRT) or liquid crystal display (LCD) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0154] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0155] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.
[0156] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0157] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A method for training a prediction model, comprising: Reinforcement learning is performed on the first characterization of the protein and the first characterization of the ligand to obtain the second characterization of the protein and the second characterization of the ligand. The second characterization of the protein and the second characterization of the ligand are fused and learned to obtain a third characterization of the protein and the ligand, wherein the third characterization is a composite molecular characterization of the protein and the ligand; Based on the third characterization of the protein and the ligand, the binding affinity between the protein and the ligand is predicted, and the predicted binding affinity value is obtained. A loss function is constructed based on the predicted and true values of the combined affinity; A prediction model for predicting binding affinity is obtained by training based on the loss function; The first characterization of the protein and the first characterization of the ligand are respectively subjected to reinforcement learning to obtain the second characterization of the protein and the second characterization of the ligand, including: The first semantic interaction learning sub-model in the prediction model is used to embed the first representation of the ligand into the first representation of the protein to learn the first contextual information of the ligand’s effect on the protein; and the second representation of the protein is obtained based on the first contextual information and the first representation of the protein. The first representation of the protein is embedded into the first representation of the ligand through the second semantic interaction learning sub-model in the prediction model, and the second context information of the protein’s effect on the ligand is learned; the second representation of the ligand is obtained based on the second context information and the first representation of the ligand.
2. The method according to claim 1, further comprising: Obtain a molecular diagram of the complex of the protein and the ligand; The complex molecular map is subjected to feature separation to obtain the first characterization of the protein and the first characterization of the ligand.
3. The method according to claim 1 or 2, wherein, The first characterization of the protein includes the atomic node characterization of the protein; The first characterization of the ligand includes atomic node characterization of the ligand, and the method further includes: Based on the atomic node characterization of the protein and the atomic node characterization of the ligand, the edge characterization of the protein and the ligand is determined.
4. The method according to claim 3, wherein, The fusion learning of the second characterization of the protein and the second characterization of the ligand to obtain the third characterization of the protein and the ligand includes: Determine the fusion weights corresponding to the protein and the ligand, respectively; Based on the fusion weights corresponding to the protein and the ligand respectively, the second characterization of the protein and the second characterization of the ligand are fused and learned to obtain the fourth characterization of the protein and the ligand. The fourth characterization of the protein and the ligand, as well as the edge characterization of the protein and the ligand, are fused and learned to obtain the third characterization of the protein and the ligand.
5. The method according to claim 4, wherein, The fusion learning of the fourth characterization of the protein and the ligand, and the edge characterization of the protein and the ligand, to obtain the third characterization of the protein and the ligand includes: By using the third semantic interaction learning sub-model in the prediction model, the edge representation of the protein and the ligand is embedded into the fourth representation of the protein and the ligand, and the third contextual information of the edge representation of the protein and the ligand for the binding of the protein and the ligand is learned. The third characterization of the protein and the ligand is obtained based on the third contextual information and the fourth characterization of the protein and the ligand.
6. A method for predicting affinity, comprising: First characterization of the target protein and first characterization of the target ligand; The first characterization of the target protein and the first characterization of the target ligand are input into the prediction model to obtain the predicted value of the binding affinity between the target protein and the target ligand output by the prediction model. The prediction model is obtained using the training method of the prediction model according to any one of claims 1 to 5.
7. The method according to claim 6, further comprising: Obtain atomic node characterizations of the target protein and the target ligand; The atomic node characterizations of the target protein and the target ligand are summed to obtain the first characterization of the target protein and the first characterization of the target ligand.
8. The method according to claim 7, further comprising: Based on the atomic node characterization of the target protein and the atomic node characterization of the target ligand, the edge characterization between the target protein and the target ligand is determined.
9. A training device for a prediction model, comprising: The reinforcement learning module is used to perform reinforcement learning on the first characterization of the protein and the first characterization of the ligand, respectively, to obtain the second characterization of the protein and the second characterization of the ligand. A fusion learning module is used to perform fusion learning on the second characterization of the protein and the second characterization of the ligand to obtain a third characterization of the protein and the ligand, wherein the third characterization is a composite molecular characterization of the protein and the ligand; The first prediction module is used to predict the binding affinity between the protein and the ligand based on the third characterization of the protein and the ligand, and obtain the predicted binding affinity value. The construction module is used to construct a loss function based on the predicted and true values of the binding affinity; The training module is used to train a prediction model for predicting binding affinity based on the loss function; wherein, the reinforcement learning module includes: The first reinforcement learning submodule is used to embed the first representation of the ligand into the first representation of the protein through the first semantic interaction learning submodel in the prediction model, and learn the first context information of the ligand’s effect on the protein; and obtain the second representation of the protein based on the first context information and the first representation of the protein. The second reinforcement learning submodule is used to embed the first representation of the protein into the first representation of the ligand through the second semantic interaction learning submodel in the prediction model, and learn the second context information of the protein’s effect on the ligand; and obtain the second representation of the ligand based on the second context information and the first representation of the ligand.
10. The apparatus according to claim 9, further comprising: The first acquisition module is used to acquire a complex molecular diagram of the protein and the ligand; The feature separation module is used to perform feature separation on the composite molecular map to obtain the first characterization of the protein and the first characterization of the ligand.
11. The apparatus according to claim 9 or 10, wherein, The first characterization of the protein includes the atomic node characterization of the protein; The first characterization of the ligand includes atomic node characterization of the ligand, and the device further includes: The first determining module is used to determine the edge representation of the protein and the ligand based on the atomic node representation of the protein and the atomic node representation of the ligand.
12. The apparatus according to claim 11, wherein, The fusion learning module includes: A determination submodule is used to determine the fusion weights corresponding to the protein and the ligand, respectively. The first fusion submodule is used to perform fusion learning on the second characterization of the protein and the second characterization of the ligand based on the fusion weights corresponding to the protein and the ligand respectively, so as to obtain the fourth characterization of the protein and the ligand. The second fusion submodule is used to perform fusion learning on the fourth characterization of the protein and the ligand and the edge characterization of the protein and the ligand to obtain the third characterization of the protein and the ligand.
13. The apparatus according to claim 12, wherein, The second fusion submodule is used for: By using the third semantic interaction learning sub-model in the prediction model, the edge representation of the protein and the ligand is embedded into the fourth representation of the protein and the ligand, and the third contextual information of the edge representation of the protein and the ligand for the binding of the protein and the ligand is learned. The third characterization of the protein and the ligand is obtained based on the third contextual information and the fourth characterization of the protein and the ligand.
14. A device for predicting affinity, comprising: The receiving module is used to receive the first characterization of the target protein and the first characterization of the target ligand; The second prediction module is used to input the first characterization of the target protein and the first characterization of the target ligand into the prediction model, and obtain the predicted value of the binding affinity between the target protein and the target ligand output by the prediction model; The prediction model is obtained using the training method of the prediction model according to any one of claims 1 to 5.
15. The apparatus of claim 14, further comprising: The second acquisition module is used to acquire the atomic node characterization of the target protein and the atomic node characterization of the target ligand. The second determining module is used to sum the atomic node characterization of the target protein and the atomic node characterization of the target ligand respectively to obtain the first characterization of the target protein and the first characterization of the target ligand.
16. The apparatus of claim 15, further comprising: The third determining module is used to determine the edge representation of the target protein and the target ligand based on the atomic node representation of the target protein and the atomic node representation of the target ligand.
17. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-8.
19. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-8.