A method, device, medium and program product for establishing a target protein acidity coefficient prediction model

CN120853685BActive Publication Date: 2026-06-26SHANGHAI MOLECULAR HEART INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI MOLECULAR HEART INTELLIGENT TECH CO LTD
Filing Date
2025-06-24
Publication Date
2026-06-26

Smart Images

  • Figure CN120853685B_ABST
    Figure CN120853685B_ABST
Patent Text Reader

Abstract

The purpose of the present application is to provide a method, device, medium and program product for establishing a target protein acidity coefficient prediction model, which comprises: obtaining a training sample data set comprising a plurality of protein information and corresponding acidity coefficient information by using a plurality of protein acidity coefficient prediction models, wherein the protein information comprises protein structure information and sequence information; obtaining first acidity coefficient prediction information by using a target protein acidity coefficient prediction model based on the protein information; calculating corresponding loss information based on the first acidity coefficient prediction information and the corresponding acidity coefficient information to update the model parameters and obtain a target model. The present application is trained by a large-scale training set obtained by a plurality of protein acidity coefficient prediction models, which helps to improve the generalization ability and stability of the model. The protein structure information is also used for prediction, so that the model can grasp more surrounding information of the target amino acid in prediction, which helps to improve the prediction accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of bioinformatics technology, and in particular to a technique for establishing a predictive model for the acidity coefficient of a target protein. Background Technology

[0002] Protein acidity coefficient (pKa) is used to quantitatively describe the tendency of ionizable groups of specific amino acids in a protein to lose or gain protons in solution. Accurate prediction of the acidity coefficient of specific amino acids in a protein is of great significance for the design and optimization of proteins that need to function at specific pH levels. For example, antibodies need to maintain affinity for specific substances in the tumor microenvironment (which is slightly more acidic than the normal physiological environment, with a pH of approximately 6); alkaline proteases in detergents need to remain stable in an environment with a pH of around 10.

[0003] Currently, many methods for predicting protein acidity coefficients are based on machine learning or use language models, such as PypKa (https: / / pypka.org), PROPKA (https: / / github.com / jensengroup / propka), or pKAI (https: / / pypi.org / project / pkai / ). These models are typically trained on limited experimentally measured protein pKa datasets, and their generalization ability is often limited. Summary of the Invention

[0004] One object of this application is to provide a method, apparatus, medium, and program product for establishing a predictive model of the acidity coefficient of a target protein.

[0005] According to one aspect of this application, a method for establishing a prediction model for the acidity coefficient of a target protein is provided, the method comprising:

[0006] A training sample dataset is obtained by using multiple protein acidity coefficient prediction models. The training sample dataset includes information on multiple proteins and acidity coefficient information corresponding to the protein information. The protein information includes protein structure information and protein sequence information.

[0007] Based on the protein information, the first acidity coefficient prediction information corresponding to the protein information is obtained using the target protein acidity coefficient prediction model.

[0008] Based on the acidity coefficient information corresponding to the first acidity coefficient prediction information and the protein information, the corresponding loss information is calculated;

[0009] Based on the loss information, the model parameters of the target protein acidity coefficient prediction model are updated to obtain the target protein acidity coefficient prediction model.

[0010] According to one aspect of this application, a computer device for establishing a prediction model of the acidity coefficient of a target protein is provided, comprising a memory, a processor, and a computer program stored in the memory, characterized in that the processor executes the computer program to implement the steps of any of the methods described above.

[0011] According to one aspect of this application, a computer-readable storage medium is provided having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the steps of any of the methods described above.

[0012] According to one aspect of this application, a computer program product is provided, comprising a computer program, characterized in that, when executed by a processor, the computer program implements the steps of any of the methods described above.

[0013] According to one aspect of this application, an apparatus for establishing a prediction model of the acidity coefficient of a target protein is provided, the apparatus comprising:

[0014] A module is used to obtain a training sample dataset by using multiple protein acidity coefficient prediction models. The training sample dataset includes information on multiple proteins and acidity coefficient information corresponding to the protein information. The protein information includes protein structure information and protein sequence information.

[0015] The first and second modules are used to obtain the first acidity coefficient prediction information corresponding to the protein information based on the protein information and using the target protein acidity coefficient prediction model.

[0016] The first and third modules are used to calculate the corresponding loss information based on the acidity coefficient information corresponding to the first acidity coefficient prediction information and the protein information;

[0017] The first four modules are used to update the model parameters of the target protein acidity coefficient prediction model based on the loss information, and to obtain the target protein acidity coefficient prediction model.

[0018] Compared with existing technologies, this application utilizes multiple protein acidity coefficient prediction models to obtain a training sample dataset. The training sample dataset includes multiple protein information and corresponding acidity coefficient information for each protein. The protein information includes protein structure information and protein sequence information. Based on the protein information, a target protein acidity coefficient prediction model is used to obtain first acidity coefficient prediction information corresponding to the protein information. Based on the first acidity coefficient prediction information and the acidity coefficient information corresponding to the protein information, corresponding loss information is calculated. Based on the loss information, the model parameters of the target protein acidity coefficient prediction model are updated to obtain the target protein acidity coefficient prediction model. By obtaining a large-scale training sample dataset through multiple protein acidity coefficient prediction models, the size of the model's training dataset is significantly increased, which helps improve the model's generalization ability and stability. Furthermore, this application also uses protein structure information for acidity coefficient prediction, enabling the model to grasp more surrounding information about the target amino acid during prediction, thus helping to improve prediction accuracy. Attached Figure Description

[0019] Other features, objects, and advantages of this application will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0020] Figure 1 This diagram illustrates a method for establishing a prediction model for the acidity coefficient of a target protein according to an embodiment of this application.

[0021] Figure 2 A schematic diagram of a target protein acidity coefficient prediction model architecture is shown according to an embodiment of this application.

[0022] Figure 3 This diagram illustrates a device structure for establishing a target protein acidity coefficient prediction model according to an embodiment of this application.

[0023] Figure 4 Exemplary systems that can be used to implement the various embodiments described in this application are shown.

[0024] The same or similar reference numerals in the accompanying drawings represent the same or similar parts. Detailed Implementation

[0025] The present application will now be described in further detail with reference to the accompanying drawings.

[0026] In a typical configuration of this application, the terminal, the device of the service network, and the trusted party all include one or more processors (e.g., a central processing unit (CPU)), input / output interfaces, network interfaces, and memory.

[0027] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash memory. Memory is an example of computer-readable media.

[0028] Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can store information using any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PCM), programmable random access memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.

[0029] The devices referred to in this application include, but are not limited to, user equipment, network equipment, or devices composed of user equipment and network equipment integrated through a network. The user equipment includes, but is not limited to, any mobile electronic product capable of human-computer interaction (e.g., via a touchpad), such as smartphones and tablets. These mobile electronic products can use any operating system, such as Android or iOS. The network equipment includes an electronic device capable of automatically performing numerical calculations and information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), and embedded devices. The network equipment includes, but is not limited to, computers, network hosts, single network servers, multiple network server clusters, or clouds composed of multiple servers. Here, a cloud consists of a large number of computers or network servers based on cloud computing, where cloud computing is a type of distributed computing, consisting of a virtual supercomputer composed of a group of loosely coupled computer clusters. The network includes, but is not limited to, the Internet, wide area network, metropolitan area network, local area network, VPN network, wireless ad hoc network, etc. Preferably, the device can also be a program running on the user equipment, network device, or a device formed by integrating user equipment and network device, network device, touch terminal, or network device and touch terminal through a network.

[0030] Of course, those skilled in the art should understand that the above-described devices are merely examples, and other existing or future devices that are applicable to this application should also be included within the scope of protection of this application, and are hereby incorporated by reference.

[0031] In the description of this application, "multiple" means two or more, unless otherwise expressly and specifically defined.

[0032] Figure 1A flowchart illustrating a method for establishing a target protein acidity coefficient prediction model according to an embodiment of this application is shown. The method includes steps S11, S12, S13, and S14. In step S11, device 1 uses multiple protein acidity coefficient prediction models to obtain a training sample dataset, wherein the training sample dataset includes multiple protein information and acidity coefficient information corresponding to the protein information, the protein information including protein structure information and protein sequence information; in step S12, device 1 uses the target protein acidity coefficient prediction model based on the protein information to obtain first acidity coefficient prediction information corresponding to the protein information; in step S13, device 1 calculates corresponding loss information based on the first acidity coefficient prediction information and the acidity coefficient information corresponding to the protein information; in step S14, device 1 updates the model parameters of the target protein acidity coefficient prediction model based on the loss information to obtain the target protein acidity coefficient prediction model.

[0033] In some embodiments, the device 1 includes, but is not limited to, user equipment or network equipment with information processing or computing capabilities, such as tablet computers, computers, and servers.

[0034] In step S11, device 1 uses multiple protein acidity coefficient prediction models to obtain a training sample dataset. The training sample dataset includes information on multiple proteins and corresponding acidity coefficient information. The protein information includes protein structure information and protein sequence information. In some embodiments, the multiple protein acidity coefficient prediction models include, but are not limited to, existing or future models that can be used to predict protein acidity coefficients, such as PypKa (https: / / pypka.org), pKAI (https: / / pypi.org / project / pkai / ), pKALM (https: / / onodalab.ees.hokudai.ac.jp / pkalm), and PROPKA (https: / / github.com / jensengroup / propka). Based on the protein information corresponding to all existing proteins with crystal structures, acidity coefficient prediction is performed using these protein acidity coefficient prediction models to obtain the acidity coefficient corresponding to the target amino acid (i.e., the amino acid in the protein that may lose or gain a proton) in each protein, thereby determining the corresponding training sample dataset. This training sample dataset is much larger than the experimentally measured protein acidity coefficient dataset, covering a more comprehensive range of protein types. In some embodiments, the acidity coefficient information corresponding to the protein information includes the acidity coefficient information corresponding to each target amino acid in the protein.

[0035] In some embodiments, step S11 includes: step S111 (not shown), where device 1 obtains multiple second acidity coefficient prediction information corresponding to each protein information by using multiple protein acidity coefficient prediction models based on multiple protein information; step S112 (not shown), where device 1 determines the acidity coefficient information corresponding to each protein information based on the multiple second acidity coefficient prediction information corresponding to each protein information; and step S113 (not shown), where device 1 determines the corresponding training sample dataset based on the multiple protein information and the acidity coefficient information corresponding to each protein information.

[0036] For example, protein information is input into each of the aforementioned protein acidity coefficient prediction models to obtain second acidity coefficient prediction information for the protein information predicted by each model. This second acidity coefficient prediction information includes the predicted acidity coefficients of each target amino acid in the protein corresponding to the protein information. Since the datasets used for training each protein acidity coefficient prediction model differ, the quality of their acidity coefficient predictions for each protein information also varies. To improve the accuracy and reliability of the training sample dataset used in this application, and in cases where multiple protein acidity coefficient prediction models label the same protein information, this application can also use multi-annotator consensus models such as the David-Skene model, GLAD (Generative Model of Labels, Abilities, and Difficulties), or LAGNN (Label Aggregation Graph Neural Network) based on the second acidity coefficient prediction information from each model to obtain the predicted true label corresponding to the protein information, i.e., the acidity coefficient information corresponding to the protein information. The protein information and its corresponding acidity coefficient information are used as training samples for the target protein acidity coefficient prediction model of this application, forming a corresponding training sample dataset. In some embodiments, the confidence level of the acidity coefficient information corresponding to the protein information can be obtained using the aforementioned multi-annotator consensus model, and the weight of the training sample can be determined based on the confidence level. The higher the confidence level of the protein information corresponding to the acidity coefficient information, the higher the weight of the corresponding training sample, thereby improving the prediction accuracy and reliability of the trained target protein acidity coefficient prediction model.

[0037] In some embodiments, step S113 includes: device 1, based on the protein sequence similarity information among the plurality of protein information, dividing the plurality of protein information and the acidity coefficient information corresponding to each protein information into corresponding training sample datasets and validation sample datasets, wherein the protein sequence similarity information between the protein information in the training sample dataset and the protein information in the validation sample dataset is less than a corresponding similarity threshold. For example, based on the protein sequence similarity information, the plurality of protein information and the acidity coefficient information corresponding to each protein information can be divided into multiple clusters, and the protein sequence similarity information between different clusters is less than a set similarity threshold. A portion of the clusters can be randomly assigned to the training sample dataset according to a set data ratio, and the remaining clusters can be assigned to the validation sample dataset. In this embodiment, assigning a cluster as an indivisible unit to the training sample dataset or the validation sample dataset can avoid overestimating the model's generalization ability due to homologous sequence leakage.

[0038] In step S12, device 1 obtains the first acidity coefficient prediction information corresponding to the protein information using a target protein acidity coefficient prediction model based on the protein information. In step S13, device 1 calculates the corresponding loss information based on the first acidity coefficient prediction information and the acidity coefficient information corresponding to the protein information. In step S14, device 1 updates the model parameters of the target protein acidity coefficient prediction model based on the loss information to obtain the target protein acidity coefficient prediction model. For example, the target protein acidity coefficient prediction model is built based on a graph neural network architecture. A graph-based neural network is constructed based on protein structure information and protein sequence information to comprehensively describe the target amino acid and its surrounding environment information, and then outputs the first acidity coefficient prediction information through a multilayer perceptron (MLP) layer. The mean squared error (MSE) is used to calculate the loss information corresponding to the first acidity coefficient prediction information and the corresponding acidity coefficient information. The gradient direction of the loss information is calculated using the backpropagation algorithm, and then the model parameters of the target protein acidity coefficient prediction model are updated in combination with a corresponding optimizer (e.g., RMSprop optimization algorithm, or Adam optimization algorithm, etc.) to obtain the target protein acidity coefficient prediction model.

[0039] In some embodiments, step S12 includes: step S121 (not shown), where device 1 determines initial graph information corresponding to each target amino acid based on the protein information, wherein the initial graph information includes node information corresponding to the target amino acid and its associated amino acids, and edge information between amino acid pairs in the target amino acid and its associated amino acids; step S122 (not shown), where device 1 updates each node information based on the initial graph information using a graph neural network based on a multi-head self-attention mechanism; step S123 (not shown), where device 1 obtains acidity coefficient prediction information corresponding to the target amino acid through a multilayer perceptron based on the updated node information corresponding to the target amino acid; and step S124 (not shown), where device 1 determines corresponding first acidity coefficient prediction information based on the acidity coefficient prediction information corresponding to each target amino acid.

[0040] Typically, only specific amino acids in proteins can undergo protonation or deprotonation. These specific amino acids include aspartic acid (Asp), glutamic acid (Glu), histidine (His), lysine (Lys), tyrosine (Tyr), cysteine ​​(Cys), and arginine (Arg). Target protein acidity coefficient prediction models can, based on input protein information, identify the amino acids belonging to these specific amino acids as the corresponding target amino acids. Furthermore, based on the information of each target amino acid and its associated amino acids within a preset distance threshold range, the initial graph information corresponding to each target amino acid is determined. For example, the node information corresponding to each amino acid in the target amino acid and its associated amino acids is denoted as h. i Where 1≤i≤N, N is the total number of amino acids corresponding to the target amino acid and its associated amino acids, and the edge information between amino acid pairs is denoted as e. ij Where 1≤i, j≤N. A graph neural network based on a multi-head self-attention mechanism integrates the above node and edge information and finally updates the node information. Specifically, refer to... Figure 2 The illustrated model architecture diagram shows that, for the l-th attention head in the multi-head self-attention mechanism graph neural network (where 1 ≤ l ≤ L, and L is the number of attention heads), the corresponding original attention score is calculated based on the initial graph information. and the corresponding normalized attention weights Furthermore, the output of each node's information in the l-th attention head is determined by combining the normalized attention weights. The outputs of each attention head are integrated to obtain the updated node information h'. i :

[0041]

[0042] in, Let be the model parameter matrix for the l-th attention head; || denotes concatenating vectors together, for example, [h i ||e ii ] indicates that vector h i With vector e ii spliced ​​together; d k W represents the dimension of the output vector for each attention head. 0 To integrate the model parameter matrices output by each attention head, after updating the information of each node, based on the updated node information corresponding to the target amino acid n, a multilayer perceptron is used to update the node information h' corresponding to the target amino acid n. n The output is the acidity coefficient prediction information corresponding to the target amino acid. For each target amino acid in the protein, the corresponding acidity coefficient prediction information is obtained through the above steps in combination with the corresponding initial map information. The set of acidity coefficient prediction information corresponding to each target amino acid is used as the first acidity coefficient prediction information.

[0043] Since the relative positions of amino acids typically remain unchanged, this example does not update the edge information, but only the node information. Combined with the node information update process described above, which passes through multiple layers of the network, it better integrates the surrounding amino acid information. Even if only the updated node information corresponding to the target amino acid is used, the prediction accuracy is usually unaffected because this updated node information already contains information about the target amino acid and its surroundings that influence the prediction. Furthermore, using only the updated node information corresponding to the target amino acid avoids using information from the entire graph and being misled by other amino acids in the graph that may undergo protonation or deprotonation.

[0044] In some embodiments, step S121 includes: step S1211 (not shown), where device 1 determines a target amino acid and its associated amino acids, wherein the distance between the associated amino acid and the target amino acid is less than a preset distance threshold; step S1212 (not shown), where device 1, based on the protein information, uses the amino acid information corresponding to the determined target amino acid and its associated amino acids as corresponding node information; and step S1213 (not shown), where device 1, based on the positional information between amino acid pairs in the target amino acid and its associated amino acids, determines corresponding edge information.

[0045] For example, based on the protein information, amino acids belonging to a specific amino acid set (including aspartic acid (Asp), glutamic acid (Glu), histidine (His), lysine (Lys), tyrosine (Tyr), cysteine ​​(Cys), or arginine (Arg)) are selected as the corresponding target amino acids. A three-dimensional sphere is cut with the target amino acid as the center and the preset distance threshold as the radius. A preset number of associated amino acids closest to the target amino acid are selected from the three-dimensional sphere. The number of associated amino acids selected can be arbitrarily set by the user based on the model size and / or prediction accuracy. If the number of amino acids other than the target amino acid in the three-dimensional sphere is less than the preset number, it can be padded with zeros to make up the preset number.

[0046] For each target amino acid and its corresponding associated amino acid, the amino acid information of each amino acid can be determined as node information and the position information between each amino acid pair as edge information, based on the protein structure information and the protein sequence information.

[0047] In some embodiments, the node information includes at least one of the following: amino acid type information; atom type information of each atom in the amino acid; spatial information of each atom in the amino acid relative to the central carbon atom of the amino acid; charge information of each atom in the amino acid; solution accessibility of the amino acid; and relative position information of the amino acid in the corresponding amino acid chain.

[0048] In some embodiments, the node information includes amino acid type information; step S1212 includes: device 1 determining the corresponding protein sequence encoding information based on the protein sequence information using a protein language model; and determining the corresponding amino acid type information based on the protein sequence encoding information. In some embodiments, the protein language model is trained on the public dataset Uniref50. The corresponding protein sequence encoding information is obtained using this protein language model. For example, for a protein sequence of length L, the obtained protein sequence encoding information is an L×d embedding matrix, where the vector of length d at the i-th position is the encoding information of the i-th amino acid in the sequence. The encoding information of the amino acid at the position corresponding to the target amino acid or its associated amino acid in the protein sequence encoding information is used as the corresponding amino acid type information.

[0049] In some embodiments, the spatial information of each atom in the amino acid relative to the central carbon atom of the amino acid includes the distance information and / or angle information of each atom in the amino acid relative to the central carbon atom (C-alpha) of the amino acid. The angle information can be determined by normalizing the positions of the main chain carbon atoms, the central carbon atom, and the nitrogen atom within the amino acid.

[0050] In some embodiments, the charge information of each atom in the amino acid includes the net charge of each atom. In some embodiments, for the convenience of model calculation, the charge information is represented by relative values, for example, +1 represents the charge of a proton and -1 represents the charge of an electron. In some embodiments, the charge information of each atom can be obtained using appropriate protein force field tools (e.g., Charmm36 force field, Amber force field, or OpenFF force field, etc.).

[0051] In some embodiments, the solution accessibility of the amino acid includes, but is not limited to, the relative solvent accessibility (RSA) and / or the absolute solvent accessible surface area. The solution accessibility of the amino acid can be calculated based on the protein structure information using computational software such as DSSP (https: / / swift.cmbi.umcn.nl / gv / dssp / ) or NACCESS (https: / / www.bioinf.manchester.ac.uk / naccess / nac_intro.html).

[0052] In some embodiments, the ionization of amino acids at the N-terminus (amino terminus) or C-terminus (carboxyl terminus) differs from that in the middle of the protein chain and needs to be treated differently. To facilitate model learning of these differences, the relative position of the amino acid within the corresponding amino acid chain is used to label the position of the amino acid relative to the N-terminus (amino terminus) or C-terminus (carboxyl terminus). In some embodiments, a one-hot encoded vector of length 2 can be used for labeling.

[0053] In some embodiments, the edge information includes at least one of the following: distance information between the central carbon atoms of two amino acids; and dihedral angle information corresponding to the plane formed by the main chain atoms of the two amino acids. For example, the positional information of any two amino acids in the target amino acid and its associated amino acids can be used as the edge information. This positional information includes the distance information between the central carbon atoms of the two amino acids and / or the dihedral angle information corresponding to the plane formed by the main chain atoms of the two amino acids. For this dihedral angle information, the plane formed by the main chain carbon atom, the central carbon atom (C-alpha), and the nitrogen atom within each amino acid can be determined, and the included angle formed by the planes corresponding to the two amino acids can be used as the dihedral angle information.

[0054] Here, this application integrates richer features that may affect the acidity coefficient, such as amino acid type information, atom type information of each atom in the amino acid, spatial information of each atom in the amino acid relative to the central carbon atom of the amino acid, charge information of each atom in the amino acid, solution accessibility of the amino acid, and / or relative position information of the amino acid in the corresponding amino acid chain, as well as distance information of the central carbon atoms of two amino acids, and / or dihedral angle information corresponding to the plane formed by the main chain atoms of two amino acids, to predict the acidity coefficient. This enables the model to efficiently learn the influence of the target amino acid and its surrounding environment on the acidity coefficient, helping the model to make better predictions. Furthermore, the features used in this application are invariant to rotation and translation, i.e., they have SE(3) invariance, which also makes the trained model have better stability and generalization.

[0055] In some embodiments, the method further includes: step S15 (not shown), where device 1 evaluates the target protein acidity coefficient prediction model using the validation sample dataset. For example, during the training process of the target protein acidity coefficient prediction model, the validation sample dataset obtained in the aforementioned steps can also be used to evaluate the target protein acidity coefficient prediction model, and calculate the corresponding performance metrics (e.g., mean squared error (MSE), accuracy, recall, and / or F1 score, etc.). The model training strategy is dynamically adjusted based on these performance metrics. For example, if the performance metrics do not improve in multiple consecutive training rounds, training is terminated early to avoid overfitting. The model with the best performance metrics during training is used as the final target protein acidity coefficient prediction model. A suitable learning rate or optimizer can also be determined based on changes in the performance metrics.

[0056] In some embodiments, to further improve the prediction accuracy of the target protein acidity coefficient prediction model, the method further includes: step S16 (not shown), where device 1 trains the target protein acidity coefficient prediction model based on an acidity coefficient experimental dataset. For example, using high-precision experimentally measured acidity coefficient data, the target protein acidity coefficient prediction model obtained in the aforementioned steps is fine-tuned to further improve the prediction accuracy of the target protein acidity coefficient prediction model. Specifically, the acidity coefficient experimental dataset can be divided into a training set and a validation set, the first few layers of the model can be frozen, and the target protein acidity coefficient prediction model can be trained with a learning rate smaller than that used in the aforementioned model training, referring to the training process of steps S12 to S15.

[0057] In some embodiments, the method further includes: step S17 (not shown), determining the target acidity coefficient information corresponding to the target protein based on the target protein information corresponding to the target protein using the target protein acidity coefficient prediction model. In some embodiments, the aforementioned trained target protein acidity coefficient prediction model can also be deployed in the device, and the target acidity coefficient information corresponding to the required target protein information can be obtained using the target protein acidity coefficient prediction model. The process of obtaining the target acidity coefficient information is similar to the process of obtaining the first acidity coefficient prediction information in step S12 above. It is also based on the target protein information including the target protein structure information and the target protein sequence information, determining the initial graph information corresponding to each target amino acid in the target protein, using a graph neural network based on a multi-head self-attention mechanism to update the node information in the initial graph information corresponding to each target amino acid respectively, and using a multilayer perceptron to obtain the acidity coefficient information corresponding to the target amino acid based on the updated node information corresponding to the target amino acid, thereby obtaining the target acidity coefficient information including the acidity coefficient information corresponding to each target amino acid in the target protein. Its specific implementation is the same as or similar to the specific embodiment of step S12 above, so it will not be described again, but is included here by reference.

[0058] Figure 3 This diagram illustrates a device structure for establishing a target protein acidity coefficient prediction model according to an embodiment of this application. The device 1 includes a first module 11, a second module 12, a third module 13, and a fourth module 14. The first module 11 uses multiple protein acidity coefficient prediction models to obtain a training sample dataset, wherein the training sample dataset includes multiple protein information and corresponding acidity coefficient information, the protein information including protein structure information and protein sequence information; the second module 12, based on the protein information, uses a target protein acidity coefficient prediction model to obtain first acidity coefficient prediction information corresponding to the protein information; the third module 13 calculates corresponding loss information based on the first acidity coefficient prediction information and the acidity coefficient information corresponding to the protein information; the fourth module 14 updates the model parameters of the target protein acidity coefficient prediction model based on the loss information, thereby obtaining the target protein acidity coefficient prediction model. Here, the... Figure 3 The specific implementation methods corresponding to modules 11, 12, 13 and 14 shown are the same as or similar to the specific embodiments of steps S11, S12, S13 and S14 mentioned above, and will not be repeated here. They are included here by reference.

[0059] In some embodiments, the module 11 includes a unit 111 (not shown), a second unit 112 (not shown), and a third unit 113 (not shown). The first unit 111, based on multiple protein information, utilizes multiple protein acidity coefficient prediction models to obtain multiple second acidity coefficient prediction information corresponding to each protein information. The second unit 112, based on the multiple second acidity coefficient prediction information corresponding to each protein information, determines the acidity coefficient information corresponding to each protein information. The third unit 113, based on the multiple protein information and the acidity coefficient information corresponding to each protein information, determines the corresponding training sample dataset. Here, the specific implementations of the first unit 111, the second unit 112, and the third unit 113 are the same as or similar to the specific embodiments of steps S111, S112, and S113 described above, and therefore will not be repeated here, but are incorporated herein by reference.

[0060] In some embodiments, the first and second modules 12 include a first and second unit 121 (not shown), a first and second unit 122 (not shown), a first and second third unit 123 (not shown), and a first and second fourth unit 124 (not shown). The first and second unit 121 determines initial graph information corresponding to each target amino acid based on the protein information, wherein the initial graph information includes node information corresponding to the target amino acid and its associated amino acids, and edge information between amino acid pairs in the target amino acid and its associated amino acids; the first and second unit 122 updates each node information based on the initial graph information using a graph neural network based on a multi-head self-attention mechanism; the first and second third unit 123 obtains acidity coefficient prediction information corresponding to the target amino acid through a multilayer perceptron based on the updated node information corresponding to the target amino acid; and the first and second fourth unit 124 determines corresponding first acidity coefficient prediction information based on the acidity coefficient prediction information corresponding to each target amino acid. Here, the specific implementations of Unit 121, Unit 122, Unit 123, and Unit 124 are the same as or similar to the specific embodiments of steps S121, S122, S123, and S124 mentioned above, and will not be repeated here, but are included by reference.

[0061] In some embodiments, the first-two-one unit 121 includes a first-two-one subunit 1211 (not shown), a first-two-two subunit 1212 (not shown), and a first-two-three subunit 1213 (not shown). The first-two-one subunit 1211 determines the target amino acid and its associated amino acids, wherein the distance between the associated amino acid and the target amino acid is less than a preset distance threshold. The first-two-two subunit 1212, based on the protein information, determines the amino acid information corresponding to the target amino acid and its associated amino acids as the corresponding node information. The first-two-three subunit 1213, based on the positional information between amino acid pairs in the target amino acid and its associated amino acids, determines the corresponding edge information. Here, the specific implementations of the first-two-one subunit 1211, the first-two-two subunit 1212, and the first-two-three subunit 1213 are the same as or similar to the specific embodiments of steps S1211, S1212, and S1213 described above, and therefore will not be repeated here, but are incorporated herein by reference.

[0062] In some embodiments, the device 1 further includes a five-module 15 (not shown). The five-module 15 uses the validation sample dataset to evaluate the target protein acidity coefficient prediction model. Here, the specific implementation of the five-module 15 is the same as or similar to the specific embodiment of the aforementioned step S15, and therefore will not be repeated, but is incorporated herein by reference.

[0063] In some embodiments, the device 1 further includes a six-module 16 (not shown). The six-module 16 trains the acidity coefficient prediction model of the target protein based on an acidity coefficient experimental dataset. Here, the specific implementation of the six-module 16 is the same as or similar to the specific embodiment of the aforementioned step S16, and therefore will not be repeated, but is incorporated herein by reference.

[0064] In some embodiments, the device 1 further includes a seven-module 17 (not shown). The seven-module 17 determines the target acidity coefficient information corresponding to the target protein based on the target protein information and using the target protein acidity coefficient prediction model. Here, the specific implementation of the seven-module 17 is the same as or similar to the specific embodiment of the aforementioned step S17, and therefore will not be repeated here, but is incorporated herein by reference.

[0065] Figure 4 Exemplary systems that can be used to implement the various embodiments described in this application are shown; such as Figure 4As shown in some embodiments, system 300 can function as any of the devices described in each of the embodiments. In some embodiments, system 300 may include one or more computer-readable media having instructions (e.g., system memory or NVM / storage device 320) and one or more processors (e.g., one or more processors 305) coupled to the one or more computer-readable media and configured to execute the instructions to implement the module and thus perform the actions described in this application.

[0066] In one embodiment, the system control module 310 may include any suitable interface controller to provide any suitable interface to at least one of the processors 305 and / or any suitable device or component communicating with the system control module 310.

[0067] The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. The memory controller module 330 may be a hardware module, a software module, and / or a firmware module.

[0068] System memory 315 can be used, for example, to load and store data and / or instructions for system 300. In one embodiment, system memory 315 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, system memory 315 may include double data rate type quad synchronous dynamic random access memory (DDR4 SDRAM).

[0069] In one embodiment, the system control module 310 may include one or more input / output (I / O) controllers to provide interfaces to the NVM / storage device 320 and (one or more) communication interfaces 325.

[0070] For example, NVM / storage device 320 may be used to store data and / or instructions. NVM / storage device 320 may include any suitable non-volatile memory (e.g., flash memory) and / or may include any suitable (one or more) non-volatile storage devices (e.g., one or more hard disk drives (HDDs), one or more optical disc drives (CDs), and / or one or more digital universal optical disc (DVD) drives).

[0071] NVM / storage device 320 may include storage resources that are physically part of a device on which system 300 is mounted, or that can be accessed by the device without necessarily being part of it. For example, NVM / storage device 320 may be accessed via a network through one or more communication interfaces 325.

[0072] One or more communication interfaces 325 may provide the system 300 with an interface to communicate over one or more networks and / or with any other suitable device. The system 300 may wirelessly communicate with one or more components of a wireless network in accordance with any of one or more wireless network standards and / or protocols.

[0073] In one embodiment, at least one of the processors 305 may be logically packaged with one or more controllers of the system control module 310 (e.g., memory controller module 330). In one embodiment, at least one of the processors 305 may be logically packaged with one or more controllers of the system control module 310 to form a system-in-package (SiP). In one embodiment, at least one of the processors 305 may be integrated with the logic of one or more controllers of the system control module 310 on the same die. In one embodiment, at least one of the processors 305 may be integrated with the logic of one or more controllers of the system control module 310 on the same die to form a system-on-a-chip (SoC).

[0074] In various embodiments, system 300 may be, but is not limited to, a server, workstation, desktop computing device, or mobile computing device (e.g., laptop computing device, handheld computing device, tablet computer, netbook, etc.). In various embodiments, system 300 may have more or fewer components and / or different architectures. For example, in some embodiments, system 300 includes one or more cameras, a keyboard, a liquid crystal display (LCD) screen (including a touchscreen display), a non-volatile memory port, multiple antennas, a graphics chip, an application-specific integrated circuit (ASIC), and a speaker.

[0075] In addition to the methods and devices described in the above embodiments, this application also provides a computer-readable storage medium storing computer code that, when executed, performs the method described in any of the preceding embodiments.

[0076] This application also provides a computer program product that, when executed by a computer device, performs the method described in any of the preceding claims.

[0077] This application also provides a computer device, the computer device comprising:

[0078] One or more processors;

[0079] Memory, used to store one or more computer programs;

[0080] When the one or more computer programs are executed by the one or more processors, the one or more processors cause the one or more processors to perform the method as described in any of the preceding methods.

[0081] It should be noted that this application can be implemented in software and / or a combination of software and hardware, for example, using an application-specific integrated circuit (ASIC), a general-purpose computer, or any other similar hardware device. In one embodiment, the software program of this application can be executed by a processor to implement the steps or functions described above. Similarly, the software program of this application (including related data structures) can be stored in a computer-readable recording medium, such as RAM memory, magnetic or optical drives, floppy disks, and similar devices. Furthermore, some steps or functions of this application can be implemented in hardware, for example, as circuitry that cooperates with a processor to perform the various steps or functions.

[0082] Furthermore, a portion of this application can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide the methods and / or technical solutions according to this application through the operation of the computer. Those skilled in the art will understand that the forms in which computer program instructions exist in a computer-readable medium include, but are not limited to, source files, executable files, installation package files, etc. Correspondingly, the ways in which computer program instructions are executed by a computer include, but are not limited to: the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled program, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed program. Here, the computer-readable medium can be any available computer-readable storage medium or communication medium accessible to a computer.

[0083] Communication media include media through which communication signals containing, for example, computer-readable instructions, data structures, program modules, or other data are transmitted from one system to another. Communication media can include guided transmission media (such as cables and wires (e.g., optical fibers, coaxial cables, etc.)) and wireless (unguided transmission) media capable of propagating energy waves, such as sound, electromagnetic, RF, microwave, and infrared. Computer-readable instructions, data structures, program modules, or other data can be embodied as modulated data signals in, for example, wireless media (such as carrier waves or similar mechanisms embodied as part of spread spectrum technology). The term "modulated data signal" refers to a signal whose one or more characteristics are altered or set in a manner that encodes information in the signal. Modulation can be analog, digital, or a hybrid modulation technique.

[0084] By way of example and not limitation, computer-readable storage media may include volatile and non-volatile, removable and non-removable media implemented by any method or technology for storing information such as computer-readable instructions, data structures, program modules or other data. For example, computer-readable storage media include, but are not limited to, volatile memories such as random access memory (RAM, DRAM, SRAM); and non-volatile memories such as flash memory, various read-only memories (ROM, PROM, EPROM, EEPROM), magnetic and ferromagnetic / ferroelectric memories (MRAM, FeRAM); and magnetic and optical storage devices (hard disks, magnetic tapes, CDs, DVDs); or other media now known or hereafter developed capable of storing computer-readable information / data for use by a computer system.

[0085] Herein, one embodiment of this application includes an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein when the computer program instructions are executed by the processor, the apparatus is triggered to run a method and / or technical solution based on the foregoing embodiments of this application.

[0086] It will be apparent to those skilled in the art that this application is not limited to the details of the exemplary embodiments described above, and that this application can be implemented in other specific forms without departing from the spirit or essential characteristics of this application. Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of this application is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within this application. No reference numerals in the claims should be construed as limiting the scope of the claims. Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices recited in the apparatus claims may also be implemented by a single unit or device in software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any particular order.

Claims

1. A method for establishing a prediction model for the acidity coefficient of a target protein, wherein, The method includes: Based on multiple protein information, multiple protein acidity coefficient prediction models are used to obtain multiple second acidity coefficient prediction information corresponding to each protein information. Based on the multiple second acidity coefficient prediction information corresponding to each protein information, the acidity coefficient information corresponding to each protein information is determined using a multi-annotator consensus model. Based on the multiple protein information and the acidity coefficient information corresponding to each protein information, a corresponding training sample dataset is determined, wherein the training sample dataset includes multiple protein information and the acidity coefficient information corresponding to the protein information, and the protein information includes protein structure information and protein sequence information; Based on the protein information, the initial graph information corresponding to each target amino acid is determined, wherein the initial graph information includes the node information corresponding to the target amino acid and its associated amino acids, and the edge information between amino acid pairs in the target amino acid and its associated amino acids; Based on the initial graph information, the information of each node is updated using a graph neural network based on a multi-head self-attention mechanism; Based on the updated node information corresponding to the target amino acid, the acidity coefficient prediction information corresponding to the target amino acid is obtained through a multilayer perceptron. Based on the acidity coefficient prediction information corresponding to each target amino acid, the corresponding first acidity coefficient prediction information is determined. Based on the acidity coefficient information corresponding to the first acidity coefficient prediction information and the protein information, the corresponding loss information is calculated; Based on the loss information, the model parameters of the target protein acidity coefficient prediction model are updated to obtain the target protein acidity coefficient prediction model.

2. The method according to claim 1, wherein, The step of determining the corresponding training sample dataset based on the multiple protein information and the acidity coefficient information corresponding to each protein information includes: Based on the protein sequence similarity information among the multiple protein information, the multiple protein information and the acidity coefficient information corresponding to each protein information are divided into corresponding training sample datasets and validation sample datasets, wherein the protein sequence similarity information between the protein information in the training sample dataset and the protein information in the validation sample dataset is less than the corresponding similarity threshold.

3. The method according to claim 1, wherein, Based on the protein information, the initial graph information corresponding to each target amino acid is determined, wherein the initial graph information includes node information corresponding to the target amino acid and its associated amino acids, and edge information between amino acid pairs in the target amino acid and its associated amino acids, including: Identify the target amino acid and its associated amino acids, wherein the distance between the associated amino acid and the target amino acid is less than a preset distance threshold; Based on the protein information, the target amino acid and its associated amino acid information are identified as the corresponding node information. Based on the positional information between amino acid pairs in the target amino acid and its associated amino acids, the corresponding edge information is determined.

4. The method according to claim 3, wherein, The node information includes at least one of the following: Information on amino acid types; Information on the atom type of each atom in an amino acid; Spatial information of each atom in an amino acid relative to the central carbon atom of that amino acid; The charge information of each atom in an amino acid; Solution accessibility of amino acids; Information on the relative positions of amino acids in their respective amino acid chains.

5. The method according to claim 4, wherein, The node information includes amino acid type information; The step of determining the target amino acid and its associated amino acid information as the corresponding node information based on the protein information includes: Based on the protein sequence information, the corresponding protein sequence coding information is determined using a protein language model; Based on the protein sequence encoding information, the corresponding amino acid types are determined.

6. The method according to any one of claims 3 to 5, wherein, The edge information includes at least one of the following: Information about the distance between the central carbon atoms of two amino acids; Information on the dihedral angles corresponding to the planes formed by the main chain atoms of two amino acids.

7. The method according to claim 2, wherein, The method further includes: The target protein acidity coefficient prediction model was evaluated using the validation sample dataset.

8. The method according to claim 1, wherein, The method further includes: The acidity coefficient prediction model for the target protein was trained based on the acidity coefficient experimental dataset.

9. The method according to claim 1, wherein, The method further includes: Based on the target protein information corresponding to the target protein, the target acidity coefficient information corresponding to the target protein is determined using the target protein acidity coefficient prediction model.

10. A computer device for establishing a prediction model for the acidity coefficient of a target protein, comprising a memory, a processor, and a computer program stored in the memory, characterized in that, The processor executes the computer program to implement the steps of the method as described in any one of claims 1 to 9.

11. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method as described in any one of claims 1 to 9.

12. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the steps of the method as described in any one of claims 1 to 9.