Molecular property prediction using molecule protonation states

The protonation state graph system addresses the computational inefficiencies of existing models by using parallel processing and focusing on probable protonation sites to efficiently predict molecular properties with uncertainty quantification.

WO2026131690A1PCT designated stage Publication Date: 2026-06-25ISOMORPHIC LABS LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
ISOMORPHIC LABS LTD
Filing Date
2025-12-15
Publication Date
2026-06-25

Smart Images

  • Figure EP2025087169_25062026_PF_FP_ABST
    Figure EP2025087169_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating a prediction characterizing a molecule. According to one aspect, there is provided a method comprising: receiving data identifying a molecule; generating data defining a protonation state graph for the molecule, wherein: each node in the graph represents a respective protonation state of the molecule; and each edge in the graph corresponds to a respective protonation site on the molecule and connects a respective pair of nodes in the graph that represent a corresponding pair of possible protonation states of the molecule that differ only in a protonation state of the corresponding protonation site on the molecule; and each edge in the graph is associated with a predicted micro pKa value of the corresponding protonation site; and processing the data defining the graph for the molecule to generate a prediction characterizing the molecule.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Isomorphic Labs Limited et al. F&R Ref.: 53672-0026W01 PCT Application

[0002] MOLECULAR PROPERTY PREDICTION USING MOLECULE PROTONATION

[0003] STATES

[0004] BACKGROUND

[0005] [1] This specification relates to processing data using machine learning models.

[0006] [2] Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model.

[0007] [3] Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to generate an output.

[0008] SUMMARY

[0009] [4] This specification describes a system implemented as computer programs on one or more computers in one or more locations that can generate data representing relationships between the protonation states of a molecule and can use the data to generate a prediction characterizing the molecule, for example for applications in drug discovery. Throughout this specification, a “molecule” can refer to a collection of atoms which are bonded together through chemical bonds. A molecule can be represented in any of a variety of possible ways, e.g., as a sequence of characters, e.g., a Simplified Molecular Input Line Entry System (SMILES) string.

[0010] [5] In this specification, a protonation site is a location on the molecule (e.g., an atom) that can either gain or lose a proton (H+). For example, an ionizable group can be a protonation site. A protonation state refers to a specific condition of a molecule that is defined by the number and respective locations of protons (also referred to as an ionized hydrogen atoms) associated with the protonation sites of the molecule. A protonation state of a protonation site of the molecule refers to whether that protonation site is protonated or deprotonated.

[0011] [6] In particular, the system can generate a protonation state graph characterizing the protonation states of the molecule and how each protonation state relates to the other protonation states of the molecule. The protonation state graph can include nodes representing possible protonation states of the molecule and edges that connect nodes that represent protonation states that differ only in a protonation state of the corresponding protonation site on the molecule, i.e., by a single proton at a single protonation site. [7] The system can associate each of the edges with a micro pKa value, e.g., an acid-base dissociation constant, or more formally, the negative logarithm of the acid-base dissociation constant, that represents the tendency of the protonation site to release or accept a proton independently of other protonation sites in the molecule at a fixed temperature, e.g., based on the local molecular environment of the protonation site. In particular, the micro pKa value for a protonation site quantifies the extent to which an acid or conjugate acid at a particular protonation site dissociates in solution, e.g., in an aqueous or non-aqueous solvent, such as water, a buffer solution, blood plasma, intracellular or extracellular fluid, and so on.

[0012] [8] The system can use the protonation state graph to determine the predicted likelihood of each of the protonation states, e.g., by identifying unique paths from a reference node in the graph to a node representing a particular protonation state and computing the likelihood as a function of the micro pKa along the unique paths and a measure of the effective concentration of protons in the molecule when the molecule is dissolved using a solvent for the molecule, known as the pH. In this specification, a “path” refers to a sequence of nodes, where each consecutive node is connected by an edge, and there are no repeat nodes in the sequence. The system can then use the predicted likelihoods to generate a prediction characterizing the molecule, e.g., an expected molecular property value or a prediction for a target protonation state.

[0013] [9] According to a first aspect there is provided a method for receiving data identifying a molecule, processing the data identifying the molecule to generate, for each of a plurality of protonation sites on the molecule, a respective predicted micro pKa value of the protonation site, and generating data defining a protonation state graph for the molecule, wherein each node in the protonation state graph represents a respective protonation state of the molecule, and each edge in the protonation state graph corresponds to a respective protonation site on the molecule and connects a respective pair of nodes in the protonation state graph that represent a corresponding pair of possible protonation states of the molecule that differ only in a protonation state of the corresponding protonation site on the molecule, and each edge in the graph is associated with the predicted micro pKa value of the corresponding protonation site, and processing the data defining the protonation state graph for the molecule to generate a prediction characterizing the molecule.

[0014]

[0010] In some implementations, processing the data defining the protonation state graph for the molecule to generate the prediction characterizing the molecule comprises processing the data defining the protonation state graph for the molecule to determine a respective predicted likelihood of each of a plurality of protonation states of the molecule, and generating the prediction characterizing the molecule based at least in part on the predicted likelihoods of the plurality of protonation states of the molecule.

[0015]

[0011] In some implementations, processing the data defining the protonation state graph for the molecule to determine the respective predicted likelihood of each of the plurality of protonation states of the molecule comprises, for each protonation state: identifying a path in the protonation state graph from a reference node to a node representing the protonation state, and determining the predicted likelihood of the protonation state based on a measure of pH and the predicted micro pKa values of edges included in the path from the reference node to the node representing the protonation state.

[0016]

[0012] In some implementations, identifying a path in the protonation state graph from the reference node to the node representing the protonation state comprises identifying a plurality of unique paths in the protonation state graph from the reference node to a node representing the protonation state, and wherein determining the predicted likelihood of the protonation state comprises determining, for each of the plurality of unique paths, a respective predicted likelihood of the protonation state based on the measure of pH and the predicted micro pKa values of edges included in the path, and determining the predicted likelihood of the protonation state as a measure of central tendency of the predicted likelihoods corresponding to the plurality of unique paths.

[0017]

[0013] In some implementations, for each protonation state, determining an uncertainty in the predicted likelihood of the protonation state based on a measure of dispersion of the predicted likelihoods of the protonation state corresponding to the plurality of unique paths.

[0018]

[0014] In some implementations, the reference node is a node representing a protonation state of the molecule where each protonation site on the molecule is deprotonated.

[0019]

[0015] In some implementations, determining the predicted likelihood of the protonation state based on the measure of pH and the predicted micro pKa values of edges included in the path from the reference node to the node representing the protonation state comprises determining the predicted likelihood of the protonation state as a function of a sum of the predicted micro pKa values of the edges included in the path from the reference node to the node representing the protonation state.

[0020]

[0016] In some implementations, generating the prediction characterizing the molecule based at least in part on the predicted likelihoods of the plurality of protonation states of the molecule comprises generating a prediction for a most likely protonation state of the molecule as a protonation state with a highest predicted likelihood among the plurality of protonation states of the molecule.

[0017] In some implementations, generating the prediction characterizing the molecule based at least in part on the predicted likelihoods of the plurality of protonation states of the molecule comprises determining, for each of the plurality of protonation states of the molecule, a value of a property of the molecule when the molecule has the protonation state, and generating an expected value of the property of the molecule based on, for each of the plurality of protonation states, the predicted likelihood of the protonation state and the value of the property of the molecule when the molecule has the protonation state.

[0021]

[0018] In some implementations, for each of the plurality of protonation states of the molecule, determining the value of the property of the molecule when the molecule has the protonation state comprises processing data characterizing the protonation state of the molecule using a property prediction machine learning model, in accordance with trained values of a set of property prediction machine learning model parameters, to generate the value of the property of the molecule when the molecule has the protonation state.

[0022]

[0019] In some implementations, the property prediction machine learning model has been trained by performing operations comprising obtaining a set of training examples, wherein each training examples comprises: (i) a training input that characterizes a protonation state of a training molecule, and (ii) a target property value of the training molecule, and training the property prediction machine learning model on the set of training examples by a machine learning training technique comprising, for each training example, training the property prediction machine learning model to reduce a discrepancy between (a) a predicted property value of the training molecule generated by processing the training input of the training example using the property prediction machine learning model, and (b) the target property value of the training molecule that is specified by the training example.

[0023]

[0020] In some implementations, the property of the molecule comprises a charge of the molecule, or a dipole moment of the molecule, or a solubility of the molecule, or a binding affinity of the molecule for a binding target.

[0024]

[0021] In some implementations, processing the data identifying the molecule to generate, for each of the plurality of protonation sites on the molecule, the respective predicted micro pKa value of the protonation site comprises processing data characterizing the molecule using a micro pKa prediction machine learning model that has been trained to perform a micro pKa prediction task; or processing data characterizing the molecule using a density functional theory (DFT) calculation.

[0025]

[0022] According to another aspect there is provided a method for obtaining data identifying a collection of candidate molecules, generating a respective prediction characterizing each of the candidate molecules using the method of any of the preceding implementations of the first aspect, and ranking the collection of candidate molecules based at least in part on the respective predictions for the collection of candidate molecules.

[0026]

[0023] In some implementations, the method further comprises selecting one or more of the candidate molecules for physical synthesis based at least in part on the ranking of the collection of candidate molecules.

[0027]

[0024] In some implementations, the method further comprises physically synthesizing each of the collection of candidate molecules selected for physical synthesis.

[0028]

[0025] In some implementations, the method is a method of obtaining a drug, the method further comprising selecting one or more of the candidate molecules for use as a drug based at least in part on the ranking of the collection of candidate molecules.

[0029]

[0026] In some implementations, the respective prediction characterizing each of the candidate molecules comprises one or more pharmacokinetic or pharmacodynamic properties for the candidate molecule.

[0030]

[0027] According to another aspect there is a provided a method for obtaining a ligand for a target molecule, wherein the ligand is a drug or a ligand of an industrial enzyme, the method comprising obtaining data identifying a collection of candidate ligands, generating one or more predictions characterizing the target molecule and / or each of the candidate ligands using the method of any of the preceding implementations of the first aspect, evaluating an interaction of each candidate ligand of the candidate ligands with the target molecule based at least in part on the one or more predictions, and selecting one or more of the candidate ligands as the ligand dependent on a result of the evaluating.

[0031]

[0028] In some implementations, the target molecule comprises a receptor or enzyme, and wherein the ligand is an agonist or antagonist of the receptor or enzyme, or wherein the ligand comprises an antibody or aptamer and the target molecule comprises an antibody or aptamer target, in particular a virus or cancer cell protein, and wherein the antibody or aptamer binds to the antibody or aptamer target to provide a therapeutic effect.

[0032]

[0029] In some implementations, the method further comprises synthesizing the drug or ligand.

[0033]

[0030] In some implementations, the method further comprises testing for biological activity of the ligand or drug in vitro or in vivo.

[0034]

[0031] In another aspect, there is provided a system comprising one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform the method of any one of the example implementation methods described.

[0032] In another aspect, there is provided a computer storage medium encoded with a computer program, the program comprising instructions that are operable, when executed by data processing apparatus, to cause the data processing apparatus to perform the method of any one of the example implementation methods described.

[0035]

[0033] Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages.

[0036]

[0034] The system of this specification provides for the prediction of molecular properties based on a protonation state graph that represents the relationships between the protonation states and the respective likelihoods of each of the protonation states of the molecule. More specifically, the system can model the probability distribution of the protonation states as a function of the micro pKa values associated with the edges of the protonation state graph and pH.

[0037]

[0035] In particular, the system can reduce the use of computational resources required to model the protonation state probability distribution by providing for the modeling of the probability distribution using the structure of the protonation state graph, e.g., as opposed to using quantum chemical calculations describing the different protonation states. Determining the respective likelihood of each protonation state using a quantum chemical calculation, e.g. ab initio molecular dynamics simulation, requires running many computationally-intensive iterations of a numerical program at a sufficiently high precision, time resolution, or both to capture proton transfer and track molecular conformation and energy changes at each iteration. This requires recalculating electrostatic forces and charge distributions at each iteration, and accounting for quantum mechanical forces, which requires the allocation of a large amount of computational processing and memory resources, particularly for large or complex molecules. Moreover, the forces can change drastically depending on fluctuating factors in the molecular environment, and in some cases, the molecular dynamics simulation may need to be run multiple times from different initializations for each protonation state in order to ensure accurate likelihood predictions.

[0038]

[0036] In contrast, the system of this specification can rely on the structure of the protonation state graph to determine the likelihood of each of the protonation states in the protonation state graph. In particular, the system can process each of the edges, e.g., using a single forward pass through a machine learning model, to generate a predicted micro pKa value for each of the edges, and can then determine the likelihood of each protonation state as a deterministic function of the predicted micro pKa values, e.g., along identified paths between a reference node and a selected node representing a particular protonation state. By distilling the likelihood calculation to a function of predicted micro pKa values, the system can prevent the need to allocate the large amount of computational resources needed to run many iterations of a quantum chemical calculation, e.g. ab initio molecular dynamics simulation using a highly accurate method. Moreover, since each of the edges in the protonation state graph are independent, the system can also generate the micro pKa values using parallel processing, e.g., by processing data for each of the edges in batches across multiple processing units available to the system, which further reduces the computational resources required with respect to serially determining each of the micro pKa values.

[0039]

[0037] Furthermore, the system can reduce the use of computational resources required to model the protonation state probability distribution by restricting the protonation state graph to only include a subset of protonation states as defined by the most probable protonation sites in the molecule. Since each protonation site has two possible states, e.g., a protonated state and a deprotonated state, the protonation state graph will have 2Nprotonation states for the N protonation sites considered in the enumeration of the protonation states. The system can exponentially reduce the size of the protonation state graph by determining the most relevant protonation states, e.g., according to biophysical and biochemical criteria. Reducing the size of the protonation state graph decreases the computational resources necessary to process the graph, e.g., to determine the likelihoods of each protonation state and further process the likelihoods to determine a prediction characterizing the molecule.

[0040]

[0038] Additionally, by modeling the probability distribution using the protonation state graph, e.g., instead of training a model to predict the likelihood of each protonation state, the system can inherently account for the uncertainty in the prediction characterizing the molecule. More specifically, the system can quantify the uncertainty in the prediction based on the determined likelihood values. In particular, the system can use the predicted likelihoods to determine an expected value of a molecular property or a prediction for a target protonation state with an associated confidence interval, e.g., based on a measure of dispersion of the predicted likelihoods of one or more of the protonation state(s).

[0041]

[0039] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE DRAWINGS

[0042]

[0040] FIG. l is a system diagram of an example molecular protonation state evaluation system that can determine a prediction characterizing a molecule using a protonation state graph.

[0043]

[0041] FIG. 2 illustrates an example protonation state graph.

[0044]

[0042] FIG. 3 is a flow diagram of an example process for generating a prediction characterizing a molecule using a protonation state graph.

[0045]

[0043] FIG. 4 is a flow diagram of an example process for generating a prediction of the property of a molecule using the protonation state graph.

[0046]

[0044] Like reference numbers and designations in the various drawings indicate like elements.

[0047] DETAILED DESCRIPTION

[0048]

[0045] FIG. 1 shows an example molecular protonation state evaluation system 100. The molecular protonation state evaluation system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations in which the systems, components, and techniques described below are implemented.

[0049]

[0046] The system 100 can generate a protonation state graph 140, e.g., data representing the possible protonation states of the molecule and the relationships between the protonation states. In this specification, a protonation state refers to a specific condition of a molecule that is defined by the presence (or absence) of a proton at each of the protonation sites of the molecule. As an example, the protonation sites of a molecule can include ionizable groups, e.g., functional or substituent groups of a molecule that can be protonated, e.g., either gain or lose a proton. The system 100 can use the protonation state graph 140 to generate a prediction characterizing the molecule, e.g., an expected molecular property value 145 or a prediction for a target protonation state 160, as will be described in more detail below.

[0050]

[0047] In particular, the system 100 can receive data identifying a molecule, e.g., the molecular data 105. The molecule can include, e.g., at least 100 atoms, or at least 1000 atoms, or at least 10,000 atoms. For example, the molecular data 105 can be specified by a Simplified Molecular Input Line Entry System (SMILES) string or by an International Chemical Identifier (InChi) string, e.g., which both provide a representation of the chemical structure of a molecule in a one-dimensional string. As another example, the molecular data 105 can include a structured data file, e.g., a protein databank file, a MolBlock file, or a chemical markup file, etc.

[0051]

[0048] The system 100 can then process the molecular data 105 using a protonation state engine 110 to identify the protonation states 115 of the molecule, e.g., which the system 100 can use to generate the protonation state graph 140. In particular, the protonation state engine 110 can identify each of the protonation sites, e.g., the ionizable groups, of the molecule and can determine the possible protonation states by enumerating the possible permutations of protons associated with each of the protonation sites. As an example, the molecule can have at least 5, at least 10, or at least 50 protonation sites.

[0052]

[0049] In some cases, the protonation state engine 110 can identify a subset of the possible protonation sites of the molecule, e.g., 50%, 20%, or 10% of the possible protonation sites. In particular, the system 100 can restrict the protonation state engine 110 to identify, e.g., the most probable protonation sites in the molecule. For example, the protonation engine 110 can identify the most probable protonation sites in the molecule by selecting protonation sites with nitrogen and oxygen atoms, e.g., which are more likely to undergo protonation than sites with carbon atoms, as the subset of protonation sites of the molecule. As another example, the protonation engine 110 can identify the most probable protonation sites in the molecule by selecting chemical structures that match SMILES arbitrary target specification (SMARTS) patterns as protonation sites.

[0053]

[0050] In some cases, the system 100 can determine the number of protonation sites in accordance with a computational budget, e.g., the system 100 can analyze whether the number of nodes resulting from the N identified protonation sites, e.g., 2Nnodes, is computationally- feasible for storage in computational memory and further processing based on the available computational resources.

[0054]

[0051] As an example, the engine 110 can output each respective protonation state as an ordered collection of numerical values of the same dimension N as the number of protonation sites in the molecule, e.g., an ordered vector of length N. In this case, the system 100 can use a binary indicator at each entry of the vector to indicate whether or not the specific protonation site corresponding with the entry is protonated in the protonation state, e.g., where a 1 indicates the presence of a proton and a 0 indicates the absence of a proton at a particular protonation site. As an example, for a molecule with four ionizable groups, the engine 110 can output the vector (1, 1, 0, 1), indicating that the first, second, and fourth ionizable groups are protonated in the protonation state.

[0055]

[0052] The system 100 can then process the protonation states 115 to generate a protonation state graph 140. More specifically, the system 100 can generate data defining a set of nodes and a set of edges in the graph 140. In particular, each node in the graph 140 can represent a respective protonation state of the molecule, e.g., that corresponds with a particular protonation state vector representing whether or not each protonation site in the molecule is protonated for the protonation state, and each edge can connect a respective pair of nodes and represent that the protonation states corresponding with the connected nodes differ only in a single protonation state of a single protonation site on the molecule. The protonation state graph 140 can include, e.g., at least 100 nodes, or at least 1000 nodes, or at least 10,000 nodes, etc., and e.g., at least 40 edges, 500 edges, or 2000 edges. An example protonation state graph will be described in more detail with respect to FIG. 2.

[0056]

[0053] More specifically, each respective edge in the protonation state graph 140 can correspond to the particular protonation site on the molecule that differs between the protonation state nodes that the edge connects. The system 100 can also associate each edge with the predicted micro pKa value of the corresponding protonation site the edge corresponds with. The micro pKa value refers to the acid-base dissociation constant that measures the tendency of the protonation site, e.g., in this context, the protonation site corresponding with the edge, to release or accept a proton based on the local molecular environment of the protonation site.

[0057]

[0054] For example, the system 100 can obtain the corresponding micro pKa values 125 for each edge, e.g., by obtaining the micro pKa values 125 from a database, by receiving the micro pKa values 125 from another system, or by generating the micro pKa values 125 for each edge. In the particular example depicted, the system 100 generates the micro pKa values 125 for each edge using a micro pKa prediction model 120.

[0058]

[0055] In the case in which the system 100 generates the micro pKa values 125, the system can generate the micro pKa values 125 at least partially in parallel, e.g., for some or all of the protonation sites. In particular, since each of the edges of the graph are independent, the system 100 can generate the micro pKa values by processing data for each of the edges using multiple processing units available to the system 100, thereby reducing the latency and enhancing the efficiency of the calculation of the micro pKa values 125 with respect to serially determining each of the micro pKa values using the micro pKa prediction model 120.

[0059]

[0056] As an example, the micro pKa prediction model 120 can be implemented using a density functional theory (DFT) calculation. In particular, the model 120 can calculate the Gibbs free energy of each protonation site in the molecule in both a protonated and deprotonated state using quantum mechanical computations and can determine the micro pKa values 125 of each protonation site of the molecule based on a difference in the Gibbs free energy for the protonated and deprotonated state. More specifically, the micro pKa values are directly proportional to the free energy differences between protonation states. Similarly to how the free energy between two states can be calculated from the free energy differences between the allowed state transitions, e.g., the sum of the free energy differences along the path between the two states that are connected by edges, the micro pKa values for each of the edges can be determined by summing the pKa values along a path of connected edges relative to a reference node.

[0060]

[0057] As another example, the micro pKa prediction model 120 can be a machine learning model, e.g., a micro pKa prediction machine learning model, that has been trained to perform a micro pKa prediction task. In this case, the micro pKa prediction model 120 can be configured to process the molecular data 105 to generate the predicted micro pKa values 125 for each of the protonation sites on the molecule. In particular, the micro pKa prediction model 120 can process a model input that includes: (i) the molecular data 105, and (ii) data identifying a target protonation site of the molecule to generate the micro pKa of the target protonation site as an output.

[0061]

[0058] For example, the micro pKa prediction machine learning model can have any appropriate machine learning architecture, e.g., a random forest, a support vector machine, a decision tree, linear regression model, or a neural network, that can be configured to process molecule data 105 to generate predicted micro pKa values 125 for each of the protonation sites in the molecule. For instance, the micro pKa prediction machine learning model can be a neural network with any appropriate number of neural network layers (e.g., 1 layer, 5 layers, or 10 layers) of any appropriate type (e.g., fully-connected layers, attention layers, convolutional layers, etc.) connected in any appropriate configuration (e.g., as a linear sequence of layers, or as a directed graph of layers).

[0062]

[0059] More specifically, the system 100 or another system can train the micro pKa prediction machine learning model on a set of training examples, e.g., where each training example includes (i) a training input including molecular data identifying a molecule and data identifying a target protonation site of the molecule, and (ii) a target output including a ground truth micro pKa value for the target protonation site of the molecule, by a machine learning training technique to optimize an objective function. The objective function can measure, for each training example, a discrepancy between: (i) the ground truth micro pKa value for the target protonation site and (ii) the predicted micro pKa value for the target protonation site generated by the micro pKa prediction machine learning model by processing the training input of the training example. For example, training examples comprising experimentally measured micro pKa values are available from publicly available databases, such as DataWarrior, see e.g., Sander et al., J. Chem. Inf. Model. 2015, 55, 2, 460-473.

[0060] The objective function can measure a discrepancy between target and predicted molecule properties in any appropriate way, e.g., using a cross-entropy loss or a mean squared error loss. The machine learning training technique can be any technique appropriate for training the micro pKa prediction machine learning model. In particular, the system 100 or another system can train the micro pKa prediction machine learning model at each of a number of training iterations until a training termination criterion is met. For example, in the case that the micro pKa prediction machine learning model is a neural network, the model can be trained by calculating and backpropagating gradients of an objective function to update parameter values of the model, e.g., using the update rule of any appropriate gradient descent optimization algorithm, e.g., RMSprop or Adam.

[0063]

[0061] As an example, the micro pKa prediction machine learning model can be implemented using a graph neural network (GNN). In this case, the molecular data 105 can be represented by a graph, e.g., where the nodes represent atoms of the molecule and the edges represent the chemical bonds between the atoms represented by nodes of the molecule, and the system 100 can associate the node representing the target protonation site of the molecule using a feature, e.g., a binary flag, as an indicator.

[0064]

[0062] In particular, the GNN can include an encoder block, a sequence of one or more message passing neural network layers, and a decoder block. The encoder block can, for each atom in the molecule, process a set of atom features of the atom to generate a node embedding for the graph node that represents the atom. Further, the encoder block can, for each bond in the molecule, process a set of bond features of the bond to generate an edge embedding for the graph edge that represents the edge. Each message passing layer is configured to process the set of node embeddings and the set of edge embeddings, by neural network operations that are parametrized by a set of neural network parameters of the message passing layer and that are conditioned on the topology of the graph representing the molecule, to update the node embeddings and the edge embeddings. The decoder block can process the node and edge embeddings generated by the final message passing layer to generate the GNN output, in this case, the predicted pKa value of the target protonation site of the molecule. An example for using a GNN to predict pKa values of target protonation sites of a molecule is described in further detail in WIPO PCT App. No. 2025 / 131,529, which is herein incorporated by reference.

[0065]

[0063] As another example, the micro pKa prediction machine learning model can be an equivariant neural network that incorporates symmetry to ensure consistent predictions, e.g., a rotationally, translationally, and chirality-invariant TensorNet model. In this case, the equivariant model can process three-dimensional molecular structures of various conformations, while maintaining consistent micro pKa predictions. An example for using an equivariant neural network to predict pKa values of target protonation sites is described in further detail in Schapin, N., et. al. “PlayMolecule pKAce: Small Molecule Protonation through Equivariant Neural Networks”, arXiv:2407.11103vl (July 2024).

[0066]

[0064] While described above with respect to processing the molecular data 105 using the micro pKa prediction model 120 to generate the predicted micro pKa values 125, the system 100 can also generate the data representing the protonation state graph 140 and then process each of the pairs of the nodes of the graph 140 that are connected by an edge using the micro pKa prediction model 120 to generate the respective predicted micro pKa values 125 for each edge. For example, in the case that the micro pKa prediction model 120 is implemented as a micro pKa prediction machine learning model, the micro pKa prediction machine learning model can be configured to process the data representing the protonation states of the nodes of the graph 140 connected by an edge to generate the predicted micro pKa value for the edge. In particular, the edges can represent allowed transitions between states, and the model can traverse a path between states by following connected edges.

[0067]

[0065] After generating the data representing the graph 140, e.g., the protonation states 115 and the micro pKa values 125, the system 100 can process the protonation state graph 140 to determine the distribution of the protonation states for the molecule, e.g., the likelihood of protonation state(s) 155, using a protonation state distribution engine 150. More specifically, the system 100 can model the probability distribution of the protonation states as a function of the micro pKa values associated with the graph edges and the pH.

[0068]

[0066] In particular, the engine 150 can determine a likelihood of each of the protonation states of the molecule by identifying paths, e.g., paths defined by a sequence of edges connecting two nodes, in the protonation state graph 140 from a reference node of the graph 140 to each of the protonation state nodes. In this context, a “path” refers to a sequence of nodes starting from the reference node, where each consecutive node is connected by an edge, and there are no repeat nodes in the sequence.

[0069]

[0067] The reference node can be any node in the protonation state graph 140 as long as it is consistently used as the reference node for all of the paths, e.g., the reference node can be selected randomly. By consistently using the same reference node, the system 100, the system 100 constrains the likelihood of the reference node, e.g., since the sum of all the likelihoods of each of the nodes in the protonation state graph 140 is one. As an example, the reference node can be the node that represents the protonation state of the molecule where each protonation site is deprotonated, e.g., (0, 0, 0, 0) in the four-protonation site example given above.

[0068] For example, the engine 150 can identify each unique path between the reference node to the protonation state node and can calculate the likelihood of the particular protonation state at a given pH using the predicted micro pKa values of the edges included in the unique paths. In this case, for each unique path, the protonation state distribution engine 150 can determine a respective predicted likelihood of the protonation state based on the predicted micro pKa values of edges included in the path, e.g., using a sum of the micro pKa values of the edges included in the path.

[0070]

[0069] As an example, the engine 100 can determine the likelihood of a given protonation state at a particular pH using: T] is a set of edges forming a path from the i-th protonation state to the reference node, pKaj is the j -th micro pKa in this path, Am is the relative number of bound hydrogen (protons) between the i-th node and the reference node and M is the total number of nodes.

[0071]

[0070] The engine 150 can then determine a measure of central tendency for each of the unique paths, e.g., a mean or median, of the likelihoods for each unique path, as the predicted likelihood of the protonation state.

[0072]

[0071] In some cases, the protonation state distribution engine 150 can additionally determine an uncertainty in the predicted likelihoods of the protonation state corresponding to the predicted likelihoods corresponding with each of the unique paths. In particular, the engine 150 can determine a measure of dispersion, e.g., a variance or an interquartile range, of the predicted likelihoods corresponding to the number of unique paths to quantify the uncertainty in the predicted likelihood of the protonation state. The measure of dispersion can also be used to determine a confidence interval for any prediction deriving from the predicted likelihoods, e.g., as will be described in more detail with respect to FIG. 2.

[0073]

[0072] By determining the predicted likelihood values for each of the protonation states, the protonation state distribution engine 150 effectively models the probability distribution of each of the protonation states of the molecule. The system 100 can use the modeled probability distribution, e.g., the predicted likelihoods of the protonation state(s) 155, to generate a prediction characterizing the molecule.

[0074]

[0073] For example, the system 100 can generate a prediction for a target protonation state 160, e.g., the target protonation state 160 can be the most likely protonation state with the highest predicted likelihood. As another example, the system 100 can generate a prediction for a target protonation state 160 including one or more protonation states that share a common characteristic. In particular, the system 100 can generate the likelihood of any particular charge condition of the molecule, e.g., charge + 1, charge +5, charge -3, e.g., by summing the likelihoods of the protonation states that have the particular charge condition.

[0075]

[0074] As another example, the system 100 can use the likelihood of protonation state(s) 155 to determine an expected molecular property value 145 for the molecule. The protonation state graph 140, the micro pKa values 125, and the likelihood values 155 provide information from which different properties related to the protonation states can be calculated, e.g., the charge of the molecule, the dipole moment of the molecule, the solubility of the molecule, the charged fraction of a molecule in a solution, or the binding affinity of the molecule with respect to a binding target.

[0076]

[0075] In particular, the system 100 can use the likelihood of the protonation state(s) 155 to determine an expected molecular property value 145 for molecular properties that depend on the protonation of the molecule. In this case, the system 100 can apply the predicted likelihoods of the protonation state(s) 155 as a weight in an expected value calculation. More specifically, the system 100 can obtain molecular property value prediction(s) 135 for each of the protonation states in the protonation state graph 140 and can apply the corresponding likelihood values 155 as the weight to the molecular property value prediction(s) 135 for each respective protonation state in an expected value calculation to generate an expected molecular property value 145. In particular, the system can calculate an expected molecular property value 145 by calculating: where EfX] is the expected molecular property value 145, z indexes over the protonation states in the protonation state graph 140, is the molecular property value of the z-th protonation state, and P(X = x ) is the predicted likelihood of the z-th protonation state. In this case, since the predicted likelihood values 155 are dependent on the pH, the expected molecular property value 145 will also depend on the pH.

[0077]

[0076] For example, the system 100 can determine the isoelectric point of a molecule using the expected value of the charge of the molecule. The isoelectric point is the pH at which the molecule is neutral and influences many properties of interest for drug development, such as solubility, stability, absorption, and distribution. In particular, the system 100 can determine the isoelectric point as the pH at which the expected value of the charge is zero, e.g., the system 100 can determine the likelihood values at each of a number of pHs and calculate the expected value of the charge at each pH to determine the isoelectric point.

[0078]

[0077] In the particular example depicted, the system 100 can process the protonation states 115 using a molecular property prediction model 130 to generate the molecular property prediction(s) 135 for each of the protonation states of the molecule. In particular, the molecular property prediction model 130 can be a machine learning model that has been configured to process a protonation state 115 to generate one or more predicted molecular property value(s) for the protonation state. For example, the molecular property prediction model 130 can have any appropriate machine learning architecture, e.g., a random forest, a support vector machine, a decision tree, linear regression model, or a neural network. In the case that the molecular property prediction model 130 is implemented as a neural network, the model 130 can have any appropriate number of neural network layers (e.g., 1 layer, 5 layers, or 10 layers) of any appropriate type (e.g., fully-connected layers, attention layers, convolutional layers, etc.) connected in any appropriate configuration (e.g., as a linear sequence of layers, or as a directed graph of layers).

[0079]

[0078] In this case, the system 100, or another system, can train the molecular property prediction model 130 on a set of training examples, e.g., where each training example includes a (i) training input characterizing a protonation state of a training molecule and a (ii) target output of one or more ground truth molecule property value(s), by a machine learning training technique to optimize an objective function. The objective function can measure, for each training example, a discrepancy between: (i) the one or more target molecule property value(s) specified by the training example, and (ii) the predicted molecule property value(s) generated by the molecular property prediction model 130 by processing the training input of the training example. For example, the objective function can measure a discrepancy between target and predicted molecule properties in any appropriate way, e.g., using a cross-entropy loss or a mean squared error loss. Training examples based on experimental data are available from publicly available databases, such as ChEMBL(https: / / www.ebi. ac.uk / chembl / ).

[0080]

[0079] The machine learning training technique can be any technique appropriate for training the molecular property prediction model 130. For instance, for a molecular property prediction model 130 implemented as a neural network, the machine learning training technique can be a stochastic gradient descent training technique. In particular, the model 130 can be trained by calculating and backpropagating gradients of an objective function to update parameter values of the model, e.g., using the update rule of any appropriate gradient descent optimization algorithm, e.g., RMSprop or Adam.

[0081]

[0080] In particular, the system 100 can generate and provide any of: the likelihood of the protonation state(s) 155, the prediction for the target protonation state 160, the molecular property predictions 135, or the expected molecular property value 145, e.g., to a user or for use in one or more downstream tasks.

[0082]

[0081] Additionally, in some cases, the system 100 can generate the macro pKa of the molecule as well, which quantifies the free energy needed to add a proton to anywhere in a molecule, e.g., as opposed to a specific atom as in the micro pKas. In this case, the macro pKa can be determined from a state with nato nbprotons, e.g., using the function: where 5( i, n) is a 5-function, e.g., which is 1 when the number ft of protons of the / -th state is n and 0 otherwise, and Ai is, as defined above, the sum of the micro pKa values from the reference node to i-th node. Macro pKas are important to calculate as they can greatly facilitate the validation of the protonation state graphs with experimental pKa measurements.

[0083]

[0082] As an example, the system 100 can provide any of the predicted outputs as an input to a downstream analysis task. In particular, the system 100 can use the prediction to inform a decision regarding the synthesis of a molecule. More specifically, the system 100 can obtain data identifying a collection of candidate molecules and can generate a respective prediction characterizing each of the candidate molecules. For example, the system 100 can rank the collection of candidate molecules based on the respective predictions for each of the candidate molecules in the collection and can select one or more of the candidate molecules for physical synthesis based on the ranking. In some cases, the selected candidate molecules can then be physically synthesized, e.g., by a user.

[0084]

[0083] In some cases, the prediction outputs can be used for drug discovery. Drug discovery can involve identifying specific molecules within the body that are involved in a human or animal disease process. These molecules are often proteins, such as enzymes, receptors, or signaling proteins, that play a key role in the disease's development or progression. A ligand, often a small molecule (e.g., with a molecular weight equal to or less than 900 daltons), peptide, or antibody, can be selected to bind specifically to an identified target protein. When a drug that includes the ligand is administered to a patient, the ligand can bind to the target protein with high affinity and in doing so contribute to achieving a therapeutic effect in the patient. For instance, if the target molecule is an enzyme involved in a disease process, the ligand can inhibit its activity, thus disrupting the disease pathway. More generally, the interaction between the ligand and the target molecule can activate, inhibit, or alter the function of the target molecule to achieve a therapeutic effect. For example, the ligand can be an agonist or antagonist of a receptor of the target molecule (e.g., protein).

[0085]

[0084] As one example, the prediction outputs generated by the system 100 can be used to screen candidate drug molecules for suitability as a drug. For example, the system can generate respective predictions characterizing each of a plurality of candidate drug molecules in relation to one or more pharmacokinetic or pharmacodynamic properties, such as such as ADMET (absorption, distribution, metabolism, excretion and toxicity properties, e.g., under in vitro or in vivo conditions), for each of the candidate drug molecules. Candidate drug molecules that have suitable pharmacokinetic or pharmacodynamic properties can then by physically synthesized for in vitro or in vivo testing.

[0086]

[0085] As another example, the prediction outputs generated by the system 100 can be used to obtain a ligand for a target molecule, with the ligand being a drug, or a ligand of an industrial enzyme. The system 100 can generate one or more predictions characterizing the target molecule and / or each of a collection of candidate ligands. For example, the one or more predictions can comprise expected charge or protonation states under biological conditions (e.g., intracellular or intercellular conditions). An interaction of each candidate ligands with the target molecule can then be determined based at least in part on the one or more predictions and one or more of the candidate ligands selected as the ligand dependent on a result of the evaluating. For example, respective binding affinities between the candidate ligands and the target molecule can be predicted, e.g., using computational methods or experimental assays, and used to select the ligand. For example, the candidate ligand(s) may be selected according to which have the highest affinity. As one example, the binding affinities can be predicted using a structure prediction machine learning model, e.g., based on the AlphaFold 3 model, as described in Abramson, Josh, et al., “Accurate structure prediction of biomolecular interactions with AlphaFold 3.” Nature (2024): 1-3. A protonation state of a molecule, such as a protein, can affect its structure and / or its ability to bind to another molecule, e.g., as a result of modified electrostatic interactions or solvent screening effects, and / or dynamic properties of the molecule, such as interconversion rates between conformations of the molecule, or the energies of reaction transition states or reaction intermediates. As one example, the protonation state of a molecule can determine whether the molecule is able to participate in proton transfer reactions.

[0086] In some implementations a candidate (e.g. polypeptide or polynucleotide) drug molecule or ligand may include: an isolated antibody or aptamer, a fragment of an isolated antibody or aptamer, a single variable domain antibody, a bi- or multi-specific antibody, a multivalent antibody, a dual variable domain antibody, an immuno-conjugate, a fibronectin molecule, an adnectin, an DARPin, an avimer, an affibody, an anticalin, an affilin, a protein epitope mimetic or combinations thereof. A candidate (polypeptide) ligand may include an antibody with a mutated or chemically modified amino acid Fc region, e.g., which prevents or decreases ADCC (antibody-dependent cellular cytotoxicity) activity and / or increases half-life when compared with a wild type Fc region. Candidate (polypeptide or polynucleotide) drug molecules or ligands may include antibodies with different CDRs (Complementarity- Determining Regions).

[0087]

[0087] A selected ligand (e.g. selected polypeptide or polynucleotide ligand) may comprise a receptor or enzyme and the ligand may be an agonist or antagonist of the receptor or enzyme. In some implementations the ligand may comprise an antibody or aptamer and the target molecule (e.g., protein molecule) comprises an antibody or aptamer target, for example a virus, in particular a virus coat protein, or a protein expressed on a cancer cell. In these implementations the antibody or aptamer binds to the antibody or aptamer target to provide a therapeutic effect. For example, the antibody or aptamer may bind to the target and act as an agonist for a particular receptor; alternatively, the antibody or aptamer may prevent binding of another ligand to the target, and hence prevent activation of a relevant biological pathway.

[0088]

[0088] As another example, the target molecule may be an enzyme comprising a CRISPR associated protein and the ligand may comprise a guide RNA molecule. The method may be performed to identify a combination of guide RNA molecule and CRISPR associated protein, in particular one that operates efficiently to edit genes. Such a method can involve determining a predicted structure of the enzyme, in particular to check that the enzyme shape and the guide RNA shape fit and work together effectively. The guide RNA may have a part with a defined 3D structure, e.g. it may be a single guide RNA (sgRNA), incorporating a guide sequence and a tracrRNA sequence.

[0089]

[0089] FIG. 2 illustrates an example protonation state graph 200. For example, the molecular protonation state evaluation system 100 can generate the data defining the graph 200 and use the graph 200 to determine the likelihoods of each protonation state in the graph 200, e.g., using the protonation state distribution engine 150 of FIG. 1.

[0090]

[0090] For example, each of the nodes A-G in the graph 200 represent protonation states, e.g., state A 202, state B 204, state C 206, state D 208, state E 210, state F 212, state G 214, and state H 216. As an example, each of the states 202-214 can be specified by respective protonation state vectors representing whether or not each protonation site in the molecule is protonated in the protonation state.

[0091]

[0091] Each edge in the graph 200 can connect a respective pair of nodes and represent a corresponding pair of possible protonation states of the molecule that differ only in the protonation of a single protonation site on the molecule. For example, in the particular example depicted, protonation state A 202 differs from state B 204 at only one protonation site, state B 204 differs from state E 210 by only one protonation site, and state E 210 differs from state F 212 at only one protonation site.

[0092]

[0092] As described with respect to FIG. 1, the system can associate each of the edges with a respective pKa value, e.g., as a feature vector that characterizes the change in acidity due to the loss or gain of a proton at the particular protonation site that differs between the protonation states represented by the two nodes connected by the edge. As an example, the system can process each of the pairs of the nodes of the graph 200 that are connected by an edge using a micro pKa prediction model, e.g., the micro pKa prediction model 120 of FIG. 1, to generate the predicted micro pKa values for each of the protonation sites in the molecule.

[0093]

[0093] In this case, the micro pKa prediction model can be a model that is configured to process data defining the pair of nodes connected by an edge to generate the predicted micro pKa value for the edge. More specifically, the system can process protonation state A 202 and state G 214 using the micro pKa prediction model (not depicted) to generate the pKa value 220 for the protonation site differing between state A 202 and G 214 and can process protonation state A 202 and state D 208 using the micro pKa prediction model to generate the pKa value 222 for the protonation site differing between state A 202 and state D 208. As another example, the system can process protonation state D 208 and state H 216 using the micro pKa prediction model to generate the pKa value 226 and can process protonation state G 214 and protonation state H 216 using the micro pKa prediction model to generate the pKa value 224. Likewise, the system can process the remaining pairs of the nodes of the graph 200 that are connected by an edge to generate the respective micro pKa values for each of the edges.

[0094]

[0094] The system can then use the micro pKa values associated with the edges to determine the predicted likelihood of observing a particular protonation state, e.g., protonation state H 216, with respect to the likelihood of observing any of the other protonation states represented in the graph 200. As an example, the system can determine the predicted likelihood of protonation state H 216 as a function of a sum of the predicted micro pKa values of the edges included in each unique path from a reference node to a selected node 240 corresponding with protonation state H 216.

[0095]

[0095] In the particular example depicted, state A 202 has been designated as the reference node 230, but generally any of the protonation states can be used as the reference node. Since state A 202 has been designated as the reference node 230, the system can determine the predicted likelihood of protonation state H 216 using a sum of the predicted micro pKa values of the edges included in each unique path from the reference node 230, e.g., state A 202, to the selected node 240 corresponding with protonation state H 216. As an example, the system can determine the predicted likelihood of protonation state H 216 by calculating: where M is total number of protonation states in the protonation state graph, j indexes over each protonation state in the graph, AH= pKa222 + = nj ~nA -

[0096]

[0096] More specifically, in this case, there are two unique paths from state A 202 to state H 216 in graph 200: (i) a first path between state A 202 to state D 208 to state H 216, and (ii) a second path between state A 202 to state G 214 to state H 216. In particular, the system can sum the micro pKa values along the first path, e.g., by adding pKa 222 to pKa 226 to yield a first predicted likelihood value, and can sum the micro pKa values along the second path, e.g., by adding pKa 220 to pKa 224 to yield a second predicted likelihood value. The system can then compute a measure of central tendency, e.g., a mean or median, of the predicted likelihoods as the predicted likelihood of protonation state H 216.

[0097]

[0097] Based on biophysical and biochemical considerations, the predicted likelihood of the first and the second path to the selected node 216 should be the same value. However, since the predictions made using the graph 200 are a function of an intermediate prediction, e.g., the micro pKa values predicted by the micro pKa prediction model, the predicted likelihoods of the unique paths between a reference node and a particular protonation state node in the graph 200 are not guaranteed to be the same value.

[0098]

[0098] For example, the system can use the differences in the unique paths to quantify the uncertainty in the predicted likelihood of any given protonation state. In particular, the system can compute a measure of dispersion, e.g., a variance or an interquartile range, of the likelihood under different possible choices of the path between the reference node and the target node. The system can then use the determined uncertainty to provide a confidence interval for the prediction characterizing the molecule, e.g., through propagation of error techniques.

[0099] Additionally, the system can estimate the uncertainty in the prediction characterizing the molecule by bootstrap sampling the predicted micro pKa values from the ML micro pKa model and different paths to the reference node for each node in the graph 200. Since the prediction characterizing the molecule is a function of the predicted micro pKa values and the protonation state graph, bootstrapping creates an empirical distribution of outputs that represents the statistical uncertainty of the prediction characterizing the molecule. In particular, the system can generate the predicted micro pKa values using the ML micro pKa model N times, e.g., N can be 10, 100, 1000, etc., can identify M different paths from each node to the reference nodes according to the graph 200, and can determine the prediction characterizing the molecule using possible combinations of the predicted micro pKa values and the different paths to the reference node. As an example, the system can then calculate the standard deviation of the output distribution to get a measure of standard error, compute confidence intervals using the distribution, or calculate the interquartile range from the distribution.

[0099]

[0100] As another example, the system can estimate the uncertainty in the prediction characterizing the molecule using an analytical function, e.g., a Taylor series approximation, of the errors in the inputs, e.g., the predicted likelihood values, that are used to calculate the prediction characterizing the molecule. As an example, the system can calculate the error due to small uncertainties in the predicted likelihood values around the mean value for each protonation state (assuming the uncertainties in the predicted likelihood values are uncorrelated) based on the function f used to determine the prediction characterizing the molecule. In particular, the system can calculate the error of the prediction characterizing the molecule as a function of the predicted likelihood values and the standard deviation of the likelihood value of each protonation state ltas:

[0100]

[0101] For example, in the case that the predicted likelihood values are normally distributed around the mean, the system can use the z-score associated with a given confidence level, e.g., a 90%, 95%, or 99% confidence level, to calculate the confidence interval for the prediction characterizing the molecule as f ± z • o , where z is the z-score associated with the given confidence level. The confidence interval can provide important contextual information, e.g., regarding a range of possible outcomes, which can better inform practical downstream applications that rely on the prediction characterizing the molecule that was determined using the predicted likelihood values.

[0102] For example, in the case that the system generates a predicted value for each molecule in a collection of molecules and ranks the molecules, e.g., for selection for synthesis, the confidence intervals for the predicted values can better inform the selection of molecules for synthesis. In particular, the system can select molecules that satisfy a selection criterion and a dispersion criterion, e.g., the system can provide for more robust downstream tasks by inherently quantifying the uncertainty in modeling the probability distribution of the protonation states.

[0101]

[0103] FIG. 3 is a flow diagram of an example process for generating a prediction characterizing a molecule using a protonation state graph. For convenience, the process 300 will be described as being performed by a system of one or more computers located in one or more locations. For example, a molecular protonation state evaluation system, e.g., the molecular protonation state evaluation system 100 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 300.

[0102]

[0104] The system can receive data identifying a molecule (step 310). As an example, the data can include a Simplified Molecular Input Line Entry System (SMILES) string, e.g., which provides a representation of the chemical structure of the molecule in a one-dimensional string, a molecular fingerprint of the molecule or a structural data file, e.g., a protein databank file, structure data file, chemical markup file, etc.

[0103]

[0105] The system can process data identifying (characterizing) the molecule to generate a respective predicted micro pKa value for each protonation site on the molecule (step 320). In particular, the system can identify each of the protonation sites on the molecule and generate the respective predicted micro pKa value of the protonation sites using a micro pKa prediction model. As an example, the system can process data characterizing the molecule using a micro pKa prediction machine learning model that has been trained to perform a micro pKa prediction task, e.g., as is described in WIPO PCT App. No. 2025 / 131,529, which is herein incorporated by reference.

[0104]

[0106] As another example, the system can process data characterizing the molecule using a density functional theory (DFT) calculation, e.g., to determine the micro pKa of each protonation site on the molecule based on a difference in the Gibbs free energy of the protonation site with and without a proton.

[0105]

[0107] The system can generate data defining a protonation state graph for the molecule (step 330). More specifically, the protonation state graph can include a set of nodes and a set of edges. Each node in the set of nodes can represent a respective protonation state of the molecule and each edge in the set of edges can correspond to a respective protonation site on the molecule and can connect a respective pair of nodes that represent a pair of possible protonation states of the molecule that differ only in a protonation state of the corresponding protonation site on the molecule. Furthermore, each edge in the graph can be associated with the predicted micro pKa value of the corresponding protonation site.

[0106]

[0108] The system can then process data defining the protonation state graph to generate a prediction characterizing the molecule (step 340). For example, the system can process the graph data defining the protonation state graph for the molecule to determine a respective predicted likelihood of each of a number of protonation states of the molecule, e.g., as a proxy for a probability distribution of observing each of the protonation states, and can generate the prediction characterizing the molecule based at least in part on the predicted likelihoods of the number of protonation states of the molecule.

[0107]

[0109] As an example, the system can generate a prediction for a most likely protonation state of the molecule as a protonation state with a highest predicted likelihood among the number of protonation states of the molecule. As another example, the system can generate a prediction using the predicted likelihood values as weights in an expected value calculation. Both examples will be described in more detail within the context of generating a prediction of the value of a property of the molecule with respect to FIG. 4.

[0108] [HO] In particular, the system can determine the respective predicted likelihood of each of the number of protonation states of the molecule by identifying respective paths from a reference node to the node representing each protonation state and determining the predicted likelihood of each protonation state based on a measure of the pH and the predicted micro pKa values of the edges included in the respective path. As an example, the reference node can be randomly chosen. As another example, the reference node can be a node representing a protonation state of the molecule where each protonation site on the molecule is deprotonated. [Hl] More specifically, the system can identify a number of unique paths in the protonation state graph from the reference node to a node representing a first protonation state, and can determine the predicted likelihood of the first protonation state by determining, for each of the number of unique paths, a respective predicted likelihood of the protonation state based on the predicted micro pKa values of the edges included in the path. For example, the system can determine the predicted likelihood of the protonation state as a function of a sum of the predicted micro pKa values of the edges included in the path from the reference node to the node representing the protonation state. The system can then determine the predicted likelihood of the protonation state as a measure of central tendency, e.g., a mean or median, of the predicted likelihoods corresponding to the number of unique paths.

[0112] In some cases, the system can additionally determine an uncertainty in the predicted likelihoods of the protonation state corresponding to the number of unique paths. In particular, the system can determine a measure of dispersion, e.g., a variance or an interquartile range, of the predicted likelihoods corresponding to the number of unique paths to quantify the uncertainty in the predicted likelihood of the protonation state.

[0109]

[0113] For example, the system can generate a respective prediction characterizing each candidate molecule in a collection of candidate molecules using the protonation state graph and can rank the collection of candidate molecules based at least in part on the predictions for the collection of candidate molecules. The system can then select one or more of the candidate molecules for physical synthesis based at least in part on the ranking, and, in some cases, can physically synthesize each of the collection of candidate molecules selected for synthesis.

[0110]

[0114] FIG. 4 is a flow diagram of an example process for generating a prediction of the property of the molecule using the protonation state graph. For convenience, the process 400 will be described as being performed by a system of one or more computers located in one or more locations. For example, a molecular protonation state evaluation system, e.g., the molecular protonation state evaluation system 100 of FIG. 1, appropriately programmed in accordance with this specification, can perform the process 400.

[0111]

[0115] The system can obtain data defining the protonation state graph of a molecule (step 410). For example, the system can generate the data defining the protonation state graph using steps 310-330 of FIG. 3. As another example, the system can receive the data defining the protonation state graph, e.g., from another system.

[0112]

[0116] The system can determine the respective predicted likelihood of each of the number of protonation states in the protonation state graph (step 420). In particular, the system can determine the respective predicted likelihood of each of the number of protonation states of the molecule by identifying respective paths from a reference node to the node representing each protonation state, e.g., as is described with respect to FIG. 3.

[0113]

[0117] The system can determine a value of the property of the molecule at each of the protonation states represented by the protonation state graph (step 430). More specifically, the property of the molecule can depend on the protonation state of the molecule. For example, the property of the molecule can be the charge of the molecule, a dipole moment of the molecule, a solubility of the molecule, or a binding affinity of the molecule for a binding target.

[0114]

[0118] As an example, the system can process data characterizing each of the protonation states of the molecule using a property prediction machine learning model to generate the value of the property of the molecule when the molecule has the protonation state. In particular, the system can determine the value of the property of the molecule in accordance with trained values of a set of property prediction machine learning model parameters, e.g., the property prediction machine learning model can have been trained on a set of training examples including (i) training inputs that each characterize a protonation state of a training molecule and (ii) corresponding target property values for each of the training molecules. More specifically, the system can train the property prediction machine learning model on the set of training examples by a machine learning training technique to reduce a discrepancy between (a) the predicted property values of the training molecules generated by processing the training input of the training example using the property prediction machine learning model, and (b) the corresponding target property values of the training molecule that is specified by the training example.

[0115]

[0119] The system can then generate a prediction of the property of the molecule based on the respective predicted likelihoods of the protonation states for the molecule (step 440). For example, the system can determine a value of a property of the molecule for each of the protonation states of the molecule, and can generate an expected value of the property of the molecule based on the predicted likelihood of the protonation state and the value of the property of the molecule when the molecule has the protonation state.

[0116]

[0120] In this case, the system can generate the respective prediction of the property of the molecule using steps 410-440 for each of a collection of candidate molecules and can rank the molecules based on the respective prediction of the property of the molecule. The system can then select one or more of the candidate molecules for physical synthesis based at least in part on the ranking, and, in some cases, can physically synthesize each of the collection of candidate molecules selected for synthesis.

[0117]

[0121] This specification uses the term “configured” in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform the operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by data processing apparatus, cause the apparatus to perform the operations or actions.

[0118]

[0122] Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non-transitory storage medium for execution by, or to control the operation of, data processing apparatus. The computer storage medium can be a machine- readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. Alternatively or in addition, the program instructions can be encoded on an artificially-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.

[0119]

[0123] The term “data processing apparatus” refers to data processing hardware and encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can optionally include, in addition to hardware, code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.

[0120]

[0124] A computer program, which may also be referred to or described as a program, software, a software application, an app, a module, a software module, a script, or code, can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages; and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, subprograms, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.

[0121]

[0125] In this specification the term “engine” is used broadly to refer to a software-based system, subsystem, or process that is programmed to perform one or more specific functions. Generally, an engine will be implemented as one or more software modules or components, installed on one or more computers in one or more locations. In some cases, one or more computers will be dedicated to a particular engine; in other cases, multiple engines can be installed and running on the same computer or computers.

[0122]

[0126] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by special purpose logic circuitry, e.g., an FPGA or an ASIC, or by a combination of special purpose logic circuitry and one or more programmed computers.

[0123]

[0127] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. The central processing unit and the memory can be supplemented by, or incorporated in, special purpose logic circuitry. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.

[0124]

[0128] Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD- ROM and DVD-ROM disks.

[0125]

[0129] To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user’s device in response to requests received from the web browser. Also, a computer can interact with a user by sending text messages or other forms of message to a personal device, e.g., a smartphone that is running a messaging application, and receiving responsive messages from the user in return.

[0126]

[0130] Data processing apparatus for implementing machine learning models can also include, for example, special-purpose hardware accelerator units for processing common and computeintensive parts of machine learning training or production, i.e., inference, workloads.

[0127]

[0131] Machine learning models can be implemented and deployed using a machine learning framework, e.g., a TensorFlow framework, or a Jax framework.

[0128]

[0132] Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface, a web browser, or an app through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.

[0129]

[0133] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. In some embodiments, a server transmits data, e.g., an HTML page, to a user device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the device, which acts as a client. Data generated at the user device, e.g., a result of the user interaction, can be received at the server from the device.

[0130]

[0134] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially be claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

[0131]

[0135] Similarly, while operations are depicted in the drawings and recited in the claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0132]

[0136] Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous.

Claims

CLAIMS1. A method performed by one or more computers, the method comprising: receiving data identifying a molecule; processing the data identifying the molecule to generate, for each of a plurality of protonation sites on the molecule, a respective predicted micro pKa value of the protonation site; and generating data defining a protonation state graph for the molecule, wherein: each node in the protonation state graph represents a respective protonation state of the molecule; and each edge in the protonation state graph corresponds to a respective protonation site on the molecule and connects a respective pair of nodes in the protonation state graph that represent a corresponding pair of possible protonation states of the molecule that differ only in a protonation state of the corresponding protonation site on the molecule; and each edge in the graph is associated with the predicted micro pKa value of the corresponding protonation site; and processing the data defining the protonation state graph for the molecule to generate a prediction characterizing the molecule.

2. The method of claim 1, wherein processing the data defining the protonation state graph for the molecule to generate the prediction characterizing the molecule comprises: processing the data defining the protonation state graph for the molecule to determine a respective predicted likelihood of each of a plurality of protonation states of the molecule; and generating the prediction characterizing the molecule based at least in part on the predicted likelihoods of the plurality of protonation states of the molecule.

3. The method of claim 2, wherein processing the data defining the protonation state graph for the molecule to determine the respective predicted likelihood of each of the plurality of protonation states of the molecule comprises, for each protonation state: identifying a path in the protonation state graph from a reference node to a node representing the protonation state; and determining the predicted likelihood of the protonation state based on a measure ofpH and the predicted micro pKa values of edges included in the path from the reference node to the node representing the protonation state.

4. The method of claim 3, wherein identifying a path in the protonation state graph from the reference node to the node representing the protonation state comprises: identifying a plurality of unique paths in the protonation state graph from the reference node to a node representing the protonation state; and wherein determining the predicted likelihood of the protonation state comprises: determining, for each of the plurality of unique paths, a respective predicted likelihood of the protonation state based on the measure of pH and the predicted micro pKa values of edges included in the path; and determining the predicted likelihood of the protonation state as a measure of central tendency of the predicted likelihoods corresponding to the plurality of unique paths.

5. The method of claim 4, further comprising, for each protonation state, determining an uncertainty in the predicted likelihood of the protonation state based on a measure of dispersion of the predicted likelihoods of the protonation state corresponding to the plurality of unique paths.

6. The method of any one of claims 3-5, wherein the reference node is a node representing a protonation state of the molecule where each protonation site on the molecule is deprotonated.

7. The method of any one of claims 3-6, wherein determining the predicted likelihood of the protonation state based on the measure of pH and the predicted micro pKa values of edges included in the path from the reference node to the node representing the protonation state comprises: determining the predicted likelihood of the protonation state as a function of a sum of the predicted micro pKa values of the edges included in the path from the reference node to the node representing the protonation state.

8. The method of any preceding claim, wherein generating the prediction characterizing the molecule based at least in part on the predicted likelihoods of the plurality of protonation states of the molecule comprises:generating a prediction for a most likely protonation state of the molecule as a protonation state with a highest predicted likelihood among the plurality of protonation states of the molecule.

9. The method of any preceding claim, wherein generating the prediction characterizing the molecule based at least in part on the predicted likelihoods of the plurality of protonation states of the molecule comprises: determining, for each of the plurality of protonation states of the molecule, a value of a property of the molecule when the molecule has the protonation state; and generating an expected value of the property of the molecule based on, for each of the plurality of protonation states, the predicted likelihood of the protonation state and the value of the property of the molecule when the molecule has the protonation state.

10. The method of claim 9, wherein for each of the plurality of protonation states of the molecule, determining the value of the property of the molecule when the molecule has the protonation state comprises: processing data characterizing the protonation state of the molecule using a property prediction machine learning model, in accordance with trained values of a set of property prediction machine learning model parameters, to generate the value of the property of the molecule when the molecule has the protonation state.

11. The method of claim 10, wherein the property prediction machine learning model has been trained by performing operations comprising: obtaining a set of training examples, wherein each training examples comprises: (i) a training input that characterizes a protonation state of a training molecule, and (ii) a target property value of the training molecule; and training the property prediction machine learning model on the set of training examples by a machine learning training technique comprising, for each training example, training the property prediction machine learning model to reduce a discrepancy between: (a) a predicted property value of the training molecule generated by processing the training input of the training example using the property prediction machine learning model, and (b) the target property value of the training molecule that is specified by the training example.

12. The method of any one of claims 9-11, wherein the property of the molecule comprises a charge of the molecule, or a dipole moment of the molecule, or a solubility of the molecule, or a binding affinity of the molecule for a binding target.

13. The method of any preceding claim, wherein processing the data identifying the molecule to generate, for each of the plurality of protonation sites on the molecule, the respective predicted micro pKa value of the protonation site comprises: processing data characterizing the molecule using a micro pKa prediction machine learning model that has been trained to perform a micro pKa prediction task; or processing data characterizing the molecule using a density functional theory (DFT) calculation.

14. A method performed by one or more computers, the method comprising: obtaining data identifying a collection of candidate molecules; generating a respective prediction characterizing each of the candidate molecules using the method of any one of claims 1-13; and ranking the collection of candidate molecules based at least in part on the respective predictions for the collection of candidate molecules.

15. The method of claim 14, further comprising selecting one or more of the candidate molecules for physical synthesis based at least in part on the ranking of the collection of candidate molecules.

16. The method of claim 15, further comprising physically synthesizing each of the collection of candidate molecules selected for physical synthesis.

17. The method of claim 14 or 15, wherein the method is a method of obtaining a drug, the method further comprising: selecting one or more of the candidate molecules for use as a drug based at least in part on the ranking of the collection of candidate molecules.

18. The method of claim 17, wherein the respective prediction characterizing each of the candidate molecules comprises one or more pharmacokinetic or pharmacodynamic properties for the candidate molecule.

19. A method performed by one or more computers for obtaining a ligand for a target molecule, wherein the ligand is a drug or a ligand of an industrial enzyme, the method comprising: obtaining data identifying a collection of candidate ligands; generating one or more predictions characterizing the target molecule and / or each of the candidate ligands using the method of any one of claims 1-13; evaluating an interaction of each candidate ligand of the candidate ligands with the target molecule based at least in part on the one or more predictions; and selecting one or more of the candidate ligands as the ligand dependent on a result of the evaluating.

20. The method of claim 19, wherein the target molecule comprises a receptor or enzyme, and wherein the ligand is an agonist or antagonist of the receptor or enzyme, or wherein the ligand comprises an antibody or aptamer and the target molecule comprises an antibody or aptamer target, in particular a virus or cancer cell protein, and wherein the antibody or aptamer binds to the antibody or aptamer target to provide a therapeutic effect.

21. The method of any one of claims 17-20, further comprising synthesizing the drug or ligand.

22. A method of claim 21, further comprising testing for biological activity of the ligand or drug in vitro or in vivo.

23. A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations of the respective method of any one of claims 1-15 or 17-20.

24. One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations of the respective method of any one of claims 1-15 or 17-20.