Drug optimization by active learning

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By optimizing the selection of a subset of compounds using Bayesian statistical models and acquisition functions, the high cost and inefficiency of compound synthesis and testing in drug discovery are addressed. This approach achieves multi-objective optimization in a discrete input space, improving the accuracy and efficiency of compound selection.

CN116601715BActive Publication Date: 2026-06-16엑스사이언티아에이아이리미티드

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: 엑스사이언티아에이아이리미티드
Filing Date: 2022-02-08
Publication Date: 2026-06-16

Application Information

Patent Timeline

08 Feb 2022

Application

16 Jun 2026

Publication

CN116601715B

IPC: G16C20/50; G16C20/64; G16C20/30

CPC: G16C20/50; G16C20/30; G16C20/70; G16C20/64; G16C20/20; G06N20/00

AI Tagging

Application Domain

Chemical property prediction Molecular entity identification

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In current drug discovery processes, the synthesis and testing of compounds are costly and inefficient, making it difficult to effectively optimize the biological properties of compounds to meet multiple objectives. Furthermore, traditional optimization techniques cannot adapt to discrete input spaces and high-dimensional features.

⚗Method used

A Bayesian statistical model is used to train the probability distribution of the compound population. A subset of compounds is selected by optimizing the acquisition function. By combining piecewise linear functions and aggregation functions, multi-objective optimization and parallel selection of compounds are achieved.

🎯Benefits of technology

It improves the accuracy and efficiency of compound selection, reduces the design cycle and cost of drug discovery, and enables the optimization of multiple biological properties in a discrete input space to meet multiple desired properties of candidate compounds.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure GDA0004276460270000221
Figure GDA0004276460270000222
Figure GDA0004276460270000223

Patent Text Reader

Abstract

A method for computing drug design through active learning is provided. The method includes defining a population of compounds, each compound having one or more structural features; defining a training set of compounds from the population for which biological properties are known; and defining a plurality of targets, each target defining a desired biological property. The method includes training a Bayesian statistical model using the training set of compounds to output a probability distribution approximating biological properties of compounds in the population as a function of structural features of the compounds in the population. The method includes determining a subset of compounds from the population that are not in the training set, the subset being determined according to an optimization of an acquisition function, the optimization of the acquisition function being based on the probability distribution from the trained Bayesian statistical model and based on the defined targets. The method includes selecting at least some of the compounds in the determined subset for synthesis.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to methods and systems for the computational design of compounds, such as drugs. In particular, this invention relates to methods for optimizing computational models through active learning for designing drugs that interact with selected target molecules. The invention also relates to drugs designed using these systems and methods. Background Technology

[0002] Drug discovery is the process of identifying candidate compounds to advance to the next stage of drug development, such as preclinical trials. These candidate compounds need to meet certain criteria for further development. Modern drug discovery involves the identification and optimization of compounds that are "hit" in the initial screening. In particular, these compounds need to be optimized relative to desired criteria, which may include the optimization of a variety of different biological properties. Properties to be optimized may include, for example: efficacy / potency against the desired target, selectivity for undesired targets, low probability of toxicity, and favorable drug metabolism and pharmacokinetic properties (ADME). Only compounds that meet specific requirements can become candidates for further drug development.

[0003] The drug discovery process may involve the preparation / synthesis of a large number of compounds from initial screening hits to the optimization of candidate compounds. Specifically, those synthesized compounds are measured to determine their properties, such as biological activity. However, the number of compounds that can be prepared as part of a particular drug discovery project will far exceed the number that can be synthesized and tested, potentially by orders of magnitude. Therefore, the results of measurements on synthesized compounds are analyzed and used to inform decisions on which compounds to synthesize next, maximizing the likelihood of obtaining compounds with further improved properties relative to the various criteria required for candidate compounds.

[0004] The synthesis of one or more compounds at a specific stage and subsequent measurements of their biological properties (such as bioactivity) are referred to as a design cycle (or iteration) of the drug discovery process. Typically, a set of compounds is synthesized and tested in each design cycle of the process, as this is more efficient than synthesizing and testing one compound at a time. However, the level of available resources usually means that there is an upper limit to the number of compounds that can be synthesized within any given design cycle.

[0005] In wet lab-based drug discovery projects, hundreds or even thousands of compounds are typically synthesized over several design cycles before a candidate compound is found. This is a lengthy, expensive, and inefficient process: synthesizing a single compound can cost thousands of pounds, and on average, it takes three to five years to obtain a candidate compound.

[0006] The use of computational methods significantly enhances the level of analysis that can be performed on synthesized compounds compared to analyses that can be performed by medicinal chemists alone. Specifically, machine learning (ML), artificial intelligence (AI), or other mathematical methods can be used to evaluate a large number of design parameters in parallel at levels exceeding human capabilities to identify relationships between parameters (e.g., structural features of the compound) and desired properties (e.g., levels of biological activity). Mathematical methods can then use these identified relationships to better predict which compounds are more likely to exhibit a greater quantity / level of the desired biological properties relative to the desired criteria for candidate compounds. This means that such mathematical methods can be used to reduce the number of design cycles, thereby reducing the number of compounds that need to be synthesized to obtain compounds that achieve the desired combination of properties required by candidate compounds, thus reducing the costs and time associated with drug discovery projects.

[0007] Therefore, the task of finding candidate compounds with a variety of desired properties can be viewed as an optimization problem, aiming to obtain the "best" compound with various desired properties using knowledge gained from previously synthesized compounds. When facing such a computational optimization problem in the context of drug discovery, several challenges need to be addressed.

[0008] One challenge is that the type of functional relationships between compounds in a group of compounds is not previously known. That is, the form of the objective function describing, for example, the relationship between the structural features of compounds and their biological properties is unknown. This means that some known optimization techniques that rely on existing knowledge of the form of the function may be unsuitable in the context of drug discovery.

[0009] Another challenge is that evaluating the objective function at points in the input space is expensive. This is because synthesizing and testing compounds—that is, evaluating them—is both time-consuming and costly. Consequently, the training set from the evaluation points of the objective function to be approximated may contain relatively few points, and significantly increasing the size of the training set in a short period may not be feasible. This can affect how efficiently a model that approximates the objective function can be trained, and thus how accurately such a model can make predictions or approximations.

[0010] A further challenge is that many known optimization techniques are designed to select a single point for evaluating an unknown function. However, as mentioned above, in drug discovery projects, for efficiency reasons, it is common practice to select multiple compounds for synthesis and testing in any given design cycle. That is, in a given iteration, multiple points need to be optimized and selected simultaneously for evaluation.

[0011] Furthermore, known optimization techniques can be used to optimize a single parameter of the objective function; that is, the optimization routine is designed to optimize a single objective. However, as mentioned above, there are usually multiple criteria that need to be used to optimize compounds in order to make them suitable candidate compounds. In other words, multiple parameters of the function need to be optimized in parallel based on the various desired biological properties of candidate compounds for the specific drug discovery project under consideration.

[0012] Finally, many optimization routines rely on the objective function having a continuous input space, allowing the use of techniques such as gradient-based methods. However, it is clear that in the context of drug discovery, the input space is discrete (where each compound represents a point in the input space), thus techniques that rely on a continuous input space cannot be utilized.

[0013] This invention is designed specifically for this context. Summary of the Invention

[0014] According to an aspect of the invention, a method for computational drug design is provided. The method includes defining a population of multiple compounds, each having one or more structural features. The method includes defining a training set of multiple compounds from the population whose properties are known. Features can be any relevant physical, chemical, or biological property of the compound, and can be considered to include biological, biochemical, chemical, biophysical, physiological, and / or pharmacological properties of the compound. The method includes defining multiple objectives, each defining a desired property. The method includes training a Bayesian statistical model using the training set of compounds to output a probability distribution approximating the properties of the compounds in the population as an objective function for the structural features of the compounds in the population. The method includes determining a subset of multiple compounds from the population that are not in the training set. This subset is determined based on the optimization of a collection function based on the probability distribution from the trained Bayesian statistical model and based on the defined multiple objectives. The method may include selecting at least some compounds from the determined subset for synthesis and / or for performing (computational) molecular dynamics analysis / simulation. This selection may be performed as part of a drug design process to obtain compounds with desired properties. For convenience, throughout this disclosure, such properties of a compound may be collectively referred to as “biological properties”. Therefore, as used herein, “biological properties” may include any relevant properties of a (chemical) compound, including those that may be more specifically considered to fall within or overlap with biological, biochemical, chemical, biophysical, physiological and / or pharmacological properties.

[0015] This method may include, for one or more objectives, mapping preferences associated with the biological characteristics of the corresponding objective to a probability distribution derived from a Bayesian statistical model by applying a corresponding utility function, thereby obtaining a probability distribution for preference modification. Optimization of the acquisition function may be based on this probability distribution for preference modification.

[0016] Preferences can indicate the priority of a given objective relative to other objectives among a plurality of objectives.

[0017] In some embodiments, for one of the biological characteristics of a compound, it is possible that a lower uncertainty value associated with the probability distribution of the biological characteristic corresponds to a larger preference associated with the corresponding biological characteristic.

[0018] Preferences can be user-defined preferences, such as those defined by chemists.

[0019] One or more of the utility functions can be piecewise functions. Piecewise functions can be piecewise linear functions.

[0020] In some embodiments, optimizing the acquisition function may include: evaluating the acquisition function for each compound in the population, optionally excluding compounds from the training set. A subset may be determined based on the evaluated acquisition function values.

[0021] In some embodiments, optimization of the acquisition function based on multiple defined objectives can provide a Pareto-optimal set of compounds. One or more compounds from a variety of compounds can be selected from the Pareto-optimal set for a determined subset. It is possible that the selection from the Pareto-optimal set is based on user-defined preferences.

[0022] The probability distribution from a Bayesian statistical model can include the probability distribution of each biological characteristic associated with each of the corresponding objectives among a plurality of objectives.

[0023] Methods may include mapping multiple probability distributions to a one-dimensional aggregated probability distribution by applying an aggregation function to multiple probability distributions from a Bayesian statistical model. Optimization of the acquisition function can be based on the aggregated probability distribution.

[0024] Aggregate functions may include one or more of the following: sum operators, average operators, and product operators.

[0025] The acquisition function can be at least one of the following: the expected improvement function, the improvement probability function, and the confidence boundary function.

[0026] The acquisition function can be a multidimensional acquisition function. In some embodiments, each dimension may correspond to a specific target among multiple targets. Optionally, the multidimensional acquisition function may be a hypervolume expectation improvement function.

[0027] In some embodiments, training a Bayesian statistical model may include tuning multiple hyperparameters of the Bayesian statistical model. Optionally, tuning the hyperparameters may include applying a combination of maximum likelihood estimation and cross-validation techniques.

[0028] In some embodiments, determining a subset of multiple compounds may include: identifying a compound from the population that is not in the training set based on a probability distribution from a trained Bayesian statistical model and multiple defined objective optimization acquisition functions. The method may include repeating the steps of: retraining the Bayesian statistical model using the training set of compounds and one or more identified compounds; and identifying a compound from the population that is not in the training set and is not one or more previously identified compounds based on a probability distribution from the retrained Bayesian statistical model and multiple defined objective optimization acquisition functions, until multiple compounds have been identified for the subset.

[0029] In some embodiments, retraining a Bayesian statistical model may include setting one or more pseudo-biological property values or spurious biological property values for one or more identified compounds in the Bayesian statistical model.

[0030] The pseudo-biological trait value can be set according to one of the following: the Kriging believer method; and the constant liar method.

[0031] In Bayesian statistical models, each compound can be represented as a bit vector, where a bit indicates the presence or absence of a corresponding structural feature in the compound.

[0032] A Bayesian statistical model can be a Gaussian process model.

[0033] The probability distribution from the trained Bayesian statistical model may include the posterior mean indicating the approximate biological characteristic values of compounds in the population. The probability distribution from the trained Bayesian statistical model may also include the posterior variance indicating the uncertainty associated with the approximate biological characteristic values in the population.

[0034] In some embodiments, one or more weighting parameters of the acquisition function may be modified according to the desired strategy of a drug discovery process or project utilizing the described computational drug design method.

[0035] The desired strategy can include a balance between development and exploration strategies, with the development strategy depending on the weighted parameters of the acquisition function associated with the posterior mean, and the exploration strategy depending on the weighted parameters of the acquisition function associated with the posterior variance.

[0036] The weighting parameters can be user-defined to set the desired strategy.

[0037] Bayesian statistical models can use kernels that indicate the similarity between pairs of compounds in a population to approximate the biological characteristics of compounds.

[0038] The kernel can be a Tanimoto-like kernel.

[0039] The method may include: synthesizing at least some of the selected compounds from the determined subset to determine the biological properties of the selected compounds.

[0040] The method may include adding the synthesized compound to the training set to obtain an updated training set.

[0041] The method may include: training an updated Bayesian statistical model using an updated training set of compounds to output a probability distribution approximating an objective function; determining a new subset of multiple compounds from the population that are not in the updated training set, the new subset being determined based on the optimization of a collection function, the optimization of which depends on approximate biological characteristics from the updated Bayesian statistical model and on multiple defined objectives; and selecting at least some compounds from the determined new subset for synthesis.

[0042] This method may include synthesizing selected compounds from a determined new subset to determine the biological properties of the selected compounds.

[0043] This method may include updating the training set by adding the synthesized compound to the training set.

[0044] The method may include iteratively performing the following steps: training an updated Bayesian statistical model using an updated training set of compounds to output a probability distribution approximating an objective function; determining a new subset of various compounds from the population that are not in the updated training set, the new subset being determined based on the optimization of a collection function, the optimization of which depends on approximate biological characteristics from the updated Bayesian statistical model and on multiple defined objectives; selecting at least some compounds from the determined new subset for synthesis; synthesizing the selected compounds from the determined subset to determine the biological characteristics of the selected compounds; and adding the synthesized compounds to the training set to obtain an updated training set, until a stopping condition is met.

[0045] The stopping condition may include at least one of the following: one or more synthesized compounds achieve multiple objectives, one or more synthesized compounds are within acceptable thresholds for the respective multiple objectives, and the maximum number of iterations has been performed.

[0046] In some embodiments, the synthetic compound that achieves multiple objectives or is within an acceptable threshold of the respective multiple objectives may be a candidate drug or therapeutic molecule having desired biological, biochemical, physiological and / or pharmacological activity against a predetermined target molecule.

[0047] The predetermined target molecule can be a therapeutic, diagnostic, or experimental assay target in vitro and / or in vivo.

[0048] Candidate drugs or therapeutic molecules can be used in medicine; for example, in methods for treating animals (such as humans or non-human animals).

[0049] Each objective can be user-defined, for example, by chemists defining the expected criteria that candidate compounds should meet.

[0050] In some embodiments, each objective includes at least one of the following: the expected value of the corresponding biological characteristic, the range of expected values of the corresponding biological characteristic, and the expected value of the corresponding biological characteristic that is maximized or minimized.

[0051] For example, based on the resource level available for testing compounds in each design cycle or iteration of a drug design project, the number of compounds in a selected subset can be defined by the user.

[0052] The structural features of each of the many compounds in the group can correspond to a fragment present in that compound.

[0053] Fragments present in each of the multiple compounds can be represented as molecular fingerprints. Optionally, the molecular fingerprint is an extended connectivity fingerprint (ECFP), which may be ECFP0, ECFP2, ECFP4, ECFP6, ECFP8, ECFP10, or ECFP12.

[0054] Biological characteristics may include one or more of the following: activity, selectivity, toxicity, absorption, distribution, metabolism, and excretion.

[0055] According to another aspect of the present invention, a compound identified by the above method is provided.

[0056] According to another aspect of the present invention, a non-transitory computer-readable storage medium is provided that stores instructions which, when executed by a computer processor, cause the computer processor to perform the above-described method.

[0057] According to another aspect of the invention, a computational device for computational drug design is provided. The computational device includes an input terminal arranged to receive data indicating a population of multiple compounds, each compound having one or more structural features. The input terminal is arranged to receive data indicating a training set of compounds from the population whose biological properties are known. The input terminal is arranged to receive data indicating multiple targets, each target defining a desired biological property. The computational device includes a processor arranged to train a Bayesian statistical model using the training set of compounds to provide a probability distribution approximating the biological properties of the compounds in the population as an objective function for the structural features of the compounds in the population. The processor is arranged to determine a subset of multiple compounds from the population that are not in the training set, the subset being determined based on the optimization of an acquisition function based on the probability distribution from the trained Bayesian statistical model and based on the defined multiple targets. The computational device includes an output terminal arranged to output the determined subset. Optionally, the computational device is arranged to select at least some compounds from the determined subset for synthesis and / or for performing (computational) molecular dynamics analysis / simulation. Alternatively, this can be done through user selection. Optionally, the computing device is configured to perform the molecular dynamics analysis / simulation. Attached Figure Description

[0058] Examples of the invention will now be described with reference to the accompanying drawings, in which:

[0059] Figure 1 This demonstrates the Gaussian process model approximation of the defined function;

[0060] Figure 2 It demonstrates how Gaussian process models and acquisition functions can be used to optimize an objective function as part of an iterative process;

[0061] Figure 3 An example of a piecewise linear function is shown;

[0062] Figure 4 This schematically illustrates the application of one or more utility functions and / or aggregation functions to a multidimensional posterior probability distribution output from a Gaussian process model trained using a swarm of compounds;

[0063] Figure 5 The steps of a computational drug design method according to an example of the present invention are shown;

[0064] Figures 6(a)-6(c) A graph showing the known and predicted values of the bioactivity of the test set of comparative molecules is presented; in particular, Figure 6(a) shows the known values versus the predicted values obtained through testing. Figure 5Figure 6(b) shows a comparison between the known values and those predicted by the prior art method; Figure 6(c) shows a comparison between the values predicted by the prior art method and those predicted by the prior art method. Figure 5 Comparison between values predicted by the method;

[0065] Figures 7(a)-7(b) A comparison was shown. Figures 6(a)-6(c) A graph showing known and predicted values of the biological activity of a test set of molecules, where... Figure 5 The method involves setting a variance threshold; specifically, Figure 7(a) shows the known values versus the variance threshold obtained through... Figure 5 A comparison between the values predicted by the method; Figure 7(b) shows a comparison between the known values and those predicted by the existing method;

[0066] Figure 8 It shows Figure 5 How are the mean squared error (MSE) and variance of the method determined by... Figures 6(a)-6(c) The graph changes due to the model determinism of the test set;

[0067] Figure 9 The illustration shows the method used for execution. Figure 5 The steps for benchmarking the method;

[0068] Figure 10(a) shows a graph illustrating the distribution of bioactivity values of molecules in the test set for a specific activity parameter, and illustrates the results of performing [the test] from that test set. Figure 5 The training set of molecules for the method, and the samples from the test set. Figure 5 The method selects a set of molecules, and the remaining (unknown) set of molecules in the test set that are not in the training set or the selection set; Figure 10(b) shows a graph illustrating the distribution of bioactivity values of molecules in the training set and selection set of Figure 10(a).

[0069] Figure 11(a) shows a graph illustrating the distribution of bioactivity values of molecules in the test set of the molecules in Figure 10(a) for different activity parameters from Figure 10(a), and also shows the results of the tests performed on the molecules in the test set. Figure 5 The method's molecular training set, derived from the test set by Figure 5 The method selects a set of molecules, and the remaining set of molecules in the test set that are not in the training set or the selection set; Figure 11(b) shows a graph illustrating the distribution of bioactivity values of molecules in the training set and selection set of Figure 11(a).

[0070] Figure 12 Instructions were shown Figures 10(a)-10(b) and Figures 11(a)-11(b)A graph showing the values of activity parameters for molecules in the test set, indicating the results obtained through... Figure 5 Which molecules to select;

[0071] Figure 13 A graph showing the distribution of relative free binding energies of molecules in a test set is presented, along with a graph showing the values of molecules from the test set used for execution. Figure 5 The method's molecular training set, derived from the test set by Figure 5 The method selects a set of molecules, and the remaining (unknown) set of molecules in the test set that are not in the training set or the selection set.

[0072] Figure 14(a) shows the results of comparing the optimal selection set and the random selection set with the results from... Figure 13 How does the cumulative relative free binding energy of the molecular selection set in the test set change with... Figure 5 The graph changes with continuous iteration of the method; Figure 14(b) shows the graph that changes according to minimizing the relative free binding energy. Figure 5 After 30 iterations of the method, Figure 14(a) shows the percentage of the selected molecules in the top x of the molecules in the test set; and,

[0073] Figure 15(a) shows a graph of Figure 14(a), except that Figure 15(a) shows a random forest model with a greedy selection set instead of a set obtained through [other means]. Figure 5 The results of the method for selecting the set; and Figure 15(b) shows the percentage of the selected molecules in Figure 14(a) in the top x of the molecules in the test set after 30 iterations of the random forest model, based on minimizing the relative free binding energy. Detailed Implementation

[0074] Molecular or drug design can be viewed as a multidimensional optimization problem that uses hypothesis generation and experimental cycles to expand knowledge. Each compound design can be considered as a hypothesis fabricated in the experiment. Experimental results are represented as structure-activity relationships (SARs), which construct hypothetical scenarios about which chemical structures are likely to contain the desired features. The drug design process is also an optimization problem because each project begins with a product profile (i.e., target function) possessing the desired specific properties. However, despite accurately describing the target, finding the optimal solution has always been an expensive and difficult challenge. A particular difficulty with this type of problem is efficiently constructing hypothetical scenarios (landscapes) spanning a vast space of feasible solutions from a relatively limited knowledge base of experimental results.

[0075] Drug discovery typically takes place in iterations known as design cycles. In each iteration, a set of molecules or compounds is synthesized and their biological properties are measured. Activities are analyzed, and a new set of compounds is proposed based on what has been learned from previous iterations. This process is repeated until clinical candidates are found. In addition to activity, the measured biological properties may include one or more of selectivity, toxicity, affinity, absorption, distribution, metabolism, and excretion.

[0076] At any particular stage of the process, a set of compounds with known biological activity is synthesized or prepared. The aim of the process is to find one or more optimal compounds from a large pool of synthesizable compounds, but for these compounds, there are only the resources and / or time to synthesize a subset of compounds from that group.

[0077] Automated or computational drug design processes use mathematical models (e.g., machine learning (ML) models) to predict or hypothesize which compounds in a population of available compounds are the best, for example, those that maximize (or minimize) a particular / desired biological activity.

[0078] Active learning is a special case of machine learning, in which the learning algorithm can interactively query the user or other information sources to label new data points with the desired output. One use case for this technique is when there is abundant unlabeled data, but manual labeling is expensive; this is a common scenario in drug discovery.

[0079] The ML model is trained using available structure-activity relationships from experimental results (i.e., from those compounds that have been synthesized and tested in the population). The strategy or method of using an ML model to select compounds with the highest predicted activity (or other desired target properties) from a population of possible compounds for synthesis is called "exploitation." The exploitation strategy can be viewed as a phase of the process. Various mathematical methods can be used to provide ML models for exploitation. These mathematical methods include, for example, support vector machine algorithms, neural networks, and decision trees.

[0080] The development approach will only be successful if the predictive power of the ML model is accurate enough (i.e., if the ML model is trained well enough). Each compound synthesized and tested from the population is added to the training set of compounds used to train the ML model. The number of molecules or compounds added to the training set at a given iteration is typically limited by resources. That is, the number of compounds in the subset of compounds synthesized in each iteration is usually defined within a specified maximum.

[0081] An ML model will be sufficiently accurate in its predictions only if a sufficient number of compounds are present in the training set. Therefore, a number of iterations or design cycles may be necessary before the ML model is fully trained, where, for example, a maximum number of compounds are added to the training set at each iteration.

[0082] Furthermore, the predictive power of an ML model will only be sufficiently accurate if the compounds in the training set adequately represent the total population of compounds that can be selected for synthesis. Therefore, it is important to include the compounds that will most contribute to improving the ML model (i.e., those that are most representative) in the subset to be synthesized at any given iteration before the ML model is adequately trained. The selection of compounds for synthesis based on this is called "exploration." Several methods are known for selecting compounds for synthesis as part of an exploration strategy, such as techniques based on distance metrics between compounds in the population or on the diversity of chemical structures among the compounds in the population. The exploration strategy can be viewed as a learning or training phase of the process.

[0083] Therefore, development and exploration strategies are in competition when selecting a subset of compounds for synthesis in a specific iteration of the drug discovery process. In reality, the appropriate choice of strategy may change depending on the specific stage of the drug discovery process. For example, in the early stages of a drug discovery project, it is unlikely that a sufficiently well-trained model has been established. Therefore, at this stage, an exploration strategy may be the most suitable, as the reward for exploration is ultimately a better-trained model, and thus a more accurate model. At this stage, a development strategy will not make full use of limited resources, as development is not a particularly good strategy for increasing the representativeness of the training set. On the other hand, if the ML model has been sufficiently well-trained, for example, in the later stages of a drug discovery project, development would be the appropriate strategy, as the subset of compounds selected by the model for synthesis is more likely to be optimal relative to desired characteristics (e.g., high levels of biological activity). At this stage, an exploration strategy will not make full use of limited resources, as exploration is not the optimal strategy for selecting compounds that may possess the desired characteristics.

[0084] As stated above, if: a sufficient number of compounds exist in the set used to train the ML model; and the compounds in this training set are sufficiently representative of the pool from which to select compounds for synthesis, then the ML model used for development strategies will only (most likely) make accurate predictions. The first of these implies that a certain number of design cycles may be needed to obtain a sufficient number of compounds for synthesis (unless data relating to a sufficient number of previously synthesized compounds is available). The second of these implies that for the initial design cycle in the early stages of a drug discovery project, it may not be desirable to make decisions based on which compounds to include in the set to be synthesized using only a development-only ML model. This is because such an ML model will predict which compounds are highly active based on a model that has not yet been trained to a sufficient level, meaning the predictions are unlikely to be accurate. Furthermore, compounds synthesized based on such predictions will not be used to improve the ML model for subsequent design cycles, as the ML model predictions further focus on relationships / information already identified from the training set of compounds. In particular, predictions from a purely development-only ML model do not help suggest which compounds to synthesize to improve the accuracy of the ML model in the next design cycle.

[0085] To reduce the time and cost associated with drug discovery projects, the number of iterations or design cycles required to discover candidate or optimal compounds with desired properties should be minimized. Therefore, it is crucial that a sufficiently well-trained model be built as quickly as possible to predict compounds with desired properties, i.e., requiring as few compounds as possible in the training set. Consequently, it is important to select the most representative compounds for synthesis early in the project to minimize the number of iterations required for (at least some degree) exploration, as candidate compounds are unlikely to emerge from iterations employing this strategy.

[0086] This invention is advantageous because it provides an improved computational drug design approach for designing and using machine learning models to identify candidate compounds from a population of compounds as part of a drug discovery process. In particular, the invention advantageously provides a machine learning model that can combine and perform both development and exploration strategies, either individually or in parallel. The invention advantageously allows for the parallel optimization and selection of multiple compounds for synthesis within a given design cycle of a drug discovery project, and advantageously allows for the optimization of compounds relative to multiple design objectives that define various desired biological properties of candidate compounds. The invention also provides a more flexible method for incorporating various preferences (e.g., chemist preferences) regarding the objectives achieved or optimized through candidate compounds in a particular drug discovery project, and / or regarding the differences between compounds that each satisfies various objectives when selecting which compound to synthesize.

[0087] According to the present invention, a step in the computational drug design method is to define a population of multiple compounds or molecules. Specifically, this population is a set of compounds that can be selected for synthesis during a particular drug discovery project. The population can be defined or obtained in any suitable manner, for example, via known computational methods and / or using human input. For example, the population can be a set of compounds obtained from a generative or evolutionary design algorithm. In particular, an evolutionary design algorithm can generate multiple novel compounds based on an initial set of one or more known compounds (e.g., existing drugs) that possess at least some of the desired properties of optimal compounds for a particular project using this method. Alternatively, multiple novel compounds can be generated in any suitable manner. Those novel compounds generated that have at least some of the desired characteristics can be retained for further analysis. In one example, certain compounds among those possessing at least some of the desired characteristics for an upcoming specific project can be retained by applying known methods to reduce the initial set of compounds (e.g., including millions of compounds). One or more filters can be applied to the retained compounds to remove any undesirable compounds. This filter can be defined according to any suitable criteria used to select (or filter) desired compounds from undesirable compounds. For example, a useful filter may be adapted to remove recurring compounds. Another filter may be adapted to remove compounds with a certain level of toxicity. The filtered set of compounds can then form a group from which to be selected for synthesis.

[0088] The population can include any suitable number of compounds. Typically, for example, due to available resources, the population will include more compounds than, and possibly significantly more, than the many compounds that can be synthesized as part of a particular drug discovery project. However, the population typically does not include so many compounds that computational analysis of the population according to the invention is infeasible. For example, the number of compounds in the population can typically be on the order of hundreds or thousands, but it will be understood that for any given project, the population can be larger or smaller than that.

[0089] Each compound in this group comprises multiple structural features that combine to form its chemical structure. Such structural features can be represented in any suitable manner. For example, one way to describe the structure of a compound or molecule is via fingerprint identification. In particular, the fingerprint of a particular compound can be represented as a mathematical object (e.g., a list of bits or integers) that reflects which specific structural features or substructures (fragments) are present or absent in the compound.

[0090] Several different categories of fingerprints exist, such as topological fingerprints, structural fingerprints, and circular fingerprints. A common method for analyzing circular fingerprints is Extended Connectivity Fingerprinting (ECFP). Several ECFP methods are known, such as ECFP0, ECFP2, ECFP4, ECFP6, ECFP8, ECFP10, and ECFP12. As is known in the art, determining the fingerprint of a compound typically involves assigning an identifier to each atom in the compound, updating these identifiers based on neighboring atoms, removing duplicates, and then forming a vector from the list of identifiers.

[0091] The next step in this computational drug design approach is to define a training set of compounds from the population. The training set includes those compounds from the population whose biological properties are known. That is, the training set includes those compounds from the population that have been synthesized and tested experimentally to determine certain biological properties (e.g., biological activity). Therefore, the number of compounds in the training set increases as the drug discovery project progresses (i.e., with more iterations or design cycles). At the start of a drug design project, there may be relatively few compounds in the training set. For example, the training set may include compounds whose biological properties are previously known (e.g., compounds that have been previously tested as part of different projects), as well as compounds that possess at least some of the expected properties of the optimal compound for the specific project under consideration.

[0092] Note that in order to execute the computational design method of the present invention, the training set needs to include at least some compounds. Therefore, if at the start of a drug design project no compounds in the defined population have been synthesized and tested—that is, the biological characteristics of the population are unknown—the training set can be populated in any suitable manner as an initial step before training and executing the ML method according to the present invention (described below). For example, compounds synthesized to provide the initial training set can be selected based on different techniques (e.g., known exploration strategies, or simply random selection from the population).

[0093] The next step in this computational drug design approach is to define multiple objectives, each defining a desired biological characteristic. That is, multiple objectives outline the desired biological characteristics that candidate compounds for a particular drug design project will exhibit. These objectives can be based on a variety of biological characteristics exhibited by the compound, such as bioactivity, selectivity, toxicity, absorption, distribution, metabolism, and excretion, among one or more. Each objective can be defined in any suitable manner relative to a particular biological characteristic. For example, an objective can simply maximize or minimize a particular biological characteristic. Alternatively, an objective can be to achieve a specific desired value for a particular biological characteristic, or it can allow a range of desired values for a particular biological characteristic to be acceptable among candidate compounds, or it can limit the value of a particular biological characteristic to be greater than or less than a certain threshold. One or more objectives can be defined for any given biological characteristic. For purely illustrative purposes, an example of an ideal molecule or compound profile for certain drug discovery projects can be represented by objectives such as: as high as possible activity against a primary target X, lipophilicity between 2 and 6 (log P), and activity against an undesirable target Y (pIC50) strictly below 5.

[0094] The (ultimate) goal of the ML model used as part of the computational design method described is to suggest or predict one or more compounds from a population that satisfy a defined objective. The next step in the computational drug design method is to train this ML model using a defined training set of compounds. Specifically, the ML model is a Bayesian statistical model whose output is a probability distribution approximating the biological characteristics of compounds in the population as an objective function of the structural features of the compounds in the population.

[0095] Bayesian optimization is a useful method for optimizing functions whose form is unknown (i.e., "black-box functions"), and evaluating functions at points in the input space is expensive for functions whose form is unknown. Therefore, Bayesian optimization can be considered a useful method in computational drug discovery. This is because the type of functional relationships between compounds in a population of compounds is not previously known, and also because the synthesis and testing of compounds, i.e., evaluation, can be both time-consuming and expensive.

[0096] Bayesian optimization is a class of machine learning-based optimization methods that focus on maximizing / minimizing an objective function within a feasible set or search space. It is common to make several further general assumptions about problems using Bayesian optimization, or vice versa. For example, the input space is typically not very dimensional, the objective function is usually a continuous function, a global maximum / minimum is sought, and there is no evaluation function to provide gradient information, thus preventing derivative-based optimization methods such as gradient descent or Newton's method. In the context of drug discovery, it is clear that not all of these general assumptions apply. For example, Bayesian optimization for drug discovery will model on a discrete rather than a continuous space, where each discrete point represents a compound from a population. Moreover, problems in the context of drug discovery may have relatively high-dimensional input spaces. In particular, each dimension of the input space may represent a specific structural feature or fragment present or absent in a given compound, and the representation of compounds in the model may include thousands of different such structural features, encoded as present or absent in each case. Therefore, it is clear that some standard Bayesian optimization techniques may not be as applicable to the computational methods in the context of drug discovery as they are in the current situation, and appropriate modifications may be necessary. This will be described in more detail below.

[0097] Bayesian optimization uses a Bayesian statistical model, or an alternative, to model the objective function. In this case, the objective function describes the relationship between the biological properties of compounds in a population and the structural features of those compounds. The Bayesian statistical model provides a Bayesian posterior probability distribution, which describes the potential value of the objective function at a given point (e.g., a point used to evaluate candidates). The posterior probability distribution is updated each time the objective function is evaluated / observed at one or more new points. That is, each time a compound is synthesized from the population to determine its biological properties, the model approximating the relationship between biological properties and structural features can then be updated using that compound.

[0098] When Bayesian optimization is applied to a problem, the model used generates a measure of uncertainty—that is, a way of quantifying how the model determines its own predictions. A Bayesian statistical model can be a Gaussian process model, which includes this measure of uncertainty. A Gaussian process is a stochastic process (i.e., a set of random variables indexed by time or space) such that each finite set of those random variables has a multivariate distribution. That is, each finite linear combination of the random variables is normally distributed. In a generalized sense, Gaussian process models assume that all training (or untrained) data is generated from the same Gaussian process, and this is often a good approximation.

[0099] Gaussian process regression is a Bayesian statistical method for modeling functions. Whenever there is an unknown in Bayesian statistics (e.g., a vector of values of the objective function at a finite set of input points), it is assumed that the unknown is randomly drawn from nature for some prior probability distribution (or simply "prior"). Gaussian process regression assumes that this prior distribution is multivariate normal, with a specific mean vector and covariance matrix.

[0100] A mean vector can be constructed by evaluating the mean function at each input point. One option is to set the mean function to a constant value; however, other suitable forms for the mean function (e.g., polynomial functions) are possible when the objective function is considered to have a structure for a particular application. A covariance matrix can be constructed by evaluating the covariance function or kernel at each pair of points. That is, when predicting the value of unseen points (i.e., points that have not yet been evaluated and therefore whose function values are unknown), the model uses a measure of similarity between points, provided by a kernel function. The kernel can be chosen such that points that are closer together in the input space have a greater positive correlation. This encodes the belief that their function values should be more similar than a pair of points that are farther apart in the input space. Therefore, training points in the neighborhood of unseen points (i.e., points that have been evaluated and whose function values are known) are more important in predicting unseen points than training points that are not in the neighborhood.

[0101] For example, suppose multiple points in the input space have been observed, and the goal is to predict the value of the objective function at new points. Gaussian process regression can be used to determine the prior distribution, and then, given the observed points, Bayes' rule (as is known in the art) can be used to compute the conditional distribution of the objective function at the new points. This conditional distribution is called the posterior probability distribution in Bayesian statistics. The posterior mean can be a weighted mean between the prior and the estimate based on known data (i.e., the evaluation point or the observation point), where the weights depend on the kernel. The posterior variance (i.e., the uncertainty) can be equal to the prior covariance minus the term corresponding to the variance removed by observing the function at the aforementioned points.

[0102] A simple example of implementing the above method is now provided for illustrative purposes. Consider the function f(x) = x sinx, and assume that six training points are fed to a Gaussian process model using radial basis function kernels. Then, the model's predictions are generated on the interval [0, 10]. Figure 1 The diagram shows the observation (training) points, the function f(x), the mean of the predictions, and the 95% confidence interval (twice the standard deviation, i.e., a measure of uncertainty). It can be seen that the uncertainty associated with predictions further from the observation points is greater than the uncertainty with predictions closer to the observation points.

[0103] As mentioned above, kernels typically possess the property that the closer points in the input space are to each other, the stronger their correlation (i.e., the more similar they are). However, a kernel needs to define how to measure the degree to which a pair of points in the input space are "close together." Typically, a kernel is a function of Euclidean distance. However, such kernels do not handle input points with high dimensionality very well. For example, a kernel based on Euclidean distance measurements works adequately well with input spaces up to tens of dimensions (e.g., 20 dimensions). However, as mentioned above, for analysis as part of an ML model, molecules or compounds can be encoded / represented as bit vectors of thousands of bits in length (e.g., 2048-bit fingerprints), where each bit indicates the presence or absence of a specific structural feature or fragment in the compound. That is, in this context, the input space can be considered to be thousands of dimensions. For example, for a 2048-bit fingerprint, each fingerprint can be considered as a vertex in a 2048-dimensional unit cube. While a kernel based on Euclidean distance can be used in this context, it cannot accurately reflect the differences between points in the input space (i.e., compounds in the defined population) because, according to the measurement of Euclidean distance, many of the points will be equally far from all other points.

[0104] In the context of this invention, it may be advantageous to use Tanimoto similarity as the basis for the kernel of a Gaussian process model instead of Tanimoto similarity. Tanimoto similarity, or Tanimoto coefficient, is a measure of the similarity and diversity of sample sets, and can be defined as the size of the intersection between sets divided by the size of the union of the sample sets. Tanimoto coefficient is used in cheminformatics to determine the similarity between fingerprints. Advantageously, the application of Tanimoto coefficient in the kernel used for Gaussian process models will not suffer from the problems described above, which are experienced by Euclidean distance-based kernels used in high-dimensional applications (such as in the case of drug discovery). This is because Tanimoto similarity can be viewed as cosine similarity, and therefore can be considered a measure of angle rather than distance (as in the case of Euclidean distance-based kernels).

[0105] Bayesian optimization models also include parameters of the prior distribution, called hyperparameters. Specifically, the mean function and kernel of the prior distribution comprise hyperparameters. The selection / optimization of these hyperparameters is crucial because their effects are often significant for various standard sample sizes. In the context of drug discovery, standard approaches to selecting hyperparameters for Bayesian statistical models may not be appropriate or optimal. One reason is the typically limited amount of training data in drug discovery. That is, the training set usually consists of a relatively small number of compounds used to train the model. Of course, adding many or any additional compounds to the training set is not necessarily feasible, as this would require relatively expensive and time-consuming synthesis and testing of compounds that have not yet been sampled. Another reason why some standard approaches to selecting model hyperparameters may be unsuitable in the context of drug discovery is due to the so-called "activity cliff." That is, it can be relatively common to find a pair of molecules with very similar or nearly identical chemical structures that exhibit relatively large differences in their respective activities. This significant difference in activity may be a result of relatively small amounts of key atoms being added to or removed from the chemical structure. This phenomenon clearly requires careful attention in models predicting structure-activity relationships between compounds.

[0106] One way to select the hyperparameters of a Bayesian statistical model is by using the (Type II) maximum likelihood estimation (MLE) method. Specifically, given a set of observations for the objective function (i.e., a training set of compounds with known biological properties in the current context), the likelihoods of these observations are computed either below or based on the prior (depending on the hyperparameters). The likelihood is a multivariate normal density, and the hyperparameters are then set to values that maximize the likelihood in this distribution. Gradient descent methods can be used to obtain the hyperparameters that maximize the likelihood of the observations below the prior. Both of these are problems when attempting to use the model on unknown regions of a chemical space where the training data is sparse or nonexistent.

[0107] In the context of drug discovery, using Type II MLE to select hyperparameters can lead to the model shifting to a low-length scale due to insufficient training data. This means that known points can influence the prediction of new points to a greater extent than expected or optimal. Such approaches can also result in high levels of noise in the model and may cause it to overfit the training data. Therefore, more robust hyperparameter optimization methods are needed to scale and automate the training of Bayesian statistical models for drug discovery without manually examining these described problems.

[0108] Another way to select hyperparameters is to use cross-validation. The general approach here is to split or partition the training set into multiple subsets; train the model using all but one of the split subsets; and then test the model using the remaining (test) subset. This process is then repeated for each of the different subsets that served as the test subset. This can be considered a more robust way to train ML models because it generalizes the optimized model. However, cross-validation tends to be relatively computationally expensive and slower than type II MLE, for example. In the context of drug discovery (due to the high dimensionality of the input data), where a relatively large number of hyperparameters need to be optimized, pure cross-validation would be extremely expensive in terms of computational cost.

[0109] In embodiments of the present invention, training a Bayesian statistical model may include tuning or training the model's hyperparameters by applying a combination of maximum likelihood estimation and cross-validation techniques. Combining these two methods or techniques enables improved hyperparameter training with relatively low computational cost.

[0110] In one approach, this combined method can be viewed in some way as similar to the "early stopping" technique. Early stopping is a machine learning technique where the model is trained incrementally via gradient descent. At each step or every few steps, the model's performance is typically evaluated on a held dataset called the validation set. If the performance has deteriorated since the last evaluation, the model stops training to avoid overfitting the training data. However, most models cannot be truly evaluated on validation data unless the model has never seen validation data. This means that in practice, it is necessary to train the model using less data than is actually available (in order to prevent overfitting).

[0111] For Bayesian statistical (Gaussian process) models in the context of drug discovery (i.e., operating on molecular data), the following approach may be useful. It may be helpful to start with relatively high priors on the initial hyperparameters and those of the model's hyperparameters, given the noise in the data. This is to ensure that the activity cliff (mentioned above) in the molecular data does not introduce numerical errors or poor fit. Then, standard gradient descent steps of the maximum likelihood estimation method can be performed on the entire training set (i.e., all compounds with known biological properties) via the model (e.g., utilizing the Tanimoto kernel). Cross-validation steps can then be performed every few gradient descent steps, where the number of steps performed between cross-validations can be chosen as needed. This is possible because of the specific property of Gaussian process models, namely that the covariance matrix used to compute predictions depends only on its hyperparameters and the initial training data. Therefore, the covariance matrix obtained by removing a few rows and columns is the same as the covariance matrix obtained by first removing the corresponding few data points from the training set. This means that for a given model with a covariance matrix, a set number of rows and columns (e.g., 10 or any other suitable number) can be hidden, but the model has the same hyperparameters and all training points except for the number of training points corresponding to the hidden rows and columns. This smaller model can then be validated by making predictions on the hidden points to obtain a specific metric of interest (e.g., "R-squared" for regression). Conversely, if this process is performed on k-fold (where k is the number of subsets into which the training data is split) (i.e., hiding the first 1 / k of the data and making predictions on it, then predicting the second 1 / k of the data, etc.), a more accurate estimate of the model's generalization ability is obtained, while, crucially, the entire training set is used for gradient descent. Because small training sets are standard for drug design, it's not possible to provide only a fraction of the compounds used in the training set (e.g., 10 out of 50, or any other suitable number) to ensure the model doesn't overfit. Tuning a Gaussian process model in the manner described above avoids this problem. Another advantage is that model validation requires almost no computational cost.

[0112] In Bayesian optimization, once a Bayesian statistical model (e.g., a Gaussian process model) has been trained to model the objective function using the training set, a sampling function is used to determine which points in the input space should be evaluated, sampled, or observed next. In particular, the sampling function is a useful tool in Bayesian optimization, shifting the problem from finding the global maximum of a cumbersome objective function to finding the global maximum of a continuous, differentiable, and fast-computable function. The sampling function can be viewed as a mapping from distributions and states to true values. The distribution can be a normal distribution, and the states can include values such as the maximum function value obtained so far, the remaining budget for the points used for evaluation, etc.

[0113] The acquisition function uses the output from a Bayesian statistical model (specifically, the predicted mean and variance of the posterior probability distribution) to guide the search across the input space. The use of the acquisition function with a Bayesian statistical model allows a trade-off between development and exploration methods to be incorporated into the predictions provided by the ML model. This is because the predictions include both mean and variance values. Development of the current model is achieved by focusing on regions of the input space with high means but penalizing higher variance values. On the other hand, exploration of the input space is achieved by focusing on regions of the input space with high variance values, biasing the search towards unexplored areas of the input space with few (if any) observations. The acquisition function has tuning parameters that can be set according to the desired balance or trade-off between development and exploration of the model at a particular design or iteration.

[0114] One type of acquisition function is the expected improvement function. This type of acquisition function selects the point in the input space that has the highest predicted or expected improvement relative to the current highest value of the function in the training set as the next point for evaluation. Another type of acquisition function is the improvement probability function. This selects the point in the input space that has the highest probability of showing improvement compared to the current highest value of the function in the training set as the next point for evaluation. Yet another type of acquisition function is the lower confidence bound function or higher confidence bound function, which selects the next point by referring to the current variance or standard deviation of the posterior mean. For example, the lower confidence bound acquisition function might consider a curve that is two standard deviations below the posterior mean at each point, and then minimize this lower confidence bound envelope of the objective function model to determine the next sample point. As mentioned above, the expression for each of these acquisition functions includes weighted or tuned parameters that can be tuned according to the expected balance between the development and exploration methods when selecting the next point to observe. The acquisition function can depend on the posterior mean and variance of the posterior distribution. Weighted parameters on the posterior mean term of the acquisition function can be used to set the desired level of development, and weighted parameters on the posterior variance term of the acquisition function (relative to the mean weighted parameters) can be used to set the desired level of exploration. These weighted parameters can be user-defined to set the desired strategy.

[0115] Figure 2This section demonstrates an example of how sampling points can be used to model an alternative function (such as a Gaussian process model) in order to optimize an objective function. In each iteration of the process, the sampling function is optimized to select the next point for sampling or evaluation. Because more sampling points are available at each subsequent iteration, the alternative function becomes more accurate, and the selected next sampling point becomes more likely to maximize the objective function.

[0116] Bayesian optimization techniques are typically used to select a single point where an unknown objective function is evaluated. However, as mentioned above, in drug discovery projects, for efficiency reasons, it is common practice to select multiple compounds for synthesis and testing in any given design cycle. That is, multiple points need to be selected for evaluation at a given iteration. Therefore, according to the steps of computationally calculating the drug design method, a subset of multiple compounds from the population that are not in the training set is identified or selected. In particular, the subset is determined based on the optimization of the acquisition function, which is based on a probability distribution from a trained Bayesian statistical model and multiple defined objectives. That is, the method automatically selects multiple compounds to be sampled in a given iteration or design cycle. For example, the number of compounds selected by the method to be included in the subset can be defined by the user, based on the available resource level for synthesizing and testing a certain number of compounds in a given design cycle. The size of the subset can be the same for each iteration (i.e., each iteration computes the drug design method), or the size of the subset can be changed as needed for different iterations.

[0117] To determine a subset, a Bayesian statistical model can be trained, and a collection function can be optimized to successively select one compound at a time until the desired number of compounds for that subset have been selected. Specifically, after the Bayesian statistical model has been trained on the training set, a compound from the population not in the training set can be identified by optimizing the collection function based on the probability distribution from the trained Bayesian statistical model and multiple defined objectives. This first-selected compound needs to be considered when repeatedly optimizing to find a second compound for the subset. However, since the biological characteristics of the first-selected compound are unknown, pseudo-labels or false labels can be applied as proxies for its biological characteristics. With pseudo-labels, the prediction variance around the identified compounds will be reduced. The method can then include retraining the Bayesian statistical model using pseudo-labels of the first-selected compounds (and the training set of compounds), and then identifying a second compound from the population not in the training set for that subset by optimizing the collection function based on the probability distribution from the retrained Bayesian statistical model and multiple defined objectives. Similarly, pseudo-labels can be given to the second-selected compounds, allowing for further retraining of the Bayesian statistical model. Specifically, the method may include repeating the following steps: retraining a Bayesian statistical model using a training set of compounds and one or more compounds identified so far; and optimizing a collection function for a subset based on the probability distribution from the retrained Bayesian statistical model and multiple defined objectives. Specifically, these steps may be repeated until a desired number of compounds have been identified for the subset.

[0118] The pseudo-label or false label, or biometric value, for each identified compound in the subset can be set or determined in any suitable manner. For example, a pseudo-label can be set according to the Kriging believer method, which sets false values based on predicted values of biometrics from a Bayesian statistical model, optionally modified to incorporate upper and lower bounds to reflect the degree of optimism or pessimism regarding the prediction. Alternatively, a pseudo-label can be set according to the frequent liar method, where the relevant value or label can be set as a constant regardless of the point. For example, the mean of the model could be such a suitable constant.

[0119] Different methods can be used (different from the sequential selection with false labels described above). For example, a batch of compounds can be selected using a multi-point expectation improvement (q-EI) method. In this method, the expected increase from the current best solution is calculated conditioned on a set of points (rather than a single point). Then, an appropriate approximation of the discrete space allows for the subsequent implementation of this multi-point acquisition function for a multi-point decision strategy.

[0120] Many Bayesian optimization techniques can typically be used to optimize a single parameter of a function, i.e., a single objective. However, as mentioned above, there are usually multiple criteria that need to be used to optimize a compound to be a suitable candidate. That is, multiple parameters of the function need to be optimized based on the various desired properties of candidate compounds for the specific drug discovery project under consideration; i.e., optimization aims to achieve multiple objectives in parallel. The objectives will often also be conflicting. Furthermore, in the context of drug discovery, the preference for objectives is not monotonic (unlike in some other applications).

[0121] Therefore, the probability distribution from a Bayesian statistical model can be a multidimensional distribution. Specifically, this multidimensional distribution can include a (one-dimensional) distribution of each biological characteristic associated with each of the multiple objectives. One option for optimizing these multiple distributions in parallel with respect to their respective objectives is to use a multidimensional acquisition function. Each dimension of the acquisition function can correspond to a corresponding objective. For example, in this case, the multidimensional acquisition function could be a hypervolume expectation improvement function.

[0122] Another option for optimizing multiple objectives across different dimensions is to transform the problem into a one-dimensional problem. Specifically, multi-objective optimization problems can be simplified using one or more aggregation functions. Such aggregation functions take the mean and variance from each dimension of the Bayesian statistical model (i.e., each biological trait and its corresponding objective) as input. The output is then a one-dimensional distribution with mean and variance. That is, the uncertainty in the model's predictions is carried over by the aggregation function for use by the acquisition function. Furthermore, the input to the aggregation function can be easily extended to any desired number of dimensions. Advantageously, optimization can then be performed using a one-dimensional acquisition function, which is generally simpler to execute. For example, such an acquisition function could be an expected improvement function, an improved probability function, or a confidence bound function, as described above. Statistical independence between each pair of dimensions can be assumed when applying the aggregation function. Aggregation functions can include one or more of sums, means, geometric means, and product functions or operators (each of which can be weighted to achieve a preference for the individual components), for example, using one or more of the following results.

[0123] For any random variables X, Y:

[0124]

[0125] For any independent random variables X, Y:

[0126]

[0127]

[0128]

[0129] These results can be generalized to N variables, and to scalar multiplication using basic expectation and variance properties.

[0130] In the case of general functions and correlated inputs, where the above results may not be preserved, Monte Carlo sampling techniques can be used, for example. Specifically, after empirically determining the correlation between the inputs and obtaining samples from a multivariate distribution, an aggregation function can be determined for these samples. The mean and standard deviation can then be derived from the results. The one-dimensional result of the aggregation can then be fed into a one-dimensional sampling function.

[0131] Since different objectives in the multiple objectives of an optimization problem can conflict with each other (i.e., optimizing for one objective may adversely affect another), optimization based on a collection function defined for multiple objectives can provide a Pareto optimal set of compounds. Then, it is necessary to select one or more of these compounds to include in the determined subset. This can be done in any suitable manner (e.g., based on user-defined preferences or desirability).

[0132] One way to handle conflicting objectives and break the connections between compounds in multi-objective optimization is to encode preferences into optimizations. This can be achieved by applying a utility function to a posterior priority distribution associated with each objective. Where the user has a preference ordering for the set of choices, the utility function can be used to encode that preference by assigning real numbers to each of the alternatives. Thus, for each of one or more objectives, the method may include mapping the preferences associated with the biological characteristics or distributions of the corresponding objective (which could be user-defined preferences) to a probability distribution of preference modifications by applying the corresponding utility function to a probability distribution from a Bayesian statistical model. Optimization of the acquisition function can then be based on this probability distribution of preference modifications. Crucially, the uncertainty associated with the predictions from the model propagates to the application of the acquisition function, where the utility function (and the aforementioned aggregation function) is advantageous because the uncertainty is preserved in its output.

[0133] In some cases, defined preferences can indicate the priority of a given objective relative to other objectives among a plurality of objectives. For example, satisfying one objective may be more critical than another in order to obtain a candidate compound.

[0134] Preferences can also be introduced based on specific predictions from the model. For example, preferences can be encoded to support predictions from a model with greater certainty than it actually is. That is, for a biological property of one of the compounds, it is likely that the lower the uncertainty associated with the probability distribution of that biological property, the greater the preference associated with the corresponding biological property. In this way, the uncertainty of the model predictions is used not only as the output of the utility function (which will be used by the acquisition function) but also as an input. As an illustrative example only, suppose multiple objectives are defined to optimize for multiple activity objectives and lipophilicity (log P) that needs to be strictly between 0 and 2 (where any value between 0 and 2 is equally desirable). Consider the case where a Bayesian statistical model predicts two compounds (X and Y) that have the same activity predictions, the same log P mean predictions, and log P standard deviations of 0.5 and 3, respectively. In this case, compound X is the preferred compound because its lipophilicity is more likely to be within the desired range of 0 to 2. In this scenario, if prediction uncertainty is not considered, the average utility function value will be the same, meaning that the method cannot distinguish between X and Y, even if the user will have a clear preference.

[0135] In practice, in ranking preferences within a set of choices, those choices that are close together in the ranking tend to have similar preference levels. Furthermore, when the choices are real numbers, the utility function can be continuous. The utility function of this method can be advantageously modeled as a piecewise function, and in particular a piecewise linear function. That is, when plotted, the function consists of line segments defined as follows:

[0136]

[0137] Where [(a0,b0),(a1,b1),…,(a N ,b N [x0, x1, ..., xn] are N+1 linear functions, and [x0, x1, ..., xn] are linear functions. N-1 [] is a point between two consecutive rows. Figure 3 An example of a piecewise linear function is shown, which can be used as part of the described method to include the degree of preference for predictions for different compounds.

[0138] Piecewise linear functions can be used in conjunction with a normal distribution. In this computational drug design approach, a Bayesian statistical model provides a prediction as a normal distribution, which can then be passed to a piecewise linear utility function. As mentioned above, the uncertainty in the normal distribution needs to be preserved through a utility function (which is subsequently used by the acquisition function). Given the prediction as a normal distribution and the utility function outlined above, the mean and standard deviation can be determined. The following results are used to determine these values.

[0139] make The probability density function (pdf) of X is:

[0140]

[0141] For those with pdf p X For any random variable X and f, the function is:

[0142]

[0143] The error function erf is defined as follows:

[0144]

[0145] For any normal distribution X with mean μ and standard deviation σ, its cumulative density function (cdf) is:

[0146]

[0147] The standard deviation of X can be written using the expected value:

[0148]

[0149] Expected value

[0150] Based on the above The expression:

[0151]

[0152]

[0153] Where, x -1 = -∞, and x N =∞.

[0154] For any a, b, μ, and σ ≠ 0:

[0155]

[0156] The above can be replaced with the result to obtain:

[0157]

[0158] Standard deviation

[0159] For any a, b, μ, and σ ≠ 0:

[0160]

[0161] From the above:

[0162]

[0163] Where, x -1 =-∞ and x N =∞. By manipulating:

[0164]

[0165] Taking the square root above provides an expression for σ(f(X)).

[0166] Note the last item. It is the square of the expected value expression calculated above.

[0167] An analytical solution has been found to compute the mean and uncertainty using a piecewise desirability function. Crucially, the equations can be vectorized, i.e., preserved as X N-dimensional vectors (N normally distributed variables, not just one). Importantly, vectorization operations (e.g., addition, multiplication, exponentiation, etc.) benefit from hardware acceleration, making them computationally very fast.

[0168] Figure 4 This diagram schematically illustrates how compounds or molecules from a population can be fed into an ML model (i.e., a Bayesian statistical model) and trained using those compounds from the population whose biological characteristics are known (i.e., compounds in the training set). In this multi-objective problem, the Bayesian statistical model can output multiple predictions (corresponding to each objective) in the form of a posterior probability distribution. A utility function or value can then be applied to the corresponding predictions, for example, to incorporate preferences into the predictions while preserving the uncertainty measure associated with the generated predictions. An aggregation function or value can then be applied to the (preference-modified) predictions to reduce the dimensionality of the predictions to a single dimension, again while preserving the uncertainty associated with the predictions. The aggregated predictions can then be optimized using a one-dimensional acquisition function (optionally including user-defined weights based on a desired balance between model development and exploration) to select compounds for synthesis.

[0169] Figure 5The steps of the computational drug design method 50 according to the present invention are summarized. At step 51, a population of multiple compounds is defined, wherein each compound has one or more structural features. At step 52, a training set of compounds is defined. Specifically, the training set includes those compounds from the population with known multiple biological properties, e.g., those compounds that have been previously synthesized and tested. At step 53, multiple objectives are defined. Specifically, each objective indicates or defines the biological properties that will be exhibited by the ideal / candidate compound (for the specific drug discovery project under consideration). At step 54, a Bayesian statistical model, e.g., a Gaussian process model, is trained using the training set of compounds. The Bayesian statistical model is then executed to output a posterior probability distribution approximating the biological properties of the compounds in the population as an objective function for the structural features of the compounds in the population. The posterior probability distribution can be multiple posterior probability distributions, e.g., one corresponding to each of the multiple objectives. At step 55, a subset of the multiple compounds is determined. Specifically, this subset includes compounds from the population that are not in the training set. Specifically, the subset is determined by optimizing a collection function based on a probability distribution from a trained Bayesian statistical model and multiple defined objectives (i.e., optimizing multiple objectives simultaneously). Specifically, the compound best suited to the optimization curve (e.g., the ideal compound) is selected. The subset can be selected sequentially, one compound at a time, by repeating the model execution and collection function optimization steps multiple times, and by retraining the model each time the step is repeated (using pseudo-labels of the compounds selected so far for training purposes). Optionally, one or more utility functions can be applied to the generated posterior probability distribution before applying the collection function to incorporate user preferences regarding the objectives into the model predictions. Optionally, one or more aggregation functions can be applied before applying the collection function to reduce the dimensionality of the generated model predictions. At least some compounds from the determined subset can then be selected for synthesis and testing. These synthesized compounds can then be added to the training set for the next execution of method 50 (e.g., a subsequent design cycle of the drug discovery project under consideration).

[0170] The method of the present invention can be implemented on any suitable computing device, for example, by one or more functional units or modules implemented on one or more computer processors. These functional units can be provided by suitable software running on any suitable computing substrate using conventional or client processors and memory. One or more functional units may use a common computing substrate (e.g., a computing substrate that can run on the same server) or a separate substrate, or one or both may be distributed among multiple computing devices. Computer memory may store instructions for performing the method, and the processor may execute the stored instructions to perform the method.

[0171] The description of the Gaussian process model is now outlined and compared with a standard random forest. A set of 14,620 molecules with known biological activities of PXC 50, particularly hERG activity, is defined. Statistics for the dataset are shown in Table 1.

[0172]

[0173] Table 1

[0174] The first 2000 molecules in the dataset are used as training data for training the model (in the same manner as described above for the Gaussian process model). The performance of each model is then evaluated using the remaining molecules in the dataset. The kernel used for the Gaussian process model is a Jaccard kernel, which uses the Jaccard (or Tanimoto) distance between fingerprints.

[0175] Figures 6(a)-6(c) The dataset compares the true, known biological activities of molecules with the activities predicted by trained Gaussian process and random forest models. Specifically, Figure 6(a) shows a scatter plot of the true activity values for each molecule relative to those predicted by the Gaussian process model. The 'x' represents the degree of correlation of each point for a molecule with respect to the variance of the Gaussian process model. Similarly, Figure 6(b) shows a plot of the true activity values relative to those predicted by the random forest model, and Figure 6(c) shows a plot comparing the predicted activities obtained from the random forest and Gaussian process models.

[0176] The variance threshold in a Gaussian process model can be adjusted to demonstrate how model determinism relates to accurate predictions. For example, the model can be run with different upper limits for variance (e.g., 1, 0.75, 0.6, 0.5, 0.4, or any other suitable value). Figure 7(a) shows a scatter plot of the true activity values relative to those predicted by the Gaussian process model, where the variance threshold is set to 0.5. For comparison, Figure 7(b) shows a scatter plot of the true activity values relative to those predicted by the random forest model of those molecules filtered from Figure 7(a). Finally, Figure 8 The graph shows how the mean squared error (MSE) and variance of a Gaussian process model change with model determinism.

[0177] Another example is described in which the Bayesian optimization method described above is used to simulate multiple optimization cycles on an existing set of known molecules to serve as a benchmark. Figure 9This diagram schematically illustrates the main steps or modules used to perform benchmarking. In the initial state or phase, for example, the user sets parameters for a customized simulation. These parameters may include the acquisition function, batch size, etc. Molecules known to the model are set, as are unknown molecules that the model can select. Multiple properties or objectives are also set. In the batch optimization run phase, a single optimization step (as described above) is performed to select a batch of molecules. The model is then retrained by feeding the selected batch to a model with the correct labels before further optimization steps can be performed. The output may include all selected molecules and / or individual logs / metrics associated with the model's predictions.

[0178] One set of known molecules is a dataset of 2,500 compounds presented in Pickett et al., (2011), “Automated lead optimization of MMP-12 inhibitors using a genetic algorithm,” ACS Medicinal Chemistry Letters, 2(1), 28-33. This dataset was generated by selecting a core with two R-groups. The core was fixed, and each R-group was essentially a placeholder, used for 50 different molecular structures to obtain 2,500 combinations containing that core. Of these combinations, only 1,880 molecules were successfully synthesized and tested against assays to produce pIC50 values. Therefore, to maximize the discovered pIC50 values, multiple synthetic cycles can be simulated using an active / machine learning model (as described above) or by chemists.

[0179] In one experiment, multiple chemists were given the same initial 14 compounds and associated pIC50 values. Using this information, the chemists were tasked with selecting another batch of 14 compounds, each with an associated pIC50 value. This process was repeated for 10 batches (iterations), resulting in a total of 140 selected compounds and the initial 14 compounds. Each chemist's performance was then evaluated based on whether a compound with the maximum pIC50 value was found, the average pIC50 value of the selected compounds, and the top N selected compounds. The same experiment was simulated using a Gaussian process model as described. Specifically, the model was trained using the provided training data (i.e., the known pIC50 values). A Bayesian optimization algorithm was used to select a batch of compounds to optimize the objective (i.e., maximizing the pIC50 value). The training set was then updated to include the selected compounds, the model was retrained, and optimization was performed again. A comparison between the active learning method of this invention and the results of the best-performing chemist is presented in Table 2.

[0180]

[0181] Table 2

[0182] Another example demonstrating the results obtained using the described Gaussian process model is described. This example is performed using molecules from the known ChEMBL and GoStar databases. The general approach is to provide a relatively small, initially generated set of molecules (i.e., the training set) and build an ML model based on this training set. Batch Bayesian optimization according to the described method is then performed to select a set of molecules from the collection containing activity data of relevant properties that optimize activity for the target set. The model is then retrained using new data from the selected set. This process is repeated for multiple cycles or iterations.

[0183] In the example described here, 13,403 molecules containing activity data for at least one of CYP3A4 (UniProt ID P08684) and CYP1A2 (UniProt IDP05177) were extracted from the aforementioned database. CYP3A4 (cytochrome P450 3A4) is an enzyme found in the body, frequently discovered in the liver and intestines, and it oxidizes toxins, allowing them to be removed from the body. CYP1A2 (cytochrome P450 1A2) is also an enzyme concentrated in the endoplasmic reticulum in the body. A random initial set of 10 molecules was obtained, and a model was constructed / trained for each CYP (i.e., each biological characteristic). Then, execution was performed. Figure 5 The Bayesian optimization method is used for 10 rounds or iterations, where 20 molecules are selected from 13,393 remaining molecules in each iteration. After each round, the (known) data for each of the selected molecules is displayed and used to retrain / update the model. Some molecules in the database do not have data for both CYPs, meaning the model can receive less data in each round or iteration.

[0184] Figure 10(a) shows a graph illustrating the distribution of CYP3A4 activity values across a set or population of 13,403 molecules. Specifically, Figure 10(a) shows the breakdown of these 13,403 molecules into 8 molecules in the initial training set, 127 molecules selected during iterative optimization, and the remaining or unknown 13,268 molecules. As mentioned above, some molecules in the database have known data for only one of the CYPs. In this case, although 10 molecules were selected for the initial training set, only 8 of these have CYP3A4 data. Figure 10(b) shows a graph illustrating the distribution of CYP3A4 activity values for the molecules in the training and selection sets described in Figure 10(a) and above, as they are more clearly visible than in Figure 10(a).

[0185] Figures 11(a) and 11(b) show the distribution of CYP1A2 activity values, but not CYP3A4 activity values, corresponding to Figures 10(a) and 10(b), respectively. In this case, only 4 out of the 10 initially selected molecules for training the model had available CYP1A2 data. After 30 iterations, 104 molecules with available CYP1A2 data were selected.

[0186] Overall, relatively large activity enrichment was observed in the selected compounds when compared to random selection (analyzing the data distribution) and to a baseline without active learning according to the method described herein. These results are particularly promising given that only 4 and 8 values (from 10 initial data points) were used for the targets outlined above, respectively.

[0187] Figure 12 A graph showing the CYP3A4 and CYP1A2 activity values of molecules in the set is presented. Both values are available for the molecules in this set, i.e., both were measured in ChEMBL+GoStar. Figure 12 It also indicates which of these molecules (“true”) to select during iterations of the described method, and the remaining molecules that are not selected (“false”). With the Pareto front located on the upper right side of the graph (maximizing the activity value), it can be seen that even with only about 200 molecules out of approximately 13,000 molecules in the population selected, a close proximity to the Pareto front is achieved in the selected molecule set.

[0188] Regarding free energy perturbation calculations, a further example demonstrating the described method is provided. Based on “Reaction-Based Enumeration, Active Learning, and Free Energy Calculations to Rapidly Explore Synthetically Tractable Chemical Space and Optimize Potency of Cylin-Dependent Kinase 2 Inhibitors”, Konze et al., J. Chem. Inf. Model., 2019, 59, 9, 3782-3793, a dataset of 1921 molecules and corresponding relative binding free energy (RBFE) calculations were extracted. This example begins with an initial training set of 935 molecules from the cited references, followed by 30 rounds or iterations of the method described herein, where 10 molecules are selected in each round. The goal is to minimize the RBFE calculation result (measured as “Pred dG(kcal / mol)”).

[0189] Figure 13 A graph showing the distribution of RBFE values for molecules in the dataset is presented. Specifically, Figure 13 Distinguish between the 935 molecules in the initial training set (“trained”), the molecules selected during iterations of the described method (“selected”), and the remaining molecules in the dataset (“unknown”). The bottom of each bar indicates “trained” molecules, the middle of each bar indicates “selected” molecules, and the top of each bar indicates “unknown” molecules.

[0190] Figure 14(a) shows how the accumulated RBFE value changes with successive iterations of the described method (“accumulated Pred dG”) under optimal selection (i.e., by selecting the molecule with the lowest dG value). This is compared to the set of optimal selections (“best possible Pred dG”) and the set of random selections. Figure 14(b) then shows the percentage of the molecules selected in Figure 14(a) after 30 iterations of the described method, based on the top x of the molecules in the dataset that minimize the RBFE value. For example, for x = 10, 80% of the lowest dG molecules are found at the end of 30 iterations. At x = 1, a 100% result means that the lowest dG molecule has been selected.

[0191] Figure 15(a) is a plot of Figure 14(a), except that Figure 15(a) shows the result of the random forest model greedily selecting the set instead of selecting the set via the described method. Figure 15(b) shows the percentage of the selected molecules in Figure 14(a) in the top x of the molecules in the test set after 30 iterations of the random forest model based on minimizing the RBFE value.

[0192] The examples above describe using a Gaussian process model to perform the described Bayesian statistical method; however, different Bayesian model architectures can be used. For example, a Bayesian statistical model in the form of a Bayesian neural network or a deep neural network with dropout that provides an estimate of uncertainty can be used in the examples of this invention. Furthermore, it will be understood that any set of models with general architectures can be used.

[0193] The examples above describe the use of Bayesian statistical models to select compounds or molecules from a population for synthesis, for example, as part of a drug discovery process. In the examples of this invention, the compounds or molecules selected using the described Bayesian statistical methods can be used for different purposes. For example, the described methods can be used to select which molecules from a population to perform molecular dynamics analysis. It is possible that performing certain physics-based simulations is resource-intensive (e.g., they are time-consuming and / or require high computing power), which may necessitate allocating computational resources, given their availability, in a manner that maximizes insight into certain molecular dynamics.

[0194] Many modifications may be made to the above examples without departing from the spirit and scope of the invention as defined in the appended terms and claims, which are specifically referenced herein.

[0195] Terms and Conditions

[0196] 1. A method for computational drug design, comprising:

[0197] Define a group of compounds, each of which has one or more structural features;

[0198] Define a training set of compounds from a population whose properties are known.

[0199] Define multiple objectives, each with a defined desired characteristic;

[0200] A Bayesian statistical model is trained using a training set of compounds to output a probability distribution of the properties of compounds in an approximate population as the objective function for the structural characteristics of compounds in the population.

[0201] A subset of various compounds from the population that are not in the training set is determined by optimizing a sampling function based on a probability distribution derived from a trained Bayesian statistical model and several defined objectives; and

[0202] At least some compounds from the determined subset are selected for synthesis.

[0203] 2. The method according to Clause 1 includes: for one or more objectives, mapping preferences associated with the characteristics of the corresponding objective by applying a corresponding utility function to a probability distribution from a Bayesian statistical model to obtain a probability distribution of preference modification, wherein the optimization of the acquisition function is based on the probability distribution of preference modification.

[0204] 3. According to the method of Clause 2, wherein preference indicates the priority of the corresponding objective relative to other objectives among a plurality of objectives.

[0205] 4. According to the method of Clause 2 or 3, wherein, for one of the properties of a compound, the lower the uncertainty value associated with the probability distribution of the property, the greater the preference associated with the corresponding property.

[0206] 5. The method according to any one of clauses 2 to 4, wherein the preference is a user-defined preference.

[0207] 6. The method according to any one of clauses 2 to 5, wherein one or more of the utility functions are piecewise functions.

[0208] 7. According to the method of Clause 6, where the piecewise function is a piecewise linear function.

[0209] 8. The method according to any of the preceding clauses, wherein optimizing the acquisition function includes: evaluating the acquisition function for each compound in the population, optionally excluding compounds from the training set, wherein a subset is determined based on the evaluated acquisition function value.

[0210] 9. The method according to any of the preceding clauses, wherein optimization of a collection function based on a plurality of defined objectives provides a Pareto optimal set of compounds, wherein one or more compounds from a plurality of compounds are selected from the Pareto optimal set for a determined subset.

[0211] 10. The method according to Clause 9, wherein selection is made from the Pareto optimal set based on user-defined preferences.

[0212] 11. The method according to any of the preceding clauses, wherein the probability distribution from the Bayesian statistical model comprises the probability distribution of each characteristic associated with each of the plurality of objectives.

[0213] 12. The method according to Clause 11 includes mapping multiple probability distributions to a one-dimensional aggregated probability distribution by applying an aggregation function to multiple probability distributions from a Bayesian statistical model, wherein the optimization of the acquisition function is based on the aggregated probability distribution.

[0214] 13. The method according to Clause 12, wherein the aggregation function includes one or more of the following: sum operator, average operator, and product operator.

[0215] 14. The method according to any of the preceding clauses, wherein the acquisition function is at least one of the following: an expected improvement function, an improved probability function, and a confidence bound function.

[0216] 15. The method according to any one of clauses 1 to 11, wherein the acquisition function is a multidimensional acquisition function, wherein each dimension corresponds to a corresponding target among a plurality of targets; optionally, wherein the multidimensional acquisition function is a hypervolume expectation improvement function.

[0217] 16. The method according to any of the preceding clauses, wherein training the Bayesian statistical model includes: tuning multiple hyperparameters of the Bayesian statistical model, wherein tuning the hyperparameters includes: applying a combination of maximum likelihood estimation and cross-validation techniques.

[0218] 17. The method according to any of the preceding clauses, wherein determining a subset of the plurality of compounds comprises:

[0219] By using probability distributions from a trained Bayesian statistical model and optimizing acquisition functions based on multiple defined objectives, a compound not in the training set from the population is identified, and the following steps are repeated until multiple compounds have been identified for a subset:

[0220] The Bayesian statistical model is retrained using the training set of compounds and one or more of the identified compounds; and,

[0221] By using probability distributions from a retrained Bayesian statistical model and optimizing acquisition functions based on multiple defined objectives, a compound from the population that is not in the training set and is not one or more previously identified compounds can be identified.

[0222] 18. The method according to Clause 17, wherein retraining the Bayesian statistical model comprises: setting one or more pseudo-characteristic values for one or more identified compounds in the Bayesian statistical model.

[0223] 19. The method of Clause 18, wherein the pseudo-characteristic value is set according to one of the following: the Kriging believer method and the frequent liar method.

[0224] 20. The method according to any of the preceding clauses, wherein, in the Bayesian statistical model, each compound is represented as a bit vector, wherein the bits of the bit vector indicate the presence or absence of a corresponding structural feature in the compound.

[0225] 21. The method according to any of the preceding clauses, wherein the Bayesian statistical model is a Gaussian process model.

[0226] 22. The method according to any of the preceding clauses, wherein the probability distribution from the trained Bayesian statistical model includes a posterior mean indicating approximate characteristic values of compounds in the population and a posterior variance indicating the uncertainty associated with the approximate characteristic values in the population.

[0227] 23. The method according to any of the preceding clauses, wherein one or more weighted parameters of the acquisition function are modified according to the desired strategy of the drug design process utilizing the methods outlined.

[0228] 24. The method according to Clause 23, wherein the desired strategy comprises a balance between a development strategy and an exploration strategy, wherein the development strategy depends on the weighted parameters of the acquisition function associated with the posterior mean, and the exploration strategy depends on the weighted parameters of the acquisition function associated with the posterior variance.

[0229] 25. The method according to Clause 23 or Clause 24, wherein the user defines weighting parameters to set the desired strategy.

[0230] 26. The method according to any of the preceding clauses, wherein the Bayesian statistical model uses a kernel that indicates the similarity between pairs of compounds in a population to approximate the biological characteristics of the compounds.

[0231] 27. According to the method of Clause 27, the kernel is a Tanimoto similarity kernel.

[0232] 28. The method according to any of the preceding clauses includes synthesizing at least some of the selected compounds of the determined subset to determine at least one property of the selected compounds.

[0233] 29. The method according to Clause 28 includes adding the synthesized compound to the training set to obtain an updated training set.

[0234] 30. The method pursuant to Clause 29 includes:

[0235] Use the updated training set of compounds to train an updated Bayesian statistical model to output a probability distribution that approximates the objective function;

[0236] A new subset of various compounds from the population that are not in the updated training set is determined by optimizing a collection function, which depends on the approximation properties of the updated Bayesian statistical model and on several defined objectives; and

[0237] Select at least some compounds from the determined new subset for synthesis.

[0238] 31. The method according to clause 30 includes synthesizing selected compounds from a determined new subset to determine at least one property of the selected compounds.

[0239] 32. The method according to Clause 31 includes updating the training set by adding the synthesized compound to the training set.

[0240] 33. The method according to any one of clauses 29 to 32, including iteratively performing the following steps until a stopping condition is met:

[0241] Use the updated training set of compounds to train an updated Bayesian statistical model to output a probability distribution that approximates the objective function;

[0242] A new subset of various compounds from the population that are not in the updated training set is determined by optimizing the acquisition function, which depends on the approximate properties of the updated Bayesian statistical model and on several defined objectives.

[0243] Select at least some compounds from the determined new subset for synthesis;

[0244] Synthesize selected compounds from the determined subset to determine at least one property of the selected compounds; and,

[0245] The synthesized compounds are added to the training set to obtain an updated training set.

[0246] 34. The method according to Clause 33, wherein the stopping condition includes at least one of the following: one or more synthesized compounds achieve multiple objectives; one or more synthesized compounds are within acceptable thresholds for the respective multiple objectives; and a maximum number of iterations have been performed.

[0247] 35. The method according to any one of clauses 28 to 34, wherein the synthesized compound that achieves multiple objectives or is within an acceptable threshold of the respective multiple objectives is a candidate drug or therapeutic molecule having desired biological, biochemical, physiological and / or pharmacological activity against a predetermined target molecule.

[0248] 36. The method according to Clause 35, wherein the predetermined target molecule is an in vitro and / or in vivo therapeutic, diagnostic, or experimental assay target.

[0249] 37. The method pursuant to Clause 35 or Clause 36, wherein the candidate drug or therapeutic molecule is used for medical purposes, for example, in a method for treating animals (such as human or non-human animals).

[0250] 38. The method according to any of the foregoing clauses, wherein each target is defined by the user.

[0251] 39. The method according to any of the preceding clauses, wherein each objective includes at least one of the following: the expected value of the corresponding characteristic, the range of expected values of the corresponding characteristic, and the expected value of the corresponding characteristic that is maximized or minimized.

[0252] 40. The method according to any of the foregoing clauses, wherein the number of compounds in the selected subset is user-defined.

[0253] 41. The method according to any of the preceding clauses, wherein the structural feature of each of the multiple compounds in the group corresponds to a segment, chemical part or chemical group present in the compound.

[0254] 42. The method according to Clause 41, wherein a fragment, chemical part or chemical group present in each of a plurality of compounds is represented as a molecular fingerprint; optionally, wherein the molecular fingerprint is an extended interconnected fingerprint (ECFP), optionally ECFP0, ECFP2, ECFP4, ECFP6, ECFP8, ECFP10 or ECFP12.

[0255] 43. The method according to any of the preceding clauses, wherein the characteristic or at least one characteristic is the biological, biochemical, chemical, biophysical, physiological and / or pharmacological characteristic of each compound.

[0256] 44. The method according to any of the preceding clauses, wherein the characteristics include one or more of the following: activity, selectivity, toxicity, absorption, distribution, metabolism, and excretion.

[0257] 45. A compound identified by any of the methods described in the preceding clauses.

[0258] 46. A non-transitory computer-readable storage medium storing instructions that, when executed by a computer processor, cause the computer processor to perform any one of the methods described in clauses 1 to 44.

[0259] 47. A computing device for calculating drug design, comprising:

[0260] The input terminal is configured to receive data indicating a population of multiple compounds, data indicating a training set of compounds from the population with known properties, and data indicating multiple targets, each compound having one or more structural features, each target defining a desired property.

[0261] A processor is configured to train a Bayesian statistical model using a training set of compounds to provide a probability distribution approximating the properties of compounds in a population as an objective function for the structural characteristics of compounds in the population; and to determine a subset of multiple compounds from the population that are not in the training set, the subset being determined by optimizing a collection function based on the probability distribution from the trained Bayesian statistical model and based on multiple defined objectives; and

[0262] An output terminal is arranged to output a defined subset; optionally, a computing device is arranged to select at least some compounds from the defined subset for synthesis.

[0263] 48. A computing device pursuant to Clause 47, wherein the processor is configured to read computer-readable code to perform at least some steps of a method pursuant to any one of Clauses 1 to 44.

[0264] 49. A method for computational drug design, comprising:

[0265] Define a group of compounds, each of which has one or more structural features;

[0266] Define a training set of compounds from a population whose properties are known.

[0267] Define multiple objectives, each with a defined desired characteristic;

[0268] The training set of compounds is used to train a Bayesian statistical model to output a probability distribution of the properties of compounds in the population as the objective function of the structural features of compounds in the population.

[0269] A subset of multiple compounds from the population that are not in the training set is determined by optimizing a sampling function based on a probability distribution derived from a trained Bayesian statistical model and several defined objectives; and

[0270] At least some compounds from the determined subset are selected for molecular dynamics analysis.

[0271] 50. The method according to Clause 49, including molecular dynamics analysis based on the selected compound.

Claims

1. A method for computational drug design, comprising: Define a group of compounds, each of which has one or more structural features; The value of each corresponding characteristic among multiple characteristics from the population is defined as a training set of known compounds; Define multiple objectives, each objective including at least one of the following: the expected value of the corresponding characteristic, the range of the expected value of the corresponding characteristic, and the expected value of the corresponding characteristic that is maximized or minimized; A Bayesian statistical model is trained using the training set of the compounds to output a probability distribution that approximates the properties of the compounds in the population as the objective function for the structural characteristics of the compounds in the population. For one or more of the targets, preferences associated with the characteristic of the corresponding target are mapped by applying the corresponding utility function to the probability distribution from the Bayesian statistical model to obtain a probability distribution of preference modification, wherein for the characteristic of the corresponding target of one of the compounds, the lower the uncertainty value associated with the probability distribution of the characteristic, the greater the preference associated with the characteristic of the compound. By optimizing the acquisition function based on the probability distribution modified according to the preferences and based on multiple defined objectives, a compound from the population that is not in the training set is identified, and the following steps are repeated until a subset of multiple compounds from the population that are not in the training set has been identified: The Bayesian statistical model is retrained using the training set of the compounds and one or more of the identified compounds; For one or more of the objectives, preferences associated with the characteristics of the corresponding objectives are mapped by applying the corresponding utility function to the probability distribution from the Bayesian statistical model to obtain a probability distribution of preference modification; as well as By optimizing the acquisition function based on the probability distribution modified according to the preferences and based on multiple defined objectives, a compound from the population that is not in the training set and is not one or more previously identified compounds is identified. as well as At least some compounds from the determined subset are selected for synthesis.

2. The method according to claim 1, wherein, The preference indicates the priority of the corresponding objective relative to other objectives among the plurality of objectives.

3. The method according to claim 1 or 2, wherein, One or more of the utility functions are piecewise functions.

4. The method according to claim 1 or 2, wherein, Optimizing the acquisition function includes: evaluating the acquisition function for each compound in the population, wherein the subset is determined based on the evaluated acquisition function values.

5. The method according to claim 1 or 2, wherein, The optimization of the acquisition function based on the defined multiple objectives provides a Pareto optimal set of compounds, wherein one or more of the multiple compounds are selected from the Pareto optimal set for a determined subset.

6. The method according to claim 1 or 2, wherein, The probability distribution from the Bayesian statistical model includes the probability distribution of each characteristic associated with each of the plurality of objectives.

7. The method of claim 6, further comprising mapping the plurality of probability distributions to a one-dimensional aggregated probability distribution by applying an aggregation function to a plurality of probability distributions from the Bayesian statistical model, wherein, The optimization of the acquisition function is based on the aggregated probability distribution.

8. The method according to claim 1 or 2, wherein, The acquisition function is at least one of the following: the expected improvement function, the improvement probability function, and the confidence limit function.

9. The method according to claim 1 or 2, wherein, The acquisition function is a multidimensional acquisition function, where each dimension corresponds to a specific target among the plurality of targets.

10. The method according to claim 1 or 2, wherein, Training the Bayesian statistical model includes tuning multiple hyperparameters of the Bayesian statistical model, wherein tuning the hyperparameters includes applying a combination of maximum likelihood estimation and cross-validation techniques.

11. The method according to claim 1, wherein, Retraining the Bayesian statistical model includes setting one or more pseudo-characteristic values for one or more of the identified compounds in the Bayesian statistical model.

12. The method according to claim 1 or 2, wherein, In the Bayesian statistical model, each compound is represented as a bit vector, where the bits of the bit vector indicate whether the corresponding structural features are present or absent in the compound.

13. The method according to claim 1 or 2, wherein, The Bayesian statistical model is a Gaussian process model.

14. The method according to claim 1 or 2, wherein, The probability distribution from the trained Bayesian statistical model includes a posterior mean indicating approximate property values of compounds in the population, and a posterior variance indicating the uncertainty associated with the approximate property values in the population.

15. The method according to claim 1 or 2, wherein, Modify one or more weighted parameters of the acquisition function according to the desired strategy of the drug design process utilizing the claimed method.

16. The method according to claim 1 or 2, wherein, The Bayesian statistical model uses a kernel that indicates the similarity between compound pairs in the population to approximate the properties of the compounds, wherein the kernel is a Tanimoto similarity kernel.

17. The method of claim 1 or 2, comprising synthesizing at least some of the selected compounds of the determined subset to determine at least one property of the selected compounds, and adding the synthesized compounds to the training set to obtain an updated training set.

18. The method of claim 17, comprising: The updated training set of the compound is used to train an updated Bayesian statistical model to output the probability distribution that approximates the objective function; A new subset of various compounds from the population that are not in the updated training set is determined based on the optimization of the acquisition function, which depends on the approximate properties of the updated Bayesian statistical model and on several defined objectives. as well as, Select at least some compounds from the determined new subset for synthesis.

19. The method of claim 18, further comprising synthesizing selected compounds of a determined new subset to determine at least one property of the selected compounds, and updating the training set by adding the synthesized compounds to the training set.

20. The method of claim 17, further comprising iteratively performing the following steps until a stopping condition is met: The updated training set of the compound is used to train an updated Bayesian statistical model to output the probability distribution that approximates the objective function; A new subset of various compounds from the population that are not in the updated training set is determined based on the optimization of the acquisition function, which depends on approximate biological characteristics from the updated Bayesian statistical model and on multiple defined objectives. Select at least some compounds from the determined new subset for synthesis; Synthesize selected compounds from the determined subset to determine at least one property of the selected compounds; as well as, The synthesized compounds are added to the training set to obtain an updated training set.

21. The method according to claim 20, wherein, The stopping conditions include at least one of the following: one or more synthesized compounds achieve the plurality of objectives, one or more synthesized compounds are within the acceptable threshold of the respective plurality of objectives, and the maximum number of iterations has been performed.

22. The method according to claim 1 or 2, wherein, The structural features of each of the multiple compounds in the group correspond to a fragment present in the compound; wherein the fragment present in each of the multiple compounds is represented as a molecular fingerprint.

23. The method according to claim 1 or 2, wherein, The property or at least one of the properties is a biological property, biochemical property, chemical property, biophysical property, physiological property and / or pharmacological property of each of the compounds.

24. A non-transitory computer-readable storage medium storing instructions that, when executed by a computer processor, cause the computer processor to perform the method of any one of claims 1 to 23.

25. A computing device for calculating drug design, comprising: The input terminal is configured to receive: Data indicating a group of multiple compounds, each with one or more structural features; The values of each corresponding biological characteristic among a variety of biological characteristics from the population are data from a training set of known compounds; as well as Data indicating multiple objectives, each objective including at least one of the following: the expected value of the corresponding characteristic, the range of the expected value of the corresponding characteristic, and the expected value of the corresponding characteristic to be maximized or minimized; Processor, the processor being arranged as follows: The training set of compounds is used to train a Bayesian statistical model to provide a probability distribution of the biological characteristics of the compounds in the population as an objective function of the structural features of the compounds in the population. For one or more of the targets, preferences associated with the characteristic of the corresponding target are mapped by applying the corresponding utility function to the probability distribution from the Bayesian statistical model to obtain a probability distribution of preference modification, wherein for the characteristic of the corresponding target of one of the compounds, the lower the uncertainty value associated with the probability distribution of the characteristic, the greater the preference associated with the characteristic of the compound. as well as A subset of multiple compounds from the population that are not in the training set is determined, the subset being determined by optimizing a collection function based on a probability distribution of preference modifications and multiple defined objectives; as well as The output terminal is arranged to output a subset of the defined output. In order to determine a subset of the plurality of compounds, the processor is arranged as follows: By optimizing the acquisition function based on the probability distribution modified according to the preferences and based on multiple defined objectives, a compound from the population that is not in the training set is identified, and the following steps are repeated until the multiple compounds have been identified for the subset: The Bayesian statistical model is retrained using the training set of the compounds and one or more of the identified compounds; For one or more of the objectives, preferences associated with the characteristics of the corresponding objectives are mapped by applying the corresponding utility function to the probability distribution from the Bayesian statistical model to obtain a probability distribution of preference modification; as well as By optimizing the acquisition function based on the probability distribution modified according to the preferences and based on multiple defined objectives, a compound from the population that is not in the training set and is not one or more previously identified compounds is identified.