Fast prediction method of acid-base dissociation constants of organic compounds

The acid-base dissociation constant prediction model optimized by counting-type Morgan fingerprints and class-based boosting tree machine learning algorithms solves the problems of time-consuming and laborious pKa prediction, insufficient accuracy, and unclear application domain in existing technologies, and achieves efficient, low-cost, and interpretable compound pKa prediction.

CN122201497APending Publication Date: 2026-06-12ZHEJIANG NORMAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG NORMAL UNIV
Filing Date
2026-03-06
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing methods for predicting the acid-base dissociation constant (pKa) of organic compounds suffer from problems such as time-consuming and laborious experiments, high costs, insufficient model accuracy, limited coverage of compound types, unclear model application domains, and poor interpretability.

Method used

Using counting Morgan fingerprint (C-MF) as molecular characterization, and combining the class boosting tree machine learning algorithm with SHAP contribution-based recursive feature elimination (SHAP-RFE) optimization, an acid-base dissociation constant prediction model is constructed. The application domain of the model is defined by the structure-activity landscape analysis method, achieving fast and accurate pKa prediction.

🎯Benefits of technology

It achieves pKa prediction without complex experimental operations, low cost, accurate prediction, strong interpretability, and good generalization. It is suitable for large-scale compound screening, with high prediction efficiency and accuracy, clear model characteristics, and a clear scope of application.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201497A_ABST
    Figure CN122201497A_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of environmental engineering and computational chemistry, and relates to a rapid prediction method of acid-base dissociation constants of organic compounds. K a The rapid prediction method of acid-base dissociation constants of organic compounds comprises the following steps: constructing a data set containing organic compounds and corresponding experimental p SAL values; generating a count Morgan fingerprint from a compound SMILES expression based on an RDKit tool; constructing an acid-base dissociation constant prediction model by using a category boosting tree machine learning algorithm, optimizing a key feature subset by recursive feature elimination based on SHAP contribution, and reconstructing an optimal model; defining an application domain of the optimal model by using an AD SAL method; inputting a count Morgan fingerprint of a target compound into the optimal model, and obtaining a prediction value of the acid-base dissociation constant in combination with a judgment result of the application domain. The application can realize rapid prediction of the acid-base dissociation constant through a molecular structure, and has the advantages of convenient operation, low cost, accurate prediction, strong stability and wide application range.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of environmental engineering and computational chemistry, and specifically relates to a rapid prediction method for the acid-base dissociation constant of organic compounds. Background Technology

[0002] Acid-base dissociation constant (p K a Acid strength is a core physicochemical parameter characterizing the acid strength of organic compounds. It directly determines the ionization equilibrium of compounds in aqueous solutions and is crucial for assessing their environmental bioavailability, toxicity, membrane permeability, and metabolic stability. It is an indispensable basic data in fields such as environmental risk assessment and chemical engineering design.

[0003] Traditional p K a Traditional methods rely on experimental analysis (such as UV-Vis spectrophotometry), requiring complex sample pretreatment and sophisticated instruments. These methods are time-consuming, labor-intensive, and costly, making them unsuitable for large-scale compound screening. With the development of computer technology, quantitative structure-property relationship (QSPR) models based on machine learning (ML) have become a faster way to predict p-values. K a A crucial means of molecular characterization, its performance hinges on the selection of molecular characterization methods and algorithms.

[0004] Morgan fingerprints (also known as extended connectivity fingerprints, ECFP) are widely used in cheminformatics because they can capture local atomic environments and symmetry invariance in molecules. However, traditional binary Morgan fingerprints (B-MF) only use "0 / 1" to mark the presence or absence of substructures, lacking crucial stoichiometric information—they cannot distinguish whether the same substructure appears once or multiple times in a molecule. p... K a The value is highly sensitive to the number and density of ionizable functional groups (such as multiple carboxyl groups and nitro groups). This lack of information leads to limited prediction accuracy of existing B-MF-based QSPR models, especially for multifunctional compounds (such as dinitrophenol and dicarboxylic acids).

[0005] Existing p K a The prediction models still have many shortcomings: some models have small datasets (less than 1000 compounds), covering a limited range of compound types; some models do not perform feature dimensionality reduction, resulting in redundant features that lead to computational complexity and overfitting; most models lack clear application domain definitions, making it difficult to guarantee the reliability of predictions for compounds with large structural differences; and the "black box" nature of the models makes it difficult to interpret p. K a The mechanism of association with molecular structure.

[0006] Therefore, the existing p K aThe prediction methods still need improvement. Summary of the Invention

[0007] The purpose of this application is to provide a rapid prediction method for the acid-base dissociation constants of organic compounds. This application uses counting-type Morgan fingerprints as molecular characterization, combines class boosting tree machine learning algorithm with recursive feature elimination feature optimization based on SHAP contribution to construct a highly accurate and interpretable optimal prediction model for acid-base dissociation constants. The application domain of the model is clarified through structure-activity landscape analysis. The rapid and accurate prediction of acid-base dissociation constants can be achieved solely through the molecular structure of organic compounds, without the need for complex experimental operations. It has the advantages of convenient operation, low cost, accurate prediction, strong interpretability, and good generalization.

[0008] This application provides a rapid method for predicting the acid-base dissociation constant of organic compounds, comprising the following steps: (1) Construct an acid-base dissociation constant dataset and divide the dataset into a training set, a validation set and a test set. The training set, the validation set and the test set contain multiple organic compounds and the experimental values ​​of the acid-base dissociation constants and the SMILES expression for each organic compound. (2) Based on the cheminformatics toolkit in the RDKit tool library, a counting type Morgan fingerprint is generated according to the SMILES expression of the organic compound, which is used as the input feature of the model; (3) The class boosting tree machine learning algorithm is adopted, and the acid-base dissociation constant prediction model of the organic compound is constructed based on the experimental values ​​of the acid-base dissociation constant of the organic compound and the corresponding counting type Morgan fingerprint. The model accuracy of the prediction model of the acid-base dissociation constant of the organic compound is verified. (4) The recursive feature elimination based on SHAP contribution is used to reduce the dimension of the counting type Morgan fingerprint, and the key feature subset is selected. Based on the key feature subset, the optimal prediction model of acid-base dissociation constant is reconstructed. (5) The application domain of the optimal prediction model for acid-base dissociation constant is defined by the application domain characterization method based on structure-active landscape, and the application domain judgment threshold is determined. (6) Extract the simplified molecular linear input specification of the target compound, generate the corresponding counting type Morgan fingerprint according to step (2), first determine whether it is within the model application domain by the threshold of step (5), if so, input the counting type Morgan fingerprint into the optimal prediction model of acid-base dissociation constant to obtain the predicted value of acid-base dissociation constant of the target compound.

[0009] In some embodiments, the organic compound includes at least one of meta / para phenols, ortho phenols containing intramolecular hydrogen bonds, ortho phenols without intramolecular hydrogen bonds, meta / para aromatic carboxylic acids, ortho aromatic carboxylic acids, aliphatic carboxylic acids, anilines, primary amines, secondary amines, tertiary amines, meta / para pyridines, ortho pyridines, pyrimidines, imidazoles and benzimidazoles, and quinolines.

[0010] In some embodiments, the number of organic compounds is 1143.

[0011] In some embodiments, the acid-base dissociation constant values ​​of each of the organic compounds are determined under the same conditions.

[0012] In some embodiments, the data distribution range of the acid-base dissociation constant values ​​of the organic compound is -5.00 to 12.23.

[0013] In some embodiments, in step (2), the tool function used to generate the counting Morgan fingerprint is the hash Morgan fingerprint generation function of RDKit, the search radius r of the counting Morgan fingerprint is 1, and the bit vector length nbits is 2048.

[0014] In some embodiments, the key feature subset is selected based on the mechanistic analysis of the acid-base dissociation constant of each organic compound and in combination with a recursive feature elimination method based on SHAP contribution.

[0015] In some embodiments, step (3) includes the construction and accuracy verification of the prediction model for the acid-base dissociation constant of the organic compound: training the model using the training set and validation set to establish the prediction model for the acid-base dissociation constant of the organic compound; and verifying the accuracy of the prediction model for the acid-base dissociation constant of the organic compound using the test set.

[0016] In some embodiments, the coefficient of determination for evaluating model accuracy is used. R 2 and root mean square error RMSE As a statistical indicator, it characterizes the fitting performance of the model; In some embodiments, the data volume ratio of the training set, the validation set, and the test set is set to 64:16:20, and the sum of the data volume ratios of the training set, the validation set, and the test set is always 100%.

[0017] In some embodiments, in step (4), the key hyperparameters of the optimal prediction model for the acid-base dissociation constant of the organic compound include: number of iterations = 657, learning rate = 0.10, tree depth = 9, and leaf node L2 regularization coefficient = 2.11.

[0018] In some embodiments, the rapid prediction method further includes: comprehensively evaluating the fitting ability, stability, and prediction ability of the optimal prediction model for the acid-base dissociation constant of the organic compound using three methods: fitting performance analysis, simulation external verification, and external verification.

[0019] This application uses counting-type Morgan fingerprints as molecular characterization, combines class boosting number machine learning algorithms with recursive feature elimination feature optimization based on SHAP contributions, to construct a highly accurate and interpretable optimal prediction model for acid-base dissociation constants. The application domain of the model is clarified through structure-activity landscape analysis. The model can achieve rapid and accurate prediction of acid-base dissociation constants based solely on the molecular structure of organic compounds, without the need for complex experimental operations. It has the advantages of convenient operation, low cost, accurate prediction, strong interpretability, and good generalization.

[0020] The fast prediction method for acid-base dissociation constants of organic compounds provided in this application combines counting-type Morgan fingerprint molecular characterization, class boosting tree algorithm, recursive feature elimination optimization based on SHAP contribution, and application domain characterization method based on structure-activity landscape as its core technologies. This method significantly breaks through the bottleneck of existing acid-base dissociation constant prediction technologies and has the following core beneficial effects: (1) High prediction efficiency and low cost: No experiments are required. The acid-base dissociation constant can be predicted by simply using the simplified molecular linear input specification of the target compound. It can achieve high-throughput screening of tens of thousands of organic compounds. Compared with traditional experimental methods, the detection cost is reduced by more than 95%, and the prediction time for a single compound is ≤1s, which meets the needs of large-scale pollutant screening.

[0021] (2) Accurate molecular characterization and complete information: The counting type Morgan fingerprint retains the stoichiometric information of molecular functional groups, which can accurately capture the structural features of multifunctional compounds, laying the foundation for high prediction accuracy of the model. Moreover, the counting type Morgan fingerprint can be generated with one click through the RDKit tool, and the parameter acquisition process is standardized and easy to operate.

[0022] (3) High model accuracy and excellent feature efficiency: After optimization by recursive feature elimination based on SHAP contribution, the optimal model only uses 81 counting Morgan fingerprint features to achieve the test set. Q 2 = 0.890, Root Mean Square Error RMSE =1.026, which improves prediction accuracy while reducing model complexity and improving computational efficiency, and the model has no systematic error.

[0023] (4) The model is highly interpretable and conforms to chemical mechanisms: The core contribution features of the model are the counting Morgan fingerprint features corresponding to carboxyl, nitro and phenolic hydroxyl groups. The prediction logic follows the basic organic chemical mechanism of “setting the baseline of acid-base dissociation constants at primary ionizable sites and fine-tuning the acid-base dissociation constant values ​​with electronic modification groups”, which solves the “black box” problem of traditional machine learning models.

[0024] (5) Clear application domain and good generalization: The high-confidence prediction range of the model was clearly defined through the structure-effect landscape analysis method, and the application domain of the internal test set was within the specified range. Q 2 =0.926, verified using an open-source external dataset. Q 2 =0.890、 RMSE =0.942, proving that the model has excellent generalization ability for unknown compounds and can effectively avoid prediction bias caused by the "activity cliff".

[0025] (6) Easy to operate and highly repeatable: All calculation processes are implemented in the Python open source environment, and the tool packages called are all industry-standard open source libraries. The operation steps are standardized, and those skilled in the art can directly reproduce them according to the description of this application. Moreover, the model parameters are fixed, and the prediction results are highly repeatable.

[0026] The above description is only an overview of the technical solution of this application. In order to better understand the technical means of this application and to implement it in accordance with the contents of the specification, and to make the above and other objects, features and advantages of this application more obvious and understandable, the following are specific embodiments of this application. Attached Figure Description

[0027] To more clearly illustrate the technical solutions in the specific embodiments of this application or the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0028] Figure 1 This is a graph showing the fitting of predicted and experimental values ​​of the optimal prediction model for acid-base dissociation constants in the embodiments of this application; the horizontal axis represents the experimentally determined acid-base dissociation constant values, and the vertical axis represents the model-predicted acid-base dissociation constant values, including training and test set data points.

[0029] Figure 2 The molecular structure corresponding to the 5-bit counting Morgan fingerprint with the highest SHAP contribution among the 81 key fingerprint features of the optimal prediction model for acid-base dissociation constant in the embodiments of this application.

[0030] Figure 3 This is a graph showing the fitting of predicted and experimental values ​​of the optimal prediction model for acid-base dissociation constants in this embodiment of the application on an open-source dataset; the horizontal axis represents the experimentally determined acid-base dissociation constant values, and the vertical axis represents the model-predicted acid-base dissociation constant values. Detailed Implementation

[0031] Exemplary embodiments of this application will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of this application are shown in the drawings, it should be understood that this application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to enable a more thorough understanding of this application and to fully convey the scope of this application to those skilled in the art.

[0032] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application; the terms “comprising” and “having”, and any variations thereof, in the specification, claims, and foregoing description of the drawings are intended to cover non-exclusive inclusion.

[0033] In the description of the embodiments of this application, technical terms such as "first" and "second" are used only to distinguish different objects and should not be construed as indicating or implying relative importance or implicitly indicating the number, specific order, or primary and secondary relationship of the indicated technical features.

[0034] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0035] In the description of the embodiments in this application, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent three cases: A exists, A and B exist simultaneously, and B exists. In addition, the character " / " in this document generally indicates that the related objects before and after it have an "or" relationship.

[0036] In the description of the embodiments of this application, the term "multiple" refers to two or more (including two), similarly, "multiple sets" refers to two or more (including two sets), and "multiple pieces" refers to two or more (including two pieces).

[0037] Count-based Morgan fingerprinting (C-MF) compensates for the information deficiencies of binary fingerprinting by quantifying the frequency of substructures in molecules. It has shown superior performance to B-MF in predicting pollutant activity, but has not yet been applied to p K a In the prediction domain, the SHAP (SHapley Additive exPlanations) method can quantify the contribution of features to the prediction results, and combined with recursive feature elimination (RFE), it can achieve feature dimensionality reduction and model interpretation; application domain representation methods based on structure-active landscapes (AD) SAL This method can accurately define the application domain of the model and improve the reliability of predictions.

[0038] This application integrates C-MF and SHAP-RFE dimensionality reduction techniques with ensemble machine learning algorithms to develop an efficient, accurate, and interpretable p-dimensionality reduction method. K a This prediction method overcomes the shortcomings of existing technologies, such as information gaps, insufficient accuracy, and ambiguous application domains, and provides a basis for large-scale prediction of compound p. K a It provides technical support for value prediction and environmental risk assessment.

[0039] This application provides a method for rapid prediction of the acid-base dissociation constant of organic compounds, which is carried out according to the following steps.

[0040] (a) Obtaining high-quality p K a Data, construct p K a Dataset.

[0041] In this application embodiment, multiple organic compounds and their respective corresponding acid-base dissociation constants were obtained experimentally. In other words, the constructed p K a The dataset contains multiple organic compounds, and each organic compound has an experimental value for its acid-base dissociation constant.

[0042] Generally speaking, high-quality experimental data with significant differences in compound structures are the core foundation for improving model prediction accuracy and expanding the model's application domain. Based on this, this application screens p K a Data construction p K a The core principles of the dataset are: authoritative data sources, standardized measurement conditions, and diverse compound types.

[0043] In this embodiment of the application, p, which is reported in the literature and determined using standard methods, is first selected. K a Experimental values ​​ensure the authenticity and reliability of the data.

[0044] In some embodiments, the entire p K a The dataset includes 1143 organic compounds, covering typical structures such as meta / para phenols, ortho phenols with intramolecular hydrogen bonds, ortho phenols without intramolecular hydrogen bonds, meta / para aromatic carboxylic acids, ortho aromatic carboxylic acids, aliphatic carboxylic acids, anilines, primary amines, secondary amines, tertiary amines, meta / para pyridines, ortho pyridines, pyrimidines, imidazoles and benzimidazoles, and quinolines.

[0045] In this application, to eliminate differences in experimental conditions, p... K a Data interference, p of all compounds K a The experimental values ​​were measured under the same conditions.

[0046] In this embodiment of the application, p K a The experimental values ​​ranged from -5.00 to 12.23.

[0047] (II) Calculation and screening of key counting Morgan fingerprint features.

[0048] In the prior art, for p K a Predictive molecular characterization often employs traditional descriptors such as binary Morgan fingerprints. These fingerprints only characterize the "presence" or "absence" of molecular substructures and cannot capture the stoichiometric information of functional groups (such as the number of carboxyl and nitro groups). This information is a core structural feature determining the acidity of organic compounds, making it difficult for models to establish a "molecular structure-p" model. K a The direct correlation between "and" makes it difficult to improve prediction accuracy and interpretability.

[0049] This application overcomes the limitations of existing technologies that rely on static binary fingerprints. Based on the reaction mechanism of acidic dissociation of organic compounds and combined with the screening results of the SHAP-RFE algorithm based on SHAP contributions, it finally identifies molecules that can be accurately characterized, such as p. K a The key fingerprint feature parameters include count-type Morgan fingerprint (C-MF) and 81 core fingerprint features optimized by SHAP-RFE, constructing a "structure-guided" molecular feature system. The specific counting and calculation methods are as follows: The counting-type Morgan fingerprint (C-MF) calculation method: Using the AllChem.GetHashedMorganFingerprint function from the RDKit 2024.03.5 cheminformatics toolkit, with a search radius r=1 and bit vector length nbits=2048, fingerprints are generated for compound molecules. The 2048-dimensional counting-type Morgan fingerprint (C-MF) value can be directly extracted from the input SMILES expression. This fingerprint quantifies the frequency of molecular substructure occurrences in the form of an integer vector, accurately characterizes the number of functional groups, and preserves stoichiometric information.

[0050] Key fingerprint feature selection method: Based on the SHAP-RFE rule, the 2048-dimensional C-MF features are iteratively selected: First, the average absolute SHAP value of each feature in the initial class boosting tree algorithm model is calculated using the SHAP algorithm, and ranked in descending order of contribution; then, the feature with the smallest contribution is iteratively removed starting from the full feature set, and the model is retrained and its performance is evaluated on the validation set after each removal; the validation set determination coefficient (... R 2 Using the largest value as the criterion, 81 key feature subsets were selected, among which the C-MF features corresponding to carboxyl, nitro, and phenolic hydroxyl groups are the core contributing features for model prediction.

[0051] (III) Acid-base dissociation constants (p) of organic compounds K a The establishment and optimization of prediction models.

[0052] Based on the aforementioned key fingerprint feature parameters, this application employs the Catboost machine learning algorithm to construct p K a A regression prediction model was developed between the key fingerprint feature parameters, and SHAP was used to analyze each fingerprint feature pair p in the final model. K a The contribution of the predicted values ​​is characterized, and the specific process is executed using Python 3.12; in other words, the contribution of each fingerprint feature parameter to p is characterized through SHAP interpretive analysis. K a The contribution weights of the predicted values ​​ensure that the model has both "high accuracy" and "interpretability".

[0053] Specifically, all calculations were implemented using Python 3.12, utilizing open-source libraries such as rdkit 2024.03.5, catboost 1.2.7, shap 0.46.0, sklearn 1.6.1, and optuna 4.0.0 to complete model building, feature importance analysis, and performance evaluation; the coefficient of determination was used. R 2(Measures how well the model fits the data) R 2 (The closer to 1 the better) and root mean square error RMSE (Measures the deviation between model predictions and experimental values) RMSE (The smaller the better) is used as a statistical indicator to characterize the model's fit performance, and the coefficient of determination from the external validation set is used. R 2 ext and root mean square error RMSE ext (Measure the model's predictive ability on unknown external data,) RMSE ext A score >0.85 indicates excellent predictive performance of the model.

[0054] It should be noted that model training and prediction must be performed in a Python 3.12 or equivalent open-source library environment. If the software version is changed, the model performance must be re-verified to avoid version compatibility issues that could lead to increased prediction errors.

[0055] In this embodiment, the Catboost machine learning algorithm is used, and it is based on the acid-base dissociation constants (p) of all organic compounds in the overall dataset. K a The acid-base dissociation constant (p) is constructed using the values ​​and key fingerprint feature parameters. K a The regression model between the acid-base dissociation constant (p) and key fingerprint feature parameters is the acid-base dissociation constant (p). K a Predictive models (also known as Catboost p) K a Predictive model), and then the acid-base dissociation constant (p K a The prediction model is used to verify its accuracy and define its application domain.

[0056] The specific implementation process is as follows: The SMILES expressions for all obtained organic compounds (e.g., 1143 organic compounds) and their corresponding experimentally determined p values ​​are provided. K a The values ​​constitute the original dataset; based on the RDKit tool, a 2048-dimensional counting Morgan fingerprint (C-MF) is generated from the SMILES expressions of each compound, forming a dataset containing C-MF features and corresponding p values. K aThe modeling dataset is divided into a training set, a validation set, and a test set, with the data volume ratio of the training set, the validation set, and the test set set being 64:16:20, and the sum of the data volume ratios of the training set, the validation set, and the test set is always 100%. Each of the training set, the validation set, and the test set independently contains multiple organic compounds and corresponding counting Morgan fingerprint features and experimental p-values ​​for each organic compound. K a Value; each organic compound corresponds to an experimental p. K a Values ​​and 2048-dimensional C-MF features.

[0057] Model training is performed using the training set to establish the initial acid-base dissociation constant (p). K a The prediction model was developed, and the core hyperparameters of Catboost were tuned using the Optuna framework combined with Bayesian optimization algorithms to improve model fitting performance. Subsequently, recursive feature elimination based on SHAP contributions (SHAP-RFE) was used to iteratively filter the entire feature set, obtain key feature subsets, and reconstruct p. K a For the optimal prediction model, see [link / reference]. Figure 2 As shown.

[0058] The accuracy of the acid-base dissociation constant prediction model was validated using a test set. The coefficient of determination was used to evaluate the model's accuracy. R 2 (See Equation (1)) and root mean square error RMSE (See Equation (2)) as a statistical indicator to characterize the model's fit performance, and the square of the predicted correlation coefficient is used. Q 2 (See Equation (3) to characterize the predictive performance of the model.)

[0059]

[0060]

[0061]

[0062] In the formula, and These represent the predicted and experimental values ​​of the acid-base dissociation constant, respectively. This represents the average of the experimental values ​​of the acid-base dissociation constant, which is the number of samples in the dataset. It can be understood that each sample corresponds to a key molecular feature in an organic compound as input and the experimental value of the acid-base dissociation constant as output.

[0063] Generally speaking, the model R2 >0.60, Q 2 When the value is greater than 0.50, the model has an acceptable goodness of fit. Q 2 A value greater than 0.8 indicates a model with superior performance. (iv) Optimization to obtain the optimal prediction model for acid-base dissociation constants To avoid model overfitting and improve generalization ability, the Bayesian optimization algorithm was used to optimize the key hyperparameters of the acid-base dissociation constant prediction model, such as the number of iterations = 657, learning rate = 0.10, tree depth = 9, and leaf node L2 regularization coefficient = 2.11. The optimal prediction model and optimal hyperparameter combination for the acid-base dissociation constant were obtained, as shown in Table 1.

[0064] Table 1. Hyperparameters of the acid-base dissociation constant prediction model

[0065] In the embodiments of this application, the SMILES of the target compound (the compound to be tested) is input into the optimal prediction model to obtain the acid-base dissociation constant value corresponding to the target compound.

[0066] (v) Performance evaluation and verification of the optimal prediction model for acid-base dissociation constant.

[0067] The fitting ability, stability, and predictive ability of the optimal prediction model for acid-base dissociation constants are comprehensively evaluated through three methods: fitting performance analysis, simulation external validation, and external validation.

[0068] Specifically, the fitting performance analysis was conducted using a dataset comprised of the acid-base dissociation constants of 1143 organic compounds and SMILES data. The dataset used an 81-bit subset of key features obtained from the counting Morgan fingerprints generated by Rdikt for the 1143 organic compounds and filtered by SHAP-RFE as input. The model-predicted values ​​of the acid-base dissociation constants were used as output, and these values ​​were compared with the corresponding experimental values. The final model fitting result was... R 2 =0.964, RMSE =0.580, indicating that the model has good fitting ability, there is no systematic bias between the prediction error and the experimental value, and the statistical performance is significantly better than the existing technology.

[0069] External validation simulation: The dataset consisting of the acid-base dissociation constants of 1143 organic compounds and SMILES data was randomly divided into a training set (containing 731 organic compounds), a validation set (containing 183 organic compounds), and a test set (containing 229 organic compounds) in a ratio of 64:16:20. It can be understood that the training set, validation set, and test set used when modeling the acid-base dissociation constant prediction model can be used to evaluate and validate the above-mentioned optimal acid-base dissociation constant prediction model.

[0070] See Figure 1 As shown, the fitting results validated using the training set are: R 2 =0.964 and RMSE =0.580; the fitting result validated using the test set is Q 2 =0.890 and RMSE =1.026. It can be seen that the statistical performance of both the training and test sets is close to that of the model based on the entire set, indicating that the optimal prediction model for acid-base dissociation constants is constructed based on the essential correlation between organic compounds and key molecular characteristic parameters, and is not a random correlation, thus possessing excellent statistical stability.

[0071] External Validation: To verify the generalization performance of the acid-base dissociation constant prediction model described in this application, a publicly available acid-base dissociation constant dataset was selected as the external validation set. After removing amphoteric compounds, 6876 single acid / base compounds were obtained. Based on the aforementioned application domain threshold, 390 compounds fell within the application domain of this model. Prediction validation was performed on this external validation set, thus validating the external validation of the model. Q 2 =0.890, RMSE =0.942, see Figure 3 As shown, the verification results indicate that the proposed acid-base dissociation constant prediction model has excellent generalization ability and good external prediction reliability.

[0072] It is worth mentioning that, in order to avoid applying the model to compounds with large structural differences and causing prediction bias, the optimal prediction model for acid-base dissociation constant is also characterized by application domain. For example, the application domain of the target compound can be determined before inputting the target compound into the optimal prediction model for acid-base dissociation constant.

[0073] (vi) Model application domain representation.

[0074] To ensure the accuracy and reliability of the acid-base dissociation constant prediction model in this application, and to accurately define the effective chemical space in which the model can achieve high-confidence predictions, thus avoiding prediction bias caused by applying the model to compounds with unclear structure-activity relationships, this application adopts an AD model based on the structure-activity landscape.SAL The method quantitatively characterizes the application domain (AD) of the acid-base dissociation constant prediction model, and clarifies the range of compounds to which the model can be applied through a two-parameter collaborative determination method, ensuring the scientific validity and practicality of the prediction results. The specific characterization method and determination rules are as follows: (1) Definition of core characterization parameters: This method selects two key parameters as quantitative indicators for the application domain characterization, namely molecular similarity density (MMR) r s,q ) and inconsistency in activity ( I A,q Among them, molecular similarity density r s,q Used to characterize the degree of structural similarity between the query compound and the compounds in the model training set. r s,q A higher value indicates that the query compound is within the chemical space covered by the training set, and the stronger the structural representativeness; activity inconsistency. I A,q Used to characterize the stability of the local structure-activity relationship around the queried compound. I A,q The lower the value, the less likely the compound is to have an "activity cliff" where the acid-base dissociation constant value fluctuates significantly due to minor structural changes, and the more stable the structure-activity relationship is.

[0075] (2) Optimization and determination of characterization parameter thresholds: Based on the acid-base dissociation constant prediction model constructed in this application, and using the training set and independent test set as a basis, the AD thresholds suitable for this model are determined through systematic parameter optimization and model performance verification. SAL Optimal threshold for the method: Molecular similarity density threshold ( r s,T =1.000, activity inconsistency threshold ( I A,T =1.025. This threshold combination balances structural similarity requirements with structure-activity relationship stability, effectively screening compounds within the chemical space of the training set and ensuring prediction confidence.

[0076] (3) Application domain determination rule: For the target compound to be predicted, its molecular similarity density is calculated by the ADSAL method. r s,q Inconsistent activity I A,q At the same time satisfy r s,q ≥ r s,T (1.000) and I A,q ≤ I A,T(1.025) Under two conditions, the target compound is determined to be within the application domain of this acid-base dissociation constant prediction model; if either condition is not met, it is determined to be a compound outside the application domain of the model, and this model has no high confidence prediction ability for it.

[0077] (4) Application domain validity verification: Based on the above AD SAL Characterization methods and decision rules were used to perform application domain partitioning and prediction performance verification on the internal independent test set and external open-source validation set of the model: for compounds in the internal test set that meet the domain decision criteria, the model's prediction determination coefficient was used. Q 2 =0.926, achieving high-confidence prediction; in the external open-source acid-base dissociation constant validation set, 390 compounds that were determined to fall into the model's application domain, the model's prediction determination coefficient was [value missing]. Q 2 =0.890、 RMSE =0.942, the verification results show that the model application domain defined by this method can effectively screen out suitable compounds, and significantly improve the reliability and stability of model prediction.

[0078] (5) Model application domain: via AD SAL By defining the method, the application domain of the acid-base dissociation constant prediction model in this application is: compounds with sufficient structural similarity to the training set compounds. r s,q ≥1.000) and no significant structure-activity relationship "activity cliff" ( I A,q This model can accurately predict the acid-base dissociation constants of compounds within the range of ≤1.025, specifically including various organic acidic substances such as carboxylic acids, phenols, and aromatic acidic compounds with nitro substitution.

[0079] Unless otherwise defined, the technical terms used in the following embodiments have the same meaning as commonly understood by those skilled in the art to which this application pertains. Unless otherwise specified, the experimental reagents used in the following embodiments are all conventional biochemical reagents; the raw materials, instruments, and equipment used in the following embodiments can all be obtained commercially or through existing methods; the amounts of experimental reagents used are, unless otherwise specified, the amounts used in conventional experimental operations; and the experimental methods used are, unless otherwise specified, conventional methods. It should be further noted that the following descriptions are merely exemplary and not intended to limit the scope of this application.

[0080] Example 1 p of valeric acid K a predict The 2D molecular structure of valeric acid is shown in equation (I) below: Formula (1) Based on the experimental values ​​of acid-base dissociation constants of 1143 organic compounds obtained in this application, 81 key fingerprint features were screened from the full-count Morgan fingerprint (C-MF, radius 1, length 2048) using a recursive feature elimination method based on SHAP contributions. An optimal prediction model for the acid-base dissociation constant was constructed, and the p-value of pentanoic acid was determined based on this optimal prediction model. K a The prediction process includes: Key molecular features were calculated using the AllChem.GetHashedMorganFingerprint function from the RDKit library, with a radius of 1 and a feature length of 2048 as parameters. Features were extracted and filtered using a recursive feature elimination method based on SHAP contributions, focusing on features related to p. K a Eighty-one key fingerprint features with highly correlated values ​​are used. These features accurately encode the chemical information of key substructures such as carboxyl groups and alkyl chains in the molecule, providing core structural feature inputs for the model.

[0081] p K a Prediction: Inputting the above 81 key counting Morgan fingerprint features into the optimal prediction model for acid-base dissociation constants, the model outputs the p-value of pentanoic acid. K a The predicted value was 4.81, and its experimental p-value was... K a The value is 4.82, the absolute prediction error is 0.014, and the prediction accuracy is excellent.

[0082] Furthermore, in Example 1, the application domain of the model was determined: using AD SAL Method: Calculate the molecular similarity density between valeric acid and the compounds in the training set. r s,q Inconsistent activity I A,q The results showed that r s,q =2.027≥ r s,T (1.000) I A,q =0.111≤ I A,T (1.025), satisfies AD SAL The application domain determination criteria further prove that valeric acid is within the model's application domain and can be accurately predicted.

[0083] Example 2 p of phenoxyacetic acid K a predict The 2D molecular structure of phenoxyacetic acid is shown in formula (II) below: Formula (II) Based on the experimental values ​​of acid-base dissociation constants of 1143 organic compounds obtained in this application, 81 key fingerprint features were screened from the full-count Morgan fingerprint (C-MF, radius 1, length 2048) using a recursive feature elimination method based on SHAP contributions. An optimal prediction model for the acid-base dissociation constant was constructed, and the p-value of phenoxyacetic acid was determined based on this optimal prediction model. K a The prediction process includes: Key molecular features were calculated using the AllChem.GetHashedMorganFingerprint function from the RDKit library, with a radius of 1 and a feature length of 2048 as parameters. Features were extracted and filtered using a recursive feature elimination method based on SHAP contributions, focusing on features related to p. K a Eighty-one key fingerprint features with highly correlated values ​​are used. These features accurately encode the chemical information of key substructures such as carboxyl groups and alkyl chains in the molecule, providing core structural feature inputs for the model.

[0084] p K a Prediction: Inputting the above 81 key counting Morgan fingerprint features into the optimal prediction model for acid-base dissociation constants, the model outputs the p-value of phenoxyacetic acid. K a The predicted value is 3.11, and its experimental p-value is... K a The value is 3.15, the absolute prediction error is 0.041, and the prediction accuracy is excellent.

[0085] Furthermore, in Example 2, the application domain of the model was determined: using AD SAL Method: Calculate the molecular similarity density between phenoxyacetic acid and the compounds in the training set. r s,q Inconsistent activity I A,q The results showed that r s,q =3.201≥ r s,T (1.000) I A,q =0.305≤ I A,T (1.025), satisfies AD SAL The application domain determination criteria further prove that phenoxyacetic acid falls within the model's application domain and can be accurately predicted.

[0086] Example 3 p-Fluorobenzoic acid K a predict The 2D molecular structure of 4-fluorobenzoic acid is shown in equation (III) below: Formula (3) Based on the experimental values ​​of acid-base dissociation constants of 1143 organic compounds obtained in this application, 81 key fingerprint features were screened from the full-count Morgan fingerprint (C-MF, radius 1, length 2048) using a recursive feature elimination method based on SHAP contributions. An optimal prediction model for the acid-base dissociation constant was constructed, and the p-value of 4-fluorobenzoic acid was determined based on this optimal prediction model. K a The prediction process includes: Key molecular features were calculated using the AllChem.GetHashedMorganFingerprint function from the RDKit library, with a radius of 1 and a feature length of 2048 as parameters. Features were extracted and filtered using a recursive feature elimination method based on SHAP contributions, focusing on features related to p. K a Eighty-one key fingerprint features with highly correlated values ​​are used. These features accurately encode the chemical information of key substructures such as carboxyl groups and alkyl chains in the molecule, providing core structural feature inputs for the model.

[0087] p K a Prediction: Inputting the above 81 key counting Morgan fingerprint features into the optimal prediction model for acid-base dissociation constants, the model outputs the p-value of 4-fluorobenzoic acid. K a The predicted value is 3.80, the experimental pKa value is 3.90, and the absolute prediction error is 0.096, demonstrating excellent prediction accuracy.

[0088] Furthermore, in Example 3, the application domain of the model was determined: using AD SAL Method: Calculate the molecular similarity density between 4-fluorobenzoic acid and the training set compounds. r s,q Inconsistent activity I A,q The results showed that r s,q =3.740≥ r s,T (1.000) I A,q =0.475≤ I A,T (1.025), satisfies AD SAL The application domain determination criteria further prove that 4-fluorobenzoic acid is within the model's application domain and can be accurately predicted.

[0089] Example 4 p-2,3-dihydro-1H-inden-4-ol K a predict The 2D molecular structure of 2,3-dihydro-1H-inden-4-ol is shown in formula (iv): Formula (IV) Based on the experimental values ​​of acid-base dissociation constants of 1143 organic compounds obtained in this application, 81 key fingerprint features were screened from the full-count Morgan fingerprint (C-MF, radius 1, length 2048) using a recursive feature elimination method based on SHAP contributions. An optimal prediction model for the acid-base dissociation constant was constructed, and the p-values ​​of 2,3-dihydro-1H-inden-4-ol were analyzed based on this optimal prediction model. K a The prediction process includes: Key molecular features were calculated using the AllChem.GetHashedMorganFingerprint function from the RDKit library, with a radius of 1 and a feature length of 2048 as parameters. Features were extracted and filtered using a recursive feature elimination method based on SHAP contributions, focusing on features related to p. K a Eighty-one key fingerprint features with highly correlated values ​​are used. These features accurately encode the chemical information of key substructures such as carboxyl groups and alkyl chains in the molecule, providing core structural feature inputs for the model.

[0090] p K a Prediction: Inputting the above 81 key counting Morgan fingerprint features into the optimal prediction model for acid-base dissociation constants, the model outputs the p-value of 2,3-dihydro-1H-inden-4-ol. K a The predicted value is 10.07, the experimental pKa value is 10.20, and the absolute prediction error is 0.135, demonstrating excellent prediction accuracy.

[0091] Furthermore, in Example 4, the application domain of the model was determined: using AD SAL Method: Calculate the molecular similarity density between 2,3-dihydro-1H-inden-4-ol and the training set compounds. r s,q Inconsistent activity I A,q The results showed that r s,q =1.737≥ r s,T (1.000) I A,q =0.105≤ IA,T (1.025), satisfies AD SAL The application domain determination criteria further prove that 2,3-dihydro-1H-inden-4-ol is within the model's application domain and can be accurately predicted.

[0092] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.

Claims

1. A rapid method for predicting the acid-base dissociation constant of organic compounds, characterized in that, Includes the following steps: (1) Construct an acid-base dissociation constant dataset and divide the dataset into a training set, a validation set and a test set. The training set, the validation set and the test set contain multiple organic compounds and the experimental values ​​of the acid-base dissociation constants and the SMILES expression for each organic compound. (2) Based on the cheminformatics toolkit in the RDKit tool library, a counting type Morgan fingerprint is generated according to the SMILES expression of the organic compound, which is used as the input feature of the model; (3) The class boosting tree machine learning algorithm is adopted, and the acid-base dissociation constant prediction model of the organic compound is constructed based on the experimental values ​​of the acid-base dissociation constant of the organic compound and the corresponding counting type Morgan fingerprint. The model accuracy of the prediction model of the acid-base dissociation constant of the organic compound is verified. (4) The recursive feature elimination based on SHAP contribution is used to reduce the dimension of the counting type Morgan fingerprint, and the key feature subset is selected. Based on the key feature subset, the optimal prediction model of acid-base dissociation constant is reconstructed. (5) The application domain of the optimal prediction model for acid-base dissociation constant is defined by the application domain characterization method based on structure-active landscape, and the application domain judgment threshold is determined. (6) Extract the simplified molecular linear input specification of the target compound, generate the corresponding counting type Morgan fingerprint according to step (2), first determine whether it is within the model application domain by the threshold of step (5), if so, input the counting type Morgan fingerprint into the optimal prediction model of acid-base dissociation constant to obtain the predicted value of acid-base dissociation constant of the target compound.

2. The rapid prediction method as described in claim 1, characterized in that, The organic compounds include at least one of the following: meta / para phenols, ortho phenols containing intramolecular hydrogen bonds, ortho phenols without intramolecular hydrogen bonds, meta / para aromatic carboxylic acids, ortho aromatic carboxylic acids, aliphatic carboxylic acids, anilines, primary amines, secondary amines, tertiary amines, meta / para pyridines, ortho pyridines, pyrimidines, imidazoles and benzimidazoles, and quinolines; and / or, The number of organic compounds is 1143.

3. The rapid prediction method as described in claim 1, characterized in that, The acid-base dissociation constant values ​​of each of the organic compounds were determined under the same conditions.

4. The rapid prediction method as described in claim 1, characterized in that, The data distribution range of the acid-base dissociation constant values ​​of the organic compounds is -5.00 to 12.

23.

5. The rapid prediction method as described in claim 1, characterized in that, In step (2), the tool function used to generate the counting Morgan fingerprint is the hash Morgan fingerprint generation function of RDKit. The search radius r of the counting Morgan fingerprint is 1, and the bit vector length nbits is 2048.

6. The rapid prediction method as described in claim 1, characterized in that, Based on the mechanistic analysis of the acid-base dissociation constants of each organic compound and combined with the recursive feature elimination method based on SHAP contributions, the feature subset is selected.

7. The rapid prediction method as described in claim 1, characterized in that, In step (3), the construction and accuracy verification of the prediction model for the acid-base dissociation constant of the organic compound includes: The training set and validation set are used to train the model and establish a prediction model for the acid-base dissociation constant of the organic compound. The accuracy of the prediction model for the acid-base dissociation constant of the organic compound was verified using the test set.

8. The rapid prediction method as described in claim 7, characterized in that, The coefficient of determination is used to evaluate the accuracy of the model. R 2 The square of the correlation coefficient predicted by the model Q 2 and root mean square error RMSE As a statistical indicator, it characterizes the fitting performance of the model; And / or, the data volume ratio of the training set, the validation set, and the test set is set to 64:16:20, and the sum of the data volume ratios of the training set, the validation set, and the test set is always 100%.

9. The rapid prediction method as described in claim 7, characterized in that, In step (4), the key hyperparameters of the optimal prediction model for the acid-base dissociation constant of the organic compound include: number of iterations = 657, learning rate = 0.10, tree depth = 9, and leaf node L2 regularization coefficient = 2.

11.

10. The rapid prediction method as described in claim 1, characterized in that, The rapid prediction method also includes: comprehensively evaluating the fitting ability, stability, and prediction ability of the optimal prediction model for the acid-base dissociation constant of the organic compound using three methods: fitting performance analysis, simulation external verification, and external verification.