Method, device and apparatus for testing hydrophobicity of molecules

By using a machine learning-based hydrophobicity prediction model to rapidly assess the hydrophobicity of molecules, this technology overcomes the problem of low efficiency in existing technologies and enables rapid screening and evaluation of large batches of molecules.

CN120808965BActive Publication Date: 2026-06-23CONTEMPORARY AMPEREX TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CONTEMPORARY AMPEREX TECHNOLOGY CO LTD
Filing Date
2024-10-30
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Current technologies for evaluating and screening the hydrophobicity of molecules are inefficient, making it difficult to process large batches of molecules quickly and failing to meet the needs of materials research and development.

Method used

By combining machine learning with a trained hydrophobicity prediction model, the property prediction is performed directly using molecular structure files, target features are extracted, and hydrophobicity test results are predicted.

Benefits of technology

It enables rapid evaluation of the hydrophobicity of large batches of molecules, improves screening efficiency, reduces workload, and ensures the accuracy of molecular structure files.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120808965B_ABST
    Figure CN120808965B_ABST
Patent Text Reader

Abstract

The application provides a method, device and equipment for testing the hydrophobicity of a molecule, and belongs to the technical field of machine learning. The method comprises the following steps: obtaining a molecular structure file corresponding to a to-be-tested molecule; wherein the molecular structure file is used to represent the structure of the to-be-tested molecule; inputting the molecular structure file into a pre-trained hydrophobicity prediction model to obtain a hydrophobicity test result of the to-be-tested molecule output by the hydrophobicity prediction model; wherein the hydrophobicity prediction model is used to extract target features of the to-be-tested molecule from the molecular structure file, and predict the hydrophobicity test result of the to-be-tested molecule according to the target features; and the hydrophobicity test result is used to represent the hydrophobicity of the to-be-tested molecule. The application directly uses the trained hydrophobicity prediction model to perform property prediction in combination with machine learning, so that the hydrophobicity test result of the molecule can be quickly obtained, and material researchers can quickly evaluate a large number of molecules and screen out suitable molecules.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of machine learning technology, and in particular relates to a method, apparatus and device for testing the hydrophobicity of molecules. Background Technology

[0002] Hydrophobicity of a molecule generally refers to the physical property of a molecule repelling water, and can also characterize the degree to which a molecule is hydrophobic to water. The required degree of hydrophobicity varies depending on the application, and how to evaluate and screen molecules with a certain degree of hydrophobicity is a problem that urgently needs to be solved.

[0003] Currently, related technologies typically analyze each molecule individually and determine its hydrophobicity through complex calculations.

[0004] However, the number of molecules that materials researchers need to evaluate and screen is usually large, and the efficiency of evaluation and screening using the above methods is low. Summary of the Invention

[0005] In view of the above-mentioned technical problems, the embodiments of this application provide a method, apparatus and device for testing the hydrophobicity of molecules. By combining machine learning, the property prediction model is directly used to predict the properties, and the hydrophobicity test results of molecules can be obtained quickly. This makes it convenient for material researchers to quickly evaluate a large number of molecules and screen out suitable molecules.

[0006] In a first aspect, embodiments of this application provide a method for testing the hydrophobicity of molecules, the method comprising:

[0007] Obtain the molecular structure file corresponding to the molecule to be tested; wherein, the molecular structure file is used to characterize the structure of the molecule to be tested;

[0008] The molecular structure file is input into a pre-trained hydrophobicity prediction model to obtain the hydrophobicity test result corresponding to the molecule under test output by the hydrophobicity prediction model; wherein, the hydrophobicity prediction model is used to extract the target features of the molecule under test from the molecular structure file, and predict the hydrophobicity test result corresponding to the molecule under test based on the target features, and the hydrophobicity test result is used to characterize the degree of hydrophobicity of the molecule under test.

[0009] In this embodiment, a molecular structure file characterizing the structure of the molecule to be tested is first obtained. This file is then input into a trained hydrophobicity prediction model. The model extracts target features of the molecule from the molecular structure file and predicts the hydrophobicity test result based on these features, thus characterizing the degree of hydrophobicity of the molecule. Compared to related technologies that analyze each molecule individually and calculate its hydrophobicity using complex methods, this application combines machine learning and directly uses a trained hydrophobicity prediction model for property prediction. This allows for rapid acquisition of molecular hydrophobicity test results, facilitating materials researchers to quickly evaluate large batches of molecules and select suitable ones.

[0010] In some embodiments, obtaining the molecular structure file corresponding to the molecule to be tested includes:

[0011] Based on the pre-input N end groups and M intermediate molecular fragments, generate the molecular structure file corresponding to the molecule to be tested;

[0012] Where N and M are both integers greater than 0.

[0013] In this embodiment, a large number of molecular structure files can be generated based on pre-inputted N end groups and M intermediate molecular fragments. The molecules corresponding to these molecular structure files are then used as test molecules for the aforementioned hydrophobicity test. Compared to obtaining a large number of molecules from other literature, which would result in a large workload and could not guarantee the correctness of the molecular structure files, this application can directly generate molecular structure files corresponding to a large number of molecules based on the end groups and intermediate molecular fragments input by the user. This effectively reduces the workload of obtaining a large number of molecules and can also effectively guarantee the correctness of the generated molecular structure files.

[0014] In some embodiments, prior to inputting the molecular structure file into a pre-trained hydrophobicity prediction model, the method further includes:

[0015] Based on the pre-input terminal and intermediate molecular fragments, multiple molecular structure files are generated; wherein, the multiple molecular structure files respectively characterize the structures of multiple molecules to be tested;

[0016] By performing DFT calculations on the multiple molecular structure files, LogP data corresponding to the multiple molecules to be tested are obtained; wherein, the LogP data is used to characterize the hydrophobicity of the multiple molecules to be tested.

[0017] For each of the plurality of molecular structure files, the following steps are performed: feature extraction is performed on the molecular structure file, and the extracted X candidate features are used as the target features; where X is an integer greater than 0;

[0018] A training dataset is generated based on the target features and corresponding LogP data of the multiple molecular structure files.

[0019] The pre-set candidate model is trained using the training dataset, and the trained candidate model is used as the hydrophobicity prediction model.

[0020] This application provides a specific implementation of training a hydrophobicity prediction model. First, a training dataset is obtained. Specifically, multiple molecular structure files are generated based on pre-inputted end-group and intermediate molecular fragments to characterize the structures of multiple test molecules, ensuring data consistency. Then, DFT calculations are performed on the multiple molecular structure files to obtain the LogP data corresponding to each test molecule, representing the hydrophobicity of each molecule. Feature extraction is also performed on each molecular structure file, and X extracted candidate features are used as target features. A training dataset is then generated based on the target features and corresponding LogP data of the multiple molecular structure files. This training dataset is then used to train a pre-set candidate model. Specifically, supervised learning is performed using the target features as input and the LogP data as the output target, and the trained candidate model is used as the hydrophobicity prediction model.

[0021] In some embodiments, the feature extraction of the molecular structure file, wherein the extracted X candidate features are used as the target features, includes:

[0022] Feature extraction is performed on the molecular structure file to obtain the X extracted candidate features;

[0023] When X is an integer greater than 2, the candidate model is used to perform important feature analysis on the X candidate features to determine the contribution of each of the X candidate features to the candidate model.

[0024] According to the order of contribution from largest to smallest, at least two features are selected from the X candidate features as the target features.

[0025] In this embodiment, when the number of extracted candidate features is large, the candidate model can be used to perform important feature analysis on multiple candidate features to determine the contribution of each of the X candidate features to the candidate model. Then, at least two features are selected from the X candidate features in descending order of contribution as target features for model training. This application uses some features with greater contribution as target features for model training, rather than using all features for model training, in order to minimize the amount of training data while ensuring the model training effect.

[0026] In some embodiments, generating a training dataset based on the target features and corresponding LogP data of the plurality of molecular structure files includes:

[0027] For each of the plurality of molecular structure files, at least two features in the target features corresponding to the molecular structure file are standardized to obtain at least two standardized features after processing.

[0028] The training dataset consists of at least two standardized features and the corresponding LogP data corresponding to the plurality of molecular structure files.

[0029] In this embodiment of the application, when generating the training dataset, for each of the multiple molecular structure files, the application first standardizes at least two features in the target features corresponding to the molecular structure file, and uses all the standardized features and the corresponding LogP data as the training dataset. By standardizing the features, the influence of different dimensions between different features can be eliminated, making different features comparable, thereby improving the effect of model training and the prediction accuracy of the trained model.

[0030] In some embodiments, the candidate model is obtained through the following steps:

[0031] Obtain at least two pre-set machine learning models; wherein the at least two machine learning models employ different machine learning methods and / or model parameters;

[0032] The training dataset is used to train the at least two machine learning models respectively, and the output results of the trained machine learning models are obtained.

[0033] The output results of the at least two machine learning models are evaluated using pre-set model evaluation metrics, and the machine learning model with the best evaluation result is selected as the candidate model.

[0034] In this embodiment, at least two machine learning models using different machine learning methods and / or model parameters are pre-set, and these machine learning models are trained separately using the generated training dataset. The output results of the trained machine learning models are obtained, and these output results can be evaluated according to the model evaluation index. The machine learning model with the best evaluation result is selected as a candidate model for subsequent training, which helps to improve the prediction accuracy of the trained hydrophobicity prediction model.

[0035] In some embodiments, obtaining LogP data corresponding to the plurality of molecules to be tested by performing DFT calculations on the plurality of molecular structure files includes:

[0036] For each of the plurality of molecular structure files, perform the following steps:

[0037] Obtain the ΔG corresponding to the molecular structure file oct and ΔG w ;

[0038] According to ΔG oct and ΔG w The LogP data corresponding to the molecular structure file is calculated using formula (1):

[0039]

[0040] Wherein, ΔG oct The free energy ΔG of the test molecule in n-octanol characterizes the free energy of the molecule under test. w The free energy of the tested molecule in water is represented by R, which represents the standard molar gas constant, and T represents the Kelvin temperature.

[0041] In this embodiment of the application, a specific implementation method is provided for calculating the LogP data corresponding to the molecule to be tested. The LogP data corresponding to a certain molecule to be tested can be obtained by formula (1) to characterize its hydrophobicity.

[0042] Secondly, embodiments of this application also provide a molecular hydrophobicity testing device, the device comprising:

[0043] An acquisition module is used to acquire the molecular structure file corresponding to the molecule to be tested; wherein, the molecular structure file is used to characterize the structure of the molecule to be tested;

[0044] The prediction module is used to input the molecular structure file into a pre-trained hydrophobicity prediction model to obtain the hydrophobicity test result corresponding to the molecule under test output by the hydrophobicity prediction model; wherein, the hydrophobicity prediction model is used to extract the target features of the molecule under test from the molecular structure file, and predict the hydrophobicity test result corresponding to the molecule under test based on the target features, and the hydrophobicity test result is used to characterize the degree of hydrophobicity of the molecule under test.

[0045] Thirdly, embodiments of this application also provide a molecular hydrophobicity testing device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the molecular hydrophobicity testing method as described in the first aspect.

[0046] Fourthly, embodiments of this application also provide a computer-readable storage medium storing computer instructions that, when executed on a computer, cause the computer to perform the method for testing the hydrophobicity of molecules as described in the first aspect.

[0047] Fifthly, embodiments of this application also provide a computer program product, the computer program product including a computer program, which, when run on a computer, implements the method for testing the hydrophobicity of molecules as described in the first aspect. Attached Figure Description

[0048] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0049] Figure 1 This is one of the flowcharts illustrating a method for testing the hydrophobicity of molecules proposed in this application.

[0050] Figure 2 This is a second schematic flowchart of a molecular hydrophobicity testing method proposed in the embodiments of this application;

[0051] Figure 3 This is the third schematic flowchart of a molecular hydrophobicity testing method proposed in the embodiments of this application;

[0052] Figure 4 This is the fourth schematic flowchart of a molecular hydrophobicity testing method proposed in the embodiments of this application;

[0053] Figure 5 This is the fifth schematic flowchart of a molecular hydrophobicity testing method proposed in the embodiments of this application;

[0054] Figure 6 This is the sixth schematic flowchart of a molecular hydrophobicity testing method proposed in the embodiments of this application;

[0055] Figure 7 This is a schematic diagram of the system used in the molecular hydrophobicity testing method proposed in the embodiments of this application;

[0056] Figure 8 This is the seventh schematic flowchart of a molecular hydrophobicity testing method proposed in the embodiments of this application;

[0057] Figure 9 This is a schematic diagram illustrating the relationship between feature importance in a molecular hydrophobicity testing method proposed in this application embodiment;

[0058] Figure 10 This is a schematic diagram illustrating the relationship between the predicted and true values ​​of LogP data in a molecular hydrophobicity testing method proposed in this application embodiment;

[0059] Figure 11 This is a schematic diagram of the structure of a molecular hydrophobicity testing device proposed in an embodiment of this application. Detailed Implementation

[0060] The embodiments of the technical solution of this application will now be described in detail with reference to the accompanying drawings. These embodiments are only used to more clearly illustrate the technical solution of this application and are therefore merely examples, and should not be used to limit the scope of protection of this application.

[0061] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application; the terms “comprising” and “having”, and any variations thereof, in the specification, claims, and foregoing description of the drawings are intended to cover non-exclusive inclusion.

[0062] In the description of the embodiments of this application, technical terms such as "first" and "second" are used only to distinguish different objects and should not be construed as indicating or implying relative importance or implicitly specifying the number, specific order, or primary and secondary relationship of the indicated technical features. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly defined.

[0063] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0064] In the description of the embodiments in this application, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this document generally indicates that the preceding and following related objects have an "or" relationship.

[0065] In related technologies, each molecule is usually analyzed individually, and its hydrophobicity is determined through relatively complex calculations. For example, the degree of hydrophobicity of each molecule is calculated using the DFT calculation method. However, this method is extremely inefficient in assessing the degree of hydrophobicity of molecules, which is not conducive to material researchers evaluating and screening a large number of molecules that meet certain hydrophobicity requirements.

[0066] To address the aforementioned issues, this application provides a method, apparatus, and device for testing the hydrophobicity of molecules. By combining machine learning, a pre-trained hydrophobicity prediction model can be directly used to predict properties, enabling rapid acquisition of molecular hydrophobicity test results. This facilitates materials researchers in quickly evaluating large batches of molecules and selecting suitable ones.

[0067] The following provides a detailed description of the methods, apparatus, and equipment for testing the hydrophobicity of molecules provided in the embodiments of this application.

[0068] Figure 1 This is one of the flowcharts illustrating a method for testing the hydrophobicity of molecules proposed in this application. Figure 1 As shown, the method for testing the hydrophobicity of molecules includes steps S101-S102;

[0069] Step S101: Obtain the molecular structure file corresponding to the molecule to be tested;

[0070] The molecular structure file is used to characterize the structure of the molecule to be tested.

[0071] It should be noted that the molecule to be tested can be, for example, a small organic molecule, and this application does not impose any restrictions on it.

[0072] In some embodiments, the molecular structure file corresponding to the molecule to be tested can be obtained from relevant literature or files, or it can be generated by software. This application does not impose any restrictions on this.

[0073] Step S102: Input the molecular structure file into the pre-trained hydrophobicity prediction model to obtain the hydrophobicity test result corresponding to the molecule to be tested output by the hydrophobicity prediction model.

[0074] The hydrophobicity prediction model is used to extract target features of the molecule to be tested from the molecular structure file, and predict the hydrophobicity test result corresponding to the molecule to be tested based on the target features. The hydrophobicity test result is used to characterize the degree of hydrophobicity of the molecule to be tested.

[0075] It should be noted that the target features extracted by the above hydrophobicity prediction model may include one or more features, which are not limited in this application. The target features are specifically used to describe the physical, chemical and other information of the molecule to be tested, so that the subsequent hydrophobicity prediction model can predict the degree of hydrophobicity of the molecule to be tested based on this information.

[0076] Specifically, a molecular structure file representing the structure of the molecule to be tested is first obtained, and then the molecular structure file is input into a trained hydrophobicity prediction model. The model extracts the target features of the molecule to be tested from the molecular structure file, and then predicts the hydrophobicity test result of the molecule to be tested based on the target features, so as to characterize the degree of hydrophobicity of the molecule to be tested.

[0077] In the molecular hydrophobicity testing method provided in this application embodiment, compared with the related technology of analyzing each molecule individually and calculating its hydrophobicity in a complicated way, this application combines machine learning and directly uses a trained hydrophobicity prediction model to predict the properties, which can quickly obtain the molecular hydrophobicity test results. This makes it convenient for material researchers to quickly evaluate a large number of molecules and screen out suitable molecules.

[0078] In some embodiments, Figure 2 This is a second schematic flowchart of a molecular hydrophobicity testing method proposed in this application embodiment, as shown below. Figure 2 As shown, step S101 specifically includes step S1011;

[0079] Step S1011: Generate the molecular structure file corresponding to the molecule to be tested based on the pre-input N end groups and M intermediate molecular fragments;

[0080] Where N and M are both integers greater than 0.

[0081] Specifically, based on N terminal groups and M intermediate molecular fragments pre-input by the user, an automated script can generate molecular structure files corresponding to the molecules to be tested. Specifically, by executing the automated script, the N terminal groups and M intermediate molecular fragments can be combined to form a large number of molecular structure files to characterize different molecules to be tested. The specific content of the automated script is not limited in this application.

[0082] In this embodiment, a large number of molecular structure files can be generated based on pre-input N end groups and M intermediate molecular fragments. The molecules corresponding to these molecular structure files are used as test molecules for the above-mentioned hydrophobicity test. Compared with obtaining a large number of molecules from other literature, which would generate a large workload and could not guarantee the correctness of the molecular structure files, this application can directly generate molecular structure files corresponding to a large number of molecules based on the end groups and intermediate molecular fragments input by the user. This effectively reduces the workload of obtaining a large number of molecules and can also effectively guarantee the correctness of the generated molecular structure files.

[0083] This application also provides a specific implementation of training a hydrophobicity prediction model, specifically, Figure 3 This is the third schematic flowchart of a molecular hydrophobicity testing method proposed in this application embodiment, as shown below. Figure 3 As shown, before using the hydrophobicity prediction model, the method further includes steps S301-S305:

[0084] Step S301: Generate multiple molecular structure files based on the pre-input end groups and intermediate molecular fragments;

[0085] The multiple molecular structure files respectively characterize the structures of multiple molecules to be tested.

[0086] It should be noted that this application does not limit the number of terminal and intermediate molecular fragments. Users can input according to the actual situation and generate a large number of molecular sample spaces, i.e., multiple molecular structure files, by executing pre-set automated scripts to characterize multiple molecules to be tested.

[0087] Step S302: By performing DFT calculations on the multiple molecular structure files, the LogP data corresponding to the multiple molecules to be tested are obtained respectively;

[0088] The LogP data is used to characterize the hydrophobicity of the plurality of molecules to be tested.

[0089] It should be noted that DFT calculations are used to obtain the LogP data of molecules. The LogP data is used to characterize the hydrophobicity of molecules. Generally, the smaller the LogP, the less hydrophobic and the more hydrophilic the molecule is; conversely, the larger the LogP, the more hydrophobic and the less hydrophilic the molecule is.

[0090] Step S303: For each of the plurality of molecular structure files, perform the following steps: extract features from the molecular structure file and use the extracted X candidate features as the target features;

[0091] Where X is an integer greater than 0.

[0092] In some embodiments, the RDKit molecular toolkit can be used to extract features from molecular structure files, typically up to 209 features that can be extracted to describe the physicochemical information of the molecule.

[0093] Step S304: Generate a training dataset based on the target features and corresponding LogP data of the multiple molecular structure files.

[0094] Step S305: Train the pre-set candidate model using the training dataset, and use the trained candidate model as the hydrophobicity prediction model.

[0095] Specifically, for each molecular structure file among multiple molecular structure files, the target feature can be used as the input and the corresponding LogP data as the output target. A mapping relationship between the input and the output target is established, and then the mapping relationships corresponding to all molecular structure files are formed into a set as a training dataset. The candidate model is trained using this training dataset. This training is supervised learning. Finally, the trained candidate model is used as a hydrophobicity prediction model, which facilitates the rapid prediction of the hydrophobicity of molecules in subsequent applications.

[0096] This application provides a specific implementation of training a hydrophobicity prediction model. First, a training dataset is obtained. Specifically, multiple molecular structure files are generated based on pre-inputted end-group and intermediate molecular fragments to characterize the structures of multiple test molecules, ensuring data consistency. Then, DFT calculations are performed on the multiple molecular structure files to obtain the LogP data corresponding to each test molecule, representing the hydrophobicity of each molecule. Feature extraction is also performed on each molecular structure file, and X extracted candidate features are used as target features. A training dataset is then generated based on the target features and corresponding LogP data of the multiple molecular structure files. This training dataset is then used to train a pre-set candidate model. Specifically, supervised learning is performed using the target features as input and the LogP data as the output target, and the trained candidate model is used as the hydrophobicity prediction model.

[0097] In some embodiments, Figure 4 This is the fourth flowchart illustrating a method for testing the hydrophobicity of molecules proposed in this application. Figure 4 As shown, with Figure 4 Is Figure 3 Taking the corresponding embodiment as an example, step S303 specifically includes steps S3031-S3033:

[0098] Step S3031: Perform feature extraction on the molecular structure file to obtain the extracted X candidate features.

[0099] Step S3032: When X is an integer greater than 2, perform important feature analysis on the X candidate features using the candidate model to determine the contribution of each of the X candidate features to the candidate model.

[0100] In some embodiments, X candidate features can be input into the candidate model, and the importance of the X candidate features to the candidate model, i.e. their contribution to the candidate model, can be analyzed by comparing the correlation between the candidate features and the output target quantity.

[0101] Step S3033: Select at least two features from the X candidate features in descending order of contribution as the target features.

[0102] It should be noted that the operation of selecting some features as target features based on contribution can also be performed when X is greater than a preset value. Here, the preset value is an integer greater than or equal to 2, and the specific preset value can be set according to the actual situation.

[0103] It should also be noted that the number of features selected from X candidate features can be set according to the actual situation, and this application does not impose any restrictions on this.

[0104] For example, if the preset value is set to 16, then when the number of extracted candidate features is greater than 16, a feature filtering operation needs to be performed. At this time, for example, 15 features with greater contribution can be selected from the candidate features as target features for subsequent model training.

[0105] In this embodiment, when the number of extracted candidate features is large, for example, when X is greater than 2, the candidate model can be used to perform important feature analysis on multiple candidate features to determine the contribution of each of the X candidate features to the candidate model. Then, at least two features are selected from the X candidate features in descending order of contribution as target features for model training. This application uses some features with greater contribution as target features for model training, rather than using all features for model training, in order to minimize the amount of training data while ensuring the model training effect.

[0106] In some embodiments, Figure 5 This is the fifth schematic flowchart of a molecular hydrophobicity testing method proposed in this application embodiment, as shown below. Figure 5 As shown, with Figure 5 Is Figure 3 Taking the corresponding embodiment as an example, step S304 specifically includes steps S3041-S3042:

[0107] Step S3041: For each of the plurality of molecular structure files, at least two features in the target features corresponding to the molecular structure file are standardized to obtain at least two standardized features.

[0108] Step S3042: Use at least two standardized features and corresponding LogP data corresponding to the plurality of molecular structure files as the training dataset.

[0109] For example, for a given molecular structure file, 15 features that contribute the most to the model are selected as target features for subsequent model training. These 15 features are denoted as x. n (n = 1, 2, 3, ..., 15), before forming the training dataset, these features can be preprocessed, specifically the feature values ​​x extracted from the molecules. n The standardization process is as follows:

[0110]

[0111] Where, x * Characterize the standardized features after standardization, and substitute x into x respectively. n μ represents the mean of the 15 features, and σ represents the standard deviation of the 15 features.

[0112] Specifically, x n After centering by the mean μ and then scaling by the standard deviation σ, the data will follow a standard normal distribution with a mean of 0 and a variance of 1. The processed dataset is: input feature value x * =(x * 1,x * 2,…,x * 15 The target output value is (hydrophobicity) y, which is the LogP data obtained above.

[0113] In the embodiments of this application, when generating the training dataset, for each of the multiple molecular structure files, this application first standardizes at least two features in the target features corresponding to the molecular structure file, and uses all the standardized features and the corresponding LogP data as the training dataset. By standardizing the features, the influence of different dimensions between different features can be eliminated, making different features comparable, thereby improving the effect of model training and the prediction accuracy of the trained model.

[0114] In some embodiments, Figure 6 This is the sixth schematic flowchart of a molecular hydrophobicity testing method proposed in this application embodiment, as shown below. Figure 6 As shown, the above candidate models are obtained through the following steps:

[0115] Step S601: Obtain at least two pre-set machine learning models;

[0116] The at least two machine learning models employ different machine learning methods and / or model parameters.

[0117] It should be noted that machine learning models can employ machine learning methods such as Random Forest (RF), Gradient Boosting Tree (GDBT), and Extreme Gradient Boosting Tree (XGBoost). For tree-based machine learning models, the model parameters are, for example, the number of trees.

[0118] Step S602: Train the at least two machine learning models using the training dataset to obtain the output results of the trained machine learning models.

[0119] Step S603: Evaluate the output results of the at least two machine learning models using pre-set model evaluation metrics, and select the machine learning model with the best evaluation result as the candidate model.

[0120] It should be noted that model evaluation metrics can include, for example, R², MAE (Mean Absolute Error), and RMSE (Root Mean Square Error); where:

[0121] 1) For R2, R2≤1, the larger the R2, the better the performance of the machine learning model. In an ideal case, when the prediction model is completely accurate, R2 is equal to the maximum value of 1.

[0122] 2) MAE is a non-negative value. The smaller the MAE, the better the performance of the machine learning model.

[0123] 3) Regarding RMSE, the smaller the RMSE, the better the performance of the machine learning model.

[0124] For example, the model can be trained using three machine learning methods, RF, GDBT and XGBoost, combined with a training dataset (which may include both training and test sets), to obtain the R2, MAE and RMSE model evaluation results for each model. Table 1 shows the R2, MAE and RMSE model evaluation results.

[0125] Table 1. Model evaluation results for R2, MAE, and RMSE

[0126]

[0127] It can be seen that the XGBoost model performs best when n_estimators is 100, so this model can be selected as a candidate model for subsequent model training.

[0128] It should be noted that for different molecular designs, the machine learning methods and model parameters of the corresponding optimal models may differ, and the above methods are needed to determine the optimal model as a candidate model.

[0129] In the embodiments of this application, at least two machine learning models using different machine learning methods and / or model parameters are pre-set, and these machine learning models are trained separately using the generated training dataset. The output results of the trained machine learning models are obtained, and these output results can be evaluated according to the model evaluation index. The machine learning model with the best evaluation result is selected as a candidate model for subsequent training, which helps to improve the prediction accuracy of the trained hydrophobicity prediction model.

[0130] In some embodiments, step S302 above can also be implemented through the following steps:

[0131] For each of the plurality of molecular structure files, perform the following steps:

[0132] Obtain the ΔG corresponding to the molecular structure file oct and ΔG w ;

[0133] According to ΔG oct and ΔG w The LogP data corresponding to the molecular structure file is calculated using formula (1):

[0134]

[0135] Wherein, ΔG oct The free energy ΔG of the test molecule in n-octanol characterizes the free energy of the molecule under test. w The free energy of the tested molecule in water is represented by R, which represents the standard molar gas constant, and T represents the Kelvin temperature.

[0136] Specifically, automated scripts can be written for molecular structure files to perform DFT calculations on their hydrophobicity. This part uses Gaussian quantum chemistry calculation software. Furthermore, by setting calculation parameters, the accuracy and other parameters of the model used in the DFT calculation can be adjusted, thereby obtaining the ΔG corresponding to the molecular structure file. oct and ΔG w Then, by combining these two parameters with R and T, the LogP corresponding to the molecular structure file can be calculated using formula (1).

[0137] In this application embodiment, a specific implementation method is provided for calculating the LogP data corresponding to the molecule to be tested. The LogP data corresponding to a certain molecule to be tested can be obtained by formula (1) to characterize its hydrophobicity.

[0138] The following example illustrates the method for testing the hydrophobicity of molecules provided in this application.

[0139] Figure 7This is a schematic diagram of the system used in the molecular hydrophobicity testing method proposed in the embodiments of this application, as shown below. Figure 7 As shown, the system includes the following parts:

[0140] 1) Molecular Generation Module: This module is designed for the automated batch generation of molecular sample spaces. Users can input end groups and intermediate molecular fragments to automatically generate a large number of molecules. Of these, 20% of the generated molecules are randomly selected and denoted as DFT_data (DFT calculation data) for subsequent hydrophobicity DFT calculations; the remaining 80% is denoted as predict_data (predicted data) for subsequent machine learning molecular property predictions to predict the hydrophobicity of these molecules.

[0141] 2) Hydrophobicity DFT Calculation Module: An automated script is written for the molecules generated above to perform DFT calculations on their hydrophobicity. This part uses Gaussian quantum chemistry calculation software, and the calculation formula is as follows:

[0142]

[0143] Wherein, ΔG oct The free energy ΔG of the test molecule in n-octanol characterizes the free energy of the molecule under test. w The free energy of the tested molecule in water is represented by R, which represents the standard molar gas constant, and T represents the Kelvin temperature.

[0144] 3) Machine Learning Module: Based on the molecular hydrophobicity dataset obtained from the above calculations, a machine learning modeling process is performed. Figure 8 This is the seventh flowchart illustrating a method for testing the hydrophobicity of molecules proposed in this application. Figure 8 As shown, the method includes:

[0145] <1> Dataset partitioning: The hydrophobic dataset obtained above is partitioned into 80% training set and 20% test set;

[0146] <2> Molecular Feature Extraction and Screening: The RDKit molecular toolkit was used to extract features from the molecular structure files, resulting in a total of 209 features to describe the physicochemical information of the molecules. Important feature analysis was performed on these 209 features, and the top 15 features that contributed most to the model were selected for subsequent model training. The important feature values ​​are denoted as x. n (n = 1, 2, 3, ..., 15), Figure 9 This is a schematic diagram illustrating the relationship between feature importance in a molecular hydrophobicity testing method proposed in this application. Figure 9 As shown, the top 15 features that contribute the most to the model are selected.

[0147] <3> Data preprocessing: Extracting and screening feature values ​​x from molecules n The standardization process is as follows:

[0148]

[0149] Centering the data x by the mean μ and then scaling it by the standard deviation σ will result in the data following a standard normal distribution with a mean of 0 and a variance of 1. The processed dataset consists of: input feature values ​​x. * =(x * 1,x * 2,…,x * 15 The target output value is (hydrophobicity) y, which is the LogP data obtained above.

[0150] <4> Batch model training and validation: Models were trained using three machine learning methods, RF, GDBT and XGBoost, combined with the training dataset. The R2, MAE and RMSE evaluation metrics for each model were obtained, as shown in Table 1 above.

[0151] <5> Save the best model: Select the best performing model from the examples and use it as the model for predicting new molecules in the future. As shown in Table 1, the XGBoost model performs best when n_estimators is 100. Figure 10 This is a schematic diagram illustrating the relationship between the predicted and true values ​​of LogP data in a molecular hydrophobicity testing method proposed in this application. Figure 10 As shown, the predicted values ​​of LogP data are close to the true values ​​for both the training and test sets, thus verifying the excellent performance of the optimal model.

[0152] 4) Predict new molecular properties module: The generated predict_data uses the best machine learning model saved above to predict molecular hydrophobicity.

[0153] In this embodiment, high-throughput quantum chemical density functional theory (DFT) calculations are used to obtain a large amount of molecular hydrophobicity (LogP) data, ensuring the consistency of data sources. A structure-property relationship machine learning model for molecular structure and hydrophobicity is constructed, and multiple machine learning models are used for training to obtain a model with optimal predictive ability. Secondly, an automated script is used to generate a large number of molecular sample spaces, and the trained machine learning model is used for property prediction, quickly obtaining molecular hydrophobicity data. This facilitates materials R&D engineers in rapidly evaluating and screening suitable molecules.

[0154] The above describes the method for testing the hydrophobicity of molecules proposed in the embodiments of this application. The related devices and electronic equipment are described below.

[0155] Figure 11 This is a schematic diagram of the structure of a molecular hydrophobicity testing device proposed in an embodiment of this application, as shown below. Figure 11 As shown, the molecular hydrophobicity testing device 1100 includes:

[0156] The acquisition module 1101 is used to acquire the molecular structure file corresponding to the molecule to be tested; wherein, the molecular structure file is used to characterize the structure of the molecule to be tested;

[0157] The prediction module 1102 is used to input the molecular structure file into a pre-trained hydrophobicity prediction model to obtain the hydrophobicity test result corresponding to the molecule to be tested output by the hydrophobicity prediction model; wherein, the hydrophobicity prediction model is used to extract the target features of the molecule to be tested from the molecular structure file, and predict the hydrophobicity test result corresponding to the molecule to be tested based on the target features, and the hydrophobicity test result is used to characterize the degree of hydrophobicity of the molecule to be tested.

[0158] In some embodiments, the acquisition module 1101 is specifically used to: generate a molecular structure file corresponding to the molecule to be tested based on the pre-input N end groups and M intermediate molecular fragments; wherein N and M are both integers greater than 0.

[0159] In some embodiments, the molecular hydrophobicity testing device 1100 further includes: a training module, used for:

[0160] Based on the pre-input terminal and intermediate molecular fragments, multiple molecular structure files are generated; wherein, the multiple molecular structure files respectively characterize the structures of multiple molecules to be tested;

[0161] By performing DFT calculations on the multiple molecular structure files, LogP data corresponding to the multiple molecules to be tested are obtained; wherein, the LogP data is used to characterize the hydrophobicity of the multiple molecules to be tested.

[0162] For each of the plurality of molecular structure files, the following steps are performed: feature extraction is performed on the molecular structure file, and the extracted X candidate features are used as the target features; where X is an integer greater than 0;

[0163] A training dataset is generated based on the target features and corresponding LogP data of the multiple molecular structure files.

[0164] The pre-set candidate model is trained using the training dataset, and the trained candidate model is used as the hydrophobicity prediction model.

[0165] In some embodiments, the training module is specifically used for:

[0166] Feature extraction is performed on the molecular structure file to obtain the X extracted candidate features;

[0167] When X is an integer greater than 2, the candidate model is used to perform important feature analysis on the X candidate features to determine the contribution of each of the X candidate features to the candidate model.

[0168] According to the order of contribution from largest to smallest, at least two features are selected from the X candidate features as the target features.

[0169] In some embodiments, the training module is further specifically used for:

[0170] For each of the plurality of molecular structure files, at least two features in the target features corresponding to the molecular structure file are standardized to obtain at least two standardized features after processing.

[0171] The training dataset consists of at least two standardized features and the corresponding LogP data corresponding to the plurality of molecular structure files.

[0172] In some embodiments, the candidate model is obtained through the following steps:

[0173] Obtain at least two pre-set machine learning models; wherein the at least two machine learning models employ different machine learning methods and / or model parameters;

[0174] The training dataset is used to train the at least two machine learning models respectively, and the output results of the trained machine learning models are obtained.

[0175] The output results of the at least two machine learning models are evaluated using pre-set model evaluation metrics, and the machine learning model with the best evaluation result is selected as the candidate model.

[0176] In some embodiments, the training module is further specifically used for:

[0177] For each of the plurality of molecular structure files, perform the following steps:

[0178] Obtain the ΔG corresponding to the molecular structure file oct and ΔG w ;

[0179] According to ΔG oct and ΔG w The LogP data corresponding to the molecular structure file is calculated using formula (1):

[0180]

[0181] Wherein, ΔG oct The free energy ΔG of the test molecule in n-octanol characterizes the free energy of the molecule under test. w The free energy of the tested molecule in water is represented by R, which represents the standard molar gas constant, and T represents the Kelvin temperature.

[0182] It should be understood that the molecular hydrophobicity testing device 1100 of this application embodiment can be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof. The aforementioned method can also be implemented by software; the molecular hydrophobicity testing device 1100 and its various modules can also be software modules.

[0183] This application also provides a molecular hydrophobicity testing device, including: a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the aforementioned molecular hydrophobicity testing method.

[0184] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive (SSD).

[0185] This application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon, the computer-readable program instructions being used to perform the molecule hydrophobicity testing method in the above embodiments.

[0186] The computer-readable storage medium provided in this application may be, for example, a USB flash drive, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared systems, devices, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this embodiment, the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, system, or device. The program code contained on the computer-readable storage medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (Radio Frequency), etc., or any suitable combination thereof.

[0187] The aforementioned computer-readable storage medium may be included in a molecular hydrophobicity testing device; or it may exist independently and not assembled into a molecular hydrophobicity testing device.

[0188] The aforementioned computer-readable storage medium carries one or more programs that, when executed by the molecular hydrophobicity testing device, cause the molecular hydrophobicity testing device to perform the following steps:

[0189] Obtain the molecular structure file corresponding to the molecule to be tested; wherein, the molecular structure file is used to characterize the structure of the molecule to be tested;

[0190] The molecular structure file is input into a pre-trained hydrophobicity prediction model to obtain the hydrophobicity test result corresponding to the molecule under test output by the hydrophobicity prediction model; wherein, the hydrophobicity prediction model is used to extract the target features of the molecule under test from the molecular structure file, and predict the hydrophobicity test result corresponding to the molecule under test based on the target features, and the hydrophobicity test result is used to characterize the degree of hydrophobicity of the molecule under test.

[0191] Computer program code for performing the operations of this application can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0192] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0193] The modules described in the embodiments of this application can be implemented in software or hardware. The names of the modules do not necessarily limit the functionality of the unit itself.

[0194] The readable storage medium provided in this application is a computer-readable storage medium that stores computer-readable program instructions (i.e., a computer program) for executing the above-described method for testing the hydrophobicity of molecules, thereby solving the technical problem of low efficiency in the evaluation and screening of molecular hydrophobicity. Compared with related technologies, the beneficial effects of the computer-readable storage medium provided in this application are the same as those of the molecular hydrophobicity testing method provided in the above embodiments, and will not be repeated here.

[0195] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the molecular hydrophobicity testing method as described above.

[0196] The computer program product provided in this application can solve the technical problem of low efficiency in the evaluation and screening of molecular hydrophobicity. Compared with related technologies, the beneficial effects of the computer program product provided in this application are the same as those of the molecular hydrophobicity testing method provided in the above embodiments, and will not be repeated here.

[0197] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A method for testing the hydrophobicity of a molecule, characterized in that, The method comprises: According to the pre-input end group and intermediate molecular fragment, a plurality of molecular structure files are generated; wherein the plurality of molecular structure files respectively represent the structures of a plurality of test molecules; By performing DFT calculation on the plurality of molecular structure files, LogP data corresponding to the plurality of test molecules are obtained; wherein the LogP data is used to represent the hydrophobicity of the plurality of test molecules respectively; For each of the plurality of molecular structure files, the following steps are performed: feature extraction is performed on the molecular structure file, and X candidate features extracted are used as target features; wherein X is an integer greater than 0; According to the target features and corresponding LogP data of the plurality of molecular structure files, a training data set is generated; The training data set is used to train a pre-set candidate model, and a trained hydrophobicity prediction model is obtained; A molecular structure file corresponding to a test molecule is obtained; wherein the molecular structure file is used to represent the structure of the test molecule; The molecular structure file is input into the pre-trained hydrophobicity prediction model, and a hydrophobicity test result of the test molecule output by the hydrophobicity prediction model is obtained; wherein the hydrophobicity prediction model is used to extract target features of the test molecule from the molecular structure file, and predict the hydrophobicity test result corresponding to the test molecule according to the target features, wherein the hydrophobicity test result is used to represent the hydrophobicity of the test molecule.

2. The method of claim 1, wherein, The molecular structure file corresponding to the test molecule is obtained, comprising: According to the pre-input N end groups and M intermediate molecular fragments, the molecular structure file corresponding to the test molecule is generated; Wherein N and M are both integers greater than 0.

3. The method of claim 1, wherein, The feature extraction is performed on the molecular structure file, and the X candidate features extracted are used as the target features, comprising: Feature extraction is performed on the molecular structure file to obtain the X candidate features extracted; In the case where X is an integer greater than 2, the candidate model is used to analyze the important features of the X candidate features, and the contribution of the X candidate features to the candidate model is determined respectively; According to the contribution from large to small, at least two features are selected from the X candidate features as the target features.

4. The method of claim 1, wherein, The training data set is generated according to the target features and corresponding LogP data of the plurality of molecular structure files, comprising: For each of the plurality of molecular structure files, at least two features in the target features corresponding to the molecular structure file are standardized to obtain at least two standardized features processed; The at least two standardized features and corresponding LogP data of the plurality of molecular structure files are used as the training data set.

5. The method of claim 1, wherein, The candidate model is obtained by the following steps: At least two machine learning models are obtained by pre-setting; wherein the at least two machine learning models use different machine learning methods and / or model parameters; The at least two machine learning models are trained respectively by using the training data set, and output results of the trained machine learning models are obtained; The output results corresponding to the at least two machine learning models are evaluated by using a pre-set model evaluation index, and a machine learning model with an optimal evaluation result is selected as the candidate model.

6. The method of claim 1, wherein, The DFT calculation is performed on the plurality of molecular structure files to obtain LogP data corresponding to the plurality of test molecules respectively, and the LogP data is used to represent the hydrophobicity of the plurality of test molecules respectively. For each of the plurality of molecular structure files, the following steps are performed: acquiring the molecular structure file corresponding to ΔG oct and ΔG w ; According to ΔG oct and ΔG w , the LogP data corresponding to the molecular structure file is calculated by formula (1): (1); wherein, ΔG oct characterizing the free energy of the molecule to be tested in n-octanol, ΔG w characterizing the free energy of the molecule to be tested in water, characterizing the standard molar gas constant, characterizing the Kelvin temperature.

7. A device for testing the hydrophobicity of a molecule, characterized in that including: The training module is configured to generate a plurality of molecular structure files according to pre-input end groups and intermediate molecular fragments, wherein the plurality of molecular structure files represent structures of a plurality of test molecules respectively; DFT calculation is performed on the plurality of molecular structure files to obtain LogP data corresponding to the plurality of test molecules respectively; wherein the LogP data is used to represent the hydrophobicity of the plurality of test molecules respectively; for each of the plurality of molecular structure files, the following steps are performed: feature extraction is performed on the molecular structure file, and X candidate features extracted are used as target features; wherein X is an integer greater than 0; a training data set is generated according to target features corresponding to the plurality of molecular structure files and corresponding LogP data; and the training data set is used to train a pre-set candidate model to obtain a trained hydrophobicity prediction model; The acquisition module is configured to acquire a molecular structure file corresponding to a test molecule; wherein the molecular structure file is used to represent the structure of the test molecule; The prediction module is configured to input the molecular structure file into the pre-trained hydrophobicity prediction model to obtain a hydrophobicity test result of the test molecule output by the hydrophobicity prediction model; wherein the hydrophobicity prediction model is used to extract target features of the test molecule from the molecular structure file, and predict the hydrophobicity test result corresponding to the test molecule according to the target features; and the hydrophobicity test result is used to represent the hydrophobicity of the test molecule.

8. A device for testing the hydrophobicity of a molecule, characterized in that including: A memory and a processor, the memory stores a computer program, and the processor implements the method for testing the hydrophobicity of the molecule according to any one of claims 1 to 6 when executing the computer program.

9. A computer-readable storage medium, characterized in that, The computer readable storage medium stores computer instructions, when the computer instructions run on the computer, the computer executes the method for testing the hydrophobicity of the molecule according to any one of claims 1 to 6.