Prediction of aldol reaction results and high-efficiency screening method of tetrapeptide catalysts

By combining a random forest model with a high-throughput experimental platform, the problem of low catalyst screening efficiency in asymmetric Aldol reactions was solved, achieving efficient and low-cost catalyst screening and solvent optimization, and improving catalyst discovery efficiency.

CN122245470APending Publication Date: 2026-06-19SHENZHEN CONTINUOUS PHARMACEUTICAL TECHNOLOGY CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN CONTINUOUS PHARMACEUTICAL TECHNOLOGY CO LTD
Filing Date
2026-03-13
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for asymmetric Aldol reactions suffer from low catalyst screening efficiency and high costs, making it difficult to establish a rapid quantitative mapping between catalyst structure, solvent conditions, and reaction results.

Method used

We employed a random forest model combined with a high-throughput experimental platform. By encoding reactant and catalyst information through a molecular characterization model, we constructed reaction stage features, trained the random forest model to predict yield and enantioselectivity, and performed solvent optimization and experimental verification.

Benefits of technology

This technology enables rapid catalyst screening on a large-scale tetrapeptide virtual library, significantly reducing experimental costs and improving the discovery efficiency of high-performance catalysts.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245470A_ABST
    Figure CN122245470A_ABST
Patent Text Reader

Abstract

This invention relates to the field of cheminformatics, specifically to a method, apparatus, computer equipment, and storage medium for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts. The method includes: acquiring reactant structure information, candidate tetrapeptide catalyst structure information, and reaction condition information for the Aldol reaction; calling a pre-trained molecular characterization model to encode the reactant structure information and the candidate tetrapeptide catalyst structure information into vector representations, and fusing them with the reaction condition information to obtain reaction-level features; constructing training data; training a random forest model based on the training dataset; inputting the reaction-level features into the random forest model to perform batch predictions on candidate tetrapeptide catalysts in a pre-constructed virtual catalyst library, outputting predicted yield values, predicted enantioselectivity values, and / or classification levels for the candidate catalysts; and performing solvent optimization on the target catalyst set, outputting optimal solvent recommendations.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of cheminformatics technology, and in particular to methods, apparatus, computer equipment, and storage media for predicting Aldol reaction results and for efficiently screening tetrapeptide catalysts. Background Technology

[0002] The asymmetric Aldol reaction is an important type of reaction for constructing β-hydroxycarbonyl compounds with multi-chiral central skeletons. The reaction yield and enantioselectivity are highly dependent on the catalyst structure and reaction conditions. Tetrapeptide catalysts possess the advantages of a designable sequence space and weak multi-site interactions, but their combinatorial space is enormous. Relying solely on human experience and individual experimental screening is inefficient and costly. Existing methods often employ limited empirical screening or expensive quantum chemical descriptors, which are insufficient to cover the large-scale tetrapeptide space. Furthermore, it is difficult to establish a rapid quantitative mapping between catalyst structure, solvent conditions, and reaction results. To address these technical challenges, this paper proposes a method, apparatus, computer equipment, and storage medium for predicting Aldol reaction results and for efficiently screening tetrapeptide catalysts. Summary of the Invention

[0003] To address the technical problems existing in the prior art, this invention provides a method, apparatus, computer equipment, and storage medium for predicting Aldol reaction results and screening tetrapeptide catalysts. This invention can run rapidly on a large-scale virtual tetrapeptide library, simultaneously perform yield prediction, enantioselectivity prediction, and efficient catalyst fractionation screening, and further support solvent optimization and experimental verification closed loop.

[0004] To achieve the above objectives, the embodiments of the present invention provide the following technical solutions: In a first aspect, in one embodiment of the present invention, a method for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts is provided, the method comprising the following steps: To obtain structural information of reactants, candidate tetrapeptide catalysts, and reaction conditions for the Aldol reaction; The pre-trained molecular characterization model is invoked to encode the structural information of the reactants and the structural information of the candidate tetrapeptide catalysts into vector representations, and then fused with the reaction condition information to obtain reaction order features; At least one batch of Aldol reactions were executed based on a high-throughput experimental platform to obtain corresponding yield data and enantioselectivity data, and training data was constructed. Train a random forest model based on the aforementioned training dataset; The reaction stage features are input into the random forest model to perform batch predictions on candidate tetrapeptide catalysts in the pre-constructed catalyst virtual library, output the predicted yield, predicted enantioselectivity and / or classification level of the candidate catalysts, and screen the target catalyst set accordingly. Solvent optimization is performed on the target catalyst set to output the optimal solvent recommendation, and the target catalysts are experimentally verified and substrate expansion is validated.

[0005] As a further aspect of the present invention, the output of the molecular characterization model is a fixed-dimensional molecular vector.

[0006] As a further embodiment of the present invention, the reaction stage features are obtained by splicing or weighted fusion of reactant vectors, tetrapeptide catalyst vectors, and reaction condition vectors.

[0007] As a further aspect of the present invention, the high-throughput experiment is performed in parallel in a porous reactor to obtain batch reaction data including the combination of catalyst and solvent, and the yield and enantioselectivity results are obtained by HPLC.

[0008] As a further embodiment of the present invention, the random forest model includes at least: a yield prediction random forest regression model, an enantioselectivity prediction random forest regression model, and a high-efficiency catalyst hierarchical screening random forest multi-classification model.

[0009] As a further aspect of the present invention, the enantioselective predictive random forest regression model uses the absolute value of ee or the ee grade label as the learning target.

[0010] As a further aspect of the present invention, before inputting the reaction-level features into the random forest model to perform batch predictions on candidate tetrapeptide catalysts in the pre-constructed virtual catalyst library, outputting the predicted yield, predicted enantioselectivity, and / or classification level of the candidate catalysts, and thereby screening the target catalyst set, the method further includes: A virtual library of catalysts containing multiple tetrapeptide sequences was constructed.

[0011] Secondly, in another embodiment provided by the present invention, an Aldol reaction result prediction and tetrapeptide catalyst high-efficiency screening device is provided. The system includes: a data acquisition module, a data fusion module, a training data construction module, a model training module, a catalyst set screening module, and an output verification module. The data acquisition module is used to acquire the reactant structure information, candidate tetrapeptide catalyst structure information, and reaction condition information of the Aldol reaction. The data fusion module is used to call a pre-trained molecular characterization model to encode the reactant structure information and the candidate tetrapeptide catalyst structure information into vector representations, and fuse them with the reaction condition information to obtain reaction level features; The training data construction module is used to execute at least one batch of Aldol reactions based on a high-throughput experimental platform to obtain corresponding yield data and enantioselectivity data, and to construct training data. The model training module is used to train a random forest model based on the training dataset; The catalyst set screening module is used to input the reaction level features into the random forest model, perform batch prediction of candidate tetrapeptide catalysts in the pre-constructed catalyst virtual library, output the predicted yield value, the predicted enantioselectivity value and / or classification level of the candidate catalysts, and screen the target catalyst set accordingly. The output verification module is used to perform solvent optimization on the target catalyst set, output the optimal solvent recommendation, and perform experimental verification and substrate expansion verification on the target catalyst.

[0012] Thirdly, in another embodiment of the present invention, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor loads and executes the computer program to implement the steps of predicting Aldol reaction results and a method for efficiently screening tetrapeptide catalysts.

[0013] Fourthly, in another embodiment of the present invention, a storage medium is provided storing a computer program, which, when loaded and executed by a processor, implements the steps of the Aldol reaction result prediction and tetrapeptide catalyst efficient screening method.

[0014] The technical solution provided by this invention has the following beneficial effects: This invention provides a method, apparatus, computer equipment, and storage medium for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts. The method acquires the reactant structure information, candidate tetrapeptide catalyst structure information, and reaction condition information for the Aldol reaction; it calls a pre-trained molecular characterization model to encode the reactant structure information and the candidate tetrapeptide catalyst structure information into vector representations, and fuses them with the reaction condition information to obtain reaction-level features; it executes at least one batch of Aldol reactions based on a high-throughput experimental platform to obtain corresponding yield and enantioselectivity data, and constructs training data; it trains a random forest model based on the training dataset; it inputs the reaction-level features into the random forest model to perform batch predictions on candidate tetrapeptide catalysts in a pre-constructed virtual catalyst library, outputting predicted yield values, predicted enantioselectivity values, and / or classification levels of the candidate catalysts, and thereby filters a set of target catalysts; it performs solvent optimization on the target catalyst set, outputs optimal solvent recommendations, and performs experimental verification and substrate expansion verification of the target catalysts. Compared with existing technologies, this invention can establish a predictive model of yield and enantioselectivity based on high-throughput experimental data, enabling batch reasoning and rapid screening of a large-scale tetrapeptide virtual library; without the need for high-cost quantum chemical calculations, it can achieve rapid prediction and efficient screening of a large-scale tetrapeptide space, significantly reducing experimental trial-and-error costs and improving the efficiency of discovering high-performance catalysts.

[0015] These or other aspects of the invention will become more apparent from the following description of embodiments. It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not intended to limit the invention. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other embodiments can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a flowchart illustrating an embodiment of the Aldol reaction result prediction and efficient screening method for tetrapeptide catalysts according to the present invention.

[0018] Figure 2 This is a flowchart illustrating an example of an efficient screening method for Aldol reaction results based on pre-trained molecular characterization and random forest model, according to one embodiment of the present invention.

[0019] Figure 3 This is a flowchart illustrating the reaction feature construction and training process in an Aldol reaction result prediction and tetrapeptide catalyst high-efficiency screening method according to an embodiment of the present invention.

[0020] Figure 4 This is a schematic diagram of a high-throughput experimental platform and parallel reaction well sites in an example of an Aldol reaction result prediction and tetrapeptide catalyst high-efficiency screening method according to an embodiment of the present invention.

[0021] Figure 5 This is a graph showing the Aldol reaction results of 32 tetrapeptide catalysts under multi-solvent conditions, as an example of the Aldol reaction result prediction and efficient screening method for tetrapeptide catalysts in an embodiment of the present invention.

[0022] Figure 6 This is a yield data distribution diagram of the Aldol reaction dataset in an example of an Aldol reaction result prediction and efficient screening method for tetrapeptide catalysts according to an embodiment of the present invention.

[0023] Figure 7 This is a distribution diagram of the ee data of the Aldol reaction dataset in an example of an Aldol reaction result prediction and efficient screening method for tetrapeptide catalysts according to an embodiment of the present invention.

[0024] Figure 8This is a schematic diagram illustrating the three-category classification of Aldol reaction data in an example of an Aldol reaction result prediction and tetrapeptide catalyst high-efficiency screening method according to an embodiment of the present invention.

[0025] Figure 9 This is a graph showing the asymmetric Aldol reaction results catalyzed by eight out-of-sample tetrapeptides in an example of the Aldol reaction result prediction and efficient screening method for tetrapeptide catalysts according to an embodiment of the present invention.

[0026] Figure 10 This is a comparison chart of Aldol reaction yields based on multiple regression models according to an embodiment of the present invention.

[0027] Figure 11 This is a comparison chart of the enantioselective prediction performance of Aldol based on multiple regression models according to an embodiment of the present invention. Figure 12 This invention provides an example of the generation and verification of tetrapeptide three-dimensional conformations based on FASTA format and ETKDG optimization in the efficient screening of tetrapeptide catalysts. Figure 1 .

[0028] Figure 13 This invention provides an example of the generation and verification of tetrapeptide three-dimensional conformations based on FASTA format and ETKDG optimization in the efficient screening of tetrapeptide catalysts. Figure 2 .

[0029] Figure 14 In one embodiment of the present invention, a tetrapeptide chemical space map was constructed based on Uni-Mol molecular representation for efficient screening of tetrapeptide catalysts.

[0030] Figure 15 This is a visualization diagram of dimension reduction clustering based on Uni-Mol molecular representation in the efficient screening of tetrapeptide catalysts according to an embodiment of the present invention.

[0031] Figure 16 This is a schematic diagram of the device for predicting Aldol reaction results and screening tetrapeptide catalysts.

[0032] Figure 17 A table of abbreviations for amino acid names. Detailed Implementation

[0033] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0034] The flowchart shown in the attached diagram is for illustrative purposes only and does not necessarily include all content and operations / steps, nor does it necessarily have to be performed in the order described. For example, some operations / steps can be broken down, combined, or partially merged, so the actual execution order may change depending on the actual situation.

[0035] It should be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

[0036] Specifically, the embodiments of the present invention will be further described below with reference to the accompanying drawings.

[0037] Please see Figure 1 and Figure 2 , Figure 1 This is a flowchart of a method for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts provided in an embodiment of the present invention, as shown below. Figure 1 As shown, the method for predicting Aldol reaction results and screening tetrapeptide catalysts includes steps S10 to S60.

[0038] S10: Obtain structural information of reactants, candidate tetrapeptide catalysts, and reaction conditions for the Aldol reaction; S20. Call the pre-trained molecular characterization model to encode the reactant structure information and the candidate tetrapeptide catalyst structure information into vector representations, and fuse them with the reaction condition information to obtain reaction level features.

[0039] In an embodiment of the present invention, the output of the molecular characterization model is a fixed-dimensional molecular vector.

[0040] In embodiments of the present invention, the reaction stage features are obtained by splicing or weighted fusion of reactant vectors, tetrapeptide catalyst vectors, and reaction condition vectors.

[0041] S30. Execute at least one batch of Aldol reactions based on a high-throughput experimental platform to obtain the corresponding yield data and enantioselectivity data, and construct a training dataset.

[0042] In an embodiment of the present invention, the high-throughput experiment is performed in parallel in a porous reactor to obtain batch reaction data including the combination of catalyst and solvent, and yield and enantioselectivity results are obtained by HPLC.

[0043] In embodiments of the present invention, solvent optimization is incorporated into the same framework and recommendations are made within the "catalyst × solvent" combination space, which helps to reduce the cost of condition exploration.

[0044] S40. Train a random forest model based on the training dataset.

[0045] In embodiments of the present invention, the random forest model includes at least: a yield prediction random forest regression model, an enantioselectivity prediction random forest regression model, and a high-efficiency catalyst hierarchical screening random forest multi-classification model.

[0046] In embodiments of the present invention, the enantioselective predictive random forest regression model uses the absolute value of ee or the ee classification label as the learning objective. The present invention employs a random forest classification model to classify and filter yield and ee, which better aligns with the decision-making logic of experimental selection; and can enhance the identification of high-performance categories through methods such as class weights.

[0047] In embodiments of the present invention, the hierarchical labels in the high-efficiency catalyst hierarchical screening random forest multi-classification model include at least three categories: high, medium, and low.

[0048] In embodiments of the present invention, a random forest multi-classification model is trained; and stratified sampling cross-validation can be used to ensure consistent class distribution. Regarding model selection, classifiers such as decision trees, logistic regression, SVM, and random forests can be compared with different molecular representation methods, ultimately determining that random forest has superior overall performance (corresponding to...). Figure 6 ).

[0049] S50. Input the reaction level features into the random forest model to perform batch prediction of candidate tetrapeptide catalysts in the pre-constructed catalyst virtual library, output the predicted yield value, predicted enantioselectivity value and / or classification level of the candidate catalysts, and screen the target catalyst set accordingly.

[0050] In an embodiment of the present invention, in S50, the reaction level features are input into the random forest model to perform batch prediction of candidate tetrapeptide catalysts in the pre-constructed virtual catalyst library, and the predicted yield, enantioselectivity, and / or classification level of the candidate catalysts are output, and the target catalyst set is screened accordingly. Prior to this, the method further includes: A virtual library of catalysts containing multiple tetrapeptide sequences was constructed.

[0051] In embodiments of the present invention, constructing a catalyst virtual library containing multiple tetrapeptide sequences further includes: Before performing full library prediction on the catalyst virtual library, the method further includes: uniformly sampling a representative tetrapeptide subset from the high-dimensional feature space using the Kennard-Stone algorithm and performing initial experimental verification to construct the initial training set for the machine learning prediction model.

[0052] S60. Solvent optimization is performed on the target catalyst set, the optimal solvent recommendation is output, and experimental verification and substrate expansion verification are conducted on the target catalysts. This invention supports experimental verification and substrate expansion verification of the screened candidate catalysts, thereby improving the reliability and applicability of the screening results.

[0053] In an embodiment of the present invention, the solvent optimization includes: predicting or experimentally measuring the yield and enantioselectivity for the same candidate tetrapeptide catalyst in at least two different polar solvents, thereby determining the optimal solvent conditions for the candidate tetrapeptide catalyst.

[0054] An example flow chart of an Aldol reaction result prediction and tetrapeptide catalyst high-efficiency screening method is provided, according to one embodiment. Figure 2 The steps are as follows: Step 1: Generation of high-throughput Aldol reaction data.

[0055] like Figure 4 As shown, based on optimized reaction conditions, a parallel high-throughput reactor was used to generate reaction data: the reactor was a 5×8 well metal bath platform, with 2.5 mL reaction tubes configured in each well; parallel reactions were carried out at a scale of 0.5 mL and a concentration of 1 M; after the reaction, samples were taken, diluted, filtered, and the conversion rate and enantioselectivity were analyzed by HPLC to obtain yield and ee data.

[0056] Furthermore, multiple tetrapeptide catalysts were screened in parallel in different polar solvents to obtain a batch reaction matrix of "catalyst × solvent" (corresponding to...). Figure 5 , 6 7 and 8).

[0057] Step 2: Optimization of reaction conditions and solvents.

[0058] The optimization of reaction conditions and solvents included: optimizing the Aldol reaction conditions by changing parameters such as reactant stoichiometry, reaction time, and catalyst type to obtain yield and ee under different conditions (corresponding to Table 1).

[0059] Table 1 Results of Aldol Reaction Condition Optimization During the system screening phase, the same catalyst is compared in multiple solvents to determine the optimal solvent conditions and provide "conditional features" for subsequent model input.

[0060] Step 3: Molecular characterization and training data construction.

[0061] Molecular characterization and training data construction include: inputting reactants and candidate tetrapeptide catalysts into a pre-trained molecular characterization model to obtain a fixed-dimensional vector; and fusing it with reaction condition features such as solvent, stoichiometry, temperature, and time to form a reaction-level feature vector.

[0062] Step 4: Random Forest Model Training and Prediction.

[0063] Random forest model training and prediction include: yield prediction (RF regression), enantioselectivity prediction (RF regression), and efficient catalyst classification screening (RF classification).

[0064] Yield prediction (RF regression): The random forest regression model is trained using the reaction-level feature vector as input and the experimental yield as the label.

[0065] Enantioch-selective prediction (RF regression): The random forest regression model is trained with reaction-level feature vectors as input and ee or |ee| as labels.

[0066] High-efficiency catalyst fractionation screening (RF classification): Yield and ee are mapped to three categories of labels (corresponding to...). Figure 8 ).

[0067] Train a random forest multi-class classification model; stratified sampling cross-validation can be used to ensure consistent class distribution. Regarding model selection, classifiers such as decision trees, logistic regression, SVM, and random forests, along with different molecular representation methods, can be compared. Ultimately, random forest is determined to be superior in overall performance (see Table 2).

[0068] Table 2 compares the performance of different classification models and descriptors.

[0069] Table 3. Performance of classification models using hierarchical cross-validation.

[0070] Step 5: Virtual library screening and initial validation of representative subsets.

[0071] To construct the initial machine learning prediction model, this embodiment also includes a step of building an initial training set. Before predicting the entire virtual library, a representative small subset can be selected from the chemical space for initial experimental validation. Specifically, the Kennard-Stone algorithm can be used, which can uniformly sample in a high-dimensional feature space, ensuring that the selected samples cover the entire chemical space. The red dots in the chemical space represent the representative peptide subset selected by this algorithm. Experiments are conducted on these representative catalysts to obtain their actual reaction results data, which constitute the initial training set of the machine learning prediction model. Based on this initial model, the aforementioned full library prediction and screening can be performed.

[0072] The tetrapeptide conformation was optimized using ETDKG to create a three-dimensional structure, which was then used as input for Uni-Mol.

[0073] Step 6: Catalyst validation and substrate expansion validation.

[0074] like Figure 9 As shown, Figure 9 (a) Prediction results using the basic RF model; (b) Prediction results using the iterative RF model; (c) Two tetrapeptides with high catalytic performance accurately predicted by the iterative model. The high-yield / high-ee candidate catalysts screened by the model were experimentally synthesized and validated; further expansion and validation were conducted by changing the substrate combinations to confirm the transferability and applicability of the screening results.

[0075] The base model trained on the Aldol reaction dataset predicted the performance of untested tetrapeptide molecules and was able to preliminarily identify high-yield candidate catalysts. Among them, LAPV and LPGV were predicted as high-yield molecules and were included in the experimental validation. Figures 10-15 Experimental results show that the basic model has some bias in predicting yield values, but it has a good ability to judge enantioselectivity trends. For example, the ee prediction for LPGV is consistent with the experimental results. Comparing LPGV and VPGL, which differ only in the terminal residues, it can be seen that although they share the proline at position i+1 and glycine at position i+2, VPGL can achieve 86% yield and 59% ee, while LPGV performance is significantly reduced. This indicates that the model is highly sensitive to small sequence changes, but it still has limitations in simultaneously screening catalysts with high yield and high ee. By introducing new reaction data containing D-proline for iterative training, the resulting iterated model significantly improved the predictive ability, identifying 60 high-yield candidates and 42 high-ee candidates in the virtual library, and successfully screening potential high-performance catalysts such as VpAL, LpGL, VAGL, and VGGL, which combine high yield and high selectivity. The experimental results for representative catalysts VpAL and LpGL are highly consistent with the predictions. Furthermore, the model accurately captures trends such as the decrease in ee from 68% to 25.6% between PGPV and PGpV. For a detailed table of amino acid abbreviations, please refer to [link to table]. Figure 17 Overall, the model's ability to learn about conformational changes and response selectivity is significantly enhanced after data augmentation.

[0076] The basic model trained on the Aldol reaction dataset was able to preliminarily screen high-yield candidate catalysts when predicting untested tetrapeptide libraries. Among them, LAPV and LPGV were predicted as high-yield molecules and entered experimental verification. Figure 9Experiments show that the model has some bias in predicting yield values, but it has a good ability to judge the trend of enantioselectivity changes. For example, the enantioselectivity prediction results of LPGV are consistent with the experimental results. Comparing LPGV and VPGL, which differ only in terminal residues, it was found that although both have a proline-glycine backbone, VPGL can achieve 86% yield and 59% enantioselectivity, while LPGV performance is significantly reduced. This indicates that the model is sensitive to small changes in sequence, but it still has limitations in simultaneously screening catalysts with high yield and high enantioselectivity.

[0077] By introducing new reaction data containing D-proline for iterative training, the predictive power of the resulting iterative model was significantly improved. The model identified 60 high-yield candidate catalysts and 42 high enantioselective candidate catalysts from a virtual library, and successfully screened potentially highly efficient catalysts such as VpAL, LpGL, VAGL, and VGGL, which possess both high catalytic activity and selectivity. The experimental results for representative catalysts VpAL and LpGL showed high agreement with the predicted values, and the model accurately captured the trend of the enantioselectivity between PGPV and PGpV decreasing from 68% to 25.6%. Overall, after data expansion, the model's ability to learn the laws governing conformational changes and reaction selectivity was significantly enhanced.

[0078] It should be understood that although the above description follows a certain order, these steps are not necessarily executed in that order. Unless otherwise expressly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, some steps in this embodiment may include multiple steps or multiple stages, which are not necessarily completed at the same time, but may be executed at different times. The execution order of these steps or stages is not necessarily sequential, but may be performed alternately or in turn with other steps or at least a portion of the steps or stages in other steps.

[0079] In one embodiment, see Figure 16 As shown, an Aldol reaction result prediction and tetrapeptide catalyst high-efficiency screening device is also provided in the embodiments of the present invention. The system includes a data acquisition module 100, a data fusion module 200, a training data construction module 300, a model training module 400, a catalyst set screening module 500, and an output verification module 600.

[0080] The data acquisition module 100 is used to acquire the reactant structure information, candidate tetrapeptide catalyst structure information, and reaction condition information for the Aldol reaction.

[0081] The data fusion module 200 is used to call a pre-trained molecular characterization model to encode the reactant structure information and the candidate tetrapeptide catalyst structure information into vector representations, and fuse them with the reaction condition information to obtain reaction level features.

[0082] The training data construction module 300 is used to execute at least one batch of Aldol reactions based on a high-throughput experimental platform to obtain corresponding yield data and enantioselectivity data, and to construct training data.

[0083] The model training module 400 is used to train a random forest model based on the training dataset.

[0084] The catalyst set screening module 500 is used to input the reaction level features into the random forest model, perform batch prediction of candidate tetrapeptide catalysts in the pre-constructed catalyst virtual library, output the predicted yield value, the predicted enantioselectivity value and / or classification level of the candidate catalysts, and screen the target catalyst set accordingly.

[0085] The output verification module 600 is used to perform solvent optimization on the target catalyst set, output the optimal solvent recommendation, and perform experimental verification and substrate expansion verification on the target catalyst.

[0086] In one embodiment, a computer device is also provided, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus.

[0087] Memory, used to store computer programs; When the processor executes the computer program stored in the memory, it performs the Aldol reaction result prediction and tetrapeptide catalyst high-efficiency screening method. When the processor executes instructions, it implements the steps in the above method embodiments: S10: Obtain structural information of reactants, candidate tetrapeptide catalysts, and reaction conditions for the Aldol reaction; S20. Call the pre-trained molecular characterization model to encode the reactant structure information and the candidate tetrapeptide catalyst structure information into vector representations, and fuse them with the reaction condition information to obtain reaction level features; S30. Perform at least one batch of Aldol reactions based on a high-throughput experimental platform to obtain the corresponding yield data and enantioselectivity data, and construct training data. S40. Train a random forest model based on the training dataset; S50. Input the reaction level features into the random forest model to perform batch prediction of candidate tetrapeptide catalysts in the pre-constructed catalyst virtual library, output the predicted yield value, enantioselectivity value and / or classification level of the candidate catalysts, and screen the target catalyst set accordingly. S60. Perform solvent optimization on the target catalyst set, output the optimal solvent recommendation, and conduct experimental verification and substrate expansion verification of the target catalyst.

[0088] The communication bus mentioned in the above terminal can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0089] The communication interface is used for communication between the aforementioned terminal and other devices.

[0090] The memory may include random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0091] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0092] The computer equipment includes user equipment and network equipment. The user equipment includes, but is not limited to, computers, smartphones, and PDAs; the network equipment includes, but is not limited to, a single network server, a server group consisting of multiple network servers, or a cloud based on cloud computing, which is a type of distributed computing consisting of a super virtual computer composed of a group of loosely coupled computers. The computer equipment can operate independently to implement the present invention, or it can connect to a network and interact with other computer equipment within the network to implement the present invention. The network in which the computer equipment is located includes, but is not limited to, the Internet, wide area networks (WANs), metropolitan area networks (MANs), local area networks (LANs), and VPN networks.

[0093] It should also be understood that the term "and / or" as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0094] In one embodiment of the present invention, a storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps in the above method embodiments: S10: Obtain structural information of reactants, candidate tetrapeptide catalysts, and reaction conditions for the Aldol reaction; S20. Call the pre-trained molecular characterization model to encode the reactant structure information and the candidate tetrapeptide catalyst structure information into vector representations, and fuse them with the reaction condition information to obtain reaction level features; S30. Perform at least one batch of Aldol reactions based on a high-throughput experimental platform to obtain the corresponding yield data and enantioselectivity data, and construct training data. S40. Train a random forest model based on the training dataset; S50. Input the reaction level features into the random forest model to perform batch prediction of candidate tetrapeptide catalysts in the pre-constructed catalyst virtual library, output the predicted yield value, enantioselectivity value and / or classification level of the candidate catalysts, and screen the target catalyst set accordingly. S60. Perform solvent optimization on the target catalyst set, output the optimal solvent recommendation, and conduct experimental verification and substrate expansion verification of the target catalyst.

[0095] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Furthermore, any references to memory, storage, databases, or other media used in the embodiments provided by this invention can include at least one of non-volatile and volatile memory.

[0096] It should be understood that, as used herein, the singular form "a" is intended to include the plural form as well, unless the context clearly supports an exception. It should also be understood that, as used herein, "and / or" refers to any and all possible combinations of one or more of the associatedly listed items. The embodiment numbers disclosed above are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0097] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of the invention (including the claims) is limited to these examples. Within the framework of the invention, technical features of the above embodiments or different embodiments can be combined, and many other variations of different aspects of the invention exist, which are not provided in the details for the sake of brevity. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the invention should be included within the protection scope of the invention.

Claims

1. A method for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts, characterized in that, The method includes: To obtain structural information of reactants, candidate tetrapeptide catalysts, and reaction conditions for the Aldol reaction; The pre-trained molecular characterization model is invoked to encode the structural information of the reactants and the structural information of the candidate tetrapeptide catalysts into vector representations, and then fused with the reaction condition information to obtain reaction order features; At least one batch of Aldol reactions were executed based on a high-throughput experimental platform to obtain corresponding yield data and enantioselectivity data, and training data was constructed. Train a random forest model based on the aforementioned training dataset; The reaction stage features are input into the random forest model to perform batch predictions on candidate tetrapeptide catalysts in the pre-constructed catalyst virtual library, output the predicted yield, predicted enantioselectivity and / or classification level of the candidate catalysts, and screen the target catalyst set accordingly. Solvent optimization is performed on the target catalyst set to output the optimal solvent recommendation, and the target catalysts are experimentally verified and substrate expansion is validated.

2. The method for predicting Aldol reaction results and screening tetrapeptide catalysts as described in claim 1, characterized in that, The output of the molecular characterization model is a fixed-dimensional molecular vector.

3. The method for predicting Aldol reaction results and screening tetrapeptide catalysts as described in claim 1, characterized in that, The reaction stage characteristics are obtained by splicing or weighted fusion of reactant vectors, tetrapeptide catalyst vectors, and reaction condition vectors.

4. The method for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts as described in claim 1, characterized in that, The high-throughput experiments were performed in parallel in a porous reactor to obtain batch reaction data including catalyst and solvent combinations, and yield and enantioselectivity results were obtained by HPLC.

5. The method for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts as described in claim 1, characterized in that, The random forest model includes at least: a yield prediction random forest regression model, an enantioselectivity prediction random forest regression model, and a high-efficiency catalyst hierarchical screening random forest multi-classification model.

6. The method for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts as described in claim 1, characterized in that, The enantioselective predictive random forest regression model uses the absolute value of ee or the ee grade label as the learning objective.

7. The method for predicting Aldol reaction results and screening tetrapeptide catalysts as described in claim 1, characterized in that, The process involves inputting the reaction level features into the random forest model to perform batch predictions on candidate tetrapeptide catalysts in a pre-constructed virtual catalyst library, outputting predicted yield values, predicted enantioselectivity values, and / or classification levels for the candidate catalysts, and then selecting the target catalyst set accordingly. Prior to this, the process also includes: A virtual library of catalysts containing multiple tetrapeptide sequences was constructed.

8. A device for predicting Aldol reaction results and for efficiently screening tetrapeptide catalysts, characterized in that, The device includes: a data acquisition module, a data fusion module, a training data construction module, a model training module, a catalyst set screening module, and an output verification module; The data acquisition module is used to acquire the reactant structure information, candidate tetrapeptide catalyst structure information, and reaction condition information of the Aldol reaction. The data fusion module is used to call a pre-trained molecular characterization model to encode the reactant structure information and the candidate tetrapeptide catalyst structure information into vector representations, and fuse them with the reaction condition information to obtain reaction level features; The training data construction module is used to execute at least one batch of Aldol reactions based on a high-throughput experimental platform to obtain corresponding yield data and enantioselectivity data, and to construct training data. The model training module is used to train a random forest model based on the training dataset; The catalyst set screening module is used to input the reaction level features into the random forest model, perform batch prediction of candidate tetrapeptide catalysts in the pre-constructed catalyst virtual library, output the predicted yield value, the predicted enantioselectivity value and / or classification level of the candidate catalysts, and screen the target catalyst set accordingly. The output verification module is used to perform solvent optimization on the target catalyst set, output the optimal solvent recommendation, and perform experimental verification and substrate expansion verification on the target catalyst.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor loads and executes the computer program to implement the steps of the method for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts as described in any one of claims 1-7.

10. A storage medium storing a computer program, which, when loaded and executed by a processor, implements the steps of the method for predicting Aldol reaction results and efficiently screening tetrapeptide catalysts as described in any one of claims 1-7.