A machine learning-based catalyst virtual screening method, electronic device, computer program product and computer-readable storage medium

By using a pre-trained model based on three-dimensional molecular conformation and machine learning methods, the problems of high computational cost and insufficient characterization ability in catalyst screening are solved, achieving efficient and accurate catalyst performance prediction and screening, which is suitable for large-scale virtual molecular library screening.

CN122224331APending Publication Date: 2026-06-16SHENZHEN CONTINUOUS PHARMACEUTICAL TECHNOLOGY CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN CONTINUOUS PHARMACEUTICAL TECHNOLOGY CO LTD
Filing Date
2026-03-13
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies rely on computationally expensive quantum chemical calculations in catalyst screening, which are difficult to apply to large-scale virtual molecular library screening. Furthermore, they have limited ability to characterize catalyst molecules with highly similar structures and cannot effectively distinguish performance differences.

Method used

By employing a pre-trained molecular characterization model based on three-dimensional molecular conformation and combining it with a machine learning prediction model, and by constructing reaction level feature representations and introducing reaction condition features, efficient prediction and screening of catalyst performance can be achieved.

🎯Benefits of technology

It significantly reduces computational costs, improves the ability to distinguish structurally similar catalysts, enhances prediction accuracy and versatility, is suitable for efficient screening under small sample conditions, and shortens the catalyst development cycle.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122224331A_ABST
    Figure CN122224331A_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on machine learning's catalyst virtual screening method, electronic equipment, computer program product and computer readable storage medium.The method includes: by a pre-trained molecular representation model, the molecular representation vector of reactant molecule and candidate catalyst molecule is obtained;The molecular representation vector is spliced with reaction condition characteristics, and reaction level feature representation is constructed;The reaction level feature representation is input into a machine learning prediction model, and the predicted reaction result is obtained;And based on the predicted reaction result, the performance of virtual library is sorted, and target catalyst is screened out.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the intersection of computational chemistry and artificial intelligence technologies. Specifically, it relates to a machine learning-based virtual screening method for catalysts, electronic devices, computer program products, and computer-readable storage media, which are particularly suitable for performance prediction and screening of complex catalysts such as peptides in chemical space. Background Technology

[0002] In the fields of chemical synthesis and materials science, the discovery and optimization of catalysts are crucial for driving the development of related technologies. Traditional catalyst development largely relies on the accumulated experience, chemical intuition, and repeated trial and error of chemical researchers. This development model is not only time-consuming and costly, but also significantly limited in its exploration efficiency when faced with the exponential growth of potential catalyst molecular spaces, making it difficult to systematically discover novel catalysts with superior performance.

[0003] To improve the efficiency of catalyst screening and optimization, computational chemistry and data-driven methods have been increasingly introduced into catalysis research. Existing techniques can be broadly categorized into two types. The first type is based on first-principles calculations, which use quantum chemical calculations such as density functional theory to obtain the electronic structure or physicochemical properties of molecules and construct predictive models of catalytic performance accordingly. While this type of method offers high theoretical accuracy in mechanistic studies, it is computationally extremely expensive, typically requiring complex geometric optimization and property calculations for each candidate molecule, making it unsuitable for high-throughput screening scenarios involving large-scale virtual molecular libraries. Furthermore, these methods often rely on the artificial design of specific reaction systems, resulting in poor versatility.

[0004] The second category of methods utilizes machine learning-based rapid screening techniques, typically employing general descriptors such as two-dimensional molecular structural representations, SMILES strings, molecular fingerprints, or molecular diagrams as model input. These methods offer significant advantages in computational efficiency and are suitable for large-scale data processing. However, their predictive performance is highly dependent on large-scale, high-quality labeled reaction data. In many emerging or highly specialized fields of catalysis research, especially in peptide or short peptide catalysis systems, publicly available reaction data are often limited in quantity and unevenly distributed, leading to insufficient model training and limited generalization ability.

[0005] Furthermore, for complex molecular systems with highly similar structures, existing molecular fingerprint descriptors based on two-dimensional structures or regular definitions struggle to effectively characterize their three-dimensional conformational differences and spatial interactions. For example, in tetrapeptide catalysts with only one or two amino acid residue differences, minute conformational changes often lead to significant differences in catalytic activity and enantioselectivity. Traditional descriptors fail to capture these differences, thus limiting the model's ability to distinguish and screen high-performance catalysts.

[0006] Therefore, there is an urgent need for a new virtual screening scheme for catalysts. This scheme can rapidly generate molecular features with strong characterization capabilities without relying on high-cost quantum chemical calculations and large-scale labeled reaction data. It can fully incorporate the three-dimensional conformational information of molecules and combine it with a computationally efficient machine learning model to effectively distinguish catalyst molecules with similar structures but significantly different performances. This would improve the efficiency of large-scale virtual molecular library screening and accelerate the discovery and verification process of high-performance catalysts. Summary of the Invention

[0007] The main objective of this invention is to provide a machine learning-based virtual screening method for catalysts, electronic devices, computer program products, and computer-readable storage media. This invention aims to solve the technical problems in the prior art, such as reliance on high-cost quantum chemical calculations in catalyst screening processes, insufficient adaptability to large-scale virtual molecular libraries, and limited ability to characterize highly similar catalyst molecules, thereby achieving efficient and accurate prediction and screening of catalyst performance.

[0008] To achieve the above objectives, the present invention provides a machine learning-based virtual screening method for catalysts, comprising: Molecular representation vectors of reactant molecules and candidate catalyst molecules in the reaction system are obtained through a pre-trained molecular characterization model. The molecular representation vectors of the reactant molecules and the molecular representation vectors of the candidate catalyst molecules are concatenated with the reaction condition features to construct a reaction order feature representation. The reaction level feature representation is input into a preset machine learning prediction model to obtain the predicted reaction results corresponding to the candidate catalyst molecule; Based on the predicted reaction results, a virtual library containing multiple candidate catalyst molecules is ranked by performance to screen for target catalysts.

[0009] In a preferred embodiment, the pre-trained molecular characterization model is a molecular representation model pre-trained based on three-dimensional molecular conformations, used to map the three-dimensional structural information of candidate catalyst molecules and reactant molecules into continuous vector representations of fixed dimensions. By incorporating three-dimensional molecular conformation information, the molecular representation model enables the generated molecular representation to simultaneously include molecular topology, spatial configuration, and potential non-covalent interaction information, thereby enhancing the characterization capability of reaction performance.

[0010] In a preferred embodiment, the step of constructing the reaction-level feature representation specifically includes: The molecular representation vectors of the candidate catalyst molecules and the molecular representation vectors of the reactant molecules are concatenated and further fused with reaction condition features to form a joint feature representation for reaction prediction. The reaction conditions include at least one of reaction temperature, reaction time, and solvent type.

[0011] In a preferred embodiment, the machine learning prediction model is a Random Forest (RF) model, used to predict reaction yield, enantioselectivity, or a combination of both. In a further preferred embodiment, a multi-layer perceptron (MLP) model can be introduced as a comparative model to evaluate the predictive performance of the molecular representation under different model architectures, thereby determining the optimal predictive model for a specific reaction system.

[0012] In a preferred embodiment, the method evaluates the system performance of prediction models constructed using different molecular representation methods, and selects the molecular representation and model combination with the best prediction accuracy for catalyst screening. The performance evaluation is based on multiple random partitioning training and testing processes of the reaction dataset, and uses the coefficient of determination R0. 2 The mean absolute error (MAE) is used as an evaluation metric to ensure the robustness and reliability of the model evaluation results.

[0013] In a preferred embodiment, the candidate catalyst molecule is a polypeptide molecule, preferably a tetrapeptide molecule; The method also includes sampling a representative subset of candidates from the virtual library using the Kennard-Stone algorithm for initial experimental validation and constructing an initial training set for the machine learning prediction model to improve the predictive reliability of the model under limited experimental data conditions.

[0014] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory, wherein the processor executes the computer program to implement the method described in any of the preceding claims.

[0015] The present invention also provides a computer program product, including computer instructions that, when executed by a processor, implement the method described in any of the preceding claims.

[0016] The present invention also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method described in any of the preceding claims.

[0017] Compared with the prior art, the present invention has at least the following beneficial effects: Efficient molecular characterization significantly reduces computational costs This invention introduces a molecular characterization model based on three-dimensional molecular conformation pre-training to replace the traditional computationally expensive density functional theory (DFT) descriptor calculation. While maintaining prediction accuracy, it achieves rapid generation of molecular features, significantly reducing computational resource consumption and overall screening cycle, making high-throughput screening of large-scale virtual catalyst libraries possible.

[0018] It balances prediction accuracy and computational efficiency, and has good versatility. Experimental results show that the pre-trained molecular representation used in this invention can achieve prediction performance comparable to or even better than that based on physicochemical properties or DFT descriptors in a variety of reaction systems. It also shows stability in the task of predicting reaction yield and enantioselectivity, proving that the method has good versatility and transferability across reaction systems.

[0019] Significantly improves the ability to distinguish between catalysts with highly similar structures. For complex molecular systems such as peptide catalysts with highly similar structures and subtle conformational differences, the three-dimensional molecular representation used in this invention can effectively capture spatial configurational differences that are difficult to characterize using traditional molecular fingerprints and two-dimensional descriptors, thereby significantly improving the accuracy of predicting catalyst performance differences and breaking through the modeling bottleneck of existing technologies in this type of system.

[0020] Suitable for small sample and engineered screening scenarios This invention, through reasonable data partitioning and model evaluation strategies, can still construct a predictive model with good generalization ability under the condition of limited experimental data scale. It can be used to quickly screen potential high-performance catalysts from a large number of untested candidates, significantly reducing the number of experimental screenings, and has good engineering application value.

[0021] It has good scalability and system integration capabilities. The method of this invention can flexibly combine different reaction systems, different types of catalysts, and various machine learning models, and can be embedded in automated reaction optimization and intelligent screening systems, providing an efficient and scalable technical solution for data-driven catalyst design and optimization. Attached Figure Description

[0022] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below.

[0023] Figure 1 This is a schematic flowchart of a machine learning-based virtual screening method for catalysts provided in an embodiment of the present invention.

[0024] Figure 2 This is a hardware structure block diagram of an electronic device provided in an embodiment of the present invention.

[0025] Figure 3 This is a schematic diagram of an overall technical framework for reaction prediction and catalyst screening based on pre-trained molecular characterization according to an embodiment of the present invention.

[0026] Figure 4 This is a schematic diagram illustrating the prediction performance of pre-trained molecular characterization on various reaction datasets according to embodiments of the present invention.

[0027] Figure 5 This is a schematic diagram illustrating the construction of a virtual molecular library of tetrapeptide catalysts according to an embodiment of the present invention.

[0028] Figure 6 This is a schematic diagram of a machine learning-based tetrapeptide catalyst screening and iterative optimization process according to an embodiment of the present invention. Detailed Implementation

[0029] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.

[0030] Example 1 This embodiment provides a machine learning-based virtual catalyst screening method. See also... Figure 1 This method is a process executed in a computer, and its flow includes the following steps.

[0031] In this embodiment, to achieve systematic and high-coverage screening of potential catalysts, a virtual molecular library for peptide molecules is first constructed. Considering that peptide catalysts, especially short peptide catalysts, typically have high conformational flexibility, large sequence modification space, and highly sensitive catalytic performance to amino acid sequence, this embodiment adopts a virtual library construction strategy based on combinations of natural amino acid sequences.

[0032] Specifically, 20 common natural amino acids were selected as basic units and arranged in full permutations according to the sequence length of tetrapeptides to construct a virtual molecular library of tetrapeptide candidate catalysts. Theoretically, the sequence combination space of this tetrapeptide virtual library is 20. 4 This means approximately 160,000 different sequences. This construction method systematically covers different amino acid compositions, sequences, and stereochemical differences, ensuring the completeness and diversity of the screening space. The system further converts the generated tetrapeptide sequences into corresponding molecular representations, such as SMILES encoding or molecular diagram structures, as input for subsequent molecular characterization models.

[0033] Step S101: Molecular representation vectors of reactant molecules and candidate catalyst molecules in the reaction system are obtained through a pre-trained molecular characterization model.

[0034] In this embodiment, the core of the technical solution lies in introducing a molecular representation model pre-trained with large-scale unsupervised data to map the structural information of molecules into continuous numerical vectors of fixed dimensions, i.e., molecular representation vectors. Unlike traditional methods that rely on quantum chemical calculations or manually designed descriptors, this pre-trained model can quickly generate molecular representations with good transferability without performing computationally expensive first-principles calculations.

[0035] In one specific implementation, the molecular representation vector is a 512-dimensional continuous vector. Since this pre-trained model is trained on large-scale molecular data, it has internally learned implicit information such as the three-dimensional conformation of molecules, interatomic interactions, and chemical environment. Therefore, the generated molecular representation vector can comprehensively reflect the structural characteristics and potential reactive behaviors of molecules.

[0036] See Figure 3 This demonstrates the overall technical framework of the present invention. Through this pre-trained molecular characterization model, the system can extract features from reactant molecules and candidate tetrapeptide catalyst molecules respectively, obtaining molecular representation vectors of uniform dimension that can be directly used in downstream machine learning models.

[0037] Step S102: The molecular representation vectors of reactant molecules, the molecular representation vectors of candidate catalyst molecules, and the reaction condition features are concatenated to construct the reaction order feature representation.

[0038] After obtaining the molecular representation vectors of each component in the reaction system, the system combines the molecular representation vectors of reactant molecules with those of candidate catalyst molecules to form a reaction order characteristic representation that can describe the complete reaction system. In one implementation, this combination is accomplished through a vector concatenation operation.

[0039] Furthermore, reaction condition characteristics are also incorporated into the reaction order feature representation. These reaction condition characteristics include, but are not limited to, at least one of reaction temperature, reaction time, solvent type, and reactant ratio. After being numerically quantified or encoded, these reaction condition characteristics are concatenated with the molecular representation vector to form a unified reaction order feature vector.

[0040] In one alternative implementation, the concatenated reaction-level feature vector is further normalized or weighted to balance the influence of different component features during model training, thereby improving the stability and prediction accuracy of the model.

[0041] Initial training set construction (cold start sample selection) To address the problem of insufficient experimental data and difficulty in direct training of machine learning models in the initial stage, this embodiment introduces an initial sample screening strategy based on chemical space coverage.

[0042] Specifically, the system first uses the pre-trained molecular characterization model to extract the molecular representation vectors of all candidate tetrapeptide molecules in the virtual library. Then, a dimensionality reduction algorithm (such as t-SNE or UMAP) is used to map the high-dimensional molecular representation vectors to a low-dimensional space, thereby constructing a chemical space distribution map of the candidate catalysts. This chemical space can reflect the similarities and differences between different tetrapeptide molecules at the structural and property levels.

[0043] Based on this, to ensure that the initial experimental samples can uniformly cover the entire chemical space, the system employs a representative subset selection algorithm to screen initial samples from the virtual library. In this embodiment, the representative subset selection algorithm is preferably the Kennard-Stone algorithm or the maximum difference algorithm. This algorithm selects several tetrapeptide molecules with the most uniform spatial distribution and the greatest structural difference as a representative subset by calculating the distance relationships between molecules in the chemical space.

[0044] The selected representative tetrapeptide molecules were then synthesized and subjected to wet experiments to obtain the actual reaction yield and / or enantiomeric excess (ee) values. This batch of experimental data was used as initial training data to build a machine learning prediction model. Compared to random sampling, this strategy can significantly improve the model's generalization ability in the early stages of training and reduce the number of experimental rounds required for subsequent model optimization.

[0045] Step S103: The reaction level features are input into a pre-defined machine learning prediction model to obtain the predicted reaction results corresponding to the candidate catalyst molecules.

[0046] In this embodiment, the machine learning prediction model can adopt a variety of supervised learning model structures, including but not limited to random forest models, gradient boosting decision tree models, or deep neural network models.

[0047] In a preferred embodiment, the machine learning prediction model is a random forest model. This model takes the reaction-level feature representation constructed in step S102 as input and outputs predicted reaction results for the candidate catalyst under given reaction system and reaction conditions. The predicted reaction results include, but are not limited to, reaction yield, enantioselectivity, or classification results indicating whether the reaction was successful.

[0048] This prediction method has been validated on several publicly available high-throughput response datasets. See [link / reference] Figure 4The results show that the proposed method exhibits good predictive performance in different types of reaction systems, and its prediction accuracy is comparable to that of methods based on artificial design or quantum chemical descriptors.

[0049] Step S104: Based on the predicted reaction results, the performance of a virtual library containing multiple candidate catalyst molecules is ranked to screen out the target catalyst.

[0050] After the machine learning prediction model is trained, the system applies the model to the entire tetrapeptide virtual catalyst library to perform batch predictions of the performance of each candidate catalyst in the library under specific reactants and reaction conditions.

[0051] Subsequently, all candidate catalysts are ranked according to their predicted reaction yields and / or enantioselectivity. In one implementation, performance threshold conditions can be set to screen candidate catalysts from a virtual library that simultaneously meet the requirements of high yield and high enantioselectivity, and these are selected as target catalysts.

[0052] See Figure 5 This method can efficiently screen a small subset of candidate catalysts with excellent predicted performance from a virtual tetrapeptide catalyst space of hundreds of thousands of units, which can then be used for subsequent experimental synthesis and verification, thereby significantly reducing the cost of experimental screening and shortening the catalyst development cycle.

[0053] Furthermore, in order to construct the initial machine learning prediction model, this embodiment also includes a step of constructing an initial training set. See [link to documentation]. Figure 6 The closed-loop optimization process shown allows for initial experimental verification by selecting a representative small subset from the chemical space before predicting the entire virtual library. Specifically, the Kennard-Stone algorithm can be used, which enables uniform sampling in a high-dimensional feature space, ensuring that the selected samples cover the entire chemical space. Figure 6 The red dots in the chemical space represent representative peptide subsets selected by the algorithm. Experiments were conducted on these representative catalysts to obtain their actual reaction results, which constituted the initial training set for the machine learning prediction model. Based on this initial model, the aforementioned full-library prediction and screening could then be performed.

[0054] Example 2 This embodiment is based on Embodiment 1, and further optimizes the feature fusion method, catalyst molecule representation and screening process of the technical solution to improve the modeling ability and screening efficiency of peptide catalyst systems with highly similar structures.

[0055] First, in the step of constructing reaction-level feature representations, this embodiment introduces an interactive feature fusion network architecture. This architecture includes independent substrate encoding and catalyst encoding modules, and achieves information interaction between them through a multi-head cross-attention mechanism. Specifically, the molecular representation vectors of candidate catalyst molecules are used as query vectors, and the molecular representation vectors of reactant molecules are used as key and value vectors, all three being input into a dual-channel cross-attention network. This network generates attention weights by calculating the correlation between the query vector and the key vector, and uses these weights to perform weighted convergence of the value vectors, thereby explicitly modeling the interaction relationships between catalyst and reactant molecules at the feature level, generating an interactive feature representation containing interaction information.

[0056] To further enhance feature representation capabilities, the output of the cross-attention network is residually connected to the original molecular representation vectors of candidate catalyst molecules, and then subjected to layer normalization to obtain an enhanced catalyst feature representation. Subsequently, this enhanced catalyst feature representation is concatenated with the original molecular representation vectors of reactant molecules and reaction condition features to form the final reaction-level feature representation, which serves as the input to the machine learning prediction model.

[0057] Secondly, to address the issue that highly similar peptide catalyst structures and even minor sequence differences can lead to significant performance changes, this embodiment employs a multimodal molecular representation strategy. This strategy simultaneously introduces two types of feature information: first, a physicochemical property embedding vector based on the amino acid sequence, used to describe the physicochemical properties of amino acid residues; and second, a topological feature vector extracted from a pre-trained graph isomorphic network based on the catalyst molecular graph structure, used to reflect the overall connectivity and spatial structural features of the molecule. These two types of feature vectors are fused to form a unified catalyst molecular representation vector, thereby improving the model's ability to recognize conformational differences in peptides.

[0058] Furthermore, to address the cold-start problem of machine learning models with limited experimental data, this embodiment continues to employ an initial sample selection strategy based on chemical spatial distribution. The system utilizes a pre-trained molecular characterization model to generate high-dimensional feature vectors for all tetrapeptide molecules in the virtual library, and constructs the chemical spatial distribution of candidate catalysts using a dimensionality reduction algorithm. Subsequently, a representative subset selection algorithm (e.g., the Kennard-Stone algorithm or the maximum difference algorithm) is used to select several tetrapeptide molecules with uniform structural distribution and the greatest difference from the chemical space as initial experimental samples. Experimental synthesis and performance testing are performed on this representative subset to obtain the actual reaction yield or enantiomeric excess value. This experimental data is then used as the initial training set to train the machine learning prediction model.

[0059] Furthermore, this embodiment constructs the entire catalyst screening process as an iterative optimization process based on the hierarchical classification of prediction results. Specifically, the machine learning prediction model adopts a random forest model structure, which is used for regression prediction of reaction yield and enantioselectivity, and for classification prediction of candidate catalysts into high-performance, medium-performance, and low-performance categories. In each round of prediction, the system performs batch predictions on candidate catalysts in the virtual library that have not yet been experimentally verified, and classifies them into different performance levels according to the prediction results.

[0060] In the subsequent experimental screening phase, the system prioritizes selecting molecules from candidate catalyst categories predicted to have high yields and high enantioselectivity for experimental verification. Simultaneously, it considers selecting representative molecules from the medium-performance category to avoid local optima. After the experiments are completed, newly added experimental data are incorporated into the training set to retrain the random forest prediction model, thereby gradually improving the model's prediction accuracy for the target reaction system. Through this approach, a closed-loop screening process without explicit uncertainty modeling is achieved, significantly reducing model complexity and implementation difficulty while maintaining screening efficiency.

[0061] Example 3 The present invention also provides an electronic device. See also Figure 2 The electronic device 200 may be a server, a personal computer, or a workstation, etc. The electronic device 200 includes at least one processor 201, a memory 202, and a computer program stored in the memory 202 and executable on the processor 201.

[0062] The processor 201 is the control center of the device. It can be a central processing unit or a computing unit that includes a graphics processing unit or a dedicated artificial intelligence acceleration chip, used to perform mathematical operations and model inference.

[0063] The memory 202 can be volatile or non-volatile. The memory 202 stores computer programs that, when loaded and executed by the processor 201, can implement the steps of the machine learning-based catalyst virtual screening method described in Embodiment 1 or Embodiment 2. For example, the memory 202 stores pre-trained molecular characterization models, machine learning prediction models, virtual library data of candidate catalysts, and algorithm instructions for executing active learning strategies.

[0064] During runtime, processor 201 executes the computer program stored in memory 202, specifically implementing the following functions: calling a pre-trained molecular characterization model to process the input reactant and candidate catalyst molecular data and generate molecular representation vectors; executing feature construction logic to concatenate the molecular representation vectors and reaction condition features or fuse them through a cross-attention network to construct reaction-level feature representations; loading a machine learning prediction model to infer the reaction-level feature representations and output predicted reaction results and prediction uncertainties; and sorting the virtual library based on the prediction results or executing a maximum expectation boosting sampling strategy to screen out the next batch of target catalysts to be verified.

[0065] This invention ensures the completeness of the screening space by constructing a tetrapeptide virtual library based on full permutations, thus avoiding the omission of potentially highly active molecules. At the same time, by adopting a cold-start strategy based on chemical space, it solves the problem of blind exploration in the early stage of traditional machine learning when there is no data, significantly reducing experimental costs and shortening the model convergence cycle.

[0066] The present invention also provides a computer program product comprising computer instructions stored on a non-transitory computer-readable storage medium. When these instructions are executed by a processor, they are capable of implementing the methods described in any of the above embodiments.

[0067] The present invention also provides a computer-readable storage medium having a computer program stored thereon. When the program is executed by a processor, it can also implement the methods described in any of the above embodiments.

[0068] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A virtual screening method for catalysts based on machine learning, characterized in that, include: Molecular representation vectors of reactant molecules and candidate catalyst molecules in the reaction system are obtained through a pre-trained molecular characterization model. The molecular representation vectors of the reactant molecules and the molecular representation vectors of the candidate catalyst molecules are concatenated with the reaction condition features to construct a reaction order feature representation. The reaction level feature representation is input into a preset machine learning prediction model to obtain the predicted reaction results corresponding to the candidate catalyst molecule; Based on the predicted reaction results, the performance of a virtual library containing multiple candidate catalyst molecules is ranked, and the target catalyst is screened out.

2. The method as described in claim 1, characterized in that, Before obtaining the molecular representation vectors of candidate catalyst molecules, a virtual library construction step is also included: Based on preset construction rules, a virtual library containing multiple candidate catalyst molecules is generated; wherein, the construction rules include generating peptide molecular sequences using the full permutation combination of natural amino acids, and the candidate catalyst molecules are tetrapeptide compounds.

3. The method as described in claim 1, characterized in that, Before inputting the reaction level feature representation into the machine learning prediction model, the model initialization step is also included: constructing the chemical spatial distribution of candidate catalyst molecules in the virtual library based on the molecular representation vector; using a representative subset selection algorithm, selecting an initial sample set from the virtual library according to the chemical spatial distribution; obtaining experimental reaction data of the initial sample set, and using the experimental reaction data to train the initial machine learning prediction model.

4. The method as described in claim 1, characterized in that, The candidate catalyst molecule specifically refers to a tetrapeptide catalyst; the pre-trained molecular characterization model is configured to extract the sequence and conformational features of the tetrapeptide molecule to distinguish tetrapeptide isomers with highly similar structures.

5. The method according to claim 1, characterized in that, The steps for constructing the reaction-level feature representation specifically include: The molecular representation vector of the candidate catalyst molecule is used as the query vector, and the molecular representation vector of the reactant molecule is used as the bond vector and value vector; The query vector, bond vector, and value vector are input into a dual-channel cross-attention network. The interaction between the candidate catalyst molecule and the reactant molecule is modeled through the dual-channel cross-attention network, generating an interactive feature representation that integrates interaction information, which is part of the reaction-level feature representation.

6. The method according to claim 5, characterized in that, The dual-channel cross-attention network includes multiple attention heads, and after the step of generating interactive feature representations, it further includes: The output of the dual-channel cross-attention network is residually connected and layer normalized with the original molecular representation vector of the candidate catalyst molecule to obtain an enhanced catalyst feature representation. The enhanced catalyst feature representation is concatenated with the original molecular representation vector of the reactant molecules and the reaction condition features to form the reaction order feature representation. The reaction conditions include at least one of reaction temperature, reaction time, and solvent type.

7. The method according to claim 1, characterized in that, The method includes obtaining molecular representation vectors of candidate tetrapeptide catalyst molecules based on a pre-trained molecular representation model, and predicting reaction results based on the molecular representation vectors; wherein, the pre-trained molecular representation model is a molecular representation model trained based on three-dimensional molecular conformation information; the molecular representation vector is used to characterize the three-dimensional spatial structure of the candidate tetrapeptide catalyst molecules and their interatomic interaction features, and serves as input features for a machine learning prediction model.

8. The method according to claim 7, characterized in that, The step of obtaining the molecular representation vector of the candidate tetrapeptide catalyst molecule includes: generating a standardized molecular representation based on the amino acid sequence of the candidate tetrapeptide catalyst molecule; using a molecular conformation generation algorithm to generate and optimize the molecular representation in three dimensions to obtain a three-dimensional molecular conformation of the candidate tetrapeptide catalyst molecule; inputting the three-dimensional molecular conformation into the pre-trained molecular representation model and outputting a fixed-dimensional molecular embedding vector as the molecular representation vector of the candidate tetrapeptide catalyst molecule.

9. The method according to claim 1, characterized in that, The machine learning prediction model is a random forest model; the random forest model predicts the reaction yield and / or enantioselectivity of the target reaction based on the molecular representation vector of the candidate tetrapeptide catalyst molecule and the corresponding reaction condition parameters. Based on the prediction results, the candidate tetrapeptide catalyst molecules are graded and screened to identify target catalyst molecules with high reaction yield and / or high enantioselectivity.

10. The method according to claim 1, characterized in that, The candidate catalyst molecule is a tetrapeptide molecule; the method further includes sampling a representative subset from the virtual library using the Kennard-Stone algorithm for initial experimental verification, and constructing an initial training set for the machine learning prediction model.

11. An electronic device, comprising: A memory, a processor, and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the method as described in any one of claims 1 to 10.

12. A computer program product comprising computer instructions, characterized in that, When the computer instructions are executed by the processor, they implement the method as described in any one of claims 1 to 10.

13. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 10.