A model training, product structure prediction method and medium
By using an MPNN-based post-processing module and a local clipping strategy, combined with atomic mapping and a weighted cross-entropy loss function, the problem of insufficient accuracy in predicting product stereochemical information in existing technologies is solved, and high-precision product structure prediction is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- EAST CHINA UNIV OF SCI & TECH
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies have insufficient accuracy in predicting the stereochemical information of chemical reaction products, especially the cis/trans isomerism (E/Z configuration) of double bonds or the R/S configuration of chiral centers, which limits the performance and practicality of product prediction models.
A post-processing module based on message passing neural network (MPNN) is adopted, combined with atom mapping (AAM) and local clipping strategy. Through feature extraction and classification, the specific configuration information of the double bonds to be predicted in the molecular skeleton of the product is determined, and the prediction accuracy is improved by training with a weighted cross-entropy loss function.
It enables accurate prediction of product stereochemical information, improves the accuracy and efficiency of the model, avoids dependence on complex three-dimensional structure calculations, and enhances the ability to distinguish fine-grained structural differences.
Smart Images

Figure CN122201479A_ABST
Abstract
Description
Technical Field
[0001] This application mainly relates to the fields of organic chemistry and artificial intelligence technology, and in particular to a method and medium for model training and prediction of product structure. Background Technology
[0002] In AI-driven drug discovery, materials science, and computer-aided synthetic route design, the prediction of chemical reaction products is a core technology. In recent years, deep learning models, represented by the Transformer architecture, such as Molecular Transformer and Local Transformer, have made significant progress in predicting the molecular skeleton of the main products of chemical reactions. However, existing technologies still face serious challenges in processing the stereochemical information of products. Stereochemistry, such as the cis / trans isomerism of double bonds (E / Z configuration) or the R / S configuration of chiral centers, has a decisive influence on the biological activity and physical properties of molecules. Therefore, the ability to accurately predict newly generated stereoconfigurations is a key bottleneck limiting the performance and practicality of product prediction models.
[0003] Currently, existing mainstream technical solutions have significant shortcomings. On the one hand, some advanced prediction models, such as Local Transform, while exhibiting extremely high accuracy in predicting the basic molecular skeleton, suffer from a complete lack of stereochemical information in the products generated due to their model architecture or principles (such as template limitations). In reactions involving newly formed stereocenters, such models cannot predict specific stereoconfigurations, failing to meet the need for complete characterization of product structures. On the other hand, other models capable of directly generating stereoconfiguration information, such as Molecular Transformer, while possessing the ability to predict stereochemistry, suffer from severely insufficient accuracy due to limitations in model generalization ability. Summary of the Invention
[0004] One objective of this application is to provide a model training and product structure prediction method and medium to address the problem of insufficient accuracy in predicting product stereochemical information in the prior art.
[0005] According to one aspect of this application, a method for predicting product structure is provided. The method includes: determining the molecular skeleton of a product based on the molecular structure of reactants in a chemical reaction; locating double bonds to be predicted based on the molecular structure of the reactants and the molecular skeleton of the product; constructing a local molecular map of the product molecular skeleton based on the double bonds to be predicted; extracting features from the local molecular map; inputting the features into a prediction model; classifying the double bonds to be predicted; determining specific configuration information of the double bonds to be predicted; and correcting the product molecular skeleton based on the specific configuration information to obtain a predicted result of the product structure. The prediction model is trained using a reaction dataset including double bonds with specific configurations.
[0006] Optionally, locating the double bond to be predicted based on the molecular structure of the reactant and the molecular skeleton of the product includes: comparing the molecular structure of the reactant and the molecular skeleton of the product based on atomic mapping to obtain the atomic mapping relationship between the reactant and the product; traversing each double bond in the molecular skeleton of the product according to the atomic mapping relationship to determine the corresponding atom of each double bond in the reactant; determining the bond type of the corresponding atom in the reactant; judging whether the double bond is a newly generated double bond based on the bond type; if so, locating the double bond to be predicted based on the newly generated double bond.
[0007] Optionally, the step of constructing a local molecular map of the product molecular skeleton based on the double bond to be predicted includes: taking the located double bond to be predicted as the center, extracting a central neighborhood of a preset range on the product molecular skeleton, and generating a local molecular map of the product molecular skeleton.
[0008] Optionally, the feature extraction of the local molecular graph includes: extracting basic molecular features of the local molecular graph, wherein the basic molecular features include atomic features and bond features; determining differential features based on the molecular structure of the reactant and the property changes of the corresponding atoms in the local molecular graph; and marking the two atoms of the double bond to be predicted in the local molecular graph to determine the marking features.
[0009] Optionally, the prediction model includes an input layer, a feature embedding layer, a convolutional layer, a pooling layer, and a fully connected layer connected in sequence. The step of inputting the features into the prediction model to classify the double bond to be predicted includes: constructing a feature matrix based on the atomic features and the difference features; inputting the feature matrix into the input layer of the prediction model and mapping it to a high-dimensional vector; converting the high-dimensional vector into initial node features through the feature embedding layer of the prediction model; inputting the initial node features into the convolutional layer to obtain optimized node features; inputting the optimized node features into the pooling layer and extracting the central feature based on the label features of the two atoms of the double bond to be predicted; and classifying the central feature in the fully connected layer to obtain the classification result of the double bond to be predicted.
[0010] Optionally, the convolutional layer includes multiple graph neural network layers connected in sequence. Each graph neural network layer includes a GIN convolutional unit, an activation function unit, and a random deactivation unit. The step of inputting the initial node features into the convolutional layer to obtain optimized node features includes: sequentially inputting the initial node features into multiple graph neural network layers of the convolutional layer for processing to obtain optimized node features. The processing steps of each graph neural network layer include: performing neighborhood information aggregation transformation on the input features through the GIN convolutional unit; inputting the aggregated and transformed features into the activation function unit for nonlinear activation; inputting the nonlinearly activated features into the random deactivation unit for regularization processing, and outputting them to the next graph neural network layer.
[0011] Optionally, determining the product molecular skeleton based on the molecular structure of the reactants in the chemical reaction includes: representing the molecular structure of the reactants in the form of a simplified molecular linear input specification to obtain a text sequence of the reactants; and inputting the text sequence into a product skeleton prediction model to obtain the product molecular skeleton.
[0012] According to another aspect of this application, a method for model training is also provided, the method comprising: determining the molecular skeleton of a product based on the molecular structure of a reactant in a chemical reaction; locating a specific configurational double bond in the reaction based on the molecular structure of the reactant and the molecular skeleton of the product; constructing a reaction dataset based on the specific configurational double bond; and training a message-passing neural network using the reaction dataset to obtain a prediction model.
[0013] Optionally, constructing a reaction dataset based on the specific configuration double bond includes: obtaining reaction samples containing the specific configuration double bond; labeling each reaction sample, determining the model input based on the reactants and the product molecular skeleton in the reaction sample, and using the true configuration of the specific configuration double bond as the training label; and constructing the reaction dataset based on the model input, the reaction sample, and the training label.
[0014] Optionally, training the message passing neural network using the reaction dataset includes: determining different configuration samples containing specific configuration double bonds based on the reaction dataset; employing a weighted cross-entropy loss function to adjust the proportion of the loss corresponding to different configuration samples in the total loss through class weights, wherein the class weights are inversely proportional to the sample frequency of each configuration sample; and performing iterative training based on the weighted cross-entropy loss function to complete the training process of the message passing neural network.
[0015] According to another aspect of this application, a computer-readable medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the method for predicting the product structure as described in any of the preceding claims, or the method steps for training the model as described in any of the preceding claims.
[0016] Compared with existing technologies, the prediction method of this application determines the product molecular skeleton based on the molecular structure of reactants in a chemical reaction, and locates the double bonds to be predicted based on the molecular structure of reactants and the product molecular skeleton; it constructs a local molecular map of the product molecular skeleton based on the double bonds to be predicted, and extracts features from the local molecular map; the features are input into the prediction model to classify the double bonds to be predicted, determine the specific configuration information of the double bonds to be predicted, and correct the product molecular skeleton based on the specific configuration information to obtain the prediction result of the product structure. The prediction model is trained on a reaction dataset including double bonds with specific configurations. This improves the accuracy of product stereochemical information prediction. Attached Figure Description
[0017] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the specific embodiments of this application will be described in detail below with reference to the accompanying drawings, wherein: Figure 1 A flowchart of a method for predicting product structure according to one aspect of this application is shown. Figure 2 This diagram illustrates the structure of a prediction model in one embodiment of this application. Figure 3 This diagram illustrates a method for training a model according to another aspect of this application. Figure 4 This document illustrates a flowchart of a product double bond configuration post-processing prediction method based on a message passing network in one embodiment of this application. Figure 5 The diagram shows the confusion matrix of the prediction model in one embodiment of this application.
[0018] The same or similar reference numerals in the accompanying drawings represent the same or similar parts. Detailed Implementation
[0019] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the specific embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0020] Many specific details are set forth in the following description in order to provide a full understanding of this application. However, this application may also be implemented in other ways different from those described herein, and therefore this application is not limited to the specific embodiments disclosed below.
[0021] As indicated in this application and claims, unless the context clearly indicates otherwise, the words "a," "an," "an," and / or "the" are not specifically singular and may include plural forms. Generally speaking, the terms "comprising" and "including" only indicate the inclusion of explicitly identified steps and elements, which do not constitute an exclusive list, and the method or apparatus may also include other steps or elements.
[0022] Before providing a detailed description of the embodiments of this application, some of the nouns and terms involved in the embodiments of this application will be explained first. The nouns and terms involved in the embodiments of this application are subject to the following interpretations.
[0023] (1) Simplified Molecular Input Line Entry System (SMILES): A specification that explicitly describes molecular structures using ASCII strings.
[0024] (2) Atom-to-Atom Mapping (AAM): The task of identifying the position of each atom in a molecule before and after a chemical reaction, which can be used to understand the reaction mechanism.
[0025] (3) Message Passing Neural Network (MPNN): A general framework for processing graph data. It captures complex relationships in graph structure data through a message passing mechanism and is widely used in fields such as chemical molecule prediction.
[0026] (4) Graph Neural Networks (GNN): A deep learning model specifically designed to process graph-structured data. It captures the complex relationships and dependencies between nodes by passing information between them. The core idea of GNN is to use the feature information of the nodes and the structural information of the graph to update the state of the nodes iteratively, thereby learning a high-level representation of each node or the entire graph.
[0027] (5) Graph Isomorphism Network (GIN): An efficient GNN structure that enhances the expressive power of GNNs on graph isomorphism problems by simulating the Weisfeiler-Lehman (WL) test. The core idea of GIN is to use injective functions to aggregate the features of nodes, thereby enabling the network to distinguish different graph structures.
[0028] The technical solution proposed in this application employs a specially designed MPNN-based post-processing module. Through a "prediction-correction" post-processing strategy, it achieves complete and repaired 3D configurations from scratch, demonstrating significant performance advantages compared to existing technologies. Simultaneously, it utilizes "local clipping" and "local pooling" strategies, allowing the model to focus only on the N-hop neighborhood of the target double bond (i.e., the double bond to be predicted) during prediction, effectively shielding it from interference from irrelevant information in the molecule. Furthermore, it eliminates the need for complex and time-consuming 3D structure calculations, thus improving prediction efficiency. In addition, it introduces mapping difference features based on atom mapping (AAM) and target center markers, providing the model with crucial reaction context information. This can be combined with a weighted cross-entropy loss function for training, effectively overcoming the problem of imbalanced E / Z configuration samples and enhancing the model's ability to discriminate fine-grained structural differences.
[0029] Figure 1 A flowchart of a method for predicting the structure of a product according to one aspect of this application is shown, the method comprising steps S11 to S13.
[0030] Step S11: Determine the molecular skeleton of the product based on the molecular structure of the reactants in the chemical reaction, and locate the double bonds to be predicted based on the molecular structure of the reactants and the molecular skeleton of the product.
[0031] Based on the reactant molecular structure of a chemical reaction, a product molecular skeleton is generated that does not contain information about the configuration of newly formed double bonds. For example, the simplified linear molecular input specification (SMILES) of the reactants can be obtained and input into a pre-trained product skeleton prediction model to determine the product molecular skeleton. Then, based on the reactant structure and the product molecular skeleton, the target double bonds whose configurations need to be determined and predicted (i.e., the double bonds to be predicted) can be located. Atom-to-atom mapping (AAM) technology can be used to compare the reactant and product molecular skeletons to locate the newly formed, predictable target double bonds in the reaction.
[0032] For example, to obtain the product structural skeleton, the LocalTransform model, based on graph attention, can be introduced as a product structural skeleton prediction tool. It has extremely high accuracy in predicting the basic skeleton of the product molecule (i.e., the atomic connections), but the generated product does not contain any chemical information about the newly generated double bond configurations. After obtaining the SMILES text sequence containing reactants and reagents, it is input into a pre-trained LocalTransform model. The output is the product molecular skeleton, also represented as a SMILES sequence, but the newly generated double bonds that can produce specific configurations are not specified in terms of E / Z configuration. Alternatively, using the list of reactant SMILES and the product skeleton SMILES as input, the key region, i.e., the double bonds to be predicted, is located based on AAM technology.
[0033] Step S12: Construct a local molecular map of the product molecular skeleton based on the double bond to be predicted, and extract features from the local molecular map.
[0034] After locating the double bond to be predicted, a local molecular map corresponding to the product molecular skeleton is constructed with the double bond as the center, highlighting the connection relationship and spatial environment of the double bond and its neighboring atoms. By extracting features from this local molecular map, characteristic information that can characterize the local chemical environment and structural properties of the double bond is obtained, providing structured input for subsequent model prediction.
[0035] In practical applications, local graph trimming and graph construction operations can be performed. To shield against interference from irrelevant information in the molecule and improve computational efficiency, a local molecular subgraph is constructed by centering on the double bond to be predicted and retaining only its neighborhood within N hops (e.g., 3 hops). The local subgraph is then characterized using the cheminformatics open-source toolkit RDKit.
[0036] Step S13: Input the features into the prediction model, classify the double bonds to be predicted, determine the specific configuration information of the double bonds to be predicted, and correct the molecular skeleton of the product based on the specific configuration information to obtain the prediction result of the product structure. The prediction model is trained from a reaction dataset including double bonds with specific configurations.
[0037] The extracted features are input into a pre-trained prediction model, which classifies and identifies the configuration of the double bond to be predicted, thus obtaining specific configuration information of the double bond. This configuration information is then used to correct and improve the molecular skeleton of the product that only contains skeletal information and not information on the configuration of the newly generated double bond, ultimately yielding a complete product structure prediction result with both molecular skeleton and stereoconfiguration information.
[0038] For example, features extracted from a local molecular map are input into a pre-trained product double bond configuration post-processing prediction model based on a message-passing neural network (MPNN). This model classifies the configuration of the double bond to be predicted and outputs the probability that the double bond is in the E or Z configuration. Based on the classification results, the original product skeleton SMILES are corrected by adding " / " or "\" symbols, and finally, a complete product structure with specific and high-precision double bond configuration information is output.
[0039] In one embodiment of this application, in step S11, the molecular structure of the reactant is represented in the form of a simplified molecular linear input specification to obtain the text sequence of the reactant; the text sequence is input into the product skeleton prediction model to obtain the product molecular skeleton.
[0040] First, prepare reactants (which may include reagents) for predicting the structure of organic chemical products and convert them into a Shortened Molecular Linear Input Specification (SMILES) text sequence representation. Input the text sequence into the product skeleton prediction model. The model can use an existing Local Transform model without modifying its network structure. This model is used to generate the product molecular skeleton without including information on the configuration of newly generated double bonds, and further post-processing prediction of double bond configurations is performed based on its output.
[0041] The model outputs the molecular skeleton of the product, also represented as the SMILES sequence, but the newly generated double bonds (if they can produce a specific configuration) are not specified in terms of E / Z configuration. E / Z configuration is used to describe the stereochemical configuration of compounds containing double bonds. Specifically, when the higher priority groups on the carbon atoms of the double bond are located on the same side, it is called the Z configuration; when the higher priority groups on the carbon atoms of the double bond are located on different sides, it is called the E configuration.
[0042] Taking a typical Knoevenagel condensation reaction as an example, the electrophile COc1cc(C=O)ccc1O reacts with the nucleophile N#CCc1ccc(C(F)(F)F)nc1 under the action of the catalyst C1CCNCC1 to generate the cis double bond product COc1cc( / C=C(\C#N)c2ccc(C(F)(F)F)nc2)ccc1O. However, the Local Transform model can only predict the general product COc1cc(C=C(C#N)c2ccc(C(F)(F)F)nc2)ccc1O, without specifying the double bond configuration.
[0043] In one embodiment of this application, in step S11, the molecular structure of the reactant and the molecular skeleton of the product are compared based on atomic mapping to obtain the atomic mapping relationship between the reactant and the product; each double bond in the molecular skeleton of the product is traversed according to the atomic mapping relationship to determine the corresponding atom of each double bond in the reactant; the bond type of the corresponding atom in the reactant is determined, and it is determined whether the double bond is a newly generated double bond based on the bond type. If so, the double bond to be predicted is located based on the newly generated double bond.
[0044] A novel method for identifying newly generated double bonds with undetermined configurations based on atom mapping (AAM) technology is used to automatically compare the molecular skeleton of the product with that of the original reactants. By analyzing the changes in atomic connection relationships, the target double bonds and their indices that are newly generated during the reaction process and whose configurations need to be predicted are accurately located.
[0045] In practical applications, the AAM tool LocalMapper can be used to obtain the precise mapping relationship between reactant atoms and product atoms. After obtaining the mapping relationship, the process begins to traverse each double bond (assuming it is a connecting atom) in the product skeleton. and For any double bond, its origin is traced using the AAM mapping relationship to find its corresponding atom in the reactants. and .
[0046] Next, key change detection is performed, that is, a check is performed. and The type of bond in the reactants. If and If the bond type between the two bonds is "single bond" or "no bond", then the product is a double bond. This is identified as a "newly formed double bond". Further analysis can be performed to check whether this "newly formed double bond" has the ability to form an E / Z configuration, based on the fact that the atoms at both ends of the double bond are connected to different groups. Finally, the atom indices of all target double bonds identified as "newly formed" and with "undetermined configuration" are output for use in subsequent steps.
[0047] In one embodiment of this application, in step S12, a preset range of the central neighborhood is extracted from the product molecular skeleton centered on the located double bond to be predicted, thereby generating a local molecular map of the product molecular skeleton.
[0048] After locating the double bond to be predicted, in order to shield the interference of irrelevant information in the molecule and improve computational efficiency, a local clipping mechanism is adopted. Taking the located double bond to be predicted as the structural center, the central neighborhood region corresponding to the double bond is clipped on the molecular skeleton of the product according to the preset structural range. For example, only the neighborhood within N hops (e.g., 3 hops) can be retained, and the far-end structural information in the molecular skeleton that is irrelevant to the spatial configuration of the double bond can be removed, thereby constructing a local molecular map focusing on the local chemical environment of the target double bond.
[0049] In one embodiment of this application, in step S12, basic molecular features of the local molecular map are extracted, wherein the basic molecular features include atomic features and bond features; differential features are determined based on the molecular structure of the reactant and the property changes of the corresponding atoms in the local molecular map; and the two end atoms of the double bond to be predicted in the local molecular map are marked to determine the marking features.
[0050] Feature extraction of local molecular graphs can be performed using the cheminformatics open-source toolkit RDKit. Local subgraphs are characterized, and the extracted basic molecular features include at least atomic features (such as atom type, ring presence, aromaticity, hybrid orbitals, valence state, etc.) and bond features (such as bond type, conjugation, ring presence, etc.). Table 1 shows the basic molecular features of local molecular graphs, including chemical property information used to initialize atomic and bond features.
[0051]
[0052] Table 1
[0053] Based on the AAM results, the changes in properties (such as aromaticity and hybridization) at corresponding atoms in the local molecular maps of reactants and products are compared and introduced as differential features to provide the model with contextual information about the reaction. The atoms at both ends of the double bonds to be predicted in the local molecular maps are specially marked to determine the marking features, which are then used as part of the feature vector. For example, a Boolean value indicating whether it is the target center can be added to the atomic feature vector to inform the subsequent model prediction focus.
[0054] Finally, the above features are integrated into the local molecular graph to construct a molecular graph data structure belonging to the target double bond and rich in information on local chemical environment and reaction changes.
[0055] exist Figure 2In the illustrated embodiment, in step S13, the prediction model includes an input layer, a feature embedding layer, a convolutional layer, a pooling layer, and a fully connected layer connected in sequence. A feature matrix is constructed based on the atomic features and the difference features. The feature matrix is input to the input layer of the prediction model and mapped to a high-dimensional vector. The high-dimensional vector is converted into initial node features through the feature embedding layer of the prediction model. The initial node features are input to the convolutional layer to obtain optimized node features. The optimized node features are input to the pooling layer, and a central feature is extracted based on the label features of the two atoms of the double bond to be predicted. The central feature is classified in the fully connected layer to obtain a classification result for the double bond to be predicted.
[0056] In the input layer of the prediction model, atomic features (characterizing the target double bond and the properties of neighboring atoms) and difference features (characterizing the structural changes from reactants to products) are integrated to construct a dimension-matched feature matrix. For example, the feature matrix can be constructed based on atomic properties (such as hybrid orbitals and valence states) extracted from the RDKit toolkit and the mapping difference features before and after the reaction, mapping them into a high-dimensional vector to retain rich chemical environment information.
[0057] When the high-dimensional vector output from the input layer is processed by the feature embedding layer, the model converts the high-dimensional vector into initial node features that can characterize the core attributes of molecular graph nodes (atoms) based on preset embedding rules (such as prior chemical knowledge such as atom type and chemical bond type). This realizes the transformation from "numerical vector" to "chemical semantic features", making the features more consistent with the chemical essence of molecular structure.
[0058] After the initial node features are input into the convolutional layer, the model aggregates the local topological information and attribute features of each node in the neighborhood of the target double bond through multi-layer graph convolution operations, and mines deep features such as the correlation patterns and spatial proximity between atoms, and finally outputs the optimized node features after multiple rounds of feature enhancement.
[0059] When the simplified node features are fed into the pooling layer, the model no longer retains all node features. Instead, it focuses on the labeled features of the atoms at both ends of the target double bond (predefined core atom feature identifiers). Through pooling operations (such as aggregation and filtering), it extracts the central features that are directly related to the target double bond and removes redundant neighborhood information. In practical applications, a local pooling strategy implemented in the code can be used, which aggregates only the feature vectors of the atoms at both ends of the target double bond using the index, rather than averaging across the entire graph, thereby achieving feature dimensionality reduction and focus.
[0060] After the central features are input into the fully connected layer, the model performs nonlinear mapping and classification calculations on the core features through a multilayer perceptron. For example, the overall model can be composed of multiple stacked GINConv layers, and finally connected to a fully connected layer (MLP) classifier for classification. The final output is the classification probability or category result of different configurations of the target double bond (such as E / Z configuration), thus completing the accurate discrimination of the three-dimensional configuration of the target double bond.
[0061] Furthermore, the convolutional layer includes multiple graph neural network layers connected in sequence, each graph neural network layer including a GIN convolutional unit, an activation function unit, and a random deactivation unit.
[0062] The initial node features are sequentially input into multiple graph neural network layers of the convolutional layer for processing to obtain optimized node features. The processing steps of each graph neural network layer include: performing neighborhood information aggregation transformation on the input features through the GIN convolutional unit; inputting the aggregated and transformed features into the activation function unit for nonlinear activation; inputting the nonlinearly activated features into the random deactivation unit for regularization processing, and outputting them to the next graph neural network layer.
[0063] Continue to refer to Figure 2 The model's convolutional layers consist of multiple sequentially connected graph neural network (Graph Neural Network) layers. Each Graph Neural Network layer includes a GIN convolutional unit, an activation function unit, and a dropout unit. After the initial node features are input into the convolutional layers, each layer processes the input features, and the processed features serve as the input to the next Graph Neural Network layer. After multiple iterations, the final optimized node features are obtained. Through multi-layer feature iterative processing, the atomic correlation information and chemical bond attribute information of the target double bond neighborhood are gradually aggregated.
[0064] The GIN convolutional unit aggregates, weights, and transforms the features of all atomic nodes in the neighborhood centered on the target double bond, based on the input node features. The Graph Isomorphism Network (GIN) convolutional unit is chosen because, compared to the basic Graph Convolutional Network (GCN), GIN, by nesting a multilayer perceptron (MLP) in the aggregation step, possesses superior graph isomorphism recognition capabilities, enabling it to capture extremely subtle spatial structural differences in cis- and anti-heterogeneous structures.
[0065] In practical applications, GIN's first The layer message passing and update mechanism can be represented by the following formula:
[0066] in, It is a node In the Layer embedding representation; It is a node The set of neighboring nodes; It is a learnable multilayer perceptron; It is a learnable parameter or a fixed scalar. A graph isomorphic network is used as the core convolutional layer of the MPNN to enhance the model's ability to discern fine-grained structural differences.
[0067] Interlayer activation function units perform nonlinear transformations on the linear aggregated features output by GIN convolution units. ReLU can be used as the activation function. By introducing nonlinear factors, the feature fitting ability of the model can be improved. It is also computationally efficient and helps to alleviate the gradient vanishing problem in deep networks.
[0068] To prevent the model from overfitting on the training set, a random dropout unit is introduced, with a probability range of 0.2 to 0.5. By performing random dropout regularization on the features after nonlinear activation, some neuron nodes are randomly masked, reducing the model's dependence on local redundant and noisy features, improving the model's generalization ability in unknown response scenarios, and enhancing the model's robustness.
[0069] In the output stage, the model adopts the local pooling strategy implemented in the code, which uses the index to aggregate the feature vectors of the atoms at both ends of the target double bond, rather than averaging across the entire image. It is also combined with a weighted cross-entropy loss function set for class imbalance to ensure that the model can still achieve high-precision directional prediction under the objective condition of uneven distribution of E / Z configuration samples.
[0070] Figure 3 A flowchart of a model training method according to another aspect of this application is shown, the method comprising steps S21 to S23.
[0071] Step S21: Determine the molecular skeleton of the product based on the molecular structure of the reactants in the chemical reaction.
[0072] During the model training phase, the product molecular skeleton generation logic from the prediction phase can be referenced to determine the product molecular skeleton, which does not contain information on the stereoconfiguration of newly formed double bonds, based on the molecular structural characteristics of the chemical reaction reactants. By unifying the skeleton representation rules between the training and prediction phases, consistency in the model input format is ensured.
[0073] Step S22: Based on the molecular structure of the reactants and the molecular skeleton of the product, locate the specific configurational double bond in the reaction.
[0074] Based on the double bond localization logic in the prediction stage, combined with the molecular structure of reactants, the molecular skeleton of products, and the structural changes before and after the reaction, the target double bonds that are newly generated in the chemical reaction, whose configuration is yet to be determined and whose stereoconfiguration needs to be identified, can be located. This clarifies the objects that the model needs to learn and predict, ensuring that the target site determination rules in the training stage are compatible with those in the prediction stage, and eliminating rule bias in model inference.
[0075] Step S23: Construct a reaction dataset based on the specific configuration double bond, and use the reaction dataset to train a message passing neural network to obtain a prediction model.
[0076] Based on the target double bonds located in the above steps, a reaction dataset containing stereo configuration annotations is constructed, which can match the local feature extraction rules in the prediction stage. Based on this reaction dataset, a message-passing neural network is iteratively trained to optimize the model's ability to learn features and classify configurations of the local chemical environment of the double bonds. This enables the trained prediction model to receive the local features extracted in the prediction stage and accurately output the stereo configuration of the target double bonds, thereby achieving the prediction of the reaction product structure.
[0077] It should be noted that the model with the ability to predict the three-dimensional configuration of double bonds constructed by the above model training method can be implemented in conjunction with the aforementioned product structure prediction method to achieve a seamless connection from training to inference, or it can be implemented independently without the above prediction stage to complete the autonomous training and optimization of the model.
[0078] In one embodiment of this application, in step S23, reaction samples containing the specific configuration double bonds are obtained; each reaction sample is labeled, and the model input is determined based on the reactants and the molecular skeleton of the product in the reaction sample, and the true configuration of the specific configuration double bonds is used as the training label; the reaction dataset is constructed based on the model input, the reaction samples and the training labels.
[0079] From chemical reaction data resources, reaction samples containing double bonds with specific configurations are screened and extracted. These samples must meet preconditions such as changes in double bond configuration before and after the reaction, and the formation of new double bonds, to ensure a high degree of match between the samples and the model's prediction target. For example, using the newly generated undetermined configuration double bond identification method based on AAM as detailed above, publicly available datasets in the field of chemical synthesis are filtered to select all reaction samples containing newly generated specific E / Z configuration double bonds, forming a basic sample pool.
[0080] For each valid reaction sample obtained through screening, data annotation was completed according to unified rules, clarifying the input dimensions and supervision labels required for model training. The input data for the model was defined and extracted by combining the reactant structural information and the corresponding product molecular skeleton in the reaction sample. Simultaneously, the true stereoconfiguration of specific double bonds in the sample was defined as the supervision label for model training. For example, when annotating the screened samples, the reactant structure, the product skeleton without E / Z stereo information, and relevant features used for molecular diagram construction were uniformly defined as model input, and the true E / Z configuration of the double bonds was used as the corresponding training label.
[0081] All individual reaction samples that have been labeled are integrated, and the correspondence between model input, original reaction samples, and training labels is sorted out in a standardized format. Abnormal samples with labeling errors and inconsistent formats are removed, and a reaction dataset with standardized data, complete labeling, and suitable for model training is constructed.
[0082] In one embodiment of this application, in step S23, different configuration samples containing specific configuration double bonds are determined based on the reaction dataset; a weighted cross-entropy loss function is used to adjust the proportion of the loss corresponding to different configuration samples in the total loss through class weights, wherein the class weights are inversely proportional to the sample frequency of each configuration sample; iterative training is performed based on the weighted cross-entropy loss function to complete the training process of the message passing neural network.
[0083] Based on the constructed reaction dataset, samples containing different configurations of double bonds with specific configurations were extracted. The weighted cross-entropy loss function was selected as the supervised loss function for model training. By setting class weights, the proportion of loss generated by different configuration samples in the total loss was adjusted. The core rule is that the class weight is inversely proportional to the sample frequency of the corresponding configuration sample. That is, the configuration class with fewer samples is assigned a higher weight value, thereby compensating for the training bias caused by the difference in the number of samples.
[0084] For example, during training, considering the potential imbalance in the number of E / Z configurations in the dataset (e.g., significantly more E configuration samples than Z configuration samples), a weighted cross-entropy loss function is used:
[0085] in, This is the weighted cross-entropy loss value for this single sample. It is the total number of categories. and These are the true label and the predicted probability, respectively. These are newly introduced category weights, which amplify or reduce the proportion of different categories in the total loss.
[0086] This loss function assigns a higher loss weight to prediction errors of minority class samples, and is typically set to be inversely proportional to the class frequency. This enhances the model's ability to identify minority class samples and mitigates prediction bias caused by class imbalance. This weight value can be used as a hyperparameter and further tuned during model validation to achieve the optimal balance between minority class recall and overall prediction accuracy.
[0087] Based on the weighted cross-entropy loss function configured above, the message passing neural network is driven to carry out iterative training. In each round of training, the network parameters are updated in reverse according to the loss value, and the feature extraction capability and configuration discrimination capability of the model are gradually optimized until the model converges and the loss value tends to stabilize, thus completing the entire training process of the model and obtaining the target prediction model.
[0088] In one embodiment of this application, after the target prediction model is trained, it is used to predict the product structure. The constructed local molecular graph is input into the trained model. The input molecular graph propagates information through multiple GIN layers, learning the high-dimensional embedding representation of each node (atom) in the graph. When generating the graph representation for classification, this invention employs a local pooling strategy. The system only extracts and aggregates the embedding vectors of the atoms at both ends of the target double bond, instead of performing global pooling on the entire graph. The aggregated local representation vector is input into a fully connected layer classifier. The classifier outputs the probability that the target double bond is in E or Z configuration. Based on the classification result (e.g., selecting the configuration with the highest probability), the cheminformatics open-source toolkit RDKit is used to add configuration markers to the identified "undetermined configuration double bonds," correcting the generated original product skeleton SMILES, and finally outputting a complete product structure with specific and high-precision double bond configuration information. For example, the only double bond in COc1cc(C=C(C#N)c2ccc(C(F)(F)F)nc2)ccc1O is marked as "double bond with undetermined configuration", and the predicted double bond configuration is cis (Z configuration). The configuration information is incorporated into the original SMILES using RDKit to obtain the final product COc1cc( / C=C(\C#N)c2ccc(C(F)(F)F)nc2)ccc1O.
[0089] exist Figure 4The illustrated embodiment demonstrates the overall flow of a product double bond configuration post-processing prediction method based on a message passing network. First, the SMILES representation of the reactants is obtained and input into a pre-trained product backbone prediction model (Local Transform) to generate a product molecular backbone that does not contain information on the newly generated double bond configuration. Atom mapping (AAM) technology is used to compare the reactants and product backbones, locate the newly generated target double bond, and perform clipping and featureization on the local neighborhood of the target double bond to extract information including atomic features, bond features, and mapping difference features to construct a molecular graph data structure. The extracted features are input into a pre-trained MPNN-based product double bond configuration post-processing prediction module. This module classifies the target double bond configuration (E / Z configuration) using strategies such as local pooling and corrects the product backbone SMILES based on the classification results, outputting a final product with specific double bond configuration information.
[0090] Specifically, from the publicly available USPTO_STEREO dataset, a newly generated undetermined configuration double bond identification method based on AAM was used for screening and labeling to obtain a reaction dataset (USPTO_EZ_Gen) containing newly generated specific E / Z configuration double bonds. This dataset contains 7519 reaction data points, with the ratio of Z configuration products to E configuration products being approximately 5:2. 83 reactions contain two or more newly generated specific configuration double bonds. The dataset samples were randomly divided into training, validation, and test sets in a 7:1.5:1.5 ratio. The training set was used for model training; the validation set was used to observe the loss value trend and optimize the model; and the test set was used to verify the predictive performance.
[0091] To improve the model's predictive performance, the MPNN post-processing module employs a weighted cross-entropy loss function during training to overcome the imbalance problem of E / Z configuration samples and enhance the model's ability to identify minority class samples. To verify its effectiveness, this method is compared with two existing benchmark models (Local Transform model and Molecular Transformer model). The comparison metric is the prediction accuracy of the complete product on the test set, and the results are shown in Table 2.
[0092]
[0093] Table 2
[0094] Table 2 lists the prediction performance of different methods for the newly generated E / Z configuration products. The baseline model, LocalTransform, is limited by the coverage of the training template and cannot predict the newly generated double bond cis-trans configuration at all, with an accuracy of 0%. The baseline model, Molecular Transformer, can generate the configuration, but its accuracy is only 46.2%. Our proposed method achieves a prediction accuracy of 83.7%, which is a significant improvement over existing techniques in terms of evaluation metrics, fully validating the beneficial effects of our method.
[0095] Figure 5 The confusion matrix of the proposed method's prediction performance on the test set is shown. It should be noted that this evaluation aims to verify the performance of the post-processing module in this method. Therefore, in the entire test set, a small number of samples were not correctly predicted by the post-processing module due to skeleton prediction failures. Figure 5 The confusion matrix statistics are ignored and not included in the performance evaluation of this module. Furthermore, for cases where a molecular backbone has more than two newly generated double bonds to be predicted, they are treated as independent predictions in the confusion matrix.
[0096] Continue to refer to Figure 5 The confusion matrix uses the vertical axis to represent the true configuration of the samples (e.g., true E configuration, true Z configuration) and the horizontal axis to represent the predicted configuration (e.g., predicted E configuration, predicted Z configuration) predicted by the proposed method. The values on the main diagonal (i.e., the cells for "true E - predicted E" and "true Z - predicted Z") represent the number of correctly predicted samples, while the values on the secondary diagonal represent the number of incorrectly predicted samples. As can be seen from the figure, the counts on the main diagonal are significantly higher than those on the secondary diagonal, intuitively demonstrating the high accuracy of the proposed method in the E / Z configuration classification task.
[0097] This application proposes a product double bond configuration post-processing prediction method based on message passing network (MPNN), which can effectively improve the prediction accuracy of product double bond configurations. It can identify specific double bond configurations newly formed in organic chemical reactions using atom mapping technology, and based on the accurate product molecular skeleton, uses a specially designed MPNN-based double bond configuration post-processing module to focus on key local regions where the reaction occurs, intelligently completing and repairing the newly formed stereochemical configuration. This application does not rely on complex three-dimensional structure calculations and has achieved significant results in predicting the cis-trans isomerism (E / Z configuration) of newly formed double bonds in products, with a prediction accuracy higher than existing technologies.
[0098] This application also provides a computer-readable medium having computer instructions stored thereon, which can be executed by a processor to implement a method for predicting a product structure as described above, or a method for training a model as described above.
[0099] When the method for predicting the product structure or the method for training the model are implemented as a computer program, they can also be stored as an article of art in a computer-readable storage medium. For example, computer-readable storage media can include, but are not limited to, magnetic storage devices (e.g., hard disks, floppy disks, magnetic stripes), optical discs (e.g., compact discs (CDs), digital multifunction discs (DVDs)), smart cards, and flash memory devices (e.g., electrically erasable programmable read-only memory (EPROM), cards, sticks, key drives). Furthermore, the various storage media described herein can represent one or more devices and / or other machine-readable media for storing information. The term "machine-readable medium" can include, but is not limited to, wireless channels and various other media (and / or storage media) capable of storing, containing, and / or carrying code and / or instructions and / or data.
[0100] It should be understood that the embodiments described above are merely illustrative. The embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For hardware implementation, the processor may be implemented within one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, and / or other electronic units designed to perform the functions described herein, or combinations thereof.
[0101] Some aspects of this application can be executed entirely by hardware, entirely by software (including firmware, resident software, microcode, etc.), or by a combination of hardware and software. The aforementioned hardware or software may be referred to as a "data block," "module," "engine," "unit," "component," or "system." The processor may be one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DAPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, or combinations thereof. Furthermore, aspects of this application may manifest as computer products residing in one or more computer-readable media, including computer-readable program code. For example, computer-readable media may include, but are not limited to, magnetic storage devices (e.g., hard disks, floppy disks, magnetic tapes, etc.), optical discs (e.g., compressed CDs, digital multifunction DVDs, etc.), smart cards, and flash memory devices (e.g., cards, sticks, key drives, etc.).
[0102] A computer-readable medium may contain a propagated data signal containing computer program code, for example, on baseband or as part of a carrier wave. This propagated signal may take various forms, including electromagnetic, optical, and so on, or suitable combinations thereof. A computer-readable medium can be any computer-readable medium other than a computer-readable storage medium, which can be connected to an instruction execution system, apparatus, or device to enable communication, propagation, or transmission of a program for use. The program code located on the computer-readable medium can be propagated through any suitable medium, including radio, cable, fiber optic cable, radio frequency signals, or similar media, or any combination of the above media.
[0103] The basic concepts have been described above. Obviously, for those skilled in the art, the above disclosure is merely illustrative and does not constitute a limitation of this application. Although not explicitly stated herein, those skilled in the art may make various modifications, improvements, and corrections to this application. Such modifications, improvements, and corrections are suggested in this application, and therefore remain within the spirit and scope of the exemplary embodiments of this application.
[0104] Furthermore, this application uses specific terms to describe embodiments of the application. For example, "an embodiment," "one embodiment," and / or "some embodiments" refer to a particular feature, structure, or characteristic related to at least one embodiment of the application. Therefore, it should be emphasized and noted that "an embodiment," "one embodiment," or "an alternative embodiment" mentioned twice or more in different locations in this specification do not necessarily refer to the same embodiment. In addition, certain features, structures, or characteristics in one or more embodiments of the application can be appropriately combined.
[0105] In some embodiments, numbers describing the quantity of components and attributes are used. It should be understood that such numbers used in the description of embodiments are modified in some examples with the terms "approximately," "approximately," or "generally." Unless otherwise stated, "approximately," "approximately," or "generally" indicates that the numbers are allowed to vary by ±20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximate values, which may be changed depending on the characteristics required by individual embodiments. In some embodiments, numerical parameters should take into account specified significant digits and employ a general method of digit reservation. Although the numerical ranges and parameters used to confirm their breadth of scope in some embodiments of this application are approximate values, in specific embodiments, such values are set as precisely as feasible.
Claims
1. A method for predicting product structure, characterized in that, The prediction method includes: The molecular skeleton of the product is determined based on the molecular structure of the reactants in the chemical reaction, and the double bonds to be predicted are located based on the molecular structure of the reactants and the molecular skeleton of the product. A local molecular map of the product molecular skeleton is constructed based on the double bond to be predicted, and features are extracted from the local molecular map. The features are input into the prediction model to classify the double bonds to be predicted, determine the specific configuration information of the double bonds to be predicted, and correct the molecular skeleton of the product based on the specific configuration information to obtain the prediction result of the product structure. The prediction model is trained from a reaction dataset that includes double bonds with specific configurations.
2. The method according to claim 1, characterized in that, The process of locating the double bond to be predicted based on the molecular structure of the reactant and the molecular skeleton of the product includes: The molecular structure of the reactant and the molecular skeleton of the product are compared based on atomic mapping to obtain the atomic mapping relationship between the reactant and the product. Based on the atomic mapping relationship, traverse each double bond in the molecular skeleton of the product to determine the corresponding atom of each double bond in the reactant; Determine the bond type of the corresponding atom in the reactant, and determine whether the double bond is a newly generated double bond based on the bond type. If so, locate the double bond to be predicted based on the newly generated double bond.
3. The method according to claim 1, characterized in that, The construction of a local molecular map of the product molecular skeleton based on the double bonds to be predicted includes: Centered on the located double bond to be predicted, a preset range of the central neighborhood is extracted from the product molecular skeleton to generate a local molecular map of the product molecular skeleton.
4. The method according to claim 1, characterized in that, The feature extraction of the local molecular map includes: Extract the basic molecular features of the local molecular map, wherein the basic molecular features include atomic features and bond features; The differential features are determined based on the molecular structure of the reactants and the property changes of the local molecular diagram at corresponding atoms. The atoms at both ends of the double bond to be predicted in the local molecular diagram are marked to determine the marking features.
5. The method according to claim 4, characterized in that, The prediction model comprises an input layer, a feature embedding layer, a convolutional layer, a pooling layer, and a fully connected layer connected in sequence. The step of inputting the features into the prediction model and classifying the double bonds to be predicted includes: A feature matrix is constructed based on the atomic features and the difference features, and the feature matrix is input into the input layer of the prediction model and mapped into a high-dimensional vector. The high-dimensional vector is converted into initial node features through the feature embedding layer of the prediction model; The initial node features are input into the convolutional layer to obtain optimized node features; The optimized node features are input into the pooling layer, and the center features are extracted based on the labeling features of the two atoms at both ends of the double bond to be predicted. The central features are classified in the fully connected layer to obtain the classification result of the double bond to be predicted.
6. The method according to claim 5, characterized in that, The convolutional layer comprises multiple graph neural network layers connected in sequence. Each graph neural network layer includes a GIN convolutional unit, an activation function unit, and a random deactivation unit. The step of inputting the initial node features into the convolutional layer to obtain optimized node features includes: The initial node features are sequentially input into multiple graph neural network layers of the convolutional layer for processing to obtain optimized node features. The processing steps of each graph neural network layer include: The GIN convolutional unit performs neighborhood information aggregation and transformation on the input features. The aggregated and transformed features are input into the activation function unit for nonlinear activation; The nonlinearly activated features are input into the random deactivation unit for regularization and then output to the next graph neural network layer.
7. The method according to claim 1, characterized in that, The determination of the product molecular skeleton based on the molecular structure of the reactants in a chemical reaction includes: The molecular structure of the reactants is represented in the form of a simplified linear molecular input specification to obtain the text sequence of the reactants; The text sequence is input into the product skeleton prediction model to obtain the product molecular skeleton.
8. A method for training a model, characterized in that, The method includes: Determine the molecular skeleton of the product based on the molecular structure of the reactants in a chemical reaction; Based on the molecular structure of the reactants and the specific configurational double bond of the product molecular skeleton in the reaction; A reaction dataset is constructed based on the specific configuration of the double bonds. The reaction dataset is then used to train a message-passing neural network to obtain a prediction model.
9. The method according to claim 8, characterized in that, The construction of the reaction dataset based on the specific configuration double bond includes: Obtain a reaction sample containing the double bonds of the specific configuration; Each reaction sample is labeled, and the model input is determined based on the reactants and product molecular skeleton in the reaction sample. The true configuration of the specific configuration double bond is used as the training label. The reaction dataset is constructed based on the model input, the reaction samples, and the training labels.
10. The method according to claim 8, characterized in that, The step of training the message-passing neural network using the reaction dataset includes: Based on the reaction dataset, different configuration samples containing specific configurational double bonds were identified; A weighted cross-entropy loss function is adopted, and the proportion of the loss corresponding to different configuration samples in the total loss is adjusted by class weights, wherein the class weights are inversely proportional to the sample frequency of each configuration sample; Iterative training is performed based on the weighted cross-entropy loss function to complete the training process of the message passing neural network.
11. A computer-readable medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the method for predicting the product structure as described in any one of claims 1 to 7, or the method for training the model as described in any one of claims 8 to 10.