Method of generating training data for chemical reactions, computing device

By mapping reagent roles to fine-grained functional roles using a large language model and combining this with chemical mechanism constraints, high-quality positive and negative sample data is generated. This solves the problem of insufficient training data in existing chemical artificial intelligence models and improves the accuracy and generalization ability of the model in predicting chemical reactions.

CN122245480APending Publication Date: 2026-06-19SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing chemical AI models lack high-quality training data, especially due to unreasonable generation of negative samples, resulting in insufficient accuracy and generalization ability in predicting chemical reactions.

Method used

By mapping reagent roles to fine-grained functional roles through a large language model and combining chemical mechanism constraints, high-quality positive and negative sample data are generated.

🎯Benefits of technology

The generated training dataset improves the model's accuracy and generalization ability in predicting chemical reactions, and enables it to better identify reaction feasibility boundaries.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245480A_ABST
    Figure CN122245480A_ABST
Patent Text Reader

Abstract

This invention relates to computer systems utilizing computational models, and discloses a method and computing apparatus for generating training data for chemical reactions. A method for generating training data for chemical reactions includes: acquiring raw reaction data; mapping role labels to functional role categories using a large language model; standardizing expressions and names using a scripting tool; obtaining positive sample data based on the mapped functional role categories, standardized expressions, and standardized names; using another scripting tool to select one or more perturbation strategies to remove and / or replace reagents of corresponding functional role categories in the positive sample data to obtain negative sample data; and generating training data based on the positive and negative sample data. The method according to this invention overcomes the limitation of existing models in learning accurate chemical reaction feasibility boundaries, and improves the performance of models performing chemical reaction-related tasks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention generally relates to computer systems utilizing computational models, and more specifically to methods, computing devices, computer-readable storage media, and computer program products for generating training data for chemical reactions. Background Technology

[0002] As artificial intelligence (AI) is increasingly applied in chemical synthesis and molecular discovery, the upper limit of model performance is no longer primarily constrained by the algorithm architecture, but rather by the quality, resolution, and distribution balance of the training data. Traditional chemical databases are mainly designed for human browsing and literature retrieval, lacking the structural granularity and statistical balance required for machine learning. Ideal reaction training corpora not only require a mapping between reactants and products, but also high-resolution (fine-grained) experimental metadata and a balanced representation of success and failure results. Systematically constructing such datasets with high resolution and reasonable negative samples has become crucial to breaking through the current bottlenecks in the development of chemical AI.

[0003] Currently, the field of chemical artificial intelligence lacks high-quality, standardized data suitable for direct model training. Historical reaction databases generally suffer from missing metadata on reaction conditions (such as fine-grained functional roles of reagents) and systematically missing records of failed experiments (negative samples). This incompleteness in these data dimensions obscures the feasibility boundaries of real chemical reactions, making it difficult for data-driven models to learn the true chemical logic. They are also prone to overfitting to superficial statistical correlations, thus severely limiting the reliability and accuracy of models in complex tasks such as reaction condition recommendation, generative synthesis planning, and generalization across reaction systems.

[0004] In the fields of artificial intelligence and cheminformatics, there exists a scheme for extracting reaction conditions based on large language models for engineered or supervised fine-tuning. The main technical idea behind this scheme is to utilize large language models for fine-grained identification and structured extraction of reagent functions. While this scheme improves the structured extraction effect to some extent, the accuracy at the role level typically struggles to exceed 90%, and it suffers from inconsistent outputs and poor cross-dataset transferability. Particularly when the reagent function is not explicitly stated in the source text (i.e., implicit roles), this scheme cannot achieve reliable disambiguation and inference relying solely on surface semantic cues. Specifically, this extraction method lacks fundamental constraints based on chemical mechanism knowledge, failing to accurately classify general reagent labels into key functional roles such as catalysts, bases, solvents, and ligands. Its inference capability is severely limited, especially when facing implicit conditions, resulting in insufficient fine-grained role resolution.

[0005] In addition, there exists a negative sample construction scheme based on generative models or heuristic perturbations. The main idea behind this scheme is to generate negative sample data through heuristic rules such as randomly adding, deleting, or replacing reagents. However, the sample quality variance generated by this scheme is extremely high, typically producing only shallow "weak negative samples" lacking physicochemical meaning. Manually constructing chemically plausible mechanistic negative samples is prohibitively expensive and cannot be scaled up to meet the needs of large-scale pre-training. In other words, heuristic perturbation methods can only generate invalid negative samples lacking physical meaning and cannot reflect true thermodynamic and kinetic infeasibility boundaries; while high-quality manually constructed methods face the bottleneck of being unable to scale, leading to a contradiction between quality and scalability in negative sample generation.

[0006] Furthermore, models trained using training data generated by the above two schemes are prone to statistical bias: due to the lack of high-resolution conditional annotations and mechanism-constrained negative samples, downstream models lack a penalty mechanism for anchoring reaction infeasibility during optimization, making them prone to overfitting to successful patterns that frequently appear in the literature and losing their ability to generalize to unknown chemical spaces.

[0007] There is a need in this field to improve the techniques for positive sample enhancement and negative sample generation of chemical reaction data based on large language model reasoning and chemical mechanism constraints in at least one of the above aspects, with the aim of synthesizing chemical reaction training data for training models. Summary of the Invention

[0008] This invention is provided to offer further improved techniques for training synthetic chemical reactions that integrate large language model reasoning with chemical mechanism constraints.

[0009] One aspect of the present invention provides a method for generating training data for chemical reactions, comprising the following steps: S1: acquiring raw reaction data, the raw reaction data being associated with multiple reagents of a chemical reaction in a target chemical reaction category, the raw reaction data including role labels, expressions, and names for each reagent; S2: mapping the role labels for each reagent to one of multiple predefined functional role categories using a large language model, wherein the multiple predefined functional role categories are pre-determined for the target chemical reaction category; S3: standardizing the expressions and names of each reagent using a first scripting tool; S4: Based on the functional role category mapped to each reagent, the standardized expression, and the standardized naming, positive sample data for the chemical reaction is obtained; S5: Using a second script tool, one or more perturbation strategies are selected from a preset perturbation strategy set to remove and / or replace reagents of the corresponding functional role category in the positive sample data to obtain negative sample data for the chemical reaction, wherein the preset perturbation strategy set is predetermined for the target chemical reaction category based on chemical prior knowledge, and each perturbation strategy is associated with promoting the failure of the chemical reaction; and S6: The training data is generated based on the positive sample data and the negative sample data.

[0010] The method described above, wherein S2 includes: S21: determining the confidence level at which the role label for each reagent is mapped to each of the plurality of predefined functional role categories; and S22: determining the functional role category with the highest confidence level as the mapped functional role category.

[0011] As described above, step S3 includes: S31: querying a public chemical database using a standardized expression for a reagent that is missing a name in the original reaction data; and S32: determining a standardized name for the reagent based on the query results.

[0012] As described above, step S32 includes: S321: In response to the query result indicating the existence of a name matching the reagent, the matching name is determined as the standardized name of the reagent.

[0013] As described above, S5 includes: S51: For each selected perturbation strategy, removing or replacing a reagent of a corresponding functional role category to obtain a negative sample data for the chemical reaction.

[0014] The method described above further includes: S52: determining the number of perturbation strategies to be selected based on a preset ratio of the number of positive sample data to the number of negative sample data; and S53: selecting the determined number of perturbation strategies and generating the corresponding number of negative sample data.

[0015] As described above, step S6 includes: S61: generating the training data based on the preset ratio of positive sample data and negative sample data.

[0016] As described above, step S6 includes: S62: generating the training data based on the reaction performance index of the positive sample data, the negative sample data, and the original reaction data.

[0017] As described above, the plurality of predefined functional role categories include one or more of the following: ligand, catalyst, base, acid, solvent, substrate, electrophile, nucleophile, halogenator, oxidant, reductant, additive, atmosphere, functionalized reagent, and borate ester reagent.

[0018] As described above, the first script tool includes RDKit, the expression includes SMILES expressions, and the naming includes IUPAC naming.

[0019] The method described above further includes: S7: In response to determining that the reagents in the original reaction data do not match any of the plurality of predefined functional role categories, the original reaction data is removed.

[0020] As described above, the large language model is implemented as a zero-sample classifier that has not been fine-tuned for samples using functional role-names of chemical reaction reagents.

[0021] As described above, the prompt word input of the large language model includes one or more of the following: input format, output format, reagent functional role category mapping rules, and hard rules associated with the target chemical reaction category.

[0022] Another aspect of the present invention provides a computing device, comprising: a processor; and a memory having instructions stored thereon, which, when executed by the processor, implement the steps of the method as described in any of the preceding claims.

[0023] Another aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in any of the preceding claims.

[0024] Another aspect of the present invention provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method described in any of the above embodiments.

[0025] The training data generation method of this invention, which integrates large language model reasoning with chemical mechanism constraints, can systematically construct high-quality chemical reaction datasets suitable for artificial intelligence training. The generated training dataset exhibits significantly improved reagent role labeling accuracy, highly chemically plausible negative sample generation, and broad generalization and scalability, thereby enhancing the performance of models trained on the generated dataset when performing chemical reaction-related tasks. Attached Figure Description

[0026] Figure 1 This is an illustration of a data synthesis framework according to some embodiments of the present invention.

[0027] Figure 2 This is a flowchart of a method for generating training data for chemical reactions according to some embodiments of the present invention.

[0028] Figure 3 This is a flowchart of a first process associated with a method for generating training data for chemical reactions according to some embodiments of the present invention.

[0029] Figure 4 This is a flowchart of a second process associated with a method for generating training data for chemical reactions, according to some embodiments of the present invention.

[0030] Figure 5 This is a flowchart of a third process associated with a method for generating training data for chemical reactions, according to some embodiments of the present invention.

[0031] Figure 6 This is a block diagram of a computing device according to some embodiments of the present invention.

[0032] Figure 7 This is a block diagram of a computer-readable storage medium according to some embodiments of the present invention.

[0033] Figure 8 This is a block diagram of a computer program product according to some embodiments of the present invention.

[0034] Figure 9 Experimental results are shown for performing a generation task of downstream variable space expansion using training data generated according to an embodiment of the present invention.

[0035] In the accompanying drawings, the same reference numerals refer to the same elements. Detailed Implementation

[0036] In this application, ordinal numbers such as "first," "second," and "third" are used to distinguish different instances of objects with the same name. The ordinal numbers "first," "second," and "third" do not indicate a relative order of the indicated objects in time, space, sequence, or other aspects.

[0037] In this application, the term "positive sample" refers to a successful experimental record sample, used to support the model in learning the forward implementation logic of a chemical reaction. In this application, the term "negative sample" refers to a failed experimental record sample, used to characterize the boundary conditions under which a chemical reaction is infeasible.

[0038] In this application, the term "reagent" means all other auxiliary substances in a chemical reaction besides reactants and products, used to promote the reaction, catalyze the reaction, or provide reaction conditions.

[0039] In this application, the term "fine-grained" refers to the high information resolution of chemical reaction metadata. Specifically, for reagent functional role labeling, "fine-grained" labeling refers to the precise role-level classification of reagents (e.g., acid, base, catalyst, etc.), rather than simply making a broad category classification. In existing technologies, auxiliary reagent roles in conventional databases are typically uniformly labeled as "agent," which does not reach a fine-grained level. This makes it difficult for models to learn the influence of different functional reagents on reaction outcomes, and is one of the core reasons for the insufficient accuracy of models in chemical reaction-related tasks.

[0040] In this application, the term "SMILES expression" refers to the Simplified Molecular Input Line Entry System (SMILES) expression, which is a standardized linear encoding rule in the field of cheminformatics used to describe molecular structures in the form of ASCII strings. It is used to perform unified, machine-recognizable structural characterization of reactants, products and other components in a reaction system, supporting the semantic understanding and reasoning of chemical structures by large language models.

[0041] In this application, the term "IUPAC nomenclature" refers to the chemical substance name generated in accordance with the standardized chemical substance nomenclature rules established by the International Union of Pure and Applied Chemistry (IUPAC). It is used to uniquely identify reactants, reagents, and products in a reaction system, supplement the semantic information of structural encoding such as SMILES expressions, and improve the accuracy of large language models in reasoning about chemical reaction components.

[0042] In this application, the term "InChI identifier" refers to the International Chemical Identifier (InChI), which is a unique structured code for chemical substances generated in accordance with the standardization rules established by IUPAC. It is used to unambiguously identify reactants, reagents, and products in a reaction system, supporting large language models to perform unified structural alignment of reaction data from different sources and improving the accuracy of data standardization processing.

[0043] Figure 1 This is an illustration of a data synthesis framework 100 according to some embodiments of the present invention.

[0044] like Figure 1 As shown, the data synthesis framework 100 begins with data acquisition 110. During the data acquisition process, a search is performed in a historical chemical reaction database (such as Pistachio) for a specified reaction type, and reaction records are extracted in batches to obtain non-standardized raw data.

[0045] Then, a Large Language Model (LLM)-guided reclassification is performed. This stage leverages the LLM in conjunction with prior knowledge in the chemical field, guiding the LLM through structured prompts to remap reaction data from the original database with fuzzy labels (e.g., vague "reagent" labels), transforming them into predefined, refined functional roles (e.g., catalyst, ligand, solvent, base, etc.), thereby constructing positive samples consistent with chemical mechanisms. To avoid label leakage from homologous data and to objectively evaluate the model's generalization and resolution capabilities, in some embodiments, this process may employ a zero-shot inference paradigm.

[0046] Specifically, a candidate set containing multiple independent reagent categories is first constructed as a reference benchmark for the large language model classifier. The construction of this set strictly follows three logical principles: mutual exclusion (ensuring that the category definitions are strictly orthogonal and avoiding feature overlap), generality (applicable to the target reaction system and can be transferred to other reaction types through fine-tuning), and completeness (ensuring that all reagents in the sample space can be mapped to a uniquely determined preset category).

[0047] During the annotation process, the large language model can be configured as a zero-shot classifier. The input for a single inference can include a cue word, a set of candidate categories, and the reaction instance to be labeled. The large language model combines the contextual features of the target chemical reaction and coexisting reagents to output a specific role label with the highest statistical confidence from a pre-defined set. If a reagent in the original data does not match any pre-defined role, its role label can be left blank, and the reaction data can be removed. Figure 1In the example, the role label for a specific reagent is mapped from the vague "reagent" to the fine-grained "catalyst".

[0048] Next, the semantic role data output by the large language model is standardized and quality-controlled to meet the stringent requirements of downstream computation for molecular structure. Cheminformatics tools (such as RDKit) are used to standardize the SMILES expressions and IUPAC names in the original records to industry-standard formats. For data lacking IUPAC information, a retrieval and interpolation module is introduced, using SMILES or InChI identifiers to perform sequential queries in public chemical databases (such as PubChem and CIRpy). If a match is found, the missing name is automatically completed; if no match is found, the sample is marked as outlier and removed. After LLM-guided reclassification, a high-quality positive sample set with fine-grained role annotations is output.

[0049] Secondly, mechanism-guided negative sample generation is performed. Building upon fine-grained role labeling, this stage utilizes local perturbations based on prior chemical knowledge to generate negative samples in batches that do not conform to the chemical mechanism. This process takes reaction condition (positive condition) data with fine-grained role labeling as input. By imposing strict mechanism rule constraints based on the current reaction's mechanistic logic (determined according to the specified chemical reaction category), core reagents are perturbed at critical nodes of the catalytic cycle. The perturbation process strictly adheres to the principle of controlled variables, modifying only one reagent category per perturbation. Perturbations may include removing specific reagents (e.g., removing basic reagents in the reaction to simulate specific inactivation nodes) and / or replacing specific reagents (e.g., replacing basic reagents with acidic reagents to disrupt the original reaction environment). When generating a single negative sample instance, only one operation is performed on the specific reagent category specified by the above strategy, while reactants, products, and other auxiliary reagents remain unchanged. Figure 1 The examples show three different negative sample perturbation methods: replacing the catalyst, removing the catalyst, and replacing the base with an acid, but the present invention is not limited thereto.

[0050] The number of negative samples generated for each positive sample corresponds to the number of perturbations applied to that positive sample. Figure 1 In the example, K negative samples were generated for one positive sample, resulting in a data pair consisting of one positive sample and K negative samples. However, this invention is not limited to this. Depending on the actual needs, the number of positive samples can be greater than, equal to, or less than the number of negative samples.

[0051] Finally, the true positive samples obtained through LLM-guided reclassification 120 are combined with the counterfactual negative samples obtained through mechanism-guided negative sample generation 130 to construct a standardized training dataset with a clear reaction feasibility boundary and a balanced statistical distribution.

[0052] This dataset is directly used to train downstream data-driven models, enabling them to internalize real chemical logic and providing a solid data foundation for reliable reaction condition recommendations and generative synthesis planning.

[0053] According to one aspect of the present invention, a method for generating training data for chemical reactions is provided.

[0054] Figure 2 This is a flowchart of a method for generating training data for chemical reactions according to some embodiments of the present invention.

[0055] In some examples, the method may include step S1: acquiring raw reaction data. The raw reaction data may be associated with multiple reagents for a chemical reaction within a target chemical reaction category. The raw reaction data may include role labels, expressions, and names for each reagent.

[0056] Step S1 can, for example, be with Figure 1 The data acquisition in the middle corresponds to 110, but the present invention is not limited thereto.

[0057] In some examples, the method may include step S2: mapping the role label for each reagent to one of a plurality of predefined functional role categories by a large language model. The plurality of predefined functional role categories may be pre-determined for the target chemical reaction category.

[0058] In some examples, the method may include step S3: standardizing the expression and naming of each reagent by a first scripting tool.

[0059] In some examples, the method may include step S4: obtaining positive sample data for the chemical reaction based on the functional role category mapped to each reagent, the standardized expression, and the standardized naming.

[0060] Steps S2-S4 can, for example, be combined with... Figure 1 This corresponds to the LLM-guided reclassification 120, but the present invention is not limited thereto. Embodiments of the present invention use chemical mechanism knowledge as a fundamental constraint, pre-define a chemically reasonable range of roles, and guide a large language model through structured prompts to accurately parse and assign reagent labels, which originally had fuzzy qualitative descriptions, to specific fine-grained functional roles. This effectively solves the disambiguation and inference problems of implicit reagent functions in the source text and provides the necessary role label foundation for subsequent negative sample generation. Simultaneously, by combining cheminformatics tools (and optional public database retrieval modules), molecular structure expressions are standardized and missing information is automatically completed, outlier data is removed, ensuring the refinement and standardization of positive samples, which facilitates further data processing.

[0061] In some examples, the method may include step S5: using a second scripting tool to select one or more perturbation strategies from a pre-defined set of perturbation strategies to remove and / or replace reagents of the corresponding functional role category in the positive sample data, thereby obtaining negative sample data for the chemical reaction. The pre-defined set of perturbation strategies may be predetermined based on prior chemical knowledge for the target chemical reaction category, and each perturbation strategy is associated with promoting the failure of the chemical reaction.

[0062] In some examples, the method may include step S6: generating training data based on positive and negative sample data.

[0063] Steps S5-S6 can, for example, be related to... Figure 1 The invention generates 130 corresponding negative samples guided by a mechanism, but is not limited thereto. Embodiments of the invention utilize reaction conditions with fine-grained role labeling obtained in a priori steps to locally perturb the core reagents at key nodes of the catalytic cycle based on prior chemical mechanism knowledge. Strictly adhering to the principle of controlled variables, and through single operations such as removing or replacing specific reagents, the invention accurately generates mechanism-guided negative samples while keeping reactants, products, and other auxiliary reagents unchanged.

[0064] This invention provides a method for synthesizing chemical reaction training data that integrates large language model reasoning with chemical mechanism constraints. By pre-defining chemically reasonable role ranges and combining them with expert rules, the large language model is guided to accurately resolve fuzzy reagent labels into fine-grained functional roles. Based on this, the fine-grained roles are used as mechanism anchors, and by intervening in key inactivation nodes in known catalytic cycles, mechanism-guided counterfactual hard-to-handle samples can be generated at low cost and on a large scale.

[0065] The advantage of the embodiments of the present invention is that it transforms the original chemical reaction data with fuzzy labeling and imbalanced samples into a high-quality positive and negative sample set with fine labeling, thereby providing reliable data support for downstream deep learning models and significantly improving the model's ability to identify the feasibility of reactions.

[0066] The advantage of embodiments of the present invention also lies in the systematic construction of a high-quality reaction training dataset suitable for artificial intelligence training, which prompts downstream models to shift from surface statistical pattern matching to mechanism-based boundary learning, thereby significantly improving the accuracy of reaction feasibility prediction and cross-system generalization ability.

[0067] In some examples, the method may optionally include step S7: in response to determining that the reagents in the original reaction data do not match multiple predefined functional role categories, the original reaction data is removed.

[0068] In some examples, multiple predefined functional role categories may include one or more of the following: ligand, catalyst, base, acid, solvent, substrate, electrophile, nucleophile, halogenator, oxidant, reductant, additive, atmosphere, functionalized reagent, and borate ester reagent.

[0069] In some examples, the first scripting tool may include RDKit, the expression may include SMILES expressions, and the naming may include IUPAC naming.

[0070] In some examples, a large language model can be implemented as a zero-sample classifier that has not been fine-tuned for samples using the functional role-name of chemical reaction reagents.

[0071] In some examples, the prompt word input for a large language model may include one or more of the following: input format, output format, reagent function role category mapping rules, and hard rules associated with the target chemical reaction category.

[0072] The implementation details of the method for generating training data for chemical reactions according to the present invention are further described below with specific examples.

[0073] First, raw reaction data is obtained from a historical chemical reaction database, which corresponds to step S1 of the method in an embodiment of the present invention. An example of the obtained raw reaction data is given below, but the present invention is not limited thereto.

[0074] Example of raw reaction data: [ {"role": "Reactant", "smiles": "ClC=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC", "iupac": None}, {"role": "Reactant", "smiles": "C(C)OC1=NC=CC=C1B(O)O", "iupac":None}, {"role": "Product", "smiles": "C(C)OC1=NC=CC=C1C=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC", "iupac": None}, {"role": "Agent", "smiles": "[CH-]1[CH-][CH-][C-]([CH-]1)P(C2=CC=CC=C2)C3=CC=CC=C3.C1=CC=C(C=C1)P([C-]2C=CC=C2)C3=CC=CC=C3.C(Cl)Cl.Cl[Pd]Cl.[Fe]", "iupac": None}, {"role": "Agent", "smiles": "[Cs]OC(=O)O[Cs]", "iupac": None}, {"role": "Agent", "smiles": "O1CCOCC1", "iupac": None}, {"role": "Agent", "smiles": "O", "iupac": None} ] As can be seen from this example, raw reaction data obtained from historical chemical reaction databases typically only includes the general role label "reagent" and includes non-standard SMILES expressions and (partially) missing IUPAC nomenclature.

[0075] Next, the large language model maps the general role label "reagent" to a functional role label within the fine-grained functional role labels for each reagent, which corresponds to step S2 of the method in an embodiment of the present invention. As mentioned above, the set of fine-grained functional role labels should adhere to mutual exclusivity, universality, and completeness. The following is an example of a set of fine-grained functional role labels input to the large language model, but the present invention is not limited thereto.

[0076] Example of a fine-grained set of functional role labels: { "ligand": { "Inclusion": "Molecules coordinated to a metal center (e.g., phosphines (such as PPh3, PCy3); amines; NHC). " "Notes": "Exclusion: Inorganic salts or solvents used only as alkalis." }, "catalyst": { "inclusion": "Reaction reagents that provide the metal center (e.g., Pd(OAc)2, Pd2(dba)3, Pd(PPh3)4)." Notes: Exclusion: Ligands that do not contain metal atoms. }, "base": { "inclusion": "Reagents used to activate boric acid or neutralize acids (e.g., K2CO3, Na2CO3, Cs2CO3, K3PO4).", Notes: Some silver salts (such as Ag₂O) act as both alkalis and promoters, and are primarily classified as alkalis. }, "acid": { "inclusion": "Addition of an acidic substance to the reaction (rare; for example, for pH adjustment or deprotection).", Notes: Exclusion: Boric acid should be classified as a borate ester reagent. }, "solvent": { "inclusion": "Reaction medium (e.g., DMF, DMSO, toluene, MeOH, H2O)." Notes: If added in trace amounts (not the primary medium), consider classifying it as an additive. }, "substrate": { "inclusion": "Typically, it is an organohalide or its equivalent (e.g., Ar-Br, Ar-I, Ar-OTf).", Notes: Exclusion: Organoboron coupling components (classified as boron ester reagents). }, "electrophile": { "inclusion": "The electrophilic component in a coupling reaction is usually identical to the substrate." Notes: In the Suzuki scenario, if a substrate classification has already been selected, that classification is usually redundant. }, "nucleophile (nucleophile reagent)": { "inclusion": "The nucleophilic component in the coupling reaction is usually an activated borate ester species." Notes: This should be classified as a borate ester reagent. }, "halogenating_agent": { "inclusion": "Reagents used to introduce halogen atoms into a molecule (e.g., NBS, NCS, I2)." Notes: Strictly exclude: metal halides (e.g., PdCl2, CuI, NiBr2, FeCl3), as these reagents will not halogenate the product. }, "oxidant": { "inclusion": "Reagents used to oxidize metal centers or byproducts (e.g., O2, Cu(OAc)2, benzoquinone)." Notes: Silver salts are only used to promote Pd 0 To Pd 2+ It acts as an oxidizing agent during the conversion. }, "reductant": { "inclusion": "Used to reduce metal centers (e.g., Pd)" 2+ Restore to Pd 0 Reagents for the reaction, such as hydrazine or formic acid. Notes: Excess phosphine ligands may act as reducing agents, but are preferentially classified as ligands. }, "additive": { "inclusion": "auxiliary salts or substances that do not conform to other classifications, such as TBAB, TBAI, 18-crown-6, or molecular sieves." Notes: Core classification logic: (Non-alkaline) inorganic salts (such as NaCl or KF) belong to this category and are not classified as halogenating agents. }, "atmosphere": { "inclusion": "gaseous environment (e.g., N2, Ar, O2)." Notes: Exclude: Vapor generated by the solvent. }, "functionalizing_agent": { "inclusion": "Reaction reagents used to introduce specific functional groups (non-core coupling)." Notes: Exclusion: Standard coupling components. }, "boronate_reagent (boronate ester reagent)": { "inclusion": "All organoboron species, such as boric acid, pinacol borates, or BF4" - Salt.", Notes: This type of reagent is a marker of Suzuki coupling; it must be classified separately. } } The following is an example of prompt word input to a large language model in order to map fuzzy role labels to fine-grained functional role labels, but the invention is not limited thereto.

[0077] Example of prompt word input: You are a chemistry expert model. Your task is to replace the "reagent" roles in the input list of chemicals with specific substance category names. Please follow these guidelines: - Input format A list, where each element is a dictionary containing the following fields: "role" (e.g., "Reactant", "Ligand", "Product"), "smiles", "inchi" - Output format This is a list with the same structure, but all "role" fields for the "Agent" entry are replaced with the inferred category name. Other roles (e.g., "Reactant", "Product") remain unchanged.

[0078] - Reagent role replacement rules When the original role is "Agent", the most accurate role term must be selected from the provided AVAILABLE_LIST for replacement. "Additive" should only be used as a last resort when no other suitable option exists in the AVAILABLE_LIST.

[0079] - Notice Modify only the "Agent" role; keep other fields unchanged.

[0080] - Important Requirements Your output must be in plain JSON format, which can be directly parsed by Python's `json.loads()`. Do not include any extra text, explanations, or JSON tags.

[0081] ### AVAILABLE_LIST : {available_list} (For example, the example of a set of fine-grained functional role labels provided above) ###Hard Rules 1. Metal salt halogenating agents: Any inorganic metal halide in the context of Suzuki coupling never It should be classified as a "halogenating agent". If it lacks clear basic or oxidizing properties, it should be classified as an "additive".

[0082] 2. Boron preferred: Any organic reagent containing boron atoms must be preferentially referred to as "boronate reagent" rather than "nucleophile" or "substrate".

[0083] 3. Silver halide branching logic: If the reagent is or is categorized as "base".

[0084] If the reagent is classified as "oxidant" or "additive".

[0085] If the reagent is or is classified as "additive".

[0086] 4. Catalyst and Ligand: A reagent containing a transition metal is called "Catalyst". A reagent containing only a metal or coordinating with a metal is called "Ligand".

[0087] 5. The function of Substrate: In the context of “Suzuki coupling”, the “Substrate” label is reserved for organic electrophilic agents (halides or trifluoromethanesulfonate components).

[0088] ###enter

[0089] { REACTION_SMILES}

[0090] (For example, the example of raw reaction data provided above)

[0091] Furthermore, a scripting tool is used to standardize the expression and nomenclature of each reagent, which corresponds to step S3 of the method in the embodiments of the present invention. As an example, the RDKit library (not limited to this) can be directly called in Python, taking the SMILES expression of the raw reaction data as input and obtaining the standardized SMILES expression as output. Then, the official Pubchem and Cirpy APIs can be called to query the corresponding IUPAC nomenclature using the standardized SMILES expression.

[0092] Based on the mapped fine-grained functional role labels, standardized expressions, and standardized naming, positive sample data associated with the response data is obtained, which corresponds to step S4 of the method in the embodiment of the present invention.

[0093] Next, building upon the aforementioned fine-grained classification of reagent roles, a counterfactual negative sample generation strategy constrained by reaction mechanisms can be introduced. The resolved fine-grained reagent roles can be used as mechanistic anchors, and chemical inactivation can be simulated by intervening in key chemical nodes in known catalytic cycles. This allows for the large-scale construction of high-quality negative sample data that is mechanistically infeasible and conforms to physicochemical common sense at extremely low computational cost.

[0094] The perturbation in the negative sample generation process is based on the reaction mechanism, that is, the reagent category required in the perturbation mechanism. No perturbation is performed for reagents that are not required. For a certain type of reaction, the reaction mechanism of that type can be clearly defined by human experts. As an example, for the Suzuki reaction system, domain experts have extracted the following two core mechanism constraints: (1) Catalyst dependence: The occurrence of the Suzuki reaction is highly dependent on the catalysis of a specific transition metal (usually the Pd or Ni system). If the system lacks a metal source, or only introduces Lewis acids with no catalytic activity (such as AlCl3, GaCl3, etc.), the catalytic cycle cannot be started; (2) Alkaline microenvironment dependence: An alkaline environment is required to drive the metallization process. If an alkaline environment is lacking or an acidic reagent (such as hydrochloric acid, acetic acid, etc.) is introduced, the reaction will be quenched.

[0095] Based on the aforementioned mechanistic limitations, various derivative perturbation strategies can be designed and summarized for key nodes in the catalytic cycle of the Suzuki reaction. As an example, seven perturbation strategies can be designed in the Suzuki system. According to the mechanism-level action pathway, these strategies are divided into three dimensions: (i) Catalytic deactivation: including removing the effective catalyst (strategy 1), replacing the effective catalyst with a Lewis acid (strategy 6) or an inert metal salt (strategy 7); (ii) Thermodynamic deactivation: covering removing the base reagent (strategy 3) or replacing the base reagent with an acidic reagent (strategy 2); (iii) Kinetic hindrance: i.e., replacing it with an inefficient catalyst (strategy 4) or an inefficient base (strategy 5).

[0096] By randomly applying one or more of the strategies described above to each reaction data point, corresponding counterfactual negative samples can be generated. Continuing with the Suzuki system as an example, by using code to replace or delete reagents whose "role" is "catalyst / base" in the positive samples, the perturbation strategy can be implemented, ultimately generating counterfactual negative samples.

[0097] In addition to experts formulating mechanistic rules for reaction systems, counterfactual negative samples that conform to chemical mechanism constraints can also be generated during the data augmentation stage through computational chemistry simulations such as density functional theory or molecular dynamics, reaction templates and heuristic rules based on domain experts, or Monte Carlo random perturbation algorithms with basic chemical valence state constraints.

[0098] Counterfactual negative samples can be generated using scripts. Below is an example of script code for generating negative samples, but the invention is not limited thereto.

[0099] Example of script code to generate negative samples: def get_strategy_1_replace_catalyst_only(self, original_condition): results = [] inactive_cats = SuzukiNegativeSample.get_inactive_catalysts() for name in inactive_cats: cond = self._copy_condition(original_condition) if name is None or name == "": cond['Catalyst'] = [] else: cond['Catalyst'] = [self._get_reagent_obj(name)] results.append(cond) return results In the example script code above, negative samples are generated by replacing the catalyst in the original reaction data with an invalid reagent or by removing the catalyst altogether. It should be understood that different script codes can also be designed to generate negative samples using other perturbation strategies.

[0100] Based on the positive and negative sample data obtained through the above process, the final training data is obtained.

[0101] The following is an example of a complete process from raw data to generating training data, but the invention is not limited thereto.

[0102] An example of the complete process from raw data to generating training data: (Raw data) [ {"role": "Reactant", "smiles": "ClC=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC", "iupac": None}, {"role": "Reactant", "smiles": "C(C)OC1=NC=CC=C1B(O)O", "iupac":None}, {"role": "Product", "smiles": "C(C)OC1=NC=CC=C1C=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC", "iupac": None}, {"role": "Agent", "smiles": "[CH-]1[CH-][CH-][C-]([CH-]1)P(C2=CC=CC=C2)C3=CC=CC=C3.C1=CC=C(C=C1)P([C-]2C=CC=C2)C3=CC=CC=C3.C(Cl)Cl.Cl[Pd]Cl.[Fe]", "iupac": None}, {"role": "Agent", "smiles": "[Cs]OC(=O)O[Cs]", "iupac": None}, {"role": "Agent", "smiles": "O1CCOCC1", "iupac": None}, {"role": "Agent", "smiles": "O", "iupac": None} ] (Positive sample data obtained by mapping role tags and standardizing expressions and names) { "Reactant": [ { "iupac": "2-chloro-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "ClC=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" }, { "iupac": "(2-ethoxy-3-pyridinyl)boronic acid", "smiles": "C(C)OC1=NC=CC=C1B(O)O" } ], Product: [ { "iupac": "2-(2-ethoxy-3-pyridinyl)-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "C(C)OC1=NC=CC=C1C=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" } ], "Catalyst": [ { "iupac": "Pd(dppf)Cl2", "smiles": "[CH-]1[CH-][CH-][C-]([CH-]1)P(C2=CC=CC=C2)C3=CC=CC=C3.C1=CC=C(C=C1)P([C-]2C=CC=C2)C3=CC=CC=C3.C(Cl)Cl.Cl[Pd]Cl.[Fe]" } ], "Base": [ { "iupac": "dicesium;carbonate", "smiles": "[Cs]OC(=O)O[Cs]" } ], "Solvent": [ { "iupac": "1,4-dioxane", "smiles": "O1CCOCC1" }, { "iupac": "oxidane", "smiles": "O" } ] } (Negative sample data generated using a perturbation strategy) [ { "Reactant": [ { "iupac": "2-chloro-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "ClC=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" }, { "iupac": "(2-ethoxy-3-pyridinyl)boronic acid", "smiles": "C(C)OC1=NC=CC=C1B(O)O" } ], "Product": [ { "iupac": "2-(2-ethoxy-3-pyridinyl)-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "C(C)OC1=NC=CC=C1C=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" } ], "Catalyst": [ { "smiles": "[Cu]", "iupac": "Copper", "amount": "", "role": "" } ], "Base": [ { "smiles": "[K+].[Cl-]", "iupac": "Potassium chloride", "amount": "", "role": "" } ], "Solvent": [ { "iupac": "1,4-dioxane", "smiles": "O1CCOCC1" }, { "iupac": "oxidane", "smiles": "O" } ] }, { "Reactant": [ { "iupac": "2-chloro-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "ClC=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" }, { "iupac": "(2-ethoxy-3-pyridinyl)boronic acid", "smiles": "C(C)OC1=NC=CC=C1B(O)O" } ], "Product": [ { "iupac": "2-(2-ethoxy-3-pyridinyl)-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "C(C)OC1=NC=CC=C1C=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" } ], "Catalyst": [ { "smiles": "Cl[Ti](Cl)(Cl)Cl", "iupac": "Titanium tetrachloride", "amount": "", "role": "" } ], "Base": [ { "iupac": "dicesium;carbonate", "smiles": "[Cs]OC(=O)O[Cs]" } ], "Solvent": [ { "iupac": "1,4-dioxane", "smiles": "O1CCOCC1" }, { "iupac": "oxidane", "smiles": "O" } ], "Acid": [ { "smiles": "CC(=O)O", "iupac": "Acetic acid", "amount": "", "role": "" } ] } ] (Training data obtained from positive and negative sample data) { "original_index": 46660, "positive": { "Reactant": [ { "iupac": "2-chloro-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "ClC=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" }, { "iupac": "(2-ethoxy-3-pyridinyl)boronic acid", "smiles": "C(C)OC1=NC=CC=C1B(O)O" } ], Product: [ { "iupac": "2-(2-ethoxy-3-pyridinyl)-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "C(C)OC1=NC=CC=C1C=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" } ], "Catalyst": [ { "iupac": "Pd(dppf)Cl2", "smiles": "[CH-]1[CH-][CH-][C-]([CH-]1)P(C2=CC=CC=C2)C3=CC=CC=C3.C1=CC=C(C=C1)P([C-]2C=CC=C2)C3=CC=CC=C3.C(Cl)Cl.Cl[Pd]Cl.[Fe]" } ], "Base": [ { "iupac": "dicesium;carbonate", "smiles": "[Cs]OC(=O)O[Cs]" } ], "Solvent": [ { "iupac": "1,4-dioxane", "smiles": "O1CCOCC1" }, { "iupac": "oxidane", "smiles": "O" } ] }, "negative": [ { "Reactant": [ { "iupac": "2-chloro-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "ClC=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" }, { "iupac": "(2-ethoxy-3-pyridinyl)boronic acid", "smiles": "C(C)OC1=NC=CC=C1B(O)O" } ], "Product": [ { "iupac": "2-(2-ethoxy-3-pyridinyl)-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "C(C)OC1=NC=CC=C1C=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" } ], "Catalyst": [ { "smiles": "[Cu]", "iupac": "Copper", "amount": "", "role": "" } ], "Base": [ { "smiles": "[K+].[Cl-]", "iupac": "Potassium chloride", "amount": "", "role": "" } ], "Solvent": [ { "iupac": "1,4-dioxane", "smiles": "O1CCOCC1" }, { "iupac": "oxidane", "smiles": "O" } ] }, { "Reactant": [ { "iupac": "2-chloro-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "ClC=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" }, { "iupac": "(2-ethoxy-3-pyridinyl)boronic acid", "smiles": "C(C)OC1=NC=CC=C1B(O)O" } ], "Product": [ { "iupac": "2-(2-ethoxy-3-pyridinyl)-N-[(4-methoxyphenyl)methyl]-7-methyl-5-propan-2-ylimidazo[1,5-b]pyridazin-4-amine", "smiles": "C(C)OC1=NC=CC=C1C=1C=C(C=2N(N1)C(=NC2C(C)C)C)NCC2=CC=C(C=C2)OC" } ], "Catalyst": [ { "smiles": "Cl[Ti](Cl)(Cl)Cl", "iupac": "Titanium tetrachloride", "amount": "", "role": "" } ], "Base": [ { "iupac": "dicesium;carbonate", "smiles": "[Cs]OC(=O)O[Cs]" } ], "Solvent": [ { "iupac": "1,4-dioxane", "smiles": "O1CCOCC1" }, { "iupac": "oxidane", "smiles": "O" } ], "Acid": [ { "smiles": "CC(=O)O", "iupac": "Acetic acid", "amount": "", "role": "" } ] } ], "yield": 93.0556 } The following will combine Figures 3-5 Detailed description Figure 2 The specific implementation details of the method in the text.

[0103] Figure 3 This is a flowchart of a first process associated with a method for synthesizing a training dataset according to some embodiments of the present invention. The first process may be... Figure 2 The specific implementation of steps S2 and S3 in the method is described, but the scope of the present invention is not limited thereto.

[0104] In one embodiment, the first process may include step S21: determining the confidence level of each functional role category among a plurality of predefined functional role categories to which the role label for each reagent is mapped. In another embodiment, the first process may include step S22: determining the functional role category with the highest confidence level as the mapped functional role category.

[0105] In one embodiment, the first process may include step S31: for reagents lacking names in the original reaction data, querying a public chemical database using the standardized expression of the reagent. In another embodiment, the first process may include step S32: determining the standardized name of the reagent based on the query results.

[0106] In an embodiment, the first process may include step S321: in response to a query result indicating the existence of a name matching the reagent, the matching name is determined as the standardized name of the reagent. Step S321 may be a specific implementation of step S32, but the scope of the invention is not limited thereto.

[0107] Figure 4 This is a flowchart of a second process associated with a method for synthesizing a training dataset according to some embodiments of the present invention. This second process may be... Figure 2 The specific implementation of step S5 in the method is described, but the scope of the present invention is not limited thereto.

[0108] In an embodiment, the second process may include step S51: for each selected perturbation strategy, removing or replacing a reagent of a corresponding functional role category to obtain a negative sample data for the chemical reaction.

[0109] In one embodiment, the second process may include step S52: determining the number of perturbation strategies to be selected based on a preset ratio of the number of positive sample data to the number of negative sample data. In another embodiment, the second process may include step S53: selecting the determined number of perturbation strategies and generating a corresponding number of negative sample data.

[0110] Figure 5 This is a flowchart of a third process associated with a method for synthesizing a training dataset according to some embodiments of the present invention. This third process may be... Figure 2 The specific implementation of step S6 in the method is described, but the scope of the present invention is not limited thereto.

[0111] In one embodiment, the third process may include step S61: generating training data based on a preset ratio of positive sample data and negative sample data. In another embodiment, the third process may include step S62: generating training data based on a reaction performance index of the positive sample data, negative sample data, and the original reaction data. As an example, the reaction performance index may be yield.

[0112] Compared with existing technologies, the advantage of this invention lies in the significantly improved accuracy of reagent role labeling. Existing technologies, when dealing with fuzzy or implicit reagent labels, rely solely on surface semantic cues, typically achieving an accuracy rate below 90%. This invention introduces a predefined set of candidate reagent categories as constraints, combining expert rules and a large language model to accurately resolve fuzzy reagent labels into fine-grained functional roles, achieving an accuracy rate of over 97%.

[0113] Compared with existing technologies, the advantage of this invention lies in the high degree of chemical rationality in generating negative samples. Existing technologies often generate "weak negative samples" that lack physical meaning by randomly adding, deleting, or replacing reagents. This invention, based on fine-grained reagent role labeling, can generate high-quality counterfactual negative samples constrained by mechanisms at low cost and on a large scale by intervening in key inactivation nodes in chemical reactions.

[0114] Compared with existing technologies, the advantages of this invention further lie in its broad generalization ability and scalability. The chemical reaction training data synthesis framework provided by this invention has been validated in multiple chemical systems with completely different mechanisms, demonstrating strong cross-type transfer capabilities and wide applicability.

[0115] According to another aspect of the present invention, a computing device is provided.

[0116] Figure 6 This is a block diagram of a computing device 600 according to some embodiments of the present invention.

[0117] The computing device 600 includes a processor 610 and a memory 620. The memory 620 stores instructions 625, which, when executed by the processor 610, implement the above-mentioned combination. Figures 2-5 The steps of each method or process described.

[0118] According to another aspect of the present invention, a computer-readable storage medium is provided.

[0119] Figure 7 This is a block diagram of a computer-readable storage medium 700 according to some embodiments of the present invention.

[0120] A computer-readable storage medium 700 stores a computer program 750. When executed by a processor, the computer program 750 implements the above-mentioned combination. Figures 2-5 The steps of each method or process described.

[0121] According to another aspect of the present invention, a computer program product is provided.

[0122] Figure 8 This is a block diagram of a computer program product 800 according to some embodiments of the present invention.

[0123] Computer program product 800 may include computer program 750. Computer program 750, when executed by a processor, implements the above-mentioned combination. Figures 2-5 The steps of each method or process described.

[0124] This invention quantitatively verifies the crucial role of mechanism-guided negative samples and fine-grained labels in enhancing downstream models' understanding of chemical reaction mechanisms by designing a series of downstream evaluation tasks. It mainly includes two downstream tasks: a reaction feasibility assessment task and a variable space expansion generation task.

[0125] The core logic and validation perspectives of these two tasks differ significantly: the former is a discriminative test, focusing on evaluating whether the model can effectively establish a decision boundary that distinguishes between positive and negative reaction spaces after introducing negative samples, thereby verifying the calibration ability of negative samples for the model to judge the feasibility of chemical reactions; the latter is a generative test, focusing on evaluating whether the model, guided by fine-grained labels, can perform chemically logical deductions and condition supplementation based on a single known condition in an unknown variable space, thereby verifying the value of fine-grained labels in improving the extrapolation and exploration performance of the model in complex chemical tasks.

[0126] Table 1 below shows the experimental results obtained by performing a discrimination task using the LLaMa-3.2-1B model with training data generated by embodiments of the present invention.

[0127] Table 1. Results of LLaMa-3.2-1B performing the discrimination task

[0128] Table 2 below shows the experimental results obtained by performing a discrimination task using training data generated by embodiments of the present invention with a domain-specific model.

[0129] Table 2. Results of domain-specific models performing discrimination tasks

[0130] As shown in Table 1, in the reaction feasibility judgment task, with the increase of the proportion of negative samples, the prediction accuracy of the general language model "LLaMa-3.2-1B" lacking domain corpus jumped from less than 40% to over 90%. Table 2 shows that for the domain-specific pre-trained model, introducing only a very small proportion (25%) of negative samples is sufficient to eliminate class collapse and improve its accuracy to over 96%. This indicates that the training data generated by this invention can effectively correct the statistical bias of historical pre-training corpora, helping the model construct a robust decision boundary that distinguishes between positive and negative reaction spaces.

[0131] Figure 9Experimental results are shown for performing a generation task of downstream variable space expansion using training data generated according to an embodiment of the present invention.

[0132] In variable space expansion generation tasks, generation strategies based on uniform coarse-grained labels often degenerate into random text continuation. However, guided by the fine-grained functional labeling system of this invention, the model can accurately identify structural defects at chemical sites in the reaction system and autonomously infer and match the required chemical components. Furthermore, fine-grained labels enable the model to overcome its dependence on initial anchor points, mapping a more uniform and comprehensive effective expansion within the reaction space.

[0133] Will Figure 9 Comparing (a) and (b) in the figures, it can be seen that the model trained using the training data generated according to embodiments of the present invention can accurately identify structural defects at chemical sites in the reaction system and autonomously infer and match the required chemical components. Figure 9 Comparing (c) and (d) in the model, it can be seen that fine-grained labels enable the model to overcome its dependence on the initial anchor point and map out an effective extension with a more uniform distribution and more comprehensive coverage in the reaction space.

[0134] Therefore, it can be seen that using the training data synthesized in the embodiments of the present invention to train the model to perform chemical reaction-related tasks significantly improves the model's performance.

Claims

1. A method of generating training data for a chemical reaction, characterized by, The method includes the following steps: S1: Obtain raw reaction data, which is associated with multiple reagents of a chemical reaction in a target chemical reaction category. The raw reaction data includes role labels, expressions, and names for each reagent. S2: The large language model maps the role label for each reagent to one of a plurality of predefined functional role categories, wherein the plurality of predefined functional role categories are predetermined for the target chemical reaction category; S3: The expression and naming of each reagent are standardized by the first script tool; S4: Based on the functional role category mapped to each reagent, the standardized expression, and the standardized naming, positive sample data for the chemical reaction are obtained; S5: Using a second script tool, select one or more perturbation strategies from a preset set of perturbation strategies to remove and / or replace reagents of the corresponding functional role category in the positive sample data, to obtain negative sample data for the chemical reaction. The preset set of perturbation strategies is predetermined based on prior chemical knowledge for the target chemical reaction category, and each perturbation strategy is associated with promoting the failure of the chemical reaction. S6: Generate the training data based on the positive sample data and the negative sample data.

2. The method according to claim 1, characterized in that, S2 includes: S21: Determine the confidence level at which the role label for each reagent is mapped to each of the plurality of predefined functional role categories; and S22: Identify the functional role category with the highest confidence level as the mapped functional role category.

3. The method according to claim 1, characterized in that, S3 includes: S31: For reagents lacking names in the original reaction data, search for them in public chemical databases using their standardized expressions; and S32: Determine the standardized name for the reagent based on the query results.

4. The method according to claim 3, characterized in that, S32 includes: S321: In response to the query result indicating the existence of a name that matches the reagent, the matching name is determined as the standardized name for the reagent.

5. The method according to claim 1, characterized in that, S5 includes: S51: For each selected perturbation strategy, remove or replace a reagent of a corresponding functional role category to obtain a negative sample data for the chemical reaction.

6. The method according to claim 5, characterized in that, The S5 also includes: S52: Determine the number of perturbation strategies to be selected based on a preset ratio of the number of positive sample data to the number of negative sample data; and S53: Select the determined number of perturbation strategies and generate the corresponding number of negative sample data.

7. The method according to claim 6, characterized in that, S6 includes: S61: Generate the training data based on the preset ratio of positive sample data and negative sample data.

8. The method according to claim 1, characterized in that, S6 includes: S62: Generate the training data based on the reaction performance index of the positive sample data, the negative sample data, and the original reaction data.

9. The method according to claim 1, characterized in that, The predefined functional role categories include one or more of the following: ligand, catalyst, base, acid, solvent, substrate, electrophile, nucleophile, halogenator, oxidant, reductant, additive, atmosphere, functionalized reagent, and borate ester reagent.

10. The method according to claim 1, characterized in that, The first scripting tool includes RDKit, the expression includes SMILES expressions, and the naming includes IUPAC naming.

11. The method according to claim 1, characterized in that, The method further includes: S7: In response to determining that the reagents in the original reaction data do not match any of the multiple predefined functional role categories, the original reaction data is removed.

12. The method according to claim 1, characterized in that, The large language model is implemented as a zero-sample classifier that has not been fine-tuned for samples using functional roles-names of chemical reaction reagents.

13. The method according to claim 1, characterized in that, The prompt word input of the large language model includes one or more of the following: input format, output format, reagent functional role category mapping rules, and hard rules associated with the target chemical reaction category.

14. A computing device, comprising: include: processor; as well as A memory having instructions stored thereon, which, when executed by the processor, implement the steps of the method as described in any one of claims 1-13.

15. A computer readable storage medium having stored thereon a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1-13.

16. A computer program product comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1-13.