A method, system, and medium for generating a reaction sequence for an organic synthesis pathway
By automatically constructing multi-cascade reaction sequences of organic synthesis pathways using a multi-agent architecture, the problem of relying on human experience for protection/deprotection steps in existing technologies is solved, realizing the automation and refinement of chemical synthesis pathways and reducing human intervention.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- EAST CHINA UNIV OF SCI & TECH
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-19
AI Technical Summary
In existing organic chemical synthesis and computer-aided synthetic route design systems, the protection/deprotection steps rely on manual experience for insertion, lacking a unified state management and coordination mechanism. This results in an inability to flexibly support the automatic recommendation of specified sites and protecting groups, high human-computer interaction costs, and reliance on manual processing for route correction and result summarization.
It adopts a multi-agent architecture, including agents for task planning, data preprocessing, route correction, and result summarization. Through atomic mapping and protection site prediction models, it automatically constructs multi-cascade reaction sequences, supports automatic recommendation of specified sites and protecting groups, and generates structured synthesis route reports.
It improves the automation and refinement of protection strategies in chemical synthesis pathways, reduces reliance on human experience and manual editing, and enables automated insertion of protection/deprotection steps and efficient pathway generation.
Smart Images

Figure CN122245477A_ABST
Abstract
Description
Technical Field
[0001] This application relates primarily to the field of chemical synthesis technology, and in particular to a method, system, and medium for generating reaction sequences of an organic synthesis pathway. Background Technology
[0002] With the development of organic synthetic chemistry and medicinal chemistry, computer-aided synthesis planning (CASP) has gradually become an important tool in drug discovery and fine chemical engineering. In retrosynthetic analysis, reversing the process from the target molecule to obtainable starting materials is one of the core steps in automated synthetic route design. In the synthesis of complex molecules, the protection and / or deprotection strategies for functional groups such as hydroxyl, amino, and carboxyl groups play a crucial role in controlling reaction selectivity, improving functional group tolerance, and ensuring the smooth progress of reactions at key sites.
[0003] However, existing organic chemical synthesis and CASP systems mostly focus on the main reaction transformation itself, and the protection and / or deprotection steps often rely on manual insertion based on experience, or are supplemented ex post with simple rules. This makes it difficult to tightly couple with the synthesis search process and to comprehensively consider the impact of protection strategies at the path level.
[0004] On the other hand, subtasks such as atomic mapping, single-step or multi-step inverse synthesis, functional group identification and protection base recommendation are usually completed by different software or models, lacking a unified state management and coordination mechanism. This results in: the inability to accurately specify sites based on atomic-level information; difficulty in flexibly supporting multiple working modes such as "specified site / automatic site selection, specified protection base or automatic recommendation of protection base"; high human-computer interaction costs; and path correction and result summarization also rely heavily on manual processing. Summary of the Invention
[0005] One objective of this application is to provide a method, system, and medium for generating reaction sequences in organic synthesis pathways, addressing the problems in existing computer-aided organic chemical synthesis pathway design where protection / deprotection steps rely on manual experience for insertion, atomic-level site control is insufficient, and multiple models and tools are used in a fragmented manner.
[0006] According to one aspect of this application, a method for generating the reaction sequence of an organic synthetic pathway is provided, the method comprising: A task plan containing the execution order is generated based on the obtained user request, wherein the user request includes an organic synthesis task requirement; According to the task plan, the input molecules or reaction formulas are preprocessed with atomic mapping to obtain atomic mapping information; Select the target reaction steps of the original synthesis route according to the task plan, and construct a multi-cascade reaction sequence based on the atomic mapping information and the specified sites. The target reaction step is replaced by the multi-cascade reaction sequence to generate a new synthetic route; A structured synthesis path report is generated based on the new synthesis path and protection strategy.
[0007] Optionally, the user request includes candidate synthesis paths, and the step of generating a task plan containing the execution order based on the obtained user request includes: The organic synthesis task requirements are identified by intent recognition. Based on the target information that is present or missing in the detected candidate synthesis paths, sub-tasks are decomposed and a task plan with execution order is generated.
[0008] Optionally, based on the detected target information present or missing in the candidate synthesis path, subtasks are decomposed, and a task plan with an execution order is generated, including: When it is detected that the candidate synthesis path lacks atomic mapping information, it is determined that a data preprocessing subtask needs to be executed first. When both the atom mapping requirement and the target reaction subsequence requirement are detected simultaneously, the generated task plan satisfies the order of executing the data preprocessing subtask first and then the synthesis path correction subtask, wherein the target reaction subsequence includes a protected sequence and / or a deprotected sequence. When only a protected sequence requirement or a deprotected sequence requirement is detected, the generated task plan directly triggers the synthetic path correction subtask.
[0009] Optionally, the atom mapping information includes atom numbers, and the atom mapping preprocessing of the input molecule or reaction formula according to the task plan to obtain the atom mapping information includes: The atom mapping tool is invoked for the molecules or reaction formulas input in the task plan to predict the atomic correspondence between reactants and products, and to obtain atom numbers. The atomic numbering is written into the reaction SMILES to form a reaction structure with atomic numbering; Write the atom numbers and the reaction structure into the shared path context.
[0010] Optionally, constructing a multi-cascade reaction sequence based on the atom mapping information and specified sites includes: Extract the main reaction with atom mapping information corresponding to the target reaction step; Based on the master reaction with atomic mapping information, a multi-cascade reaction sequence is constructed, wherein the multi-cascade reaction sequence includes a protection reaction, a protected master reaction, and a deprotection reaction.
[0011] Optionally, constructing a multi-cascade reaction sequence based on the master reaction with atomic mapping information includes: The protected sites and protecting groups are obtained based on the working mode; Based on the master reaction with atomic mapping information and the obtained protecting sites and protecting groups, the multi-cascade reaction sequence is reconstructed at the SMILES reaction level.
[0012] Optionally, the operating mode includes at least one of the following: The working mode that specifies a site and a protecting group includes: resolving the user-specified site and the specified protecting group based on the atom number provided by the shared path context; The working mode of specifying a site and predicting the protecting group includes: automatically providing the protecting group by calling the protecting group prediction model based on the site specified by the user; The working mode of predicting sites and protecting groups includes: calling the joint prediction model on products with atomic mapping information, while providing the combination of protecting sites and protecting groups.
[0013] Optionally, replacing the target reaction step with the multi-cascade reaction sequence to generate a new synthetic route includes: The protected sequence and the deprotected sequence in the multi-cascade reaction sequence are considered as a single reaction subsequence; The reaction subsequence is inserted into the target step position, replacing the original main reaction. The consistency of the preceding and following steps is checked and renumbered to obtain a new synthetic route containing the protected sequence and the unprotected sequence.
[0014] According to another aspect of this application, a system for generating reaction sequences of organic synthesis pathways is also provided, the system comprising: a task planning agent, a data preprocessing agent, a route correction agent, and a result summarization agent; The task planning agent is used to generate a task plan containing the execution order based on the obtained user request, wherein the user request includes an organic synthesis task requirement; The data preprocessing agent is used to perform atom mapping preprocessing on the input molecules or reaction formulas according to the task plan, to obtain atom mapping information; The route correction agent is used to select the target reaction step of the original synthesis path according to the task plan, and to construct a multi-cascade reaction sequence based on the atomic mapping information and the specified site. The route correction agent is used to replace the target reaction step with the multi-cascade reaction sequence to generate a new synthetic route; The result-summarizing agent is used to generate a structured synthesis path report based on the new synthesis path and protection strategy.
[0015] According to another aspect of this application, a computer-readable storage medium is also provided, having stored thereon computer-readable instructions that can be executed by a processor to implement the method described above.
[0016] Compared with existing technologies, this application generates a task plan containing the execution sequence based on a user request, wherein the user request includes organic synthesis task requirements; performs atom mapping preprocessing on the input molecules or reaction formulas according to the task plan to obtain atom mapping information; selects the target reaction steps of the original synthetic route according to the task plan, and constructs a multi-cascade reaction sequence based on the atom mapping information and specified sites; replaces the target reaction steps with the multi-cascade reaction sequence to generate a new synthetic route; and generates a structured synthetic route report based on the new synthetic route and protection strategy. This improves the automation and refinement of protection strategy design in chemical synthesis routes, and reduces reliance on human experience and manual editing. Attached Figure Description
[0017] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the specific embodiments of this application will be described in detail below with reference to the accompanying drawings, wherein: Figure 1 A schematic diagram of the structure of a system for generating a reaction sequence of an organic synthesis pathway according to one aspect of this application is shown. Figure 2 This diagram illustrates a method for generating the reaction sequence of an organic synthesis pathway according to one aspect of this application. Figure 3 A schematic diagram of a protection site prediction model in one embodiment of this application is shown; Figure 4 A schematic diagram of a protecting group prediction model in one embodiment of this application is shown; Figure 5 This illustration shows a schematic diagram of a multi-agent collaborative processing flow in one embodiment of this application; Figure 6 This is a schematic diagram of the overall process in one embodiment of this application; Figure 7 This diagram illustrates the construction process of the protected / deprotected reaction sequence in one embodiment of this application.
[0018] The same or similar reference numerals in the accompanying drawings represent the same or similar parts. Detailed Implementation
[0019] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the specific embodiments of this application will be described in detail below with reference to the accompanying drawings.
[0020] Many specific details are set forth in the following description in order to provide a full understanding of this application. However, this application may also be implemented in other ways different from those described herein, and therefore this application is not limited to the specific embodiments disclosed below.
[0021] As indicated in this application and claims, unless the context clearly indicates otherwise, the words "a," "an," "an," and / or "the" are not specifically singular and may include plural forms. Generally speaking, the terms "comprising" and "including" only indicate the inclusion of explicitly identified steps and elements, which do not constitute an exclusive list, and the method or apparatus may also include other steps or elements.
[0022] This application provides a method and system for generating reaction sequences in organic synthesis pathways. The method, under a unified multi-agent architecture, decouples task planning, atom mapping preprocessing, protected reaction sequence construction, and result aggregation into multiple cooperating agents. Utilizing the atom numbering information provided by the reaction atom mapping results, combined with protected site prediction models and protected group prediction models, it achieves the automatic construction and insertion of multi-stage protection / deprotection reactions (e.g., protection-main reaction-deprotection) in the target reaction step. It supports both user-specified sites and protected groups and automatic prediction of sites and protected groups, thereby improving the automation and precision of protection strategy design in chemical synthesis pathways and reducing reliance on human experience and manual editing. The specific scheme is as follows: like Figure 1 As shown, the generation system includes: a task planning agent 10, a data preprocessing agent 20, a route correction agent 30, and a result summarization agent 40.
[0023] The task planning agent 10 generates a task plan containing the execution order based on the acquired user request, wherein the user request includes organic synthesis task requirements; the data preprocessing agent 20 performs atom mapping preprocessing on the input molecules or reaction formulas according to the task plan to obtain atom mapping information; the route correction agent 30 selects the target reaction step of the original synthesis path according to the task plan, and constructs a multi-cascade reaction sequence based on the atom mapping information and specified sites; the route correction agent 30 replaces the target reaction step with the multi-cascade reaction sequence to generate a new synthesis path; the result summarizing agent 40 generates a structured synthesis path report based on the new synthesis path and protection strategy.
[0024] A multi-agent-based task planning process is established. The task planning agent 10 receives a natural language description of a user request, which includes the target molecule, candidate synthesis paths, and organic synthesis task requirements. These organic synthesis task requirements include protection / deprotection design requirements. The input natural language is semantically parsed, and based on keyword and intent recognition rules and task routing strategies, the task is broken down into a data preprocessing subtask, a synthesis path correction subtask, and a result summarization subtask, generating a task plan that includes the execution order.
[0025] According to the task plan, the data preprocessing agent 20 performs atom mapping preprocessing on the input molecules or reaction formulas to obtain atom mapping information. This atom mapping information can be information with atom numbers, that is, the molecular structure or reaction structure with atom numbers is obtained after preprocessing, which can then be used for subsequent protection site specification and prediction.
[0026] Based on the original synthesis path, the route correction agent 30 selects the target reaction steps according to the task plan and constructs a multi-cascade reaction sequence based on the master reaction with atom mapping. The multi-cascade reaction sequence is preferably a protected / deprotected reaction sequence, including a protected reaction, a protected master reaction and a deprotected reaction.
[0027] The route correction agent 30 replaces the target step position of the original synthetic path with the multi-cascade reaction sequence, performs consistency checks and renumbers, and obtains a new synthetic path containing protected / deprotected sequences.
[0028] The result summarizing agent 40 summarizes the intermediate results and final synthesis path output by the task planning agent 10, the data preprocessing agent 20, and the route correction agent 30, and generates and outputs a structured synthesis path report.
[0029] This application achieves unified orchestration of different models and tools by dividing and cooperating among task planning agents, data preprocessing agents, route correction agents, and result summarization agents. It decouples task understanding, atomic mapping, protection sequence construction, and result display into independently upgradable modules and uses shared path context for state transfer. This makes it more scalable and more automated than existing single-model or script-based processes.
[0030] Figure 2 The diagram shows a flow chart of a method for generating a reaction sequence of an organic synthesis pathway according to one aspect of this application, the method comprising steps S11 to S15.
[0031] Step S11: Generate a task plan containing the execution order based on the obtained user request, wherein the user request includes organic synthesis task requirements.
[0032] The system receives user requests, which include the target molecule, candidate synthetic routes, and organic synthesis task requirements input by the user. These organic synthesis task requirements include protection / deprotection design requirements. The system parses the user requests and generates a task plan, which includes data preprocessing, protection reaction sequence construction, and result summarization.
[0033] In one embodiment of this application, in step S11, the intention of the organic synthesis task requirement is identified, the sub-tasks are decomposed according to the target information that exists or is missing in the detected candidate synthesis path, and a task plan with execution order is generated.
[0034] The system receives user input including the target molecule, existing synthetic pathways (which can be multi-step reaction sequences), and natural language descriptions related to protection / deprotection. Through keyword and intent recognition, it distinguishes between atom mapping requirements and protection / deprotection requirements. By determining whether the user's requirements include sub-tasks such as atom mapping and protection / deprotection design, it constructs the corresponding multi-agent invocation sequence. The invocation sequence refers to the execution sequence based on the logical order of the task. For example, if a specific site (i.e., reaction center number) is desired, the prerequisite for this task is to "execute atom mapping first," thus making the reaction center number available. Specifically, this is implemented through the task logic specified in the prompts of the task planning agent, as follows: When a lack of atom mapping information is detected in the candidate synthesis path, it is determined that a data preprocessing subtask must be executed first. When both atom mapping requirements and target reaction sequence requirements are detected simultaneously, the generated task plan satisfies the order of executing the data preprocessing subtask first and then the synthesis path correction subtask. The target reaction sequence includes protected sequences and / or unprotected sequences. When only a protected sequence requirement or an unprotected sequence requirement is detected, the generated task plan directly triggers the synthesis path correction subtask.
[0035] It should be noted that protecting sequences are added to different functional groups in the order of reactivity, steric hindrance, and subsequent reaction conditions at the beginning or during multi-step synthesis. Deprotection sequences are removed in the order of protecting group stability and orthogonality at the later stages of synthesis or after the completion of key intermediates.
[0036] When a missing atom mapping marker is detected in the candidate synthesis path or when the user indicates that numbering / mapping is required, it is determined that a data preprocessing subtask must be executed first. When both atom mapping and protection / deprotection requirements are detected simultaneously, the generated task plan satisfies the order of executing the data preprocessing subtask first, followed by the synthesis path correction subtask. When only a protection / deprotection requirement is detected, the generated task plan directly triggers the synthesis path correction subtask. The task plan includes the corresponding parameters only when the user explicitly provides the protection site or protection group; otherwise, an automatic prediction strategy is used.
[0037] For example, when a user requests "to perform atomic mapping on the path and add protection / deprotection measures at step K," the sequence would be: data preprocessing (atomic mapping) → path correction (construction and replacement of multi-cascade reaction sequences) → result aggregation and output. When both atomic mapping and protection / deprotection requirements are detected simultaneously, the task flow is prioritized as "data preprocessing first, then path correction and protection design." When only protection / deprotection requirements are detected, the task is directly assigned to the path correction agent, which internally determines whether to trigger the site prediction model based on whether atomic mapping results already exist.
[0038] For example, if a user wants to verify or observe the structure of each reaction step in the path before entering the protected / unprotected design phase, and only the atom mapping requirement is detected, they can directly request: "Perform atom mapping on [path xxx]". In this scenario, the system only triggers the data preprocessing subtask, where the preprocessing output includes: SMILES with atom mapping and a structural visualization image; it does not proceed to the route correction and multi-cascade reaction construction process.
[0039] After user observation, further requests can be made to complete the "protection / unprotection" function.
[0040] Step S12: Perform atom mapping preprocessing on the input molecules or reaction formulas according to the task plan to obtain atom mapping information.
[0041] The atom mapping information includes atom numbers. According to the task plan, the input molecules or reactions are preprocessed with atom mapping. A reaction atom mapping model based on graph neural networks can be called to perform batch atom mapping on the reaction list in the form of "reactants >> products" to obtain molecular structures or reaction structures with atom numbers. The results are written into the shared path context to provide atomic-level labeling information for subsequent protection site specification and prediction.
[0042] In one embodiment of this application, an atom mapping tool is invoked on the molecules or reaction formulas input in the task plan to predict the atomic correspondence between reactants and products, resulting in atom-numbered structures; the atom-numbered structures are written into the reaction SMILES to form a reaction structure with atom numbers; the atom-numbered structures and the reaction structure are written into a shared path context and used as the basis for subsequent site-specific and site-predicted atom numbers.
[0043] For example, for a list of reactions or multi-step reactions given in the form "AB>>C", an atom mapping tool (such as LocalMapper) is invoked to predict the atomic correspondences between reactants and products. The resulting mappings are then written into the reaction SMILES, forming reaction structures with atom numbers. LocalMapper is an open-source tool that uses a graph neural network to learn the atomic correspondence probabilities between the molecular diagrams of products and reactants. It converts the probability distribution output by the model into a one-to-one atomic mapping relationship. After template validation and optimization, it assigns unique and valid mapping numbers to the atoms in the reaction SMILES, ultimately outputting the reaction formula with atom mappings.
[0044] The mapping results are written into the shared path context using a unified data structure, recording the original reaction formula, the mapped reaction formula, and the step number for each step, so that the path correction agent can directly retrieve and reuse them.
[0045] Step S13: Select the target reaction step of the original synthesis route according to the task plan, and construct a multi-cascade reaction sequence based on the atom mapping information and the specified sites.
[0046] The main reaction with atom number is extracted from the target reaction step based on the atom mapping information. A multi-cascade reaction sequence is constructed based on the extracted main reaction and the specified sites. The specified sites include user-specified sites for protection or deprotection and protected sites obtained through prediction. The sites are the atoms or positions on the functional groups that actually participate in the reaction, which determine where the protecting group is attached, on which atom the reaction occurs, and where the deprotection is broken off. The protected sites are also the deprotection sites.
[0047] In one embodiment of this application, in step S13, the master reaction with atom mapping information corresponding to the target reaction step is extracted; a multi-cascade reaction sequence is constructed based on the master reaction with atom mapping information, wherein the multi-cascade reaction sequence includes a protection reaction, a protected master reaction, and a deprotection reaction.
[0048] Based on the existing synthetic pathway, target reaction steps are selected according to the task plan, and master reactions with atom mappings are extracted. Protection / deprotection reaction sequences are then constructed based on these master reactions. Construction follows the forward synthesis time sequence; for example, when constructing a triplet reaction sequence, it is fixed as a "triple sequence": protection → protected master reaction → deprotection. It is necessary to ensure the continuity of material connections in the multi-step synthetic pathway. The input to the triplet sequence remains the substrate S from the original step, and the output is restored to the product P from the original step (only reversible protected states S_prot / P_prot are introduced in between), eliminating the need for full path recalculation.
[0049] The specific approach is as follows: based on the working mode, the protection sites and protecting groups are obtained; based on the main reaction with atomic mapping information and the obtained protection sites and protecting groups, the multi-cascade reaction sequence is reconstructed at the SMILES reaction level.
[0050] The operating mode includes at least one of the following: The working mode that specifies a site and a protecting group includes: resolving the user-specified site and the specified protecting group based on the atom number provided by the shared path context; The working mode of specifying a site and predicting the protecting group includes: automatically providing the protecting group by calling the protecting group prediction model based on the site specified by the user; The working mode of predicting sites and protecting groups includes: calling the joint prediction model on products with atomic mapping information, while providing the combination of protecting sites and protecting groups.
[0051] It supports multiple modes such as "specifying the site and specifying the protecting group", "specifying the site and predicting the protecting group", and "predicting the site and predicting the protecting group". It can satisfy experienced users' fine control over specific protected sites and protecting groups, and also support the model to automatically provide recommended solutions when the site and protecting group are unknown. It is more flexible than the existing methods that can only fix rules or be completely manually set.
[0052] The route correction agent reads the original synthesis path from the shared path context, determines the target reaction step according to the user-specified step number or preset rules, and extracts the corresponding atom-mapped master reaction (in the form of "reactants >> products") as input for protection / deprotection design. In the specified site and specified protecting group mode, the atom number obtained through mapping preprocessing is used to resolve the user-specified site into a specific atom number and the user-specified protecting group into a SMILES form of protecting group that can participate in the reaction construction. Finally, the site parameters and protecting group parameter set for the construction of triple protected sequences are obtained.
[0053] The protecting group resolution process employs "name standardization + protecting group dictionary mapping" to convert chemical names into SMILES. Specifically, this includes: standardizing the user-input protecting group name, including case neutralization, space / hyphen cleanup, and synonym unification; and performing protecting group dictionary mapping: searching for the corresponding entry in the protecting group dictionary and mapping the standardized protecting group identifier to the corresponding SMILES fragment. For example, the SMILES corresponding to the protecting group Boc is "[C:1](=O)OC(C)(C)C", where :1 represents the junction position of the group.
[0054] In the specified site and protected group prediction mode, firstly, based on the user-specified site, the protected group prediction model is invoked to obtain candidate protected groups. The protected group with the highest score is selected as the parameter, forming a site parameter and protected group parameter set. In the predicted site and protected group prediction mode, based on the product structure with atomic mapping, a joint prediction model is invoked to simultaneously predict candidate protected sites and their matching protected group combinations. The set of site parameters and protected group parameters with the highest score is selected. The prediction models include a reaction site prediction model and a protected group prediction model.
[0055] In one embodiment of this application, as Figure 3 As shown, the reaction site prediction model uses a graph message passing network (MPNN) as the graph encoder. Let a certain intermediate molecule to be protected / deprotected be represented by a molecular graph. ,in A collection of atoms This is information about chemical bonds. Each atom... Having feature vectors (Including atom type, degree, whether it is aromatic, whether it is on a ring, formal charge, etc.), each bond Having feature vectors (Including bond type, whether it is an aromatic bond, whether it is on a ring, etc.).
[0056] First, the molecular diagram is encoded using MPN to obtain the representation of each atom. : ; in For atoms The set of adjacent edges.
[0057] Next, a feedforward neural network is used to calculate the score for each atom as a reaction site (i.e., the atom that undergoes protection / deprotection). : ;in , , For learnable parameters, This is the activation function.
[0058] The scores of all atoms are normalized using Softmax to obtain the site probability distribution across all atoms: ; In the reasoning stage, the atom with the highest probability is selected as the single protection site, which serves as the input for subsequent protection group prediction models and triple reaction construction.
[0059] like Figure 4 The image shows the protecting group prediction model. Training data was extracted from the Uspto-50K dataset, including data on protecting and deprotecting reactions. During the preprocessing stage of the training set, protecting group subgraphs were automatically extracted by differentiating the structures of products and reactants, and duplicates were removed to obtain a finite-sized protecting group vocabulary. For a given target molecule and candidate reaction site First, represent all atoms. Performing a polymerization operation (the average sum of all atomic features of the molecule) yields molecular-level features. The candidate sites are then disconnected to obtain the synthon, whose post-polymerization characteristics are represented as follows: The two are concatenated to obtain the joint feature vector: .
[0060] Subsequently, through a multi-layered fully connected network, Mapped onto the protecting group category space. For the first in the vocabulary... Protecting group Its unnormalized score can be expressed as: ; in , , and For learnable parameters, It is a non-linear activation function.
[0061] Finally, the probability distribution on the protecting group vocabulary is obtained through the Softmax function: ; During the reasoning stage, the protecting group with the highest probability is selected as the molecular synthesis fragment, which serves as the input for the subsequent construction of the triple reaction.
[0062] In a specific embodiment of this application, the construction of a triplet protection sequence is used as an example for illustration. The route correction agent constructs the triplet reaction at the SMILES reaction level based on the obtained master reaction with atom mapping and the determined set of site parameters and protecting group parameters. Specifically, this includes: Instantiation of the protection reaction: The substrate is extracted from the main reaction and denoted as S. The atomic number to be protected is located in S according to the predicted site parameters. The protecting group is then attached to the atom to generate the protection intermediate S_prot, which can form the protection reaction formula R_protect: S>>S_prot.
[0063] Concatenating the protected master reaction: With atomic mapping consistency as a constraint, the substrate of the master reaction R_main: S >> P is replaced with the protecting intermediate S_prot to obtain the substrate end of the protected master reaction; and based on the atomic mapping correspondence, the atoms on the product side corresponding to the site parameters are determined, and the same protecting group is added to them to obtain P_prot, thereby constructing the protected master reaction R_main_prot: S_prot >> P_prot.
[0064] Deprotection reaction instantiation: The above P_prot is converted back to P to form the deprotection reaction formula R_deprotect: P_prot >> P, so as to ensure that the interface of subsequent reaction steps is consistent with the original synthesis path.
[0065] The final result is a triple protection sequence {R_protect, R_main_prot, R_deprotect} consisting of protection-protected main reaction-deprotection.
[0066] Step S14: Replace the target reaction step with the multi-cascade reaction sequence to generate a new synthetic route.
[0067] In one embodiment of this application, in step S14, the protected sequence and the deprotected sequence in the multi-cascade reaction sequence are regarded as a reaction sub-sequence; the reaction sub-sequence is inserted into the target step position, replacing the original main reaction, and the consistency check and renumbering of the steps before and after are performed to obtain a new synthetic path containing the protected sequence and the deprotected sequence.
[0068] The route correction agent assembles the generated protected reactions, protected master reactions, and unprotected reactions into protected / unprotected reaction subsequences in a preset order. The preset order is arranged in forward synthesis time sequence: protected → protected master reaction → unprotected. For example, protected reaction: S >> S_prot, protected master reaction: S_prot >> P_prot, unprotected reaction: P_prot >> P. This ensures that the path interface remains unchanged; the input is still S, and the output returns to P, i.e., S> … …>P.
[0069] The protected / deprotected reaction sequence replaces the single step corresponding to the target step in the original synthetic pathway, while keeping the reaction order and material connection relationship before and after the target step unchanged, forming an updated synthetic pathway containing the protected / deprotected sequence; the updated synthetic pathway is subjected to consistency checks and step renumbering to ensure that the direction of each reaction and the relationship between substrate and product in the pathway are correct.
[0070] The method described in this application automatically generates a triple sequence of protection reaction, protected main reaction, and deprotection reaction based on the master reaction with atom mapping, and directly inserts or replaces it at the target step position in the synthesis path, maintaining the consistency of the connection between upstream and downstream materials. This avoids the problems of existing technologies that rely on manual step-by-step insertion of protection / deprotection reactions, are prone to errors, and are difficult to manage uniformly at the whole path level.
[0071] Step S15: Generate a structured synthesis path report based on the new synthesis path and protection strategy.
[0072] The results-summarizing agent reads intermediate data and the final updated synthesis path generated by the task planning agent, data preprocessing agent, and route correction agent from the shared path context. It then organizes the atom mapping information, protected / deprotected reaction sequences, path structure, and key parameters into structured data in a preset format. This preset format uses a JSON+graphical structure, including, from top to bottom: the working mode of the route correction agent, key parameters (reaction sites, protection base names and their SMILES), original step reactions (mapped SMILES and their structural images), triple protection sequences (processed SMILES and their structural images), and the replaced final path step SMILES.
[0073] Generate and output a synthetic path report containing updated reaction sequences, protection / deprotection strategy descriptions, and visualized path information for user viewing and subsequent human-machine collaborative optimization.
[0074] The results aggregation agent organizes the updated synthetic routes, protected / deprotected sequences and their descriptions into a structured report, which can be presented in conjunction with visual information such as route diagrams, making it easier for chemists to quickly understand and review. This ensures automation efficiency while retaining the space for human review and adjustment of key synthetic strategies.
[0075] In a specific embodiment of this application, such as Figure 5The diagram shown is a flowchart of the collaborative processing at the multi-agent level in this application. First, the system receives input from the user terminal regarding the synthesis path and protection / deprotection design requirements, and passes this input to the task planning agent. The task planning agent generates task planning results based on the input and assigns the tasks to the data preprocessing agent and the path correction agent. Based on the task planning results, the data preprocessing agent uses the tool library to call the chemical reaction atom mapping model to number the reaction SMILES, and writes the API return results into the shared context. The path correction agent selects the target reaction step based on the task planning results and the atom mapping information in the shared context, and calls the protection / deprotection construction tools in the tool library to generate the protection / deprotection reaction sequence. The result summarization agent reads the running results from the data preprocessing agent and the path correction agent, summarizes the synthesis path updates and the protection / deprotection reaction sequences, generates a final report, and outputs it to the user.
[0076] like Figure 6 As shown, the method described in this application can be applied as an automated method for constructing protected / deprotected reaction sequences in a synthetic pathway. It can run on a server or local computer and mainly includes steps such as task planning, data preprocessing, protected / deprotected sequence construction, route updating, and result summarization. Specifically, it includes: Step S1: Receive user requests and generate task plans. Collect user design requests for synthesis paths, which include at least: existing candidate synthesis paths and natural language descriptions of protection / deprotection design requirements. The task planning agent parses the above input: on the one hand, it formats structured data such as SMILES and step numbers; on the other hand, it performs keyword recognition and semantic understanding on the natural language descriptions to determine whether they contain intent information such as "atom mapping / numbering," "protection / deprotection," "specified site," "specified protecting group," and "automatically recommended protecting groups."
[0077] Based on the analysis results, the task planning agent constructs the task plan for this task, determines whether subtasks such as data preprocessing, protection / deprotection reaction sequence construction, and result summarization need to be performed, and gives the execution order of the subtasks. When both atomic mapping and protection / deprotection requirements are detected simultaneously, the task order is set to "data preprocessing first, then path correction and protection / deprotection design"; when only protection / deprotection requirements are detected, the task is directly issued to the path correction agent, and the plan is marked as to whether subsequent site prediction is allowed.
[0078] Step S2 involves performing atom mapping on the reaction SMILES and writing it into the context. Based on the task plan generated in Step S1, the data preprocessing agent performs atom mapping preprocessing on the input molecules or reaction formulas. For single-step or multi-step reaction lists given in the form of "reactants >> products", the atom mapping tool is invoked to predict the atomic correspondence between reactants and products. The predicted atom numbers are written into the reaction SMILES to obtain the reaction structure with atom numbers. For reactions already labeled with atom mapping, a validity check and format standardization are performed.
[0079] The data preprocessing agent writes the original reaction formulas, reaction formulas with atomic mappings, and step numbers of each step into a shared path context using a unified data structure, providing atomic-level labeling information for subsequent protection site assignment and protection site / protection group prediction.
[0080] Step S3, Target Step Selection and Protection / Deprotection Reaction Sequence Construction: The route correction agent reads the current synthesis path in the shared path context and selects the target reaction step for protection / deprotection design based on the instructions in the task plan and the step number or preset rules given by the user. The corresponding atom-mapped master reaction "Reactants >> Products" is extracted as the input basis for protection / deprotection design. The working mode for this operation is determined according to the task plan or user selection. The working modes include: specified site and specified protecting group mode, specified site and predicted protecting group mode, and predicted site and predicted protecting group mode. After determining the working mode, the route correction agent, based on the atom-mapped master reaction and parameters such as site and protecting group, calls the protection / deprotection construction tool to generate the protection reaction, protected master reaction, and deprotection reaction in the same call flow, forming an ordered protection / deprotection reaction sequence.
[0081] Step S4 involves replacing the original target step with a triple reaction. The route correction agent treats the protected / deprotected reaction sequence constructed in step S3 as a sub-sequence, using the position of the original target step in the synthetic pathway as an anchor point. This sub-sequence replaces the original single step while maintaining the order of other steps before and after the target step and the material connections. After the replacement, the route correction agent performs a consistency check on the updated synthetic pathway to confirm that the products and substrates between adjacent steps match correctly, the path direction remains consistent, and there are no abnormalities such as chain breaks or duplications. After passing the check, each step in the pathway is renumbered to obtain an updated synthetic pathway containing the protected / deprotected reaction sequence.
[0082] Step S5, Results Summary and Report Generation: The results summary agent summarizes the intermediate results and final updated synthesis path output by the task planning agent, data preprocessing agent, and route correction agent. It reads the original path, reaction structure with atomic mapping, and inserted protected / deprotected reaction sequences, etc.
[0083] The results aggregation agent organizes the above information into structured data according to the preset report template. This includes at least the updated multi-step synthetic pathway reaction list, the protection / deprotection reaction sequences inserted or replaced in the pathway and their corresponding protection sites and protecting groups, and can generate visualization results such as pathway diagrams. Finally, the results are output in the form of a report for chemists to review and revise the automatically generated protection strategies.
[0084] In a specific embodiment of this application, the publicly available USPTO-50K reaction dataset was selected as the basic data source. From this dataset, 12,852 reaction samples related to functional group protection and deprotection were automatically selected according to reaction type labels and reaction SMARTS rules. All samples contain complete atomic mapping information. All samples were randomly divided into training, validation, and test sets in an 8:1:1 ratio.
[0085] In the training set, structural differencing was performed on all protected / deprotected reactions to extract protecting groups. A vocabulary of protecting groups was then reconstructed, resulting in 51 distinct protecting group fragments. This vocabulary represented 91.07% of the actual protecting groups present in the entire dataset. During model evaluation, the Top-k accuracy of the protecting site prediction model and the protecting group prediction model was calculated on the test set. Top-k accuracy indicates that if the true label is among the top k candidates given by the model, the prediction is considered correct. The accuracies of the two models at Top-1, Top-2, Top-3, and Top-5 are shown in Table 1.
[0086] Table 1
[0087] As shown in Table 1, in the test set, the protection site prediction model of this application achieves an accuracy of approximately 83.8% on the Top-1 metric and further improves to approximately 98.4% on the Top-5 metric, indicating that the model can accurately locate reaction sites requiring protection / deprotection. The protecting group prediction model achieves an accuracy of approximately 66.0% on the Top-1 metric and approximately 96.9% on the Top-5 metric, indicating that, given a protection site, the model can provide high-confidence protecting group recommendations within a small candidate space. In summary, the two-stage prediction model proposed in this application demonstrates good predictive performance in the joint prediction task of protection sites and protecting groups, providing reliable parameter support for the subsequent automatic construction of protection / deprotection reaction sequences.
[0088] like Figure 7 As shown, taking a forward synthetic route containing a primary amine functional group as an example, the application process of the method of this application in actual protection / deprotection design is illustrated.
[0089] (1): Original synthetic route and problem description. The user inputs the current synthetic route and the requirement for protection / deprotection design of the molecule in the system interface. The current synthetic route is as follows: CCCCc1nc2c(N)nc3ccccc3c2[nH]1.OCCCBr>>CCCCc1nc2c(N)nc3ccccc3c2n1CCCO, CCCCc1nc2c(N)nc3ccccc3c2n1CCCO.COC(=O)Cc1ccc(CN)cc1>>CCCCC1nc2c(N)nc3ccccc3c2n1CCCNCc1ccc(CC(=O)OC)cc1, CCCCc1nc2c(N)nc3ccccc3c2n1CCCNCc1ccc(CC(=O)OC)cc1.CS(=O)(=O)CC(=O)O>>CCCCC1nc2c(N)nc3ccccc3c2n1CCCN(Cc1ccc(CC(=O)OC)cc1)C(=O)CS(C)(=O)=O.
[0090] (2): Task planning and atomic mapping preprocessing. The task planning agent parses the above input and plans the task as follows: 1) Performing atomic mapping data preprocessing subtask for the entire path; 2) Performing path correction subtask for constructing protected / deprotected sequences. The user did not specify the reaction site and protecting group, so a prediction model is needed; 3) Summarizing and generating reports of the final results subtask.
[0091] The data preprocessing agent invokes the reaction atom mapping model to perform atom mapping on the reactants and products of each step in the original path, generating reaction SMILES with atom numbers, and writes the mapping results, including all steps, into the shared path context. At this point, the primary amine nitrogen atom in step 2 is assigned a unique atom number, serving as the basis for subsequent site prediction and protection design.
[0092] (3): Prediction of protected sites and protecting groups, route correction agent selects step2 as the target reaction step according to the task plan and calls the "predict site and predict protecting group" mode.
[0093] First, the product molecule SMILES from step 2, with its atomic mapping, is input into the protected site prediction model. The model represents the molecule as a graph structure, and after encoding through a multi-layer message passing network, outputs the site probability distribution for each atom. In this embodiment, the primary amine nitrogen atom has the highest prediction probability and is selected as the reaction site requiring protection.
[0094] Then, the synthetic features of the site are spliced with the original molecular features and input into the protecting group prediction model. The model is then classified and predicted on a pre-constructed vocabulary of 51 protecting groups, and finally the protecting group (Boc) with the highest score is selected.
[0095] (4): Construction of the protection / deprotection ternary reaction sequence: After determining the protection site (primary amine nitrogen atom) and the protecting group (Boc), the route correction agent calls the protection / deprotection construction tool to automatically generate the following ternary reaction: Protection reaction: Using intermediate M as a substrate, a Boc protecting group is introduced onto the nitrogen atom at the site to generate the Boc-protected intermediate M–NHBoc; Main reaction with protection: Using M–NHBoc as substrate, Boc protecting groups are added at the corresponding positions of the product molecule to avoid side reactions at the primary amine sites; Deprotection reaction: Remove the Boc protecting group from the product molecule to ensure the overall product sequence in the synthesis reaction.
[0096] (5) Synthetic route update and report generation: The route correction agent uses the above triple reaction as the reaction subsequence, replaces the position of the second single step in the original synthetic route, and performs consistency checks on the substrate and product of the preceding and following steps. The results summary agent reads the updated multi-step synthetic route, the predicted results of the protecting sites and protecting groups, and the detailed information of the constructed triple reaction, and generates a structured report.
[0097] This embodiment demonstrates that the method of this application can automatically locate the reaction sites that need to be protected, automatically recommend reasonable protecting groups, and insert the corresponding protection / deprotection triple reaction into the original path, thereby obtaining a synthetic path with better functional group tolerance, given the original synthetic path and natural language protection requirements.
[0098] This application also provides a computer-readable storage medium storing computer-readable instructions that can be executed by a processor to implement a method for generating a reaction sequence of an organic synthesis pathway as described above.
[0099] When a method for generating the reaction sequence of an organic synthesis pathway is implemented as a computer program, it can also be stored as an article of manufacture in a computer-readable storage medium. For example, computer-readable storage media can include, but are not limited to, magnetic storage devices (e.g., hard disks, floppy disks, magnetic stripes), optical discs (e.g., compact discs (CDs), digital multifunction discs (DVDs)), smart cards, and flash memory devices (e.g., electrically erasable programmable read-only memory (EPROM), cards, sticks, key drives). Furthermore, the various storage media described herein can represent one or more devices and / or other machine-readable media for storing information. The term "machine-readable medium" can include, but is not limited to, wireless channels and various other media (and / or storage media) capable of storing, containing, and / or carrying code and / or instructions and / or data.
[0100] It should be understood that the embodiments described above are merely illustrative. The embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For hardware implementation, the processor may be implemented within one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, and / or other electronic units designed to perform the functions described herein, or combinations thereof.
[0101] Some aspects of this application can be executed entirely by hardware, entirely by software (including firmware, resident software, microcode, etc.), or by a combination of hardware and software. The aforementioned hardware or software may be referred to as a "data block," "module," "engine," "unit," "component," or "system." The processor may be one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DAPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, or combinations thereof. Furthermore, aspects of this application may manifest as computer products residing in one or more computer-readable media, including computer-readable program code. For example, computer-readable media may include, but are not limited to, magnetic storage devices (e.g., hard disks, floppy disks, magnetic tapes, etc.), optical discs (e.g., compressed CDs, digital multifunction DVDs, etc.), smart cards, and flash memory devices (e.g., cards, sticks, key drives, etc.).
[0102] A computer-readable medium may contain a propagated data signal containing computer program code, for example, on baseband or as part of a carrier wave. This propagated signal may take various forms, including electromagnetic, optical, and so on, or suitable combinations thereof. A computer-readable medium can be any computer-readable medium other than a computer-readable storage medium, which can be connected to an instruction execution system, apparatus, or device to enable communication, propagation, or transmission of a program for use. The program code located on the computer-readable medium can be propagated through any suitable medium, including radio, cable, fiber optic cable, radio frequency signals, or similar media, or any combination of the above media.
[0103] The basic concepts have been described above. Obviously, for those skilled in the art, the above disclosure is merely illustrative and does not constitute a limitation of this application. Although not explicitly stated herein, those skilled in the art may make various modifications, improvements, and corrections to this application. Such modifications, improvements, and corrections are suggested in this application, and therefore remain within the spirit and scope of the exemplary embodiments of this application.
[0104] Furthermore, this application uses specific terms to describe embodiments of the application. For example, "an embodiment," "one embodiment," and / or "some embodiments" refer to a particular feature, structure, or characteristic related to at least one embodiment of the application. Therefore, it should be emphasized and noted that "an embodiment," "one embodiment," or "an alternative embodiment" mentioned twice or more in different locations in this specification do not necessarily refer to the same embodiment. In addition, certain features, structures, or characteristics in one or more embodiments of the application can be appropriately combined.
[0105] In some embodiments, numbers describing the quantity of components and attributes are used. It should be understood that such numbers used in the description of embodiments are modified in some examples with the terms "approximately," "approximately," or "generally." Unless otherwise stated, "approximately," "approximately," or "generally" indicates that the numbers are allowed to vary by ±20%. Accordingly, in some embodiments, the numerical parameters used in the specification and claims are approximate values, which may be changed depending on the characteristics required by individual embodiments. In some embodiments, numerical parameters should take into account specified significant digits and employ a general method of digit reservation. Although the numerical ranges and parameters used to confirm their breadth of scope in some embodiments of this application are approximate values, in specific embodiments, such values are set as precisely as feasible.
Claims
1. A method for generating the reaction sequence of an organic synthesis pathway, characterized in that, The method includes: A task plan containing the execution order is generated based on the obtained user request, wherein the user request includes an organic synthesis task requirement; According to the task plan, the input molecules or reaction formulas are preprocessed with atomic mapping to obtain atomic mapping information; Select the target reaction steps of the original synthesis route according to the task plan, and construct a multi-cascade reaction sequence based on the atomic mapping information and the specified sites. The target reaction step is replaced by the multi-cascade reaction sequence to generate a new synthetic route; A structured synthesis path report is generated based on the new synthesis path and protection strategy.
2. The method according to claim 1, characterized in that, The user request includes candidate synthesis paths, and the step of generating a task plan containing the execution order based on the obtained user request includes: The organic synthesis task requirements are identified, and subtasks are broken down based on the target information present or missing in the detected candidate synthesis paths, and a task plan with execution order is generated.
3. The method according to claim 2, characterized in that, Based on the detected target information present or missing in the candidate synthesis path, subtasks are decomposed and a task plan with execution order is generated, including: When it is detected that the candidate synthesis path lacks atomic mapping information, it is determined that a data preprocessing subtask needs to be executed first. When both the atom mapping requirement and the target reactive subsequence requirement are detected simultaneously, the generated task plan satisfies the order of executing the data preprocessing subtask first and then the synthesis path correction subtask, wherein the target reactive subsequence includes a protected sequence and / or a deprotected sequence. When only a protected sequence requirement or a deprotected sequence requirement is detected, the generated task plan directly triggers the synthetic path correction subtask.
4. The method according to claim 1, characterized in that, The atom mapping information includes atom numbers. The atom mapping preprocessing of the input molecules or reaction formulas according to the task plan to obtain the atom mapping information includes: The atom mapping tool is invoked for the molecules or reaction formulas input in the task plan to predict the atomic correspondence between reactants and products, and to obtain atom numbers. The atomic numbering is written into the reaction SMILES to form a reaction structure with atomic numbering; Write the atom numbers and the reaction structure into the shared path context.
5. The method according to claim 1, characterized in that, The construction of a multi-cascade reaction sequence based on the atomic mapping information and specified sites includes: Extract the main reaction with atom mapping information corresponding to the target reaction step; Based on the master reaction with atomic mapping information, a multi-cascade reaction sequence is constructed, wherein the multi-cascade reaction sequence includes a protection reaction, a protected master reaction, and a deprotection reaction.
6. The method according to claim 5, characterized in that, The construction of a multi-cascade reaction sequence based on the master reaction with atomic mapping information includes: The protected sites and protecting groups are obtained based on the working mode; Based on the master reaction with atomic mapping information and the obtained protecting sites and protecting groups, the multi-cascade reaction sequence is reconstructed at the SMILES reaction level.
7. The method according to claim 6, characterized in that, The operating mode includes at least one of the following: The working mode that specifies a site and a protecting group includes: resolving the user-specified site and the specified protecting group based on the atom number provided by the shared path context; The working mode of specifying a site and predicting the protecting group includes: automatically providing the protecting group by calling the protecting group prediction model based on the site specified by the user; The working mode of predicting sites and protecting groups includes: calling the joint prediction model on products with atomic mapping information, while providing the combination of protecting sites and protecting groups.
8. The method according to claim 1, characterized in that, The step of replacing the target reaction step with the multi-cascade reaction sequence to generate a new synthetic pathway includes: The protected sequence and the deprotected sequence in the multi-cascade reaction sequence are considered as a single reaction subsequence; The reaction subsequence is inserted into the target step position, replacing the original main reaction. The consistency of the preceding and following steps is checked and renumbered to obtain a new synthetic route containing the protected sequence and the unprotected sequence.
9. A system for generating reaction sequences of an organic synthesis pathway, characterized in that, The generation system includes: a task planning agent, a data preprocessing agent, a route correction agent, and a result aggregation agent; The task planning agent is used to generate a task plan containing the execution order based on the obtained user request, wherein the user request includes an organic synthesis task requirement; The data preprocessing agent is used to perform atom mapping preprocessing on the input molecules or reaction formulas according to the task plan, to obtain atom mapping information; The route correction agent is used to select the target reaction step of the original synthesis path according to the task plan, and to construct a multi-cascade reaction sequence based on the atomic mapping information and the specified site. The route correction agent is used to replace the target reaction step with the multi-cascade reaction sequence to generate a new synthetic route; The result-summarizing agent is used to generate a structured synthesis path report based on the new synthesis path and protection strategy.
10. A computer-readable storage medium storing computer-readable instructions thereon, characterized in that, The computer-readable instructions can be executed by a processor to implement the method as described in any one of claims 1 to 8.