Methods and systems for predicting chemical synthesis pathways

Pre-trained BART models enhance chemical synthesis pathway prediction by addressing scalability and interpretability issues, providing efficient and transparent synthesis planning for complex molecules.

WO2026129042A1PCT designated stage Publication Date: 2026-06-25REDWOOD AI INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
REDWOOD AI INC
Filing Date
2025-12-17
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing computer-assisted synthesis planning tools struggle with scalability, accuracy, and interpretability, particularly for complex and novel molecules, and lack integration with existing informatics environments, leading to inefficiencies and limited deployment in real-world settings.

Method used

Utilizing pre-trained and post-trained Bidirectional Auto-regressive Transformer (BART) models for generating and selecting chemical synthesis pathways, with forward and reverse synthesis models leveraging diverse reaction data and providing transparent, data-aware outputs.

Benefits of technology

Enhances the prediction of chemical synthesis pathways by improving scalability, accuracy, and transparency, enabling efficient and reliable synthesis planning for complex molecules while maintaining data provenance and interoperability with existing systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CA2025051706_25062026_PF_FP_ABST
    Figure CA2025051706_25062026_PF_FP_ABST
Patent Text Reader

Abstract

Herein is disclosed a method for generating a chemical synthesis pathway for a molecule, the method comprising: receiving a representation of the molecule from a user; generating a plurality of textual chemical synthesis pathways for the molecule with a reverse synthesis bidirectional auto-regressive transformer (BART) model; selecting one or more of the reverse chemical synthesis pathways for the molecule with a forward synthesis BART model; converting the selected one or more reverse chemical synthesis pathways into corresponding graphical representations; and displaying the graphical representations to the user.
Need to check novelty before this filing date? Find Prior Art

Description

METHODS AND SYSTEMS FOR PREDICTING CHEMICAL SYNTHESIS PATHWAYSCross-Reference to Related Applications

[0001] This application claims priority from application No. 63 / 735,265, filed 17 December 2024. For purposes of the United States, this application claims the benefit under 35 U.S.C. §119 of application No. 63 / 735,265, filed 17 December 2024, and entitled METHOD AND SYSTEM FOR PREDICTING CHEMICAL SYNTHESIS PATHWAYS which is hereby incorporated herein by reference for all purposes.Technical Field

[0002] The present disclosure is directed to methods and systems of predicting chemical synthesis pathways. More particularly, the present disclosure is directed to methods and systems for predicting chemical synthesis pathways using artificial intelligence (Al) algorithms.Background

[0003] Chemical synthesis planning is a foundational activity in chemistry, underpinning the discovery, optimization, and manufacture of pharmaceuticals, agrochemicals, materials, and intermediates. Identifying feasible and efficient synthesis pathways to target molecules requires reasoning across vast chemical spaces, assessing compatibility among reagents and conditions, and balancing constraints related to yield, selectivity, cost, safety, and sustainability. Historically, expert chemists have relied on specialized knowledge, technical literature, and heuristic rules to design synthesis paths. While effective for many problems, these manual approaches struggle to keep pace with the scale, diversity, and complexity of modem chemical discovery and development pipelines. Furthermore, these existing approaches are labor intensive, and therefore the amount of resources required to devise new synthesis pathways scales with the number and complexity of molecules to synthesize.

[0004] Existing computer-assisted synthesis planning tools have advanced the state of the art, but they often exhibit limitations that hinder broad and reliable deployment. Rule-based and template-driven methods can have limited application outside of well-known andcommon reaction types, leading to limited usefulness when used for more novel molecules or underrepresented chemistries. Data sparsity, inconsistent annotations, and bias in reaction data can further degrade predictive performance, especially for complex, multi-step pathways.

[0005] Scalability and efficiency of present solutions also present persistent challenges. Searching combinatorial reaction spaces with sufficient efficiency to be commercially useful requires algorithms that can prioritize plausible synthesis steps while avoiding unpractical ones. Beyond accuracy and speed, practical utility in real-world settings demands transparency, adaptability, and maintainability. Chemists benefit from systems that can reflect uncertainties, provide rationales grounded in precedent, and accommodate evolving constraints, such as supply chain shifts or regulatory considerations. However, many current approaches offer limited interpretability or calibration, making it difficult for practitioners to trust and refine the suggested pathways. Additionally, interoperability with existing informatics environments and the ability to ingest curated datasets while maintaining data provenance remain important yet inconsistently addressed requirements.

[0006] Accordingly, there is a continued need for improved techniques for predicting chemical synthesis pathways that scale to complex targets under realistic constraints, and provide outputs that are transparent, data-aware, and amenable to expert oversight. Such approaches should leverage diverse reaction information while mitigating dataset bias, uphold reproducibility and provenance, and support integration into the workflows of practicing chemists across research and development contexts.

[0007] There is a general desire for an improved system and method for predicting chemical synthesis pathways.

[0008] The foregoing examples of the related art and limitations related thereto are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the drawings.Summary

[0009] Further aspects and example embodiments are illustrated in the accompanying drawings and / or described in the following description.

[0010] One aspect of the invention provides a method for generating a chemical synthesis pathway for a molecule, the method comprising: receiving a representation of the molecule from a user; generating a plurality of textual chemical synthesis pathways for the molecule with a reverse synthesis bidirectional auto-regressive transformer (BART) model; selecting one or more of the reverse chemical synthesis pathways for the molecule with a forward synthesis BART model; converting the selected one or more reverse chemical synthesis pathways into corresponding graphical representations; displaying the graphical representations to the user; wherein both the reverse synthesis BART model and the forward synthesis BART model comprise BART models pretrained with a set of textual data representing a plurality of chemicals; wherein the reverse synthesis BART model comprises a BART model post-trained with a first set of textual data representing a plurality of chemical reactions; and wherein the forward synthesis BART model comprises a BART model post-trained with a second set of textual data representing a plurality of chemical reactions.

[0011] One aspect of the invention provides a computer system for generating a chemical synthesis pathway for a molecule, the system comprising: a user interface for receiving a representation of the molecule from the user; a reverse synthesis bidirectional autoregressive transformer (BART) model for generating a plurality of chemical synthesis pathways for the molecule; a forward synthesis BART model for selecting one or more of the reverse chemical synthesis pathways for the molecule; a display for displaying the selected reverse chemical synthesis pathways to the user; wherein both the reverse synthesis BART model and the forward synthesis BART model comprise BART models pretrained with a set of textual data representing a plurality of chemicals; wherein the reverse synthesis BART model comprises a BART model post-trained with a first set of textual data representing a plurality of chemical reactions; and wherein the forward synthesis BART model comprises a BART model post-trained with a second set of textual data representing a plurality of chemical reactions

[0012] In addition to the exemplary aspects and embodiments described above, further aspects and embodiments will become apparent by reference to the drawings and by study of the following detailed descriptions.Brief Description of the Drawings

[0013] The accompanying drawings illustrate non-limiting example embodiments of the invention.

[0014] Fig. 1 is a schematic view of a method for generating a chemical synthesis pathway for a molecule, according to an example embodiment of the present invention.

[0015] Fig. 2 is a schematic view of a system for generating a chemical synthesis pathway for a molecule, according to an example embodiment of the present invention.Description

[0016] Throughout the following description, specific details are set forth in order to provide a more thorough understanding of the invention. However, the invention may be practiced without these particulars. In other instances, well known elements have not been shown or described in detail to avoid unnecessarily obscuring the invention. Accordingly, the specification and drawings are to be regarded in an illustrative, rather than a restrictive sense.

[0017] Fig. 1 is a schematic view of method 100 for generating a chemical synthesis pathway for a molecule, according to an example embodiment of the present invention. Method 100 comprises:• step 102: receiving a representation of the molecule from a user;• step 104: generating a plurality of textual chemical synthesis pathways for the molecule with a reverse synthesis bidirectional auto-regressive transformer (BART) model;• step 106: selecting one or more of the reverse chemical synthesis pathways for the molecule with a forward synthesis BART model;• step 108: converting the selected one or more reverse chemical synthesis pathways into corresponding graphical representations; and• step 110: displaying the graphical representations to the user

[0018] Both the reverse synthesis BART model and the forward synthesis BART model comprise BART models pretrained with a set of textual data representing a plurality of chemicals. The reverse synthesis BART model comprises a BART model post-trained witha first set of textual data representing a plurality of chemical reactions. The forward synthesis BART model comprises a BART model post-trained with a second set of textual data representing a plurality of chemical reactions.

[0019] In some embodiments, the second set of textual data comprises a plurality of reversed chemical reactions from the first set of textual data. For example, first set of textual data contains a reaction A + B -> C, then the second set of textual data would comprise the reaction C <- A + B.

[0020] One or more of: the set of textual data representing a plurality of chemicals, the first set of textual data representing a plurality of chemical reactions, and the second set of textual data representing a plurality of chemical reactions, may comprise data extracted from one or more third-party sources. Said third-party sources may include one or more of the Pistachio™ Dataset provided by NextMove Software™, the Open Reaction Database (ORD), the ZINC15 dataset provided by the Department of Pharmaceutical Chemistry at the University of California, San Francisco (UCSF), and the USPTO50k dataset generated from USTPO patent data.

[0021] In some embodiments, generating the plurality of textual chemical synthesis pathways for the molecule with the reverse synthesis BART model comprises canonicalizing the textual representation of the molecule, and tokenizing the canonicalized textual representation of the molecule.

[0022] Canonicalizing the textual representation of the molecule may comprise rewriting the textual representation of the molecule in accordance with a set of one or more rules for textually representing one or more of: isotopes, charges, aromatic atoms, ring openings, and ring closures. For example, each atom for a molecule may be assigned a canonical priority from a set of atomic canonical priorities, and the atom with the highest canonical priority may be selected as the starting atom of the molecule. After the starting atom is selected, the next neighboring atom may be selected as the one of the neighboring atoms with the highest canonical priority among neighboring atom. The molecular may then be similarly traversed until all atoms in the molecule are selected, and the entire molecule is rewritten in the canonical form.

[0023] The set of rules for canonicalizing the textual representation of the molecule may include one more of the following rules:• ring closures of a molecule are numbered using a consistent numbering scheme, for example starting with the same starting number and incrementing by a consistent number such as 1 , 2, 3, to n number of ring closures;• bond orderings are consistently expressed in the molecule, for example by omitting single bonds, expressing double bonds using a unique character such as an equals sign (=), expressing triple bonds using a unique character such as a pound sing (#), and expressing aromatic bonds using lowercase atoms such as c, n, etc.;• representing aromatic systems consistently, for example by representing aromatic systems with lowercase letters such as c, n, etc.;• removing implicit hydrogens unless they are required for charge, isotope, or stereochemistry purposes;• stereochemical features are represented consistently, for example by using a single ampersand (@) and a double ampersand (@@) for trihedral chirality, and forward and back slashes for E / Z; and• consistently representing charge and isotope features, for example by bracketing and ordering isotopes, elements, hydrogens, and charges within the brackets.

[0024] As an example of the above rules, the molecule ethanol may be written as any of {CCO, OCC, C(O)C}, where the canonical representation of ethanol in accordance with above rules would be CCO.

[0025] In some embodiments, tokenizing the canonicalized textual representation of the molecule comprises identifying an invalid substring within the canonicalized textual representation of the molecule based on a set of invalidation rules, and converting the invalid substring to a valid substring based on the set of invalidation rules. Tokenizing the canonicalized textual representation of the molecule may further comprise greedily matching the canonicalized textual representation of the molecule from left to right with a set of chemically meaningful tokens.

[0026] The set of chemically meaningful tokens may comprise one or more of:• a plurality of tokens each representing a unique element;• a plurality of tokens each representing a unique molecule;• a plurality of tokens each representing a unique bracketed combination of one or both of atoms and ions;• a plurality of tokens each representing a unique unbracketed organic atom;• a plurality of tokens each representing a unique type of chemical bond;• a plurality of tokens each representing a unique type of ring index; and• a plurality of tokens each representing a unique isotopic mass number.

[0027] In some embodiments, selecting the one or more of the reverse chemical synthesis pathways for the molecule with the forward synthesis BART model comprises:• determining one or more valid chemical synthesis pathways in the plurality of textual chemical synthesis pathways by validating one or more of the plurality of textual chemical synthesis pathways with the forward synthesis BART model;• determining a confidence score for each of the valid chemical synthesis pathways;• ranking the valid chemical synthesis pathways by the confidence score of each of the valid chemical synthesis pathways;• receiving a number of desired output pathways from the user;• selecting a top number of ranked valid chemical synthesis pathways up to the number of desired output pathways; and• displaying the top number of ranked valid chemical synthesis pathways to the user.

[0028] Validating one or more of the plurality of textual chemical synthesis pathways with the forward synthesis BART model may comprise:• confirming that the one or more of the plurality of textual chemical synthesis pathways satisfies one or more rules for valid representation of a chemical reaction;• confirming that the one or more of the plurality of textual chemical synthesis pathways does not include any synthesis steps wherein one or more reactants are products of a previous synthesis step; and• confirming that the one or more of the plurality of textual chemical synthesis pathways successfully generates the molecule with the forward synthesis BART model.

[0029] Some embodiments of method 100 further comprise pretraining the reverse synthesis BART model with the set of textual data representing the plurality of chemicals, pretraining the forward synthesis BART model with the set of textual data representing the plurality of chemicals, post-training the reverse synthesis BART model with the first set of textual data representing a plurality of chemical reactions; and post-training the forwardsynthesis BART model with the second set of textual data representing a plurality of chemical reactions.

[0030] Method 100 may further comprise generating the set of textual data representing the plurality of chemicals from a database of graphical representations of chemicals by converting one or more of the graphical representations of chemicals into Simplified Molecular Input Line Entry System (SMILES) representations of the chemicals. Generating the set of textual data representing the plurality of chemicals may comprise canonicalizing the SMILES representations of the chemicals, and tokenizing the canonicalized SMILES representations of the chemicals.

[0031] As above, canonicalizing the SMILES representations of the chemicals may comprise rewriting the textual representations of the chemicals in accordance with a set of one or more rules for textually representing one or more of: isotopes, charges, aromatic atoms, ring openings, and ring closures.

[0032] In some embodiments, tokenizing a prespecified canonicalized SMILES representations of the chemicals comprises identifying an invalid substring within a canonicalized SMILES representation of one of the chemicals based on a set of invalidation rules and converting the invalid substring to a valid substring based on the set of invalidation rules. For example, one of the set of invalidation rules may be that every opening bracket of a bracket type must have a corresponding closing bracket of the same bracket type. If an invalid substring is identifies as having a closing bracket of a bracket type, then the substring may be converted to a valid substring by adding an opening bracket of the bracket type to the invalid substring.

[0033] As above, tokenizing the canonicalized SMILES representations of the chemicals may comprise greedily matching each of the canonicalized SMILES representations of the chemicals from left to right with a set of chemically meaningful tokens. The chemically meaningful tokens may comprise those described above.

[0034] Some embodiments of method 100 may comprise generating the first set of textual data representing the plurality of chemical reactions from a database of graphical representations of reactions by converting one or more of the graphical representations of reactions into Simplified Molecular Input Line Entry System (SMILES) representations of thereactions. The database of graphical representations of reactions may include one or more of the third-party databases described above.

[0035] Generating the first set of textual data representing the plurality of chemical reactions may comprise canonicalizing the SMILES representations of the reactions, and tokenizing the canonicalized SMILES representations of the reactions. Canonicalizing the SMILES representations of the reactions may comprise rewriting the textual representation of the reactions in accordance with a set of one or more rules for textually representing one or more of: isotopes, charges, aromatic atoms, ring openings, ring closures, and reaction arrows.

[0036] Canonicalizing the SMILES representations of the reactions may further comprise identifying an invalid substring within a prespecified canonicalized SMILES representation of one of the reactions based on a set of invalidation rules; and converting the invalid substring to a valid substring based on the set of invalidation rules.

[0037] In some embodiments, tokenizing the canonicalized SMILES representations of the reactions comprises greedily matching each of the canonicalized SMILES representations of the reactions from left to right with a set of chemically meaningful tokens. The set of chemically meaningful tokens may include those described above, and one or both of:• a plurality of tokens each representing a unique substituent; and• a plurality of tokens each representing a unique reaction punctuation.

[0038] Fig. 2 is a schematic view of system 200 for generating a chemical synthesis pathway for a molecule, according to an example embodiment of the present invention. System 200 comprises: user interface 210 for receiving a representation of the molecule from the user, reverse synthesis BART model 220 for generating a plurality of chemical synthesis pathways for the molecule, forward synthesis BART model 230 for selecting one or more of the reverse chemical synthesis pathways for the molecule, and display 240 for displaying the selected reverse chemical synthesis pathways to the user

[0039] Both reverse synthesis BART model 220 and forward synthesis BART model 230 comprise BART models pretrained with a set of textual data representing a plurality of chemicals. Reverse synthesis BART model 220 comprises a BART model post-trained with a first set of textual data representing a plurality of chemical reactions, and forwardsynthesis BART 230 model comprises a BART model post-trained with a second set of textual data representing a plurality of chemical reactions.

[0040] Some embodiments of system 200 further comprise a preprocessor before reverse synthesis BART model 220 configured to canonicalize the representation of the molecule and tokenize the canonicalized representation of the molecule prior to generating the plurality of chemical synthesis pathways for the molecule with reverse synthesis BART model 220. The preprocessor may be configured to one or both of canonicalize and tokenize the representation of the molecule as otherwise described herein.

[0041] While a number of exemplary aspects and embodiments have been discussed above, those of skill in the art will recognize certain modifications, permutations, additions and sub-combinations thereof. It is therefore intended that the following appended claims and claims hereafter introduced are interpreted to include all such modifications, permutations, additions and sub-combinations as are consistent with the broadest interpretation of the specification as a whole.Interpretation of Terms

[0042] Unless the context clearly requires otherwise, throughout the description and the claims:• “comprise”, “comprising”, and the like are to be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense; that is to say, in the sense of “including, but not limited to”;• “connected”, “coupled”, or any variant thereof, means any connection or coupling, either direct or indirect, between two or more elements; the coupling or connection between the elements can be physical, logical, or a combination thereof;• “herein”, “above”, “below”, and words of similar import, when used to describe this specification, shall refer to this specification as a whole, and not to any particular portions of this specification;• “or”, in reference to a list of two or more items, covers all of the following interpretations of the word: any of the items in the list, all of the items in the list, and any combination of the items in the list;the singular forms “a”, “an”, and “the” also include the meaning of any appropriate plural forms.

[0043] Words that indicate directions such as “vertical”, “transverse”, “horizontal”, “upward”, “downward”, “forward”, “backward”, “inward”, “outward”, “vertical”, “transverse”, “left”, “right”, “front”, “back”, “top”, “bottom”, “below”, “above”, “under”, and the like, used in this description and any accompanying claims (where present), depend on the specific orientation of the apparatus described and illustrated. The subject matter described herein may assume various alternative orientations. Accordingly, these directional terms are not strictly defined and should not be interpreted narrowly.

[0044] Embodiments of the invention may be implemented using specifically designed hardware, configurable hardware, programmable data processors configured by the provision of software (which may optionally comprise “firmware”) capable of executing on the data processors, special purpose computers or data processors that are specifically programmed, configured, or constructed to perform one or more steps in a method as explained in detail herein and / or combinations of two or more of these. Examples of specifically designed hardware are: logic circuits, application-specific integrated circuits (“ASICs”), large scale integrated circuits (“LSIs”), very large scale integrated circuits (“VLSIs”), and the like. Examples of configurable hardware are: one or more programmable logic devices such as programmable array logic (“PALs”), programmable logic arrays (“PLAs”), and field programmable gate arrays (“FPGAs”)). Examples of programmable data processors are: microprocessors, digital signal processors (“DSPs”), embedded processors, graphics processors, math co-processors, general purpose computers, server computers, cloud computers, mainframe computers, computer workstations, and the like. For example, one or more data processors in a control circuit for a device may implement methods as described herein by executing software instructions in a program memory accessible to the processors.

[0045] Processing may be centralized or distributed. Where processing is distributed, information including software and / or data may be kept centrally or distributed. Such information may be exchanged between different functional units by way of a communications network, such as a Local Area Network (LAN), Wde Area Network (WAN),or the Internet, wired or wireless data links, electromagnetic signals, or other data communication channel.

[0046] For example, while processes or blocks are presented in a given order, alternative examples may perform routines having steps, or employ systems having blocks, in a different order, and some processes or blocks may be deleted, moved, added, subdivided, combined, and / or modified to provide alternative or subcombinations. Each of these processes or blocks may be implemented in a variety of different ways. Also, while processes or blocks are at times shown as being performed in series, these processes or blocks may instead be performed in parallel, or may be performed at different times.

[0047] In addition, while elements are at times shown as being performed sequentially, they may instead be performed simultaneously or in different sequences. It is therefore intended that the following claims are interpreted to include all such variations as are within their intended scope.

[0048] In some embodiments, the invention may be implemented in software. For greater clarity, “software” includes any instructions executed on a processor, and may include (but is not limited to) firmware, resident software, microcode, and the like. Both processing hardware and software may be centralized or distributed (or a combination thereof), in whole or in part, as known to those skilled in the art. For example, software and other modules may be accessible via local memory, via a network, via a browser or other application in a distributed computing context, or via other means suitable for the purposes described above.

[0049] Where a component (e.g. a software module, processor, assembly, device, circuit, etc.) is referred to above, unless otherwise indicated, reference to that component (including a reference to a “means”) should be interpreted as including as equivalents of that component any component which performs the function of the described component (i.e. , that is functionally equivalent), including components which are not structurally equivalent to the disclosed structure which performs the function in the illustrated exemplary embodiments of the invention.

[0050] Various features are described herein as being present in “some embodiments”. Such features are not mandatory and may not be present in all embodiments. Embodiments of the invention may include zero, any one or any combination of two or more of suchfeatures. This is limited only to the extent that certain ones of such features are incompatible with other ones of such features in the sense that it would be impossible for a person of ordinary skill in the art to construct a practical embodiment that combines such incompatible features. Consequently, the description that “some embodiments” possess feature A and “some embodiments” possess feature B should be interpreted as an express indication that the inventors also contemplate embodiments which combine features A and B (unless the description states otherwise or features A and B are fundamentally incompatible).

[0051] It is therefore intended that the following appended claims and claims hereafter introduced are interpreted to include all such modifications, permutations, additions, omissions, and sub-combinations as may reasonably be inferred. The scope of the claims should not be limited by the preferred embodiments set forth in the examples, but should be given the broadest interpretation consistent with the description as a whole.

Claims

1. CLAIMS1 . A method for generating a chemical synthesis pathway for a molecule, the method comprising: receiving a representation of the molecule from a user; generating a plurality of textual chemical synthesis pathways for the molecule with a reverse synthesis bidirectional auto-regressive transformer (BART) model; selecting one or more of the reverse chemical synthesis pathways for the molecule with a forward synthesis BART model; converting the selected one or more reverse chemical synthesis pathways into corresponding graphical representations; displaying the graphical representations to the user; wherein both the reverse synthesis BART model and the forward synthesis BART model comprise BART models pretrained with a set of textual data representing a plurality of chemicals; wherein the reverse synthesis BART model comprises a BART model posttrained with a first set of textual data representing a plurality of chemical reactions; and wherein the forward synthesis BART model comprises a BART model posttrained with a second set of textual data representing a plurality of chemical reactions.

2. The method of claim 1 , wherein the second set of textual data comprises a plurality of reversed chemical reactions from the first set of textual data.

3. The method of either of claims 1 and 2, wherein generating the plurality of textual chemical synthesis pathways for the molecule with the reverse synthesis BART model comprises: canonicalizing the textual representation of the molecule; and tokenizing the canonicalized textual representation of the molecule.

4. The method of claim 3, wherein canonicalizing the textual representation of the molecule comprises rewriting the textual representation of the molecule in accordance with a set of one or more rules for textually representing one or more of: isotopes, charges, aromatic atoms, ring openings, and ring closures.

5. The method of either of claims 3 and 4, wherein tokenizing the canonicalized textual representation of the molecule comprises: identifying an invalid substring within the textual representation of the molecule based on a set of invalidation rules; and converting the invalid substring to a valid substring based on the set of invalidation rules.

6. The method of any one of claims 3 to 5, wherein tokenizing the canonicalized textual representation of the molecule comprises greedily matching the canonicalized textual representation of the molecule from left to right with a set of chemically meaningful tokens.

7. The method of claim 6, wherein the set of chemically meaningful tokens comprises one or more of: a plurality of tokens each representing a unique element; a plurality of tokens each representing a unique molecule; a plurality of tokens each representing a unique bracketed combination of one or both of atoms and ions; a plurality of tokens each representing a unique unbracketed organic atom; a plurality of tokens each representing a unique type of chemical bond; a plurality of tokens each representing a unique type of ring index; and a plurality of tokens each representing a unique isotopic mass number.

8. The method of any one of claims 1 to 7, wherein selecting the one or more of the reverse chemical synthesis pathways for the molecule with the forward synthesis BART model comprises:determining one or more valid chemical synthesis pathways in the plurality of textual chemical synthesis pathways by validating one or more of the plurality of textual chemical synthesis pathways with the forward synthesis BART model; determining a confidence score for each of the valid chemical synthesis pathways; ranking the valid chemical synthesis pathways by the confidence score of each of the valid chemical synthesis pathways; receiving a number of desired output pathways from the user; selecting a top number of ranked valid chemical synthesis pathways up to the number of desired output pathways; and displaying the top number of ranked valid chemical synthesis pathways to the user.

9. The method of claim 8, wherein validating one or more of the plurality of textual chemical synthesis pathways with the forward synthesis BART model comprises: confirming that the one or more of the plurality of textual chemical synthesis pathways satisfies one or more rules for valid representation of a chemical reaction; confirming that the one or more of the plurality of textual chemical synthesis pathways does not include any synthesis steps wherein one or more reactants are products of a previous synthesis step; and confirming that the one or more of the plurality of textual chemical synthesis pathways successfully generates the molecule with the forward synthesis BART model.

10. The method of any one of claims 1 to 9, further comprising: pretraining the reverse synthesis BART model with the set of textual data representing the plurality of chemicals; pretraining the forward synthesis BART model with the set of textual data representing the plurality of chemicals; post-training the reverse synthesis BART model with the first set of textual data representing a plurality of chemical reactions; and post-training the forward synthesis BART model with the second set of textual data representing a plurality of chemical reactions.

11. The method of claim 10, further comprising generating the set of textual data representing the plurality of chemicals from a database of graphical representations of chemicals by converting one or more of the graphical representations of chemicals into Simplified Molecular Input Line Entry System (SMILES) representations of the chemicals.

12. The method of claim 11 , wherein generating the set of textual data representing the plurality of chemicals comprises: canonicalizing the SMILES representations of the chemicals; and tokenizing the canonicalized SMILES representations of the chemicals.

13. The method of claim 11 , wherein canonicalizing the SMILES representations of the chemicals comprises rewriting the textual representations of the chemicals in accordance with a set of one or more rules for textually representing one or more of: isotopes, charges, aromatic atoms, ring openings, and ring closures.

14. The method of either of claims 12 and 13, wherein tokenizing the canonicalized SMILES representations of the chemicals comprises: identifying an invalid substring within a SMILES representation of one of the chemicals based on a set of invalidation rules; and converting the invalid substring to a valid substring based on the set of invalidation rules.

15. The method of any one of claims 12 to 14, wherein tokenizing the canonicalized SMILES representations of the chemicals comprises greedily matching each of the canonicalized SMILES representations of the chemicals from left to right with a set of chemically meaningful tokens.

16. The method of claim 15, wherein the set of chemically meaningful tokens comprises one or more of: a plurality of tokens each representing a unique element; a plurality of tokens each representing a unique molecule;a plurality of tokens each representing a unique bracketed combination of one or both of atoms and ions; a plurality of tokens each representing a unique unbracketed organic atom; a plurality of tokens each representing a unique type of chemical bond; a plurality of tokens each representing a unique type of ring index; and a plurality of tokens each representing a unique isotopic mass number.

17. The method of any one of claims 10 to 16, further comprising generating the first set of textual data representing the plurality of chemical reactions from a database of graphical representations of reactions by converting one or more of the graphical representations of reactions into Simplified Molecular Input Line Entry System (SMILES) representations of the reactions.

18. The method of claim 17, wherein generating the first set of textual data representing the plurality of chemical reactions comprises: canonicalizing the SMILES representations of the reactions; and tokenizing the canonicalized SMILES representations of the reactions.

19. The method of claim 18, wherein canonicalizing the SMILES representations of the reactions comprises rewriting the textual representation of the reactions in accordance with a set of one or more rules for textually representing one or more of: isotopes, charges, aromatic atoms, ring openings, ring closures, and reaction arrows.

20. The method of either of claims 18 and 19, wherein tokenizing the canonicalized SMILES representations of the reactions comprises: identifying an invalid substring within a canonicalized SMILES representation of one of the reactions based on a set of invalidation rules; and converting the invalid substring to a valid substring based on the set of invalidation rules.

21. The method of any one of claims 18 to 20, wherein tokenizing the canonicalized SMILES representations of the reactions comprises greedily matching each of thecanonicalized SMILES representations of the reactions from left to right with a set of chemically meaningful tokens.

22. The method of claim 21 , wherein the set of chemically meaningful tokens comprises one or more of: a plurality of tokens each representing a unique element; a plurality of tokens each representing a unique molecule; a plurality of tokens each representing a unique bracketed combination of one or both of atoms and ions; a plurality of tokens each representing a unique unbracketed organic atom; a plurality of tokens each representing a unique type of chemical bond; a plurality of tokens each representing a unique type of ring index; a plurality of tokens each representing a unique substituent; a plurality of tokens each representing a unique reaction punctuation; and a plurality of tokens each representing a unique isotopic mass number.

23. A computer system for generating a chemical synthesis pathway for a molecule, the system comprising: a user interface for receiving a representation of the molecule from the user; a reverse synthesis bidirectional auto-regressive transformer (BART) model for generating a plurality of chemical synthesis pathways for the molecule; a forward synthesis BART model for selecting one or more of the reverse chemical synthesis pathways for the molecule; a display for displaying the selected reverse chemical synthesis pathways to the user; wherein both the reverse synthesis BART model and the forward synthesis BART model comprise BART models pretrained with a set of textual data representing a plurality of chemicals; wherein the reverse synthesis BART model comprises a BART model posttrained with a first set of textual data representing a plurality of chemical reactions; andwherein the forward synthesis BART model comprises a BART model posttrained with a second set of textual data representing a plurality of chemical reactions.

24. The system of claim 23, wherein the second set of textual data comprises a plurality of reversed chemical reactions from the first set of textual data.

25. The system of either of claims 23 and 24, further comprising a preprocessor before the reverse synthesis BART model configured to canonicalize the representation of the molecule and tokenize the canonicalized representation of the molecule.

26. The system of claim 25, wherein the preprocessor is configured to canonicalize the representation of the molecule by rewriting the textual representation of the molecule in accordance with a set of one or more rules for textually representing one or more of: isotopes, charges, aromatic atoms, ring openings, and ring closures.

27. The system of either of claims 25 and 26, wherein the preprocessor is configured to tokenize the canonicalized representation of the molecule by: identifying an invalid substring within the canonicalized representation of the molecule based on a set of invalidation rules; and converting the invalid substring to a valid substring based on the set of invalidation rules.

28. The system of any one of claims 25 to 27, wherein the preprocessor is configured to tokenize the canonicalized representation of the molecule by greedily matching the canonicalized representation of the molecule from left to right with a set of chemically meaningful tokens.

29. The system of claim 28, wherein the set of chemically meaningful tokens comprises one or more of: a plurality of tokens each representing a unique element; a plurality of tokens each representing a unique molecule;a plurality of tokens each representing a unique bracketed combination of one or both of atoms and ions; a plurality of tokens each representing a unique unbracketed organic atom; a plurality of tokens each representing a unique type of chemical bond; a plurality of tokens each representing a unique type of ring index; and a plurality of tokens each representing a unique isotopic mass number.

30. The system of any one of claims 23 to 29, wherein the forward synthesis BART model is configured to select the one or more of the reverse chemical synthesis pathways for the molecule by: determining one or more valid chemical synthesis pathways in the plurality of chemical synthesis pathways by validating one or more of the plurality of chemical synthesis pathways with the forward synthesis BART model; determining a confidence score for each of the valid chemical synthesis pathways; ranking the valid chemical synthesis pathways by the confidence score of each of the valid chemical synthesis pathways; receiving a number of desired output pathways from the user via the user interface; selecting a top number of ranked valid chemical synthesis pathways up to the number of desired output pathways; and displaying the top number of ranked valid chemical synthesis pathways to the user via the display.

31. The system of claim 30, wherein validating one or more of the plurality of chemical synthesis pathways with the forward synthesis BART model comprises: confirming that the one or more of the plurality of chemical synthesis pathways satisfies one or more rules for valid representation of a chemical reaction; confirming that the one or more of the plurality of chemical synthesis pathways does not include any synthesis steps wherein one or more reactants are products of a previous synthesis step; andconfirming that the one or more of the plurality of chemical synthesis pathways successfully generates the molecule with the forward synthesis BART model.