An interpretable molecular optimization method based on a large language model

By constructing a rule base and a self-feedback mechanism using the ELLM-MOM model, the problem of black-box optimization behavior in molecular optimization is solved, enabling an interpretable and stable molecular optimization process and improving the success rate and stability of optimization.

CN121983172BActive Publication Date: 2026-06-26CHINA UNIV OF PETROLEUM (EAST CHINA)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA UNIV OF PETROLEUM (EAST CHINA)
Filing Date
2026-04-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing deep learning methods in molecular optimization suffer from the problem of black-box optimization behavior, which is difficult to interpret and control, and is also difficult to reuse between different optimization objectives.

Method used

The interpretable LLMs-guided molecular optimization framework ELLM-MOM is adopted. By extracting structure-property prior knowledge from a large language model to build a rule base, it combines a retrieval-driven self-feedback mechanism for iterative optimization, and explicitly modeling it as a rule-based reasoning and structure editing process.

Benefits of technology

It achieves interpretability and traceability of the molecular optimization process without retraining the model, enables iterative advancement within a controllable structural operation space, solves the implicit mapping problem of optimization behavior, and improves the stability and success rate of optimization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121983172B_ABST
    Figure CN121983172B_ABST
Patent Text Reader

Abstract

The application discloses an interpretable molecule optimization method based on a large language model, and relates to the technical field of computational chemistry, and comprises the following steps: obtaining an original molecule set, performing standardization processing and effective molecule screening on the original molecule set, and outputting a standardized molecule set and an optimization task test molecule set; performing multi-modal molecule embedding generation on the standardized molecule set, and outputting a multi-modal molecule embedding vector set; performing semantic embedding and clustering deduplication on an initial molecule editing rule set, outputting a deduplicated representative rule set, and then extracting high-frequency molecule fragments based on the standardized molecule set, and outputting a final molecule editing rule library; based on the optimization task test molecule set and a structured rule metadata table, scoring and sorting are performed for a specified target optimization attribute, and a multi-attribute candidate rule sorting table is output; based on a to-be-optimized benchmark molecule and a correction editing rule set, molecule structure editing and property compliance evaluation are performed, and after iteration, an optimized final molecule set is output.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computational chemistry, and in particular to an interpretable molecular optimization method based on a large language model. Background Technology

[0002] In recent years, artificial intelligence methods have demonstrated their value in multiple stages of drug discovery, including tasks such as molecule generation, property prediction, virtual screening, and lead compound optimization. Among these, molecule optimization plays a crucial role in progressively advancing initial candidate structures to meet multiple druggability constraints. This process is not a one-time prediction or generation, but rather an iterative decision-making process accompanied by repeated structural modifications, evaluations, and feedback.

[0003] Existing deep learning methods have achieved significant success in molecular optimization, particularly under specific objectives or constraints, effectively guiding models to converge to desired properties through end-to-end training, reinforcement learning, or multi-objective loss design. However, these methods typically model the optimization task as an implicit mapping from molecular structure to target properties or reward signals, and adapting to different optimization objectives often relies on relearning or adjusting model parameters. In this paradigm, the motivation for structural modification, potential chemical effects, and post-failure correction logic are black-boxed into the model parameters. This makes the optimization behavior, while numerically effective, difficult to directly control, interpret, or reuse across different tasks. Summary of the Invention

[0004] To solve the above problems, such as Figure 1 As shown, this invention provides an explainable Large Language Model-guided Molecular Optimization (ELLM-MOM) framework that explicitly models the optimization process as a rule-based reasoning and structural editing workflow. This method extracts structure-property priors from Large Language Models (LLMs) and organizes them into reusable editing rules. The applicability of these rules is evaluated through intermediate physicochemical property mapping and binary inference, and the optimization direction is iteratively corrected using a retrieval-driven self-feedback mechanism. This allows for iterative improvement without retraining the model.

[0005] This invention provides an interpretable molecular optimization method based on a large language model, comprising the following steps:

[0006] S10: Obtain the original molecular set, perform standardization processing and effective molecule screening on the original molecular set, and output the standardized molecular set and the optimized task test molecular set;

[0007] S20: Perform multimodal molecular embedding generation and LLM rule initial generation on the standardized molecular set, and output the multimodal molecular embedding vector set and the initial molecular editing rule set;

[0008] S30: Perform semantic embedding and clustering deduplication on the initial molecular editing rule set, output a deduplicated representative rule set, and then extract high-frequency molecular fragments based on the standardized molecular set to output the final molecular editing rule library;

[0009] S40: Standardize the data of the final molecular editing rule base and output the structured rule metadata table;

[0010] S50: Based on the optimized task test set and the structured rule metadata table, score and sort the specified target optimization attributes, and output a multi-attribute candidate rule ranking table;

[0011] S60: Based on the optimized task test molecule set and the multi-attribute candidate rule ranking table, and combined with the standardized molecule set, perform chemical legality verification and screen the optimal reference molecule, and output the benchmark molecule to be optimized;

[0012] S70: Based on the benchmark molecule to be optimized, the standardized molecule set, and the multimodal molecule embedding vector set, perform similar molecule retrieval and output successful reference molecules;

[0013] S80: Extract the shared backbone and split the unique fragments from the baseline molecule to be optimized and the successful reference molecule, match and map them with the final molecule editing rule base, and output the modified editing rule set;

[0014] S90: Based on the baseline molecule to be optimized and the set of correction editing rules, perform molecular structure editing and property compliance assessment, and output the optimized final molecular set after iteration.

[0015] In summary, the present invention has at least the following beneficial effects:

[0016] The ELLM-MOM model summarizes structural and property knowledge from literature and databases, explicitly presenting it in the form of rules. In specific optimization tasks, hypotheses are derived based on these rules, and controlled structural editing is performed. Through continuous verification and feedback, the inferences are tested and revised, making the optimization path interpretable and traceable. ELLM-MOM enables the model to simulate the "reasoning-operation-evaluation-feedback" cycle in scientific research without requiring parameter updates, continuously advancing molecular optimization with traceable structural mechanisms at its core, rather than the traditional generative, one-off output model.

[0017] The ELLM-MOM model's entire framework does not rely on additional parameter training; its optimization behavior is entirely driven by rule-based reasoning and external knowledge feedback. At its core are interpretable structure editing rules. Through cue engineering and contextual learning, it leverages the knowledge extraction capabilities of large language models to generate a molecular editing rule base based on prior knowledge. Given the target property and the starting molecule, the property mapping module filters applicable operations and uses a thought chain to perform discretized structure influence determination, thereby selecting the most potentially contributing editing strategy under a unified scoring system. The ELLM-MOM model also introduces an external knowledge-driven self-feedback mechanism. Using similar molecules in the database whose properties already meet the conditions as references, it remaps the structural differences between the two into rules for further optimization, enabling the editing process to iteratively advance within an interpretable structure operation space. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is an architecture diagram of the ELLM-MOM model of the present invention;

[0020] Figure 2 It is the input molecular structure diagram for tasks that are more soluble in water;

[0021] Figure 3 It is a diagram of the intermediate molecular structure for tasks that are more soluble in water;

[0022] Figure 4 It is the output molecular structure diagram of the task that is more soluble in water;

[0023] Figure 5 It is the input molecular structure diagram for the task of improving drug similarity;

[0024] Figure 6 This is a diagram of the intermediate molecular structure for improving drug similarity.

[0025] Figure 7 It is the output molecular structure diagram for the task of improving drug similarity;

[0026] Figure 8 It is the input molecular structure diagram for tasks with lower permeability;

[0027] Figure 9 This is an intermediate molecular structure diagram for tasks with lower permeability;

[0028] Figure 10 It is the output molecular structure diagram of a task with lower permeability;

[0029] Figure 11 It is the input molecular structure diagram for more hydrogen bond donor missions;

[0030] Figure 12 This is a diagram of the intermediate molecular structure for more hydrogen bond donor missions;

[0031] Figure 13 These are output molecular structure diagrams from more hydrogen bond donor missions;

[0032] Figure 14 It is a diagram of the input molecule structure for tasks that are more soluble in water and have more hydrogen bond acceptors;

[0033] Figure 15 It is a diagram of the intermediate molecule structure for a more water-soluble and hydrogen-bonded acceptor task;

[0034] Figure 16 It is the output molecular structure diagram of a more water-soluble and hydrogen-bonded acceptor task;

[0035] Figure 17 It is the input molecular structure diagram for tasks that are less soluble in water and have more hydrogen bond donors;

[0036] Figure 18 It is an intermediate molecule structure diagram that is less soluble in water and has more hydrogen bond donors;

[0037] Figure 19 It is the output molecular structure diagram of a mission that is less soluble in water and has more hydrogen bond donors. Detailed Implementation

[0038] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0039] The following is in conjunction with the appendix Figures 1 to 19 The present invention will be described in further detail below.

[0040] An interpretable molecular optimization method based on a large language model includes the following steps:

[0041] Step 1: Preprocessing of basic data.

[0042] To ensure sufficient structural diversity and chemical reliability for external searches, this invention constructs a composite molecular database covering both real-world drug chemistry and experimental three-dimensional structural spaces. The database integrates two types of molecular sources: one is approved drugs from ChEMBL, representing typical pharmacophore combinations, physicochemical property distributions, and drug chemistry frameworks; the other is small molecule ligands extracted from PDBs, representing a broader range of stereochemical patterns, three-dimensional conformational features, and rare substitutional structures. The two databases are complementary, enabling the search space to simultaneously cover both "core drug chemistry structures" and "structure types feasible in experimental systems but scarce in the corpus." All molecules were standardized using the RDKit tool before being added to the database to remove invalid and incomplete structures.

[0043] Step two, building the rule base.

[0044] The generation of the rule base is driven by a set of strictly constrained prompts. These prompts explicitly define the form of the rules, including the types of structural editing operations that can be performed, the textual description specifications of the operations, and the molecular fragment form that triggers the operations. Furthermore, each rule must be accompanied by a piece of prior knowledge, clearly explaining "why such structural modifications might improve certain properties" and "through which intermediate properties (such as hydrophobicity, polarity, hydrogen bonding characteristics, etc.) it will work." This mechanism ensures that the generated rules naturally rely on specific prior knowledge, guaranteeing the interpretability and traceability of molecular editing.

[0045] To reduce the occurrence of invalid or unreasonable rules, context learning is introduced, providing a small number of positive and negative examples to guide the ELLM-MOM model in adhering to chemical legitimacy and medicinal semantics when generating rules. Positive examples demonstrate common and reasonable substitution or modification methods; negative examples are used to explicitly exclude patterns such as broken bond fragments, mixture smiles, or valence state anomalies, thereby reducing the tendency of the ELLM-MOM model to generate structurally unreasonable candidates.

[0046] Due to the contextual constraints of large language models, rule bases need to be generated in batches. In this case, directly using all existing rules as few-shot learning examples would compress the task instruction space and significantly enhance the model's tendency to imitate old rules. To mitigate this bias, this invention semantically embeds and clusters the prior knowledge fields corresponding to the rules, selecting only representative rules from each cluster for suggestions. This approach alleviates context overflow and reduces the data processing cost of rule base construction caused by example redundancy, allowing the ELLM-MOM model to cover a broader structural editing space.

[0047] In actual generation, the rule distribution obtained after semantic deduplication is usually relatively concentrated. This reflects that large language models prioritize presenting high-density substitution patterns in the corpus within a pharmaceutical context. In other words, the rules extracted by large language models often represent the most common and stable structural and property relationships in literature and general chemical knowledge, serving as the "density center" of the rule space. Based on this phenomenon, this invention further mines common substitution groups from a large-scale small molecule library and constructs heuristic substitution group statistics to complete the rule space. By using BRICS decomposition and Murcko scaffold segmentation to extract side chains, fragments that appear repeatedly are extracted. These fragments reflect chemical components that are repeatedly used in real drugs and ligands. Based on patterns such as whether they contain halogens, whether they are aromatic rings, and whether they contain carbonyl groups, these fragments are divided into several heuristic families. Fragments within the same heuristic family are used to expand the rules, increasing the number of rules from the rule set given by the large language model to a preset scale.

[0048] It should be noted that Murcko scaffold is derived from Bemis GW, Murcko M A. The properties of known drugs.1.Molecular frameworks[J].Journal of medicinal chemistry,1996,39(15):2887-2893.

[0049] Step three explains the editing of molecules.

[0050] After generating the rule base, it is necessary to further determine the applicability of the rules to specific molecules and specific property targets, and then implement structural editing accordingly. If a large language model is directly used to provide an applicability score for the rules, score drift may occur, meaning the evaluation results will fluctuate with prompts and context, making it difficult to maintain consistency in cross-rule comparisons and to explain the underlying rationale. Therefore, the core objective of this module is to transform the "effectiveness of a rule" into a set of interpretable, decomposable intermediate inferences that are independent of the prompt context, thereby selecting structural editing schemes with clear mechanistic basis and implementing molecular editing within a unified property contribution space.

[0051] To achieve this goal, this invention organizes the rule evaluation process into a chain-like inference. Before the rule takes effect, an influence mapping table related to the target property T is first constructed at the property level: using physicochemical descriptors commonly used in the QSPR framework as intermediaries, a set of discrete intermediate attributes such as hydrophobic volume, polarity, hydrogen bond donor / acceptor ability, number of rotatable bonds, and aromaticity are selected. The large language model is then instructed to provide the direction of action for each intermediate attribute only in the binary classification task of "favorable / unfavorable to T", taking only two values: {+1, −1}. The resulting mapping table compresses the complex structure-property relationship into several intermediate dimensions with clear directions, while avoiding the instability of the large language model when processing continuous multi-level labels.

[0052] Subsequently, in the specific screening stage, the rule evaluation was broken down into three steps: First, the rule meta-information was used to determine whether its expected direction was consistent with the current property optimization objective; on this basis, the triggering mode (smarts_trigger field) in the rule meta-information was checked to see if it appeared in the given molecule; only when the first two steps were satisfied, the large language model was further used to make a binary judgment on "whether the structural editing will substantially change the above intermediate attributes", and a set of {0,1} "trigger vectors" was obtained.

[0053] The final score of a rule on property T is obtained by multiplying and summing the trigger vector element-wise with the pre-constructed intermediate attribute influence map. Since the set of intermediate attributes and their directional mappings remain fixed throughout the process, and the large language model always performs only two types of binary decisions, the rule score can be projected onto the same property contribution space across different prompt contexts and different calls, thus obtaining a stable rule ranking with a clear mechanistic path.

[0054] In single-attribute optimization, each round selects the highest-scoring set of rules from the rule scoring results as candidates, and these rules are then sequentially sent to the rule application stage to generate corresponding candidate molecules. For multi-attribute optimization, this invention does not introduce additional rule construction or evaluation models, but directly extends the single-attribute rules and scoring mechanism: First, based on the screening results of each single attribute, the intersection of the rule's trigger fragment and specific operation is calculated; when the intersection is empty, it degenerates into a union. Subsequently, the degree to which these candidate rules may change towards the specified optimization direction in each attribute is evaluated, that is, the offset of the candidate rule's score in each attribute relative to the neutral threshold is accumulated as its comprehensive contribution index, and the optimal rule is selected based on this to enter the subsequent structure editing process. This design enables multi-attribute optimization to directly reuse the single-attribute rule base and evaluation path, achieving attribute collaborative optimization without additional training or independent modeling.

[0055] In the rule application phase, this invention employs a molecular diagram editing method based on the RDKit tool to perform operations such as bond breaking, replacement, and reconnection, ensuring that the generated molecules are semantically consistent with the rules in terms of valence state and connectivity. The reason for not directly generating the edited molecules from the large language model is that in environments with strong spatial constraints or complex loop operations, the large language model is prone to generating mixed-type smiles, smiles containing bond breaking markers ("[*]"), or structural rewriting inconsistent with the target rules, making it difficult to establish a traceable relationship between the generated molecules and the corresponding rules. Furthermore, to avoid potential edits being misjudged as invalid due to individual diagram editing failures, when the RDKit-based editing method is interrupted, molecular editing using the large language model serves as a backup mechanism. If no of the above three types of erroneous molecules are generated, the process returns; otherwise, the editing fails.

[0056] Given that large language models tend to extract high-frequency editing patterns from corpora, this invention further introduces ChatMol's pipeline as another branch generation molecule to provide off-distribution structural transformations, thereby expanding the overall explorable structural space while keeping the main structure of the rule framework unchanged.

[0057] It should be noted that ChatMol is a seq2seq model for molecular text translation based on MolT5. Its generation is not constrained by regular formats and can propose larger step sizes or unconventional structural variants, providing additional candidate molecules for situations where regular paths are difficult to cover.

[0058] Step four: closing the retrieval-feedback loop.

[0059] Single-step rule editing is sometimes constrained by the molecular structure itself, making it difficult to achieve the desired properties in one go. On the one hand, limitations on site editability and structural integrity can prevent certain theoretically feasible rules from being reliably executed on specific molecules. On the other hand, the coupling phenomenon of multiple properties can cause the same structural modification to affect multiple properties simultaneously, and the direction of some coupling effects may conflict with the expected change in the target property. Based on these factors, a single rule application may not be sufficient to form stable and reliable optimization results. This invention introduces external evidence to reposition the direction of structural modification after molecular optimization fails. This module retrieves similar molecules whose properties already meet the target requirements and maps their structural differences back to the rule base, enabling the optimization process to have an iterative reasoning ability similar to that of an expert who "corrects hypotheses based on evidence."

[0060] Before entering the self-feedback stage, the ELLM-MOM model first selects the candidate molecule with the closest properties to the target molecule from the candidate molecules obtained from the current round of rule application as the benchmark molecule for the next retrieval and correction. For both single-attribute and multi-attribute tasks, this invention uses two types of scoring functions.

[0061] Single-attribute task: The ELLM-MOM model prioritizes the rule that produces the largest change in the target direction from all candidate rule sets X. If no rule can improve the target direction, it selects the rule with the smallest absolute change from all candidates to achieve the most conservative update. The above selection process is defined as follows:

[0062] ;

[0063] in, A general index representing candidate rules. This represents the index number of the ultimately selected optimal rule. Indicates the first The amount of change of a rule on the target attribute. A subset of feasible rules is defined as follows:

[0064] ;

[0065] Indicates the expected direction of change of the target attribute. This represents the set of candidate rules obtained in the current round of selection. It includes only those rules that produce a positive change in the target direction. This strategy of "choosing the maximum advance if there is a solution, and the minimum perturbation if there is no solution" avoids introducing excessively large and unnecessary structural changes in single-attribute optimization while maintaining directional consistency.

[0066] Multi-attribute tasks: When multiple attributes need to be optimized simultaneously, there are a total of There are 10 rules, and each rule has 100 rules. The change in an attribute is defined as Given the target direction for each attribute. Change in target For each candidate rule, three components are calculated: the proportion of correct direction, the forward propulsion, and the reverse deviation.

[0067] Correct proportions Defined as:

[0068] ;

[0069] Where k represents all the IDs of the attribute to be optimized. This represents the change in the value of the k-th attribute after the i-th rule takes effect. This represents an indicative function that returns 1 if the condition is true and 0 if the condition is false. Indicates for candidates , No. The changes in each attribute are consistent with expectations; their summation yields the candidate data. The total number of attributes that change in the expected direction.

[0070] Forward propulsion Defined as:

[0071] ;

[0072] in, This represents the target change threshold for the k-th attribute. For the set of attributes with only correct orientation, we calculate how much each attribute has progressed towards the target threshold. The ratio of the actual change to the target change represents the progress. When the actual change exceeds the target change, it is truncated to 1 to prevent a change in one attribute from accumulating bias across multiple rounds of optimization and thus dominating the overall decision.

[0073] Reverse deviation Defined as:

[0074] ;

[0075] Calculate the proportion of the total deviation of all attributes from the expected direction of change, and also use the ratio of the actual change to the target change as the degree of deviation.

[0076] The rule for selecting the highest overall score is defined as follows:

[0077] ;

[0078] This selection strategy, through the combined effect of directional consistency, goal advancement, and deviation penalty, enables multi-attribute optimization to achieve a robust compromise solution among conflicting attributes.

[0079] When a round of molecular editing fails to achieve the desired results, the ELLM-MOM model searches the database for similar molecules to obtain suitable structural examples as "reference solutions." To ensure the interpretability and robustness of the retrieved structures, this invention constructs a set of molecular embeddings combining 1D, 2D, and 3D modalities to improve the interpretability and accuracy of similarity searches. Specifically, the 1D embedding is based on the substructures after BRICS splitting, encoded by the ChemBERTa model, emphasizing the semantics of molecular sequence fragments and local functional groups; the 2D embedding is based on the local environment obtained on-bit in ECFP4, using the mol2vec model to extract topological neighborhood features; and the 3D embedding comes from the Uni-Mol model, used to capture the spatial configuration and three-dimensional shape features between atoms. The three types of embeddings are semantically complementary, focusing on sequenced fragments, two-dimensional skeletons, and three-dimensional conformations, respectively, and are spliced ​​together as the final embedded representation of the molecule. Cosine similarity is used as a measure of inter-molecular similarity to achieve molecular size-independent similarity comparisons.

[0080] It should be noted that BRICS (Breaking of Retrosynthetically Interesting Chemical Substructures) is a molecular fragmentation algorithm. Its core is based on the rules of retrosynthetic reactions to precisely cut small molecule chemical bonds, separating them into chemically meaningful and recombinable substructure / functional group fragments. Unlike random fragmentation, it can preserve the semantics of local functional groups and sequence fragment features of the molecule. The ChemBERTa model originates from Chithrananda S, Grand G, Ramsundar B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction[J]. arXiv preprint arXiv:2010.09885, 2020. ECFP4 (Extended Connectivity Fingerprint 4) is a hash fingerprint specifically used to characterize the two-dimensional topological features of molecules; the number 4 indicates that the fingerprint neighborhood radius / chemical bond layer is 4, and on-bit indicates the effective bits with a value of 1 in the fingerprint, corresponding to characteristic local topological fragments in the molecule. The mol2vec model comes from Jaeger S, Fulle S, Turk S. Mol2vec: unsupervised machine learning approach with chemical intuition[J]. Journal of chemical information and modeling, 2018, 58(1): 27-35. The Uni-Mol model comes from Zhou G, Gao Z, Ding Q, et al.Uni-mol: A universal 3d molecular representation learning framework[C] / / The eleventh international conference on learning representations.2023.

[0081] After retrieving a reference molecule that simultaneously meets the requirements for the change in the target property and is most similar to the failed molecule, this invention extracts the shared portion of the two molecules by extracting the maximum common substructure (MCS), and then removes the shared portion from each molecule to obtain their respective unique structural fragments. The unique portion of the reference molecule can be regarded as the "structural increment of successful optimization," while the unique portion of the failed molecule is the "structural feature that needs to be corrected." Therefore, these two fragments can be mapped to existing structural editing operations in the rule base, so that their structural differences fall exactly into the interpretable transformation space encoded by the rule base, thereby explaining "why the reference molecule can guide the next modification direction." Then, these rules are selected and applied based on the optimization results of the previous step, with the specific process being the same as in the previous section. If the optimization is still unsuccessful, the ELLM-MOM model will continue to repeat the closed-loop process of "retrieval-mapping-reapplication" until the preset property threshold or the upper limit of the number of iterations is reached.

[0082] Experimental verification:

[0083] Benchmarking: Text-based molecular editing tasks from the MoleculeSTM model, including 16 single-attribute editing tasks and 12 two-attribute editing tasks. The single-attribute editing tasks cover the improvement or decrease requirements of five molecular properties at different thresholds Δ. The evaluation metric uses hit rate (%): using the input molecule as a reference, a hit is defined as when the direction of change of the generated molecule in the corresponding property is consistent with the preset target direction and the magnitude of the change exceeds Δ.

[0084] The test set is from the test set of the text-based molecular editing experiment in the MoleculeSTM model. The test set consists of 200 molecules randomly selected from the ZINC database and their corresponding text prompts. This invention only uses the molecular part of the test set.

[0085] It should be noted that the MoleculeSTM model is derived from

[14] Liu S, Nie W, Wang C, Lu J, Qiao Z, Liu L, et al. Multi-modal molecule structure–text model for text-based retrieval and editing. Nat Mach Intell. 2023 Dec;5(12):1447–57.

[0086] The baselines cover two main learning paradigms. The first is based on a molecular pre-trained model with downstream fine-tuning. The baseline is derived from the MoleculeSTM model, which uses MegaMolBART to encode SMILES and GIN and GraphMVP to encode molecular graphs. Specific strategies include MoleculeSTM(SMILES), MoleculeSTM(Graph), Random, PCA, High-Variance, and GS-Mutate. The second approach is a large language model paradigm that does not require parameter updates and is based on contextual learning and retrieval enhancement. The baseline is the ChatDrug model, which directly uses a large language model to generate molecules and combines similar molecule retrieval and feedback to form an iterative optimization process.

[0087] It should be noted that ChatDrug is derived from the model Liu S, Wang J, Yang Y, Wang C, Liu L, Guo H, et al. ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback.

[0088] Experimental Setup: To ensure fairness, this invention and the ChatDrug model use the same hyperparameter settings, i.e., selecting 5 candidate rules per round. Specifically, for multi-attribute optimization, due to the intersection operation, 10 candidate rules are selected for each attribute, and the intersection is used to retain 5 candidate rules. A maximum of 2 rounds of optimization are allowed. Each attribute has a rule base containing 300 rules. Each optimization objective is run 5 times according to the above task settings, and the mean and standard deviation are calculated. The large language model used is ChatGPT-3.5-turbo.

[0089] The experimental results are shown in Tables 1 and 2. The ELLM-MOM model has a significant advantage in absolute performance. Specifically, in 13 of the 16 single-attribute optimization tasks, it achieved the highest average success rate, and in 2 of the remaining 3 tasks, it reached a suboptimal level. Only one task had a standard deviation slightly higher than the baseline. In 9 of the 12 two-attribute optimization tasks, it achieved the current best result on the mean, and the standard deviations of these tasks were all lower than those of the ChatDrug model. These results indicate that the ELLM-MOM model not only outperforms the baseline methods in average performance but also has advantages in optimal results and volatility control, exhibiting higher success rates and stronger stability.

[0090] Table 1: Results of the single-target small molecule editing task

[0091]

[0092] Table 1: Results of the single-target small molecule editing task (Continued from Table 1)

[0093]

[0094] Table 2: Results of the multi-target small molecule editing task

[0095]

[0096] Table 2: Results of the multi-target small molecule editing task (continued from Table 1)

[0097]

[0098] From a methodological design perspective, the improved performance and stability of the ELLM-MOM model can be attributed to two core mechanisms. Firstly, compared to the ChatDrug framework, which directly relies on large language models to generate molecules, the ELLM-MOM model employs a binary classification scoring system based on property influence mapping. It performs cross-batch rule ranking within a unified contribution space and then directly edits molecules using specific rules. Therefore, the ELLM-MOM model's mean is significantly higher than the baseline, and its standard deviation is lower. This essentially avoids the randomness and uninterpretability of molecules generated by large language models to some extent by clearly defining the decision-making criteria. Secondly, compared to fine-tuning models like MoleculeSTM, the ELLM-MOM model uses a rule-based and retrieval-driven self-feedback mechanism to map failed samples back to interpretable structural operation sequences and iteratively corrects the optimization direction, continuously improving the success rate without relying on parameter updates. This is particularly evident in dual-attribute tasks: when the baseline model experiences a significant performance drop due to property coupling, the ELLM-MOM model still maintains higher mean and extreme values, accompanied by smaller variance. This indicates that rule-level coordination does indeed alleviate the optimization difficulties caused by multi-attribute coupling. Essentially, this is achieved through the intermediate form of rules, enabling the ELLM-MOM model to exclude editing directions in the search space that only improve a single attribute but destroy other properties.

[0099] Table 3 presents examples of six molecular optimization tasks, showing successful examples after two rounds of optimization. For instance, in Task 1, the ELLM-MOM model selects the rule +OH@meta in the first step. This involves choosing a carbon atom with a side chain on the aromatic ring as an anchor point, and introducing a hydroxyl group at the meta position based on this anchor point to obtain the intermediate molecule. This structural change directly increased the molecule's polarity and hydrogen bonding ability, thus significantly reducing LogP, but the reduction did not meet the threshold requirement. The next step in the ELLM-MOM model was to... The addition of a chlorine atom to the ortho position is an operation because, for molecules... +Cl@ortho can induce local skeletal reconstruction, resulting in modified molecules. The aromatic ring in the solvent did indeed change to a non-aromatic saturated ring after this editing, making the polar functional groups more likely to interact with the solvent, thus leading to a decrease in LogP and satisfying the requirement. The requirements. In Table 5e, the first step of the ELLM-MOM model is to introduce an oxygen-containing substituent at the para position of the aromatic ring, directly adding an additional hydrogen bond acceptor site to the molecule, resulting in an intermediate molecule. The second step involves choosing the +NO2@meta rule. This functional group provides both hydrogen bond acceptors and significantly increases molecular polarity, ultimately resulting in a molecule that satisfies the constraints. These cases demonstrate that the ELLM-MOM model can achieve synergistic regulation of properties through a series of structural modifications with clear chemical significance.

[0100] It should be noted that, Figure 2-19 In the diagram, the shaded areas indicate the locations where the molecules are modified.

[0101] Table 3: Visualization results of the 6 tasks

[0102]

[0103] Table 3: Visualization results of the 6 tasks (continued from Table 1)

[0104]

[0105] Table 3: Visualization results of the 6 tasks (continued from Table 2)

[0106]

[0107] The proposed ELLM-MOM model models molecular optimization as an explicit decision-making process centered on rule-based reasoning and executable structural editing. Compared to the implicit mapping paradigm of end-to-end molecular optimization, it can achieve stable and interpretable optimization behavior without relying on parameter updates. Experimental results show that the ELLM-MOM model transforms the implicit structure-property knowledge in the large language model into reusable editing hypotheses through rules as an intermediate form, so that the optimization behavior no longer depends on implicit parameter mapping, but is based on explicit structural operations. In addition, the ELLM-MOM model introduces a rule ranking mechanism based on intermediate physicochemical property mapping, which constrains the judgment of rule applicability into a series of context-independent binary inferences, thereby forming a unified basis for judging property contributions and alleviating the scoring drift problem caused by context constraints in large language models. Finally, when a single round of editing fails to achieve the expected goal, the self-feedback mechanism of the ELLM-MOM model can map the structural differences of external successful molecules back to the rule space, guiding subsequent optimizations to be gradually corrected within the range of interpretable structural operations.

[0108] The above are merely preferred embodiments of the invention and are not intended to limit the invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the invention should be included within the protection scope of the invention.

Claims

1. An interpretable molecular optimization method based on a large language model, characterized in that, Includes the following steps: S10: Obtain the original molecular set, perform standardization processing and effective molecule screening on the original molecular set, and output the standardized molecular set and the optimized task test molecular set; S20: Perform multimodal molecular embedding generation and LLM rule initial generation on the standardized molecular set, and output the multimodal molecular embedding vector set and the initial molecular editing rule set; S30: Perform semantic embedding and clustering deduplication on the initial molecular editing rule set, output a deduplicated representative rule set, and then extract high-frequency molecular fragments based on the standardized molecular set to output the final molecular editing rule library; S40: Standardize the data of the final molecular editing rule base and output the structured rule metadata table; S50: Based on the optimized task test set and the structured rule metadata table, score and sort the specified target optimization attributes, and output a multi-attribute candidate rule ranking table; S60: Based on the optimized task test molecule set and the multi-attribute candidate rule ranking table, and combined with the standardized molecule set, perform chemical legality verification and screen the optimal reference molecule, and output the benchmark molecule to be optimized; S70: Based on the benchmark molecule to be optimized, the standardized molecule set, and the multimodal molecule embedding vector set, perform similar molecule retrieval and output successful reference molecules; S80: Extract the shared backbone and split the unique fragments from the baseline molecule to be optimized and the successful reference molecule, match and map them with the final molecule editing rule base, and output the modified editing rule set; S90: Based on the baseline molecule to be optimized and the set of correction editing rules, perform molecular structure editing and property compliance assessment, and output the optimized final molecular set after iteration.

2. The interpretable molecular optimization method based on a large language model according to claim 1, characterized in that, The specific steps for standardizing the original molecular set and outputting the standardized molecular set are as follows: Remove duplicate entries from the original molecule set, filter out valid entries, and output the original set of composite molecules; The original set of composite molecules is standardized by using the RDkit tool, and the validity of valence states is verified and invalid structures are removed, resulting in the output of the standardized molecular set.

3. The interpretable molecular optimization method based on a large language model according to claim 1, characterized in that, The specific steps for generating multimodal molecular embeddings from the standardized molecular set and outputting a multimodal molecular embedding vector set are as follows: The standardized molecular set is processed by 1D fragment embedding, 2D topological embedding and 3D conformation embedding, and then the 1D fragment, the 2D topology and the 3D conformation are spliced ​​and fused to output the multimodal molecular embedding vector set.

4. The interpretable molecular optimization method based on a large language model according to claim 1, characterized in that, Based on the optimized task test subset and the structured rule metadata table, the specific steps for scoring and ranking the specified target optimization attributes and outputting a multi-attribute candidate rule ranking table are as follows: Perform binary property determination on the specified target optimization attribute and output a property-attribute mapping table; Based on the optimized task test set, the structured rule metadata table, and the property-attribute mapping table, rule direction matching, trigger fragment existence verification, and attribute influence binary determination are performed. The rule property contribution score is calculated and sorted, and a single-attribute candidate rule ranking table is output. Based on the single-attribute candidate rule ranking table and the property-attribute mapping table, the rules are merged, the correct direction ratio, forward propulsion and reverse deviation are calculated, a comprehensive score is obtained and ranked, and the multi-attribute candidate rule ranking table is output.

5. The interpretable molecular optimization method based on a large language model according to claim 4, characterized in that, Based on the optimized task test molecule set and the multi-attribute candidate rule ranking table, and combined with the standardized molecule set, the specific steps for chemical legality verification and screening of the optimal reference molecule, and outputting the benchmark molecule to be optimized are as follows: Based on the optimized task test set and the multi-attribute candidate rule sorting table, combined with the standardized set and structurally edited using the RDKit tool, the failed edit entries are corrected. Then, the search space is expanded by generating out-of-distribution structural variants using the ChatMol model, and the edited set of candidate molecules is output. Based on the edited candidate molecule set, the molecule with the largest or smallest positive change is selected in the single-attribute task, and the molecule with the highest comprehensive score is selected in the multi-attribute task, and the benchmark molecule to be optimized is output.

6. The interpretable molecular optimization method based on a large language model according to claim 5, characterized in that, Based on the edited candidate molecule set, the specific steps for selecting the molecule with the largest or smallest positive change in the single-attribute task are as follows: In the entire set of edited candidate molecules, the rule that produces the largest change in the target direction is preferentially selected; if no rule can produce an improvement in the target direction, the rule with the smallest absolute change is selected from all candidates; the above selection process is defined as follows: ; in, A general index representing candidate rules. This represents the index number of the ultimately selected optimal rule. Indicates the first The amount of change of the rule on the target attribute; A subset of feasible rules is defined as follows: ; Indicates the expected direction of change of the target attribute. This represents the set of candidate rules obtained in the current round of screening.

7. The interpretable molecular optimization method based on a large language model according to claim 6, characterized in that, Based on the edited candidate molecule set, the specific steps for selecting the molecule with the highest comprehensive score in the multi-attribute task are as follows: When multiple attributes need to be optimized simultaneously, there are a total of There are 10 rules, and each rule has 100 rules. The change in an attribute is defined as ; Given the target direction for each attribute Change in target For each candidate rule, calculate three components: the proportion of correct direction, the forward propulsion, and the reverse deviation. Correct proportions Defined as: ; Where k represents all the IDs of the attribute to be optimized. This represents the change in the value of the k-th attribute after the i-th rule takes effect. This represents an indicative function that returns 1 if the condition is true and 0 if the condition is false. Forward propulsion Defined as: ; in, This represents the target change threshold for the k-th attribute; Reverse deviation Defined as: ; The rule for selecting the highest overall score is defined as follows: .

8. The interpretable molecular optimization method based on a large language model according to claim 1, characterized in that, The specific steps for extracting the shared backbone and splitting unique fragments between the baseline molecule to be optimized and the successful reference molecule are as follows: The maximum common substructure of the baseline molecule to be optimized and the successful reference molecule is extracted as the shared backbone. The shared backbone of the benchmark molecule to be optimized and the successful reference molecule are removed respectively, and the unique fragments of the benchmark molecule to be optimized and the successful reference molecule are output.