A molecular generation model construction method and a molecular generation method
By unifying the multimodal fusion of protein pocket structures, molecular fragments, and text descriptions in a molecular generation model, and employing cross-modal attention and reinforcement learning training, this approach addresses the problem that existing models cannot fully utilize multi-source information, achieving efficient and controllable molecular generation that is adaptable to the multi-constraint design of real-world drug development.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING STONEWISE TECH CO LTD
- Filing Date
- 2026-05-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing molecular generation models cannot fully utilize multi-source information, are difficult to adapt to the complex and multi-constrained design requirements in real drug development, and lack a unified multimodal fusion mechanism, which limits the controllability and chemical rationality of generated molecules.
A molecular generation model is constructed that unifies protein pocket structures, molecular fragments, and text descriptions into a single large language model. A cross-modal attention mechanism is used to achieve deep fusion of multi-dimensional information, and a three-stage reinforcement learning training strategy is adopted to improve the model's control capability and generation accuracy.
It significantly improves the controllability and accuracy of molecular generation, enabling simultaneous understanding of the protein's geometric environment, existing molecular structural features, and textual drug design intent, thus adapting to the complex and multi-constrained requirements of real drug development.
Smart Images

Figure CN122245513A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of molecular generation technology, specifically to a method for constructing a molecular generation model and a molecular generation method. Background Technology
[0002] Molecular generation is a crucial step in AI-assisted drug design, aiming to automatically explore candidate molecules with novel structures, good drug similarity, high synthetic feasibility, and potential biological activity within the vast chemical space. With the breakthroughs in deep learning technology, especially language models, in sequence modeling and generation, numerous studies have attempted to reformulate the molecular design problem as a "sequence generation" task to improve molecular generation efficiency and model expressive power.
[0003] However, real-world drug design tasks typically require consideration of information from multiple sources simultaneously. For example, medicinal chemists may provide specific lead compound fragments, limiting groups, and natural language text describing properties (such as "improved solubility" or "fitting hydrophobic pockets"); protein structure information may come from PDB files (a database containing the three-dimensional structures of proteins and other macromolecules) at different resolutions, predicted structures, or molecular dynamics simulations; meanwhile, existing molecular fragments often need to be retained, replaced, or optimized. This information possesses inherent multimodal characteristics, but current molecular generation models lack a unified multimodal fusion mechanism, making it difficult to simultaneously process protein structures, molecular fragments, and natural language descriptions within the same generation framework. This limits the controllability and chemical rationality of the generated molecules. Summary of the Invention
[0004] This invention provides a method for constructing a molecular generation model and a molecular generation method to solve the problem that existing molecular generation models cannot fully utilize multi-source information to improve generation quality and are difficult to adapt to the complex and multi-constrained design requirements in real drug development.
[0005] In a first aspect, the present invention provides a method for constructing a molecular generation model, the method comprising: Obtain a multimodal sample dataset, which includes several protein sample groups. Each protein sample group includes the pocket structure information of the target protein, the ligand molecule information that binds to the target protein pocket, the textual description information of the target protein pocket, and the molecular fragment constraint information. A language learning model is trained based on a multimodal sample dataset to obtain a molecular generation model.
[0006] This invention unifies protein pocket structures, molecular fragments, and textual descriptions into a single large language model. Training the language learning model based on this multimodal sample dataset allows the model to fully integrate multi-dimensional conditional information within a unified training framework. This effectively improves the model's control over the molecular generation process and the accuracy of structure prediction, enabling the generated molecules to efficiently bind to the target protein pocket while meeting semantic requirements and structural constraints. This effectively adapts to the complex and multi-constrained design requirements of real-world drug development. Compared to traditional models that rely solely on structure or sequence, the molecular generation model provided in this embodiment can simultaneously understand the protein's geometric environment, existing molecular structural features, and the medicinal design intent implied in the text, significantly improving the controllability and accuracy of molecular generation.
[0007] In one alternative implementation, training a language learning model based on a multimodal sample dataset includes: Feature encoding is performed on each type of modal data to obtain the feature data corresponding to each type of modal data; pocket structure information, ligand molecule information, text description information, and molecular fragment constraint information are each a type of modal data; Self-attention enhancement is performed on the feature data corresponding to each type of modality data to obtain the enhanced features corresponding to each type of modality data; Cross-modal attention alignment is performed based on the enhanced features of each modality data to obtain multimodal fusion data; Language learning models are trained based on multimodal fusion data.
[0008] In this embodiment, a multimodal molecular generation framework maps protein pocket structure representation, molecular fragment representation, and medicinal chemistry text description to a unified semantic embedding space. Through a cross-modal attention mechanism, the three types of information are deeply integrated in the generation process, enabling the generation model to simultaneously perceive spatial constraints, chemical skeleton limitations, and textual semantic goals.
[0009] In one alternative implementation, cross-modal attention alignment is performed based on the enhanced features of each type of modality data to obtain multimodal fusion data, including: The enhanced features corresponding to the pocket structure information and the enhanced features corresponding to the molecular fragment constraint information are fused through cross-modal attention to obtain the structure fusion data; Multimodal fusion data is obtained by performing cross-modal attention fusion between the structural fusion data and the enhanced features corresponding to the text description information.
[0010] This implementation supports both a single modality as the query vector and multiple other modalities as key and value vectors, enabling one-time alignment and fusion of protein structure, molecular chemistry, and textual semantic information; and a progressive approach where protein and molecular fragment feature alignment is first performed within the structural modality, followed by cross-modal interaction with textual semantic features. Both methods allow different modalities to complete information interaction and feature binding within a unified space, ultimately forming a unified joint representation that includes pocket structures, fragment constraints, and textual descriptions, for subsequent molecular generation or property optimization tasks. This enables the model to understand the relationship between the "protein environment—fragment—textual constraints" in a unified manner.
[0011] In one optional implementation, feature encoding is performed on each type of modal data to obtain feature data corresponding to each type of modal data, including: The pocket structure information is compressed and encoded to obtain the feature data corresponding to the pocket structure information; The molecular fragment constraint information is encoded by a graph neural network encoder or a SMILES encoder to obtain the feature data corresponding to the molecular fragment constraint information. The text embedding module, constructed using model distillation technology, encodes text description information to obtain feature data corresponding to the text description information.
[0012] In one optional implementation, the language learning model is an improved autoregressive language model; training the language learning model based on multimodal fusion data includes: An improved autoregressive language model is trained for sequence prediction tasks based on multimodal fusion data. Molecular local units are generated step by step during the sequence prediction task training process. In each step of local unit generation, the chemical constraint decoder and chemical rule checking module equipped with the improved autoregressive language model are used to evaluate the molecular local units and form candidate molecules after evaluation. The candidate molecules are scored based on a static reward optimization mechanism to determine the score value of each candidate molecule. Based on a dynamic structural reward mechanism, the interaction information between candidate molecules and their corresponding pocket structures is determined. Based on the sequence prediction task loss, score, and interaction information, the model parameters of the improved autoregressive language model are updated until training is complete.
[0013] This implementation adopts a three-stage reinforcement learning training strategy, which effectively improves the model's ability to learn about the interaction between molecules and pockets, and makes up for the deficiency of traditional data-driven models that cannot provide structural feedback during the generation process.
[0014] In one optional implementation, the improved autoregressive language model freezes the parameters of the first preset level during the sequence prediction task training phase; the improved autoregressive language model freezes the parameters of the second preset level during the static reward optimization mechanism phase; and the improved autoregressive language model freezes the parameters of the third preset level during the dynamic structural reward mechanism phase. The first preset level is lower than the second preset level, and the third preset level consists of all layers except the output layer.
[0015] By employing a continuous training and phased freezing strategy, the model can smoothly adapt to reward signals while maintaining its original knowledge, achieving high-quality, chemically sound molecular generation that interacts well with the target protein pocket. This effectively avoids the gradient instability and catastrophic forgetting problems common in learning, enabling the model to maintain structural legitimacy and inference stability while improving performance.
[0016] Secondly, the present invention provides a method for generating molecules, the method comprising: Obtain known protein information and design constraint information for the target protein. The design constraint information includes textual description information and / or molecular fragment constraint information. By inputting known protein information and design constraint information into a molecular generation model constructed according to the molecular generation model construction method of the first aspect above or any of its corresponding embodiments, the target molecule corresponding to the target protein is obtained.
[0017] Thirdly, the present invention provides a molecular generation model construction apparatus, the apparatus comprising: The acquisition module is used to acquire a multimodal sample dataset, which includes several protein sample groups. Each protein sample group includes the pocket structure information of the target protein, the ligand molecule information that binds to the target protein pocket, the text description information of the target protein pocket, and the molecular fragment constraint information. The training module is used to train a language learning model based on a multimodal sample dataset to obtain a molecular generative model.
[0018] Fourthly, the present invention provides an electronic device, comprising: a memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the molecular generation model construction method of the first aspect or any corresponding embodiment described above.
[0019] Fifthly, the present invention provides a computer-readable storage medium storing computer instructions for causing a computer to execute the molecular generation model construction method of the first aspect or any corresponding embodiment thereof.
[0020] It should be noted that the molecular generation model construction apparatus, electronic device, and computer-readable storage medium provided by this invention correspond to the molecular generation model construction method described above. Therefore, for the beneficial effects of the molecular generation model construction apparatus, electronic device, and computer-readable storage medium, please refer to the description of the corresponding beneficial effects of the molecular generation model construction method above, and will not be repeated here. Attached Figure Description
[0021] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0022] Figure 1 This is a schematic flowchart of a molecular generation model construction method according to an embodiment of the present invention; Figure 2 This is a schematic diagram of the phased freezing of model parameters according to an embodiment of the present invention; Figure 3 This is a schematic flowchart of a molecular generation method according to an embodiment of the present invention; Figure 4 This is a structural block diagram of a molecular generation model construction apparatus according to an embodiment of the present invention; Figure 5 This is a schematic diagram of the hardware structure of an electronic device according to an embodiment of the present invention. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0024] It is understood that before using the technical solutions disclosed in the various embodiments of the present invention, users should be informed of the types, scope of use, and usage scenarios of the personal information involved in the present invention and their authorization should be obtained in accordance with relevant laws and regulations through appropriate means.
[0025] Molecular generative models in related technologies rely on relatively limited data sources during training, and the diversity, protein structure coverage, and richness of pharmacochemical relationships in these data are far from sufficient to support a generative model with strong generalization ability. In contrast, the pre-training of large language models typically requires at least two orders of magnitude more data. Therefore, under data constraints, large language models often struggle to fully learn non-covalent interactions (NCIs), hydrogen bond networks, electrostatic potential distributions, hydrophobic cavities, dihedral features, and more complex structure-activity relationships. This results in models that, while capable of rapidly generating molecular structures, still have limited adaptability to real pocket spaces, reasonableness of pharmacochemical properties, and structural complexity.
[0026] In view of this, according to the embodiments of the present invention, a method for constructing a molecular generative model is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.
[0027] This embodiment provides a method for constructing a molecular generation model, which can be used on servers, terminals, and mobile terminals, such as mobile phones and tablets. Figure 1 This is a flowchart of a molecular generative model construction method according to an embodiment of the present invention, such as... Figure 1 As shown, the process includes the following steps: Step S101: Obtain a multimodal sample dataset. The multimodal sample dataset includes several protein sample groups. Each protein sample group includes the pocket structure information of the target protein, the ligand molecule information that binds to the target protein pocket, the textual description information of the target protein pocket, and the molecular fragment constraint information.
[0028] In this embodiment, the multimodal sample dataset consists of multiple protein sample groups, each corresponding to a target protein and its related information. The pocket structure information of the target protein is the three-dimensional structural data of the pocket on the protein that can bind to small molecules; the ligand molecule information that binds to the protein pocket is the known small molecule ligand structure that can bind to the pocket; the textual description information of the protein pocket, such as "able to form strong hydrogen bond interactions with the EGFR (epidermal growth factor receptor) pocket and has high binding affinity"; the molecular fragment constraint information is the fragment that needs to be retained when generating the molecule.
[0029] By grouping the aforementioned multimodal information into the same protein sample group, the model can learn the structural features, textual constraints, and molecular generation rules corresponding to specific protein pockets during training. This allows the model to focus on learning the interaction features between specific protein pockets and molecules, thereby achieving drug molecule generation under multiple constraints.
[0030] Step S102: Train the language learning model based on the multimodal sample dataset to obtain the molecular generation model.
[0031] The language learning models in this embodiment include, but are not limited to: autoregressive language models, Transformer models, BERT-like coding models, lightweight neural network models, recurrent neural networks (RNNs), gated recurrent units (GRUs), long short-term memory networks (LSTMs), etc.
[0032] When training a language learning model based on a multimodal sample dataset, the language learning model can take the pocket structure information, ligand molecule information, text description information and molecular fragment constraint information in each protein sample group as input, perform sequence prediction training, and perform multiple rounds of training until the model performance reaches the preset standard, and finally obtain a molecular generation model that can generate drug molecules that meet the requirements based on multimodal constraints.
[0033] In this embodiment, protein pocket structures, molecular fragments, and textual descriptions are unified into a single large language model. The language learning model is trained based on this multimodal sample dataset. This allows the model to fully integrate multi-dimensional conditional information within a unified training framework, effectively improving the model's control over the molecular generation process and the accuracy of structure prediction. The generated molecules can efficiently bind to the target protein pocket while also meeting semantic requirements and structural constraints, effectively adapting to the complex and multi-constrained design requirements of real-world drug development. Compared to traditional models that rely solely on structure or sequence, the molecular generation model provided in this embodiment can simultaneously understand the protein's geometric environment, existing molecular structural features, and the medicinal design intent implied in the text, significantly improving the controllability and accuracy of molecular generation.
[0034] On the other hand, real-world drug design tasks typically require consideration of information from multiple sources simultaneously. This information possesses inherent multimodal characteristics, but current models lack a unified multimodal fusion mechanism, making it difficult to simultaneously process protein structures, molecular fragments, and natural language descriptions within the same generative framework. This limits the controllability and chemical rationality of the generated molecules. Therefore, there is an urgent need for a unified multimodal generative framework that can integrate protein pocket structures, molecular fragment embeddings, and natural language medicinal descriptions, enabling generative models to achieve higher controllability, stronger target adaptation, and richer structural innovation in real-world drug design tasks.
[0035] Therefore, this embodiment provides a method for constructing a molecular generation model, which can be used on servers, terminals, mobile terminals, etc. The process includes the following steps: Step S201: Obtain the multimodal sample dataset. The multimodal sample dataset includes several protein sample groups. Each protein sample group includes the pocket structure information of the target protein, the ligand molecule information binding to the target protein pocket, the textual description information of the target protein pocket, and the molecular fragment constraint information. For details, please refer to [link to relevant documentation]. Figure 1 Step S101 of the illustrated embodiment will not be described again here.
[0036] Step S202: Train the language learning model based on the multimodal sample dataset to obtain the molecular generation model.
[0037] Specifically, step S202 includes: Step S2021: Feature encoding is performed on each type of modal data to obtain the feature data corresponding to each type of modal data; pocket structure information, ligand molecule information, text description information, and molecular fragment constraint information are each a type of modal data.
[0038] In this embodiment, independent feature encoding sub-modules are constructed for different modalities, such as pocket structure information, ligand molecule information, molecular fragment constraint information, and text description information. In some optional implementations, step S2021 above includes: Step a1: Compress and encode the pocket structure information to obtain the feature data corresponding to the pocket structure information. In this embodiment, the three-dimensional structural information of the protein pocket is compressed and encoded, transforming it into a compact serialized representation. This encoding significantly reduces the sequence length while preserving the pocket's geometric information as much as possible, thus solving the problem of excessively long original PDB files that are difficult for language models to process.
[0039] Step a2: Encode the molecular fragment constraint information using a graph neural network encoder or a SMILES encoder to obtain the feature data corresponding to the molecular fragment constraint information.
[0040] For molecular fragments controlled by conditions, this embodiment obtains a structure-level fragment representation through a graph neural network or a lightweight SMILES encoder (a tool that can convert chemical structures into SMILES strings (a text format used to represent chemical molecular structures, such as cyclohexane C1CCCCC1)). The SMILES encoder can convert two-dimensional or three-dimensional structural information of molecules (such as atom types, bond types, ring structures, etc.) into corresponding SMILES string representations.
[0041] Step a3: The text embedding module constructed using model distillation technology encodes the text description information to obtain the feature data corresponding to the text description information.
[0042] To enable the model to understand medicinal chemistry language and translate it into executable structure generation constraints, this embodiment introduces a text embedding module based on large model distillation. This module can convert natural language describing activity, selectivity, ADMET features, conformational preferences, electronic effects, etc., into vectorized expressions.
[0043] This text embedding module uses a pre-trained large-scale language learning model as the teacher model, and trains a student model with smaller parameters and higher inference efficiency through distillation. When processing textual information such as activity descriptions, the original text description information is first preprocessed by word segmentation and standardization, and then input into the student model to extract the corresponding semantic representation vectors. The obtained text embedding vectors can compress high-dimensional language features while preserving key semantic information, and serve as conditional constraint inputs in subsequent cross-modal fusion and molecular generation processes, enabling the model to map semantic design goals into executable structural generation guidance signals. In this way, the model can not only receive structural information, but also understand semantic design goals such as "improving hydrogen bond interactions with a key residue," "reducing lipophilicity to improve water solubility," and "enhancing the affinity of pocket A region." This capability gives the model a clear and controllable generation capability, which is difficult to achieve with traditional molecular generation strategies based solely on structural rules.
[0044] Step S2022: Perform self-attention enhancement on the feature data corresponding to each type of modality data to obtain the enhanced features corresponding to each type of modality data.
[0045] Within each modality, a self-attention mechanism is introduced to perform weighted modeling of the sequential or node features of that modality, enabling the model to automatically focus on local structural or semantic units that are more critical to the current task, thereby completing feature enhancement and information aggregation within their respective feature spaces.
[0046] Step S2023: Perform cross-modal attention alignment based on the enhanced features of each type of modal data to obtain multimodal fusion data.
[0047] After completing the self-attention enhancement for each modality, this embodiment further introduces a cross-modal attention module. Using one type of modal feature as the query vector and other multiple types of modal features as key and value vectors, the correlation weights between different modalities are calculated to realize the interactive modeling of protein structure information, molecular chemical information and text semantic information.
[0048] By adopting a one-time fusion approach, multimodal features can be injected into the model simultaneously in the same stage, simplifying the feature fusion process, reducing the complexity of the model structure, enabling the model to quickly complete global information interaction within a unified feature space, and improving the efficiency of feature fusion and model inference.
[0049] In some optional implementations, step S2023 includes: performing cross-modal attention fusion on the enhanced features corresponding to the pocket structure information and the enhanced features corresponding to the molecular fragment constraint information to obtain structure fusion data; and performing cross-modal attention fusion on the structure fusion data and the enhanced features corresponding to the text description information to obtain multimodal fusion data.
[0050] In this implementation, one type of modality feature is used as the query vector, and another type or more modality features are used as key and value vectors to calculate the correlation weights between different modalities. For example, the enhanced features corresponding to pocket structure information are used as query Q, and the enhanced features corresponding to molecular fragment constraint information are used as key K and value V to determine protein-fragment structure fusion data; then, the protein-fragment structure fusion features are used as query Q, and the enhanced features corresponding to text description information are used as key K and value V to determine protein-fragment-text fusion data. This embodiment adopts a progressive fusion strategy, first aligning the structure modalities (protein + fragment), and then aligning the structure + text, which can effectively preserve the feature characteristics of each modality, reduce interference and information loss between different types of features, and enable the model to learn the correlation relationships between modalities more accurately.
[0051] This embodiment supports both a single modality as the query vector and multiple other modalities as key and value vectors, achieving one-time alignment and fusion of protein structure, molecular chemistry information, and textual semantic information; and a progressive approach where protein and molecular fragment feature alignment is first completed within the structural modality, followed by cross-modal interaction with textual semantic features. Both methods enable different modalities to complete information interaction and feature binding within a unified space, ultimately forming a unified joint representation that includes pocket structures, fragment constraints, and textual descriptions, for subsequent molecular generation or property optimization tasks. This allows the model to understand the relationship between the "protein environment—fragment—textual constraints" in a unified way.
[0052] In this embodiment, a multimodal molecular generation framework maps protein pocket structure representation, molecular fragment representation, and medicinal chemistry text description to a unified semantic embedding space. Through a cross-modal attention mechanism, the three types of information are deeply integrated in the generation process, enabling the generation model to simultaneously perceive spatial constraints, chemical skeleton limitations, and textual semantic goals.
[0053] Step S2024: Train the language learning model based on multimodal fusion data.
[0054] In some alternative implementations, the language learning model is an improved autoregressive language model.
[0055] Specifically, step S2024 above includes: Step b1: Train the improved autoregressive language model for sequence prediction task based on multimodal fusion data; generate molecular local units step by step during sequence prediction task training, and evaluate the molecular local units through the chemical constraint decoder and chemical rule checking module equipped with the improved autoregressive language model in each step of local unit generation, and form candidate molecules after evaluation.
[0056] In this embodiment, a modified autoregressive language model framework is preferred to model the molecular generation process as a stepwise generation task based on Tokenized SMILES sequences (SMILES is a symbolic string symbol for molecular structure, and in this embodiment, Token is a smallest character unit (also referred to as a local molecular unit)).
[0057] In its implementation, the improved autoregressive language model receives and integrates multiple types of input information at each generation step. These include intermediate structure representations corresponding to the generated partial SMILES sequences, a unified joint representation of multimodal conditional information, and policy-related signals (i.e., textual descriptions guiding the direction of molecular generation) to guide the generation direction. Based on these inputs, the model uses an attention mechanism to weighted model the historical generation state and the unified joint representation, calculates the conditional probability distribution of each candidate token in the current generation step, and selects or samples the next token accordingly, thus completing a generation decision. This process is executed iteratively along the sequence dimension, enabling the model to dynamically adjust subsequent structure selections during the generation process.
[0058] By modifying the standard autoregressive language model by introducing conditional feature injection and decision guidance mechanisms, the molecular generation process is extended from simple sequence prediction to a sequential decision-making process constrained by multimodal conditions and policy signals.
[0059] Since molecular generation inherently possesses a sequential and variable graph structure, this embodiment incorporates a chemical constraint decoder on top of the language model. This decoder checks on-the-fly chemical rules (such as valence bond count, atom validity, and loop closure rationality) and provides structural feedback during the decoding process, thereby reducing the occurrence of unresolvable SMILES or chemically unreasonable structures. Specifically, in each step of token generation, the decoder in the improved autoregressive language model receives features of historically generated SMILES sequences, multimodal condition information, and policy signals. Simultaneously, it applies an on-the-fly chemical rule checking module to evaluate the validity of candidate tokens, including valence bond count, atom type constraints, and loop closure rationality. Based on the rule checking results, the decoder corrects or renormalizes the generation probability distribution, thereby dynamically adjusting the generation strategy to ensure that the output SMILES sequences are structurally resolvable and chemically reasonable.
[0060] In the model training phase, the first stage is baseline supervised training. First, a protein pocket-ligand complex dataset and its corresponding text description dataset are constructed. The text description dataset includes fragment constraint information, etc. Each protein pocket and ligand sequence is encoded. Protein pockets can be represented using compressed serialization, ligands can be represented using tokenized SMILES or graph representation, and text can be represented using embedding vectors. These encodings are then used as input to train an autoregressive language model for sequence prediction, with the goal of predicting the next SMILES token. The generated sequences and real ligand sequences are optimized using a cross-entropy loss function, enabling the model to capture chemical plausibility and structural regularity at the sequence level, achieving basic sequence generation and chemical plausibility understanding capabilities.
[0061] In each step of the autoregressive generation of SMILES tokens, the model first calculates the probability distribution of candidate tokens based on the currently generated partial sequences and the input multimodal conditional information. Then, a temperature parameter is introduced to smooth or sharpen the probability distribution to adjust the randomness of the generated results. Building on this, a top-k or top-p sampling strategy can be used to truncate candidate tokens with low confidence in the probability distribution, retaining only the set of candidates that meet preset conditions, and sampling the next token from this set. For scenarios requiring higher stability or better global sequence quality, a bundle search approach can be used to score and select among multiple candidate generation paths, thereby achieving a balance between generation diversity and structural rationality.
[0062] Step b2: Score the candidate molecules based on the static reward optimization mechanism and determine the score value corresponding to the candidate molecules.
[0063] This step belongs to the second stage of training, namely the static reward optimization stage. In this stage, the molecules generated by the model are evaluated using a static scoring system, including: molecular synthetic accessibility (SA Score), drug similarity (QED), and prediction of basic ADMET indicators (such as Papp, CYP inhibition, etc.).
[0064] In practice, each generated candidate molecule is first input into a static scoring module to obtain a corresponding numerical evaluation. The score is then used as a reward signal for reinforcement learning, combined with policy gradient or PPO algorithms to update the model parameters, guiding the model to generate molecules with better chemical properties while maintaining structural resolvability.
[0065] By introducing a text-based attribute control mechanism into the molecular generation process, pharmacodynamic targets and drugability indicators (such as QED, synthetic accessibility, etc.) are incorporated into model training and inference in the form of conditional embedding and strategy guidance signals, thereby achieving directional guidance and controllable optimization of molecular properties without relying on explicit numerical regression. Step b3: Based on the dynamic structure reward mechanism, determine the interaction information between candidate molecules and their corresponding pocket structures.
[0066] This step belongs to the third stage of training, namely the dynamic structural reward stage. In this stage, dynamic structural rewards are introduced to capture information about the interaction between the molecule and the target protein pocket, including non-covalent interaction (NCI) scores, molecular docking scores, and molecular-protein geometric matching degrees.
[0067] During training, for each token or fragment generated, the dynamic reward between the corresponding intermediate molecular structure and the target protein pocket is calculated, and the model generation strategy is updated using near real-time feedback.
[0068] Step b4: Based on the sequence prediction task loss, score, and interaction information, update the model parameters of the improved autoregressive language model until training is complete.
[0069] Training can be terminated under certain conditions, such as: the average dynamic reward in multiple rounds of molecule generation reaches a preset threshold, or the generated molecules converge and stabilize in terms of chemical rationality and binding ability, which is considered as the completion of model training.
[0070] This embodiment employs a three-stage reinforcement learning training strategy, which effectively enhances the model's ability to learn about the interaction between molecules and pockets, thus overcoming the deficiency of traditional data-driven models that cannot provide structural feedback during the generation process.
[0071] The model architecture provided in this embodiment adopts a modular design. Each modal encoder, attribute control module, and generation strategy can be independently extended or replaced, exhibiting good modularity and scalability. It is compatible with more accurate structural models and scoring modules, and can adapt to different protein systems and drug optimization targets, possessing good engineering scalability and industrial application potential. Furthermore, this embodiment also constructs a generation strategy that combines chemical validity constraint decoding with staged reinforcement learning optimization. While ensuring the validity and diversity of generated molecular structures, it gradually enhances the potential binding ability of molecules to target proteins through static property rewards and dynamic structural interaction rewards.
[0072] In some optional implementations, the improved autoregressive language model freezes the parameters of the first preset level during the sequence prediction task training phase; the improved autoregressive language model freezes the parameters of the second preset level during the static reward optimization mechanism phase; and the improved autoregressive language model freezes the parameters of the third preset level during the dynamic structural reward mechanism phase. The first preset level is lower than the second preset level, and the third preset level consists of all layers except the output layer.
[0073] Throughout the training process, this embodiment employs continuous training and phased freezing techniques to avoid gradient instability issues inherent in reinforcement learning. In specific implementation, continuous training means that the model parameters are continuously updated throughout the three-stage training process, rather than each stage being trained completely independently. For the phased freezing strategy, refer to... Figure 2 As shown, during the first phase of baseline supervised training, the underlying Transformer encoding layer is frozen (see reference). Figure 2 In the first stage (basic semantic / encoding layer and multimodal fusion representation layer), only the parameters of the high-level sequence generation and output layers are updated, enabling the model to first master basic sequence generation capabilities and chemical rationality understanding. In the second stage, the static reward optimization stage, more Transformer layers are gradually unfrozen (refer to...). Figure 2 As shown in the diagram, the parameters of the lower encoding layer are updated stably, allowing the model to gradually adjust its generation strategy while accepting static reward constraints. In the third stage, the dynamic structural reward stage, only small-step fine-tuning is performed, and the model's strategy is adjusted in conjunction with the dynamic structural reward to ensure that the reward-driven gradient does not destroy the learned language and chemistry knowledge.
[0074] Through the aforementioned continuous training and phased freezing strategy, the model can smoothly adapt to reward signals while maintaining its original knowledge, achieving high-quality, chemically sound molecular generation that interacts well with the target protein pocket. This effectively avoids the gradient instability and catastrophic forgetting problems common in learning, enabling the model to maintain structural legitimacy and inference stability while improving performance.
[0075] In this embodiment, because the model internally implements a unified semantic space for text, structure, and fragment information, it achieves "trimodal collaborative generation capability," simultaneously considering chemical rationality, three-dimensional configuration, and medicinal properties, effectively realizing molecule generation based on multi-condition input. After the model completes training, in practical applications, one or more of protein pocket structure information, text target information, and molecular fragment information can be selectively input as generation conditions according to specific design requirements. This effectively adapts to the complex and multi-constrained design requirements in real drug development, significantly improving the accuracy and reliability of drug molecule design.
[0076] This embodiment provides a molecule generation method, which can be used in servers, terminals, mobile terminals, etc., as described above. Figure 3 As shown, the process includes the following steps: Step S301: Obtain known protein information and design constraint information for the target protein. The design constraint information includes text description information and / or molecular fragment constraint information. Step S301: Input the known protein information and design constraint information into the molecular generation model constructed according to the molecular generation model construction method in any of the above embodiments to obtain the target molecule corresponding to the target protein.
[0077] For details on the construction method of the molecular generation model, please refer to the above embodiments and preferred embodiments. Details that have already been explained will not be repeated here.
[0078] In practical applications, the structural characterization of the target protein pocket can be input to define the space and interaction environment for molecule generation. Subsequently, a textual design objective can be optionally input to describe the desired pharmacodynamic characteristics, interaction types, or property optimization directions. Simultaneously, if there are known molecular structures that need to be retained or referenced, corresponding molecular fragments or scaffolds can be input as generation constraints. After uniformly encoding and mapping the above input conditions to a shared semantic space, the model integrates the information from each condition during the inference phase to guide molecule generation, thereby achieving customized molecule generation within a specified pocket environment, combined with textual design objectives and constrained by molecular fragments. Since the model has learned the independent contributions of different modal conditions to the generation strategy and their combination relationships during the training phase, it can also complete the corresponding condition generation task based on the input conditions during the inference phase, even when only some of the conditions are provided. Ultimately, this embodiment can generate a large number of candidate molecules with chemical rationality, syntheticity, and potential biological activity at a high speed, providing a high-quality initial screening molecule set for subsequent tasks such as virtual screening, conformational analysis, and pharmacophore evaluation.
[0079] The molecular generation method provided in this embodiment has fast reasoning speed and strong controllability. It can efficiently generate candidate molecules that meet specific pharmacodynamic requirements based on pocket features and text instructions. It effectively adapts to the complex and multi-constrained design requirements in real drug development, significantly improves the controllability and accuracy of molecular generation, and thus significantly shortens the early drug discovery cycle.
[0080] This embodiment also provides a molecular generative model construction apparatus for implementing the above embodiments and preferred embodiments; details already described will not be repeated. As used below, the term "module" can refer to a combination of software and / or hardware that performs a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.
[0081] This embodiment provides a molecular generative model construction device, such as... Figure 4 As shown, it includes: The acquisition module 401 is used to acquire a multimodal sample dataset. The multimodal sample dataset includes several protein sample groups. Each protein sample group includes the pocket structure information of the target protein, the ligand molecule information that binds to the target protein pocket, the text description information of the target protein pocket, and the molecular fragment constraint information. Training module 402 is used to train a language learning model based on a multimodal sample dataset to obtain a molecular generation model.
[0082] In one alternative implementation, the language learning model is an improved autoregressive language model; In some alternative implementations, training module 402 includes: The feature encoding module is used to encode the features of various modal data to obtain the feature data corresponding to each modal data. Pocket structure information, ligand molecule information, text description information, and molecular fragment constraint information are each a type of modal data. Specifically, it is used to compress and encode the pocket structure information to obtain the feature data corresponding to the pocket structure information; to encode the molecular fragment constraint information through a graph neural network encoder or SMILES encoder to obtain the feature data corresponding to the molecular fragment constraint information; and to encode the text description information through a text embedding module constructed using model distillation technology to obtain the feature data corresponding to the text description information.
[0083] The self-attention enhancement module is used to perform self-attention enhancement on the feature data corresponding to each type of modality data to obtain the enhanced features corresponding to each type of modality data.
[0084] The cross-modal attention module is used to perform cross-modal attention alignment based on the augmented features of each modality of data to obtain multimodal fused data; and to train a language learning model based on the multimodal fused data. Specifically, it performs cross-modal attention fusion of the augmented features corresponding to pocket structure information and the augmented features corresponding to molecular fragment constraint information to obtain structural fused data; and performs cross-modal attention fusion of the structural fused data and the augmented features corresponding to text description information to obtain multimodal fused data.
[0085] The training unit is used to train the improved autoregressive language model for sequence prediction tasks based on multimodal fusion data. During the sequence prediction task training process, molecular local units are generated progressively. At each step of local unit generation, the chemical constraint decoder and chemical rule checking module equipped in the improved autoregressive language model evaluate the molecular local units, forming candidate molecules. A static reward optimization mechanism is used to score the candidate molecules, determining their corresponding score values. A dynamic structural reward mechanism is used to determine the interaction information between the candidate molecules and their corresponding pocket structures. Based on the sequence prediction task loss, score values, and interaction information, the model parameters of the improved autoregressive language model are updated until training is complete.
[0086] In one optional implementation, the improved autoregressive language model freezes the parameters of the first preset level during the sequence prediction task training phase; the improved autoregressive language model freezes the parameters of the second preset level during the static reward optimization mechanism phase; and the improved autoregressive language model freezes the parameters of the third preset level during the dynamic structural reward mechanism phase. The first preset level is lower than the second preset level, and the third preset level consists of all layers except the output layer.
[0087] The molecular generative model construction apparatus provided in this embodiment of the invention can execute the molecular generative model construction method provided in any embodiment of the invention, and has the corresponding functional modules and beneficial effects for executing the method. Further functional descriptions of the above modules and units are the same as in the corresponding embodiments described above, and will not be repeated here.
[0088] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention.
[0089] The following is a detailed reference. Figure 5The diagram illustrates a structural schematic suitable for implementing an electronic device according to embodiments of the present invention. The electronic device may include a processor (e.g., a central processing unit, graphics processor, etc.) 501, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 502 or a program loaded from memory 508 into random access memory (RAM) 503. The RAM 503 also stores various programs and data required for the operation of the electronic device. The processor 501, ROM 502, and RAM 503 are interconnected via a bus 504. An input / output (I / O) interface 505 is also connected to the bus 504.
[0090] Typically, the following devices can be connected to I / O interface 505: input devices 506 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 507 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; memory devices 508 including, for example, magnetic tapes, hard disks, etc.; and communication devices 509. Communication device 509 allows electronic devices to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 5 Electronic devices with various devices are shown, but it should be understood that it is not required to implement or have all of the devices shown, and more or fewer devices may be implemented or have instead.
[0091] In particular, according to embodiments of the present invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present invention include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 509, or installed from a memory 508, or installed from a ROM 502. When the computer program is executed by the processor 501, it performs the functions defined in the molecular generation model construction method or molecular generation method of the embodiments of the present invention.
[0092] Figure 5 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.
[0093] This invention also provides a computer-readable storage medium. The methods described above according to embodiments of the invention can be implemented in hardware or firmware, or implemented as recordable on a storage medium, or implemented as computer code downloaded via a network and originally stored on a remote storage medium or a non-transitory machine-readable storage medium and subsequently stored on a local storage medium. Thus, the methods described herein can be processed by software stored on a storage medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, optical disk, read-only memory, random access memory, flash memory, hard disk, or solid-state drive, etc.; further, the storage medium can also include combinations of the above types of memory. It is understood that computers, processors, microprocessor controllers, or programmable hardware include storage components capable of storing or receiving software or computer code. When the software or computer code is accessed and executed by the computer, processor, or hardware, the molecular generation model construction method or molecular generation method shown in the above embodiments is implemented.
[0094] A portion of this invention can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide the methods and / or technical solutions according to the invention through the operation of the computer. Those skilled in the art will understand that the forms in which computer program instructions exist in a computer-readable medium include, but are not limited to, source files, executable files, installation package files, etc. Correspondingly, the ways in which computer program instructions are executed by a computer include, but are not limited to: the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled program, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed program. Here, the computer-readable medium can be any available computer-readable storage medium or communication medium accessible to a computer.
[0095] Although embodiments of the invention have been described in conjunction with the accompanying drawings, those skilled in the art can make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations all fall within the scope defined by the appended claims.
Claims
1. A method for constructing a molecular generation model, characterized in that, The method includes: A multimodal sample dataset is obtained, which includes several protein sample groups. Each protein sample group includes pocket structure information of the target protein, ligand molecule information that binds to the target protein pocket, textual description information of the target protein pocket, and molecular fragment constraint information. A language learning model is trained based on the multimodal sample dataset to obtain a molecular generation model.
2. The method of claim 1, wherein, The training of the language learning model based on the multimodal sample dataset includes: Feature encoding is performed on each type of modal data to obtain the feature data corresponding to each type of modal data; the pocket structure information, the ligand molecule information, the text description information, and the molecular fragment constraint information are each a type of modal data; Self-attention enhancement is performed on the feature data corresponding to each type of modality data to obtain the enhanced features corresponding to each type of modality data; Cross-modal attention alignment is performed based on the enhanced features of each type of modal data to obtain multimodal fusion data; The language learning model is trained based on the multimodal fusion data.
3. The method of claim 2, wherein, The cross-modal attention alignment based on the enhanced features of each modality data is performed to obtain multimodal fusion data, including: The enhanced features corresponding to the pocket structure information and the enhanced features corresponding to the molecular fragment constraint information are fused together via cross-modal attention to obtain structural fusion data. The multimodal fusion data is obtained by performing cross-modal attention fusion between the structural fusion data and the enhanced features corresponding to the text description information.
4. The method according to claim 2 or 3, characterized in that, The step of performing feature encoding on various types of modal data to obtain feature data corresponding to each type of modal data includes: The pocket structure information is compressed and encoded to obtain the feature data corresponding to the pocket structure information; The molecular fragment constraint information is encoded by a graph neural network encoder or a SMILES encoder to obtain the feature data corresponding to the molecular fragment constraint information; A text embedding module constructed using model distillation technology encodes the text description information to obtain the feature data corresponding to the text description information.
5. The method of claim 2, wherein, The language learning model is an improved autoregressive language model; training the language learning model based on the multimodal fusion data includes: The improved autoregressive language model is trained for sequence prediction tasks based on the multimodal fusion data. Molecular local units are generated step by step during the sequence prediction task training process. In each step of local unit generation, the chemical constraint decoder and chemical rule checking module equipped by the improved autoregressive language model are used to evaluate the molecular local units, and candidate molecules are formed after evaluation. The candidate molecules are scored based on a static reward optimization mechanism to determine the score value corresponding to the candidate molecules; Based on the dynamic structure reward mechanism, the interaction information between the candidate molecule and the corresponding pocket structure is determined; Based on the sequence prediction task loss, the score, and the interaction information, the model parameters of the improved autoregressive language model are updated until training is complete.
6. The method of claim 5, wherein, The improved autoregressive language model freezes the parameters of the first preset level during the sequence prediction task training phase; the improved autoregressive language model freezes the parameters of the second preset level during the static reward optimization mechanism phase; and the improved autoregressive language model freezes the parameters of the third preset level during the dynamic structural reward mechanism phase. The first preset level is lower than the second preset level, and the third preset level is any layer other than the output layer.
7. A method of molecular generation, characterized by, The method includes: Obtain known protein information and design constraint information for the target protein, wherein the design constraint information includes textual description information and / or molecular fragment constraint information; The known protein information and the design constraint information are input into the molecular generation model constructed by the molecular generation model construction method according to any one of claims 1 to 6 to obtain the target molecule corresponding to the target protein.
8. A molecular generation model construction device, characterized in that, The device includes: The acquisition module is used to acquire a multimodal sample dataset, which includes several protein sample groups. Each protein sample group includes pocket structure information of the target protein, ligand molecule information that binds to the target protein pocket, textual description information of the target protein pocket, and molecular fragment constraint information. The training module is used to train a language learning model based on the multimodal sample dataset to obtain a molecular generation model.
9. An electronic device, characterized in that, include: The device includes a memory and a processor, which are interconnected. The memory stores computer instructions, and the processor executes the computer instructions to perform the molecular generation model construction method of any one of claims 1 to 6 or the molecular generation method of claim 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing the computer to perform the molecular generation model construction method of any one of claims 1 to 6 or the molecular generation method of claim 7.