Training method of visual language model for molecular representation learning, molecular representation learning method and device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By aligning and pre-training molecular structure images with chemical meta-knowledge, consistent molecular representations are generated, solving the problems of chemical semantics disconnect and knowledge scaling in existing technologies. This enables interpretable property prediction that considers task semantics and domain priors in downstream tasks.

CN122241219APending Publication Date: 2026-06-19BEIJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING UNIV OF POSTS & TELECOMM
Filing Date: 2026-03-11
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing molecular characterization learning methods suffer from several problems, including a disconnect between chemical semantics and pre-trained agent tasks, non-scalable knowledge introduction, failure to consider task context semantics during fine-tuning, and difficulty in providing reproducible explanations consistent with domain priors.

Method used

By aligning molecular structure images with chemical meta-knowledge during pre-training, a supervised contrastive learning module is used to generate molecular representation learning data. Combined with a cue learner, learnable descriptor semantic cues are generated, thus aligning image encoding and text encoding and optimizing model parameters to generate consistent molecular representations.

Benefits of technology

Learning consistent embedded chemical semantics on large-scale data reduces the risk of chemical consistency issues caused by synthetic perturbations, improves model reliability and semantic consistency, and considers task semantics and domain priors in downstream tasks to achieve interpretable property predictions.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122241219A_ABST

Patent Text Reader

Abstract

This application provides a training method, molecular representation learning method, and device for a visual language model used for molecular representation learning. The training method includes: in the current iteration, inputting training molecular data into the current visual language model for molecular representation learning, so that the image encoding module and text encoding module in the model generate molecular structure image feature data and descriptor semantic cue feature data, respectively; and having the supervised contrastive learning module in the model align the feature data and output molecular representation learning data; determining the target loss for the current iteration, and optimizing the parameters of the image encoding module and text encoding module based on the loss to update the visual language model; if the model has converged or is in the preset last iteration, then the updated visual language model is determined as the visual language model for molecular representation learning. This application can reduce the risk of chemical inconsistency and achieve scalable pre-training.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of molecular representation learning technology, and in particular to a training method for a visual language model for molecular representation learning, a molecular representation learning method, and an apparatus. Background Technology

[0002] Molecular characterization learning is a core component of drug discovery and computational chemistry, used to map molecular structures into vector embeddings to support molecular property prediction and quantitative structure-activity relationship modeling, and applied to tasks such as virtual screening and molecular design. Existing deep molecular characterization methods mainly include molecular modeling and encoder structure design; pre-trained agent tasks based on self-supervised learning; and transfer learning and fine-tuning strategies to pursue "general and task-independent" molecular characterization.

[0003] However, existing technologies have some problems, including: chemical semantics are disconnected from pre-trained agent tasks; knowledge introduction is not scalable; the fine-tuning stage does not consider task context semantics and it is difficult to provide reproducible explanations that are consistent with domain priors. Summary of the Invention

[0004] In view of this, embodiments of this application provide a training method for a visual language model for molecular representation learning, a molecular representation learning method and device, to eliminate or improve one or more defects existing in the prior art.

[0005] One aspect of this application provides a method for training a visual language model for molecular representation learning, the method comprising the following steps: In the current iteration, training molecular data is input into the current visual language model used for molecular representation learning. The image encoding module and text encoding module in this model generate molecular structure image feature data and descriptor semantic cue feature data corresponding to the training molecular data, respectively. The supervised contrastive learning module in the visual language model aligns the molecular structure image feature data and the descriptor semantic cue feature data and outputs molecular representation learning data corresponding to the training molecular data. The training molecular data includes structural images of training molecules and learnable descriptor semantic cue associated with the training molecules. The descriptor semantic cue is obtained by transforming chemical meta-knowledge associated with the training molecules. The chemical meta-knowledge includes molecular descriptors that can be automatically calculated and do not require manual annotation. In the current iteration, the target loss of the current iteration is determined based on the molecular characterization learning data, and the model parameters of the image encoding module and the text encoding module are optimized based on the target loss to update the visual language model; If the visual language model has converged or the current iteration round is the preset last iteration round, then the updated visual language model is determined as the visual language model for molecular representation learning.

[0006] In some embodiments of this application, the descriptor semantic hints are obtained by transforming the chemical meta-knowledge associated with the training molecule through a hint learner; the hint learner is used to generate learnable descriptor semantic hints based on the type and value of the molecule descriptor.

[0007] In some embodiments of this application, the descriptor semantic hints include a learnable context embedding for representing the type of the molecular descriptor and a learnable value encoding for representing the value of the molecular descriptor.

[0008] In some embodiments of this application, the learnable value encoding is a learnable parameter obtained by mapping the values of the molecular descriptor through rank embedding technology. The values of the molecular descriptor are divided into multiple ranks according to the value distribution in the training molecular data, and each rank corresponds to an independent learnable vector.

[0009] In some embodiments of this application, the supervised contrastive learning module in the visual language model aligns the molecular structure image feature data and the descriptor semantic cue feature data and outputs molecular representation learning data corresponding to the training molecular data, including: For multiple training molecules in a batch, the supervised contrastive learning module calculates the similarity value between the structural image feature data of each training molecule and the semantic cue feature data of each descriptor. Based on the real descriptor semantic cue label corresponding to each training molecule and the cross-entropy loss function, the module aligns the molecular structural image feature data and the descriptor semantic cue feature data and outputs the molecular representation learning data.

[0010] In some embodiments of this application, if only one molecule descriptor exists, the target loss is expressed by the following formula: in, This represents the target loss when a single molecule descriptor exists; Represents cross-entropy loss; This represents the KL divergence loss.

[0011] In some embodiments of this application, if there are two or more molecular descriptors, the target loss is expressed by the following formula: in, This represents the target loss when multiple molecular descriptors exist; This indicates the total number of molecular descriptors used during training.

[0012] Another aspect of this application provides a molecular characterization learning method, which includes: Based on the prompting learner, the acquired target task data is transformed to obtain the task semantic prompts corresponding to the target task data; The task semantic prompts and preset domain prior knowledge prompts are input into a visual language model for molecular representation learning trained by the training method. This allows the image encoding module in the visual language model to generate image feature data corresponding to the target task data, and the visual language model to extract visual attention weights corresponding to the task semantic prompts from the image encoding module. Furthermore, the text encoding module in the visual language model generates task semantic prompt feature data and domain prior knowledge prompt feature data corresponding to the target task data. Then, the attention fusion module in the visual language model performs feature fusion on the task semantic prompt feature data and the domain prior knowledge prompt feature data to generate knowledge-enhanced text feature data and knowledge attention weights. The image feature data, the knowledge-enhanced text feature data, the visual attention weights, and the knowledge attention weights are output. The domain prior knowledge prompts include automatically computed molecular descriptors or auxiliary task information. The visual attention weights are used to locate molecular information related to the current task semantic prompts, and the knowledge attention weights are used to quantify the contribution of the domain prior knowledge prompts to the prediction results.

[0013] A third aspect of this application provides an electronic device including a processor and a memory, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the training method for the visual language model for molecular representation learning and / or the molecular representation learning method.

[0014] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the training method for the visual language model for molecular representation learning and / or the molecular representation learning method.

[0015] The fifth aspect of this application provides a computer program product comprising a computer program that, when executed by a processor, implements the training method and / or molecular representation learning method for a visual language model used for molecular representation learning.

[0016] This application discloses a training method for a visual language model used for molecular representation learning. The method includes the following steps: In the current iteration, training molecular data is input into the current visual language model used for molecular representation learning, so that the image encoding module and text encoding module in the visual language model generate molecular structure image feature data and descriptor semantic cue feature data corresponding to the training molecular data, respectively; and the supervised contrastive learning module in the visual language model aligns the molecular structure image feature data and the descriptor semantic cue feature data and outputs molecular representation learning data corresponding to the training molecular data; wherein, the training molecular data includes structural diagrams of training molecules. The system includes learnable descriptor semantic cues associated with the training molecules; these cues are derived by transforming chemical meta-knowledge associated with the training molecules; the chemical meta-knowledge includes molecular descriptors that can be automatically calculated and do not require manual annotation; in the current iteration, the target loss for the current iteration is determined based on the molecular representation learning data, and the model parameters of the image encoding module and the text encoding module are optimized based on this target loss to update the visual language model; if the visual language model has converged or the current iteration is the preset last iteration, the updated visual language model is determined as the visual language model for molecular representation learning. This approach can reduce the risk of chemical inconsistency caused by synthetic perturbations; it also enables scalable pre-training; the fine-tuning stage can also consider task context semantics and provide reproducible explanations consistent with domain priors.

[0017] Additional advantages, objectives, and features of this application will be set forth in part in the description which follows, and will in part become apparent to those skilled in the art upon review of the following description, or may be learned by practice of the application. The objectives and other advantages of this application can be realized and obtained by means of the structures specifically pointed out in the specification and drawings.

[0018] Those skilled in the art will understand that the purposes and advantages that can be achieved with this application are not limited to those specifically described above, and that the above and other purposes that this application can achieve will be more clearly understood from the following detailed description. Attached Figure Description

[0019] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, do not constitute a limitation thereof. The components in the drawings are not drawn to scale but are merely for illustrating the principles of this application. For ease of illustration and description of certain parts of this application, corresponding portions in the drawings may be enlarged, i.e., may appear larger relative to other components in an exemplary device actually manufactured according to this application. In the drawings: Figure 1This is a schematic diagram of the first process of a training method for a visual language model for molecular representation learning in one embodiment of this application.

[0020] Figure 2 This is a schematic diagram of a second process for training a visual language model for molecular representation learning in one embodiment of this application.

[0021] Figure 3 This is a schematic diagram illustrating the training method of the visual language model used for molecular representation learning and the technical framework of the molecular representation learning method, as described in a specific example of this application. Detailed Implementation

[0022] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and their descriptions are used to explain this application, but are not intended to limit it.

[0023] It should also be noted that, in order to avoid obscuring this application with unnecessary details, only the structures and / or processing steps closely related to the solution according to this application are shown in the accompanying drawings, while other details that are not closely related to this application are omitted.

[0024] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.

[0025] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.

[0026] In the following description, embodiments of the present application will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.

[0027] It's important to note that self-supervised learning pre-training schemes based on graph structures, sequences, and images learn molecular representations by constructing proxy tasks through "synthetic perturbations" such as node or edge masks, subgraph sampling, and jigsaw puzzle reconstruction. This approach often uses reinforcement strategies or heuristic rules to construct positive and negative sample pairs. However, synthetic perturbations (deleting nodes, masking edges, jigsaw puzzles, etc.) can disrupt chemical structural consistency, leading to pseudo-label conflicts or violating chemical principles, thus affecting reliability and chemical interpretability. Therefore, the molecular encoders obtained through this type of pre-training are likely to learn erroneous prior knowledge based on inappropriate inductive bias assumptions. Strengthening chemical consistency through knowledge graphs, expert annotations, or rule constraints relies on manual organization and labeling, making it difficult to scale to tens of millions of pre-training resources. Schemes relying on manual knowledge graphs or expert annotations are limited in data scalability and cannot simultaneously ensure "large-scale pre-training" and "the authenticity and chemical rationality of pre-trained knowledge." There are also interpretable models based on ex post-interpretation mechanisms, which rely on ex post-interpretation mechanisms such as gradient attribution to explain model results during fine-tuning. However, such explanations merely quantify the importance of features to the correct label, without explicitly learning the relationship between molecular structure and context under the semantics of the current task. Post-hoc explanations are typically "explanatory predictions" rather than "semantic-guided predictions," making it difficult to provide reproducible explanations consistent with domain priors. Furthermore, existing fine-tuning phases often treat the property predictions of each downstream task as isolated tasks. For example, existing methods typically model the blood-brain barrier permeability and lipophilicity of molecules independently, neglecting the fact that lipophilicity is a key determinant of blood-brain barrier permeability. Therefore, it can be said that existing methods almost entirely neglect contextual semantic modeling of downstream tasks during the fine-tuning phase, lacking the mechanism and ability to leverage task semantics and domain priors to share knowledge and shape context-aware representations. Based on this, the inventors of this application first conceived of using meta-knowledge (such as molecular descriptors) in existing publicly available chemical databases as supervisory signals to align molecular structure images with meta-knowledge prompts during pre-training. This reduces reliance on synthetic perturbation proxy tasks from the source, maintains chemical semantic consistency, and allows the training data scale to match the billions of data points of current pre-trained models without any manual annotation costs. In downstream tasks, knowledge-guided prompts are used for optimization, explicitly anchoring predictions to the task context and domain priors, achieving context-aware semantic representation and cross-property knowledge sharing. Through the visual attention and knowledge attention mechanisms of "prompt anchoring," interpretable reasoning is achieved, which can both locate task-related substructures and quantify the contribution of different prior knowledge to prediction.

[0028] The following examples will provide a detailed description.

[0029] This application provides a method for training a visual language model for molecular representation learning. See [link to relevant documentation]. Figure 1 The training method includes the following steps: Step 100: In the current iteration, the training molecular data is input into the current visual language model for molecular representation learning, so that the image encoding module and text encoding module in the visual language model generate molecular structure image feature data and descriptor semantic cue feature data corresponding to the training molecular data, respectively; and the supervised contrastive learning module in the visual language model aligns the molecular structure image feature data and the descriptor semantic cue feature data and outputs the molecular representation learning data corresponding to the training molecular data; wherein, the training molecular data includes structural images of training molecules and learnable descriptor semantic cue associated with the training molecules; the descriptor semantic cue is obtained by transforming the chemical meta-knowledge associated with the training molecules; the chemical meta-knowledge includes molecular descriptors that can be automatically calculated and do not require manual annotation; In step 100, the training molecular data can be represented by the following formula: in, This represents molecular data used for training. Indicates the sample index; This represents the total number of molecular samples in the training dataset; Structural images of molecules used for training; A learnable descriptor semantic cue associated with the training molecule; the descriptor semantic cue can be represented by the following formula: in, This indicates a semantic hint for the j-th descriptor of the i-th molecule; This represents the j-th molecule descriptor; This represents the value of the j-th descriptor of the i-th molecule; , , Special symbols for beginnings, ends, and divisions; For descriptor type The bound learnable context embedding is used to represent the semantics of "what descriptor the cue is talking about"; To get the value of the descriptor The encoding of the molecular structure image feature data and the descriptor semantic cue feature data can be represented by the following formula: in, Represents molecular structure image feature data; This represents semantic hint feature data for descriptors; Indicates an image encoder; This refers to a text encoder. The molecular representation learning data corresponding to the molecular data used for training can be molecular structure image feature data. and the descriptor semantic hint feature data Joint representation in a shared latent space.

[0030] In one or more embodiments of this application, the chemical meta-knowledge can also be all automatically computable chemical knowledge available in chemical databases (such as drug similarity rule-related indicators, functional group counts, polarity-related indicators, auxiliary label statements, etc.), all of which can complete semantic prompts and achieve the pre-training objective of supervised comparative learning.

[0031] Step 200: In the current iteration, determine the target loss of the current iteration based on the molecular characterization learning data, and optimize the model parameters of the image encoding module and the text encoding module based on the target loss to update the visual language model; In step 200, the model parameters can be the network weight parameters in the image encoding module and the text encoding module, and the learnable context embedding parameters in the cue learner.

[0032] In one or more embodiments of this application, supervised contrastive learning can be combined with or replaced by classification cross-entropy, KL divergence, etc., or other equivalent supervised matching losses to achieve the goal of image-text description alignment and semantic preservation.

[0033] Step 300: If the visual language model has converged or the current iteration round is the preset last iteration round, then the updated visual language model is determined as the visual language model for molecular representation learning.

[0034] As described above, the training method for a visual language model for molecular representation learning provided in this application represents molecular structures as molecular images by using automatically calculated chemical meta-knowledge as a supervisory signal for alignment pre-training; it converts automatically calculated chemical meta-knowledge, such as molecular descriptors, into text descriptions and forms learnable contextual semantics through a cue learner; it employs supervised contrastive learning to align "molecular image features" and "cue text features" in a shared latent space, thereby learning consistent representations of embedded chemical semantics on large-scale data; it avoids reliance on proxy task perturbations that may disrupt structural semantics, reduces the risk of chemical consistency caused by synthetic perturbations, and thus improves... Reliability and semantic consistency; chemical meta-knowledge can be automatically calculated by tools or obtained and standardized from existing chemical databases without manual annotation or expert rules, thus adapting to pre-training tasks of billions of scales; it can fully consider the task semantics of current property prediction in downstream tasks, and can also use textual prior knowledge to assist decision-making, so that related property reasoning can share knowledge and consider the different tasks' emphasis on molecular structure rather than isolated learning, thereby improving the performance and interpretability of downstream models; interpretable property prediction based on task semantics, the explanation comes from the model's attention allocation and knowledge weight allocation under prompt conditions, rather than simply ex post facto attribution, the explanation is more in line with task semantics and domain priors.

[0035] To further avoid relying on proxy task perturbations that may disrupt structural semantics and reduce the risk of chemical consistency issues caused by synthetic perturbations, thereby improving reliability and semantic consistency, in a training method for a visual language model for molecular representation learning provided in this application embodiment, the descriptor semantic prompts are obtained by a prompt learner transforming the chemical meta-knowledge associated with the training molecule; the prompt learner is used to generate learnable descriptor semantic prompts based on the type and value of the molecular descriptor.

[0036] In one or more embodiments of this application, the molecular structure can be directly obtained from public databases (e.g., stored as structural strings such as SMILES (Simplified Molecular Input Line Entry System)) and further rendered as a molecular structure image as visual modal input. Chemical meta-knowledge mainly consists of standardized molecular descriptors (MDs), which can be automatically calculated using cheminformatics toolkits or obtained directly from public databases, without the need for manual compilation or expert annotation. Unlike the pre-training method of traditional visual-language contrastive supervised learning, this application does not rely on manually constructed templates for text. Instead, it generates descriptor semantic prompts driven by cue learner data, enabling the same molecule to be associated with multiple text descriptions of different meta-knowledge semantics.

[0037] To further avoid relying on proxy task perturbations that may disrupt structural semantics and reduce the risk of chemical consistency caused by synthetic perturbations, thereby improving reliability and semantic consistency, in a training method for a visual language model for molecular representation learning provided in this application embodiment, the descriptor semantic prompt includes a learnable context embedding for representing the type of the molecular descriptor and a learnable value encoding for representing the value of the molecular descriptor.

[0038] In one or more embodiments of this application, meta-knowledge such as molecular descriptors, physicochemical properties, or information available in existing chemical databases is transformed into a set of prompts containing learnable context and value encoding, enabling molecules to obtain multiple matching text descriptions.

[0039] To further avoid relying on proxy task perturbations that may disrupt structural semantics and reduce the risk of chemical consistency caused by synthetic perturbations, thereby improving reliability and semantic consistency, in a training method for a visual language model for molecular representation learning provided in this application embodiment, the learnable value encoding is a learnable parameter obtained by mapping the values of the molecular descriptor through rank embedding technology. The values of the molecular descriptor are divided into multiple ranks according to the value distribution in the training molecular data, and each rank corresponds to an independent learnable vector.

[0040] In one or more embodiments of this application, rank embedding is used to map continuous or discrete values to learnable parameters, thereby achieving standardized and enumerable semantic representations. For the same descriptor... The number of observable values (or their rank) in the pre-training set is denoted as . The set of prompts corresponding to this descriptor can be represented by the following formula: in, Indicates descriptor The corresponding set of prompts.

[0041] To further avoid relying on proxy task perturbations that may disrupt structural semantics and reduce the risk of chemical inconsistency caused by synthetic perturbations, thereby improving reliability and semantic consistency, a training method for a visual language model for molecular representation learning provided in this application embodiment includes a supervised contrastive learning module in the visual language model that aligns the molecular structure image feature data and the descriptor semantic cue feature data and outputs molecular representation learning data corresponding to the training molecular data, including: Step 110: For multiple training molecules in a batch, the supervised contrastive learning module calculates the similarity value between the structural image feature data of each training molecule and the semantic cue feature data of each descriptor, and performs alignment processing on the molecular structural image feature data and the semantic cue feature data based on the real descriptor semantic cue label corresponding to each training molecule and the cross-entropy loss function, and outputs the molecular representation learning data.

[0042] In one or more embodiments of this application, a supervised contrastive learning objective for a batch of sample sets can be expressed as follows: in, This represents the target loss function for supervised contrastive learning; This represents the sample index; Indicates batch size; This represents the similarity function, which can be either dot product or cosine similarity. The temperature parameter is represented; unlike comparative pre-training that relies on heuristic construction of positive and negative samples, this application explicitly uses "each molecule and its truth meta-knowledge hints" as a positive alignment relationship, thereby ensuring the chemical semantic consistency of the supervision signal.

[0043] To further improve chemical semantic consistency, in a training method for a visual language model for molecular representation learning provided in this application embodiment, if only one molecular descriptor exists, the target loss is expressed by the following formula: in, This represents the target loss when a single molecule descriptor exists; Represents cross-entropy loss; This represents the KL divergence loss (Kullback–Leibler divergence).

[0044] In one or more embodiments of this application, under a single descriptor setting, for a certain descriptor Take a batch of B molecular images and their cue sets. , respectively encoded to obtain , Calculate the similarity matrix. Its elements can be represented as follows: in, This represents the element in the i-th row and k-th column of the similarity matrix A; This represents the feature data of the i-th molecular structure image; This represents the semantic cue feature data of the k-th descriptor. Construct a one-hot encoded label matrix. Let represent the true rank of each molecule. The row-normalized prediction distribution and the smoothed label distribution are defined as follows: in, Indicates to Predicted probabilities after row normalization; This represents the element in the i-th row and k-th column of the similarity matrix A; This represents an index variable used to iterate through the categories of values that can be taken from a descriptor. The temperature parameter represents the predicted distribution and is used to adjust the smoothness of the predicted probability distribution. Indicates to Label probabilities after smoothing; This represents the element in the i-th row and k-th column of the one-hot encoded label matrix Y; The temperature parameter represents the label distribution and is used to adjust the smoothness of the label distribution.

[0045] Cross-entropy loss ( ) and KL divergence loss ( It can be expressed as follows: To further improve chemical semantic consistency, in a training method for a visual language model for molecular representation learning provided in this application embodiment, if there are two or more molecular descriptors, the target loss is expressed by the following formula: in, This represents the target loss when multiple molecular descriptors exist; This indicates the total number of molecular descriptors used during training.

[0046] In one or more embodiments of this application, pre-training is extended to Parallel optimization of each descriptor involves independently calculating the single-descriptor loss for each descriptor and averaging the results. In engineering implementation, to balance GPU memory and computing power budgets, this application adopts a stochastic parallel training strategy. In each iteration, a subset is extracted from the descriptor set and the backbone encoder is shared to approximate multi-descriptor pre-training, thereby maintaining scalable meta-knowledge alignment learning under limited memory conditions.

[0047] This application also provides a molecular characterization learning method, see [link to application]. Figure 2 The molecular characterization learning method includes the following steps: Step 10: Based on the prompting learner, the acquired target task data is transformed to obtain the task semantic prompts corresponding to the target task data; In step 10, for downstream subtasks The number of its categories is The learner generates a set of category-conditional cues (e.g., an opposing category vocabulary for binary classification) to represent the label semantic space of the subtask. This mechanism enables molecular representations to "switch contexts" between tasks of different natures, rather than learning task-independent generalized descriptions.

[0048] Step 20: Input the task semantic prompt and the preset domain prior knowledge prompt into the visual language model for molecular representation learning trained by the training method, so that the image encoding module in the visual language model generates image feature data corresponding to the target task data, and the visual language model extracts the visual attention weights corresponding to the task semantic prompt from the image encoding module; and the text encoding module in the visual language model generates task semantic prompt feature data and domain prior knowledge prompt feature data corresponding to the target task data; then the attention fusion module in the visual language model performs feature fusion on the task semantic prompt feature data and the domain prior knowledge prompt feature data to generate knowledge-enhanced text feature data and knowledge attention weights; output the image feature data, the knowledge-enhanced text feature data, the visual attention weights, and the knowledge attention weights; wherein, the domain prior knowledge prompt includes automatically calculated molecular descriptors or auxiliary task information; the visual attention weights are used to locate molecular information related to the current task semantic prompt, and the knowledge attention weights are used to quantify the contribution of the domain prior knowledge prompt to the prediction result.

[0049] In step 20, the domain prior knowledge hint can be represented as: in, The abbreviation for "prior" is used to distinguish this hint set as a domain prior knowledge set rather than a descriptor hint set. This represents the total number of domain prior knowledge hints corresponding to the i-th molecule; This represents the k-th prior knowledge hint for the i-th molecule; This represents the domain prior knowledge hint for the i-th molecule. The domain prior knowledge hint can be a priori physical properties (such as TPSA, MolWt, etc.) or auxiliary task information (such as semantic hints like "this molecule is positive on a certain task"). For downstream subtasks... A batch of data, the image encoding module outputs image feature data. The task-hinting learner generates a set of contextual hints. The text encoding module obtains the task semantic prompt feature data. Simultaneously, each molecule can optionally have its set of knowledge hints encoded. Obtain feature data with prior knowledge of the domain A vector can be constructed as follows: in, Represents the query vector. Represents the key vector (Key); Represents a value vector; The learnable projection matrix, and the knowledge-enhanced text feature data obtained through scaling dot product attention, can be represented as follows: in, This represents knowledge-enhanced text feature data. for The dimension of the image branch. After linear projection, the image branch can be represented by the following formula: in, This represents the feature data of the projected image; A learnable projection matrix representing an image branch; for downstream tasks Construct a similarity matrix Its elements can be represented as: And by the downstream tag matrix The mechanism provides supervision for optimization. Its core is that task prompts provide the context of "what task needs to be solved," while knowledge prompts provide interpretable support through "domain priors or auxiliary semantics." These two aspects are dynamically modulated through attention fusion, thereby improving downstream generalization and interpretability. Furthermore, the attention fusion module can utilize different attention structures, such as single-head or multi-head cross-attention, gating units, or fusion through addition or splicing, all of which can conditionally influence the semantics of the task based on prior knowledge.

[0050] Explainable reasoning mechanisms include visual and textual explanations. Visual explanation involves the model forming an attentional or saliency response in the visual encoding branch that aligns with the semantics of a given task cue or descriptor cue, thereby locating the molecular substructure region most critical to that semantic meaning. When the cue context changes, the attention region for the same molecule can change with the semantic switch, achieving structural explanation within a "controllable context." Textual explanation involves the weights of the attention fusion module during the knowledge-guided cue fine-tuning stage, which can be interpreted as the contribution of different knowledge cuees to the final knowledge-enhanced text representation. These weights can be used not only for sample-level explanations (why a molecule prediction is influenced by TPSA (Topological Polar Surface Area) or MolWt (Molecular Weight) priors) but also for task-level statistics (which type of prior a task as a whole relies more on), thus forming reproducible explanations of knowledge contributions.

[0051] For specific examples of the training method for the visual language model used for molecular representation learning and the molecular representation learning method described in this application, please refer to... Figure 3 The training method includes the following steps: During the pre-training phase, each molecule is associated The molecule descriptor. For the molecule descriptor... The first molecule Descriptors Construction descriptor hint: in, , , Special symbols for beginnings, ends, and divisions; For descriptor type The bound learnable context embedding is used to represent the semantics of "what descriptor the cue is talking about"; To get the value of the descriptor The encoding of the descriptor. This invention employs rank embedding to map continuous or discrete values into learnable parameters, thereby achieving standardized and enumerable semantic representation. The number of observable values (or their rank) in the pre-training set is denoted as . The set of hints corresponding to this descriptor is formed by enumerating the rank: Molecular images and text descriptions are input into their respective encoder structures and mapped to a shared latent space: in, For image encoders, This is a text encoder. The training objective is to make semantically consistent "image-text" pairs close to each other in the latent space, and semantically inconsistent pairs far apart.

[0052] For a batch size of The sample set is used to implement supervised contrastive learning objectives. in, For similarity functions (e.g., dot product or cosine similarity). The temperature parameter is used. Unlike comparative pre-training that relies on heuristically constructed positive and negative samples, this method explicitly uses "each molecule and its truth meta-knowledge hints" as a positive alignment relationship, thereby ensuring the chemical semantic consistency of the supervision signal.

[0053] Under a single descriptor setting, for a specific descriptor Take a batch of B molecular images and their cue sets. , respectively encoded to obtain , Calculate the similarity matrix. Its elements are: Construct a one-hot encoded tag matrix Let represent the true rank of each molecule. Define the row-normalized prediction distribution and the smoothed label distribution: in, , Control the "sharpness" of the predicted distribution versus the label distribution.

[0054] Redefining cross-entropy loss ( ) and KL divergence loss ( ): Therefore, the overall loss for single-task training is: The above construction enables the model to not only obtain supervision information at the "correct rank" (cross-over loss), but also further constrain the consistency between the predicted distribution and the smoothed label distribution through "distribution consistency" (KL divergence loss), thereby improving training stability.

[0055] Extending pre-training Parallel optimization for each descriptor. The single-descriptor loss is calculated independently for each descriptor and then averaged. In terms of engineering implementation, in order to balance the budget of GPU memory and computing power, this invention adopts a random parallel training strategy: each iteration extracts a subset from the descriptor set and shares the backbone encoder to approximately achieve multi-descriptor pre-training, thereby maintaining scalable meta-knowledge alignment learning under limited memory.

[0056] The molecular characterization learning method includes the following steps: In the downstream task phase, the descriptor semantics of the text description are replaced with task semantics. This applies to downstream subtasks. The number of its categories is The learner generates a set of category-conditional cues (e.g., an opposing category vocabulary for binary classification) to represent the label semantic space of the subtask. This mechanism enables molecular representations to "switch contexts" between tasks of different natures, rather than learning task-independent generalized descriptions.

[0057] This further allows for the introduction of domain priors into the cue space. For the first... Each molecule, structure A set of knowledge tips: Prior knowledge may include (but is not limited to) material property priors (such as TPSA, MolWt, etc.) or auxiliary task information (such as semantic cues like "this molecule is positive on a certain task"). These knowledge cues will serve as "interpretable prior channels" in the downstream stage, participating in attention fusion.

[0058] For downstream subtasks A batch of data, the image encoder outputs image features The task-hinting learner generates a set of contextual hints. Text features are obtained through a text encoder. Simultaneously, each molecule can optionally have its set of knowledge hints encoded. Obtain knowledge characteristics To obtain "knowledge-enhanced text features," this invention introduces an attention fusion module. Constructing vectors: And knowledge enhancement text features are obtained by scaling dot product attention: in, For learnable projection matrices, for The dimension. The image branch undergoes linear projection: For downstream tasks Construct a similarity matrix Its elements can be represented as: And by the downstream tag matrix The mechanism provides supervision for optimization. At its core, task prompts provide context for "what task needs to be solved," while knowledge prompts provide interpretable support through "domain priors or auxiliary semantics." The two are dynamically modulated through attention fusion, thereby improving downstream generalization and interpretability.

[0059] This application also provides an electronic device, which may include a processor, a memory, a receiver, and a transmitter. The processor is used to execute the training method for the visual language model for molecular representation learning and the molecular representation learning method mentioned in the above embodiments. The processor and the memory can be connected via a bus or other means, taking a bus connection as an example. The receiver can be connected to the processor and the memory via wired or wireless means.

[0060] The processor can be a central processing unit (CPU). The processor can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations of the above types of chips.

[0061] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as the training method for a visual language model for molecular representation learning and the corresponding program instructions / modules for the molecular representation learning method described in the embodiments of this application. The processor executes various functional applications and data processing by running the non-transitory software programs, instructions, and modules stored in the memory, thereby implementing the training method for a visual language model for molecular representation learning and the molecular representation learning method described in the above method embodiments.

[0062] The memory may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created by the processor, etc. Furthermore, the memory may include high-speed random access memory and non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory may optionally include memory remotely located relative to the processor, which can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0063] The one or more modules are stored in the memory, and when executed by the processor, the training method of the visual language model for molecular representation learning and the molecular representation learning method described in the embodiment are executed.

[0064] In some embodiments of this application, the user equipment may include a processor, a memory, and a transceiver unit. The transceiver unit may include a receiver and a transmitter. The processor, memory, receiver, and transmitter may be connected via a bus system. The memory is used to store computer instructions, and the processor is used to execute the computer instructions stored in the memory to control the transceiver unit to send and receive signals.

[0065] As one implementation method, the functions of the receiver and transmitter in this application can be implemented by transceiver circuits or dedicated transceiver chips, and the processor can be implemented by dedicated processing chips, processing circuits or general-purpose chips.

[0066] As another implementation approach, the server provided in this application embodiment can be implemented using a general-purpose computer. That is, the program code implementing the processor, receiver, and transmitter functions is stored in memory, and the general-purpose processor implements the processor, receiver, and transmitter functions by executing the code in memory.

[0067] This application also provides a computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program implements the steps of the aforementioned training method for a visual language model for molecular representation learning and the molecular representation learning method. The computer-readable storage medium can be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.

[0068] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the aforementioned training method for a visual language model for molecular representation learning and the steps of the molecular representation learning method.

[0069] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. The programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave.

[0070] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.

[0071] In this application, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.

[0072] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to the embodiments of this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. A training method for a visual language model for molecular representation learning, characterized in that, The training method includes: In the current iteration, training molecular data is input into the current visual language model used for molecular representation learning. The image encoding module and text encoding module in this model generate molecular structure image feature data and descriptor semantic cue feature data corresponding to the training molecular data, respectively. The supervised contrastive learning module in the visual language model aligns the molecular structure image feature data and the descriptor semantic cue feature data and outputs molecular representation learning data corresponding to the training molecular data. The training molecular data includes structural images of training molecules and learnable descriptor semantic cue associated with the training molecules. The descriptor semantic cue is obtained by transforming chemical meta-knowledge associated with the training molecules. The chemical meta-knowledge includes molecular descriptors that can be automatically calculated and do not require manual annotation. In the current iteration, the target loss of the current iteration is determined based on the molecular characterization learning data, and the model parameters of the image encoding module and the text encoding module are optimized based on the target loss to update the visual language model; If the visual language model has converged or the current iteration round is the preset last iteration round, then the updated visual language model is determined as the visual language model for molecular representation learning.

2. The method according to claim 1, characterized in that, The descriptor semantic hints are obtained by transforming the chemical meta-knowledge associated with the training molecule through a hint learner; the hint learner is used to generate learnable descriptor semantic hints based on the type and value of the molecule descriptor.

3. The method according to claim 1, characterized in that, The descriptor semantic hints include a learnable context embedding for representing the type of the molecular descriptor and a learnable value encoding for representing the value of the molecular descriptor.

4. The method according to claim 3, characterized in that, The learnable value encoding is a learnable parameter obtained by mapping the values of the molecular descriptor through rank embedding technology. The values of the molecular descriptor are divided into multiple ranks according to the value distribution in the training molecular data, and each rank corresponds to an independent learnable vector.

5. The method according to claim 1, characterized in that, The supervised contrastive learning module in the visual language model aligns the molecular structure image feature data and the descriptor semantic cue feature data and outputs molecular representation learning data corresponding to the training molecular data, including: For multiple training molecules in a batch, the supervised contrastive learning module calculates the similarity value between the structural image feature data of each training molecule and the semantic cue feature data of each descriptor. Based on the real descriptor semantic cue label corresponding to each training molecule and the cross-entropy loss function, the module aligns the molecular structural image feature data and the descriptor semantic cue feature data and outputs the molecular representation learning data.

6. The method according to claim 5, characterized in that, If only one molecule descriptor exists, the target loss is expressed by the following formula: in, This represents the target loss when a single molecule descriptor exists; Represents cross-entropy loss; This represents the KL divergence loss.

7. The method according to claim 1, characterized in that, If there are two or more molecular descriptors, the target loss is expressed by the following formula: in, This represents the target loss when multiple molecular descriptors exist; This indicates the total number of molecular descriptors used during training.

8. A molecular characterization learning method, characterized in that, The molecular characterization learning method includes: Based on the prompting learner, the acquired target task data is transformed to obtain the task semantic prompts corresponding to the target task data; The task semantic prompts and preset domain prior knowledge prompts are input into a visual language model for molecular representation learning trained by the training method described in any one of claims 1-7. This causes the image encoding module in the visual language model to generate image feature data corresponding to the target task data, and the visual language model to extract visual attention weights corresponding to the task semantic prompts from the image encoding module. Furthermore, the text encoding module in the visual language model generates task semantic prompt feature data and domain prior knowledge prompt feature data corresponding to the target task data. Then, the attention fusion module in the visual language model performs feature fusion on the task semantic prompt feature data and the domain prior knowledge prompt feature data to generate knowledge-enhanced text feature data and knowledge attention weights. The image feature data, the knowledge-enhanced text feature data, the visual attention weights, and the knowledge attention weights are output. The domain prior knowledge prompts include automatically computed molecular descriptors or auxiliary task information. The visual attention weights are used to locate molecular information related to the current task semantic prompts, and the knowledge attention weights are used to quantify the contribution of the domain prior knowledge prompts to the prediction results.

9. An electronic device, characterized in that, It includes a processor and a memory; when the processor executes the running program stored in the memory, it implements the training method for a visual language model for molecular representation learning as described in any one of claims 1 to 7 and / or the molecular representation learning method as described in claim 8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the training method for a visual language model for molecular representation learning as described in any one of claims 1 to 7 and / or the molecular representation learning method as described in claim 8.