A method for predicting molecular properties by fusing multi-scale multi-view information
By integrating multi-scale and multi-view information into a molecular property prediction method, and combining Uni-Mol, BiGRU, and GraphTransformer, the problems of insufficient information fusion and data imbalance in molecular property prediction are solved, achieving more efficient molecular property prediction and improving the accuracy and reliability of drug development.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING HANWEI PUBLIC HEALTH RES INST CO LTD
- Filing Date
- 2026-03-02
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for predicting molecular properties suffer from insufficient fusion of multi-view information, neglect of intermolecular correlations and cross-task correlations, weak generalization ability due to scarce labeled data, and unbalanced data affecting prediction reliability.
A multi-scale, multi-view information prediction method is adopted, which combines Uni-Mol to extract 3D geometric features, BiGRU to extract 1D sequence features, and GraphTransformer to extract 2D topological features. Intramolecular features are fused through a cross-attention mechanism, and an intermolecular graph is constructed to mine correlation information. The weighted cross-entropy loss function is used to handle the data imbalance problem.
It improves the accuracy and robustness of molecular property prediction, enhances adaptability to new molecules and complex structures, and reduces drug development costs and time.
Smart Images

Figure CN122201496A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of molecular property prediction technology, and more specifically, to a method for predicting molecular properties by integrating multi-scale and multi-view information. Background Technology
[0002] In drug development, molecular property prediction is a crucial step in accelerating candidate drug screening and reducing development cycles and costs. However, traditional drug development still faces challenges such as long cycles, high investment, and low success rates. 30%-50% of drug candidates are terminated due to unfavorable key molecular properties such as absorption, distribution, metabolism, excretion, and toxicity (ADMET). Traditional experimental assay methods suffer from high resource consumption and poor scalability, making them unsuitable for large-scale compound library screening needs.
[0003] In existing computational prediction methods, traditional quantitative structure-activity relationship (QSAR / QSPR) models rely on experts manually designing molecular descriptors, resulting in limited generalization ability. While mainstream deep learning methods achieve end-to-end feature learning, most rely only on single-view representations (1D sequence, 2D topology, or 3D geometry), failing to fully integrate multidimensional molecular features and generally ignoring inter-molecular correlations and cross-task relevance. Furthermore, the scarcity of labeled data leads to insufficient generalization ability for novel and complex molecular structures. In addition, the class imbalance problem often exists in molecular property prediction datasets, further affecting the reliability of model predictions for a few key samples (such as active drugs and toxic molecules). These shortcomings collectively limit the accuracy and practicality of molecular property prediction.
[0004] Therefore, the present invention provides a method for predicting molecular properties by integrating multi-scale and multi-view information, thereby improving the above-mentioned technical problems. Summary of the Invention
[0005] This disclosure aims to address the shortcomings of existing technologies by providing a method for predicting molecular properties by integrating multi-scale and multi-view information. The invention employs a multi-view feature extraction scheme, with the pre-trained large model Uni-Mol extracting 3D geometric features as its core, combined with a bidirectional gated recurrent unit (BiGRU) extracting 1D sequence features and a graph transformer extracting 2D topological features. Through a cross-attention mechanism, it achieves deep fusion of intramolecular multi-view features, while simultaneously constructing an intermolecular graph (IMG) to mine intermolecular correlation information and integrate cross-task property correlation features. A weighted cross-entropy loss function is used to address data imbalance issues, thereby accurately predicting key molecular properties, improving prediction accuracy and robustness, and providing efficient support for drug development.
[0006] The above-mentioned technical objective of this invention is achieved through the following technical solution: a method for predicting molecular properties by fusing multi-scale and multi-view information, comprising the following steps: S1. Intramolecular Feature Extraction and Fusion: 1D sequence features, 2D topological features, and 3D geometric features of the molecule are extracted separately. These multi-view features are then fused using a cross-attention mechanism, with the 3D geometric features as the core, to obtain the intramolecular fused features. ; S2. Intermolecular Information Mining: Constructing an Intermolecular Graph (IMG) and extracting intermolecular correlation features using a Graph Neural Network (GNN). Simultaneously splicing the cross-task property correlation features of molecules ; S3. Multi-scale feature fusion and prediction: splicing the intramolecular fused features Intermolecular correlation characteristics and cross-task nature correlation characteristics The final molecular embedding is obtained, and the result of molecular property prediction is input into the feedforward network FFN. S4. Imbalanced data processing: The weighted cross-entropy loss function is used to optimize model training and balance the loss contribution of positive and negative samples.
[0007] As a preferred embodiment of the present invention, the extraction of 1D sequence features includes: After word segmentation and embedding, the SMILES sequence is input into a bidirectional gated recurrent unit (BiGRU). The forward GRU and backward GRU respectively encode the context information, and the concatenation yields the SMILES feature. ; A hybrid fingerprint vector is obtained by fusing MACCS fingerprint, PharmacophoreErG fingerprint, and PubChem fingerprint, and fingerprint features are obtained by processing with a multilayer perceptron (MLP). ; Through element product fusion and The 1D sequence features are obtained. .
[0008] As a preferred embodiment of the present invention, the extraction of the 2D topological features includes: Representing molecules as undirected graphs The node set V corresponds to molecular atoms, the edge set E corresponds to chemical bonds, the node features include physical and chemical properties such as atomic number and hybridization state, and the edge features include bond features such as bond type and conjugation. Laplacian Positional Encodings (LPEs) are introduced to enhance node structural position information. Neighbor weights are dynamically allocated using a multi-head attention mechanism in the Transformer (GraphTransformer) to model long-range dependencies between atoms, and the resulting 2D topological features are output. . As a preferred embodiment of the present invention, the 3D geometric features are extracted using a pre-trained large model, Uni-Mol. The SMILES sequence is input into Uni-Mol, and the model generates a 3D molecular conformation that minimizes energy and outputs 3D features that incorporate spatial information. Furthermore, the 3D features possess rotation and translation invariance.
[0009] As a preferred embodiment of the present invention, the fusion process of the cross-attention mechanism is as follows: calculating 1D sequence features. With 3D geometric features Calculate 2D topological features using cross-attention features att1. With 3D geometric features The intramolecular fusion feature is obtained by splicing att1 and att2 together with the cross-attention feature att2. . As a preferred embodiment of the present invention, the construction of the intermolecular graph (IMG) includes: using molecules as nodes, with node attributes as Morgan fingerprints, calculating the Tanimoto coefficient of molecular pairs as structural similarity scores, and determining the similarity score sim ≥ a preset threshold. Edges are established between corresponding nodes, with the edge attribute being the Tanimoto coefficient; neighbor node information is aggregated through multi-layer message passing in the graph neural network (GNN), and the inter-molecule correlation features are output. . As a preferred technical solution of the present invention, the cross-task property association feature It is composed of known tags for all non-target properties of the molecule, and is used to encode cross-task prior knowledge.
[0010] As a preferred embodiment of the present invention, the calculation formula for the final molecular embedding is as follows: ; Where ⊙ represents the element-wise product; the feedforward network FFN outputs a binary classification prediction result.
[0011] As a preferred embodiment of the present invention, the weighted cross-entropy loss function is defined as follows: ; in, For positive sample sets, For negative sample set, Let N be the predicted probability of the model for the i-th sample, and N be the total number of samples. To balance the weights.
[0012] As a preferred technical solution of the present invention, the preset threshold .
[0013] In summary, the present invention has the following beneficial effects: Firstly, it achieves deep fusion of multi-view features within molecules by integrating three complementary features: 1D sequence (SMILES + molecular fingerprint), 2D topology (molecular graph), and 3D geometry (extracted from pre-trained large model). Combined with cross-attention mechanism, it achieves information complementarity, comprehensively captures multi-dimensional molecular structure and physicochemical information, and solves the defect of incomplete single-view representation in existing methods.
[0014] Secondly, it fully explores the information related to molecules and cross-tasks. By constructing an intermolecular graph (IMG) to capture the global interaction patterns of structurally similar molecules, it integrates cross-task property correlation features, and uses group knowledge and prior information to optimize the prediction logic, thus making up for the shortcomings of traditional methods that ignore the correlation between molecules and cross-tasks.
[0015] Third, it significantly enhances the model's generalization ability by introducing the pre-trained large model Uni-Mol, which is based on the general molecular representation learned from massive unlabeled data. This makes the model more adaptable to new molecules and complex molecular structures, thus overcoming the generalization bottleneck caused by the scarcity of labeled data.
[0016] Fourth, it accurately captures long-range molecular dependencies. It uses Graph Transformer to process 2D molecular graphs and flexibly models long-range interactions between atoms through dynamic attention weight allocation, which is superior to the local message passing of traditional GNNs and improves the representation accuracy of complex molecular structures.
[0017] Fifth, optimize the prediction performance of imbalanced data by using a weighted cross-entropy loss function to dynamically adjust the loss weights of positive and negative samples, enhance the ability to identify a few key samples (such as active drug candidates and toxic molecules), reduce the risk of false negatives, and adapt to the class imbalance characteristics of molecular property prediction datasets.
[0018] Sixth, it improves the overall accuracy and practicality of molecular property prediction. The collaborative design of multiple scales and modules enables the model to perform well in various molecular property prediction tasks such as toxicity, biological activity, and pharmacokinetics, providing a reliable tool for early screening of drug development, effectively shortening the research and development cycle and reducing experimental costs. Attached Figure Description
[0019] Figure 1 This is a framework diagram of multi-scale multi-view fusion (MSMPP) provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of the label distribution of a selected MoleculeNet dataset provided in an embodiment of the present invention; Figures 3-6 ROC curves, PR curves, probability scatter plots, and confusion matrix plots of models on the BACE and BBBP datasets provided in this embodiment of the invention; Figures 7-8 The ROC-AUC values of different models on the BACE and BBBP datasets provided in this embodiment of the invention; Figures 9-13 This is a schematic diagram of the MSMPP ablation study results provided in this embodiment of the invention. Figure 14 The five datasets provided in the embodiments of the present invention are different. A diagram illustrating the results of the values; Figures 15-16 This is a schematic diagram illustrating the embedding of molecular representations into the BBBP and BACE test sets, provided for embodiments of the present invention. Detailed Implementation
[0020] The present application will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present application, but do not limit the present application in any way. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application. These all fall within the protection scope of the present application.
[0021] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0022] Unless otherwise defined, all technical and scientific terms used in this specification have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. The term "and / or" as used in this specification includes any and all combinations of one or more of the associated listed items.
[0023] Furthermore, the technical features involved in the various embodiments of this application described below can be combined with each other as long as they do not conflict with each other.
[0024] This disclosure aims to address the problems of insufficient multi-view information fusion, neglect of intermolecular correlations and cross-task relevance, weak generalization ability due to scarce labeled data, and the impact of imbalanced data on prediction reliability in existing molecular property prediction methods. In view of this, this disclosure proposes a molecular property prediction method that integrates multi-scale, multi-view information for accurate prediction of key molecular properties (such as ADMET-related properties). This method employs a multi-scale, multi-view fusion framework, a pre-trained 3D molecular large model, a pre-trained large model Uni-Mol, graph Transformer, and a weighted cross-entropy loss function. By integrating intramolecular 1D sequence, 2D topology, 3D geometric multi-view features and intermolecular correlation information, utilizing pre-trained models to enhance generalization ability, and optimizing imbalanced data processing, it aims to improve the accuracy and robustness of molecular property prediction, and reduce drug development costs and timelines.
[0025] Please refer to Figure 1 , Figure 1 A framework diagram of the Multi-Scale Multi-View Fusion (MSMPP) framework described in this disclosure embodiment is shown. In the figure, (A) shows the details of the model input features, (B) shows the details of the intra-scale feature learning module, (C) shows the details of the inter-scale feature learning module, (D) shows the prediction module, and (E) partially shows the details of the Graph Transformer module.
[0026] The overall MSMPP framework includes: an intramolecular feature fusion module, an intermolecular information module, a prediction module, and an imbalanced data processing strategy.
[0027] The MSMPP framework sequentially extracts and deeply fuses 1D sequence, 2D topology, and 3D geometry multi-view features through an intramolecular feature fusion module, constructs an intermolecular graph (IMG) through an intermolecular information module, and mines intermolecular correlations and cross-task features. Finally, it outputs molecular property prediction results by fusing multi-scale features through a prediction module, while using a weighted cross-entropy loss function to handle the data imbalance problem.
[0028] Intramolecular Feature Fusion Module: This module is used to extract 1D sequence features, 2D topological features, and 3D geometric features of molecules, and to achieve deep fusion of multi-view features through a cross-attention mechanism; specifically, it includes: 1D sequence feature extraction, 2D topological feature extraction, 3D geometric feature extraction, and multi-view feature fusion.
[0029] 1D sequence feature extraction: Molecular 1D sequence information usually exists in two forms: SMILES sequences and molecular fingerprints.
[0030] Specifically, 1D sequence features are constructed based on SMILES sequences and molecular fingerprints, and the specific steps are as follows: SMILES Sequence Processing: The SMILES string is segmented and converted into vector embeddings based on a predefined vocabulary.
[0031] Since unidirectional gated recurrent units (GRUs) encode information only in a single direction, they may overlook crucial contextual dependencies embedded in the SMILES sequence. To mitigate this limitation, this application employs a bidirectional gated recurrent unit (BiGRU). The BiGRU comprises a forward GRU and a backward GRU, encoding contextual information from the beginning and end of the sequence, respectively. This allows the model to capture contextual information from both preceding and following sequence segments, generating richer, context-aware embeddings for the SMILES sequence through its bidirectional design.
[0032] The calculation process of BiGRU is as follows: Let the input SMILES sequence be... Where T is the sequence length, Input at time t; update gate Reset door Candidate hidden state Hidden state The formula for calculation is: ; in, These are the weighting coefficients. For bias terms, It is the sigmoid activation function. It is the element-wise product.
[0033] Forward GRU outputs hidden state sequence The reverse GRU outputs the hidden state sequence. The final hidden state at time t is the concatenation of the two states: ; in, This indicates feature splicing.
[0034] The BiGRU output is passed through a fully connected layer to obtain the SMILES feature representation. The bidirectional connections described above preserve the inherent bidirectional contextual dependence of the SMILES sequence, thereby enhancing the 1D sequence representation of the molecule and providing additional chemical information.
[0035] However, as a linear text representation, SMILES still has inherent limitations because its ability to explicitly represent substructures is weak, and it cannot directly encode predefined chemical substructure information in an explicit manner. To address the shortcomings of SMILES-based linear representations, this application's molecular fingerprinting process utilizes the RDkit library to generate three complementary molecular fingerprints: substructure-based MACCS fingerprints, topology / path-based PharmacophoreErG fingerprints, and database-derived PubChem fingerprints. MACCS (Molecular Access System) fingerprints are a binary encoding system based on predefined substructure rules, generating fingerprint vectors by traversing the molecular structure and marking the presence or absence of template matches. PharmacophoreErG fingerprints focus on topological and path-related features, while PubChem fingerprints are implicit feature encodings based on the PubChem database's collective knowledge. These three fingerprint types are complementary in the three dimensions of substructure, topology, and database-derived collective knowledge, thus constructing a more comprehensive molecular feature space.
[0036] The three fingerprints are concatenated to obtain a hybrid fingerprint vector. Fingerprint feature embedding is obtained by extracting latent patterns through a multilayer perceptron (MLP). The calculation formula is: .
[0037] 1D sequence feature fusion: combining SMILES features fingerprint features Unified 1D sequence features are obtained through element-wise product fusion. .
[0038] 2D Topological Feature Extraction: Based on the molecular graph structure, a Graph Transformer with edge features is used to extract 2D topological features. The specific steps are as follows: Molecular graph construction: Representing molecules as undirected graphs , where the node set Corresponding molecules and atoms, Number of atoms; edge set Corresponding chemical bonds. Node feature vectors. ( (Node feature dimension) includes physicochemical properties such as atomic number, hybridization state, formal charge, and aromaticity. Edge feature vector ( The edge feature dimension includes bond features such as bond type and conjugation. The topological structure of a molecule can be represented by an adjacency matrix. It means that if ,but otherwise .
[0039] Graph Transformer Processing: Unlike traditional graph neural networks that rely on fixed neighbor aggregation rules, graph Transformer processing dynamically assigns weights to neighboring nodes for each node through an attention mechanism, flexibly modeling long-range dependencies between atoms, thereby learning potential long-range dependencies and simultaneously mining local and global information from the molecular topology graph. This invention introduces Laplacian Position Encoding (LPEs) to encode node structural position information. Calculation process: Normalized graph Laplacian matrix: Where I is the identity matrix and D is the degree matrix; for Eigenvalue matrix is obtained by performing eigenvalue decomposition. With eigenvector matrix The position of node i is encoded as a nontrivial feature vector. ; Position encoding embedding: The initial node embedding is updated as follows: ,in These are the original node features. and These are trainable parameters.
[0040] The Graph Transformer updates node and edge features through a multi-head attention mechanism. This represents the attention score between nodes i and j in the k-th attention head. This indicates an intermediate calculation item, and the specific calculation is as follows: ; in, Here are trainable parameters, where For the number of attention heads, This represents the feature dimension of the k-th attention head.
[0041] Output of attention layer and The inputs are fed into the feedforward neural network, followed by residual connections and layer normalization: ; in, These are trainable parameters.
[0042] After processing with a multi-layer graph Transformer, 2D topological features are obtained. .
[0043] 3D geometric feature extraction: The pre-trained large model Uni-Mol is used to extract 3D molecular geometric features. The specific steps are as follows: Uni-Mol is based on a Transformer backbone network, combined with an SE(3)-equivariant attention mechanism, and its pre-training data includes... Uni-Mol captures the fundamental structural principles governing the three-dimensional conformations of molecules, derived from 19 million unique molecules and 3.2 million protein pocket structures. Unlike traditional methods that explicitly rely on predefined molecular diagrams, Uni-Mol employs a sequence-driven paradigm for implicit three-dimensional conformation modeling. This allows for the direct generation of energy-minimizing three-dimensional structures from the smiles sequence of molecules, eliminating the need for computationally intensive molecular dynamics simulations. Uni-Mol encodes the spatial specificity of molecular conformations into its characterization through end-to-end modeling from one-dimensional sequence to three-dimensional conformation to multimodal embedding.
[0044] Specifically, by inputting the SMILES sequence into Uni-Mol, the model directly generates a 3D molecular conformation that minimizes energy and outputs a 3D feature embedding that incorporates spatial information. The calculation formula is: SE(3)-equivariant design ensures that 3D features are rotationally and translationally invariant, faithfully reflecting the inherent physicochemical properties of molecules.
[0045] Multi-view feature fusion: To fully utilize the complementarity of features from different molecular views, this application introduces a cross-attention mechanism to fuse multi-view features. In the fusion strategy, this application particularly emphasizes the features extracted from the Uni-Mol large model, using the features output by the Uni-Mol large model as the anchor point with the most information in the feature fusion module.
[0046] Specifically, a cross-attention mechanism is employed, using 3D features To perform multi-view fusion around the core, the specific steps are as follows: Calculate 1D features With 3D features Cross-attention fusion features : ; Calculate 2D features With 3D features Cross-attention fusion features : ; Where d is the feature dimension. For cross-attention functions, Q is the query vector, K is the key vector, and V is the value vector; By splicing att1 and att2, intramolecular fusion characteristics were obtained. .
[0047] Intermolecular Information Module: This application utilizes intermolecular graphs to characterize the relationships between molecular groups and uncover potential connections between different molecules. Intermolecular Graph (IMG) Construction and Message Passing: The IMG is constructed based on molecular structural similarity, with the following specific steps: IMG Construction: Each molecule is treated as a node in the IMG, and the node attribute is the Morgan fingerprint of that molecule. The Tanimoto coefficient between any two molecules is calculated as the structural similarity score. To ensure the sparsity and information validity of the graph structure, this application sets a similarity threshold. ,when When, an edge is established between two molecular nodes, and the edge attribute is the Tanimoto coefficient; when At that time, no edge is established.
[0048] GNN Message Passing: The IMG (Intrace Recognition Model) is taken as input and combined with node attributes (i.e., the Morgan fingerprint vector of each molecule) before being fed into a Graph Neural Network (GNN) for representation learning. The GNN aggregates information from neighboring molecule nodes and their edge attributes through a multi-layer message passing mechanism. At each layer, the node embedding is updated based on the features and edge attributes of neighboring nodes, thus progressively integrating local and global similarity information. After multiple iterations, the GNN outputs a trained embedding for each molecule, which not only includes the molecule's own structural features but also integrates global information from structurally similar molecules.
[0049] Specifically, the IMG is input into a graph neural network (GNN), which aggregates neighbor node information through multi-layer message passing. Layer node feature update formula: ; in, For the embedding of node v at layer l, Let v be the set of neighbors of node v. For substitution-invariant aggregate functions (such as summation and averaging). This is the update function; after multiple layers of message passing, intermolecular characteristics are obtained. .
[0050] Intermolecular property correlation modeling: constructing cross-task auxiliary features The specific steps are as follows: Based on the principle that "there is an intrinsic correlation between molecular properties", the known labels of all non-target properties of the molecule are spliced together; the spliced features serve as auxiliary information. Encode cross-task prior knowledge to enhance structure-property relationship modeling.
[0051] In the prediction module, to comprehensively fuse feature information at different scales, this application extracts intramolecular features from the intramolecular feature fusion module. Intermolecular features derived from the intermolecular information module and other molecular properties derived from the database. Connecting together, forming the final molecular embedding. Then the final embedded Input feedforward network (FFN).
[0052] Specifically, the steps for fusing multi-scale features and outputting prediction results are: splicing intramolecular features. Intermolecular characteristics Cross-task auxiliary features The final molecular embedding is obtained. :
[0053] Will Input to a feedforward network (FFN), output the predicted molecular properties: .
[0054] Imbalanced Data Handling Strategy: In molecular property prediction tasks, class imbalance in datasets is a common problem. To mitigate model bias caused by class imbalance, this application employs a weighted cross-entropy loss function. Loss function definition: ; in, For positive sample sets, For negative sample set, Let N be the predicted probability of the model for the i-th sample, and N be the total number of samples. To balance the weights (the ratio of negative to positive samples), the model is forced to focus more on learning features from the minority class during training by amplifying the loss contribution of positive samples.
[0055] Example: Experimental Datasets: To comprehensively evaluate the performance of the proposed model in molecular property prediction tasks, five representative benchmark datasets were selected from MoleculeNet: Tox21, SIDER, ClinTox, BACE, and BBBP. These datasets cover different types of molecular properties, including toxicity, bioactivity, and pharmacokinetics.
[0056] Table 1. Statistical information of the dataset used in the experiment. ; In the table, Tox21-AR, SIDER-ED, and ClinTox-CT are multi-task datasets for physiology, with single subtask labels selected; BACE is a biophysics dataset, and BBBP is a single-task dataset for physiology.
[0057] like Figure 2 As shown, the label distribution of the selected MoleculeNet dataset is displayed. Red represents positive samples, cyan represents negative samples, and yellow represents missing values in the dataset.
[0058] like Figure 3-6 As shown, BACE (such as Figure 3-6 (see the image above) and BBBP (as shown in the image above) Figure 3-6 (See the image below) Comprehensive performance analysis of the dataset. Figure 3 It is the receiver operating characteristic (ROC) curve. Figure 4 It is the precision-recall (PR) curve. Figure 5 This is a scatter plot of the predicted probabilities for a single sample, colored by its true label (blue for negative samples and orange for positive samples). The line k=0.5 represents the standard decision threshold. Figure 6 This is a confusion matrix, showing the proportion of BACE and BBBP classification results (true labels and predicted labels) at a threshold of k=0.5.
[0059] Multi-dimensional evaluation indicators are used: Key metric: ROC-AUC (area under receiver operating characteristic curve), applicable to imbalanced data, reflects classification and discrimination ability; Additional metrics: PR-AUC (area under the precision-recall curve), accuracy, and F1 score, to comprehensively evaluate model performance.
[0060] Experimental Results: As shown in Table 2, the ROC-AUC scores of the MSMPP model for the multi-task classification datasets Tox21, SIDER, and ClinTox are 0.8918, 0.8642, and 0.9836, respectively. For the single-task classification datasets BACE and BBBP, the model also delivers high performance, with ROC-AUC scores of 0.8984 and 0.9536, respectively.
[0061] Table 2 shows the performance of the MSMPP model on five datasets. ; like Figure 3-6 As shown, Figure 3-6The ROC curves, PR curves, probability scatter plots, and confusion matrices of the model on the BACE and BBBP datasets are displayed, respectively. In the probability scatter plots, the x-axis represents the sample index, and the y-axis represents the predicted positive probability, ranging from 0 to 1. A probability greater than 0.5 indicates that the model classifies the corresponding sample as a positive sample; otherwise, it is considered a negative sample. In both datasets, positive and negative samples are clearly separated, indicating that the model can predict molecular characteristics well and has good classification ability.
[0062] Compared with existing models: As shown in Table 3, compared with 17 mainstream models (including 1D sequence models, 2D graph models, 3D geometric models, multi-view fusion models, and pre-trained models), the average ROC-AUC of this invention reaches 0.9184, ranking first on the Tox21-AR, SIDER-ED, and BACE datasets, and close to the best performance on the ClinTox-CT and BBBP datasets.
[0063] Specifically, on the Tox21 and SIDER datasets, MSMPP increased the receiver operating characteristic area under the curve (ROC-AUC) by 0.017 and 0.022, respectively, compared to the best baseline results. MSMPP also achieved state-of-the-art performance on the ClinTox, BACE, and BBBP datasets, matching the performance of current optimization methods.
[0064] like Figure 7-8 As shown, the ROC-AUC comparison analysis of the BBBP and BACE datasets demonstrates that the proposed model outperforms many existing methods.
[0065] The above results fully demonstrate that MSMPP can leverage the advantages of multi-perspective molecular information to effectively capture information from different perspectives (1D sequence, 2D topology, 3D conformation) at the intramolecular scale; at the same time, it can effectively fuse information at the intramolecular and intermolecular scales, thereby uncovering potential correlations that are crucial for accurate performance prediction.
[0066] Table 3 shows the performance of different models tested on five benchmarks of MoleculeNet. ; In the table, the best and second-best performances are indicated by bold and underline, respectively. A backslash "\" indicates a result not reported in the references.
[0067] Ablation experiment verification: such as Figure 9-13 As shown, in order to study the contribution of each module to the prediction performance of MSMPP, an ablation study was conducted on each dataset, removing one module at a time and comparing the changes in the final evaluation metrics.
[0068] Specifically, the symbols wo_1D, wo_2D, wo_3D, and wo_IMG represent variants of the original MSMPP model, in which the following modules have been removed: ·wo_1D: 1D sequence feature extraction module (for SMILES sequences and molecular fingerprints). ·wo_2D: Two-dimensional graphics feature extraction module (GT module for short); ·wo_3D: A 3D feature extraction module based on a large-scale pre-trained model (abbreviated as Uni-Mol model); ·wo_IMG: Intermolecular graph (IMG) feature extraction module.
[0069] Ablation experiments were conducted by removing each core module, and the results showed that: Removing the 3D feature module (wo_3D): The ROC-AUC of each dataset decreased most significantly, verifying the core role of Uni-Mol pre-trained 3D features; Removing the 2D feature module (wo_2D): The ROC-AUC drops significantly, indicating that the long-range topological dependencies captured by GraphTransformer are crucial for prediction; Removing the 1D feature module (wo_1D): ROC-AUC decreased to varying degrees, demonstrating the complementarity between SMILES and fingerprints; Removing the IMG module (wo_IMG): ROC-AUC decreased, validating the gain in intermolecular correlation information; all modules contributed positively to the model performance, and the multi-scale, multi-view fusion design was reasonable and effective.
[0070] The ablation experiments described above demonstrate that combining intramolecular multi-view features (1DSMILES sequences and fingerprints, 2D molecular maps, and 3D conformations of pre-trained large-scale models) with intermolecular information mined through IMG can more comprehensively capture intramolecular structural information and intermolecular relationships. This holistic feature integration ultimately improves the model's performance in molecular property prediction tasks.
[0071] Threshold sensitivity analysis: such as Figure 14 As shown, with the similarity threshold As the similarity threshold increases, the model's performance initially increases and then decreases. The model achieves optimal performance when the value is set to approximately 0.3. This trend can be attributed to two key cases: when... When the size is too small, the IMG becomes overly dense, with too many edges connecting molecules with low similarity—this causes the model to learn a large amount of irrelevant information, effectively introducing noise and making it difficult to distinguish between effective features and interfering signals, resulting in poor performance. When When the graph is too large, the IMG becomes too sparse because many molecular pairs that actually share highly similar substructures cannot form edges—thus, the critical information necessary for downstream molecular property prediction tasks is filtered out, preventing the model from learning effective features through the graph structure, thus leading to performance degradation.
[0072] Feature visualization: such as Figure 15-16 As shown, the molecular embeddings learned by the model are visualized by dimensionality reduction using t-SNE. The results show that on the BBBP and BACE datasets, positive and negative samples form obvious clusters, proving that the model has learned discriminative molecular representations.
[0073] The implementation process of this invention is as follows: Data preprocessing: Collect SMILES sequences, structural information, and known property tags of molecules; use the RDKit toolkit to process molecular structures and generate SMILES strings, molecular diagrams, and molecular fingerprints; Feature extraction: 1D sequence features are extracted using BiGRU, 2D topological features are extracted using Graph Transformer, and 3D geometric features are extracted using Uni-Mol. Multi-view fusion: Taking 3D features as the core, 1D and 2D features are fused through cross-attention mechanism to obtain intramolecular fusion features; Intermolecular information mining: Construct an IMG and extract intermolecular features through GNN, and splice cross-task auxiliary features; Prediction and optimization: Multi-scale features are fused, prediction results are output through FFN, and the model is optimized using a weighted cross-entropy loss function; Model deployment: The trained model is applied to the screening of drug candidate molecules, predicting key properties such as ADMET, and assisting in drug development decisions.
[0074] This embodiment is merely a specific application example. The scope of protection of this invention is not limited by specific implementation details such as datasets and parameter settings. All equivalent modifications or changes based on the technical solution of this invention fall within the scope of protection of this invention.
[0075] The above description is merely a preferred embodiment of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements and modifications made without departing from the principles of the present invention should also be considered within the scope of protection of the present invention.
Claims
1. A method for predicting molecular properties by fusing multi-scale and multi-view information, characterized in that, The method includes the following steps: S1. Intramolecular Feature Extraction and Fusion: 1D sequence features, 2D topological features, and 3D geometric features of the molecule are extracted separately. These multi-view features are then fused using a cross-attention mechanism, with the 3D geometric features as the core, to obtain the intramolecular fused features. ; S2. Intermolecular Information Mining: Constructing an Intermolecular Graph (IMG) and extracting intermolecular correlation features using a Graph Neural Network (GNN). Simultaneously splicing the cross-task property correlation features of molecules ; S3. Multi-scale feature fusion and prediction: splicing the intramolecular fused features Intermolecular correlation characteristics and cross-task nature correlation characteristics The final molecular embedding is obtained, and the result of molecular property prediction is input into the feedforward network FFN. S4. Imbalanced data processing: The weighted cross-entropy loss function is used to optimize model training and balance the loss contribution of positive and negative samples.
2. The molecular property prediction method based on claim 1, characterized in that, The extraction of the 1D sequence features includes: After word segmentation and embedding, the SMILES sequence is input into a bidirectional gated recurrent unit (BiGRU). The forward GRU and backward GRU respectively encode the context information, and the concatenation yields the SMILES feature. ; A hybrid fingerprint vector is obtained by fusing MACCS fingerprint, PharmacophoreErG fingerprint, and PubChem fingerprint, and fingerprint features are obtained by processing with a multilayer perceptron (MLP). ; Through element product fusion and The 1D sequence features are obtained. .
3. The molecular property prediction method based on claim 1, characterized in that, The extraction of the 2D topological features includes: Representing molecules as undirected graphs The node set V corresponds to molecular atoms, the edge set E corresponds to chemical bonds, the node features include physical and chemical properties such as atomic number and hybridization state, and the edge features include bond features such as bond type and conjugation. Laplacian Positional Encodings (LPEs) are introduced to enhance node structural position information. Neighbor weights are dynamically allocated using a multi-head attention mechanism in the Transformer (GraphTransformer) to model long-range dependencies between atoms, and the resulting 2D topological features are output. .
4. The molecular property prediction method based on claim 1, characterized in that, The 3D geometric features are extracted using the pre-trained large model Uni-Mol. The SMILES sequence is input into Uni-Mol, and the model generates a 3D molecular conformation that minimizes energy and outputs 3D features that incorporate spatial information. Furthermore, the 3D features possess rotation and translation invariance.
5. The molecular property prediction method according to claim 1, characterized in that, The fusion process of the cross-attention mechanism is as follows: calculating 1D sequence features. With 3D geometric features Calculate 2D topological features using cross-attention features att1. With 3D geometric features The intramolecular fusion feature is obtained by splicing att1 and att2 together with the cross-attention feature att2. .
6. The molecular property prediction method according to claim 1 is characterized in that, The construction of the intermolecular graph (IMG) includes: using molecules as nodes, with node attributes as Morgan fingerprints, calculating the Tanimoto coefficient of molecule pairs as structural similarity scores, and determining the similarity score when sim ≥ a preset threshold. Edges are established between corresponding nodes, with the edge attribute being the Tanimoto coefficient; neighbor node information is aggregated through multi-layer message passing in the graph neural network (GNN), and the inter-molecule correlation features are output. .
7. The molecular property prediction method according to claim 1 is characterized in that, The cross-task property association features It is composed of known tags for all non-target properties of the molecule, and is used to encode cross-task prior knowledge.
8. The molecular property prediction method according to claim 1 is characterized in that, The formula for calculating the final molecular embedding is as follows: ; Where ⊙ represents the element-wise product; the feedforward network FFN outputs a binary classification prediction result.
9. The molecular property prediction method according to claim 1 is characterized in that, The weighted cross-entropy loss function is defined as follows: ; in, For positive sample sets, For negative sample set, Let N be the predicted probability of the model for the i-th sample, and N be the total number of samples. To balance the weights.
10. The molecular property prediction method according to claim 6 is characterized in that, The preset threshold .