circRNA-disease association prediction method based on multi-source feature fusion

By constructing a circRNA-disease association network, extracting multi-source features, and employing an XGBoost classifier, the network topology embedding is learned at three scales: micro, meso, and macro. This solves the problems of insufficient integration of biological attribute features and single network representation in existing methods, and achieves high-precision and robust circRNA-disease association prediction.

CN122290970APending Publication Date: 2026-06-26NORTHWESTERN POLYTECHNICAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NORTHWESTERN POLYTECHNICAL UNIV
Filing Date
2026-04-08
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing circRNA-disease association prediction methods suffer from insufficient integration of biological attribute features, limited network representation, and imperfect multi-source heterogeneous information fusion mechanisms, resulting in limited prediction performance.

Method used

We construct a circRNA-disease association network, extract GIPK functional features of diseases and Transformer sequence features of circRNAs, and systematically learn the network topology embedding at three scales: micro, meso, and macro. We achieve high-precision prediction through deep fusion of multi-source features and an XGBoost classifier.

Benefits of technology

It improves the accuracy and robustness of predictions, provides biological interpretability, comprehensively characterizes complex association patterns in networks, and achieves efficient feature utilization and robust prediction performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122290970A_ABST
    Figure CN122290970A_ABST
Patent Text Reader

Abstract

This invention discloses a circRNA-disease association prediction method based on multi-source feature fusion, belonging to the field of disease-aided diagnosis technology. The method first constructs a circRNA-disease association network and extracts the GIPK functional features of the disease and the Transformer sequence features of the circRNA; then, it systematically learns the network topology embedding at three scales: microscopic, mesoscopic, and macroscopic; finally, it deeply fuses the above multi-source features and inputs them into an XGBoost classifier. This invention systematically solves the three major technical defects of existing technologies—"lack of biological semantics," "single network representation," and "inefficient feature fusion"—by introducing biological attribute features, multi-scale network structure features, and a deep fusion strategy, achieving high-precision, high-robustness, and highly biologically interpretable association prediction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of disease-aided diagnosis technology, and relates to disease prediction related to circRNA, specifically to a circRNA-disease association prediction method based on multi-source feature fusion. Background Technology

[0002] Circular RNAs (circRNAs) are a class of endogenous non-coding RNAs with a closed circular structure. Compared to linear RNAs, they are more stable and can participate in gene expression regulation through various mechanisms, such as acting as microRNA sponges and interacting with RNA-binding proteins. Studies have shown that circRNAs play a crucial role in the occurrence and development of various diseases, including cancer, neurological disorders, and immune diseases. Therefore, accurately identifying the association between circRNAs and diseases (CDAs) is of great significance for a deeper understanding of disease mechanisms and the discovery of novel biomarkers and drug targets.

[0003] With the development of high-throughput sequencing and public databases, the number of known circRNAs has grown rapidly. However, the number of CDAs validated through wet experiments remains very limited due to bottlenecks such as high experimental costs, long cycles, and limited throughput. Therefore, computational methods for predicting potential CDAs using known association data and biological attribute information have become an important supplementary research tool.

[0004] Existing calculation methods can be classified into the following three categories based on their technical principles: (1) The method based on the correlation matrix decomposition.

[0005] This type of method is represented by Nonnegative Matrix Factorization (NMFCDA), which takes the known circRNA-disease association matrix as input data. NMFCDA decomposes the given circRNA-disease association matrix into the product of two low-dimensional nonnegative matrices, which can be regarded as feature matrices of the circRNA and the disease, respectively. By minimizing the reconstruction error through an optimization algorithm, the representations of circRNA and disease in a latent low-dimensional space are learned. Subsequently, unknown associations are predicted based on the similarity (e.g., dot product) of these latent feature vectors.

[0006] However, these methods primarily rely on linear decomposition models, making it difficult to effectively capture and model the complex nonlinear interactions between circRNAs and diseases. Furthermore, model performance degrades significantly when the associated data is extremely sparse.

[0007] (2) Network-based propagation methods.

[0008] This type of method is represented by the KATZCDA metric model. It first integrates circRNA similarity networks, disease similarity networks, and known associations to construct a heterogeneous network. KATZCDA measures the strength of association between two nodes by calculating the weighted sum of all paths of length between them; shorter paths generally have higher weights.

[0009] However, these methods heavily rely on predefined, heuristic propagation rules (such as path decay coefficients), limiting their model representation capabilities. More importantly, they often struggle to effectively model higher-order network topologies (such as complex dependencies beyond multi-hops) and are insufficient at learning deep semantic features of network structures.

[0010] (3) Deep learning-based methods (especially graph neural networks).

[0011] This type of method, exemplified by GCNCDA, treats circRNAs and diseases as nodes in a graph, and known associations as edges. GCNCDA generates new node embeddings by aggregating features of the node itself and its neighbors. By stacking multiple layers, the model can capture information from multi-hop neighbors.

[0012] However, despite the improved performance of this type of method compared to the previous two, it still suffers from a limited perspective. They either rely primarily on network structure and fail to fully integrate rich biological attribute information such as circRNA sequences and disease ontology; or they learn node representations only from a single scale (such as local neighborhood aggregation), failing to explicitly and collaboratively model multi-scale network features at the micro (local neighborhood), meso (community structure), and macro (global topology) levels, resulting in an insufficiently comprehensive representation of complex biological semantics.

[0013] Overall, existing technologies still suffer from the following systemic defects: First, there is insufficient integration of biological attribute features. Whether based on association matrix factorization or network propagation, the input is typically only an association or similarity matrix, failing to incorporate key biological semantics such as circRNA sequences and disease ontology as core features. While some deep learning models can accept node features, they often only perform simple concatenation or initialization, failing to use deep architectures (such as Transformers) for specialized, context-aware encoding. This results in the model's utilization of the intrinsic attributes of biological entities remaining superficial.

[0014] Second, the network structure representation is simplistic. This deficiency is particularly evident in methods based on network propagation and deep learning. Methods such as KATZCDA capture only single topological patterns (e.g., short-path associations) through predefined paths or propagation rules. Graph neural networks such as GCNCDA primarily learn representations through multi-layer neighbor aggregation, essentially focusing on the smoothing and information transfer of microscopic local neighborhoods. Both types of methods lack explicit modeling of mesoscopic (community structure) and macroscopic (global topology) network scales, while structures at different scales correspond to different biological regulatory implications.

[0015] Third, the multi-source heterogeneous information fusion mechanism is imperfect. This is the key bottleneck that causes the first two defects to jointly affect performance. Most existing methods adopt a "single main line" architecture: they mainly rely on the association matrix (matrix factorization type), or mainly rely on the similarity network (propagation type), or mainly rely on the graph structure (most graph neural networks). Even if some methods attempt to introduce additional information, they usually use simple early splicing or late weighting methods, lacking a collaborative learning framework that can treat biological attribute features equally, interact deeply, and adaptively fuse features from the micro, meso, and macro scales. This lack of a fusion mechanism prevents the model from comprehensively utilizing the complementarity between features from different sources and at different scales, thus limiting its representation ability and the upper limit of predictive performance. Summary of the Invention

[0016] To address the shortcomings of existing technologies, the present invention aims to provide a circRNA-disease association prediction method based on multi-source feature fusion, thereby solving the technical problems of insufficient integration of biological attribute features, single network representation, and imperfect fusion mechanism in existing circRNA-disease association methods.

[0017] To solve the above-mentioned technical problems, the present invention adopts the following technical solution: A multi-source feature fusion-based circRNA-disease association prediction method is proposed. This method first constructs a circRNA-disease association network and extracts the GIPK functional features of the disease and the Transformer sequence features of the circRNA. Then, it systematically learns the network topological embedding at micro, meso, and macro scales. Finally, the above multi-source features are deeply fused and input into an XGBoost classifier to achieve high-precision and robust association prediction.

[0018] Specifically, the method includes the following steps: Step 1: Dataset preparation and heterogeneous network construction: Step 1.1, Obtain the circRNA-disease association pair dataset: Obtain the experimentally validated circRNA-disease association pair dataset from the database, and at the same time obtain the complete nucleotide sequences of all relevant circRNAs.

[0019] Step 1.2: Based on the obtained circRNA-disease association pair dataset, construct a heterogeneous association network and express it as an association matrix: The heterogeneous association network includes a set of all circRNA nodes, a set of disease nodes, and a set of edges; for each known circRNA-disease association pair, an edge is established between the corresponding circRNA node and the disease node, which represents the biological association between the circRNA node and the disease node.

[0020] Step 2, biological attribute feature extraction: Using the correlation matrix from step 1.2 and the complete nucleotide sequences of all relevant circRNAs obtained in step 1.1 as input, biologically significant feature representations of disease nodes and circRNAs are extracted, including disease functional attribute feature extraction and circRNA sequence semantic attribute feature extraction.

[0021] Step 3, Multi-scale network structure feature extraction: Using the heterogeneous interconnected network in step 1.2 as input, topological features at different granularity levels are extracted, including microscale feature extraction, mesoscale feature extraction, and macroscale feature extraction.

[0022] Step 4: Multi-source feature fusion and association pair representation construction: For each circRNA node, the 64-dimensional sequence attribute features of the node are vector-concatenated with the 64-dimensional microscopic structural features, 64-dimensional mesoscopic structural features, and 64-dimensional macroscopic structural features to obtain a unified 256-dimensional representation of the node.

[0023] For each disease node, its 256-dimensional unified representation is obtained in the same way.

[0024] For a candidate pair consisting of a circRNA node and a disease node to be predicted, the 256-dimensional unified representation of the circRNA node and the 256-dimensional unified representation of the disease node are concatenated again to construct a 512-dimensional association pair feature representation.

[0025] Step 5, Association Prediction: Step 5.1, Training Phase: Use all known circRNA-disease association pairs as positive samples. Each positive sample contains 512-dimensional features of the association pair and a label value of 1. Randomly select the same number of unknown association pairs as negative samples. Each negative sample contains 512-dimensional features of the unknown association pair and a label value of 0. Input the positive and negative samples into the XGBoost model for training.

[0026] Step 5.2, Prediction Phase: For new, unknown circRNA-disease pairs, generate their 512-dimensional features in the same way as in the training phase, and then input the features into the trained XGBoost model. The model outputs a probability value between 0 and 1, which represents the prediction confidence that the unknown pair is associated.

[0027] The present invention has the following technical effects: The biological attribute features introduced in this invention provide effective biological prior information for the model, solving the problem of "lack of biological semantics" in traditional methods and making the prediction results more biologically interpretable.

[0028] This invention systematically extracts network structural features at three complementary scales: micro, meso, and macro. These three scales provide complementary topological information, and joint modeling can comprehensively characterize complex relational patterns in the network, effectively overcoming the deficiency of "single network representation".

[0029] This invention concatenates multi-source features into vectors to form a unified association pair feature representation, and uses an XGBoost ensemble learner for non-linear classification. By combining deep fusion of multi-source features with the ensemble learner, it achieves efficient feature utilization and robust prediction performance, solving the problem of "inefficient feature fusion". Attached Figure Description

[0030] Figure 1 This is a flowchart of the circRNA-disease association prediction method based on multi-source feature fusion in this invention.

[0031] Figure 2 This is a structural diagram of biological attribute feature extraction in this invention.

[0032] Figure 3 This is a flowchart of the disease functional attribute feature extraction process in this invention.

[0033] Figure 4 This is a flowchart of the semantic attribute feature extraction process for circRNA sequences in this invention.

[0034] Figure 5 This is a structural diagram of the multi-scale network feature extraction in this invention.

[0035] Figure 6 This is a diagram of the microscale feature extraction structure in this invention.

[0036] Figure 7 This is a structural diagram of the median-scale feature extraction of the present invention.

[0037] Figure 8 This is a diagram of the macroscopic scale feature extraction structure in this invention.

[0038] Figure 9 This is a diagram of the encoder structure of the depth autoencoder model in this invention.

[0039] Figure 10 This is a structural diagram for constructing the representation of multi-source feature fusion and association pairs in this invention.

[0040] Figure 11 The graphs show the performance of the present invention. 11(a) shows the performance comparison of different classifiers on three datasets, and 11(b) shows the performance distribution of the present invention's method on three datasets under five-fold cross-validation.

[0041] The specific content of the present invention will be further explained in detail below with reference to the embodiments. Detailed Implementation

[0042] It should be noted that, unless otherwise specified, all software and databases used in this invention are conventional software and databases known in the art.

[0043] Following the above technical solutions, specific embodiments of the present invention are given below. It should be noted that the present invention is not limited to the following specific embodiments, and all equivalent modifications made based on the technical solutions of this application fall within the protection scope of the present invention.

[0044] Example 1 This embodiment presents a circRNA-disease association prediction method based on multi-source feature fusion, such as... Figure 1 As shown, it includes the following steps: Step 1: Dataset preparation and heterogeneous network construction: Step 1.1, Obtain the circRNA-disease association pair dataset: Obtain the experimentally validated circRNA-disease association pair dataset from the circRNADisease v2.0, CircR2Disease v2.0 and circAtlas 3.0 databases, and obtain the complete nucleotide sequences of all relevant circRNAs.

[0045] Step 1.2: Based on the obtained circRNA-disease association pair dataset, construct a heterogeneous association network as shown in Equation 1 below: Formula 1.

[0046] In Equation 1: Represents heterogeneous interconnected networks. Represents the set of all circRNA nodes. Represents the set of all disease nodes. The set of edges; for each known circRNA-disease association pair At the corresponding circRNA node Disease nodes Establish an edge between them This indicates that there is a biological link between the two.

[0047] The topology of the heterogeneous interconnected network shown in Equation 1 can be formally represented by Equation 2 as follows: Equation 2.

[0048] In Equation 2: It is an incidence matrix. and Representing the total number of circRNAs and diseases, respectively; matrix elements Indicates the first The circRNA and the first There are known associations between these diseases. This indicates that the association is unknown.

[0049] Step 2, biological attribute feature extraction: like Figure 2 As shown, this step uses the correlation matrix from step 1.2. Using the complete nucleotide sequences of all relevant circRNAs obtained in step 1.1 as input, we extract biologically significant feature representations of disease nodes and circRNAs, including disease functional attribute feature extraction and circRNA sequence semantic attribute feature extraction.

[0050] Step 2.1, Extraction of Disease Functional Attribute Features: like Figure 3 As shown, the input is the correlation matrix from step one. The feature vectors of disease functional attributes are extracted, including associated contour calculation, Gaussian kernel similarity calculation, dimensionality reduction and feature vector generation.

[0051] Step 2.1.1, Correlation Contour Calculation: For each disease node Extract its value from the correlation matrix The corresponding column vector in is denoted as .

[0052] Step 2.1.2, Gaussian kernel similarity calculation: To quantify the functional similarity between diseases, a Gaussian interaction spectral kernel (GIPK) is introduced. For any two diseases... and The formula for calculating the similarity between them is as follows: Formula 3.

[0053] In Equation 3: It is the Euclidean distance between the two disease-related profiles; is a bandwidth parameter, and its value is adaptively set to the reciprocal of the average norm of all disease association profiles, calculated using Equation 4 below: Equation 4.

[0054] In Equation 4: represents the disease set, represents the total number of diseases.

[0055] Step 2.1.3, Dimensionality reduction and eigenvector generation; After calculating the GIPK similarity between all disease pairs, the similarity matrix is obtained. Each row of this matrix corresponds to the similarity vector of a disease with all other diseases. Since the dimension of this matrix is relatively high and there may be noise redundancy, principal component analysis (PCA) is used to reduce its dimension, retaining the first 64 principal components, and obtaining the dimensionality-reduced feature matrix . Each row of this matrix is the low-dimensional, dense continuous eigenvector of the corresponding disease node .

[0056] Step 2.2, circRNA sequence semantic attribute feature extraction: As Figure 4 shown, in this sub-step, the input is the complete nucleotide sequences of all relevant circRNAs in Step 1, and the feature vectors of the circRNA sequence semantic attributes are extracted, including sequence normalization and tokenization, context encoding, and feature pooling. <00SS0187>Step 2.2.1, Sequence normalization and tokenization: The variable-length nucleotide sequences of each circRNA are uniformly padded or truncated to a fixed length (e.g., 512 bases) to ensure consistent input dimensions. Then, a tokenizer pre-trained for the DNA language (the tokenizer配套 with the DNABERT model) is used to convert the normalized sequences into token sequences, and special [CLS] and [SEP] markers are added at the beginning and end of the sequences respectively to obtain the input sequence that can be directly processed by the model.

[0058] [[ID=3S]]Step 2.2.2, Context encoding: The above token sequence is input into a bidirectional Transformer encoder model (DNABERT) pre-trained on large-scale genomic data. After the forward propagation of the model, the corresponding hidden state vectors are output at each token position.

[0059] Step 2.2.3, Feature pooling: From all the hidden states output by the last layer of the Transformer, the hidden state vector corresponding to the [CLS] marker is extracted ]>. This vector aggregates the context semantic information of the entire input sequence. To keep consistent with other feature dimensions, take The first 64 dimensions are used as the final sequence semantic attribute feature vector of the circRNA. .

[0060] Step 3, Multi-scale network structure feature extraction: like Figure 5 As shown, this step uses the heterogeneous interconnected network from step 1.2. As input, topological features at different granularity levels are extracted, including microscale feature extraction, mesoscale feature extraction, and macroscale feature extraction.

[0061] Step 3.1, Microscale Feature Extraction: like Figure 6 As shown, in this sub-step, the node2vec graph embedding method based on biased random walks is used to embed the heterogeneous network from step 1.2. The microstructural features of nodes are extracted to characterize their connection patterns in their local neighborhoods. The specific implementation is as follows: Step 3.1.1, first the heterogeneous interconnected network in step 1.2 A biased random walk is performed on the node, generating several fixed-length (80) walk sequences for each node. The transition probability of the walk is determined by the return parameter. and input / output parameters Control, used to balance depth-first and breadth-first exploration, is defined as: Formula 5.

[0062] In Equation 5: Representing heterogeneous interconnected networks The transition probability of performing a biased random walk; The previously visited node, For edge weights, According to the node and Shortest path distance adjustment between: If but ,like and Connected ,like and If not connected Generate a node walk sequence. After parameter sensitivity analysis, the optimal settings are: .

[0063] Step 3.1.2: Treat the generated walk sequences for each node as training corpus and train the Skip-gram model. The model learns to map each node to a dense vector, ensuring that the vectors of co-occurring nodes in the walk sequences are similar. Preferably, the vector dimension is set to 64 and the window size to 10. After training, the microstructural features of each node are obtained. .

[0064] Step 3.2, Mesoscale Feature Extraction: like Figure 7 As shown, this sub-step employs GraRep, a graph embedding method based on high-order transition matrix decomposition, to extract data from the heterogeneous interconnected network in step 1.2. The mesoscopic-scale structural features of nodes are extracted to characterize the higher-order topological dependencies of nodes in multi-step path propagation. This includes constructing a first-order network transition probability matrix, calculating multi-order transition matrices, calculating multi-order positive point mutual information matrices, performing successive singular value decomposition (SVD), and multi-order embedding representation. The specific implementation is as follows: Step 3.2.1, firstly based on the correlation matrix in step 1.2 Construct the network's transition probability matrix and define... The adjacency matrix of the network is in the form of: Formula 6.

[0065] In Equation 6: Let be the adjacency matrix of the network. The correlation matrix in step 1.2 .

[0066] Subsequently, Row normalization yields the first-order transition probability matrix. The formula for calculating the elements is: Formula 7.

[0067] In Equation 7: Indicates from node After a random walk, the node is reached. The probability, , For the adjacency matrix of the network The elements in.

[0068] Step 3.2.2, then calculate the multi-order transition matrix. For the order... (This embodiment takes) ),calculate Transition probability matrix ,in Indicates from node The departure point happened to be just right Step random walk to reach node The probability of.

[0069] For each order Further calculate the positive point mutual information matrix. Its element calculation formula is: Formula 8.

[0070] In Equation 8: and Represent matrices respectively The row and column sums are calculated. This transformation enhances significant co-occurrence patterns among nodes while suppressing noise.

[0071] Step 3.2.3, then, for each order matrix Perform singular value decomposition (SVD), i.e. , keep before The maximum singular value and its corresponding singular vector are used to obtain the node embedding at this order. In this embodiment, the sum of the embedding dimensions of each order is set to 64, that is... The dimensions are usually distributed evenly.

[0072] Step 3.2.4 Finally, concatenate the embedding vectors of all orders to obtain the multi-order embedding representation of each node. To maintain dimensionality consistency, a fully connected layer can be used to map the concatenated vectors to 64 dimensions, ultimately obtaining each node. mesoscale structural eigenvectors .

[0073] Step 3.3, Macro-scale Feature Extraction: like Figure 8 As shown, this sub-step employs a nonlinear graph embedding method based on a deep autoencoder to embed the heterogeneous relational network from step 1.2. Macroscopic structural features of nodes are extracted to characterize the global nonlinear structural information of nodes in the network. In specific implementation: Step 3.3.1: First, construct the adjacency vector for each node. For node... Its position in the correlation matrix The corresponding adjacency vector is defined as , specifically, if If it is a circRNA node, then for The concatenation of the corresponding row and the zero vector; if If it is a disease node, then for The corresponding row is concatenated with the zero vector. This vector describes the connection relationship between the node and all other nodes in the network.

[0074] Step 3.3.2: Subsequently, a deep autoencoder model is constructed. This model consists of an encoder and a decoder. The encoder processes high-dimensional adjacency vectors through multiple fully connected layers. The data is compressed layer by layer to a low-dimensional latent representation, and the decoder then reconstructs the low-dimensional representation back into the original adjacency vector. In this embodiment, as shown... Figure 9 As shown, the encoder structure is set to The decoder structure is symmetrically set. The activation function used is ReLU.

[0075] Step 3.3.3, define the total loss function for the model training process as follows: Formula 9.

[0076] In Equation 9: To balance the weight parameters of first-order and second-order proximity, The regularization coefficient is . For all trainable parameters of the model, To reconstruct the loss function, This is the Laplace regularization term.

[0077] The formulas for calculating the reconstruction loss function and the Laplace regularization term are as follows: Formula 10.

[0078] Formula 11.

[0079] In Equations 10 and 11: These are the original input features of the nodes. This is the reconstructed vector output by the decoder; For nodes The corresponding low-dimensional representation vector; For the low-dimensional latent representation of the encoder output; It is the edge set of the graph, representing adjacent nodes. and There is an edge between them.

[0080] Step 3.3.4: Finally, a 64-dimensional macroscopic-scale structural feature vector is generated for each node in the network, denoted as... .

[0081] Step 4: Multi-source feature fusion and association pair representation construction: like Figure 10 As shown, this step integrates the multi-source features obtained in the above steps.

[0082] For each circRNA node Its 64-dimensional sequence attribute features With 64-dimensional microscopic Mesoscopic Macro The structural features are concatenated into vectors to obtain a 256-dimensional unified representation of the node: Equation 12.

[0083] Similarly, a unified 256-dimensional representation of each disease node is obtained. .

[0084] Feature construction of association pairs, for a circRNA-disease candidate pair to be predicted. The 256-dimensional representation vectors of the two nodes are concatenated again to construct the final 512-dimensional feature representation of the association pair: Formula 13.

[0085] Step 5, Association Prediction: This step utilizes the fused features for final classification.

[0086] Step 5.1, Training Phase: Put all 512-dimensional features of known circRNA-disease association pairs and its labels =1 (positive sample), and randomly selected, The number of unknown association pairs with the same number of known circRNA-disease association pairs is 512-dimensional. and its labels =0 (negative sample), input into the XGBoost model for training. The preferred key hyperparameter settings are: learning rate = 0.05, maximum tree depth = 6, number of trees (T) = 200, and subsample ratio = 0.8. The objective function is binary classification logistic loss.

[0087] Step 5.2, Prediction Phase: For new, unknown circRNA-disease pair datasets, repeat steps one through four to generate their 512-dimensional features. Input the pre-trained XGBoost model. The model outputs a probability value between 0 and 1. , which represents the prediction confidence level that the pair is associated.

[0088] This invention conducted five-fold cross-validation experiments on three publicly available circRNA-disease association datasets (circAtlas, CircR2Disease, and circRNADisease) and compared the results with several existing state-of-the-art methods. The experimental results are as follows: Table 1. Comparison of ablation experimental performance of different feature combinations on three benchmark datasets.

[0089] Note: In Table 1, Att. represents biological attributes, Mi represents microscopic structural features, Me represents mesoscopic structural features, and Ma represents macroscopic structural features.

[0090] Existing methods often neglect the inherent sequence information of circRNAs and the functional similarity to diseases, lacking guidance from prior biological knowledge. This invention introduces circRNA sequence semantic features and disease GIPK functional features in step two, integrating prior biological knowledge into the model. To verify the effectiveness of this design, ablation experiments were conducted. As shown in Table 1, on the circAtlas dataset, the configuration using only biological attribute features (Att.) achieved an accuracy of 0.7886 and an AUC of 0.8590; however, when the complete model simultaneously integrates biological attribute features and multi-scale structural features, the accuracy improved to 0.8383 and the AUC to 0.9213. Similar significant improvements were observed on the CircR2Disease and circRNADisease datasets. This indicates that the biological attribute features introduced in this invention provide effective prior biological information for the model, solving the problem of "lack of biological semantics" in traditional methods and making the prediction results more biologically interpretable.

[0091] Existing graph learning methods typically learn embeddings from a single perspective, making it difficult to simultaneously characterize local neighborhood structures, higher-order topological dependencies, and global nonlinear patterns. This invention systematically extracts network structural features from three complementary scales—micro (node2vec), meso (GraRep), and macro (SDNE)—through step three. To verify the effectiveness of multi-scale modeling, feature combination ablation experiments were conducted. As shown in Table 1, on the circAtlas dataset, using only a combination of single-scale structural and attribute features (e.g., Att. & Mi) yielded an AUC of 0.8745; fusing two scales (e.g., Att. & Mi & Me) increased the AUC to 0.8822; and fully fusing all three scales resulted in an AUC of 0.9213. This fully demonstrates that structural features at the micro, meso, and macro scales provide complementary topological information, and joint modeling of these three scales can comprehensively characterize complex network association patterns, effectively overcoming the deficiency of "single network representation."

[0092] Existing methods employ simplistic strategies for fusing multi-source information, failing to effectively integrate biological attributes with multi-scale network structures. This invention addresses this by concatenating multi-source features into vectors in step four, forming a unified association pair feature representation, and then using the XGBoost ensemble learner in step five for nonlinear classification. Figure 11 As shown in (a), XGBoost achieved the best performance as the classifier while maintaining the same input features. Taking the circAtlas dataset as an example, XGBoost achieved an AUC of 0.9213, significantly outperforming commonly used classifiers such as Support Vector Machine (SVM), Multilayer Perceptron (MLP), Random Forest, and LightGBM. Furthermore, Figure 11 (b) shows the performance distribution of the present invention on three datasets using five-fold cross-validation. The small variance of each evaluation metric indicates that the model performs stably under different data partitions and has good robustness and generalization ability. This proves that the present invention achieves efficient feature utilization and robust prediction performance by combining deep fusion of multi-source features with an ensemble learner, thus solving the problem of "inefficient feature fusion".

[0093] As can be seen, by introducing biological attribute features, multi-scale network structure features, and deep fusion strategies, this invention systematically solves the three major technical defects in the prior art: "lack of biological semantics", "single network representation" and "inefficient feature fusion", and achieves significant improvements in prediction accuracy, robustness and biological interpretability.

Claims

1. A circRNA-disease association prediction method based on multi-source feature fusion, characterized in that, Includes the following steps: Step 1: Dataset preparation and heterogeneous network construction: Step 1.1, Obtain the circRNA-disease association pair dataset: Obtain the experimentally validated circRNA-disease association pair dataset from the database, and at the same time obtain the complete nucleotide sequences of all relevant circRNAs; Step 1.2: Based on the obtained circRNA-disease association pair dataset, construct a heterogeneous association network and express it as an association matrix: The heterogeneous association network includes a set of all circRNA nodes, a set of disease nodes, and a set of edges; for each known circRNA-disease association pair, an edge is established between the corresponding circRNA node and the disease node, which represents the biological association between the circRNA node and the disease node. Step 2, biological attribute feature extraction: Using the correlation matrix in step 1.2 and the complete nucleotide sequences of all relevant circRNAs obtained in step 1.1 as input, we extract biologically significant feature representations of disease nodes and circRNAs, including disease functional attribute feature extraction and circRNA sequence semantic attribute feature extraction. Step 3, Multi-scale network structure feature extraction: Using the heterogeneous interconnected network in step 1.2 as input, extract topological features at different granularity levels, including microscale feature extraction, mesoscale feature extraction and macroscale feature extraction; Step 4: Multi-source feature fusion and association pair representation construction: For each circRNA node, the 64-dimensional sequence attribute features of the node are concatenated with the 64-dimensional microscopic structural features, 64-dimensional mesoscopic structural features, and 64-dimensional macroscopic structural features to obtain a 256-dimensional unified representation of the node. For each disease node, its 256-dimensional unified representation is obtained using the same method; For a candidate pair consisting of a circRNA node and a disease node to be predicted, the 256-dimensional unified representation of the circRNA node and the 256-dimensional unified representation of the disease node are concatenated again to construct a 512-dimensional association pair feature representation. Step 5, Association Prediction: Step 5.1, Training Phase: Use all known circRNA-disease association pairs as positive samples. Each positive sample contains 512-dimensional features of the association pair and a label value of 1. Randomly select the same number of unknown association pairs as negative samples. Each negative sample contains 512-dimensional features of the unknown association pair and a label value of 0. Input the positive and negative samples into the XGBoost model for training. Step 5.2, Prediction Phase: For new, unknown circRNA-disease pairs, generate their 512-dimensional features in the same way as in the training phase, and then input the features into the trained XGBoost model. The model outputs a probability value between 0 and 1, which represents the prediction confidence that the unknown pair is associated.

2. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 1, characterized in that, Step two includes the following steps: Step 2.1, Extraction of Disease Functional Attribute Features: The input is the correlation matrix from step one, and the feature vectors of the disease functional attributes are extracted, including correlation contour calculation, Gaussian kernel similarity calculation, dimensionality reduction and feature vector generation; Step 2.2, Extraction of semantic attribute features of circRNA sequence: The input is the complete nucleotide sequence of all relevant circRNAs from step one. The feature vector of semantic attributes of the circRNA sequence is extracted, including sequence normalization and word segmentation, context encoding, and feature pooling.

3. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 2, characterized in that, Step 2.1 includes the following steps: Step 2.1.1, Calculate the associated contour: For each disease node, extract the column vector corresponding to it in the association matrix constructed in step 1.2 as the association profile of that disease node; Step 2.1.2, Gaussian kernel similarity calculation: For any two disease nodes, calculate the Euclidean distance between their associated contours, and set the bandwidth parameter based on the average of the square norms of the associated contours of all disease nodes. Then, use the Gaussian interaction spectrum kernel function to calculate the similarity between the two disease nodes, where the bandwidth parameter is the reciprocal of the average value. Step 2.1.3, Dimensionality Reduction and Feature Vector Generation: A similarity matrix is ​​constructed based on the pairwise similarity between all disease nodes. Each row of the similarity matrix corresponds to the similarity vector of a disease node with all other disease nodes. Principal component analysis was used to reduce the dimensionality of the similarity matrix, retaining the first 64 principal components to obtain the dimensionality-reduced feature matrix. Each row of this feature matrix is ​​a low-dimensional, dense, continuous feature vector corresponding to the disease node.

4. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 2, characterized in that, Step 2.2 includes the following steps: Step 2.2.1, Sequence Standardization and Tokenization: The variable-length nucleotide sequences of each circRNA are uniformly padded or truncated to a fixed length to ensure consistent input size; then, a tokenizer pre-trained for DNA language is used to convert the standardized sequences into token sequences, and different markers are added to the beginning and end of the sequences to obtain input sequences that the model can directly process. Step 2.2.2, Context Encoding: Input the word sequence obtained in Step 2.2.1 into a pre-trained DNA language model based on Transformer. After forward propagation of the model, the corresponding hidden state vector is output for each word position. Step 2.2.3, Feature Pooling: Extract the hidden state vector corresponding to the beginning marker of the word sequence from all the hidden states output from the last layer of the pre-trained DNA language model in Step 2.2.

2. This vector pools the contextual semantic information of the entire input sequence. To maintain consistency with other feature dimensions, take the first 64 dimensions of the hidden state vector as the final sequence semantic attribute feature vector of the circRNA.

5. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 1, characterized in that, Step three includes the following steps: Step 3.1, Microscale Feature Extraction: The node2vec graph embedding method based on biased random walk is used to extract the microstructural features of nodes from the heterogeneous association network in step 1.2, which are used to characterize the connection patterns of nodes in their local neighborhoods. Step 3.2, Mesoscale Feature Extraction: GraRep, a graph embedding method based on high-order transition matrix decomposition, is used to extract mesoscale structural features of nodes from the heterogeneous network in step 1.

2. These features are used to characterize the high-order topological dependencies of nodes in multi-step path propagation. The process includes constructing a first-order network transition probability matrix, calculating multi-order transition matrices, calculating multi-order positive point mutual information matrices, performing successive singular value decomposition, and multi-order embedding representation. Step 3.3, Macro-scale Feature Extraction: A nonlinear graph embedding method based on deep autoencoders is used to extract the macroscopic structural features of nodes from the heterogeneous interconnected network in step 1.2, which are used to characterize the global nonlinear structural information of nodes in the network.

6. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 5, characterized in that, Step 3.1 includes the following steps: Step 3.1.1: First, perform a biased random walk on the heterogeneous association network in step 1.2, and generate several fixed-length walk sequences for each node; Step 3.1.2: Treat the generated walk sequences of each node as training corpus and train the Skip-gram model; the model learns to map each node to a dense vector, so that the vectors of nodes co-occurring in the walk sequences are similar; After training, the microstructural features of each node are obtained.

7. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 5, characterized in that, Step 3.2 includes the following steps: Step 3.2.1: Construct the adjacency matrix of the heterogeneous association network based on the association matrix, and then perform row normalization on the adjacency matrix to obtain the first-order transition probability matrix, where each element represents the probability of reaching another node from the current node through one random walk. Step 3.2.2, calculate the multi-order transition matrix: For order The k-th order transition probability matrix is ​​obtained by exponentiation of the first-order transition probability matrix. The elements in this matrix represent the probability of reaching another node after exactly k random walks from the current node. For each order k, further calculate the positive point mutual information matrix: divide the probability of the corresponding element in the k-order transition probability matrix by the product of the sum of the probabilities of the row where the element is located and the sum of the probabilities of the column where the element is located, take the logarithm, and set the negative value of the calculated result to 0. Step 3.2.3, for each order Singular value decomposition is performed on the positive point mutual information matrix, and the first few elements are retained. The maximum singular values ​​and their corresponding singular vectors are used to obtain the node embedding matrix of that order, where the sum of the dimensions of each embedding order is 64, and the first... It is a positive integer; Step 3.2.4: Concatenate the embedding vectors of all orders to obtain the multi-order embedding representation of each node; Then, a fully connected layer is used to map the concatenated vectors to 64 dimensions, finally obtaining the 64-dimensional mesoscopic-scale structural feature vector of each node.

8. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 5, characterized in that, Step 3.3 includes the following steps: Step 3.3.1: Construct the adjacency vector for each node: In the correlation matrix, its corresponding adjacency vector is defined as a vector with a specific dimension, which is equal to the sum of the total number of nodes in the first category and the total number of nodes in the second category. If a node is a circRNA node, its corresponding row in the association matrix is ​​concatenated with a zero vector to form the node's adjacency vector; if a node is a disease node, its corresponding row in the transpose of the association matrix is ​​concatenated with a zero vector to form the node's adjacency vector. Step 3.3.2, Construct the deep autoencoder model: The deep autoencoder model consists of an encoder and a decoder. The encoder compresses the high-dimensional adjacency vector constructed in step 3.3.1 layer by layer into a low-dimensional latent representation through multiple fully connected layers. The decoder reconstructs the low-dimensional representation into the original adjacency vector. The activation function is ReLU. Step 3.3.3: Define the total loss function for the model training process; then iteratively optimize the model using the backpropagation algorithm until the loss function converges; after training, take the output of the last layer of the encoder as the low-dimensional embedding representation of each node. Step 3.3.4: Generate a 64-dimensional macroscopic structural feature vector for each node in the network.

9. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 8, characterized in that, In step 3.1.1, the parameters are returned. and input / output parameters The transition probability of a walk is controlled to strike a balance between depth-first and breadth-first exploration; wherein the transition probability from the current node to the next node is equal to the product of the edge weight and an adjustment factor; the adjustment factor is determined based on the shortest path distance between the previously visited node and the candidate node: if the candidate node is equal to the previously visited node, then the adjustment factor is 1 / If the candidate node is directly connected to the previous visited node, the adjustment factor is 1; if the candidate node is not connected to the previous visited node, the adjustment factor is 1 / .

10. The circRNA-disease association prediction method based on multi-source feature fusion as described in claim 1, characterized in that: In step 3.1.2, when training the Skip-gram model, the vector dimension is set to 64 and the window size is 10. In step 5.1, the key hyperparameters are set as follows: learning rate = 0.05, maximum tree depth = 6, number of trees = 200, subsample ratio = 0.8; the objective function is binary classification logistic loss.