Multi-level causal inference method and device for multi-omics heterogeneous data

By combining feature-level and representation-level causal inference methods based on conditional mutual information with heterogeneous graph neural networks, the problem of insufficient causal inference in multi-omics analysis is solved, achieving accurate inference of multi-level causal relationships and reducing false causal relationships.

CN122242783APending Publication Date: 2026-06-19SHENZHEN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN UNIV
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multi-omics analyses suffer from insufficient causal inference, lack a rigorous mathematical and information theory foundation, cannot effectively distinguish between direct causality, indirect causality, or spurious causal relationships, and lack cross-modal and multi-level causal constraint mechanisms.

Method used

We employ a feature-level causal constraint and representation-level causal inference method based on conditional mutual information, combined with a heterogeneous graph neural network. Through the screening, preprocessing, construction of causal constraint heterogeneous graphs, and stability verification of multi-omics feature pairs, we achieve the inference of multi-level causal relationships.

🎯Benefits of technology

It establishes a rigorous mathematical foundation, enhances the theoretical rigor of causal inference, reduces the risk of false causal discoveries, captures richer causal structures, and improves the accuracy and reliability of causal inference.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242783A_ABST
    Figure CN122242783A_ABST
Patent Text Reader

Abstract

This application proposes a multi-level causal inference method and device for heterogeneous multi-omics data. By establishing feature-level causal constraints based on conditional mutual information and designing a two-stage causal inference scheme of feature level and representation level, it systematically infers multi-level causal relationships from heterogeneous multi-omics data. By utilizing conditional mutual information, it rigorously captures the essence of multi-omics causal relationships. This framework establishes a rigorous mathematical foundation, elevating causal inference from empirical to theoretical, and significantly reducing the risk of false causal discovery. The feature level and representation level stages constrain and complement each other. Feature-level constraints ensure the accuracy of basic causality, while representation-level inference captures complex multi-step causal chains. This multi-level design can capture richer causal structures compared to single-level methods.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to a multi-level causal inference method and device for heterogeneous multi-omics data. Background Technology

[0002] Multi-omics technology is a systematic research method that integrates data from multiple biological levels, such as genomics, transcriptomics, proteomics, and metabolomics, to comprehensively analyze the complexity of biological systems.

[0003] Graph Neural Network (GNN) is a deep learning model specifically designed for processing graph-structured data. It can capture the complex relationships between nodes, edges, and their global structure. Through a "message passing" mechanism, it iteratively aggregates neighbor information among nodes in the graph, thereby learning a low-dimensional representation (embedding) of each node. It is widely used in social network analysis, recommender systems, molecular structure prediction, traffic flow prediction, and other fields.

[0004] Currently, multi-omics technologies and graph neural networks have been widely used in the biomedical field, but existing technologies have the following drawbacks in multi-omics causal inference: 1. Existing multi-omics analyses tend to focus on association inferences rather than causal inferences. Existing methods use statistical techniques such as correlation and enrichment analysis to discover the association between omics characteristics and diseases, but these methods are essentially association-driven and cannot distinguish between direct causation, indirect causation, or spurious causation, resulting in insufficient biological reliability of the inferences.

[0005] 2. Causal inference methods lack a rigorous mathematical and information theory foundation. Traditional causal inference methods (such as Granger causality, PC algorithm, etc.) are mainly based on time series assumptions or linear relationships, and are poorly adapted to cross-sectional multi-omics data, especially high-dimensional genomic data containing complex nonlinear relationships.

[0006] 3. The lack of cross-modal and multi-level causal constraint mechanisms. Most existing methods conduct causal analysis at a single omics level and lack a methodology for systematically establishing causal constraints across multiple omics modalities (gene-image-phenotype), resulting in the inability to capture multi-level causal mechanisms from molecules to phenotypes in biological systems. Summary of the Invention

[0007] This application proposes a multi-level causal inference method and device for heterogeneous multi-omics data, which can solve one of the problems existing in the background technology.

[0008] To achieve the above objectives, this application adopts the following technical solution: Firstly, a multi-level causal inference method for heterogeneous multi-omics data is provided, including: Obtain heterogeneous multi-omics data from the subjects that may have causal relationships; Extract multi-omics features from heterogeneous multi-omics data and preprocess these features; Based on the conditional mutual information of multi-omics feature pairs, feature-level causal candidate pairs are selected from multi-omics feature pairs; Based on feature-level causal candidate pairs, and combined with clinical domain knowledge used to characterize the directional constraints of feature pairs, a causal constraint heterogeneous graph is constructed. In this graph, a causal constraint heterogeneous graph represents a subject, nodes are omics features, and edges are causal candidate relationships between omics features. By inputting the causal-constrained heterogeneous graph into a trained heterogeneous graph neural network, a representation-level feature vector with both predictive power and causal consistency is obtained. Furthermore, the causal stability of the representation-level eigenvectors is verified to obtain a set of stable causal relationships.

[0009] In one possible design approach of the first aspect, feature-level causal candidate pairs are selected from multi-omics feature pairs based on the conditional mutual information of multi-omics feature pairs, specifically including: Using conditional mutual information, the information gain of one feature over another in a feature pair is calculated after applying a confusion factor; Furthermore, based on information gain, feature-level causal candidate pairs are selected through significance testing and screening.

[0010] In one possible design approach of the first aspect, feature-level causal candidate pairs are selected based on information gain through significance testing and screening, specifically including: Calculate conditional mutual information : in, and For feature pairs, As a confounding factor, For probability distribution, For conditional probability distribution, It is a joint distribution; Based on confusion factor The values ​​of divide the multi-omics features into K subsets, and within each subset, according to the triplet... conduct and Cross-frequency statistics; The significance index p-value was calculated from the cross-frequency statistics. And, based on the significance index p-value and the preset significance level The comparison results were used to select feature-level causal candidate pairs.

[0011] In one possible design approach of the first aspect, the heterogeneous graph neural network is a heterogeneous graph Transformer model, a heterogeneous graph attention network, or a combination of a heterogeneous graph Transformer model and a heterogeneous graph attention network.

[0012] In one possible design approach of the first aspect, the heterogeneous graph neural network is trained by optimizing downstream phenotypic prediction errors or classification errors. Simultaneously, a constraint loss function is introduced into the training objective, which uses conditional mutual information loss. Express: .

[0013] In one possible design approach for the first aspect, prior to performing causal stability verification on the representation-level eigenvectors: By using orthogonal projection, the newly identified causal factors are projected onto the orthogonal complement space of the existing causal factors.

[0014] In one possible design of the first aspect, the multi-level causal inference method for multi-omics heterogeneous data further includes: Heterogeneous graph neural networks automatically learn cross-modal attention weights, which are derived from node-level, edge-level, or path-level aggregation processes.

[0015] In one possible design approach of the first aspect, causal stability verification is performed on the representation-level eigenvectors, specifically including: Resampling consistency test and fluctuation analysis are performed on the representation-level feature vectors and cross-modal attention weights to obtain a stable causal relationship set and its stability score.

[0016] In one possible design approach of the first aspect, the multi-omics heterogeneous data includes at least two of the following: genotype single nucleotide polymorphism data, brain imaging data, clinical phenotype data, and environmental data; The omics features corresponding to genotype single nucleotide polymorphism (SNP) data are as follows: gene features are obtained from genotype SNP data using Shannon entropy representation. The omics features corresponding to brain imaging data are as follows: several morphological indicators are extracted from brain imaging data to form brain imaging features. The omics features corresponding to clinical phenotype data are as follows: statistical descriptive modeling is performed on clinical phenotype data, and its statistical features are extracted to form phenotypic features. The omics features corresponding to environmental data are as follows: statistical descriptive modeling is performed on environmental data, and its statistical features are extracted to form environmental features. Preprocessing includes: missing value imputation, outlier removal, quality control, standardization, and dimensional unification.

[0017] In a second aspect, an electronic device is provided, comprising: a processor, and a memory coupled to the processor, the memory for storing a computer program; the processor for executing the computer program stored in the memory such that the electronic device performs the multi-level causal inference method for multi-omics heterogeneous data as described in any possible implementation of the first aspect.

[0018] Beneficial effects: Based on the above technical solution, by establishing feature-level causal constraints based on conditional mutual information and designing a two-stage causal inference scheme of feature level and representation level, multi-level causal relationships are systematically inferred from heterogeneous multi-omics data. By utilizing conditional mutual information, the essence of multi-omics causal relationships is rigorously captured. This framework establishes a rigorous mathematical foundation, elevating causal inference from empirical to theoretical, and significantly reducing the risk of false causal discovery. The two stages of feature level and representation level constrain and complement each other. Feature level constraints ensure the accuracy of basic causality, while representation level inference captures complex multi-step causal chains. This multi-level design can capture richer causal structures compared to single-level methods. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 This is a flowchart of a multi-level causal inference method for heterogeneous multi-omics data provided in Embodiment 1 of this application; Figure 2 This is a detailed flowchart of step S103 provided in Embodiment 1 of this application; Figure 3 This is a detailed flowchart of step S202 provided in Embodiment 1 of this application; Figure 4 This is a flowchart of the multi-level causal inference method in heterogeneous graph neural networks provided in Embodiment 2 of this application; Figure 5 This is the causal constraint heterogeneity diagram provided in Embodiment 2 of this application; Figure 6 This is a heterogeneous graph neural network structure diagram provided in Embodiment 2 of this application; Figure 7 This is a structural diagram of the attention calculation module provided in Embodiment 2 of this application. Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0022] It should be noted that although functional modules are divided in the device schematic diagram and the logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. The terms "first," "second," etc., in the specification and the above-mentioned figures are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0023] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0024] Example 1 like Figure 1 As shown, this embodiment provides a multi-level causal inference method for heterogeneous multi-omics data, including: Step S101: Obtain heterogeneous multi-omics data of the subjects that may have causal relationships; Step S102: Extract multi-omics features from multi-omics heterogeneous data and preprocess the multi-omics features; Step S103: Based on the conditional mutual information of multi-omics feature pairs, select feature-level causal candidate pairs from the multi-omics feature pairs. Step S104: Based on feature-level causal candidate pairs, and combined with clinical domain knowledge used to characterize the directional constraints of feature pairs, construct a causal constraint heterogeneous graph, where a causal constraint heterogeneous graph represents a subject, nodes are omics features, and edges are causal candidate relationships between omics features. Step S105: Input the causal constraint heterogeneous graph into the trained heterogeneous graph neural network to obtain a representation-level feature vector that has both predictive ability and causal consistency. And, in step S106, the causal stability of the representation-level feature vectors is verified to obtain a set of stable causal relationships.

[0025] Specifically, multi-omics heterogeneous data includes at least two of the following: genotype single nucleotide polymorphism (SNP) data, brain imaging data, clinical phenotype data, and environmental data. The omics features corresponding to genotype single nucleotide polymorphism data are: gene features extracted from genotype single nucleotide polymorphism data using Shannon entropy representation; the omics features corresponding to brain imaging data are: morphological indicators such as cortical volume, cortical thickness, and curvature extracted from brain imaging data to form brain imaging features; the omics features corresponding to clinical phenotype data are: statistical descriptive modeling of clinical phenotype data, extracting statistical features such as mean, interquartile range, skewness, kurtosis, and Shannon entropy to form phenotypic features; the omics features corresponding to environmental data are: statistical descriptive modeling of environmental data, extracting statistical features such as mean, interquartile range, skewness, kurtosis, and Shannon entropy to form environmental features.

[0026] Preprocessing includes: missing value imputation, outlier removal, quality control, standardization, and dimensional unification.

[0027] like Figure 2 As shown, in one possible implementation, feature-level causal candidate pairs are selected from multi-omics feature pairs based on the conditional mutual information of multi-omics feature pairs, specifically including: Step S201: Calculate the information gain of one feature to another in a feature pair after applying the confusion factor R, using conditional mutual information I. In step S202, feature-level causal candidate pairs are selected based on information gain through significance testing and screening.

[0028] like Figure 3 As shown, through significance testing and screening, based on information gain, feature-level causal candidate pairs are selected, specifically including: Step S301: Calculate conditional mutual information : in, and For feature pairs, As a confounding factor, For probability distribution, For conditional probability distribution, It is a joint distribution; Step S302, based on the confusion factor The values ​​of divide the multi-omics features into K subsets, and within each subset, according to the triplet... conduct and Cross-frequency statistics; Step S303: Calculate the significance index p-value based on the cross-frequency statistics results; And, in step S304, based on the significance index p-value and the preset significance level... The comparison results were used to select feature-level causal candidate pairs.

[0029] In one possible implementation, the heterogeneous graph neural network is a heterogeneous graph Transformer model, a heterogeneous graph attention network, or a combination of a heterogeneous graph Transformer model and a heterogeneous graph attention network.

[0030] Specifically, the heterogeneous graph neural network is trained by optimizing downstream phenotypic prediction errors or classification errors. Simultaneously, a constraint loss function is introduced into the training objective, which uses conditional mutual information loss. Express: .

[0031] in: In one possible implementation, before performing causal stability verification on the representation-level eigenvectors: By using orthogonal projection, the newly identified causal factors are projected onto the orthogonal complement space of the existing causal factors.

[0032] In one possible implementation, the multi-level causal inference method for heterogeneous multi-omics data further includes: Heterogeneous graph neural networks automatically learn cross-modal attention weights, which are derived from node-level, edge-level, or path-level aggregation processes to reflect the relative influence of a certain upstream omics feature on the downstream phenotypic representation.

[0033] In one possible implementation, causal stability verification is performed on the representation-level feature vectors, specifically including: Resampling consistency test and fluctuation analysis are performed on the representation-level feature vectors and cross-modal attention weights to obtain a stable causal relationship set and its stability score.

[0034] In one possible implementation, the multi-level causal inference method for heterogeneous multi-omics data also includes: credibility assessment and pseudo-label feedback optimization, as well as causal result output and visualization.

[0035] Based on the above technical solution, by establishing feature-level causal constraints based on conditional mutual information and designing a two-stage causal inference scheme of feature level and representation level, multi-level causal relationships are systematically inferred from heterogeneous multi-omics data. By utilizing conditional mutual information, the essence of multi-omics causal relationships is rigorously captured. This framework establishes a rigorous mathematical foundation, elevating causal inference from empirical to theoretical, and significantly reducing the risk of false causal discovery. The two stages of feature level and representation level constrain and complement each other. Feature level constraints ensure the accuracy of basic causality, while representation level inference captures complex multi-step causal chains. This multi-level design can capture richer causal structures compared to single-level methods.

[0036] Example 2 like Figure 4 As shown, this embodiment proposes a multi-level causal inference method in heterogeneous graph neural networks (hereinafter referred to as the CauseHGN method). By establishing feature-level causal constraints based on conditional mutual information, designing a two-stage (feature-level and representation-level) causal inference scheme, and introducing a geometric redundancy elimination mechanism based on orthogonal projection, it systematically infers multi-level causal relationships from heterogeneous multi-omics data. It is applicable to scenarios such as brain disease classification, early screening, and mechanism explanation.

[0037] This proposal suggests a gene-brain region interaction analysis method based on heterogeneous graph neural networks, comprising the following steps: Step 1: Multi-omics data acquisition and preprocessing Step 1 inputs the raw multi-omics data, as described below. The processing flow has also been described. The result is the multi-omics features for each subject, which are abstract. Each omics has a feature that reflects some characteristics of that omics. It can be understood that the raw omics data is high-dimensional and complex, and the dimensions are not uniform. After standard preprocessing (first paragraph below) and the preprocessing proposed in this method (second paragraph below), all omics data are transformed into low-dimensional features with smaller dimensional differences. Each feature is used to represent the input of that omics into the neural network.

[0038] Multi-omics data were collected from the subjects, and a multi-omics heterogeneous dataset was constructed. This multi-omics data included at least genotype SNP data, brain imaging data, clinical phenotype data, and environmental data. Genotype SNP data was obtained through SNP detection combined with the Plink tool for quality control, site screening, and sample processing. Here, SNP data refers to single nucleotide polymorphisms, i.e., base mutation sequences at different sites discovered after gene testing. Plink provides a universally applicable standard procedure for processing this type of data. Brain imaging data was obtained through MRI scans combined with the FreeSurfer tool to extract brain structure-related indicators. Clinical phenotype data was obtained through psychological scales, cognitive assessments, and clinical evaluation tools, all in numerical form. Environmental data was used to characterize the subjects' external survival and exposure conditions, including social environment, natural environment, and other non-biological exposure factors. Examples include greening rate and UV index based on satellite remote sensing data, or family circumstances and childhood trauma based on questionnaires, all in numerical form.

[0039] In the preprocessing stage, missing value imputation, outlier removal, quality control, standardization, and necessary dimensional unification are performed on various types of data to ensure that data from different sources and with different dimensions can enter a unified causal inference process. Furthermore, gene features are extracted from gene data using Shannon entropy representation; seven morphological indicators, including cortical volume, cortical thickness, and curvature, are extracted from brain imaging data to form brain imaging features; and statistical descriptive modeling is performed on clinical phenotypic and environmental data respectively, extracting statistical features such as mean, quartile range, skewness, kurtosis, and Shannon entropy to form phenotypic and environmental features. These processes transform the original heterogeneous data into structured, comparable, and graphable multi-omics feature representations, providing a unified input for subsequent causal screening and graph learning.

[0040] Step 2: Feature-level causal candidate pair screening The input for step 2 is the data from each omics after the processing in step 1. After processing in step 1, these data can be called features. A feature pair means that the data is grouped in pairs. The processing steps are described in the following two paragraphs, which can be described as further filtering and optimization of the features. The required input is the features after processing in each omics. These features are grouped in pairs. After each group of features is mixed (simple addition or concatenation), some conventional methods such as MLP or reparameterized sampling are used to obtain a confounding factor, which is a vector with the same size as these features. Then, through conditional mutual information, the information gain of one feature in a feature pair on the other feature is calculated after applying the confounding factor. This is a value that represents the causal association strength between the two features. For example, if a gene causes a pathogenic mutation in a certain brain region, then this value will be high. Of course, the real causal relationship is complex and will not be such a simple one-to-one correspondence. Afterwards, through significance testing and screening, strong causal feature pairs are selected based on this value, thus completing the feature-level causal candidate pair screening.

[0041] Calculate conditional mutual information for any feature pair from different modalities or within the same modality, under the control of a confusion factor. : By combining significance testing and threshold screening, feature pairs with strong information gain and statistical stability are retained as preliminary "causal candidate pairs".

[0042] Step 1: Calculate Conditional Mutual Information (CMI) As a preliminary screening physical quantity, calculation CMI measures the impact of the confusion factor. Given the circumstances, and The degree of interdependence between them.

[0043] Only feature pairs with a CMI significantly greater than 0 proceed to the next step of statistical testing.

[0044] Step 2: Construct a three-dimensional contingency table According to the data Frequency statistics were performed on the triples.

[0045] Layering: Based on The value of divides the dataset into K subsets.

[0046] Counting: Statistics within each subset and The cross-frequency count, used to calculate the total chi-square value, includes the actual frequencies. With expected frequency .

[0047] Step 3: Perform a significance test Calculate the total chi-square value using cross-frequency analysis. : Calculate the p-value: Based on the accumulated statistic and total degrees of freedom, obtain the p-value from the chi-square distribution function.

[0048] in, This represents a theoretical chi-square distribution variable that follows a specific number of degrees of freedom (df). The corresponding value can be obtained by calculating the degrees of freedom and then looking up the chi-square distribution table. .

[0049] Step 4: Hard threshold screening Set significance level (e.g., 0.05 or 0.01).

[0050] Retention conditions: .

[0051] This means that after controlling the confusion factor... back, and There remains a statistically significant correlation between them, excluding the possibility of [other causes]. The resulting spurious correlation.

[0052] The confounding factor can be calculated by MLP and identified and controlled through partial correlation analysis, covariate regression, partial correlation correction or similar methods to reduce interference from co-occurrence bias, sampling bias and spurious correlation.

[0053] Preferably, only feature pairs that meet the saliency requirement and have conditional mutual information higher than a set threshold are retained to filter out spurious and weak associations. This step aims to establish a candidate causal edge set at the original feature level, thereby improving the causal purity of subsequent heterogeneous graph construction and representation learning from the source.

[0054] By systematically applying conditional mutual information (CMI), an information-theoretic tool, to perform causal inference on multi-omics data, CMI can more rigorously capture the essence of causal relationships compared to traditional correlation coefficients or Granger causality, and supports explicit control over confounding factors. This framework establishes a rigorous mathematical foundation, elevating causal inference from empirical to theoretical, and significantly reducing the risk of false causal discoveries.

[0055] Step 3: Construct a causal constraint heterogeneous graph Figure 5 A causal constraint heterogeneity graph is shown. Nodes of different colors and shapes represent different omics.

[0056] The input to this step is the causal candidate pairs from step 2, which are abstract features. These can be understood as factors that, after step 2, have a high probability of influencing the disease, including genes, environment, etc. Step 2 did not incorporate common sense to constrain the directionality of the feature pairs; that is, relying solely on the output of step 2 might lead to erroneous conclusions such as imaging influencing genes. Therefore, step 3 incorporates domain knowledge to constrain the directionality, resulting in a reasonable heterogeneous graph.

[0057] Based on the causal candidate pairs obtained in step 2 and combined with clinical domain knowledge, a heterogeneous graph of causal constraints is constructed. The nodes of the heterogeneous graph include at least gene nodes, brain region nodes, and phenotype nodes, and the edges represent the selected causal candidate relationships between omics. Preferably, the retention of edges must simultaneously satisfy statistical significance and domain knowledge consistency; that is, only connections that are both statistically supported and conform to the "gene → image → phenotype" or other known pathological chain directions are included in the final graph structure. "Statistical support" mainly refers to the fact that causal feature pairs are considered to have statistical support after passing the significance test described in step 2, and are considered to conform to the pathological chain direction after introducing clinical domain knowledge in step 3.

[0058] In one implementation, environmental nodes can also serve as upstream exposure nodes and be incorporated into the heterogeneous graph, forming a multi-level causal map together with gene, brain region, and phenotype nodes. Exposure nodes refer to exposomics nodes; environmental factors can be considered exposomics. In some databases, diseases are considered related to the environment; therefore, the environment serves as an upstream node in radiomics, meaning the environment influences brain structure. Through this step, the system organizes the originally dispersed multi-omics features into a graph structure with clear directional and semantic type constraints, providing interpretable structural input for subsequent heterogeneous graph neural network learning.

[0059] Step 4: Heterogeneous graph representation learning and causal constraint training The input for step 4 is the heterogeneous graph obtained in the previous steps after causal filtering and enhancement. It contains the features of each filtered omics. Each graph represents a subject, and the features of each omics have been described in the previous steps.

[0060] The constructed causal constraint heterogeneous graph is input into a heterogeneous graph neural network. Heterogeneous graph Transformer (HGT), heterogeneous graph attention network (HAN), or a combination thereof are preferably used to perform deep representation learning on various types of nodes, obtaining latent representations of gene, brain region, phenotype, and environmental nodes. During training, while ensuring that the known causal constraints are not violated, the downstream phenotype prediction error or classification error is optimized, enabling the model to learn not only distinguishable representations but also structural relationships consistent with the pathological process.

[0061] To further enhance causal consistency, a constraint loss function can be introduced into the training objective, ensuring that the model continuously conforms to the feature-level causal constraints obtained in step 2 during message passing. This involves incorporating the conditional mutual information loss function. The other terms in the final constraint can be either supervised cross-entropy loss or unsupervised contrastive learning loss, depending on the chosen method. These are all traditional methods. Our constraint works by using conditional mutual information to measure the causal relationship between feature pairs and encourages the model to retain feature pairs with strong causal relationships, thus reducing the interference of confusion factors.

[0062] The technical effect of this step is that it transforms the "causal candidate edges" from static screening results into dynamic and learnable graph structure information, thereby obtaining a deep representation that has both predictive power and causal consistency.

[0063] In this scheme, the structure of the heterogeneous graph neural network is as follows: Figure 6 As shown, the attention computation module in a heterogeneous graph neural network is as follows: Figure 7 As shown.

[0064] Taking a heterogeneous graph attention network as an example, this network is divided into two modules: an attention calculation module and a message aggregation module. The attention computation module takes a heterogeneous graph containing nodes from different omics systems as input. For each combination of source node-connection-target node in the heterogeneous graph, an attention space is allocated, comprising query (Q), key (K), and value (V) matrices. Then, the attention for each combination is calculated using standard attention computation methods. Figure 7 The left half of the message aggregation module aggregates messages from different omics combinations based on the calculated attention. Figure 7 (The right half)

[0065] By dividing causal inference into two stages: (1) feature-level causal constraints, which directly identify causal relationships between original features; and (2) representation-level causal influences, which infer indirect causal paths in the latent representation space. The two stages constrain and complement each other. Feature-level constraints ensure the accuracy of basic causality, while representation-level inference captures complex multi-step causal chains. This multi-level design can capture richer causal structures compared to single-level methods.

[0066] Step 5: Cross-modal attention weight quantization Step 5 is an additional module. In the previous step, heterogeneous graph representation was learned based on causal constraints, which optimized the connection relationships between various omics features to make them more causal. Step 5 is responsible for extracting and saving these connection relationships.

[0067] After heterogeneous graph representation learning is completed, cross-modal attention weights automatically learned within the model are extracted to quantify the causal contribution of different omics modalities to the target phenotype. These attention weights can originate from node-level, edge-level, or path-level aggregation processes, reflecting the relative influence of a particular upstream omics feature on the downstream phenotypic representation.

[0068] This step does not rely on manually assigned weights, but instead automatically learns the importance distribution among different modalities through model training, outputting a ranked modality contribution map. Its technical advantage lies in transforming abstract graphical representations into interpretable omics contribution indicators, facilitating the identification of key driving factors and critical pathological pathways.

[0069] Step 6: Verify causal stability at the representation level The inputs for step 6 are: the representation-level feature vector of each subject output from the preceding module, and the cross-omics attention matrix (or attention weight sequence). Resampling consistency tests and fluctuation analysis are performed on the input representation-level features and attention weights to obtain a stable causal relationship set and its stability score.

[0070] Bootstrap resampling (e.g., 100 to 1000 times) can be used to repeatedly sample subsets and attention edge subsets, statistically analyzing the frequency of occurrence, directional consistency, and weight fluctuation range of candidate causal relationships; and threshold perturbation analysis can be used to test the sensitivity of the results to changes in the attention threshold. The inputs here include: The representation of the eigenvalue matrix is ​​denoted as follows: ,in For the real number field, Let d be the number of participants, and d be the dimension. A cross-omics attention weight tensor or matrix, denoted as A, whose elements represent the attention intensity of cross-omics node pairs or paths; The candidate causal relationship set (C) (obtained from the previous steps) contains directional information and initial weights for each relationship; Preset parameters: number of resampling attempts (B) (e.g., 100 to 1000), attention threshold set .

[0071] Perform the following processing on the above input: Bootstrap sampling: Perform B samplings with replacement, generating a sample subset (S_b) each time; simultaneously extract attention edge subsets on the corresponding subsets. (Sampling can be done by stratified sampling or uniform sampling based on edge weights).

[0072] Relationship reassessment: in each sampling result Recalculate candidate relations The direction and intensity were used to obtain the result. with direction symbols .

[0073] Statistical summary: Statistics for each candidate relationship Frequency of occurrence: ; Consistency in direction: ; Weight fluctuations: or confidence interval .

[0074] Threshold perturbation analysis: for each threshold Repeated relationships are screened and statistically analyzed. Changes in relationship sets and key indicators are observed under different thresholds, and sensitivity is quantified (such as set overlap rate, ranking changes, and weight shifts).

[0075] Stability determination: Stable relationships are selected based on preset criteria, such as... And at the same time, the result change under the threshold perturbation does not exceed the tolerance range.

[0076] The output of this process is: stable causal relationship set ; The stability report for each stable relationship includes frequency of occurrence, directional consistency, average weight, weight fluctuation range, and threshold sensitivity index; Structured result tables (relationship pairs, directions, strength, stability scores) that can be directly used for subsequent visualization and sorting.

[0077] The technical effect of this step is that, using only representation-level information and cross-omics attention, random relationships are eliminated while statistically stable and reproducible core causal links are preserved.

[0078] Following step 7, orthogonal projection algorithms can be introduced to address the problem of redundant information among multiple causal factors. In multi-omics systems, collinearity exists among many features (such as multiple co-regulated genes), and simple filtering will result in the loss of effective information. By projecting newly identified causal factors onto the orthogonal complement space of existing causal factors through orthogonal projection, linear redundancy can be quantified and eliminated, yielding the "net causal effect," which is the independent contribution of each causal factor to the target phenotype. This method is mathematically rigorous and biologically interpretable.

[0079] Step 8: Credibility Assessment and Optimization of Pseudo-Label Feedback Step 8 takes the causal relationship determined in the previous step and all the evaluative metrics saved in the preceding steps as input, and outputs pseudo-label results of the causal relationship evaluation, which are used to provide feedback for optimizing the preceding steps.

[0080] All confirmed causal relationships are evaluated for credibility, taking into account factors such as mutual information strength, bootstrap stability, domain knowledge consistency, redundancy residual size, and path recurrence rate to form a unified credibility score. For causal relationships that meet the high credibility threshold, pseudo-labels are generated, such as "high confidence causal relationship"; for low credibility relationships, they are labeled as "suspicious relationship" or "relationship to be verified".

[0081] This embodiment does not impose constraints on specific representation learning methods. It proposes a feasible approach, but does not limit it to this method: pseudo-labels based on automatically generated feature-driven confidence levels of causal relationships are fed back as weakly supervised signals to the representation learning stage, driving iterative optimization of the heterogeneous graph neural network to further strengthen high-confidence causal relationships and suppress low-confidence noise relationships. The technical effect of this step is that, in the absence of extensive manual annotation, semi-supervised self-reinforcement training is achieved using self-generated reliable pseudo-labels, improving the discriminativeness and causal consistency of representation learning.

[0082] By employing a pseudo-label generation and feedback mechanism, the system automatically generates pseudo-labels for causal relationships that demonstrate high confidence in causal inference but lack explicit labels in labeled data. These pseudo-labels are then fed back to the representation learning stage as weak supervision signals. This enables the model to perform causal learning within a "semi-supervised" framework, expanding its applicability (especially for the biomedical field where dense labeling is difficult to obtain), while iterative optimization through pseudo-labels further improves the accuracy of causal inference.

[0083] Step 9: Output and Visualization of Causal Results The input for step 9 is: the set of stable causal relationships output from step 8, the corresponding stability scores, and the cross-omics attention weights. The stable relationships in the input are then sorted, structurally summarized, and graphically mapped to obtain the final causal inference results and a visual output.

[0084] The output includes: representation-level implicit causal relationships, causal strength ranking, key cross-omics transmission paths, and core nodes. The preferred visualization formats are causal network diagrams, Sankey diagrams, and heatmaps, used to illustrate the direction, strength, and cross-omics transmission structure of relationships.

[0085] The technical effect of this step is that it transforms the representation-level causal discovery results into interpretable, auditable, and traceable structured conclusions, which facilitates mechanism analysis and engineering applications.

[0086] By providing various visualization methods for causal paths (network topology, Sankey flow graphs, heatmaps, etc.), it not only displays direct causal relationships at the feature level but also implicit causal relationships and multi-hop indirect causal paths at the representation level. This multidimensional visualization significantly improves the interpretability of causal inference results, enabling biologists and medical experts to more intuitively understand the causal mechanisms of multi-omics systems, thereby establishing more reliable hypotheses and designing more precise experimental verifications.

[0087] In summary, this embodiment innovatively combines information theory, graph neural networks, and multi-level causal inference to provide a rigorous, systematic, and interpretable causal inference methodology for the analysis of multi-group biomedical data, which has significant basic research value and clinical application prospects.

[0088] This embodiment also proposes a network training and optimization method based on the above causal inference results, including the following steps: Step 1: Initialize the parameters of the heterogeneous graph neural network, and set hyperparameters such as learning rate and regularization strength; In step 1, Xavier is used for initialization.

[0089] Step 2, for each batch of training data: (2.1) Construct a constraint heterogeneity graph based on existing causal constraints; (2.2) Learn node representations through heterogeneous graph neural networks and output phenotypic predictions; (2.3) Calculate the supervision loss (cross-entropy or MSE) based on the known phenotypic labels; (2.4) Calculate the feature-level causal constraint error: For each causal constraint edge (u,v), calculate the corresponding CMI loss term, requiring the model to learn a representation that maximizes the satisfaction of the constraints; (2.5) Calculate the total loss: ; In step 2, For adjustable weight parameters, This indicates that it can be either supervised or unsupervised, because our method does not limit the specific classification method. As mentioned earlier, this is an improvement we made, which encourages the model to retain the associations (edge ​​weights) between strong causal feature pairs, while discarding or weakening the associations between unreasonable or weakly causal feature pairs.

[0090] Step 3: Calculate the gradient through backpropagation and update the network parameters using the Adam optimizer; Step 4: Evaluate on the validation set, and trigger early stopping when validation performance no longer improves; Step 5: Once the model converges, perform a final causal inference evaluation on the test set (including accuracy, AUC, causal correctness, etc.) and generate a final report on the causal path.

[0091] This application also provides an electronic device, including: a processor, and a memory coupled to the processor, the memory being used to store a computer program; the processor being used to execute the computer program stored in the memory, so that the electronic device performs the method as described in any of the above embodiments.

[0092] Electronic devices can be computing devices such as desktop computers, laptops, handheld computers, and cloud servers. These electronic devices may include, but are not limited to, processors and memory.

[0093] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor. The processor is the control center of the electronic device, connecting various parts of the device via various interfaces and lines.

[0094] The memory can be used to store the computer program, and the processor implements various functions of the electronic device by running or executing the computer program stored in the memory and calling the data stored in the memory.

[0095] The memory may primarily include a program storage area and a data storage area. The program storage area may store the operating system, applications required for at least one function, etc.; the data storage area may store data created based on the use of the mobile phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0096] This application also provides a storage medium, which is a computer-readable storage medium. The computer program is stored in the computer-readable storage medium, and when executed by a processor, the computer program can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium can include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.

[0097] This application also provides a computer program product, including: a computer program or instructions that, when the computer program or instructions are run on a computer, cause the computer to perform any of the above possible implementation methods.

[0098] The above description is the preferred embodiment of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications are also considered to be within the scope of protection of this application.

Claims

1. A multi-level causal inference method for multi-omics heterogeneous data, characterized in that, include: Obtain heterogeneous multi-omics data from the subjects that may have causal relationships; Extract multi-omics features from heterogeneous multi-omics data and preprocess these features; Based on the conditional mutual information of multi-omics feature pairs, feature-level causal candidate pairs are selected from multi-omics feature pairs; Based on feature-level causal candidate pairs, and combined with clinical domain knowledge used to characterize the directional constraints of feature pairs, a causal constraint heterogeneous graph is constructed. In this graph, a causal constraint heterogeneous graph represents a subject, nodes are omics features, and edges are causal candidate relationships between omics features. By inputting the causal-constrained heterogeneous graph into a trained heterogeneous graph neural network, a representation-level feature vector with both predictive power and causal consistency is obtained. Furthermore, the causal stability of the representation-level eigenvectors is verified to obtain a set of stable causal relationships. 2.The method for multi-level causal inference on multi-omics heterogeneous data of claim 1, wherein, Based on the conditional mutual information of multi-omics feature pairs, feature-level causal candidate pairs are selected from multi-omics feature pairs, specifically including: Using conditional mutual information, the information gain of one feature over another in a feature pair is calculated after applying a confusion factor; Furthermore, based on information gain, feature-level causal candidate pairs are selected through significance testing and screening.

3. The multi-level causal inference method for heterogeneous multi-omics data as described in claim 2, characterized in that, Through significance testing and screening, based on information gain, feature-level causal candidate pairs are selected, specifically including: Calculate conditional mutual information : in, and For feature pairs, As a confounding factor, For probability distribution, For conditional probability distribution, It is a joint distribution; Based on confusion factor The values ​​of divide the multi-omics features into K subsets, and within each subset, according to the triplet conduct and Cross-frequency statistics; The significance index p-value was calculated from the cross-frequency statistics. And, based on the significance index p-value and the preset significance level The comparison results were used to select feature-level causal candidate pairs.

4. The multi-level causal inference method for heterogeneous multi-omics data as described in claim 3, characterized in that, Heterogeneous graph neural networks are heterogeneous graph Transformer models, heterogeneous graph attention networks, or a combination of heterogeneous graph Transformer models and heterogeneous graph attention networks.

5. The multi-level causal inference method for heterogeneous multi-omics data as described in claim 4, characterized in that, The heterogeneous graph neural network is trained by optimizing downstream phenotypic prediction errors or classification errors. Simultaneously, a constraint loss function is introduced into the training objective, which uses conditional mutual information loss. Express: .

6. The multi-level causal inference method for heterogeneous multi-omics data as described in claim 1, characterized in that, Before performing causal stability verification on the representation-level eigenvectors: By using orthogonal projection, the newly identified causal factors are projected onto the orthogonal complement space of the existing causal factors.

7. The multi-level causal inference method for heterogeneous multi-omics data as described in claim 1, characterized in that, The multi-level causal inference method for heterogeneous multi-omics data also includes: Heterogeneous graph neural networks automatically learn cross-modal attention weights, which are derived from node-level, edge-level, or path-level aggregation processes.

8. The multi-level causal inference method for heterogeneous multi-omics data as described in claim 7, characterized in that, Causal stability verification of representation-level eigenvectors includes: Resampling consistency test and fluctuation analysis are performed on the representation-level feature vectors and cross-modal attention weights to obtain a stable causal relationship set and its stability score.

9. The multi-level causal inference method for heterogeneous multi-omics data as described in claim 1, characterized in that, Multi-omics heterogeneous data include at least two of the following: genotype single nucleotide polymorphism data, brain imaging data, clinical phenotype data, and environmental data; The omics features corresponding to genotype single nucleotide polymorphism (SNP) data are as follows: gene features are extracted from genotype SNP data using Shannon entropy representation. The omics features corresponding to brain imaging data are as follows: several morphological indicators are extracted from brain imaging data to form brain imaging features. The omics features corresponding to clinical phenotype data are as follows: statistical descriptive modeling is performed on clinical phenotype data to extract its statistical features. Phenotypic features are formed; the omics features corresponding to environmental data are: statistical descriptive modeling of environmental data, extraction of its statistical features, and formation of environmental features; Preprocessing includes: missing value imputation, outlier removal, quality control, standardization, and dimensional unification.

10. An electronic device, characterized in that, The electronic device includes: a processor, and a memory coupled to the processor. The memory is used to store computer programs; The processor is configured to execute the computer program stored in the memory, such that the electronic device performs the multi-level causal inference method for heterogeneous multi-omics data as described in any one of claims 1-9.