A disease early detection method based on causal inference and quantum machine learning

By combining linear dual machine learning causal inference with quantum machine learning, core features of diabetes are screened out and subtle correlation patterns are captured. This solves the problem of insufficient causal correlation explanation in early diabetes detection, improves detection accuracy and interpretability, and is applicable to early screening and personalized risk warning for diseases such as diabetes.

CN122245722APending Publication Date: 2026-06-19NANTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANTONG UNIV
Filing Date
2026-03-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies cannot explain the causal relationship of diabetes in early detection, resulting in insufficient generalization ability of models in risk prediction in asymptomatic populations, which easily leads to missed diagnoses and misdiagnoses. Furthermore, existing quantum machine learning models cannot simultaneously achieve interpretable identification of pathogenic factors and high-precision detection.

Method used

A linear dual machine learning (LinearDML) causal inference model was used to screen out core features related to disease onset. By combining classical and quantum machine learning models, and capturing the correlation patterns between features through causal inference and quantum computing, an early disease detection model was constructed.

🎯Benefits of technology

It significantly improves the accuracy, efficiency, and clinical interpretability of early detection of diseases such as diabetes, enhances the interpretability of the model and its ability to predict early risks, and is suitable for large-scale screening and personalized risk warning. It also has good generalization ability and scalability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245722A_ABST
    Figure CN122245722A_ABST
Patent Text Reader

Abstract

This invention discloses a method for early disease detection based on causal inference and quantum machine learning. The method involves collecting clinical feature data of several diseases and preprocessing them to obtain standardized datasets for each disease. A linear dual machine learning (LinearDML) causal inference model is used to analyze the causal contribution of features in the standardized datasets, selecting feature combinations with a positive causal relationship to disease onset to construct a core feature set. Classical machine learning models and corresponding quantum machine learning models are constructed to form a candidate model library for early disease detection. The standardized datasets are used to train and validate the models in the candidate model library, selecting the optimal early disease detection model for the chosen disease. The clinical feature data of the target to be detected is processed to obtain the core feature set of the target. This core feature set is then input into the selected optimal early disease detection model for detection, outputting the disease status detection result of the target.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of interdisciplinary technology of artificial intelligence and quantum computing, and in particular to a method for early disease detection based on causal inference and quantum machine learning. Background Technology

[0002] Diabetes presents with identifiable symptoms in its early stages. Effective early detection and risk warning can help patients adjust their lifestyles in a timely manner, delaying or even halting disease progression. Currently, diabetes prediction and detection technologies are mainly based on classical machine learning and deep learning algorithms. Feature selection methods used in related studies include Principal Component Analysis (PCA), Multilayer Perceptron (MLP), F-score, and Boruta feature selection. These methods can only uncover the correlation between features and the onset of diabetes, but cannot explain the causal relationship between features and the disease. Therefore, they fail to answer the core question of why diabetes occurs, resulting in insufficient generalization ability of models in risk prediction for early-stage asymptomatic individuals, easily leading to missed diagnoses and misdiagnoses. Quantum machine learning, as an intersection of quantum computing and machine learning, can utilize quantum superposition and entanglement properties to uncover subtle patterns of correlation between features that classical machine learning cannot capture, demonstrating significant performance advantages in complex data classification tasks. A few studies have applied quantum machine learning to diabetes classification tasks, but none of them have deeply integrated causal inference feature selection methods with quantum machine learning models. This makes it impossible to simultaneously achieve interpretable identification of pathogenic factors and high-precision early disease detection, which is insufficient to meet the clinical needs of early screening and personalized intervention for diseases such as diabetes. This results in existing models having problems such as poor interpretability and insufficient early risk prediction capabilities. Summary of the Invention

[0003] Purpose of the invention: In order to overcome the shortcomings of the existing technology, the present invention provides a method for early disease detection based on causal inference and quantum machine learning. Causal inference enables interpretable core feature screening, and quantum machine learning captures subtle correlation patterns between features, which significantly improves the accuracy, efficiency and clinical interpretability of early disease detection.

[0004] Technical Solution: To achieve the above objectives, the present invention provides an early disease detection method based on causal inference and quantum machine learning, comprising the following steps:

[0005] Step 1: Collect clinical characteristic data related to early risk prediction of several diseases as raw data for several diseases, and preprocess the raw data of each disease to obtain standardized datasets for several diseases.

[0006] Step 2: Use the LinearDML causal inference model to perform feature causal contribution analysis on the standardized dataset of any disease, calculate the average causal contribution of each feature to the onset of the disease; screen out features that have a positive causal relationship with the onset of the disease, and combine the features that have a positive causal relationship with the onset of the disease to construct the core feature set.

[0007] Step 3: Using the core feature set as model input and the disease status as classification target, construct a classic class machine learning model and a corresponding quantum class machine learning model to form a candidate model library for early disease detection.

[0008] Step 4: Divide the standardized dataset corresponding to the core feature set of any disease into a training set and a test set according to a preset ratio. Input the training set and the test set into the candidate model library for early disease detection. Use cross-validation to train and validate the performance of each model in the candidate model library for early disease detection. Select the model combination with the best performance after training as the optimal early disease detection model for the disease.

[0009] Step 5: Obtain the clinical feature data of the target to be detected, and process the clinical feature data of the target to be detected in Step 1 and Step 2 in sequence to obtain the standardized dataset and core feature set of the target to be detected; select the corresponding optimal early disease detection model based on the standardized dataset of the target to be detected, input the core feature set of the target to be detected into the selected optimal early disease detection model for detection, and output the disease status detection result and early onset risk assessment result of the target to be detected.

[0010] Furthermore, in step 1, the raw data for each disease is preprocessed to obtain standardized datasets for several diseases; for any given disease, the raw data is cleaned and encoded to obtain a standardized dataset for that disease; the raw data for any given disease includes binary classification feature data and target variable data, wherein the data encoding encodes Yes as 1 and No as 0 in the binary classification feature data, and encodes Positive as 1 and Negative as 0 in the target variable data, and retains the original values ​​for the age feature in the raw data of that disease.

[0011] Furthermore, in step 2, the LinearDML causal inference model performs feature causal contribution analysis on a standardized dataset of any disease, calculating the average causal contribution of each feature to the disease's onset; this includes the following steps:

[0012] Step 11: Take any feature from the standardized dataset as a latent processing variable T. i The remaining features are used as confounding variables Wi .

[0013] Step 12: Based on the latent treatment variable T i and confusion variable W i The random forest algorithm was used to construct the latent treatment variable model and the outcome variable model, respectively, and the latent treatment variable residuals and outcome variable residuals were obtained.

[0014] Step 13: Based on the residuals of the latent treatment variables and the residuals of the outcome variables, calculate the individual causal effect (ITE) of this feature on the disease incidence and outcome through orthogonalization, and statistically obtain the average causal contribution (ATE) of this feature in the full sample.

[0015] Step 14: Calculate the average causal contribution of each feature in the standardized dataset through steps 11, 12 and 13 to obtain the average causal contribution of several features in the standardized dataset.

[0016] Step 15: Filter the average causal contribution of several features in the standardized dataset, remove features with no causal contribution or negative average causal contribution, and obtain features that have a positive causal association with the onset of the disease.

[0017] Furthermore, in step 3, the classic machine learning models include the classic Support Vector Machine (CSVM) model and the classic Random Forest (CRF) model. The classic CSVM model takes a core feature set as input, aims at binary classification of disease status, standardizes the input core feature set using a linear kernel function, then maximizes the classification margin to construct the optimal hyperplane, and outputs the positive prediction probability of the disease and the binary classification result. The expression for the optimal hyperplane is as follows:

[0018]

[0019] In the formula, w is the weight vector, x is the input feature vector, and b is the bias term.

[0020] The classic random forest model (CRF) takes a core feature set as input and aims to classify disease status into two categories. It constructs multiple independent CART decision trees, generates an independent training sample set for each CART decision tree based on a bootstrap sampling method, randomly selects some features as candidates during node splitting, and finally obtains the predicted classification result through a majority voting mechanism across all CART decision trees. The expression for the predicted classification result of the classic random forest model (CRF) is shown below:

[0021]

[0022] In the formula, The final classification prediction result of the classic random forest model CRF; hN (x) represents the classification result predicted by the Nth decision tree, where N is the total number of decision trees. It is the mode operator.

[0023] Furthermore, the quantum-based machine learning models include the Quantum Support Vector Machine (QSVM) and the Quantum Random Forest (QRF) model. The QSVM model is built on the Qiskit quantum computing framework, employs ZZFeatureMap to construct a parameterized quantum feature mapping operator, achieves quantum entanglement between features through Pauli-Z gates and CNOT gates, uses a state vector simulator to accurately calculate the quantum kernel matrix, and uses the pre-calculated quantum kernel matrix as the input kernel of the classical SVM. With the objective of binary classification of disease status, it employs the maximum margin optimization objective consistent with the classical support vector machine (CSVM) to complete model training, outputting the positive prediction probability of the disease and the binary classification result. The quantum kernel function expression of the QSVM model is as follows:

[0024]

[0025] In the formula, x i x j Let φ(·) be the classical feature vector of any two input samples, and let φ(·) be the quantum feature mapping operator. This refers to the inner product operation of quantum states.

[0026] Furthermore, in step 4, the test set is input into the candidate model library for early disease detection, and the performance of each model in the candidate model library for early disease detection is verified by cross-validation; the performance index of each model is calculated, and the best-performing classical machine learning model is selected by comparing the performance index of each model in the classical machine learning model library after training; the best-performing quantum machine learning model is selected by comparing the performance index of each model in the quantum machine learning model library after training; the model constructed by the best-performing classical machine learning model and the best-performing quantum machine learning model is combined as the optimal early disease detection model for the disease.

[0027] Furthermore, the performance metrics include accuracy (Acc), precision (P), recall (R), and the F1-score (F1). In the classical class machine learning models after training, the weighted composite performance (CPS) of each model is calculated based on accuracy (Acc), recall (R), and the F1-score. The weighted composite performance (CPS) of each model is compared, and the model with the highest weighted composite performance (CPS) value is selected as the best-performing classical class machine learning model. In the quantum class machine learning models after training, the weighted composite performance (CPS) of each model is calculated based on accuracy (Acc), recall (R), and the F1-score. The weighted composite performance (CPS) of each model is compared, and the model with the highest weighted composite performance (CPS) value is selected as the best-performing quantum class machine learning model. The calculation of the weighted composite performance (CPS) value is as follows:

[0028]

[0029] In the formula, Acc is the accuracy, R is the recall, F1 is the F1-score core evaluation metric, ω1, ω2 and ω3 are the weight coefficients of each performance metric, and ω1+ω2+ω3=1.

[0030] Furthermore, the calculation process for the accuracy (Acc), precision (P), recall (R), and F1-score (F1) is as follows:

[0031] , , ,

[0032] In the formula, TP represents the number of true positive samples, TN represents the number of true negative samples, FP represents the number of false positive samples, and FN represents the number of false negative samples.

[0033] Beneficial Effects: This invention provides an early disease detection method based on causal inference and quantum machine learning. Through a linear dual machine learning (LinearDML) causal inference model, it not only filters out core features related to the onset of diseases such as diabetes but also clarifies the causal contribution of each feature to the disease. This overcomes the limitation of traditional machine learning models, which can only uncover correlations but cannot explain the pathogenesis, significantly improving the interpretability and clinical guidance value of early detection models for diseases such as diabetes. By constructing classical and corresponding quantum machine learning models, and utilizing the high-dimensional feature mapping and parallel computing capabilities of quantum computing, it captures subtle correlation patterns between features that classical models cannot identify. Causal inference enables interpretable screening of core features related to disease onset, while quantum machine learning captures subtle correlation patterns between features, significantly improving the accuracy, efficiency, and clinical interpretability of early disease detection, and enhancing the model's interpretability and early risk prediction capabilities. It is suitable for large-scale early screening and personalized risk warning scenarios for diseases such as diabetes; it possesses good generalization and scalability, adaptable to diabetes and other disease datasets from different regions and populations, and can be extended to early detection and risk warning scenarios for other chronic diseases, demonstrating extremely high clinical application value and promising prospects for widespread adoption. Attached Figure Description

[0034] Figure 1 This is a flowchart of an early disease detection method based on causal inference and quantum machine learning. Detailed Implementation

[0035] The invention will now be further described with reference to the accompanying drawings.

[0036] like Figure 1 As shown, an early disease detection method based on causal inference and quantum machine learning includes the following steps:

[0037] Step 1: Collect clinical feature data related to early risk prediction of several diseases as raw data for several diseases, and preprocess the raw data of each disease to obtain standardized datasets for several diseases.

[0038] Step 2: Using the LinearDML causal inference model, a standardized dataset for any disease is analyzed for feature causal contribution, and the average causal contribution of each feature to the onset of the disease is calculated. Features with a positive causal association with the onset of the disease are selected, and features with a positive causal association with the onset of the disease are combined to construct a core feature set. The core feature set consists of clinical symptoms and signs that can be identified in the early stages of the disease. Data collection can be completed without invasive testing, making it suitable for large-scale screening scenarios for diseases such as diabetes. At the same time, the model can output the early onset risk of diseases such as diabetes in healthy individuals, providing data support for personalized health intervention and precision medicine.

[0039] Step 3: Using the core feature set as model input and the disease status as classification target, construct a classic class machine learning model and a corresponding quantum class machine learning model to form a candidate model library for early disease detection.

[0040] Step 4: Divide the standardized dataset corresponding to the core feature set of any disease into a training set and a test set according to a preset ratio. Input the training set and the test set into the candidate model library for early disease detection. Use cross-validation to train and validate the performance of each model in the candidate model library for early disease detection. Select the model combination with the best performance after training as the optimal early disease detection model for the disease. The cross-validation method is a 5-step cross-validation method.

[0041] Step 5: Obtain the clinical feature data of the target to be detected, and process the clinical feature data of the target to be detected in Step 1 and Step 2 in sequence to obtain the standardized dataset and core feature set of the target to be detected; select the corresponding optimal early disease detection model according to the standardized dataset of the target to be detected, input the core feature set of the target to be detected into the selected optimal early disease detection model for detection, and output the disease status detection result and early disease risk assessment result of the target to be detected; the disease status detection result of the target to be detected is a binary classification result, with disease being positive and no disease being negative; the early disease risk assessment result is classified based on the positive prediction probability value output by the model, including three levels: low risk, medium risk, and high risk, providing a basis for personalized health intervention.

[0042] This includes a disease early detection system based on causal inference and quantum machine learning. The system comprises a data preprocessing module, a causal feature selection module, a model building module, a model training and validation module, and a detection inference module. The data preprocessing module collects clinical feature data of the disease, performs data cleaning and numerical encoding preprocessing, and outputs a standardized dataset. The causal feature selection module introduces a LinearDML causal inference model to calculate the causal contribution of each feature and select the core feature set for early disease detection. The model building module constructs classical and quantum machine learning models to form a candidate model library for early disease detection. The model training and validation module divides the dataset and trains and validates each model using cross-validation, selecting the optimal early disease detection model based on performance evaluation metrics. The detection inference module collects clinical feature data of the target disease, performs preprocessing and feature matching, inputs the optimal early disease detection model, and outputs the disease status detection results and early onset risk assessment results of the target disease.

[0043] In step 1, the raw data for each disease is preprocessed to obtain standardized datasets for several diseases. For the raw data of any disease, data cleaning and data encoding are performed to obtain a standardized dataset without missing values ​​for that disease. The raw data of any disease includes binary classification feature data and target variable data. The data encoding encodes Yes as 1 and No as 0 in the binary classification feature data, and encodes Positive as 1 and Negative as 0 in the target variable data. The age feature in the raw data of this disease retains its original value.

[0044] In step 2, the LinearDML causal inference model performs feature causal contribution analysis on a standardized dataset of any disease, calculating the average causal contribution of each feature to the disease's pathogenesis; this includes the following steps:

[0045] Step 11: Take any feature from the standardized dataset as a latent processing variable T. i The remaining features are used as confounding variables W i ; Take any disease as the target disease, and set the standardized dataset for the target disease as ; Where n is the total sample size, X i Let T be the full feature set of the i-th sample. i Y represents the single feature to be analyzed, which is the single disease risk feature to be analyzed. iFor the outcome variable of the target disease, positive is 1 and negative is 0; the individual feature T to be analyzed is... i Set as a latent treatment variable, and the single feature T to be analyzed will be obtained. i Other features as confounding variables W i .

[0046] Step 12: Based on the latent treatment variable T i and confusion variable W i The random forest algorithm was used to construct latent treatment variable models and outcome variable models, respectively, to obtain the latent treatment variable residuals and outcome variable residuals. The latent treatment variable models were then constructed to predict the conditional expected value of the latent treatment variables given the confounding variables, and the latent treatment variable residuals were calculated. The calculations are shown below:

[0047] ,

[0048] In the formula, For the latent treatment variable model, E[|] represents the conditional expectation operation; V i For the latent treatment variable residuals, V is the latent treatment variable residual. i Eliminate the influence of confounding variables on potential treatment variables, and retain only the variation information of the features themselves.

[0049] Construct an outcome variable model to predict the conditional expectation of the outcome variable given the confounding variables, and calculate the outcome variable residuals; the calculations are shown below:

[0050] ,

[0051] In the formula, For the latent treatment variable model, E[|] represents the conditional expectation operation; U i Let U be the residual of the outcome variable. i The influence of confounding variables on disease outcome variables was eliminated, and only outcome variations that could not be explained by confounding variables were retained.

[0052] Step 13: Based on the residuals of the latent treatment variable and the residuals of the outcome variable, calculate the individual causal effect (ITE) of this feature on the disease outcome through orthogonalization; statistically analyze the individual causal effects of the entire sample and take the mean to obtain the average causal contribution (ATE) of this feature in the entire sample; the calculation process is shown below:

[0053]

[0054]

[0055] In the formula, Let ITE be the individual causal effect of the latent treatment variable on the disease outcome in the i-th sample; ATE is the average causal contribution of the feature corresponding to the latent treatment variable, that is, the average influence of the feature on the disease outcome in the entire population; E[·] is the expectation operator, and n is the total number of samples.

[0056] Step 14: Calculate the average causal contribution of each feature in the standardized dataset through steps 11, 12 and 13 to obtain the average causal contribution of several features in the standardized dataset.

[0057] Step 15: Filter the average causal contribution of several features in the standardized dataset, removing features with no causal contribution or negative average causal contribution, to obtain features with a positive causal association with the onset of the disease; compare the average causal contribution ATE of each feature with 0. If the average causal contribution ATE of any feature is ≤ 0, then the feature has no causal contribution or has a negative average causal contribution, and the feature is removed; if the average causal contribution ATE of any feature is > 0, then the feature has a positive causal association with the onset of the disease, and the feature is retained.

[0058] In step 3, a core feature set is used as the model input, and the disease status is used as the classification target to construct a classical class machine learning model and a corresponding quantum class machine learning model, forming a candidate model library for early disease detection; including the following steps:

[0059] Step 21: Adaptation of model input and model classification target; use the core feature set as the unified input for all models, and use the binary classification label of disease status as the unified classification target for all models; complete feature standardization adaptation for different model input requirements; the standardization adaptation is to perform Z-score standardization on linear models and kernel method models, and directly use the original feature values ​​for tree models; Z-score standardization is standard deviation standardization or zero mean standardization, which is a commonly used linear transformation method in data preprocessing.

[0060] Step 22: Construction of classic machine learning models; Construct classic support vector machine (CSVM) and classic random forest (CRF) models adapted to the binary classification task of disease, forming a classic machine learning model group; Define the structure, initialize the parameters, and adapt each model in the classic machine learning model group to the binary classification task.

[0061] Step 23: Construction of quantum-based machine learning models; For the classical support vector machine model CSVM and the classical random forest model CRF in the classical machine learning model group, corresponding quantum augmentation models are constructed, namely the quantum support vector machine model QSVM and the quantum random forest model QRF, forming a quantum-based machine learning model group; and the quantum circuit design, feature mapping configuration and classical-quantum connection adaptation of each model in the quantum-based machine learning model group are carried out.

[0062] Step 24: Standardize and integrate the candidate model library for early disease detection; integrate the classical machine learning model group and the quantum machine learning model group, and configure a unified core feature set input interface, disease state probability output interface and training parameter configuration entry for each model to form a candidate model library for early disease detection.

[0063] In step 3, the classic machine learning models include the classic Support Vector Machine (CSVM) and the classic Random Forest (CRF) model. The classic CSVM is a supervised binary classification model based on the maximum margin concept. Its core is to find an optimal hyperplane in the feature space, maximizing the classification margin between positive and negative samples, thereby achieving optimal classification generalization ability. The CSVM takes a core feature set as input and aims at binary classification of disease status. It uses a linear kernel function to standardize the input core feature set, then maximizes the classification margin to construct the optimal hyperplane, outputting the positive prediction probability and binary classification result. It uses a linear kernel function to adapt to the binary classification scenario, adds slack variables to handle linearly inseparable samples, and optimizes the model by minimizing structural risk to avoid overfitting. It exhibits good classification stability for high-dimensional disease clinical features. The expression for the optimal hyperplane is shown below:

[0064]

[0065] In the formula, w is the weight vector, used to define the spatial orientation of the hyperplane; x is the input feature vector; and b is the bias term, used to realize the spatial translation of the hyperplane.

[0066] From any sample point x in space to the decision-optimal hyperplane The vertical distance is d, calculated as follows:

[0067]

[0068] For support vectors on the positive class boundary hyperplane The distance from any sample point x to the optimal decision hyperplane is d. + =1 / ||w|; for support vectors on the negative class boundary hyperplane The distance from any sample point x to the optimal decision hyperplane is d.- =1 / ‖w‖. The classification margin is the sum of the distances from the nearest support vector to the optimal hyperplane in both positive and negative classes, and its formula is: Classification Margin = d + + d - =2 / ‖w‖, where the classic support vector machine model CSVM achieves the optimal separation of the two classes of samples by maximizing the above classification margin.

[0069] The Classic Random Forest (CRF) model, a supervised binary classification model based on the Bagging ensemble learning framework, takes a core feature set as input and aims to classify disease status into two categories. It constructs multiple independent CART decision trees, generating independent training sample sets for each tree using a bootstrap sampling method. During node splitting, some features are randomly selected as candidates, and the predicted classification result is obtained through a majority voting mechanism across all CART decision trees. The Gini coefficient is used as the evaluation index for node splitting. The ensemble effect of multiple CART decision trees reduces the overfitting risk of individual trees, exhibiting strong robustness to nonlinear associations, missing values, and noise in clinical features. It also outputs a ranking of feature importance, demonstrating good clinical interpretability. The expression for the predicted classification result of the Classic Random Forest (CRF) model is shown below:

[0070]

[0071] In the formula, The final classification prediction result of the classic random forest model CRF; h N (x) represents the predicted classification result of the Nth decision tree, where N is the total number of decision trees, and N can be 100; mode(·) is the mode operator, which selects the category with the highest proportion among all the prediction results of the decision trees as the final output. The classic random forest model CRF is trained based on bootstrap sampling set and feature bag selection method, without the need for standardization of input features.

[0072] The quantum-based machine learning models include the quantum support vector machine (QSVM) and the quantum random forest (QRF) model. The QSVM is a quantum-enhanced version of the classical support vector machine (CSVM). Its core principle is to map classical clinical feature data to a high-dimensional quantum Hilbert space using quantum feature mapping. Leveraging quantum superposition and entanglement, it achieves high-dimensional feature transformations that classical kernel methods cannot efficiently accomplish, thus solving the problem of linear inseparability of classical SVM under complex clinical features. It is built upon the Qiskit quantum computing framework and employs ZZFeatureMap. A parameterized quantum feature mapping operator is constructed, and quantum entanglement between features is achieved through Pauli-Z gates and CNOT gates. A state vector simulator is used to accurately calculate the quantum kernel matrix. The pre-calculated quantum kernel matrix is ​​used as the input kernel of a classical SVM. With the objective of binary classification of disease status, the model is trained using the maximum margin optimization objective consistent with the classical support vector machine model CSVM, outputting the positive prediction probability of the disease and the binary classification result. The quantum support vector machine model QSVM can capture subtle nonlinear correlations between clinical features of diseases that the classical support vector machine model CSVM cannot recognize. In high-dimensional, small-sample clinical data scenarios, its classification accuracy is significantly better than that of classical SVM. The quantum kernel function expression of the quantum support vector machine model QSVM is as follows:

[0073]

[0074] In the formula, x i x j Let be the classic feature vectors of any two input samples. For quantum feature mapping operators,

[0075] This refers to the quantum state inner product operation; ZZFeatureMap is a widely used quantum feature mapping circuit in quantum machine learning.

[0076] The quantum random forest model (QRF) is a quantum-enhanced version of the classical random forest model (CRF). Its core lies in optimizing the node splitting process of the classical random forest through quantum parallel computing, simultaneously evaluating the classification performance of multiple splitting paths, and achieving rapid selection of optimal splitting features and thresholds. This addresses the problems of low computational efficiency and insufficient capture of complex dependencies in the classical random forest model (CRF) under high-dimensional features. Based on the classic random forest's Bagging ensemble framework and combined with the shallow circuit depth constraints of current quantum hardware, a quantum-enhanced decision tree structure is designed. Candidate splitting features and splitting thresholds are encoded using quantum superposition states, and parameterized quantum circuits are used to implement feature mapping and splitting performance evaluation, simultaneously evaluating the classification performance of multiple splitting paths to capture complex dependencies between features. The model retains the bootstrap sampling, feature bagging, and majority voting mechanisms of the classical random forest model (CRF), while limiting the maximum depth of a single quantum decision tree to adapt to quantum hardware constraints. It can capture complex dependencies between clinical features through quantum parallel computing, and in early disease detection tasks, its generalization ability and anti-overfitting ability are significantly better than those of the classical random forest.

[0077] The construction of classical machine learning models and their corresponding quantum machine learning models can further include classical logistic regression (LR), classical extreme gradient boosting (XGBoost), variable quantum classifier (VQC), and quantum boosting classifier (QBoost). The variable quantum classifier (VQC), a quantum counterpart of the classical logistic regression (LR), uses parameterized quantum circuits as feature maps and classifiers, updating quantum circuit parameters through a classical optimizer to adapt to binary classification tasks with small sample clinical features. The quantum boosting classifier (QBoost), a quantum counterpart of the classical extreme gradient boosting (XGBoost), optimizes the ensemble weights of weak classifiers based on the quantum annealing algorithm, achieving rapid screening and combination of weak classifiers through quantum parallel computing, resulting in stronger fitting capabilities for complex clinical features. The classical extreme gradient boosting (XGBoost) model is an efficient machine learning algorithm based on gradient boosting trees.

[0078] In step 4, the test set is input into the candidate model library for early disease detection, and the performance of each model in the candidate model library is verified by cross-validation. The performance index of each model is calculated, and the best-performing classical machine learning model is selected by comparing the performance index of each model in the classical machine learning model library after training. The best-performing quantum machine learning model is selected by comparing the performance index of each model in the quantum machine learning model library after training. The model constructed by the best-performing classical machine learning model and the best-performing quantum machine learning model is combined as the optimal early disease detection model for the disease.

[0079] The performance metrics include accuracy (Acc), precision (P), recall (R), and the F1-score (F1). In the classical class machine learning models after training, the weighted average performance (CPS) of each model is calculated based on accuracy (Acc), recall (R), and the F1-score. The weighted average CPS of each model is compared, and the model with the highest weighted average CPS is selected as the best-performing classical class machine learning model. In the quantum class machine learning models after training, the weighted average performance (CPS) of each model is calculated based on accuracy (Acc), recall (R), and the F1-score. The weighted average CPS of each model is compared, and the model with the highest weighted average CPS is selected as the best-performing quantum class machine learning model. The calculation of the weighted average CPS value is shown below:

[0080]

[0081] In the formula, Acc represents accuracy, R represents recall, F1 represents the F1-score core evaluation metric, ω1, ω2, and ω3 are the weight coefficients of each performance metric, and ω1+ω2+ω3=1. Considering the clinical needs of early disease detection, the default weights are set as follows: accuracy weight coefficient ω1=0.3, recall weight coefficient ω2=0.4, which have the highest priority; F1-score core evaluation metric weight coefficient ω3=0.3, and the weight coefficients can be flexibly adjusted according to the clinical needs of different diseases; CPS represents the overall performance, with a value range of 0-1, and a higher score indicates better overall model performance.

[0082] The calculation process for the core evaluation metric F1, namely Acc, Precision (P), Recall (R), and F1 score, is as follows:

[0083] , , ,

[0084] In the formula, TP represents the number of true positive samples, TN represents the number of true negative samples, FP represents the number of false positive samples, and FN represents the number of false negative samples.

[0085] Step 5: Obtain the clinical feature data of the target to be detected, and process the clinical feature data of the target to be detected in Step 1 and Step 2 sequentially to obtain the standardized dataset and core feature set of the target to be detected; select the corresponding optimal early disease detection model based on the standardized dataset of the target to be detected, input the core feature set of the target to be detected into the selected optimal early disease detection model for detection, and output the disease status detection result and early onset risk assessment result of the target to be detected; including the following steps:

[0086] Step 31: Preprocess and perform feature matching on the clinical feature data of the target to be detected to obtain the clinical feature data of the target to be detected. Following step 1, data cleaning and encoding of the clinical feature data of the target to be detected are performed to obtain a standardized dataset of the target to be detected. Then, in step 2, feature causal contribution analysis is performed on the standardized dataset of the target to be detected to obtain the core feature set of the target to be detected. Only features completely consistent with the core feature set of the target disease are retained, and irrelevant features are removed to obtain the core feature vector x of the target to be detected. test .

[0087] Step 32: Select the optimal early disease detection model based on the standardized dataset of the target to be detected. The optimal early disease detection model includes the optimal classical machine learning model and the optimal quantum machine learning model.

[0088] Step 33: Employ the optimal early disease detection model to achieve optimal model dual-path inference processing, using both the optimal classical machine learning model and the optimal quantum machine learning model to process the core feature vector x. test Processing; optimal classical machine learning model processing, for the core feature vector x test After completing the corresponding standardization process, the input is the optimal classical class machine learning model. Using the parameters of the pre-trained optimal classical class machine learning model, the output is the positive prediction probability p of the target being detected as a disease. c ∈[0,1]; Machine learning model processing of optimal quantum class, for core feature vector x test After completing the corresponding standardization process, the input is the optimal classical machine learning model. Feature transformation and inference calculation are performed through quantum circuit state vector simulation, and the output is the positive prediction probability p of the target to be detected as a disease. a ∈[0,1].

[0089] Step 34: Dual-model probability fusion; Based on the weights determined during the training phase of the optimal classical and optimal quantum machine learning models, the positive disease prediction probabilities output by the optimal classical and optimal quantum machine learning models are weighted and fused to obtain the final positive disease prediction probability p. final .

[0090] Step 35: Determination of disease status detection results; based on the final positive disease prediction probability p. final A binary classification decision was made using the clinically accepted 50% threshold; if p final A p-value ≥0.5 indicates a positive result for the disease, meaning the target population is diseased; if p... final A value <0.5 indicates a negative result for the disease, meaning the target being tested does not have a clear disease status.

[0091] Step 36: Grading of early disease risk assessment results; for the target to be tested that is negative for the disease, based on the final positive disease prediction probability p. final Early disease risk is classified into three levels to provide a basis for personalized health intervention; when p final If the value is <0.3, the risk level is classified as low, with an extremely low risk of developing the disease. No special intervention is required; routine health monitoring is sufficient. When 0.3 ≤ p final If the p-value is ≤0.5, the risk level is classified as medium, indicating a potential risk of developing the disease. Lifestyle interventions targeting the core causal characteristics of the disease are necessary, along with regular follow-up examinations. final A value ≥0.5 indicates a high risk level, is considered positive, and clearly indicates the presence of disease characteristics. It is recommended to seek medical attention as soon as possible for clinical diagnosis and intervention.

[0092] Step 37, Results Output: The final output includes four core components: the binary classification detection result of the disease status of the target to be detected, the early onset risk assessment result, the final positive prediction probability of the disease, and the core pathogenic causal characteristic analysis, forming a complete early disease detection report.

[0093] Example

[0094] Taking diabetes as an example, clinical characteristic data was collected as the raw data. The raw data used was the Early Stage Diabetes Risk Prediction Dataset published by UCI, which contained 520 valid samples, including 320 positive samples and 200 negative samples. The dataset had no missing values ​​or outliers and was used as the raw data for this experiment. The clinical characteristic data included 16 features: age, gender, polyuria, polydipsia, sudden weight loss, fatigue, polyphagia, genital candidiasis, blurred vision, pruritus, irritability, delayed wound healing, partial paralysis, muscle stiffness, hair loss, and obesity, as well as a binary target variable for diabetes status. The above clinical characteristic data was preprocessed to obtain a standardized dataset for diabetes. After processing the standardized dataset for diabetes in step 2, the core feature set for diabetes was obtained. The core feature set for diabetes included 10 features: polydipsia, polyuria, fatigue, polyphagia, sudden weight loss, delayed wound healing, gender, blurred vision, genital candidiasis, and pruritus.

[0095] Using the core feature set as model input and diabetes prevalence status as the classification target, a classic class machine learning model and a corresponding quantum class machine learning model are constructed to form a candidate model library for early disease detection. This candidate model library includes the classic Support Vector Machine (CSVM), the classic Random Forest (CRF), the quantum Support Vector Machine (QSVM), and the quantum Random Forest (QRF). The standardized diabetes dataset is randomly divided into training and test sets at a predetermined ratio of 8:2, with 416 samples in the training set and 104 samples in the test set. This ensures that the positive and negative sample ratios in the training and test sets are consistent with the original dataset, avoiding data imbalance.

[0096] Four models from the candidate model library for early disease detection were trained using a 5-fold cross-validation method. During training, the training set was divided into five mutually exclusive subsets. One subset was selected as the validation set for each training iteration, and the remaining four were used as the training set. This process was repeated five times, employing an early stopping strategy to avoid overfitting. After training, the performance of the four models was validated using a test set. The accuracy (Acc), precision (P), recall (R), and F1-score (F1) were calculated for each model. The results of these four performance metrics are shown in the table below.

[0097] Performance metrics of each model in the candidate model library for early disease detection

[0098]

[0099] The performance index results of each model in the candidate model library for early disease detection are shown in the table. The weighted comprehensive performance was calculated, and the model combination of the classical random forest model CRF and the quantum support vector machine model QSVM was selected as the optimal early disease detection model for diabetes. Both of them achieved a detection accuracy of 97% and performed well in the recall rate of diabetes positive samples, which can effectively reduce the false negative rate and meet the clinical needs of early detection.

[0100] The clinical characteristic data of the target to be detected are obtained, and the clinical characteristic data of the target to be detected are processed in steps 1 and 2 to obtain a standardized dataset and core feature set of the target to be detected. The core feature set includes 10 features including polydipsia, polyuria, fatigue, polyphagia, sudden weight loss, delayed wound healing, gender, blurred vision, genital candidiasis, and pruritus. The optimal early detection model for diabetes is selected based on the standardized dataset. The core feature set is input into the optimal early detection model of diabetes, which is a combination of the classical random forest model (CRF) and the quantum support vector machine model (QSVM), and the binary classification detection result of the diabetes status of the target to be detected is output. At the same time, risk classification is performed based on the output positive probability value: a positive probability value of less than 0.3 is low risk, 0.3-0.5 is medium risk, and greater than 0.5 is high risk. Finally, the output includes four core contents: the binary classification detection result of the disease status of the target to be detected, the early onset risk assessment result, the final positive prediction probability of the disease, and the core pathogenic causal feature analysis, forming a complete early disease detection report, which provides a basis for clinical diagnosis and personalized health intervention.

[0101] The above description is merely a preferred embodiment of the present invention. Those skilled in the art can make several modifications and optimizations based on the above disclosure without departing from the basic principles described above. These modifications and optimizations should be considered within the scope of protection as understood by the present invention.

Claims

1. A method for early disease detection based on causal inference and quantum machine learning, characterized in that: Includes the following steps: Step 1: Collect clinical feature data related to early risk prediction of several diseases as raw data for several diseases, and preprocess the raw data of each disease to obtain standardized datasets for several diseases. Step 2: Use the LinearDML causal inference model to perform feature causal contribution analysis on the standardized dataset of any disease, calculate the average causal contribution of each feature to the onset of the disease; screen out features that have a positive causal relationship with the onset of the disease, and combine features that have a positive causal relationship with the onset of the disease to construct a core feature set; Step 3: Using the core feature set as model input and the disease status as classification target, construct a classic class machine learning model and a corresponding quantum class machine learning model to form a candidate model library for early disease detection. Step 4: Divide the standardized dataset corresponding to the core feature set of any disease into training set and test set according to a preset ratio, input the training set and test set into the candidate model library for early disease detection, and use cross-validation to train and verify the performance of each model in the candidate model library for early disease detection. The model combination with the best performance after training was selected as the optimal early detection model for this disease. Step 5: Obtain the clinical feature data of the target to be detected, and process the clinical feature data of the target to be detected in Step 1 and Step 2 in sequence to obtain the standardized dataset and core feature set of the target to be detected; select the corresponding optimal early disease detection model based on the standardized dataset of the target to be detected, input the core feature set of the target to be detected into the selected optimal early disease detection model for detection, and output the disease status detection result and early onset risk assessment result of the target to be detected.

2. The method for early disease detection based on causal inference and quantum machine learning according to claim 1, characterized in that: In step 1, the raw data for each disease is preprocessed to obtain standardized datasets for several diseases. For the raw data of any disease, data cleaning and data encoding are performed to obtain a standardized dataset for that disease. The raw data for any disease includes binary classification feature data and target variable data. The data encoding encodes Yes as 1 and No as 0 in the binary classification feature data, and encodes Positive as 1 and Negative as 0 in the target variable data. The age feature in the raw data of this disease retains its original value.

3. The method for early disease detection based on causal inference and quantum machine learning according to claim 1, characterized in that: In step 2, the LinearDML causal inference model performs feature causal contribution analysis on a standardized dataset of any disease, calculating the average causal contribution of each feature to the disease's pathogenesis; this includes the following steps: Step 11: Take any feature from the standardized dataset as a latent processing variable T. i The remaining features are used as confounding variables W i ; Step 12: Based on the latent treatment variable T i and confusion variable W i The random forest algorithm was used to construct the latent treatment variable model and the outcome variable model, respectively, and the latent treatment variable residuals and outcome variable residuals were obtained. Step 13: Based on the residuals of the latent treatment variables and the residuals of the outcome variables, calculate the individual causal effect (ITE) of this feature on the disease incidence and outcome through orthogonalization, and statistically obtain the average causal contribution (ATE) of this feature in the full sample. Step 14: Calculate the average causal contribution of each feature in the standardized dataset through steps 11, 12 and 13 to obtain the average causal contribution of several features in the standardized dataset. Step 15: Filter the average causal contribution of several features in the standardized dataset, remove features with no causal contribution or negative average causal contribution, and obtain features that have a positive causal association with the onset of the disease.

4. The early disease detection method based on causal inference and quantum machine learning according to claim 1, characterized in that: In step 3, the classic machine learning models include the classic Support Vector Machine (CSVM) model and the classic Random Forest (CRF) model. The classic CSVM model takes a core feature set as input, aims at binary classification of disease status, standardizes the input core feature set using a linear kernel function, and then maximizes the classification margin to construct the optimal hyperplane, outputting the positive prediction probability of the disease and the binary classification result. The expression for the optimal hyperplane is shown below: In the formula, w is the weight vector, x is the input feature vector, and b is the bias term; The classic random forest model (CRF) takes a core feature set as input and aims to classify disease status into two categories. It constructs multiple independent CART decision trees, generates an independent training sample set for each CART decision tree based on a bootstrap sampling method, randomly selects some features as candidates during node splitting, and finally obtains the predicted classification result through a majority voting mechanism across all CART decision trees. The expression for the predicted classification result of the classic random forest model (CRF) is shown below: In the formula, The final classification prediction result of the classic random forest model CRF; h N (x) represents the predicted classification result of the Nth decision tree, where N is the total number of decision trees, and mode(·) is the mode operator.

5. The method for early disease detection based on causal inference and quantum machine learning according to claim 1, characterized in that: The quantum-based machine learning models include the Quantum Support Vector Machine (QSVM) and the Quantum Random Forest (QRF) model. The QSVM model is built on the Qiskit quantum computing framework, employing ZZFeatureMap to construct a parameterized quantum feature mapping operator. It achieves quantum entanglement between features through Pauli-Z gates and CNOT gates, and uses a state vector simulator to accurately calculate the quantum kernel matrix. The pre-calculated quantum kernel matrix is ​​used as the input kernel of the classical SVM. With the objective of binary classification of disease status, it employs the maximum margin optimization objective, consistent with the classical SVM model CSVM, to complete model training, outputting the positive prediction probability of the disease and the binary classification result. The quantum kernel function expression of the QSVM model is as follows: In the formula, x i x j Let be the classic feature vectors of any two input samples. For quantum feature mapping operators, This refers to the inner product operation of quantum states.

6. The method for early disease detection based on causal inference and quantum machine learning according to claim 1, characterized in that: In step 4, the test set is input into the candidate model library for early disease detection, and the performance of each model in the candidate model library is verified by cross-validation. The performance index of each model is calculated, and the best-performing classical machine learning model is selected by comparing the performance index of each model in the classical machine learning model library after training. The best-performing quantum machine learning model is selected by comparing the performance index of each model in the quantum machine learning model library after training. The model constructed by the best-performing classical machine learning model and the best-performing quantum machine learning model is combined as the optimal early disease detection model for the disease.

7. The method for early disease detection based on causal inference and quantum machine learning according to claim 6, characterized in that: The performance metrics include accuracy (Acc), precision (P), recall (R), and the F1-score (F1). In the classical class machine learning models after training, the weighted average performance (CPS) of each model is calculated based on accuracy (Acc), recall (R), and the F1-score. The weighted average CPS of each model is compared, and the model with the highest weighted average CPS is selected as the best-performing classical class machine learning model. In the quantum class machine learning models after training, the weighted average performance (CPS) of each model is calculated based on accuracy (Acc), recall (R), and the F1-score. The weighted average CPS of each model is compared, and the model with the highest weighted average CPS is selected as the best-performing quantum class machine learning model. The calculation of the weighted average CPS value is shown below: In the formula, Acc is the accuracy, R is the recall, F1 is the F1-score core evaluation metric, ω1, ω2 and ω3 are the weight coefficients of each performance metric, and ω1+ω2+ω3=1.

8. The method for early disease detection based on causal inference and quantum machine learning according to claim 7, characterized in that: The calculation process for the core evaluation metric F1, namely Acc, Precision (P), Recall (R), and F1 score, is as follows: , , , In the formula, TP represents the number of true positive samples, TN represents the number of true negative samples, FP represents the number of false positive samples, and FN represents the number of false negative samples.