Methylation-based deep learning cancer risk prediction method and system
By using a methylation-based deep learning method, tissue-specific interference and common signals in DNA methylation data are separated. The model is optimized using transfer learning and pseudo-label generation mechanisms, which solves the problems of sample scarcity and feature differences in multi-cancer risk prediction and achieves efficient and accurate risk prediction for rare cancers.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INNER MONGOLIA UNIV OF TECH
- Filing Date
- 2026-03-27
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies face challenges in predicting the risk of various cancers, including uneven sample distribution, differences in detection environments, and significant differences in biological characteristics. In particular, the scarcity of samples for rare cancers makes it difficult to achieve accurate cross-cancer predictions.
By using a methylation-based deep learning method, sequence features of DNA methylation data are extracted, tissue-specific interference and common cancer signals are separated, and a multi-cancer adaptation model is constructed by combining transfer learning and pseudo-label generation mechanisms with semi-supervised learning to optimize the model and output cancer risk prediction results.
It significantly improves the accuracy and interpretability of early risk screening for rare cancers, providing an efficient and objective auxiliary diagnostic tool for multi-cancer screening in scenarios with limited clinical resources.
Smart Images

Figure CN122245778A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of medical data processing technology, and in particular to a deep learning-based cancer risk prediction method and system based on methylation. Background Technology
[0002] Research in early cancer screening is crucial for improving patient survival rates and reducing the burden on healthcare. This field analyzes biomarkers in the human body to identify potential risks in the early stages of disease, enabling timely intervention. However, despite significant advancements in related technologies in recent years, numerous challenges remain, particularly in achieving accurate prediction and widespread applicability across multiple cancer types – a problem that urgently needs to be solved.
[0003] Current methods for predicting the risk of various cancers are often limited by uneven sample distribution and differences in testing environments. This is especially true for less common cancer types, where the extremely small number of cases makes it difficult for researchers to obtain sufficient data to support the development of effective analytical tools. Furthermore, the significant differences in biological characteristics between different cancer types and between individuals often make it difficult for existing technologies to find common patterns across multiple cancers, leading to a substantial reduction in the accuracy of predictions for certain cancer types.
[0004] A deeper challenge lies in extracting the truly cancer-related key signals from complex biological information. Cancer-related biological changes are often obscured by a large amount of irrelevant background information; for example, the characteristics of different tissue origins can interfere with the identification of common cancer features. This background interference makes it difficult for researchers to extract signal features that are universally applicable across multiple cancer types. Taking rare cancers as an example, due to the scarcity of samples, researchers cannot fully understand their unique patterns of change, while the analysis results of common cancers cannot be directly applied to rare cancers due to background differences. This cross-type applicability problem further exacerbates the difficulty of technology development.
[0005] Therefore, how to accurately separate common cancer-related signals and overcome background differences between different cancer types to effectively predict the risk of rare cancers when the sample size is extremely limited has become a key problem that this study urgently needs to solve. Summary of the Invention
[0006] To address the shortcomings of existing technologies, this invention provides a deep learning-based cancer risk prediction method and system based on methylation. This method significantly improves the accuracy and interpretability of early risk screening for rare cancers, and provides an efficient and objective auxiliary diagnostic tool for pan-cancer screening of multiple cancer types in clinical settings with limited resources.
[0007] To achieve the above objectives, the technical solution of the present invention is as follows: A deep learning-based cancer risk prediction method based on methylation, which includes: DNA methylation data of various cancer types were obtained, sequence features were extracted from the DNA methylation data and standardized, tissue-specific interference parts and cancer common signal parts were separated and processed to obtain a transferable common signal subset; For a small amount of sample data of rare cancer types, the parameters of the source cancer type model are transferred to the target cancer type model through transfer learning and fine-tuned to obtain an initially adapted predictive model. Intermediate layer representations are extracted from the initially adapted prediction model, and a pseudo-label generation mechanism is used to assign temporary labels to unlabeled samples. An enhanced training sample set is then selected based on confidence level. A semi-supervised learning framework combined with an enhanced training sample set was used to iteratively optimize the initial adaptive prediction model, resulting in an optimized multi-cancer adaptation model. The newly input DNA methylation data is processed by an optimized multi-cancer adaptation model to separate background features from carcinogenesis features and output cancer risk prediction results.
[0008] Preferably, the step of extracting sequence features from DNA methylation data and performing standardization processing to separate tissue-specific interference components and cancer-common signal components includes: The original sequence features are extracted using sequence alignment tools to obtain a set of sequence features; The sequence feature set is standardized to obtain standardized sequence features; The tissue-specific interference component is separated from the standardized sequence features to obtain a tissue-specific interference set; The common cancer signal components are extracted from the standardized sequence features to obtain a set of common cancer signals; The common cancer signal set was processed by principal component analysis to obtain the dimensionality-reduced principal components of the common signals. Cluster analysis was performed on the principal components of the dimensionality-reduced common signals to obtain the grouping results of common methylation signals in cancer.
[0009] Preferably, the processing to obtain a transferable subset of common signals includes: For the tissue-specific interference portion contained in the common cancer signal set, a domain adaptation technique is used to adjust the weights, resulting in an adjusted signal distribution set. By calculating the distribution difference of the adjusted signal distribution set, if the calculated difference value is lower than a preset threshold, it is determined to be a signal subset with transferability. For the aforementioned signal subset with portability, batch normalization is applied to unify the signal distribution range, resulting in a standardized portable signal group. Based on the standardized transferable signal set, extract the signal segments that are highly correlated with cancer and identify them as the core common signal combination; From the core common signal combination, signal sub-units related to specific cancer types are separated. If the signal intensity of the separated sub-unit is higher than a preset standard, it is judged to be a key signal unit. For the key signal units, a signal correlation matrix is constructed, and the potential relationships between signals are obtained through matrix analysis to obtain the final set of common signal correlations.
[0010] Preferably, the small sample data for rare cancer types is used to transfer the parameters of the source cancer type model to the target cancer type model through transfer learning and fine-tuning to obtain an initially adapted predictive model, including: Obtain all signal feature data from the source cancer species; The set of signal features that are significantly different in the source cancer species was screened out using statistical testing methods; For limited sample data of the target cancer species, calculate the expression level distribution of each signal feature in the target cancer species; If the source cancer type's significantly different signal features also have expression variations in the target cancer type, then the signal features are retained to obtain a transferable subset of common signals; The deep learning model that has been trained for the source cancer species is used, and all the parameters of the model are loaded into the same network structure for the target cancer species. By fine-tuning the model with loaded parameters using a small sample of target cancer data, an initially adapted predictive model is obtained.
[0011] Preferably, the step of extracting intermediate layer representations from the initially adapted prediction model, assigning temporary labels to unlabeled samples using a pseudo-label generation mechanism, and selecting an enhanced training sample set based on confidence levels includes: Obtain the intermediate layer representation of the prediction model of the initial adaptation; Based on the intermediate layer representation, temporary labels are assigned to unlabeled samples through a pseudo-label generation mechanism; Calculate a confidence score for the temporary label of each unlabeled sample; If the confidence score is higher than a preset threshold, the sample and its temporary label are retained. An enhanced training sample set is constructed by retaining the samples and their temporary labels; The parameters of the initially adapted prediction model are updated using the enhanced training sample set.
[0012] Preferably, the step of using a semi-supervised learning framework combined with an enhanced training sample set to iteratively optimize the initial adaptive prediction model to obtain an optimized multi-cancer adaptation model includes: The enhanced training sample set and the initially adapted prediction model are loaded using a semi-supervised learning framework; The updated probability values are obtained by recalculating the class probability distribution of the samples in the enhanced training sample set using the current prediction model. Based on the updated probability values and the original temporary labels, calculate the weighted loss value for each sample; If the weighted loss value is less than the preset loss threshold, then the sample is retained to participate in the gradient calculation of the current round; By incorporating an L2 regularization term into the total loss function, we obtain the constrained total loss. The prediction model parameters are updated based on the constrained total loss using the backpropagation algorithm to obtain the prediction model optimized for the current round. The current round of optimized prediction model is used as the initial prediction model for the next round of adaptation. The probability recalculation process is repeated until the preset iteration round conditions are met, resulting in an optimized multi-cancer adaptation model.
[0013] Preferably, the step of processing newly input DNA methylation data through an optimized multi-cancer adaptation model, separating background features from carcinogenic features, and outputting cancer risk prediction results includes: By using newly acquired DNA methylation data, the optimized multi-cancer adaptation model was loaded for preliminary analysis to obtain preliminary feature classification results. Based on the preliminary feature classification results, the support vector machine algorithm is used to further separate background features and cancer features to determine the distinguishing boundaries between features; If the proportion of cancerous features after separation is higher than a preset threshold, the sample is determined to have potential risks, and preliminary risk assessment data is obtained. Based on the preliminary risk assessment data, the signal intensity in a specific region of the methylation data of the sample is obtained to determine the signal intensity distribution; Based on the signal intensity distribution, the differences in the manifestation of cancerous features in different regions are analyzed to obtain regional feature distribution data. By combining the regional feature distribution data with the historical prediction records of the optimized multi-cancer adaptation model, it is determined whether there is a consistent risk trend, and the final cancer risk prediction result is obtained.
[0014] A deep learning-based cancer risk prediction system based on methylation includes: The data preprocessing module is used to acquire DNA methylation data of various cancer types, extract sequence features from the DNA methylation data and perform standardization processing, separate tissue-specific interference parts and cancer common signal parts, and process them to obtain a transferable common signal subset; The transfer learning adaptation module is used to transfer the parameters of the source cancer model to the target cancer model and fine-tune them using a small amount of sample data for rare cancer types, so as to obtain an initially adapted predictive model. The pseudo-label augmentation module is used to extract intermediate layer representations from the initially adapted prediction model and assign temporary labels to unlabeled samples using a pseudo-label generation mechanism, and select the augmented training sample set based on confidence. The semi-supervised iterative optimization module is used to iteratively optimize the initial adaptive prediction model by using a semi-supervised learning framework combined with an enhanced training sample set, so as to obtain an optimized multi-cancer adaptation model. The risk prediction output module is used to process newly input DNA methylation data through an optimized multi-cancer adaptation model, separate background features from carcinogenesis features, and output cancer risk prediction results.
[0015] The technical effects and advantages of this invention are as follows: This application provides a methylation-based deep learning-based cancer risk prediction method and system that addresses the core clinical challenge of extremely scarce rare cancer samples and the difficulty in cross-cancer generalization due to strong tissue heterogeneity. It systematically removes tissue-specific background interference, retains and purifies common tumor epigenetic signals across cancer types, and constructs a highly transferable subset of common signals. Then, using sufficiently labeled data from common cancer types as the source domain, it employs transfer learning to efficiently transfer source model parameters to the target domain of rare cancer types and fine-tunes them. Simultaneously, it combines pseudo-label generation and semi-supervised iterative optimization mechanisms to fully utilize a large number of unlabeled samples to enhance the training data scale, enabling rapid model adaptation and continuous performance improvement on rare cancer types with limited samples. Finally, during the inference stage, it effectively separates background features from cancer features through built-in attention or gradient importance analysis, outputting reliable cancer risk probabilities, risk levels, and key contributing sites.
[0016] This method significantly improves the accuracy and interpretability of early risk screening for rare cancers, providing an efficient and objective auxiliary diagnostic tool for pan-cancer screening of multiple cancer types in clinical settings with limited resources. Attached Figure Description
[0017] Figure 1 This is a flowchart of the methylation-based deep learning cancer risk prediction method of the present invention; Figure 2 This is a schematic diagram of the structure of the methylation-based deep learning cancer risk prediction system of the present invention. Detailed Implementation
[0018] To further understand the content of this invention, a detailed description of the invention is provided in conjunction with the accompanying drawings and embodiments. The specific embodiments described herein are for illustrative purposes only and are not intended to limit the invention. It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.
[0019] See Figure 1 As shown in this embodiment, the deep learning-based cancer risk prediction method based on methylation includes the following steps: Step S1: Obtain DNA methylation data for multiple cancer types.
[0020] The DNA methylation data for various cancer types presented here are derived from clinical test samples, public databases, and methylation microarrays or whole-genome bisulfite sequencing data provided by collaborating research institutions. These data typically cover a wide range of common cancers, including lung cancer, breast cancer, colorectal cancer, pancreatic cancer, liver cancer, stomach cancer, prostate cancer, and ovarian cancer, while also including a small number of samples from relatively rare cancers such as neuroblastoma, osteosarcoma, cholangiocarcinoma, and adrenocortical carcinoma. The data is primarily presented as a methylation site β-value matrix or M-value matrix, with each row corresponding to one sample and each column corresponding to one CpG site. The sample label indicates the cancer type and whether the sample has been diagnosed with cancer.
[0021] In actual data collection, blood, tissue, or body fluid samples undergo standard procedures such as DNA extraction, bisulfite conversion, microarray hybridization, or sequencing library construction to obtain raw methylation signal intensity values. Further, through preprocessing steps such as background correction, staining bias correction, and probe type correction, preliminary methylation level data suitable for analysis is obtained. This embodiment does not limit the specific data acquisition platform; it can be an Illumina 450K or EPIC850K microarray, or WGBS or RRBS data based on next-generation sequencing.
[0022] Step S2: Extract sequence features from the DNA methylation data and perform standardization processing.
[0023] Specifically, the raw methylation data is first mapped onto a reference genome using sequence alignment tools, and the sequence context information within a certain range around each CpG site is extracted. For example, the target CpG site is used as the center, and the sequence is extended upstream and downstream by 100 to 500 bp to obtain local sequence fragments. These sequence fragments contain sequence patterns related to various regulatory elements such as CpG islands, promoter regions, enhancer regions, and gene body regions.
[0024] In one possible implementation, sequence feature extraction employs a combination of methods, including k-mer counting, one-hot encoding, and embedding representations based on pre-trained language models. For example, local sequence fragments are segmented into segments of various lengths, such as 3-mer, 5-mer, and 7-mer, and the frequency of each type of k-mer in the context of that site is counted to construct a high-dimensional sparse feature vector. Simultaneously, the sequence itself is one-hot encoded to obtain a fixed-length dense representation. Furthermore, the sequence fragments can be input into a model pre-trained on large-scale genome sequences to extract high-level semantic representations.
[0025] After obtaining the sequence feature set, the set is standardized. Standardization includes, but is not limited to, Z-score standardization, min-max normalization, quantile standardization, and combat batch effect removal. In mixed cancer data scenarios, the combat algorithm is preferentially used to remove systematic biases caused by different batches and cancer sources, making the statistical distribution of methylation signals from different sources more similar.
[0026] Step S3: Separate the tissue-specific interference portion and the cancer common signal portion to obtain the cancer common signal set.
[0027] This step is one of the core innovative aspects of this method. Its purpose is to remove background variations mainly determined by tissue origin from highly heterogeneous multi-cancer methylation data, and to preserve as many tumor-related epigenetic alteration signals as possible that are ubiquitous across cancer types.
[0028] In one embodiment, the tissue origin of each sample in the current dataset is first inferred using a large number of known normal tissue methylation reference maps. Based on the inferred tissue category labels, all samples are grouped according to the main tissue type, such as lung-derived, breast-derived, intestinal-derived, liver-derived, etc. Then, within each tissue group, the variation patterns of tissue-specific methylation sites are calculated.
[0029] Specifically, for all samples within a given tissue group, the mean and variance of the β value are calculated at each CpG site. Sites with smaller variances and stable means are initially identified as stable background sites for that tissue. This set of sites is defined as the tissue-specific interference candidate set. Furthermore, through differential analysis with cancer samples, if a site does not show significant differences between cancer samples and normal samples from the corresponding tissue, it is further confirmed as a tissue-specific interference component.
[0030] On the other hand, sites that exhibit significant hypermethylation or hypomethylation across multiple cancer types are preliminarily classified as candidates for common cancer signals. For example, in multiple cancer types such as lung cancer, colorectal cancer, and pancreatic cancer, CpG sites in the promoter regions of certain tumor suppressor genes generally exhibit high methylation silencing, a pattern that is common across cancer types.
[0031] To further purify common signals, one possible approach is to use principal component analysis (PCA) to reduce the dimensionality of the standardized methylation matrix. The first few principal components are extracted, as these typically capture the largest variation directions in the data. In multi-cancer datasets, the first few principal components are often strongly correlated with tissue origin, while subsequent principal components reflect more tumor-related common changes. By analyzing the principal component loading distribution and its correlation with cancer type labels, the weights corresponding to the principal components that primarily capture tissue differences can be reduced or directly removed, thereby reconstructing a low-dimensional representation that removes most tissue-specific interference.
[0032] Preferably, after principal component analysis, cluster analysis is further performed on the obtained low-dimensional common representations. For example, k-means, hierarchical clustering, or graph-based spectral clustering methods can be used to group the common signals along the sample or site dimensions. In the clustering results, some clusters often correspond to known cross-cancer pathways, such as methylation patterns related to DNA repair defects, cell cycle dysregulation, and epigenetic regulatory abnormalities. These grouping results constitute the preliminary grouping of common cancer methylation signals.
[0033] It should be noted that in the process of separating tissue-specific interference from common cancer signals, known cancer gene databases and epigenetic regulatory element annotation information can be used as auxiliary methods. For example, mapping sites to reported cancer driver genes, tumor suppressor genes, and hotspot methylation regions can preferentially preserve common alteration signals within these regions, thereby improving the biological significance and interpretability of the cancer common signal set.
[0034] In one embodiment, three cancer types—lung adenocarcinoma, pancreatic ductal adenocarcinoma, and cholangiocarcinoma—are used as examples. In lung adenocarcinoma samples, certain sites in the EGFR promoter region often exhibit hypomethylation, while in pancreatic and cholangiocarcinoma samples, this region is mostly hypermethylated and silent. Through the aforementioned separation process, the signal in this region is correctly classified as tissue-specific interference rather than a common signal. For the MGMT gene promoter region, varying degrees of methylation elevation are present in all three cancer types, thus multiple sites in this region are successfully preserved in the cancer common signal set. This separation method effectively avoids interference from tissue background on subsequent cross-cancer transfer learning.
[0035] Furthermore, after obtaining the set of common cancer signals, it can optionally be validated again. For example, by comparing it with published multi-omics cancer atlas data, it can be checked whether the retained common sites also exhibit similar cross-cancer patterns in independent cohorts. If the consistency is high, the quality of the set is confirmed to be reliable.
[0036] Through the above multi-level separation strategy, the final set of common cancer signals is significantly smaller in dimensionality than the total number of original methylation sites, but has a higher information density and is more suitable as the input basis for subsequent transferable learning.
[0037] Step S4: Process the set of common cancer signals to obtain a transferable subset of common signals.
[0038] This step further extracts the most suitable subset for cross-cancer transfer from the set of common cancer signals, aiming to reduce the distribution difference between the source and target domains and improve the effectiveness of transfer learning.
[0039] Specifically, the first step is to adjust the weights of the small amount of tissue-specific interference remaining in the common cancer signal set using domain adaptation techniques. One possible implementation uses an adversarial domain adaptation framework, where a domain discriminator attempts to distinguish the feature distributions of source and target cancer samples, while a feature extractor tries to confuse the discriminator. By minimizing the discriminative loss and maximizing the level of confusion, the distributions of the source and target domains in the feature space are gradually made closer. After adversarial training, each common signal is assigned an adjusted weight; signals with higher weights represent greater consistency across different cancer types.
[0040] Based on the adjusted signal distribution set, a quantitative assessment of distribution differences is performed. Commonly used metrics include maximum mean difference, Wasserstein distance, or KL divergence. If the calculated distribution difference value is lower than a preset threshold, the current signal subset is considered to have good transferability.
[0041] Preferably, the transferable subset of signals is further subjected to batch normalization to unify the numerical range of signals from different cancer types. In practice, the mean and standard deviation can be calculated for each signal dimension, and then all samples are mapped to a space with zero mean and unit variance, thereby eliminating the additional bias introduced by the numerical scale.
[0042] In one embodiment, signal fragments highly correlated with cancer biological functions are further extracted based on the standardized transferable signal set. For example, by combining gene function annotation, pathway enrichment analysis, and databases of known cancer biomarkers, signals located in regulatory regions of key tumor suppressor genes, oncogenes, DNA damage repair genes, and immune checkpoint genes are screened. These signal fragments are identified as core common signal combinations.
[0043] Furthermore, signal sub-units related to specific cancer types are isolated from the core common signal combination. By calculating the signal intensity distribution of each sub-unit in a small sample of data of the target cancer type, if its average intensity or degree of variation is significantly higher than a preset standard, the sub-unit is judged to be a key signal unit.
[0044] For these key signaling units, a signal correlation matrix is constructed. The rows and columns of the matrix correspond to key signaling units, and the matrix elements are the covariance coefficients or mutual information values of two signaling units in multiple cancer samples. By performing spectral decomposition, community detection, or master graph extraction on this matrix, potential functional relationships between signals are explored. For example, if multiple key signaling units form tightly connected subgraphs in the matrix, and these subgraphs correspond to the same known cancer-driving pathway, then these signals are further identified as highly correlated common modules.
[0045] The resulting set of common signal associations not only has lower dimensionality but also retains the epigenetic alteration patterns with the strongest consistency across cancer types, providing a high-quality and highly interpretable input foundation for subsequent transfer learning.
[0046] Step S5: For a small amount of sample data of rare cancer types, the parameters of the source cancer type model are transferred to the target cancer type model through transfer learning and fine-tuned to obtain an initially adapted prediction model.
[0047] In one embodiment, common cancer types with sufficient labeled samples are first selected as the source domain, such as well-trained deep neural network models of one or more cancer types, including lung adenocarcinoma, breast cancer, and colorectal cancer, as the source model. These source models have typically been trained under supervision on thousands to tens of thousands of samples and are able to distinguish cancer-related signals from normal signals relatively well.
[0048] Specifically, after obtaining all signal feature data from the source cancer species, statistical tests are used to screen out the set of signal features that show significant differences in the source cancer species. For example, t-tests or Wilcoxon rank-sum tests are used to screen out sites or regions with p-values significantly less than the threshold, provided that the false discovery rate is controlled below 0.05. These sites usually correspond to key regulatory regions known to be frequently aberrantly methylated in this cancer species.
[0049] Subsequently, for a small sample of data from the target rare cancer species, the expression level distribution of each signal feature in the target cancer species was calculated. The calculation process included statistical analysis of descriptive statistics such as the median β value, interquartile range, and skewness coefficient of each retained signal in the small sample, and a rough comparison was made with the distribution of the corresponding signal in the source cancer species. If a significantly different signal feature from the source cancer species also showed a similar direction of variation or amplitude range in the small sample of the target cancer species, the signal was considered to have cross-cancer migration potential and was retained, thus forming a transferable common signal subset.
[0050] In one possible implementation, all network layer parameters of the deep learning model already trained for the source cancer are directly copied and loaded into the same newly constructed network structure for the target cancer. This network structure remains completely consistent with the source model, including the same number of convolutional layers, number of channels, fully connected layer dimensions, activation function types, etc., to ensure semantic alignment of parameter transfer.
[0051] After loading the parameters, the model is fine-tuned using a small number of labeled samples of the target cancer type. During fine-tuning, a small learning rate is typically used, such as 1 / 10 to 1 / 50 of the learning rate of the source model's pre-training, to avoid destroying transferred general knowledge. Simultaneously, some lower-level feature extraction layers can be frozen, and only the higher-level classification head and a few intermediate layers are fine-tuned to further protect common knowledge.
[0052] For example, in a scenario where cholangiocarcinoma is the target cancer and lung adenocarcinoma is the source cancer, the source model has already learned to recognize common hypermethylation patterns in promoter regions of certain DNA damage repair genes on lung adenocarcinoma samples. When these parameters are transferred to the cholangiocarcinoma model, even with only a few dozen labeled cholangiocarcinoma samples, the model can quickly capture similar methylation silencing patterns in cholangiocarcinoma through a few rounds of fine-tuning, thus achieving much higher initial performance on the validation set than training from scratch.
[0053] It should be noted that sample weight adjustment strategies can also be introduced during the fine-tuning process. For signals with significant differences in distribution between the source and target domains, their contribution weight in the loss function can be appropriately reduced to decrease the risk of negative migration. After the above migration and fine-tuning process, an initial adaptive predictive model with preliminary adaptability to the target rare cancer species is finally obtained.
[0054] Step S6: Extract intermediate layer representations from the initial adapted prediction model and assign temporary labels to unlabeled samples using a pseudo-label generation mechanism, and select an enhanced training sample set based on confidence level.
[0055] In one embodiment, all available unlabeled samples of the target cancer type are first input into the initially adapted prediction model obtained in step S5, and the output vectors of the intermediate hidden layers of the model are extracted as high-level feature representations. These intermediate layer representations typically capture more abstract and discriminative tumor-related patterns than the original methylation β values.
[0056] Subsequently, a pseudo-label generation mechanism is used to assign temporary category labels to each unlabeled sample. The specific process is as follows: the intermediate layer represents the classification head input to the model, and the softmax probability distribution of the model for whether the sample is cancer positive or negative is obtained. The category with the highest probability is taken as the pseudo-label of the sample, and the maximum probability value of the corresponding category is recorded as the confidence score.
[0057] For each unlabeled sample's temporary label, calculate its confidence score. If the confidence score is higher than a preset threshold, such as 0.85 or 0.90, the model is considered to have a relatively certain prediction for that sample, and the sample and its pseudo-label are retained; if the confidence score is lower than the threshold, the sample is discarded to avoid introducing too many noisy labels.
[0058] In one possible implementation, the density information of the samples in the feature space can be further incorporated for filtering. For example, t-SNE or UMAP dimensionality reduction visualization can be performed on the intermediate layer representations of all unlabeled samples to observe whether there is an obvious clustering structure; then only those samples that fall in high-density regions and have high confidence can be retained to further reduce the risk of false labels.
[0059] Through the above screening, the retained samples and their pseudo-labels together constitute the enhanced training sample set. The sample size of this enhanced set is typically 3-10 times that of the original labeled samples, providing more sufficient training material for subsequent semi-supervised optimization.
[0060] For example, in a scenario targeting neuroblastoma, suppose there are only 30 labeled samples, while another 300 are unlabeled clinically suspected samples. The initial transfer model assigns pseudo-labels to these 300 samples, setting a confidence threshold of 0.88, ultimately retaining approximately 180 high-confidence pseudo-labeled samples. Mixing these samples with the 30 truly labeled samples significantly expands the available data for optimization, enabling the model to learn more about the epigenetic variation patterns within rare cancer types.
[0061] Step S7: The initial adaptive prediction model is iteratively optimized using a semi-supervised learning framework combined with the enhanced training sample set to obtain the optimized multi-cancer adaptation model.
[0062] In one embodiment, a semi-supervised learning framework is constructed, which is jointly trained using a small number of real labeled samples and a large number of augmented pseudo-label samples. The training objective consists of two loss components: a supervised cross-entropy loss for real labeled samples, and an unsupervised consistency regularization loss or pseudo-label cross-entropy loss for pseudo-label samples.
[0063] The specific training process is as follows: First, the current prediction model is forward-propagated again to all samples in the augmented training sample set to calculate the latest probability distribution of each sample belonging to each category, thus obtaining the updated probability value. For pseudo-labeled samples, these updated probabilities are compared with the previously assigned temporary pseudo-labels, and the difference between the two is calculated.
[0064] Then, a weighting coefficient is calculated for each pseudo-labeled sample based on the consistency between the updated probability value and the original temporary label. A common approach is to use the square or exponential form of the maximum value in the updated probability as the weight, allowing the model to assign higher weights to samples it is "becoming more confident" in.
[0065] In one possible implementation, a sample is allowed to participate in the gradient calculation of the current round only when its weighting coefficient is higher than a certain dynamic threshold, thereby dynamically eliminating samples whose model predictions fluctuate significantly and avoiding error accumulation.
[0066] Furthermore, to prevent the model from overfitting and amplifying pseudo-label noise in the sample set, an L2 regularization term is added to the total loss function to constrain the weight decay of the network weights. The total loss is a weighted sum of the supervised loss, the pseudo-label weighted loss, and the L2 regularization loss.
[0067] The backpropagation algorithm is used to update the prediction model parameters based on the constrained total loss, resulting in the optimized prediction model for the current round. This model is then used as the prediction model for the next round, and the processes of recalculating the probabilities of the augmented sample set, calculating the weighted loss, and updating the parameters are repeated.
[0068] The iterations typically last 20-100 rounds. The stopping condition can be any of the following: validation set performance no longer improves, loss converges, or the preset maximum number of iterations is reached. When the stopping condition is met, a multi-cancer adaptation model that is fully optimized on the target rare cancer type is finally obtained.
[0069] Preferably, during the iteration process, a small number of real labeled samples can be used periodically as anchor points to periodically evaluate the model's performance on real labels, and the pseudo-label threshold or pseudo-label loss weight can be dynamically adjusted based on the performance to balance noise tolerance and real generalization ability.
[0070] Through the semi-supervised iterative optimization process described above, even when labeled samples for rare cancer types are extremely scarce, the model can still make full use of the potential patterns in a large amount of unlabeled data to achieve continuous performance improvement.
[0071] Step S8: The newly input DNA methylation data is processed by the optimized multi-cancer adaptation model to separate background features from carcinogenesis features and output cancer risk prediction results.
[0072] In one embodiment, when newly collected DNA methylation data of an individual to be tested is input, an optimized multi-cancer adaptation model is first loaded for forward inference. The model outputs the probability score that the sample belongs to high risk and the attention weight distribution of key discrimination regions.
[0073] Based on the preliminary feature classification results from the model output, a support vector machine (SVM) algorithm is further employed to separate background features from cancerous features. The SVM is used here as a secondary refinement classifier, and its training samples are pseudo-labeled high-confidence samples obtained from the correlation analysis between the model's intermediate layer representations and the final labels. The separation process aims to find a maximum margin hyperplane that can distinguish normal background signals from tumor-related abnormal signals as clearly as possible.
[0074] If the proportion of cancerous features after separation is higher than a preset threshold, for example, if the model determines that the proportion of cancer-related signals exceeds 0.25%, then the sample is preliminarily determined to have potential cancer risk, and preliminary risk assessment data is generated.
[0075] Based on the preliminary risk assessment data, the signal intensity of several specific regions in the sample methylation data was further extracted, such as the promoter region of tumor suppressor genes, repetitive sequence regions, and regulatory regions of immune-related genes. The distribution characteristics of the average methylation level, median, coefficient of variation, and other features of the sites in these regions were statistically analyzed.
[0076] The differences in the manifestation of carcinogenic features across different genomic regions were then analyzed. For example, if a consistent trend of hypermethylation was found in the promoter regions of multiple known pan-cancer-associated genes, while no abnormalities were observed in tissue-specific regulatory regions, this would support the assessment of the authenticity of the carcinogenic signal.
[0077] In one possible implementation, the aforementioned regional feature distribution data is compared with the prediction records of the optimized model on the historical validation set to check whether the risk pattern of the current sample shows similar regional distribution characteristics to previously confirmed high-risk cases. If the consistency is high, the risk confidence level is further increased.
[0078] Finally, by combining evidence from multiple aspects such as the original output probability of the integrated model, the separation results of the support vector machine, and the consistency of regional feature distribution, the cancer risk prediction result of the sample to be tested is output, which is usually presented in the form of risk probability score, risk level, and a list of major contributing sites or regions.
[0079] In another embodiment, the process of separating background features from cancerous features can also directly rely on the built-in attention mechanism or gradient-weighted activation mapping technique of the optimization model, rather than introducing an additional support vector machine. Specifically, by calculating the gradient contribution of the model to each input site or region, an importance heatmap is generated. Regions with higher importance are identified as regions dominated by cancerous features, while regions with lower importance are classified as background features.
[0080] If the heatmap shows that the cumulative importance of cancer-related regions exceeds a preset proportion, high-risk marker data is generated. For the high-risk marker data, the intensity values of several key methylation sites are extracted to form a site intensity set. These sites are typically predefined as hotspot sites that have been reported to be abnormal in multiple cancer types.
[0081] Subsequently, the intensity patterns of rare cancer-related sites in the site intensity set were analyzed. For example, it was observed whether several sites simultaneously exhibited abnormally high methylation or simultaneously exhibited low methylation. Such co-occurrence patterns often correspond to specific functional pathway abnormalities.
[0082] The degree of pattern consistency is determined by comparing the current site intensity pattern with typical patterns in historical rare cancer sample records, such as by calculating Euclidean distance, cosine similarity, or sequence similarity based on dynamic time warping. If the consistency exceeds a certain threshold, the reliability of the risk signal is confirmed to be higher, and the final output is a cancer risk prediction result including risk level and confidence interval.
[0083] It is understandable that the two separation and interpretation methods mentioned above can be flexibly selected or combined according to factors such as computing resources, interpretability requirements, and real-time requirements in actual application scenarios, so as to achieve good risk warning capabilities in screening tasks for different rare cancer types.
[0084] In actual clinical auxiliary diagnostic scenarios, such as for elderly patients suspected of having pancreatic cancer but difficult to biopsy, or for patients with bile duct space-occupying lesions with abnormal imaging but difficult to diagnose histologically, after inputting their peripheral blood cell-free DNA methylation spectrum data, this method can provide risk stratification reference in a short time, providing objective quantitative basis for whether further invasive examinations are needed, thus having promotional application value in primary medical institutions with limited resources.
[0085] See Figure 2 As shown in this embodiment, a deep learning-based cancer risk prediction system based on methylation includes: The data preprocessing module is used to acquire DNA methylation data of various cancer types, extract sequence features from the DNA methylation data and perform standardization processing, separate tissue-specific interference parts and cancer common signal parts, and process them to obtain a transferable common signal subset; The transfer learning adaptation module is used to transfer the parameters of the source cancer model to the target cancer model and fine-tune them using a small amount of sample data for rare cancer types, so as to obtain an initially adapted predictive model. The pseudo-label augmentation module is used to extract intermediate layer representations from the initially adapted prediction model and assign temporary labels to unlabeled samples using a pseudo-label generation mechanism, and select the augmented training sample set based on confidence. The semi-supervised iterative optimization module is used to iteratively optimize the initial adaptive prediction model by using a semi-supervised learning framework combined with an enhanced training sample set, so as to obtain an optimized multi-cancer adaptation model. The risk prediction output module is used to process newly input DNA methylation data through an optimized multi-cancer adaptation model, separate background features from carcinogenesis features, and output cancer risk prediction results.
[0086] The data preprocessing module is specifically configured as follows: It extracts original sequence features using sequence alignment tools to obtain a sequence feature set; it standardizes the sequence feature set to obtain standardized sequence features; it separates tissue-specific interference components and cancer-common signal components from the standardized sequence features; it applies principal component analysis and cluster analysis to the cancer-common signal components to obtain cancer-common methylation signal grouping results; for the tissue-specific interference components included in the cancer-common signal set, it uses domain adaptation techniques to adjust weights and calculate distribution differences, filters out transferable signal subsets, and obtains the final common signal association set through batch normalization and correlation screening.
[0087] The semi-supervised iterative optimization module and the risk prediction output module are specifically configured as follows: a semi-supervised learning framework is used to recalculate the class probability distribution of the enhanced training sample set, and iterative optimization is performed by combining weighted loss and L2 regularization until the preset iteration conditions are met; preliminary feature classification is performed on the new input DNA methylation data, and support vector machine is used to further separate background features and cancer features, and the final cancer risk prediction result is output by combining signal intensity distribution and historical prediction records.
[0088] This embodiment can improve the accuracy and interpretability of early risk screening for rare cancers through this system, and provide an efficient and objective auxiliary diagnostic tool for pan-cancer screening of multiple cancers in clinical settings with limited resources.
[0089] The above description is only a preferred embodiment of the present invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the inventive concept of the present invention, and these all fall within the protection scope of the present invention.
Claims
1. A deep learning-based cancer risk prediction method based on methylation, characterized in that, The method includes: DNA methylation data of various cancer types were obtained, sequence features were extracted from the DNA methylation data and standardized, tissue-specific interference parts and cancer common signal parts were separated and processed to obtain a transferable common signal subset; For a small amount of sample data of rare cancer types, the parameters of the source cancer type model are transferred to the target cancer type model through transfer learning and fine-tuned to obtain an initially adapted predictive model. Intermediate layer representations are extracted from the initially adapted prediction model, and a pseudo-label generation mechanism is used to assign temporary labels to unlabeled samples. An enhanced training sample set is then selected based on confidence level. A semi-supervised learning framework combined with an enhanced training sample set was used to iteratively optimize the initial adaptive prediction model, resulting in an optimized multi-cancer adaptation model. The newly input DNA methylation data is processed by an optimized multi-cancer adaptation model to separate background features from carcinogenesis features and output cancer risk prediction results.
2. The deep learning-based cancer risk prediction method based on methylation according to claim 1, characterized in that, The process of extracting sequence features from DNA methylation data and performing standardization to separate tissue-specific interference components and common cancer signal components includes: The original sequence features are extracted using sequence alignment tools to obtain a set of sequence features; The sequence feature set is standardized to obtain standardized sequence features; The tissue-specific interference component is separated from the standardized sequence features to obtain a tissue-specific interference set; The common cancer signal components are extracted from the standardized sequence features to obtain a set of common cancer signals; The common cancer signal set was processed by principal component analysis to obtain the dimensionality-reduced principal components of the common signals. Cluster analysis was performed on the principal components of the dimensionality-reduced common signals to obtain the grouping results of common methylation signals in cancer.
3. The deep learning-based cancer risk prediction method based on methylation according to claim 2, characterized in that, The process of obtaining a transferable subset of common signals includes: For the tissue-specific interference portion contained in the common cancer signal set, a domain adaptation technique is used to adjust the weights, resulting in an adjusted signal distribution set. By calculating the distribution difference of the adjusted signal distribution set, if the calculated difference value is lower than a preset threshold, it is determined to be a signal subset with transferability. For the aforementioned signal subset with portability, batch normalization is applied to unify the signal distribution range, resulting in a standardized portable signal group. Based on the standardized transferable signal set, extract the signal segments that are highly correlated with cancer and identify them as the core common signal combination; From the core common signal combination, signal sub-units related to specific cancer types are separated. If the signal intensity of the separated sub-unit is higher than a preset standard, it is judged to be a key signal unit. For the key signal units, a signal correlation matrix is constructed, and the potential relationships between signals are obtained through matrix analysis to obtain the final set of common signal correlations.
4. The deep learning-based cancer risk prediction method based on methylation according to claim 1, characterized in that, The limited sample data for rare cancer types are used to transfer the parameters of the source cancer type model to the target cancer type model through transfer learning and fine-tuning to obtain an initially adapted predictive model, including: Obtain all signal feature data from the source cancer species; The set of signal features that are significantly different in the source cancer species was screened out using statistical testing methods; For limited sample data of the target cancer species, calculate the expression level distribution of each signal feature in the target cancer species; If the source cancer type's significantly different signal features also have expression variations in the target cancer type, then the signal features are retained to obtain a transferable subset of common signals; The deep learning model that has been trained for the source cancer species is used, and all the parameters of the model are loaded into the same network structure for the target cancer species. By fine-tuning the model with loaded parameters using a small sample of target cancer data, an initially adapted predictive model is obtained.
5. The deep learning-based cancer risk prediction method based on methylation according to claim 1, characterized in that, The process involves extracting intermediate layer representations from the initially adapted prediction model, assigning temporary labels to unlabeled samples using a pseudo-label generation mechanism, and selecting an enhanced training sample set based on confidence levels. This includes: Obtain the intermediate layer representation of the prediction model of the initial adaptation; Based on the intermediate layer representation, temporary labels are assigned to unlabeled samples through a pseudo-label generation mechanism; Calculate a confidence score for the temporary label of each unlabeled sample; If the confidence score is higher than a preset threshold, the sample and its temporary label are retained. An enhanced training sample set is constructed by retaining the samples and their temporary labels; The parameters of the initially adapted prediction model are updated using the enhanced training sample set.
6. The deep learning-based cancer risk prediction method based on methylation according to claim 1, characterized in that, The method employs a semi-supervised learning framework combined with an enhanced training sample set to iteratively optimize the initial adaptation prediction model, resulting in an optimized multi-cancer adaptation model, including: The enhanced training sample set and the initially adapted prediction model are loaded using a semi-supervised learning framework; The updated probability values are obtained by recalculating the class probability distribution of the samples in the enhanced training sample set using the current prediction model. Based on the updated probability values and the original temporary labels, calculate the weighted loss value for each sample; If the weighted loss value is less than the preset loss threshold, then the sample is retained to participate in the gradient calculation of the current round; By incorporating an L2 regularization term into the total loss function, we obtain the constrained total loss. The prediction model parameters are updated based on the constrained total loss using the backpropagation algorithm to obtain the prediction model optimized for the current round. The current round of optimized prediction model is used as the initial prediction model for the next round of adaptation. The probability recalculation process is repeated until the preset iteration round conditions are met, resulting in an optimized multi-cancer adaptation model.
7. The deep learning-based cancer risk prediction method based on methylation according to claim 1, characterized in that, The process of processing newly input DNA methylation data using an optimized multi-cancer adaptation model to separate background features from carcinogenic features and output cancer risk prediction results includes: By using newly acquired DNA methylation data, the optimized multi-cancer adaptation model was loaded for preliminary analysis to obtain preliminary feature classification results. Based on the preliminary feature classification results, the support vector machine algorithm is used to further separate background features and cancer features to determine the distinguishing boundaries between features; If the proportion of cancerous features after separation is higher than a preset threshold, the sample is determined to have potential risks, and preliminary risk assessment data is obtained. Based on the preliminary risk assessment data, the signal intensity in a specific region of the methylation data of the sample is obtained to determine the signal intensity distribution; Based on the signal intensity distribution, the differences in the manifestation of cancerous features in different regions are analyzed to obtain regional feature distribution data. By combining the regional feature distribution data with the historical prediction records of the optimized multi-cancer adaptation model, it is determined whether there is a consistent risk trend, and the final cancer risk prediction result is obtained.
8. A deep learning-based cancer risk prediction system based on methylation, characterized in that, include: The data preprocessing module is used to acquire DNA methylation data of various cancer types, extract sequence features from the DNA methylation data and perform standardization processing, separate tissue-specific interference parts and cancer common signal parts, and process them to obtain a transferable common signal subset; The transfer learning adaptation module is used to transfer the parameters of the source cancer model to the target cancer model and fine-tune them using a small amount of sample data for rare cancer types, so as to obtain an initially adapted predictive model. The pseudo-label augmentation module is used to extract intermediate layer representations from the initially adapted prediction model and assign temporary labels to unlabeled samples using a pseudo-label generation mechanism, and select the augmented training sample set based on confidence. The semi-supervised iterative optimization module is used to iteratively optimize the initial adaptive prediction model by using a semi-supervised learning framework combined with an enhanced training sample set, so as to obtain an optimized multi-cancer adaptation model. The risk prediction output module is used to process newly input DNA methylation data through an optimized multi-cancer adaptation model, separate background features from carcinogenesis features, and output cancer risk prediction results.