Deep learning and sparse methylation based cross-platform multi-cancer diagnosis method and system

The multi-cancer prediction model built using the CatchME framework leverages deep learning and sparse methylation data to address the cross-platform adaptability and accuracy issues in existing multi-cancer detection technologies, achieving high-precision and interpretable multi-cancer detection.

CN122245437APending Publication Date: 2026-06-19RENJI HOSPITAL AFFILIATED TO SHANGHAI JIAO TONG UNIV SCHOOL OF MEDICINE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
RENJI HOSPITAL AFFILIATED TO SHANGHAI JIAO TONG UNIV SCHOOL OF MEDICINE
Filing Date
2026-03-10
Publication Date
2026-06-19
Patent Text Reader

Abstract

This invention provides a cross-platform multi-cancer diagnostic method and system based on deep learning and sparse methylation. Specifically, the method utilizes cfDNA methylation data from multiple sequencing platforms and achieves compatibility with sparse methylation data by constructing CpG clusters during data processing. Furthermore, the method obtains interpretable feature results from the training data using deconvolution and deep learning models for downstream classifier learning, thereby enabling cancer diagnosis and prediction. In various cancer prediction tasks, the method and system of this invention exhibit superior prediction accuracy and have broad application prospects in cancer diagnosis and treatment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to deep learning, sparse cfDNA methylation, and disease typing and diagnosis. Specifically, it relates to a cross-platform multi-cancer diagnostic method and system based on deep learning and sparse methylation. Background Technology

[0002] Malignant tumors, as one of the leading causes of death worldwide, have long posed a serious threat to human health. Clinical practice shows that the prognosis of a tumor is closely related to its stage at diagnosis; early detection and intervention can significantly improve patient survival rates and reduce treatment costs. Therefore, developing early detection technologies that are highly sensitive, specific, repeatable, and suitable for population screening has always been an important research direction in the field of tumor diagnosis. Although traditional imaging examinations and tissue biopsies play a crucial role in tumor diagnosis, they still have limitations such as low sensitivity and specificity, high invasiveness, inability to dynamically monitor, and inability to achieve simultaneous screening of multiple cancer types, making it difficult to meet the actual needs of precision medicine and population screening.

[0003] Cell-free DNA (cfDNA), as free nucleic acid fragments derived from the bloodstream, can carry molecular information reflecting tissue origin and disease status. In recent years, cfDNA-based liquid biopsy technology has gradually become an important research hotspot in the fields of early cancer screening and companion diagnostics due to its advantages such as non-invasiveness, repeatability, and ability to dynamically monitor disease progression. Among them, DNA methylation, as an important form of epigenetic regulation, exhibits stable and specific modification patterns in different tissue types and disease states. Compared with gene mutation detection, DNA methylation features usually have higher tissue origin resolution and stronger biological stability, and are therefore considered ideal molecular markers for achieving joint detection and source tracing analysis of multiple cancer types.

[0004] Some studies have attempted to use machine learning or deep learning methods to model and analyze cfDNA methylation data, but existing technologies still generally suffer from insufficient feature utilization efficiency, sensitivity to missing values, poor cross-platform adaptability, lack of ability to explain biological mechanisms, and difficulty in balancing the accuracy of multi-cancer classification and source tracing.

[0005] Therefore, there is an urgent need in this field for a new method and system that can optimize modeling of the structural characteristics of sparse cfDNA methylation data, achieve high-precision, multi-cancer detection under complex data conditions, and at the same time improve the interpretability and clinical usability of model results, thereby promoting the practical application and transformation of liquid biopsy technology in the field of early cancer screening. Summary of the Invention

[0006] The purpose of this invention is to provide a cross-platform multi-cancer diagnostic method and system based on deep learning and sparse methylation data.

[0007] A first aspect of the present invention provides a method for constructing a multi-cancer prediction model, the method comprising the steps of: (S1) Provide a cfDNA methylation dataset, the cfDNA methylation dataset including one or more cfDNA methylation data from various cancers and healthy controls; (S2) The cfDNA methylation dataset is preprocessed to obtain preprocessed methylation data; CpG clusters are constructed according to predetermined rules based on the preprocessed methylation data, and the CpG clusters are classified; the methylation level of each tissue category is evaluated according to the classification results of the CpG clusters, thereby obtaining tissue methylation maps of multiple tissues. (S3) The prediction model is trained using the tissue methylation map; the prediction model includes a deconvolution module, a deep learning module, and a classifier module; In the deconvolution module, a deconvolution algorithm is used to deconvolve the methylation map transformation of the tissue to obtain the deconvolution result; In the deep learning module, the tissue methylation map is converted into a feature sequence using a deep learning model, and feature learning is performed on the feature sequence to obtain the output of the deep learning module. In the classifier module, the classifier is trained using the deconvolution result and the concatenation result output by the deep learning module; (S4) When the prediction model reaches the predetermined termination condition, the model training is terminated, thereby obtaining a multi-cancer prediction model using sparse cfDNA methylation data, namely CatchME.

[0008] In another preferred embodiment, the cancers include breast cancer, colorectal cancer, liver cancer, lung cancer, and prostate cancer.

[0009] In another preferred embodiment, the healthy control is white blood cells in a healthy / normal state.

[0010] In another preferred embodiment, the cfDNA methylation data is selected from the group consisting of: high-depth methylation data, low-depth methylation data, or combinations thereof.

[0011] In another preferred embodiment, the sequencing depth D1 of the high-depth methylation data satisfies ≥3×, such as 3×, 6×, 10×.

[0012] In another preferred embodiment, the sequencing depth D2 of the low-depth methylation data satisfies 0.5×≤D2<3×, such as 1× or 2×.

[0013] In another preferred embodiment, the cfDNA methylation data is selected from the group consisting of methylation sequencing data, methylation microarray data, or a combination thereof.

[0014] In another preferred embodiment, the cfDNA methylation sequencing data is based on bisulfite sequencing data.

[0015] In another preferred embodiment, the cfDNA methylation sequencing data is single-end methylation sequencing data and / or double-end methylation sequencing data.

[0016] In another preferred embodiment, the methylation sequencing data includes reduced genome methylation sequencing (RRBS) data and whole genome methylation sequencing (WGBS) data.

[0017] In another preferred embodiment, the methylated microarray data includes Illumina 450K, EPIC, and EPICv2.

[0018] In another preferred embodiment, in step (S2), the preprocessing includes preprocessing the methylated sequencing data and / or preprocessing the methylated microarray data.

[0019] In another preferred embodiment, the methylation sequencing data is preprocessed, including the following steps: (a1) Perform quality control on the methylation sequencing data; (a2) Perform joint removal and quality trimming on the data obtained in step (a1); (a3) Align the data obtained in step (a2) to the reference genome and perform screening; (a4) Identify methylation sites in the data obtained in step (a3) ​​to obtain preprocessed methylation data; In another preferred embodiment, for paired-end methylated sequencing data, before step (a1), the step of downsampling the paired-end methylated sequencing data is further included.

[0020] In another preferred embodiment, for whole-genome methylation sequencing data, the following step is included before step (a4): (a4.0) Remove duplicates from the data obtained in step (a3).

[0021] In another preferred embodiment, the software used for joint removal and quality trimming includes TrimGalore.

[0022] In another preferred embodiment, the software used for alignment to the reference genome includes Bismark.

[0023] In another preferred embodiment, in step (a3), samples with an alignment rate of less than 60% are removed.

[0024] In another preferred embodiment, preprocessing the methylated microarray data includes the following steps: (b1) Filter the probes of the methylated microarray data according to a predetermined filtering standard; (b2) Standardize the data obtained in step (b1); (b3) Annotate the data obtained in step (b2) to obtain preprocessed methylation data.

[0025] In another preferred embodiment, the predetermined filtering criteria include: (x1) Remove probes with a detection p-value > 0.01; (x2) Remove probes with fewer than 3 beads in at least 5% of the samples; (x3) Remove all probes associated with single nucleotide polymorphisms (SNPs); (x4) Remove all multiple alignment probes; and (x5) Remove all probes located on the X and Y chromosomes.

[0026] In another preferred embodiment, the predetermined rules include: (y1) For each CpG site targeted by the probe in the methylation microarray data, a CpG cluster is defined as the flanking region of 100 base pairs upstream and downstream of it, and it is assumed that all CpG sites in this region have the same average methylation level as the CpG sites covered by the probe. (y2) If the flanking regions of two adjacent CpG sites overlap, they are defined as a single CpG cluster; and (y3) Define a CpG cluster as a CpG site that contains at least three probes.

[0027] In another preferred embodiment, classifying the CpG clusters includes the following steps: (c1) Transform each of the CpG clusters into a type-specific (TS) matrix composed of histogram vectors. Type Discriminant (TD) Matrix ; (c2) Matrix based on each CpG cluster and Calculate the TS score TD score and symbolic methylation distance ; (c3) According to the above , and Calculate the final TS score and final TD score and according to and The CpG clusters are classified, wherein if the CpG clusters are classified as follows: Then the CpG cluster is of type TS; if the CpG cluster is... If so, then the CpG cluster is of type TD.

[0028] In another preferred embodiment, assessing the tissue-specific methylation level according to the classification results of CpG clusters specifically includes: assessing the tissue-specific CpG clusters according to the scores corresponding to the classification results of each CpG cluster, and finally integrating them into multiple tissue-specific tissue methylation maps.

[0029] In another preferred embodiment, the type-specific matrix ,in, The histogram vector for the target organization type. is the histogram vector for other organization types, and b is the number of histogram bins.

[0030] In another preferred embodiment, the type discriminant matrix ,in, Let be the histogram vector of the m-th organization type, and b be the number of histogram bins.

[0031] In another preferred embodiment, b = 10.

[0032] In another preferred embodiment, the The calculation formula is as follows: , in, For TS matrix; For nuclear norm; is the Frobenius norm; M is the number of organization types.

[0033] In another preferred embodiment, the The calculation formula is as follows: , in, This is the TD matrix; For nuclear norm; is the Frobenius norm; M is the number of organization types.

[0034] In another preferred embodiment, the symbolic methylation distance The calculation method is as follows: , in, For organization type i, β value Let β be the value of organization type j, and finally obtain the symmetric distance matrix between different organization categories.

[0035] In another preferred embodiment, the symmetric distance matrix between the different tissue categories is represented as: .

[0036] In another preferred embodiment, the The calculation method is as follows: , , in, This represents the minimum value of the symbolic methylation distance d; To obtain the maximum value.

[0037] In another preferred embodiment, the The calculation method is as follows: , in, This represents the maximum value of the symbolic methylation distance d.

[0038] In another preferred embodiment, the deconvolution algorithm is Nu support vector machine regression (Nu-SVR).

[0039] In another preferred embodiment, the hyperparameter nu in the Nu support vector machine regression is taken from [0.05, 0.1, 0.15, 0.25, 0.5, 0.75].

[0040] In another preferred embodiment, the hyperparameter C in the Nu support vector machine regression is taken from [0.1, 0.25, 0.5, 0.75, 1, 5, 10, 30, 50, 100].

[0041] In another preferred embodiment, the deep learning model is selected from the group consisting of: gated convolutional neural networks (GCNN), convolutional neural networks (CNN), XgBoost, support vector machines, or random forests.

[0042] In another preferred embodiment, the deep learning model is a gated convolutional neural network.

[0043] In another preferred embodiment, the deep learning module specifically includes: converting the tissue methylation map into a feature sequence using a gated convolutional neural network, and performing convolution on the feature sequence; wherein, The gated convolutional neural network includes multiple gated convolutional layers. In each gated convolutional layer, the feature sequence is convolved using two parallel convolutions to obtain two convolutional results. The two convolutional results are merged to obtain a merged convolutional result Z. The merged convolutional result Z is pooled to obtain a gated output. The multiple gated outputs of the multiple gated convolutional layers are pooled to finally obtain the output of the deep learning module.

[0044] In another preferred embodiment, the feature sequence satisfies: Where L is the number of CpG clusters and d is the feature dimension.

[0045] In another preferred embodiment, the two parallel convolutions include: (d1) The first parallel convolution, which uses the ReLU function to perform convolution, yields the first convolution result H; and (d2) The second parallel convolution is performed using the sigmoid function to obtain the second convolution result G.

[0046] In another preferred embodiment, the first convolution result H in (d1) is calculated as follows: , Among them, the For learnable convolutional kernels, For characteristic sequences, is the bias term, * is the convolution operation, and (·) is the activation function.

[0047] In another preferred embodiment, the second convolution result G in (d2) is calculated as follows: , Among them, the For learnable convolutional kernels, For characteristic sequences, σ represents the bias term, * represents the convolution operation, and σ(·) represents the activation function.

[0048] In another preferred embodiment, the pooling in each gated convolutional layer is max pooling.

[0049] In another preferred embodiment, adaptive max pooling is performed on multiple gated outputs.

[0050] In another preferred embodiment, element-wise multiplication is used to merge the two convolution results.

[0051] In another preferred embodiment, the merged convolution result Z is calculated as follows: , Wherein, H is the first convolution result; and G is the second convolution result; For Hadamard products.

[0052] In another preferred embodiment, the deep learning module further includes optimization of the deep learning module.

[0053] In another preferred embodiment, the deep learning module is optimized using a Bayesian optimization algorithm.

[0054] In another preferred embodiment, the classifier employs an algorithm selected from the group consisting of: random forest, support vector machine, or linear regression.

[0055] In another preferred embodiment, the classifier is a random forest classifier.

[0056] In another preferred embodiment, the random forest classifier is configured with 220 decision trees.

[0057] In another preferred embodiment, the depth of the random forest classifier is 7.

[0058] In another preferred embodiment, the classifier module further includes optimization of the classifier module.

[0059] In another preferred embodiment, the classifier module is optimized using a grid search.

[0060] In another preferred embodiment, the method further includes optimizing the prediction model.

[0061] In another preferred embodiment, the prediction model is optimized using the Adam optimizer.

[0062] In another preferred embodiment, the initial learning rate of the Adam optimizer is 1e-4.

[0063] In another preferred embodiment, the initial learning rate is adjusted using cosine annealing.

[0064] A second aspect of the present invention provides a multi-cancer prediction system, the system comprising: An input module, configured to input data, the data including one or more cfDNA methylation data from one or more tissues of the subject to be tested; A prediction module is configured as a prediction model that predicts the test subject based on one or more cfDNA methylation data from one or more tissues, thereby obtaining a prediction result; wherein the prediction includes: (i) predicting whether the test subject has cancer / inflammation; and / or (ii) predicting the cancer type of the test subject; and / or (iii) predicting the cancer stage of the test subject; the prediction model is constructed using the method described in the first aspect of the present invention; An output module is configured to take into account the prediction results of the prediction module.

[0065] In another preferred embodiment, the methylation data is selected from the group consisting of: high-depth methylation data, low-depth methylation data, or combinations thereof.

[0066] In another preferred embodiment, the sequencing depth D1 of the high-depth methylation data satisfies ≥3×, such as 3×, 6×, 10×.

[0067] In another preferred embodiment, the sequencing depth D2 of the low-depth methylation data satisfies 0.5×≤D2<3×, such as 1× or 2×.

[0068] In another preferred embodiment, the methylation data is selected from the group consisting of methylation sequencing data, methylation microarray data, or combinations thereof.

[0069] In another preferred embodiment, the methylation sequencing data is based on bisulfite sequencing data.

[0070] In another preferred embodiment, the methylation sequencing data is single-end methylation sequencing data and / or double-end methylation sequencing data.

[0071] In another preferred embodiment, the methylation sequencing data includes reduced genome methylation sequencing (RRBS) data and whole genome methylation sequencing (WGBS) data.

[0072] In another preferred embodiment, the methylated microarray data includes Illumina 450K, EPIC, and EPICv2.

[0073] In another preferred embodiment, the subjects to be tested include: healthy individuals, individuals with undiagnosed / suspected cancer / inflammation, cancer patients, and patients with inflammation.

[0074] In another preferred embodiment, the one or more tissues are selected from the group consisting of: breast, colon, rectum, liver, lung, prostate, or combinations thereof.

[0075] It should be understood that, within the scope of this invention, the above-described technical features of this invention and the technical features specifically described below (such as in the embodiments) can be combined with each other to form new or preferred technical solutions. Due to space limitations, they will not be described in detail here. Attached Figure Description

[0076] Figure 1 The CatchME architecture diagram is shown. (A) Tissue-specific methylation cluster screening method; (B) Cross-platform data collection and processing; (C) Deconvolution model (Block 1); (D) Deep learning model for cancer risk prediction (Block 2); (E) Cancer multi-classification model, with diagnostic criteria relying on the two modules in Figures C and D.

[0077] Figure 2 The deconvolution model is demonstrated on simulated data. (A) Comparison of deconvolution performance between TS marker and TS marker + TD marker; (B) Comparison between matrixNorm method and traditional ANOVA method; (C) Performance of NuSVR and NNLS deconvolution on simulated missing data; (d) Performance of deconvolution methods on simulated data for each tissue.

[0078] Figure 3 The performance of the multi-cancer diagnostic model on patient samples from a public database is shown. (A) ROC curves for comparing the multi-cancer diagnostic model, cancer risk prediction model, and other standard machine learning algorithms in the binary cancer detection task; (B) ROC curves of the multi-cancer diagnostic model for multi-cancer diagnosis on patient samples; (C) Confusion matrix of the overall detection performance of the multi-cancer diagnostic model in distinguishing between cancer and non-cancer samples when multiple cancer patients and non-cancer control samples are included in the analysis; (D) Confusion matrix of the multi-cancer diagnostic model in the multi-cancer detection task; (E) Detection sensitivity of breast cancer in different clinical stages. Stage I (N=19), Stage II (N=50), Stage III (N=15).

[0079] Figure 4 The performance of the multi-cancer diagnostic model is shown in samples with different sequencing depths. (A) ROC curves of the multi-cancer diagnostic model in samples with sequencing depths of 0.5-6×; (B) Confusion matrix of the multi-cancer diagnostic model in samples with sequencing depths of 0.5-6×.

[0080] Figure 5The performance of the tissue deconvolution model on patient samples is shown. (AE) Predicted tissue origin proportions of cfDNA in patients with breast, colorectal, liver, lung, and prostate cancer compared to controls. P-values ​​were calculated using a two-tailed t-test; (F) ROC curve analysis results distinguishing patients with the five cancer types from controls based on predicted tissue origin proportions. Detailed Implementation

[0081] Through extensive and in-depth research, the inventors have developed a novel CatchME framework. This framework primarily comprises data processing, a main model, and a diagnostic classification module. In data processing, the CatchME framework creatively constructs CpG clusters based on various methylation data, enabling it to be compatible with sparse cfDNA methylation data. In the main model, the CatchME framework combines a deconvolution model with a deep learning model based on a gated convolutional neural network to achieve the acquisition and learning of interpretable features. In the diagnostic classification module, the CatchME framework employs a random forest classifier to diagnose and predict cancer based on the results of the main model. Experimental results show that the CatchME framework can achieve cancer prediction across multiple cancer types and platforms, and its robustness is significantly superior to many existing cancer prediction models. Based on this, the present invention was completed.

[0082] It should be understood that the specific methods and experimental conditions of the invention described below in varying degrees of detail are intended to provide a substantive understanding of the invention. Definitions of certain terms used in this specification are provided below. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0083] the term As used herein, the terms “containing” or “including (comprise)” can be open-ended, semi-closed, or closed-ended. In other words, the terms also include “consistently made of” or “made of”.

[0084] As used herein, the term “and / or” refers to and covers any and all possible combinations of one or more of the related listed items.

[0085] As used in this article, the term "significant" means that, in a hypothesis test, the observed effect (such as the difference between the experimental and control groups) is unlikely to be caused solely by random error. A hypothesis test includes: the null hypothesis (H0), which assumes that the observed effect does not exist (such as no difference between the experimental and control groups); the p-value, which is the probability of observing the current or more extreme effect when H0 is true; and the significance threshold (α). The significance threshold is typically used to determine whether a hypothesis test is significant. Generally, the significance threshold is 0.05. If the p-value ≤ α, then H0 is rejected, meaning the observed effect exists, and the result is called "significant."

[0086] The computer system is equipped with at least one processor and a memory. The processor invokes a sequence of computer-executable instructions stored in the memory to implement the evaluation process defined in the claims. Although the flowchart describes the operation steps in a specific logical order, in actual execution, the steps may be processed in parallel, their order adjusted, or partially omitted in some cases. As long as such adjustments do not deviate from the core features of the technical solution described in the claims and achieve the same technical effect, they all fall within the scope of protection of this invention. This flexibility in execution order is determined by the programmable nature of computer instructions.

[0087] methylation data As used herein, the terms "multi-platform methylation data" and "cross-platform methylation data" are used interchangeably, referring to methylation data from multiple sequencing platforms, specifically including methylation sequencing data and methylation microarray data. Methylation sequencing data includes whole-genome methylation sequencing (WGBS) and reduced-genomic methylation sequencing (RRBS) data; methylation microarray data includes Infinium Human Methylation 450 (Illumina 450K, 450K), Methylation EPIC v1.0 BeadChips (EPIC, 850K), and Infinium Methylation EPIC v2.0 BeadChip (EPICv2, 935K), among others.

[0088] As used in this article, the terms "sparse methylation," "sparse methylated data," "sparse cfDNA," and "sparse data" are used interchangeably and refer to data with insufficient sequencing depth. In this type of data, the number of reads at some sites or sequence alignments is missing or insufficient, leading to the loss of information about these sites or sequences during quality control and other processes in downstream analysis, thus causing bias in the downstream analysis results.

[0089] CatchME This invention provides a multi-cancer prediction model, CatchME. This framework combines interpretability, cross-platform applicability, and robustness to sparse data, enabling rapid prediction. The method for constructing CatchME includes the following steps: (S1) Provide a cfDNA methylation dataset, which includes one or more cfDNA methylation data from various cancers and healthy controls.

[0090] Preferably, the cancer includes breast cancer, colorectal cancer, liver cancer, lung cancer, and prostate cancer. Preferably, the healthy control is white blood cells in a healthy / normal state. This invention utilizes data from healthy controls to capture tissue-specific cfDNA released due to tissue damage caused by cancer, thereby providing interpretable biological characteristics for subsequent processes.

[0091] The cfDNA methylation data can be high-depth methylation data or low-depth methylation data; it can be methylation sequencing data or methylation microarray data; it can be paired-end methylation sequencing data or single-end methylation sequencing data. Preferably, the sequencing depth D1 of the high-depth methylation data satisfies ≥3×, such as 3×, 6×, or 10×. Preferably, the sequencing depth D2 of the low-depth methylation data satisfies 0.5× ≤ D2 < 3×, such as 1× or 2×. CatchME, through the processing in step (S2), makes it particularly suitable for processing low-depth methylation data.

[0092] (S2) The cfDNA methylation dataset is preprocessed to obtain preprocessed methylation data; CpG clusters are constructed according to predetermined rules based on the preprocessed methylation data, and the CpG clusters are classified; the methylation level of each tissue category is evaluated according to the classification results of the CpG clusters, thereby obtaining tissue methylation maps of multiple tissues.

[0093] The preprocessing methods for the cfDNA methylation dataset are conventional techniques in the field, typically including quality control, adapter removal, quality trimming, alignment, screening, and identification of methylation sites. Methods, platforms, and software capable of achieving the above objectives are all within the scope of this invention. The screening thresholds involved can be set manually according to research needs. For methylation sequencing data, after aligning the methylation sequencing data to the reference genome, samples with an alignment rate below 60% are preferably removed. For methylation microarray data, preferably, samples with a detection p-value > 0.01, and / or with a bead count < 3 in at least 5% of samples, and / or all samples associated with single nucleotide polymorphisms (SNPs), and / or all multiple alignments, and / or all probes located on the X and Y chromosomes are removed.

[0094] CatchME achieves compatibility with low-depth sequencing data by constructing CpG clusters. As used herein, the terms "CpG cluster" and "CpG clustering" are used interchangeably, both referring to a set of multiple CpG sites constructed using predetermined rules. The predetermined rules for constructing CpG clusters include: (y1) defining a CpG cluster as a flanking region of 100 base pairs upstream and downstream of each probe-targeted CpG site in the methylated microarray data, assuming that all CpG sites within this region have the same average methylation level as the probe-covered CpG sites; (y2) defining a CpG cluster as the two flanking regions of two adjacent CpG sites if they overlap; and (y3) defining a CpG cluster as a group of CpG sites containing at least three probe-covered CpG sites.

[0095] Preferably, classifying the CpG clusters includes the steps of: (c1) converting each CpG cluster into a type-specific (TS) matrix composed of histogram vectors. Type Discriminant (TD) Matrix (c2) Matrix based on each CpG cluster and Calculate the TS score TD score and symbolic methylation distance (c3) According to the above , and Calculate the final TS score and final TD score and according to and The CpG clusters are classified, wherein if the CpG clusters are classified as follows: Then the CpG cluster is of type TS; if the CpG cluster is... If the CpG cluster is TD type, then the CpG cluster is classified as such. Based on the score corresponding to the classification result of each CpG cluster, the tissue-specific CpG clusters are evaluated, and finally integrated into multiple tissue-specific tissue methylation maps.

[0096] (S3) The prediction model is trained using the tissue methylation map; the prediction model includes a deconvolution module, a deep learning module, and a classifier module; In the deconvolution module, a deconvolution algorithm is used to deconvolve the methylation map transformation of the tissue to obtain the deconvolution result; In the deep learning module, the tissue methylation map is converted into a feature sequence using a deep learning model, and feature learning is performed on the feature sequence to obtain the output of the deep learning module. In the classifier module, the classifier is trained using the deconvolution result and the concatenation result output by the deep learning module.

[0097] Preferably, the deconvolution algorithm is Nu support vector machine regression (Nu-SVR).

[0098] Preferably, the deep learning model is selected from the group consisting of: Gated Convolutional Neural Networks (GCNN), Convolutional Neural Networks (CNN), XgBoost, Support Vector Machines, or Random Forests. This invention uses multiple deep learning models to construct CatchME, among which the model using GCNN exhibits the best performance in specificity, accuracy, sensitivity, and F1-score; therefore, GCNN is ultimately selected as the deep learning model. GCNN enables the acquisition and learning of interpretable features. The deep learning module using GCNN specifically includes: using GCNN to convert the tissue methylation map into a feature sequence, and convolving the feature sequence; wherein the gated convolutional neural network includes multiple gated convolutional layers, and in each gated convolutional layer, two parallel convolutions are used to convolve the feature sequence to obtain two convolutional results; the two convolutional results are merged to obtain a merged convolutional result Z; the merged convolutional result Z is pooled to obtain a gated output; and the multiple gated outputs of the multiple gated convolutional layers are pooled to finally obtain the output of the deep learning module. Preferably, the two parallel convolutions include: (d1) a first parallel convolution using the ReLU function; and (d2) a second parallel convolution using the sigmoid function. Preferably, the two pooling operations include: a first pooling operation using max pooling; and a second pooling operation using adaptive max pooling.

[0099] Preferably, the classifier employs an algorithm selected from the group consisting of: random forest, support vector machine, or linear regression. Preferably, the classifier is a random forest classifier.

[0100] Furthermore, during model building and training, each module of the model is optimized. Preferably, the deep learning module is optimized using a Bayesian optimization algorithm. Preferably, the classifier module is optimized using a grid search algorithm. The overall model is optimized using the Adam optimizer.

[0101] (S4) When the prediction model reaches the predetermined termination condition, the model training is terminated, thereby obtaining a multi-cancer prediction model using sparse cfDNA methylation data, namely CatchME.

[0102] Validated results show that CatchME can effectively predict / diagnose breast cancer, colorectal cancer, liver cancer, lung cancer, and prostate cancer, and its results are not affected by non-cancerous features (such as inflammation). Furthermore, CatchME can also effectively predict / diagnose different cancer stages.

[0103] The system of the present invention This invention provides a multi-cancer prediction system. The system can be embodied as an electronic device, including but not limited to: smartphones, tablets, personal computers, servers, terminals, or other intelligent terminals with data processing capabilities. The system can also be embodied as a computer-readable storage medium, such as a disk, optical disk, solid-state drive, read-only memory, or flash memory, on which a computer program is stored. When the program is executed by one or more processors, it can implement the data processing flow. Furthermore, the system can also be embodied as a computer program product containing computer instructions that, when executed by a computer, cause the computer to perform all or part of the steps of the data processing flow.

[0104] The data processing flow can be loaded, deployed, or run on any of the aforementioned electronic devices. Through the collaboration of hardware resources and the software logic defined in the flow, the system can complete specific data acquisition, transmission, calculation, analysis, storage, or presentation tasks.

[0105] The main advantages of this invention include: (1) The CatchME of the present invention can use a variety of methylation data, including shallow WGBS, shallow RRBS and microarray data, to achieve cross-platform and rapid cancer prediction, and has universality.

[0106] (2) This invention reduces computational complexity and noise by introducing CpG clusters into methylation data processing, and is compatible with the analysis and prediction of low-depth sequencing data.

[0107] (3) This invention combines deconvolution model and deep learning model to achieve interpretable diagnosis and prediction of multiple cancer types with high accuracy and sensitivity.

[0108] (4) When deconvolveing ​​the cfDNA methylation map, the CatchME of the present invention uses normal tissue as reference data to capture tissue-specific cfDNA released due to tissue damage caused by cancer, thereby providing tissue component features with biological interpretability.

[0109] The present invention will be further illustrated below with reference to specific embodiments. Those skilled in the art can easily understand other advantages and effects of the present invention from the content disclosed in this specification. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. It is also understood that the purpose of describing the present invention in conjunction with the embodiments is to cover other options or modifications that may be derived based on the claims of the present invention. To provide a deep understanding of the invention, many specific details will be included in the following description. The invention may also be practiced without using these details. Furthermore, to avoid confusion or obscuring the focus of the invention, some specific details will be omitted in the description.

[0110] Example 1: CatchME build process The CatchME framework of this invention can be used for multi-cancer diagnosis based on sparse cfDNA methylation profiles. This framework combines interpretability, cross-platform applicability, and robustness to sparse data. It enables rapid prediction on various platforms, including shallow WGBS, shallow RRBS, and microarrays (Illumina 450K, EPIC).

[0111] CatchME consists of three modules ( Figure 1 ): (1) Tissue-origin deconvolution module: using normal tissue reference data, it captures tissue-specific cfDNA released due to tissue damage caused by cancer, thereby providing biologically interpretable tissue component features; (2) Feature learning and cancer diagnosis module: Using a deep learning model, cancerous samples and non-cancer samples can still be robustly distinguished under sparse data conditions.

[0112] (3) Classifier: The tissue proportion information is directly concatenated with the cancer detection probability to generate the final multi-cancer classification result.

[0113] Specifically, the CatchME build process is as follows: S1. Data Preparation and Processing: The training cohort includes methylation data from breast cancer, colorectal cancer, liver cancer, lung cancer, prostate cancer, and leukocytes, with leukocytes serving as healthy controls. CatchME can use methylation sequencing data and / or methylation microarray data. Methylation sequencing data includes Reduced Genome Methylation Sequencing (RRBS) data, Whole Genome Methylation Sequencing (WGBS), etc.; methylation microarray data includes 450K, EPIC, EPICv2, etc.

[0114] For bisulfite sequencing data, the following data processing methods were employed: Initial quality assessment of the raw sequencing data was performed using FastQC (v0.12.1), followed by adapter removal and quality trimming using TrimGalore (v0.6.10). This process followed the library preparation kit manufacturer's recommendations and was fine-tuned based on internal empirical testing to optimize read quality and alignment rates. Processed reads were aligned to the GRCh38 (hg38) reference genome (default parameters) using Bismark (v0.24.2). Samples with alignment rates below 60% were systematically excluded from downstream analysis. For WGBS data, duplicates were removed using deduplicate_bismark after alignment, while no duplicates were removed for RRBS data. Methylation sites were identified using bismark_methylation_extractor, and the generated coverage file was subsequently processed using a custom Python script to generate CpG cluster-specific methylation profiles. The complete analysis workflow was implemented using an internally developed Snakemake pipeline.

[0115] For paired-end sequencing data, the `sample` submodule in SeqKit was used to downsample the data. The same random seed was applied to the FASTQ files of read1 and read2 to ensure correct read pairing. Downsampling was performed using the multi-process script `down_sample.py`, which can simultaneously subsample hundreds of millions of reads to effectively reduce sequencing depth. This method, by combining a fixed seed and paired-end processing, allows for flexible adjustment of sequencing coverage for downstream analysis while maintaining read pairing.

[0116] For the microarray data, the following methods were used for data processing: methylation microarray data were processed using the ChAMP (v2.30.0) package in R. Probe filtering was performed according to the following criteria: (1) probes with a p-value > 0.01; (2) probes with a bead count < 3 in at least 5% of the samples; (3) all probes associated with SNPs; (4) all multiple alignment probes; and (5) all probes located on the X and Y chromosomes. After filtering, the data were normalized using the champ.norm() function and the BMIQ method. After preprocessing, CpG sites were clustered using the array2cluster.py Python script and GRCh38 (hg38) annotations.

[0117] S2. CpG Cluster Construction: To make CatchME compatible with low-depth data, CpG clusters are constructed to preserve as much low-depth site information as possible. The CpG cluster construction process is as follows: (1) For each CpG site targeted by a probe on the chip, a CpG cluster is defined as the flanking regions of 100 base pairs upstream and downstream of it, and it is assumed that all CpG sites in this region have the same average methylation level as the CpG sites covered by the probe. (2) If the flanking regions of two adjacent CpG sites overlap, they are grouped into the same CpG cluster. (3) Only CpG clusters containing at least three CpGs covered by the chip probes are used in this study. CpG site annotations for the GRCh38 (hg38) reference genome are obtained from the hm450.hg38.manifest and EPIC.hg38.manifest files provided by the R package ELMER.data (v2.33.0). After constructing the CpG clusters, a total of 36,817 CpG clusters accessible through low-coverage sequencing were identified, accounting for about half of the clusters covered by the HM450K chip. These clusters were reserved for subsequent feature selection.

[0118] S3. Tissue-Specific CpG Cluster Screening: To reduce computational complexity and noise, a compact set of discriminative clusters was first identified. This study considered two types of informative clusters: type-specific (TS) clusters, which show a binary contrast between a tissue and all other tissues, and type-discriminative (TD) clusters, which show differences in methylation levels across all tissue categories.

[0119] Inspired by the Robust Feature Downsampling Module (SRFD), a matrix norm-driven selection strategy is designed that preserves the sign of inter-class methylation differences. By explicitly preserving directional information, this method captures the magnitude and polarity (hypomethylation and hypermethylation) of site-specific signals and simultaneously generates TS (containing both hypomethylation and hypermethylation) and TD clusters, thereby enabling accurate and robust deconvolution.

[0120] For each CpG cluster, there exists an N×M matrix of β values ​​for N samples and M tissue types. Assume that the distribution of all N samples of the m-th tissue type in a given CpG cluster can be approximated by a vector as follows: A histogram is generated to reduce computational complexity, where b is the number of histogram bins. Increasing b can improve the fidelity of the distribution approximation, but it also increases the computational complexity quadratically. To balance accuracy and efficiency, this study fixes b=10 in all subsequent analyses. The resulting histograms are then concatenated into a matrix. The divergence between histogram vectors is quantified by the following formula: , in and Let H represent the nuclear norm and Frobenius norm, respectively, and M be the number of tissue types. D ranges from 0 to 1, where D=0 corresponds to the column vectors of H being perfectly linearly dependent, while D=1 indicates that they are mutually orthogonal.

[0121] To simultaneously identify TS and TD clusters, a corresponding score was calculated for each CpG cluster: TS score: for target organization type To classify all other organizations as Calculate using the method described above. and From the histogram vector, we get: Then, the divergence D is calculated to quantify their distributional differences.

[0122] TD score: Calculate the histogram vector for each tissue type to obtain: Similarly, calculate the divergence D.

[0123] Symbolic methylation distance: For each pair of tissue types i and j, the symbolic distance is defined as... This yields a symmetric distance matrix: The sign of the distance indicates the direction of change in relative methylation level: a positive value indicates a high methylation state relative to another tissue, while a negative value indicates a relatively low methylation state. In the TD score calculation, the distance between the tissue and itself... It is set to 0; however, in the TS score calculation, to avoid the minimum distance d between self-comparison pairs ( The interference calculated will affect this value. Set to 1, which is the maximum distance.

[0124] Final rating: , , , , , Where t represents the category number of the TS CpG cluster, used to mark specific CpG clusters of different tissue categories. and The higher of the two values ​​indicates the optimal label type for this CpG cluster, and its final informative score is determined by... Provided.

[0125] After performing clustered methylation calculations on the data, a sparse plasma cfDNA methylation map can be obtained.

[0126] S4. Tissue-origin deconvolution model (Nu-SVR): The cfDNA methylation map obtained in S3 is input into a nonlinear regression model, specifically Nu-SVR, which expresses the cfDNA methylation data as a linear summation of its tissue origin. The formula is as follows: , here, It is a reference matrix (m×n) representing the methylation spectra of m CpG clusters and n tissue types. It is an n×1 vector representing the contribution ratio of each organization. It is the measured DNA methylation spectrum.

[0127] Furthermore, a linear kernel Nu-SVR deconvolution was applied, a method widely used to infer cell type composition from single-cell transcriptome data. 'nu' corresponds to a lower bound on the support vector scores and an upper bound on the boundary error scores. Here, the support vectors are methylation spectra from the reference matrix, used for learning... . The learned weights reflect the tissue types in the column space of the reference matrix, and these weights constitute the methylation profile of the cfDNA. For each sample, a set of values ​​is learned by performing a grid search on two SVR hyperparameters (nu taken from [0.05, 0.1, 0.15, 0.25, 0.5, 0.75], C taken from [0.1, 0.25, 0.5, 0.75, 1, 5, 10, 30, 50, 100]). .

[0128] S5. Gated Convolutional Neural Network: The cfDNA methylation map obtained in S3 is input into a neural network model, preferably a gated convolutional neural network (GCNN). GCNN extends the traditional convolutional neural network (CNN) by introducing a gating mechanism to adaptively regulate the information flow. Unlike standard CNNs, GCNN can learn and utilize missing patterns in the input data, effectively treating missing values ​​as informational signals rather than noise. This feature enhances robustness to low-coverage sequencing data and improves the model's ability to capture complex nonlinear dependencies across CpG clusters.

[0129] Formally, given an input feature sequence (Where L is the number of CpG clusters and d is the feature dimension), a gated convolutional layer applies two parallel convolutions: , , in, and It is a learnable convolutional kernel. and This is the bias term, * indicates the convolution operation, ReLU() is the linear correction unit activation function, and σ(·) is the sigmoid function, producing a gate value in the range [0,1]. The gated output is then calculated through element-wise multiplication. , in This represents the Hadamard (element-wise) product. This gating mechanism allows the network to selectively propagate information patterns while suppressing noise or redundant signals, which is beneficial for cfDNA methylation data characterized by sparsity and heterogeneity. Bayesian optimization is used to optimize the gated convolutional neural network.

[0130] S6. Diagnostic Model: This model receives the direct concatenation results of S4 and S5, and is implemented as a random forest classifier using scikit-learn (v1.6.1)—configured with 220 decision trees and a maximum depth of 7. Grid search is used to optimize the model.

[0131] CatchME is implemented in PyTorch (v2.7) and trained on an NVIDIA RTX 4090 GPU. It is optimized using the Adam optimizer with an initial learning rate of 1e-4, decaying using cosine annealing. Early stopping is applied to prevent overfitting. All hyperparameters are rigorously optimized through 4-fold cross-validation on the training queue.

[0132] Model performance was evaluated using an independent test set reserved before any optimization procedure and against predefined clinical benchmarks, calculating sensitivity, specificity, precision, and F1 score.

[0133] Example 2: Verification using simulated data deconvolution Simulation data was generated using a custom implementation of the methods used in CancerLocator and MethylationAtlas. To prevent data leakage, samples specified for simulation were excluded from label selection and reference-informative cluster construction. The simulation protocol consisted of three consecutive steps: (1) Tissue score generation: For each synthetic sample, a tissue pool score with a sum of 1 was generated. White blood cell count (WBC) scores were restricted to the major component, contributing an average of 75% across all simulations with a minimum threshold of 60%. This design reflected the dominance of hematopoietic cell-derived DNA in real cell-free DNA (cfDNA) samples. (2) Random sampling: A sample was randomly selected from each tissue type for β-value synthesis. (3) β-value synthesis: For each CpG cluster, the methylated β-value was calculated as follows: ,in and Representing organizations The score and beta value, This represents Gaussian noise. The synthesized values ​​are constrained to a reasonable range of [0,1].

[0134] Subsequently, the robustness and accuracy of the CatchME method were systematically verified on simulated data. Figure 2 The results show that the selection strategy based on the approximate distribution of information histograms significantly improves performance compared to the traditional ANOVA method, reducing the root mean square error of tissue fraction prediction by 18%, and increasing the computation speed by about 15 times in the task of screening differentially methylated sites from 400,000 CpG sites (based on a dual-path AMDEPYC7302 platform with 128GB RAM, the ANOVA method took about 6 hours ± 12 minutes, while our method took 24 minutes ± 2 minutes).

[0135] In simulated data, the Nu-SVR method used in this study demonstrated superior robustness compared to NNLS when handling missing values ​​in methylation data. Simulated experiments controlling for tissue proportions (100 replicates per group) confirmed that this method can accurately reconstruct tissue composition, with an average RMSE of 0.033 and an average correlation of 0.93.

[0136] Example 3: Validation results of CatchME on public datasets This embodiment involves using CatchME to validate multiple cancer diagnosis results on a public dataset. The results are as follows: Figure 3 As shown.

[0137] On a publicly available patient dataset (n=372), the diagnostic model in this study achieved an overall sensitivity of 83% and an accuracy of 78%. Specifically, in the binary classification diagnostic task for cancer, CatchME demonstrated outstanding performance, outperforming cancer risk prediction models and other standard machine learning algorithms (XgBoost, Support Vector Machine, and Random Forest). Figure 3 A, 3C). CatchME has also achieved high-accuracy diagnosis in breast cancer, colorectal cancer, liver cancer, lung cancer, and prostate cancer. Figure 3 B, 3D).

[0138] Furthermore, it was found that the diagnostic performance of CatchME for breast cancer gradually improved as the disease progressed. Figure 3 E). In the detection tasks for lung cancer, liver cancer, breast cancer, prostate cancer, and healthy controls, CatchME achieved the highest F1-score, indicating that CatchME has the best accuracy in these four cancer types (Table 1).

[0139] Table 1. Comparison of F1-scores of the multi-cancer diagnostic model with other mainstream multi-cancer detection models in five cancers. Example 4: CatchME Validation Results on an External Queue This embodiment involves using CatchME to validate multiple cancer diagnostic results on an external cohort and exploring the impact of sequencing depth on diagnostic results. The results are as follows: Figure 4 As shown.

[0140] To comprehensively evaluate the impact of sequencing depth on model performance, ROC analysis was performed for each cancer type at sequencing coverages of 0.5×, 1×, and 3×. The results showed that, consistent with the deconvolution model, CatchME's diagnostic performance significantly improved with increasing sequencing depth. At 0.5× coverage, the model's ability to distinguish between benign and malignant samples was relatively limited, with AUCs for each cancer type ranging from 0.78 to 0.98. As sequencing depth increased to 1×, AUC values ​​generally improved, with AUCs for many cancer types exceeding 0.9, and the diagnostic confusion matrix also showed high diagnostic reliability. When the sequencing depth was further increased to 3×, CatchME's diagnostic performance tended to be optimal, with AUCs for each cancer type approaching the level of the original 6× data, indicating that the model could achieve diagnostic results close to full-depth sequencing at medium to high coverage.

[0141] Example 5: Model without CpG clusters In DNA methylation sequencing, when the sequencing depth is in an unsaturated range (i.e., not yet covering the vast majority of target sites), the number of detected methylation sites is positively correlated with the sequencing depth. Based on Figure 4 The analysis results of the four different sequencing depth library preparation schemes shown in B show that the number of detectable CpG sites gradually increases with the increase of sequencing depth.

[0142] However, a reliable assessment of the methylation level of a single CpG site typically requires sequencing that site at least 10 times (≥10 reads). At a sequencing depth of approximately 10×, the number of CpG sites meeting this requirement is approximately 773,915, comparable to the number of CpG sites covered by the EPIC methylation array (Table 2). This indicates that obtaining high-confidence methylation measurements using single CpG sites as the unit of analysis often requires high sequencing depths, leading to a significant increase in detection time and experimental costs. In contrast, when the minimum read requirement for a single CpG site is relaxed to 1 (i.e., only once), approximately 15-20 million CpG sites can be detected at a sequencing depth of approximately 2-3×, covering most CpG sites in the human genome (approximately 30 million in total).

[0143] The aforementioned low-coverage but high-abundance data offers significant advantages in information utilization, but it also comes with higher measurement noise and platform heterogeneity issues. Furthermore, it can be anticipated that modeling is impossible when sequencing depth is low and analysis is still based on single CpG sites. Based on these considerations, to fully utilize the abundant CpG site information obtained from low-coverage sequencing, while improving data compatibility across different detection platforms and technical approaches and reducing the impact of random noise, this study aggregated CpG sites according to pre-defined rules before subsequent analysis, constructing CpG clusters as new analytical units. This strategy not only improves the efficiency of sequencing data utilization but also lays the foundation for the subsequent construction of multi-platform compatible machine learning models.

[0144] Table 2. Relationship between sequencing depth and library preparation method and CpG site coverage Example 6: A model built using only the deconvolution module This embodiment involves an ablation experiment on CatchME to explore the impact of using only the deconvolution module for diagnostic model modeling on the model results.

[0145] To evaluate the performance of the proposed tissue deconvolution model in real patient samples, this study integrated data from public databases with independently collected patient samples from the study to validate the model. By performing tissue deconvolution analysis on patient samples, the differences in tissue-derived cfDNA prediction results between non-cancer and cancer patients were compared to assess the model's tissue origin tracing capability in a real disease context. Results are as follows: Figure 5 As shown in A-5E: Based on the constructed tissue deconvolution model, in five tissue types—breast, colorectal, liver, lung, and prostate—the predicted proportion of cfDNA from the corresponding tissues in cancer patient samples was significantly higher than that in the corresponding non-cancer control groups.

[0146] Building upon this, this study further evaluated the feasibility of using tissue-derived cfDNA proportions solely for cancer detection. The model's discriminative ability was analyzed using receiver operating characteristic (ROC) curves, and the results are as follows: Figure 5 As shown in F, the deconvolution model demonstrated a certain level of discriminative ability in distinguishing patients with five types of cancer from their corresponding non-cancer controls, with area under the curve (AUC) ranging from 0.636 to 0.873. Among them, breast cancer and liver cancer showed relatively good discriminative performance, while prostate cancer had a relatively low AUC value, exhibiting an overall moderate level of discriminative performance.

[0147] The above results indicate that although the proportion of tissue-derived cfDNA obtained based on the tissue deconvolution model can, to some extent, distinguish between cancer patients and non-cancer controls and reflect the changing characteristics of cfDNA tissue origin in disease states, relying solely on tissue proportion information is still insufficient to accurately distinguish between cancer and non-cancer samples, especially in application scenarios where multiple cancer types are detected simultaneously, which has certain limitations.

[0148] Example 7: Diagnostic Models Constructed from Different Models for Deep Learning Modules This embodiment involves comparing the performance differences of diagnostic models that utilize other deep learning models or classifiers to construct deep learning modules.

[0149] Specifically, in addition to using Gated Convolutional Neural Networks (GCNN), we also employ Neural Networks (CNN), XgBoost, Support Vector Machines, and Random Forests to construct feature learning and cancer diagnosis modules, thereby building a diagnostic model. The performance of each diagnostic model is shown in Table 3. It can be seen that the diagnostic model using GCNN has the best specificity, accuracy, sensitivity, and F1-score, therefore, GCNN is the preferred choice for constructing the diagnostic model.

[0150] Table 3. Performance comparison of different CatchME models Example 8: Models using different classifiers This embodiment involves comparing the performance differences of diagnostic models that utilize other classifiers to generate classification results based on the splicing results.

[0151] Specifically, in addition to using random forest, support vector machine (SV) and linear regression were also employed. The performance of each diagnostic model is shown in Table 4. It can be seen that the diagnostic model using random forest has the best specificity, accuracy, sensitivity, and F1 score. Therefore, random forest is the preferred method for constructing diagnostic models.

[0152] Table 4. Performance comparison of CatchME with different classifiers All documents mentioned in this invention are incorporated herein by reference as if each document were individually incorporated by reference. Furthermore, it should be understood that after reading the foregoing teachings of this invention, those skilled in the art can make various alterations or modifications to this invention, and these equivalent forms also fall within the scope defined by the appended claims.

Claims

1. A method for constructing a multi-cancer prediction model, characterized in that, The method includes the following steps: (S1) Provide a cfDNA methylation dataset, the cfDNA methylation dataset including one or more cfDNA methylation data from various cancers and healthy controls; (S2) Preprocess the cfDNA methylation dataset to obtain preprocessed methylation data; Based on the preprocessed methylation data, CpG clusters are constructed according to predetermined rules and the CpG clusters are classified; the methylation level of each tissue category is evaluated according to the classification results of the CpG clusters, thereby obtaining tissue methylation maps of multiple tissue specificities; (S3) The prediction model is trained using the tissue methylation map; the prediction model includes a deconvolution module, a deep learning module, and a classifier module; In the deconvolution module, a deconvolution algorithm is used to deconvolve the methylation map transformation of the tissue to obtain the deconvolution result; In the deep learning module, the tissue methylation map is converted into a feature sequence using a deep learning model, and feature learning is performed on the feature sequence to obtain the output of the deep learning module. In the classifier module, the classifier is trained using the deconvolution result and the concatenation result output by the deep learning module; (S4) When the prediction model reaches the predetermined termination condition, the model training is terminated, thereby obtaining a multi-cancer prediction model using sparse cfDNA methylation data, namely CatchME.

2. The method as described in claim 1, characterized in that, The cfDNA methylation data are selected from the following group: high-depth methylation data, low-depth methylation data, or a combination thereof.

3. The method as described in claim 1, characterized in that, The classification of the CpG clusters includes the following steps: (c1) Transform each of the CpG clusters into a type-specific (TS) matrix composed of histogram vectors. Type Discriminant (TD) Matrix ; (c2) Matrix based on each CpG cluster and Calculate the TS score TD score and symbolic methylation distance ; (c3) According to the above , and Calculate the final TS score and final TD score and according to and The CpG clusters are classified, wherein if the CpG clusters are classified as follows: Then the CpG cluster is of type TS; if the CpG cluster is... If so, then the CpG cluster is of type TD.

4. The method as described in claim 3, characterized in that, The The calculation formula is as follows: , in, For TS matrix; For nuclear norm; The Frobenius norm is used; M represents the number of organization types. The The calculation formula is as follows: , in, This is the TD matrix; For nuclear norm; The Frobenius norm is used; M represents the number of organization types. The symbolic methylation distance The calculation method is as follows: , in, For organization type i, β value Let β be the value of organization type j, and finally obtain the symmetric distance matrix between different organization categories.

5. The method as described in claim 3, characterized in that, The The calculation method is as follows: , , in, This represents the minimum value of the symbolic methylation distance d; To obtain the maximum value; The The calculation method is as follows: , in, This represents the maximum value of the symbolic methylation distance d.

6. The method as described in claim 1, characterized in that, The deconvolution algorithm is Nu support vector machine regression; The deep learning model is selected from the following group: Gated Convolutional Neural Network (GCNN), Convolutional Neural Network (CNN), XgBoost, Support Vector Machine, or Random Forest.

7. The method as described in claim 1, characterized in that, The deep learning module specifically includes: using a gated convolutional neural network to convert the tissue methylation map into a feature sequence, and then performing convolution on the feature sequence; wherein, The gated convolutional neural network includes multiple gated convolutional layers. In each gated convolutional layer, the feature sequence is convolved using two parallel convolutions to obtain two convolutional results. The two convolutional results are merged to obtain a merged convolutional result Z. The merged convolutional result Z is pooled to obtain a gated output. The multiple gated outputs of the multiple gated convolutional layers are pooled to finally obtain the output of the deep learning module.

8. The method as described in claim 7, characterized in that, The two parallel convolutions include: (d1) The first parallel convolution, which uses the ReLU function to perform convolution, yields the first convolution result H; and (d2) The second parallel convolution is performed using the sigmoid function to obtain the second convolution result G.

9. The method as described in claim 7, characterized in that, The classifier uses an algorithm selected from the group consisting of: random forest, support vector machine, or linear regression.

10. A multi-cancer prediction system, characterized in that, The system includes: An input module, configured to input data, the data including one or more cfDNA methylation data from one or more tissues of the subject to be tested; A prediction module, configured as a prediction model, predicts the test subject based on one or more cfDNA methylation data from one or more tissues to obtain a prediction result; wherein the prediction includes: (i) predicting whether the test subject has cancer / inflammation; and / or (ii) predicting the cancer type of the test subject; and / or (iii) predicting the cancer stage of the test subject; the prediction model is constructed using the method of claim 1; An output module is configured to take into account the prediction results of the prediction module.