Cfdna methylation-based early prediction method and prediction apparatus for multi-modal pan-cancer and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating methylation sequencing data, fragmentation analysis, and methylation entropy features, a multimodal pan-cancer early prediction model was constructed, which solved the accuracy problem of joint detection of multiple cancers and realized efficient pan-cancer early screening and personalized diagnosis and treatment plans.

WO2026124466A1PCT designated stage Publication Date: 2026-06-18BIOCHAIN BEIJING SCI & TECH

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: BIOCHAIN BEIJING SCI & TECH
Filing Date: 2025-12-09
Publication Date: 2026-06-18

Application Information

Patent Timeline

09 Dec 2025

Application

18 Jun 2026

Publication

WO2026124466A1

IPC: G16B40/20; G16B30/10; G06F18/27; G06F18/24; G06F18/2135; G16H50/30

AI Tagging

Application Domain

Health-index calculation Biostatistics

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A follow-up method after diagnosis and treatment, a computer device, and a program product
CN122201663AMedical communication Medical data mining
A method for constructing a stroke recurrence risk assessment model
CN122201745AMedical data mining Health-index calculation
A personalized life prediction method based on multi-algorithm fusion and electronic equipment
CN122224534AMedical data mining Ensemble learning
Hazard based assessment patterns
US12658318B2Health-index calculationEpidemiological alert systems
A patient health monitoring system integrating behavior intervention, medication management, and examination
CN122201786AMedical data mining Health-index calculation

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

The accuracy of existing technologies in the combined detection of multiple cancers needs to be improved. In particular, detection methods based on cfDNA methylation are difficult to achieve efficient pan-cancer early screening, and the detection of low-content cancers is costly and difficult.

⚗Method used

By integrating methylation sequencing data, fragmentation analysis, and methylation entropy features, a multimodal pan-cancer early prediction model is constructed. Multiple prediction models are trained using various analytical methods, and the accuracy and generalization ability of detection are improved by combining the fragmentation features and methylation pattern entropy of the reference genome.

🎯Benefits of technology

It achieves highly accurate joint detection of multiple cancers, improves the efficiency and accuracy of pan-cancer early screening, enhances the model's generalization ability, can better cope with noise and interference, and provides personalized diagnosis and treatment plans.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure PCTCN2025141115-FTAPPB-I100001
Figure PCTCN2025141115-FTAPPB-I100002
Figure PCTCN2025141115-FTAPPB-I100003

Patent Text Reader

Abstract

Disclosed are an early prediction method and apparatus for multi-modal pan-cancer, a device, a product and a storage medium, which method comprises: separately collecting cfDNA samples from i types of cancer populations and a healthy population, and extracting methylation data therefrom; merging CpG sites on the basis of the methylation data to obtain a plurality of methylation intervals, screening the plurality of methylation intervals to obtain a plurality of candidate marker intervals, and further screening the plurality of candidate marker intervals; training a first prediction model by using the plurality of candidate marker intervals, training a second prediction model on the basis of fragmentation characteristics of chromosomes extracted from a reference genome, and training a third prediction model by using methylation entropy characteristics of the extracted chromosomes; and training an early prediction model for multi-modal pan-cancer by prediction values for the cancer populations obtained from the first prediction model, the second prediction model, and the third prediction model. The method can provide accurate combined early screening for lung cancer, intestinal cancer, gastric cancer, liver cancer, esophageal cancer, thyroid cancer, and ovarian cancer.

Need to check novelty before this filing date? Find Prior Art

Description

A multimodal pan-cancer early prediction method, prediction device and electronic device based on CFDNA methylation Technical Field

[0001] This application belongs to the field of molecular biomedical technology, specifically relating to a multimodal pan-cancer early prediction method, prediction device, and electronic device based on cfDNA methylation. Background Technology

[0002] Cancer is one of the leading causes of death worldwide and a major public health problem that seriously threatens human health. Numerous studies have shown that early detection and timely diagnosis can effectively improve the survival rate of patients with various cancers. In the field of cancer detection, multi-cancer screening is often more efficient than single-cancer screening. This is because individuals eligible for cancer screening typically have a potential risk of developing multiple cancers, and it is difficult for them to easily determine the possible sites of disease based on their own condition before the examination, especially for cancers occurring in the abdominal cavity or bloodstream. cfDNA, derived from DNA released after cell damage and rupture, circulates in the blood. Tumor genomes usually carry characteristic gene mutation sites; by detecting mutation sites in cfDNA, especially methylation sites, cancer can be detected.

[0003] To continuously improve the performance of cfDNA methylation in cancer detection, researchers have made significant efforts in technological innovation and detection model construction. While some successes have been achieved in the early detection of cancer using methylation panels, the accuracy of combined detection of multiple cancers needs improvement. Based on cfDNA sequencing data, integrating multiple features such as methylation, fragmentation, and methylation entropy to construct a multimodal detection model that more comprehensively considers sample data from the target population can effectively improve the versatility and accuracy of multi-cancer detection. Therefore, this application designs a multimodal pan-cancer detection method based on methylation panel sequencing data through multi-dimensional analysis. Summary of the Invention

[0004] To address the problems existing in the prior art, this application aims to provide a multimodal pan-cancer early prediction method, prediction device, and electronic device. Based on cfDNA methylation differential analysis, it introduces fragmentation analysis of methylation sequencing data and detection analysis of methylation entropy. By integrating multiple analysis methods, a multimodal pan-cancer early prediction model is obtained. The model is used to analyze sequencing samples from the detection population to achieve highly accurate joint detection of seven cancers, including lung cancer, colorectal cancer, gastric cancer, liver cancer, esophageal cancer, thyroid cancer, and ovarian cancer, providing a more accurate and feasible solution for pan-cancer early screening.

[0005] Specifically, this application relates to the following aspects:

[0006] According to one aspect of this application, a multimodal pan-cancer early prediction method is provided, comprising: collecting multiple cfDNA samples from i types of cancer populations and healthy populations respectively, and extracting methylation data from the multiple cfDNA samples respectively; merging CpG sites based on the methylation data of the multiple cfDNA samples to obtain multiple methylation intervals; extracting differentially methylated intervals between each cancer population and healthy population from the multiple methylation intervals; selecting differentially methylated intervals that appear in at least m types of cancer populations as candidate biomarker intervals; screening the multiple candidate biomarker intervals; training a first prediction model using the screened multiple candidate biomarker intervals; training a second prediction model based on chromosome fragmentation features extracted from a reference genome; and training a third prediction model by extracting chromosome methylation entropy features; and training a multimodal pan-cancer early prediction model using the cancer population prediction values of the first prediction model, the second prediction model, and the third prediction model for pan-cancer early detection.

[0007] According to some implementation schemes, extracting methylation data from multiple cfDNA samples includes: extracting multiple CpG sites from multiple cfDNA samples and the methylation value of each CpG site; merging CpG sites based on the methylation data of multiple cfDNA samples includes: calculating the difference in methylation values at each CpG site between each cancer population and a healthy population in i types of cancer populations; and merging the corresponding CpG sites in response to a non-zero difference.

[0008] According to some implementation schemes, the screening of multiple candidate marker intervals includes: calculating the importance value of each candidate marker interval among the multiple candidate marker intervals, removing the corresponding candidate marker interval in response to an importance value not greater than a first threshold, and retaining the corresponding candidate marker interval in response to an importance value greater than the first threshold.

[0009] According to some implementation schemes, training a second prediction model based on fragmented features of chromosomes extracted from a reference genome includes: flattening autosomes of the reference genome and dividing them into multiple first intervals on an average basis; recording the number of short-length, medium-length, and long-length fragments in each of the multiple first intervals; sequentially merging the multiple first intervals to obtain multiple second intervals; calculating the coverage of each of the multiple second intervals; reducing the dimensionality of the multiple coverages obtained; and constructing a second prediction model using the multiple coverages after dimensionality reduction.

[0010] According to some implementation schemes, the coverage is the short-length segment, medium-length segment, and long-length segment contained in each of the multiple second intervals; a second prediction model is constructed using elastic network regression through the multiple coverages after dimensionality reduction.

[0011] According to some implementation schemes, training a third prediction model by extracting methylation entropy features of chromosomes includes: extracting multiple insert fragments from the methylation data of multiple cfDNA samples to obtain the methylation pattern entropy values of multiple insert fragments; calculating the methylation pattern entropy value of each chromosome; and constructing a third prediction model using logistic regression based on the methylation pattern entropy values of all chromosomes.

[0012] According to some implementation schemes, multiple insert fragments are extracted from the methylation data of multiple cfDNA samples, including: insert fragments that can be compared with the reference genome from both left-to-right and right-to-left reads of the cfDNA samples extracted in CpG mode; the formula for obtaining the methylation pattern entropy value of multiple insert fragments is as follows:

[0013] Where BiEn(s) is the methylation pattern entropy value of the inserted fragment, n represents the number of all CpG sites in the inserted fragment, k is a set of values from 0 to n-2, and p is the probability value of k under a given value.

[0014] According to some implementation schemes, calculating the methylation pattern entropy value of each chromosome includes: calculating the average of the methylation pattern entropy values of all inserted segments on each chromosome as the methylation pattern entropy value of each chromosome.

[0015] According to another aspect of this application, a multimodal pan-cancer early prediction device is provided, comprising: a data acquisition unit, which collects multiple cfDNA samples from i types of cancer populations and healthy populations respectively, and extracts methylation data from the multiple cfDNA samples respectively; a biomarker screening unit, which merges CpG sites based on the methylation data of the multiple cfDNA samples to obtain multiple methylation intervals, extracts differentially methylated intervals between each type of cancer population and healthy populations from the multiple methylation intervals, and uses the differentially methylated intervals that appear in at least m types of cancer populations as candidate biomarker intervals, and screens the multiple candidate biomarker intervals; a model construction unit, which trains a first prediction model using the screened multiple candidate biomarker intervals, trains a second prediction model based on chromosome fragmentation features extracted from a reference genome, and trains a third prediction model by extracting chromosome methylation entropy features; and a model integration unit, which trains a multimodal pan-cancer early prediction model using the cancer population prediction values of the first prediction model, the second prediction model, and the third prediction model for pan-cancer early detection.

[0016] According to some implementation schemes, the data acquisition unit extracts methylation data from multiple cfDNA samples, including: the data acquisition unit extracts multiple CpG sites from multiple cfDNA samples and the methylation value of each CpG site; the biomarker screening unit merges CpG sites based on the methylation data of multiple cfDNA samples, including: the biomarker screening unit calculates the difference in methylation values at each CpG site between each cancer population and a healthy population in i types of cancer populations; and merges the corresponding CpG sites in response to a non-zero difference.

[0017] According to some implementation schemes, the marker screening unit screens multiple candidate marker intervals by: the marker screening unit calculating the importance value of each candidate marker interval among the multiple candidate marker intervals; removing the corresponding candidate marker interval in response to an importance value not greater than a first threshold; and retaining the corresponding candidate marker interval in response to an importance value greater than the first threshold.

[0018] According to some implementation schemes, the model building unit trains the second prediction model based on the fragmented features of chromosomes extracted from the reference genome, including: the model building unit flattens the autosomes of the reference genome and divides them into multiple first intervals on an average basis; the model building unit records the number of short-length, medium-length, and long-length fragments in each of the multiple first intervals; the model building unit merges the multiple first intervals in sequence to obtain multiple second intervals, calculates the coverage of each of the multiple second intervals, and performs dimensionality reduction on the multiple coverages obtained; and the model building unit constructs the second prediction model using the multiple coverages after dimensionality reduction.

[0019] According to some implementation schemes, the coverage is the short-length segment, medium-length segment, and long-length segment contained in each of the multiple second intervals; the model building unit constructs a second prediction model using elastic network regression through multiple coverages after dimensionality reduction.

[0020] According to some implementation schemes, the model building unit extracts methylation entropy features of chromosomes to train the third prediction model, including: the model building unit extracts multiple insert fragments from the methylation data of multiple cfDNA samples to obtain the methylation pattern entropy values of multiple insert fragments; the model building unit calculates the methylation pattern entropy value of each chromosome respectively; and the model building unit uses logistic regression to construct the third prediction model using the methylation pattern entropy values of all chromosomes.

[0021] According to some implementation schemes, the model building unit extracts multiple insert fragments from the methylation data of multiple cfDNA samples, including: insert fragments that can be compared with the reference genome in both left-to-right and right-to-left reads of the cfDNA samples in CpG mode; the formula for the model building unit to obtain the methylation pattern entropy value of multiple insert fragments is as follows:

[0022] Where BiEn(s) is the methylation pattern entropy value of the inserted fragment, n represents the number of all CpG sites in the inserted fragment, k is a set of values from 0 to n-2, and p is the probability value of k under a given value.

[0023] According to some implementation schemes, the model building unit calculates the methylation pattern entropy value of each chromosome separately, including: the model building unit calculates the mean of the methylation pattern entropy values of all inserted segments on each chromosome as the methylation pattern entropy value of each chromosome.

[0024] According to another aspect of this application, an electronic device is provided, comprising: a processor; and a memory storing computer program instructions, which, when executed by the processor, cause the processor to perform the aforementioned multimodal pan-cancer early prediction method.

[0025] According to another aspect of this application, a computer program product is provided, which includes computer program instructions that, when executed by a processor, cause the processor to perform the aforementioned multimodal pan-cancer early prediction method.

[0026] According to another aspect of this application, a computer-readable storage medium is provided, on which computer program instructions are stored, which, when executed by a processor, cause the processor to perform the aforementioned multimodal pan-cancer early prediction method.

[0027] The multimodal pan-cancer early prediction method, device, and equipment of this application integrate information from different modalities, such as methylation data and chromosome fragment data mentioned in this application. The model can capture more dimensional data features, which not only helps the model better cope with noise and interference during prediction and enhances the model's generalization ability, but also further improves the model's prediction accuracy on the basis of a single modality by revealing the hidden correlations between different modalities. This helps doctors assess the risk of cancer in test subjects and provide them with personalized treatment plans. Attached Figure Description

[0028] Figure 1 illustrates a flowchart of a multimodal pan-cancer early prediction method according to an embodiment of this application.

[0029] Figure 2 illustrates a flowchart of the multimodal pan-cancer early prediction method according to an embodiment of this application, which constructs three prediction models using three modal features.

[0030] Figure 3 illustrates a schematic diagram of the ROC curve of the multimodal pan-cancer early prediction method according to an embodiment of this application on a test sample set.

[0031] Figure 4 illustrates a block diagram of a multimodal pan-cancer early prediction device according to an embodiment of this application.

[0032] Figure 5 illustrates a block diagram of an electronic device according to an embodiment of this application. Detailed Implementation

[0033] methylation data

[0034] Data on the methylation levels of DNA or RNA molecules are measured and analyzed using various techniques. Methylation data is particularly important in cancer research, as it can reveal the epigenetic mechanisms behind changes in gene expression and contribute to understanding the occurrence and development of cancer.

[0035] methylation value

[0036] A methylation index used to measure DNA methylation levels, commonly used in biostatistics and genomics research. The methylation value reflects the proportion of methylated cytosine at a specific CpG site, ranging from 0 to 1, where 0 represents complete unmethylation and 1 represents complete methylation. In practical applications, methylation values can be used to compare methylation differences between different samples and cell types, and are of significant importance in disease diagnosis and gene expression regulation research.

[0037] The present application is further illustrated below with reference to embodiments. It should be understood that the embodiments are only used to further illustrate and explain the present application and are not intended to limit the present application.

[0038] Unless otherwise defined, technical and scientific terms used in this specification have the same meaning as commonly understood by one of ordinary skill in the art. While similar or identical methods and materials may be applied in experimental or practical applications, materials and methods are described herein. In case of conflict, the definitions included herein shall prevail. Furthermore, materials, methods, and examples are for illustrative purposes only and are not intended to be limiting. The present application is further described below with reference to specific embodiments, but is not intended to limit the scope of the application.

[0039] Application Overview

[0040] As mentioned above, in the field of cancer detection, the abnormal characteristics of cfDNA are heterogeneous across different cancer types, subtypes, stages, and causes. Therefore, techniques for gene characterization of pan-cancer features using cfDNA sequencing data can result in a certain degree of false negatives and false positives. Furthermore, due to inconsistencies in the detection standards for tumor tissue and ctDNA, existing studies yield significantly different consistency data, making it difficult to provide useful information for cancer feature extraction. This hinders the development of improved combinations of effective biomarkers for specific cancer types, thus limiting the effectiveness of cfDNA detection, such as cfDNA methylation detection, in predicting individual cancer development. In addition, in the early stages of some cancers or in the minimal residual disease stage, the concentration of ctDNA in plasma is extremely low, significantly increasing the cost and difficulty of detection.

[0041] The genomic cancer information detection system proposed in patent CN114045345A uses enzymatic transfection of cfDNA from plasma samples to perform whole-genome methylation sequencing, analyzing methylation density, fragment length distribution, 5' end motifs, and / or chromosomal stability, while simultaneously enabling early detection and screening of various cancers. However, this monitoring system neglects to capture minute changes in DNA or RNA molecules on the genome, including fragment length variations, specific sequence deletions or additions, etc. These fragmentation features often imply overall changes in the entire genome or transcriptome, and their analysis does not depend on specific cancer biomarkers or genes. This makes it difficult for this monitoring system to provide strong support for pan-cancer detection.

[0042] Patent CN116356021A provides GutSeer, a multi-cancer early screening and localization technology for five digestive system cancers with high mortality rates. It demonstrates that relatively small second-generation sequencing panels can utilize multiple dimensions of features, including methylation, copy number changes, and terminal motifs, to achieve relatively accurate cancer detection.

[0043] Methylation entropy is the entropy value generated during DNA methylation, representing the degree of disorder or uncertainty in the DNA methylation state. Since methylation entropy is an indicator of the complexity and stability of DNA methylation patterns, it can serve as a specific indicator reflecting the distribution and variation of methylation sites on cfDNA molecules. This makes methylation entropy features helpful in helping predictive models understand the differences in methylation patterns among target cancer patients, thereby learning the heterogeneity among these cancers. Therefore, compared to the detection method constructed from methylation panel sequencing features in CN116356021A, this application proposes a multimodal detection model constructed by combining methylation entropy features with cfDNA methylation rate features and genomic fragmentation features. This allows the predictive method, device, and electronic equipment of this application to be adapted to detect cancer types not limited to the digestive system, but covering more cancers in multiple systems. Compared to the aforementioned methods or systems, it can achieve a wider range of high-accuracy joint detection.

[0044] Specifically, this application provides a multimodal pan-cancer early prediction method, prediction device, and electronic device. It collects cfDNA samples from i types of cancer populations and healthy individuals, and extracts methylation data from the cfDNA samples. Based on the methylation data of the cfDNA samples, it merges CpG sites in the cfDNA samples to obtain multiple methylation intervals. From these multiple methylation intervals, it screens out multiple differentially methylated intervals for each type of cancer population and healthy individuals. It extracts differentially methylated intervals that appear in at least m types of cancer populations as multiple candidate biomarker intervals, and screens these candidate biomarker intervals. It trains a first prediction model using these candidate biomarker intervals, simultaneously extracts fragmented features from a reference genome to train a second prediction model, and extracts chromosomal methylation entropy features to train a third prediction model. Finally, it trains a multimodal pan-cancer early prediction model using the cancer population prediction values from the first, second, and third prediction models for pan-cancer early detection.

[0045] By screening multiple differentially methylated regions between individuals with type i cancer and healthy individuals, and further extracting differentially methylated regions that simultaneously appear in cfDNA samples from individuals with various cancer types as multiple candidate biomarker regions, these candidate biomarker regions can better reflect the overall genetic characterization of type i cancer, providing a basic predictive model for the joint detection of type i cancer. By extracting fragmented features from the reference genome, subtle changes in genomic DNA fragments can be captured. These changes are closely related to the occurrence and development of various cancers, thereby improving the sensitivity and accuracy of cancer detection for each type. At the same time, it provides rich biological information to reveal the heterogeneity among type i cancers, providing a predictive model for cancer differentiation among type i cancers.

[0046] Furthermore, entropy reflects the degree of disorder in a system; the more disordered a system is, the greater its entropy; conversely, the more ordered a system is, the smaller its entropy. Methylation pattern entropy refers to the overall state of the methylation pattern at CpG sites on the inserted fragment, reflecting the distribution and variability of CpG site methylation states. It can be used to assess the epigenetic heterogeneity within the cell population containing cfDNA. Therefore, extracting methylation pattern entropy features can comprehensively assess the complexity and stability of DNA methylation patterns, accurately identify methylation abnormalities associated with certain cancers, and the predictive model constructed using methylation entropy features can further improve the accuracy of overall cancer detection and subtyping based on methylation panel data features and genomic fragment features, thus providing a highly feasible pan-cancer early screening method.

[0047] After introducing the basic principles of this application, various non-limiting embodiments of this application will be described in detail below with reference to the accompanying drawings.

[0048] Exemplary methods

[0049] Figure 1 illustrates a multimodal pan-cancer early prediction method according to an embodiment of this application.

[0050] As shown in Figure 1, the multimodal pan-cancer early prediction method according to an embodiment of this application includes the following steps.

[0051] S110, collect cfDNA samples from i types of cancer patients and healthy individuals respectively, and extract methylation data from the cfDNA samples respectively. In this application, methylation data refers to the data obtained by conventional analysis of FastQ data obtained using a methylation panel, preferably a processable BAM file. According to the multimodal pan-cancer early prediction method of this application embodiment, it can include up to 7 types of cancer patients, namely lung cancer, colorectal cancer, gastric cancer, liver cancer, esophageal cancer, thyroid cancer, and ovarian cancer. Based on these 7 types of cancer patients, all samples in the training set of cancer samples in the methylation data samples obtained by methylation panel sequencing of their cfDNA samples are used as candidate groups, and all healthy samples are used as control groups, and methylation intervals are screened to obtain the methylation data. It can be understood that methylation data can be data obtained by further screening of methylation sequencing data obtained from the methylation panel, such as methylation sequencing data after data quality preprocessing and evaluation (FastP software), genome alignment (Bismark software), or removal of duplicate data caused by sample / experimental techniques. The reference genome used for sequencing is the human genome. In this field, there are multiple versions of human genome sequencing, with hg19 being the commonly used version. Those skilled in the art can select the appropriate version.

[0052] Thus, the methylation data samples obtained from the seven cancer populations can be the location information of all CpG sites and their methylation values obtained through methylation panel sequencing from all cancer population samples and healthy population samples. The CpG sites can be CpG sites on cancer-highly relevant target regions obtained from existing publicly available research, detection, or sequencing products, such as sites on the regions used in Bocheng's seven-cancer NGS methylation detection kit, or CpG sites throughout the entire genome. These CpG sites can be merged according to certain rules to form methylation regions containing multiple CpG sites, such as regions with high methylation levels or low methylation levels. These methylation regions may therefore exhibit differentiated methylation states in cancer patients and healthy individuals, thus becoming valuable methylation biomarkers.

[0053] S120: Based on the methylation data of cfDNA samples, CpG sites are merged to obtain multiple methylation intervals. From these intervals, multiple differentially methylated intervals are selected for each cancer population and the healthy population within the i-type cancer population. Differentially methylated intervals appearing in at least m cancer populations are extracted as multiple candidate biomarker intervals, and these candidate biomarker intervals are then screened. As shown above, step S110 obtains the location information and methylation values of all CpG sites in both populations. The expression difference of each CpG site in the two populations can be obtained by calculating the difference between the methylation values at each CpG site in the cancer population and the healthy population, respectively. The difference in methylation values can be the difference between the average methylation value of each CpG site in a cancer population and the average methylation value in the healthy population.

[0054] This yields differentiated CpG sites, which can then be used as a basis for constructing differentiated methylation regions. A feasible method for merging differentiated CpG sites to obtain methylation regions is to merge the corresponding CpG sites for each of the aforementioned methylation values where the difference is not zero; that is, to sequentially merge these adjacent differentiated CpG sites. The purpose of merging CpG sites is to combine co-methylated sites for joint analysis. Compared to differential analysis between individual differentiated methylated sites, analyzing sites after merging is more statistically significant. It is important to note that methylation intervals obtained in this way may be excessively long. Since human cfDNA is typically around 167 bp in length, it is necessary to further set the maximum length of the methylation intervals to ensure they do not significantly exceed the cfDNA length. For example, a maximum length of 200 bp can be set to ensure that each methylation interval is close to the 167 bp length of the cfDNA, making the calculated methylation level of CpG sites within the interval closer to the actual level and ensuring data quality. Preferably, a maximum length of 300 bp can be set, and merging should terminate when this length is exceeded to avoid an excessive number of CpG sites within the methylation interval, which would significantly reduce the probability of co-methylation and ensure that the analysis of the merged intervals is meaningful. Furthermore, since a lower number of CpG sites within a given interval increases the false positive probability of CpG site methylation values obtained using existing techniques, it is possible to further set the number of CpG sites in each methylation interval to at least 3, and to screen CpG sites within the methylation interval using any feasible CpG site outlier or missing value processing methods to further improve the robustness of the obtained methylation intervals.

[0055] In some implementations, the screening of multiple differentially methylated intervals between each cancer population and a healthy population from multiple methylated intervals of type i cancer includes the following screening method: specifying that the detection sensitivity of the methylated interval detection for each cancer in type i, relative to the cancer population and healthy population, is not lower than a given threshold under a given detection specificity. While maintaining a certain level of specificity and sensitivity for a given detection, for example, in a preferred implementation, ensuring that the sensitivity is not lower than 70%, so that more than 70% of the methylated intervals from the cancer population exhibit population-specific methylation values, and ensuring that the specificity is not lower than 80%, so that more than 80% of the methylated intervals from the healthy control group exhibit population-specific methylation values, this allows for the screening of differentially methylated intervals with high sensitivity and specificity, while also reducing the impact of outliers on the differences and ensuring the stability of the differences.

[0056] Furthermore, the screening method can be further included as follows: specifying that the detection AUC of the corresponding cancer population and healthy population based on the methylation intervals of each cancer in type i is not lower than a given threshold; and that the difference between the average methylation values of all methylation sites within the methylation interval for each cancer population and healthy population is not lower than a given threshold. For example, in a preferred embodiment, it is guaranteed that the AUC area of the cancer population detected using the methylation interval is not lower than 0.7, and the absolute value of the difference between the average methylation values of methylation sites within the methylation interval and the cancer population and healthy population is not lower than 0.02. In this way, while ensuring that the selected differential methylation intervals have good population classification performance, interference from methylation direction can be avoided. That is, the methylation levels of each CpG site in these differential methylation intervals are different for the two populations, and the methylation direction that produces the difference is not specified, so as to ensure that some differential methylation intervals are not screened out because the total difference of CpG sites is too small or zero.

[0057] Thus, through one or more of the aforementioned screening methods, multiple differentially methylated regions with high population classification performance were obtained from multiple methylation regions. Since these differentially methylated regions are used for pan-cancer detection of i different cancer types, to improve detection speed and the generalization performance of the constructed predictive model, it is also necessary to screen for candidate biomarker regions with broader population adaptability than single cancer populations. Specifically, in some advantageous implementations, differentially methylated regions appearing in at least m cancer populations can be extracted as multiple candidate biomarker regions, where m can be 2, 3, 4, 5, 6, or 7; for example, differentially methylated regions appearing in cfDNA samples from at least two cancer populations can be selected as candidate biomarker regions, so that the candidate biomarker regions can exhibit good classification performance for at least two cancer types.

[0058] Considering that candidate biomarker intervals based on large target intervals or merging CpG sites across the entire genome may still have high data dimensionality even after screening using one or more of the aforementioned methods, placing an excessive burden on the prediction model's learning of classification features and parameter optimization, the multimodal pan-cancer early prediction method according to embodiments of this application further includes screening each candidate biomarker interval, i.e., interval dimensionality reduction. According to some feasible implementations, the screening method may include: calculating the importance value of each candidate biomarker interval among multiple candidate biomarker intervals; deleting the corresponding candidate biomarker interval from the multiple candidate biomarker intervals in response to an importance value not exceeding a first threshold; and retaining the corresponding candidate biomarker interval from the multiple candidate biomarker intervals in response to an importance value exceeding the first threshold.

[0059] Here, importance value is an indicator that reflects the contribution of each candidate marker interval to the prediction results of the prediction model. The higher the importance value, the greater the influence of the candidate marker interval on the model's prediction results. Importance value can be obtained in various ways, such as by inputting multiple candidate marker intervals into a linear regression model, Lasso model, random forest, gradient boosting tree, etc., and filtering intervals with high importance values. According to a feasible implementation plan, multiple candidate marker intervals are input into a random forest model and the importance value of each candidate marker interval is calculated separately. The multiple candidate marker intervals are sorted according to the importance value, and the top l candidate marker intervals with importance values greater than a first threshold are selected. All candidate marker intervals starting from the (l+1)th interval are deleted. The value of l is preferably 45, so as to ensure that the classification performance and construction cost of the simple binary classification prediction model can be optimized.

[0060] Step S130 involves training a first prediction model using multiple candidate biomarker intervals, training a second prediction model based on fragmented chromosome features extracted from the reference genome, and training a third prediction model by extracting methylation entropy features from the chromosome. Here, step S130 can be further divided into three prediction steps: the first prediction model, the second prediction model, and the third prediction model.

[0061] Step S1301 involves training a first prediction model using multiple candidate biomarker intervals, and obtaining predicted values for the cancer population based on this model. Here, since multiple candidate biomarker intervals have already been obtained after multiple rounds of screening and dimensionality reduction, a concise and efficient binary classification model can be selected. This binary classification model is trained using the candidate biomarker intervals to obtain the first prediction model. Therefore, in a favorable implementation, logistic regression is chosen to construct a binary classification model for cancer and healthy populations. Logistic regression has relatively low computational complexity, making it suitable for processing large-scale sequencing data. Furthermore, it can be combined with optimization algorithms to rapidly iterate the classification model parameters, making it suitable for real-time analysis and processing of cancer detection data.

[0062] In this way, the first prediction model is obtained through the logistic regression algorithm. By providing the model with a fully connected layer or any possible classifier, it can output the cancer probability value of each individual in the cancer population for each type of cancer in i types of cancer, which can be used as part of the cancer population prediction value for the final training of the multimodal pan-cancer early prediction model.

[0063] Step S1302: Train a second prediction model based on the fragmented features of chromosomes extracted from the reference genome, and obtain the predicted value for the cancer population based on the second prediction model. Specifically, the autosomes of the reference genome selected during methylation panel sequencing are flattened and divided into adjacent, non-overlapping first intervals, such as multiple intervals of 100kb length. In particular, the length of ctDNA fragments in plasma samples is shorter than that of normal cfDNA fragments. Specifically, the distribution of ctDNA in cancer patients and cfDNA in healthy individuals differs in three types of fragments: 90-150bp, 180-220bp, and 250-320bp.

[0064] Therefore, within each first interval, short segments are designated as "short segments," with lengths between 100-150 bp, encompassing the first type of difference segments; medium-length segments are designated as "middle segments," with lengths between 150-260 bp, including the second type of difference segments; and long segments are designated as "long segments," with lengths between 260-320 bp, encompassing the third type of difference segments. This results in more difference segments obtained through length type, enhancing the specificity of the fragmentation features. In a preferred example, the length of short segments can be set between 90-150 bp to further encompass difference information, the length of middle segments can be set between 150-250 bp, and the length of long segments can be set between 250-320 bp to better distinguish segments longer than or shorter than 250 bp. Furthermore, the above three types of segments can be integrated to obtain an overall length segment, i.e., an "nfrags" segment, with an interval length between 100-320 bp, preferably extended to 90-320 bp. Then, the number of short, middle, long, and nfrags in each interval is counted in each first interval to obtain the fragmented features.

[0065] Then, the first interval is successively merged to obtain multiple second intervals of a certain length, for example, multiple 1MB second intervals, resulting in 2608 non-overlapping second intervals. A coverage is defined for each second interval: j represents the j-th second interval among the multiple second intervals, where j can take values of 1, 2, ..., 2608. The various types of fragments (short, middle, long, and nfrags) and their corresponding counts in the j-th second interval can be set as fragment features and their corresponding feature values. For example, 2608 second intervals yield 10432 fragment features and their corresponding fragment counts. In this way, the elastic network regression algorithm can be used to construct a fragmented binary classification model for cancer patients and healthy patients as a second prediction model.

[0066] It is understandable that after obtaining the fragment features, the PCA method is used to reduce the dimensionality of the massive fragment features, so that the principal components that can explain 95% of the differences can be used to construct the fragment feature matrix, and then the elastic network regression algorithm is used to construct the second prediction model. The purpose of choosing elastic network regression is that the number of fragment features in the second interval may be much higher than the number of intervals in the second interval, and elastic network regression can effectively select fragment features to avoid overfitting of the second prediction model.

[0067] Thus, the second prediction model performs better than the first prediction model in handling the problem of high collinearity among chromosome segments. This is because there are usually complex interactions and correlations between chromosome segments, which the first prediction model, built using candidate marker intervals, does not take into account. Furthermore, resilient network regression is more suitable for building sparse models to reduce model complexity and computational cost. Therefore, the second prediction model exhibits higher specificity and faster fitting speed during classification. Constructing the second prediction model and integrating it with the first prediction model effectively improves the specificity of the first prediction model.

[0068] Step S1303: Extract methylation entropy features of chromosomes to train a third prediction model, and obtain cancer population prediction values based on the third prediction model. Specifically, multiple insert fragments are extracted from the methylation data of cfDNA samples in CpG mode to obtain the methylation pattern entropy values of multiple insert fragments; the methylation pattern entropy value of each chromosome is calculated as the methylation entropy feature. First, insert fragments are extracted from the methylation panel sequencing data of cfDNA samples, and unqualified insert fragments are filtered according to certain criteria.

[0069] It is important to note that insert fragment extraction refers to extracting fragments that can be compared with a reference genome (e.g., the same fragment on the genome) from both left-to-right and right-to-left reads during sequencing. Examples include fragments of approximately 300 bp in length that have been broken down, or fragments of approximately 170 bp in length for free DNA. In some advantageous implementations, filtering out unqualified insert fragments may include: removing insert fragments containing fewer than 3 or more than 32 CpG sites and / or removing insert fragments with missing CpG sites, thereby eliminating erroneous fragments or fragments with excessively low methylation levels.

[0070] Specifically, a CpG pattern refers to a pattern that captures the methylation status of all CpG sites on the insert fragment. Using CpG patterns, one can statistically obtain results such as chromosome location, the search positions of the first and last CpG sites, a CpG pattern diagram (a pattern diagram of the overall methylation status of CpG sites on the insert fragment, i.e., the magnitude of methylation values), and the frequency of CpG pattern occurrences. Then, the methylation pattern entropy value is calculated for each insert fragment, using chromosomes as the unit. An example of the calculation method is shown in the following formula:

[0071] Where BiEn(s) represents the methylation pattern entropy value of the inserted fragment, n represents the number of all CpG sites in the inserted fragment, k is a set of values from 0 to n-2, and p is the probability value of k under a given value. Thus, the calculated methylation pattern entropy value can determine the degree of disorder or randomness of the methylation pattern of the inserted fragment. The inserted fragments screened by the multimodal pan-cancer early prediction method according to the embodiments of this application are the collective effect of the entropy of cfDNA fragments from different sources. In contrast, the methylation pattern entropy in the prior art is calculated based on sites. This is unlike the method used in this application, which uses extracted inserted fragments to statistically analyze the methylation pattern entropy value of shed cfDNA from a specific tumor cell. Instead, it considers the entropy of several consecutive CpG sites from multiple DNA sources as a whole. Therefore, the methylation pattern entropy value of this application will not mask the CpG site information of the target tumor cell cfDNA.

[0072] In this way, the methylation pattern entropy value of each chromosome or specified interval can be further calculated using the formula: RE=1 / N*(e1+e2+e3+…+eN) (Formula 2)

[0073] Here, RE represents the methylation entropy value of a region, such as a segment of a chromosome; N represents the total number of inserted segments within that region; and e represents the methylation entropy value of each inserted segment within that region. Thus, the mean methylation pattern entropy value of all inserted segments on each chromosome can be used as the methylation pattern entropy value of that chromosome. Finally, using methylation entropy value as the methylation entropy feature, a logistic regression algorithm is used to construct a classification model between cancer patients and healthy patients as a third predictive model.

[0074] In another implementation, multiple target regions can be selected, and the average methylation pattern entropy value of all insert fragments within each target region can be used as the methylation pattern entropy value of each target region. The target regions can also be cancer-highly relevant target regions obtained from existing publicly available research, detection, or sequencing products, such as the high-throughput methylation sequencing target regions used in Bocheng's Seven Cancer NGS Methylation Detection Kit. This reduces the computational cost of the insert fragment's methylation pattern entropy value, prevents data redundancy, and accelerates the fitting speed of the third prediction model. It is understood that researchers in the art can select appropriate target regions based on actual research or production conditions and use the multimodal pan-cancer early prediction method of the fundamental application implementation to calculate the insert fragment's methylation pattern entropy value.

[0075] Step S140 involves training a multimodal pan-cancer early prediction model using the cancer population prediction values from the first, second, and third prediction models for pan-cancer early detection. For cases with a small sample size and a large number of features, this application combines the cancer population prediction values from the base models constructed from multiple samples—namely, the first and third prediction models—to form a cancer population prediction feature set. This feature set is then used as input to train the multimodal pan-cancer early prediction model, outputting the final prediction result. This approach facilitates the integration of base models from different modalities, enabling the multimodal pan-cancer early prediction model to adapt to different types of data and classification problems. Furthermore, since the training process fully considers the sample predictions of the three base models, it reduces the risk of overfitting that might exist with a single model, and utilizes model complementarity to provide more stable classification results and improve robustness.

[0076] In a favorable implementation, a binary classification multimodal pan-cancer early prediction model can be constructed using a logistic regression algorithm to predict the cancers afflicted by each of the i types of cancer populations. It should be noted that after obtaining the predicted values for the cancer populations (i.e., the predicted labels of at least one prediction model for the cancer populations) through steps S110-S130, a three-class or even more-classification prediction model can be trained using any feasible algorithm as a multimodal pan-cancer early prediction model. Alternatively, the predicted values for the cancer populations can be used to fine-tune a large language model to obtain multi-classification results; this application does not impose any limitations on this approach.

[0077] This application provides a multimodal pan-cancer early prediction method based on cfDNA methylation. It includes a first prediction model constructed using the methylation level of methylation regions detected by a methylation panel, a second prediction model constructed using genomic fragment features, and a third prediction model constructed using chromosome methylation entropy. By integrating the three modal models, this application obtains a multimodal pan-cancer early prediction model, which can demonstrate superior pan-cancer classification accuracy compared to any single modal prediction model in the detection of various cancer types. At the same time, it can also maintain high sensitivity and specificity, and is suitable for early non-invasive screening of various tumors.

[0078] Exemplary device

[0079] Figure 4 illustrates a block diagram of a multimodal pan-cancer early prediction device according to an embodiment of this application.

[0080] As shown in Figure 4, the multimodal pan-cancer early prediction device 200 according to an embodiment of this application includes:

[0081] The data acquisition unit 210 collects multiple cfDNA samples from i types of cancer patients and healthy individuals, and extracts methylation data from each of the multiple cfDNA samples. As described in the "Exemplary Method," methylation data refers to data obtained through routine analysis of FastQ data obtained using a methylation panel, preferably a processable BAM file. The multimodal pan-cancer early prediction device according to embodiments of this application can include up to seven cancer patients: lung cancer, colorectal cancer, gastric cancer, liver cancer, esophageal cancer, thyroid cancer, and ovarian cancer. Based on these seven cancer patients, methylation intervals are obtained by screening all samples from the training set of cancer samples in the methylation data samples obtained through methylation panel sequencing of their cfDNA samples, with all healthy samples serving as the control group. It can be understood that methylation data can be data obtained by further screening of methylation sequencing data obtained from the methylation panel, such as methylation sequencing data after data quality preprocessing and evaluation (FastP software), genome alignment (Bismark software), or removal of duplicate data caused by sample / experimental techniques. The reference genome used for sequencing is the human genome. In this field, there are multiple versions of human genome sequencing, with hg19 being the commonly used version. Those skilled in the art can select the appropriate version.

[0082] Thus, the methylation data samples of the seven cancer populations obtained by the data acquisition unit 210 can be the location information of all CpG sites and the methylation values of these CpG sites obtained by methylation panel sequencing from all cancer population samples and healthy population samples. The CpG sites can be CpG sites on cancer-highly relevant target regions obtained from existing publicly available research, publicly available detection, or sequencing products, such as sites on the regions used in Bocheng's seven-cancer NGS methylation detection kit, or CpG sites on the entire genome. These CpG sites can be merged according to certain rules to form methylation regions containing multiple CpG sites, such as regions with high methylation levels or regions with low methylation levels. These methylation regions may therefore exhibit differentiated methylation states in cancer patients and healthy individuals, thus becoming useful methylation biomarkers.

[0083] The biomarker screening unit 220 merges CpG sites based on the methylation data of the multiple cfDNA samples to obtain multiple methylation intervals. From these multiple methylation intervals, it extracts differentially methylated intervals between each cancer population and a healthy population from i different cancer populations. The differentially methylated intervals appearing in at least m different cancer populations are selected as candidate biomarker intervals, and these candidate intervals are then screened. As can be seen, the data acquisition unit 210 obtains the location information and methylation values of all CpG sites in both populations. The biomarker screening unit 220 can calculate the difference in methylation values at each CpG site between the cancer population and the healthy population to obtain the expression difference of each CpG site in the two populations. The difference in methylation values can be the difference between the average methylation value of each CpG site in a cancer population and the average methylation value in a healthy population.

[0084] This yields differentiated CpG sites, which can then be used as a basis for constructing differentiated methylation regions. A feasible method for merging differentiated CpG sites to obtain methylation regions is as follows: the biomarker screening unit 220, in response to each CpG site with a non-zero difference in methylation values, merges the corresponding CpG sites, that is, sequentially merges these adjacent differentiated CpG sites. The purpose of merging CpG sites is to combine co-methylated sites for joint analysis. Compared to differential analysis between individual differentiated methylated sites, analyzing sites after merging is more statistically significant. It is important to note that the methylation regions obtained in this way may be excessively long. Since the length of human cfDNA is generally around 167 bp, it is necessary to further set the maximum length of the methylation regions to ensure that it does not significantly exceed the length of the cfDNA. For example, the maximum length of the methylation regions can be set to 200 bp, thus ensuring that each methylation region is close to the 167 bp of the cfDNA, making the methylation level of CpG sites within the calculated regions closer to the actual level and ensuring data quality. Preferably, the maximum length of the methylation regions can be set to 300 bp, at which point merging is terminated to avoid an excessive number of CpG sites within the methylation regions, which would significantly reduce the probability of comethylation and ensure that the analysis of the merged regions is meaningful. Furthermore, since the fewer CpG sites there are in a given interval, the higher the false positive probability of the methylation value of CpG sites obtained by existing technology, the marker screening unit 220 can be further set to have at least 3 CpG sites in each methylation interval, and CpG sites in the methylation interval can be screened by any feasible CpG site outlier processing or missing value processing method, thereby further improving the robustness of the obtained methylation interval.

[0085] In some implementations, the biomarker screening unit 220 screens multiple differentially methylated intervals between each cancer population and a healthy population from multiple methylated intervals of the i-th type of cancer population. This screening method includes the following: the biomarker screening unit 220 detects the methylated intervals of each cancer in the i-th type of cancer, ensuring that the detection sensitivity for each cancer population and healthy population, given a specificity, is not lower than a given threshold. While maintaining a certain level of specificity and sensitivity for a given detection, for example, in a preferred implementation, ensuring a sensitivity of not less than 70% so that more than 70% of the methylated intervals from the cancer population exhibit population-specific methylation values, and ensuring a specificity of not less than 80% so that more than 80% of the methylated intervals from the healthy control group exhibit population-specific methylation values, this allows for the screening of differentially methylated intervals with high sensitivity and specificity, while also reducing the impact of outliers on the differences and ensuring the stability of the differences.

[0086] Furthermore, the biomarker screening unit 220 can further obtain multiple differential methylation intervals through the following screening method: the biomarker screening unit 220 detects the corresponding cancer population and healthy population based on the methylation interval of each cancer in i types of cancer, and the detection AUC is not lower than a given threshold; and the difference between the average methylation values of all methylation sites in the methylation interval between each cancer population and the healthy population is not lower than a given threshold. For example, in a preferred embodiment, it is ensured that the AUC area of the cancer population detected using the methylation interval is not lower than 0.7, and the absolute value of the difference between the average methylation values of methylation sites in the methylation interval and the cancer population and the healthy population is not lower than 0.02. In this way, while ensuring that the selected differential methylation intervals have good population classification performance, interference from methylation direction can be avoided. That is, the methylation level of each CpG site in these differential methylation intervals is different for the two populations, and the methylation direction that produces the difference is not specified, so as to ensure that some differential methylation intervals are not screened out because the total difference of CpG sites is too small or zero.

[0087] Thus, the biomarker screening unit 220 obtains multiple differentially methylated regions with high population classification performance from multiple methylated regions through one or more of the aforementioned screening methods. Since these differentially methylated regions are used for pan-cancer detection of i different types of cancer, in order to improve detection speed and the generalization performance of the constructed prediction model, it is also necessary to screen for candidate biomarker regions with broader population adaptability than a single cancer population. Specifically, in some advantageous embodiments, the biomarker screening unit 220 can extract differentially methylated regions that appear in at least m types of cancer populations as multiple candidate biomarker regions, where m can be 2, 3, 4, 5, 6, or 7; for example, differentially methylated regions appearing in cfDNA samples from at least two types of cancer populations can be selected as candidate biomarker regions, so that the candidate biomarker regions can exhibit good classification performance for at least two cancer types.

[0088] Considering that candidate biomarker intervals based on large target intervals or merging CpG sites across the entire genome may still have high data dimensionality even after screening by the biomarker screening unit 220 using one or more of the aforementioned screening methods, placing an excessive burden on the prediction model's learning of classification features and parameter optimization, the multimodal pan-cancer early prediction device according to embodiments of this application further includes using the biomarker screening unit 220 to screen multiple candidate biomarker intervals, i.e., interval dimensionality reduction. According to some feasible implementations, the screening method may include: using the biomarker screening unit 220 to calculate the importance value of each candidate biomarker interval among the multiple candidate biomarker intervals; deleting the corresponding candidate biomarker interval from the multiple candidate biomarker intervals in response to an importance value not exceeding a first threshold; and retaining the corresponding candidate biomarker interval from the multiple candidate biomarker intervals in response to an importance value exceeding the first threshold.

[0089] Here, importance value is an indicator that reflects the contribution of each candidate marker interval to the prediction result of the prediction model. The larger the importance value, the greater the influence of the candidate marker interval on the model's prediction result. Importance value can be obtained in various ways, such as by inputting multiple candidate marker intervals into a linear regression model, Lasso model, random forest, gradient boosting tree, etc., running in the marker selection unit 220 to select intervals with high importance values. According to a feasible implementation plan, the marker selection unit 220 inputs multiple candidate marker intervals into a random forest model and calculates the importance value of each candidate marker interval separately. The multiple candidate marker intervals are sorted according to the importance value, and the top l candidate marker intervals with importance values greater than a first threshold are selected, while all candidate marker intervals starting from the (l+1)th interval are deleted. The value of l is preferably 45, thereby ensuring that the classification performance and construction cost of the simple binary classification prediction model can be optimized.

[0090] The model building unit 230 trains a first prediction model using the selected candidate marker intervals, trains a second prediction model based on fragmented features of chromosomes extracted from the reference genome, and trains a third prediction model by extracting methylation entropy features of chromosomes. Here, the model building unit 230 can be further divided into three functional sub-units: a first prediction model building sub-unit, a second prediction model building sub-unit, and a third prediction model building sub-unit.

[0091] The first prediction model construction subunit trains a first prediction model using multiple candidate biomarker intervals, and obtains the predicted value for the cancer population based on the first prediction model. Here, since multiple candidate biomarker intervals have already been obtained after multiple screenings and dimensionality reduction, a concise and efficient binary classification model can be selected. The first prediction model construction subunit trains the binary classification model using the candidate biomarker intervals and obtains the first prediction model. Therefore, in an advantageous implementation, the first prediction model construction subunit selects the logistic regression algorithm to construct a binary classification model for cancer and healthy populations. Logistic regression has relatively low computational complexity, making it suitable for processing large-scale sequencing data. In addition, it can be combined with optimization algorithms to quickly iterate the classification model parameters, making it suitable for real-time analysis and processing of cancer detection.

[0092] In this way, the first prediction model is obtained through the logistic regression algorithm. By providing the model with a fully connected layer or any possible classifier, it can output the cancer probability value of each individual in the cancer population for each type of cancer in i types of cancer, which can be used as part of the cancer population prediction value for the final training of the multimodal pan-cancer early prediction model.

[0093] The second prediction model construction subunit trains a second prediction model based on fragmented features extracted from the reference genome, and obtains prediction values for cancer populations based on this model. Specifically, the second prediction model construction subunit flattens the autosomes of the reference genome selected during methylation panel sequencing to divide them into adjacent, non-overlapping first intervals, such as multiple 100kb intervals. Notably, the ctDNA fragment length in plasma samples is shorter than that of normal cfDNA fragments. Specifically, the distribution of ctDNA in cancer populations differs from that in healthy individuals across three fragment types: 90-150bp, 180-220bp, and 250-320bp.

[0094] Therefore, within each first interval, the second prediction model construction subunit sets short-length segments as short segments, with a length between 100-150 bp, encompassing the first type of difference segments; medium-length segments as middle segments, with a length between 150-260 bp, including the second type of difference segments; and long-length segments as long segments, with a length between 260-320 bp, encompassing the third type of difference segments. This results in more difference segments obtained through length types, making the fragmented features more specific. In a preferred example, the length of short segments can be set between 90-150 bp to further encompass difference information, the length of middle segments can be set between 150-250 bp, and the length of long segments can be set between 250-320 bp to better distinguish segments longer than or shorter than 250 bp. Furthermore, the second prediction model construction subunit can also integrate the above three types of segments to obtain an overall length segment, i.e., an nfrags segment, with an interval length between 100-320 bp, preferably appropriately expanded to 90-320 bp. Then, the number of short, middle, long, and nfrags in each interval is counted in each first interval to obtain the fragmented features.

[0095] Then, the second prediction model construction unit sequentially merges the first intervals to obtain multiple second intervals of a certain length, for example, multiple 1MB second intervals, resulting in 2608 non-overlapping second intervals. A coverage is defined for each second interval: j represents the j-th second interval among the multiple second intervals, with values such as 1, 2, ..., 2608. The various types of fragments (short, middle, long, and nfrags) and their corresponding counts in the j-th second interval can be set as fragment features and their corresponding feature values. For example, 2608 second intervals yield 10432 fragment features and their corresponding fragment counts. In this way, the second prediction model construction unit can use the elastic network regression algorithm to construct a fragmented binary classification model for cancer patients and healthy patients as the second prediction model.

[0096] Understandably, the second prediction model construction unit, after obtaining the fragment features, uses a PCA model running on it to reduce the dimensionality of the massive fragment features, constructing a fragment feature matrix from principal components capable of explaining 95% of the differences, and then constructing the second prediction model through the elastic network regression algorithm. The purpose of choosing elastic network regression is that the number of fragment features in the second interval may be much higher than the number of intervals in the second interval, and elastic network regression can effectively select fragment features, thereby avoiding overfitting of the second prediction model.

[0097] Thus, the second prediction model performs better than the first prediction model in handling the problem of high collinearity among chromosome segments. This is because there are usually complex interactions and correlations between chromosome segments, which the first prediction model, built using candidate marker intervals, does not take into account. Furthermore, resilient network regression is more suitable for building sparse models to reduce model complexity and computational cost. Therefore, the second prediction model exhibits higher specificity and faster fitting speed during classification. Constructing the second prediction model and integrating it with the first prediction model effectively improves the specificity of the first prediction model.

[0098] The third prediction model construction subunit extracts methylation entropy features of chromosomes to train the third prediction model, and obtains cancer population prediction values based on the third prediction model. Specifically, the third prediction model construction subunit extracts multiple insert fragments from the methylation data of cfDNA samples in CpG mode to obtain the methylation pattern entropy values of multiple insert fragments; then, the methylation pattern entropy value of each chromosome is calculated as the methylation entropy feature. First, the third prediction model construction subunit extracts insert fragments from the methylation panel sequencing data of cfDNA samples and filters out unqualified insert fragments according to certain criteria.

[0099] It is important to note that the third predictive model constructing unit extracts insert fragments that can be compared with a reference genome (e.g., the same fragment on the genome) in both left-to-right and right-to-left reads during sequencing. Examples include fragments of approximately 300 bp that have been broken down, or fragments of approximately 170 bp for free DNA. In some advantageous implementations, filtering unqualified insert fragments by the third predictive model constructing unit may include removing insert fragments containing fewer than 3 or more than 32 CpG sites and / or removing insert fragments with missing CpG sites, thereby eliminating erroneous fragments or fragments with excessively low methylation levels.

[0100] Specifically, CpG patterns refer to patterns that capture the methylation status of all CpG sites on the insert fragment. Using CpG patterns, we can statistically obtain results such as chromosome location, the retrieval positions of the first and last CpG sites, a CpG pattern map (a pattern diagram of the overall methylation status of CpG sites on the insert fragment, i.e., the magnitude of methylation values), and the frequency of CpG pattern occurrences. Then, on a chromosome-by-chromosome basis, the methylation pattern entropy value is calculated for each insert fragment. An example of how the third prediction model constructs subunits to calculate this entropy value is shown in the following formula:

[0101] Where BiEn(s) represents the methylation pattern entropy value of the inserted fragment, n represents the number of all CpG sites in the inserted fragment, k is a set of values from 0 to n-2, and p is the probability value of k under a given value. Thus, the methylation pattern entropy value calculated by the third prediction model construction subunit can determine the degree of disorder or randomness of the methylation pattern of the inserted fragment. The inserted fragments screened by the multimodal pan-cancer early prediction device according to the embodiments of this application are the collective effect of the entropy of cfDNA fragments from different sources. In contrast, the methylation pattern entropy in the prior art is calculated based on sites. This is unlike the method used in this application, which statistically analyzes the methylation pattern entropy value of shed cfDNA from a certain tumor cell, but rather considers the entropy of several consecutive CpG sites from multiple DNA sources as a whole. Therefore, the methylation pattern entropy value of this application will not mask the CpG site information of the target tumor cell cfDNA.

[0102] In this way, the third prediction model construction unit can further calculate the methylation pattern entropy value for each chromosome or specified interval, using the formula: RE=1 / N*(e1+e2+e3+…+eN) (Formula 2)

[0103] Here, RE represents the methylation entropy value of a region, such as a segment of a chromosome; N represents the total number of inserted segments within that region; and e represents the methylation entropy value of each inserted segment within that region. Thus, the third prediction model construction subunit can use the average methylation pattern entropy value of all inserted segments on each chromosome as the methylation pattern entropy value of that chromosome. Finally, using methylation entropy as the methylation entropy feature, the third prediction model construction subunit uses a logistic regression algorithm to construct a classification model between cancer patients and healthy individuals, serving as the third prediction model.

[0104] In another implementation, the third prediction model construction unit can also select multiple target intervals and use the average methylation pattern entropy value of all insert fragments in each target interval as the methylation pattern entropy value of each target interval. The target intervals can also be cancer-highly relevant target intervals obtained from existing publicly available research, detection, or sequencing products, such as the high-throughput methylation sequencing target intervals used in Bocheng's Seven Cancer NGS Methylation Detection Kit. This reduces the computational cost of the methylation pattern entropy value of the insert fragments, prevents data redundancy, and accelerates the fitting speed of the third prediction model. It is understood that researchers in the art can select appropriate target intervals based on actual research or production conditions and use the multimodal pan-cancer early prediction device of the fundamental application implementation to calculate the methylation pattern entropy value of the insert fragments.

[0105] The model integration unit 240 trains a multimodal pan-cancer early prediction model using the cancer population prediction values of the first, second, and third prediction models for pan-cancer early detection. For cases with a small sample size and a large number of features, the model integration unit 240 combines the cancer population prediction values of the base models constructed from multiple samples—namely, the first and third prediction models—to form a cancer population prediction feature set. This feature set is then used as input to train the multimodal pan-cancer early prediction model, outputting the final prediction result. In this way, the model integration unit 240 can help integrate base models of different modalities, enabling the multimodal pan-cancer early prediction model to adapt to different types of data and classification problems. Furthermore, since the training process can fully consider the sample predictions of the three base models, it also reduces the risk of overfitting that may exist with a single model, and utilizes model complementarity to provide more stable classification results and improve robustness.

[0106] In a favorable implementation, the model integration unit 240 may also continue to use a logistic regression algorithm to construct a binary multimodal pan-cancer early prediction model and predict the cancers afflicted by the i types of cancer populations. It should be noted that after obtaining the predicted values for the cancer population—that is, the predicted labels for the cancer population by at least one prediction model—through the data acquisition unit 210, the biomarker screening unit 220, and the model building unit 230, the model integration unit 240 may also train a three-class or even more-class predictive model using any feasible algorithm as a multimodal pan-cancer early prediction model, or fine-tune a large language model using the predicted values for the cancer population to obtain multi-class results; this application does not impose any limitations on this.

[0107] The multimodal pan-cancer early prediction device based on cfDNA methylation provided in this application includes a first prediction model constructed using the methylation level of methylation regions detected by the methylation panel, a second prediction model constructed using genomic fragment features, and a third prediction model constructed using chromosomal methylation entropy. By integrating the three modal models, a multimodal pan-cancer early prediction model is obtained. It can demonstrate superior pan-cancer classification accuracy compared to any single modal prediction model in the detection of various cancer types, while maintaining high sensitivity and specificity, making it suitable for early non-invasive screening of various tumors.

[0108] As described above, the multimodal pan-cancer early prediction device 200 according to the embodiments of this application can be implemented in various terminal devices, such as servers used to train any prediction model and multimodal pan-cancer early prediction models. In one example, the multimodal pan-cancer early prediction device 200 according to the embodiments of this application can be integrated into a terminal device as a software module and / or a hardware module. For example, the multimodal pan-cancer early prediction device 200 can be a software module in the operating system of the terminal device, or it can be an application developed for the terminal device; of course, the multimodal pan-cancer early prediction device 200 can also be one of many hardware modules of the terminal device.

[0109] Alternatively, in another example, the multimodal pan-cancer early prediction device 200 and the terminal device can also be separate devices, and the multimodal pan-cancer early prediction device 200 can be connected to the terminal device via wired and / or wireless networks, and transmit interactive information in accordance with an agreed data format.

[0110] Exemplary electronic devices

[0111] The electronic device according to an embodiment of the present application will now be described with reference to FIG5.

[0112] Figure 5 illustrates a block diagram of an electronic device according to an embodiment of this application.

[0113] As shown in Figure 5, the electronic device 10 includes one or more processors 11 and memory 12.

[0114] The processor 13 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

[0115] The memory 12 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 11 may execute the program instructions to implement the multimodal pan-cancer early prediction methods of the various embodiments of this application described above, and / or other desired functions. Various contents such as cfDNA samples, candidate biomarker regions, fragmentation features, methylation entropy features, etc., may also be stored in the computer-readable storage medium.

[0116] In one example, the electronic device 10 may also include an input device 13 and an output device 14, which are interconnected via a bus system and / or other forms of connection mechanism (not shown).

[0117] The input device 13 may include, for example, a keyboard, a mouse, etc.

[0118] The output device 14 can output various information to the outside, including a trained multimodal pan-cancer early prediction model. The output device 14 may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output devices, etc.

[0119] Of course, for simplicity, Figure 5 only shows some of the components of the electronic device 10 that are relevant to this application, omitting components such as buses, input / output interfaces, etc. In addition, the electronic device 10 may include any other suitable components depending on the specific application.

[0120] Exemplary computer program products and computer-readable storage media

[0121] In addition to the methods, apparatus, and devices described above, embodiments of this application may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the steps of the multimodal pan-cancer early prediction method according to embodiments of this application as described in the "Exemplary Methods" section of this specification.

[0122] The computer program product can be written in any combination of one or more programming languages to perform the operations of the embodiments of this application. The programming languages include object-oriented programming languages such as Java and C++, as well as conventional procedural programming languages such as C, Python, or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0123] Furthermore, embodiments of this application may also be computer-readable storage media storing computer program instructions thereon, which, when executed by a processor, cause the processor to perform the steps in the multimodal pan-cancer early prediction method according to embodiments of this application described in the "Exemplary Methods" section of this specification.

[0124] The computer-readable storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may, for example, include, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0125] Example

[0126] This application provides a general and / or specific description of the materials and test methods used in the experiments. Unless otherwise specified, all raw materials or instruments used are commercially available and readily available.

[0127] Example 1: Construction of the First Prediction Model

[0128] For each of the seven cancer types (lung cancer, colorectal cancer, stomach cancer, liver cancer, esophageal cancer, thyroid cancer, and ovarian cancer) samples provided by Borcheng, cfDNA samples were collected and divided into training set samples and test set samples. Candidate biomarker intervals were screened for the training set samples using the following method:

[0129] Using the human genome hg19 as a reference genome, methylation sequencing was performed on cfDNA samples to obtain methylation regions using the regions specified in the Bocheng Seven Cancer NGS Methylation Detection Kit. For highly methylated regions, thresholds were set for sensitivity, the difference in methylation values of CpG sites within the methylated region (DELTA), and the significance p-value of CpG sites: sensitivity ≥ 0.6, DELTA ≥ 0.02, and p-value < 0.01, respectively.

[0130] For hypomethylated regions, thresholds for Sensitivity, DELTA, and p-value were set as follows: Sensitivity ≥ 0.55, DELTA ≤ -0.02, and p-value < 0.01, respectively. Then, the top 200 differentially methylated regions were selected based on their absolute DELTA values; all regions with fewer than 200 regions were selected. Differentially methylated regions appearing in at least two cancer types were then selected as candidate biomarker regions. The final candidate biomarker regions included 152 hypermethylated regions and 313 hypomethylated regions. Finally, the importance value of the candidate biomarker regions was calculated using a random forest algorithm, and 45 candidate biomarker regions with an importance value greater than 0.0055 were selected as features for subsequent modeling.

[0131] A first predictive model for binary classification was constructed using the logistic regression algorithm. Classification predictions were performed on test set samples, and the model's predictive performance was evaluated. Evaluation results show that on the test set, the AUC was 0.97, the sensitivity was 85.7%, and the specificity was 96.6%.

[0132] Example 2: Construction of the Second Prediction Model

[0133] The reference genome autosomes were flattened into adjacent, non-overlapping first intervals of 100kb each. Short segments were defined as short (100-150bp), medium segments as middle (150-260bp), long segments as long (260-320bp), and nfrags (100-320bp). The number of each type of segment (short, middle, long, nfrags) was counted. These 100kb first intervals were then merged into second intervals of 1MB each, resulting in 2608 non-overlapping second intervals. Let j represent the j-th second interval, with values 1, 2, ..., 2608 on the reference genome autosomes. The number of each type of segment (short, middle, long, nfrags) within the j-th second interval represents its coverage. All coverage values were used as segment features, resulting in 10432 features.

[0134] Then, PCA was used to reduce the dimensionality of the fragment features, constructing a fragment feature matrix with principal components that could explain 95% of the differences. Elastic network regression was then used to construct a fragmented binary classification discriminant model, i.e., the second prediction model. Classification predictions were performed on the test set samples, and the model's predictive performance was evaluated. The evaluation results showed an AUC of 0.95, a sensitivity of 79.1%, and a specificity of 98.3% on the test set.

[0135] Example 3: Construction of the Third Prediction Model

[0136] The methylation pattern entropy value of the inserted fragments in each chromosome was calculated separately. Specifically, all inserted fragments were extracted and unqualified fragments were filtered out. The methylation pattern entropy value of each inserted fragment was used as the methylation entropy feature, and a binary classification model based on methylation entropy, i.e., the third prediction model, was constructed using a logistic regression algorithm. Classification predictions were performed on the test set samples, and the model's predictive performance was evaluated. The evaluation results showed that the AUC on the test set was 0.85, the sensitivity was 65.9%, and the specificity was 84.5%.

[0137] Example 4: Construction of a Multimodal Pan-Cancer Early Prediction Model

[0138] Using cfDNA samples as input, positive predictive values for the seven cancer populations in Example 1 were obtained using the first, second, and third prediction models. All positive predictive values were used as features, and a multimodal binary classification model, i.e., a multimodal pan-cancer early prediction model, was constructed using a logistic regression algorithm. Classification predictions were performed on the test set samples, and the model's predictive performance was evaluated. The evaluation results show that the multimodal model has good ability to distinguish between cancer and non-cancer cells. The ROC curve of the test set is shown in Figure 3, with an AUC of 0.98, a sensitivity of 89.0%, and a specificity of 98.3% on the test set.

[0139] As can be seen from Examples 1-4 and Figure 3, after integrating the three basic models to obtain the multimodal pan-cancer early prediction model, its cancer sample prediction sensitivity is improved compared to the basic models, while the prediction specificity for healthy samples remains at the highest level compared to the basic models. Therefore, the multimodal pan-cancer early prediction model of this application can be trained using small sample data and accurately distinguish plasma samples from seven types of cancer patients and healthy individuals.

[0140] The basic principles of this application have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in this application are merely examples and not limitations, and should not be considered as essential features of each embodiment of this application. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the application to the necessity of employing the aforementioned specific details for implementation.

[0141] In this application, words such as “including,” “comprising,” and “having” are open-ended terms meaning “including but not limited to” and are used interchangeably. The terms “or” and “and” as used herein refer to the terms “and / or” and are used interchangeably unless the context explicitly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to” and is used interchangeably.

[0142] It should also be noted that in the methods, systems, and devices of this application, each step or module can be decomposed and / or recombined. These decompositions and / or recombinations should be considered as equivalent solutions of this application.

[0143] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this application to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations thereof.

Claims

1. A multimodal pan-cancer early prediction method, comprising: Multiple cfDNA samples were collected from individuals with type i cancer and healthy individuals, and methylation data were extracted from the multiple cfDNA samples respectively. Based on the methylation data of the multiple cfDNA samples, CpG sites are merged to obtain multiple methylation intervals. Differential methylation intervals between each cancer population and a healthy population are extracted from the multiple methylation intervals. Differential methylation intervals that appear in no less than m cancer populations are selected as candidate biomarker intervals. The multiple candidate biomarker intervals are then screened. The first prediction model is trained using the selected candidate marker intervals, the second prediction model is trained based on the fragmented features of chromosomes extracted from the reference genome, and the third prediction model is trained by extracting the methylation entropy features of chromosomes. A multimodal pan-cancer early prediction model is trained using the cancer population prediction values of the first prediction model, the second prediction model, and the third prediction model for pan-cancer early detection.

2. The multimodal pan-cancer early prediction method according to claim 1, wherein, The extraction of methylation data from the multiple cfDNA samples includes: Extract multiple CpG sites from the multiple cfDNA samples and the methylation value of each CpG site; The methylation data merging CpG sites based on the multiple cfDNA samples includes: Calculate the difference in methylation values at each CpG site between each cancer population and a healthy population in the i-type cancer population. In response to a non-zero difference, the corresponding CpG sites are merged.

3. The multimodal pan-cancer early prediction method according to claim 1, wherein, The process of filtering the obtained multiple candidate marker intervals includes: Calculate the importance value of each candidate marker interval among the plurality of candidate marker intervals; in response to the importance value not being greater than a first threshold, remove the corresponding candidate marker interval; in response to the importance value being greater than the first threshold, retain the corresponding candidate marker interval.

4. The multimodal pan-cancer early prediction method according to claim 1, wherein, The training of the second prediction model based on fragmented features of chromosomes extracted from the reference genome includes: The reference genome autosomes were laid flat and divided into multiple first-order regions on an average basis; Record the number of short-length segments, medium-length segments, and long-length segments in each of the plurality of first intervals; The plurality of first intervals are sequentially merged to obtain a plurality of second intervals; the coverage of each of the plurality of second intervals is calculated; and the dimensionality of the obtained multiple coverages is reduced; and The second prediction model is constructed using the multiple coverages after dimensionality reduction.

5. The multimodal pan-cancer early prediction method according to claim 4, wherein, The coverage is defined as the short-length segments, medium-length segments, and long-length segments contained in each of the plurality of second intervals; The second prediction model is constructed using elastic network regression with the multiple coverages after dimensionality reduction.

6. The multimodal pan-cancer early prediction method according to claim 1, wherein, The method of extracting methylation entropy features from chromosomes to train the third prediction model includes: Multiple insertion fragments were extracted from the methylation data of the multiple cfDNA samples to obtain the methylation pattern entropy values of the multiple insertion fragments; Calculate the methylation pattern entropy value for each chromosome separately; and The third prediction model is constructed using logistic regression based on the methylation pattern entropy values of all chromosomes.

7. The multimodal pan-cancer early prediction method according to claim 6, wherein, The multiple insert fragments extracted from the methylation data of the multiple cfDNA samples include: Both left-to-right and right-to-left reads of cfDNA samples extracted in CpG mode were able to be compared with the inserted fragments in the reference genome. The formula for obtaining the methylation mode entropy values of the plurality of inserted fragments is as follows: Where BiEn(s) is the methylation pattern entropy value of the inserted fragment, n represents the number of all CpG sites in the inserted fragment, k is a set of values from 0 to n-2, and p is the probability value of k under a given value.

8. The multimodal pan-cancer early prediction method according to claim 6, wherein, The calculation of the methylation pattern entropy value for each chromosome includes: The mean methylation pattern entropy of all inserted segments on each chromosome is calculated as the methylation pattern entropy of each chromosome.

9. A multimodal pan-cancer early prediction device, comprising: The data acquisition unit collects multiple cfDNA samples from i types of cancer patients and healthy individuals, and extracts methylation data from the multiple cfDNA samples respectively; The biomarker screening unit merges CpG sites based on the methylation data of the multiple cfDNA samples to obtain multiple methylation intervals. It extracts the differential methylation intervals between each cancer population and a healthy population from the multiple methylation intervals. The differential methylation intervals that appear in no less than m cancer populations among the multiple differential methylation intervals are selected as candidate biomarker intervals. The multiple candidate biomarker intervals are then screened. The model building unit trains a first prediction model using the selected candidate marker intervals, trains a second prediction model based on the fragmented features of chromosomes extracted from the reference genome, and trains a third prediction model by extracting the methylation entropy features of chromosomes. The model integration unit trains a multimodal pan-cancer early prediction model using the cancer population prediction values of the first prediction model, the second prediction model, and the third prediction model, for pan-cancer early detection.

10. The multimodal pan-cancer early prediction device according to claim 9, wherein, The data acquisition unit extracts methylation data from the multiple cfDNA samples, including: The data acquisition unit extracts multiple CpG sites from the multiple cfDNA samples and the methylation value of each CpG site among the multiple CpG sites; The biomarker screening unit, based on the methylation data of the multiple cfDNA samples, merges CpG sites, including: The biomarker screening unit calculates the difference in methylation values at each CpG site between each cancer population and a healthy population in the i-type cancer population. In response to a non-zero difference, the corresponding CpG sites are merged.

11. The multimodal pan-cancer early prediction device according to claim 9, wherein, The marker screening unit filters the obtained multiple candidate marker intervals, including: The marker screening unit calculates the importance value of each candidate marker interval among the plurality of candidate marker intervals, and removes the corresponding candidate marker interval in response to the importance value not being greater than a first threshold, and retains the corresponding candidate marker interval in response to the importance value being greater than the first threshold.

12. The multimodal pan-cancer early prediction device according to claim 9, wherein, The model building unit, which trains a second prediction model based on fragmented features of chromosomes extracted from the reference genome, includes: The model building unit flattens out the autosomes of the reference genome and divides them into multiple first intervals on an average basis; The model building unit records the number of short-length segments, medium-length segments, and long-length segments in each of the plurality of first intervals; The model building unit sequentially merges the multiple first intervals to obtain multiple second intervals, calculates the coverage of each of the multiple second intervals, and performs dimensionality reduction on the obtained multiple coverages; and The model building unit constructs the second prediction model using the multiple coverages after dimensionality reduction.

13. The multimodal pan-cancer early prediction device according to claim 12, wherein, The coverage is defined as the short-length segments, medium-length segments, and long-length segments contained in each of the plurality of second intervals; The model building unit constructs the second prediction model using elastic network regression with the multiple coverages after dimensionality reduction.

14. The multimodal pan-cancer early prediction device according to claim 9, wherein, The model building unit extracts methylation entropy features of chromosomes to train the third prediction model, including: The model building unit extracts multiple insertion fragments from the methylation data of the multiple cfDNA samples to obtain the methylation pattern entropy value of the multiple insertion fragments; The model building unit calculates the methylation pattern entropy value for each chromosome; and The model building unit constructs the third prediction model using logistic regression based on the methylation pattern entropy values of all chromosomes.

15. The multimodal pan-cancer early prediction device according to claim 14, wherein, The model building unit extracts multiple insertion fragments from the methylation data of the multiple cfDNA samples, including: The model building unit extracts reads from left to right and right to left from cfDNA samples in CpG mode, and both can be compared with the inserted fragments in the reference genome. The formula used by the model building unit to obtain the methylation mode entropy values of the multiple inserted fragments is as follows: Where BiEn(s) is the methylation pattern entropy value of the inserted fragment, n represents the number of all CpG sites in the inserted fragment, k is a set of values from 0 to n-2, and p is the probability value of k under a given value.

16. The multimodal pan-cancer early prediction device according to claim 14, wherein, The model building unit calculates the methylation pattern entropy value for each chromosome, including: The model building unit calculates the mean methylation pattern entropy value of all inserted segments on each chromosome as the methylation pattern entropy value of each chromosome.

17. An electronic device comprising: processor; as well as A memory storing computer program instructions that, when executed by the processor, cause the processor to perform the multimodal pan-cancer early prediction method as described in any one of claims 1-8.

18. A computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the multimodal pan-cancer early prediction method as described in any one of claims 1-8.

19. A computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform the multimodal pan-cancer early prediction method as described in any one of claims 1-8.