Whole genome fragment level cancer detection method and system based on split-cls two-path signal-to-noise separation

By employing a multi-instance learning method based on split-CLS dual-path signal-noise separation, and utilizing a pre-trained language model with a Transformer architecture to encode and decode cfDNA fragments, effective separation of cancer signals and background noise at the cfDNA fragment level is achieved. This solves the problem of insufficient signal separation in existing technologies and improves the sensitivity and robustness of early cancer screening and MRD monitoring.

CN122201704APending Publication Date: 2026-06-12GENESEEQ TECH INC +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GENESEEQ TECH INC
Filing Date
2025-11-17
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing cfDNA detection methods struggle to effectively separate signal from background noise at the whole-genome fragment level, and their detection performance is limited under different sequencing platforms and strategies, affecting the sensitivity and robustness of early cancer screening and MRD monitoring.

Method used

A multi-instance learning method based on splitting-CLS dual-path signal-noise separation is adopted. The cfDNA fragment is modeled end-to-end using a deep learning sequence language model. A signal-background separation mechanism is introduced, and a pre-trained language model with a Transformer architecture is used for encoding. The signal and background are separated and reconstructed through an attention mechanism and a decoder. The multi-instance learning framework is then used for cancer classification.

🎯Benefits of technology

It achieves effective decoupling of cfDNA fragment-level cancer signals from background noise, improves weak signal detection capabilities, enhances the model's robustness to different sequencing platforms and strategies, and supports high-sensitivity and high-specificity early cancer screening and MRD monitoring.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201704A_ABST
    Figure CN122201704A_ABST
Patent Text Reader

Abstract

The present application relates to a kind of based on splitting-CLS two-way signal-to-noise separation multi-instance learning's whole genome fragment level cancer detection method and its application in cancer screening and MRD monitoring.The method utilizes splitting-CLS and double-branch deep representation structure, realizes "signal-background" effective separation;Through the modeling of cfDNA whole genome fragment, and combined with multi-instance learning framework realizes cancer signal automatic extraction and discrimination under the condition of no fragment level labeling.The present application can effectively overcome the influence of different sequencing platforms and sequencing strategy difference (whole genome sequencing and targeted capture) on detection performance, so as to maintain the robustness and generalizability of the results.The technology breaks through the limitation of relying on specific known mutation or single sequencing scheme, significantly improves the sensitivity, specificity and cross-platform generalization ability of early cancer screening and minimal residual disease (MRD) monitoring, and provides a high robustness and high universality innovative solution for multi-cancer non-invasive screening and longitudinal disease monitoring.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of bioinformatics and medical detection technology, and in particular to a whole-genome fragment-level cancer detection method based on split-CLS dual-path signal-noise separation multi-instance learning, which is applicable to early cancer screening, minimal residual disease (MRD) monitoring and related clinical applications. Background Technology

[0002] Cancer is one of the major diseases threatening human health, and early detection and accurate monitoring are crucial for improving cure rates and reducing mortality. Circulating cell-free DNA (cfDNA), as genetic material released into the bloodstream by tumor cells, has broad application prospects in early cancer screening and postoperative monitoring. Existing cfDNA detection methods mainly rely on the following two technical pathways:

[0003] Specific mutation or methylation detection: This type of method targets known gene mutations, methylation sites or structural variations for analysis, and has the advantage of high specificity. However, it often requires the prior definition of biomarkers, making it difficult to cover universal signals of multiple cancer types, and it is not sensitive enough to unknown mutations or rare signals.

[0004] Statistical or whole-genome pattern analysis: With the development of high-throughput sequencing, cancer identification using fragment length distribution, nucleosome occupancy patterns, and sequence context features of cfDNA has become a trend. However, these methods are mostly based on artificially designed features or traditional statistical models, making it difficult to fully capture complex sequence patterns. Furthermore, significant performance differences exist between different sequencing platforms (such as Illumina, MGI, Nanopore, etc.) or different sequencing strategies (such as whole-genome sequencing WGS and targeted capture), limiting their clinical application.

[0005] In recent years, deep learning methods have been increasingly applied to cfDNA analysis. Typical examples include DNABERT2 and other sequence language model-based frameworks (such as the pre-trained and fine-tuned models described in Chinese patent application No. 2025116100582, entitled "A Cancer Signal Recognition Method Based on cfDNA Sequence Language Model and Its Application in Cancer Screening and MRD Monitoring"). These models can learn DNA sequence patterns without fragment-level annotation and achieve excellent performance on certain tasks. However, these models primarily rely on a single global CLS representation to aggregate information, lacking a split-CLS dual-path signal-noise separation mechanism tailored to the specific characteristics of cfDNA. This results in the inability to effectively distinguish cancer-related signals from background signals in the context of highly noisy and heterogeneous cfDNA, and insufficient capture of weak signals or rare patterns, thus affecting the sensitivity and robustness of early cancer screening and MRD monitoring.

[0006] Therefore, there is an urgent need for a new cfDNA analysis method that can effectively separate signal from background noise at the whole genome fragment level, and overcome the impact of differences in sequencing platforms and sequencing strategies on detection performance, thereby achieving high-sensitivity and high-specificity early cancer screening and longitudinal disease monitoring. Summary of the Invention

[0007] This invention proposes a genome-wide fragment-level cancer detection method based on splitting-CLS dual-path signal-noise separation and multi-instance learning, applicable to early cancer screening and minimal residual disease (MRD) monitoring. This method uses a deep learning sequence language model (such as a pre-trained language model based on the Transformer architecture) to perform end-to-end modeling of cfDNA genome fragments, enabling automatic extraction and discrimination of cancer-related signals. Simultaneously, a signal-background separation mechanism is introduced to effectively improve the model's sensitivity to weak signal fragments and its robustness to different sequencing platforms and strategies (such as WGS and TargetCapture).

[0008] A genome-wide fragment-level cancer detection method based on split-CLS dual-path signal-noise separation and multi-instance learning, for non-therapeutic and diagnostic purposes, includes the following steps:

[0009] a) Data acquisition and encoding steps: acquire the sequence data of cfDNA fragments in the sample to be tested, and encode the sequence data of each cfDNA fragment using a pre-trained sequence language model to obtain an initial fragment representation vector;

[0010] b) Vector splitting step: Divide each of the initial segment representation vectors along its channel dimension into a signal subspace vector for carrying cancer classification signals and a background subspace vector for carrying background information;

[0011] c) Dual-path processing and classification steps:

[0012] In the classification path, based on the signal subspace vector, a multi-instance learning framework with an attention mechanism is used to weight and aggregate the signals of all segments in the sample to generate a bag-level representation vector, and the cancer prediction result of the sample is calculated based on the bag-level representation vector.

[0013] In the background reconstruction path, the initial fragment representation vector is reconstructed by a decoder based on the background subspace vector;

[0014] d) Model training steps: The trainable parameters in the sequence language model, the multi-instance learning framework, and the decoder are trained by jointly optimizing an overall loss function that includes sample classification loss, background reconstruction loss, and orthogonal constraint loss to promote decoupling between signal and background.

[0015] A whole-genome fragment-level cancer detection system based on splitting-CLS dual-path signal-noise separation includes:

[0016] a) Data acquisition and encoding module: used to acquire the sequence data of cfDNA fragments in the sample to be tested, and to encode the sequence data of each cfDNA fragment using a pre-trained sequence language model to obtain an initial fragment representation vector;

[0017] b) Vector splitting module: used to divide each of the initial segment representation vectors along its channel dimension into a signal subspace vector carrying cancer classification signals and a background subspace vector carrying background information;

[0018] c) Dual-path processing and classification module:

[0019] In the classification path, based on the signal subspace vector, a multi-instance learning framework with an attention mechanism is used to weight and aggregate the signals of all segments in the sample to generate a bag-level representation vector, and the cancer prediction result of the sample is calculated based on the bag-level representation vector.

[0020] It is also used in the background reconstruction path to reconstruct the initial fragment representation vector based on the background subspace vector through a decoder;

[0021] d) Model training module: used to train the trainable parameters in the sequence language model, multi-instance learning framework and decoder by jointly optimizing an overall loss function that includes sample classification loss, background reconstruction loss and orthogonal constraint loss to promote decoupling between signal and background.

[0022] e) Prediction module: Used to perform inference and prediction on new test samples using the model obtained from the model training module, and obtain cancer prediction results.

[0023] The data acquisition and encoding module is further used to: acquire cfDNA from tumor samples and healthy human samples, sequence them to obtain cfDNA fragment sequences, align them to a reference genome to obtain their positions on the reference genome, and then extend them upstream and downstream by several bases, using the extended sequence information on the reference genome as the data information of a single sample.

[0024] In the data acquisition and encoding module, when the sequencing data is aligned to the reference genome, cfDNA fragments with alignment quality scores not lower than a preset threshold need to be selected; and the length of the selected cfDNA fragments is limited to no more than a predetermined length threshold.

[0025] The preset threshold for the alignment quality score is 20-40; the preset length threshold for the cfDNA fragment length is 300-500 bp.

[0026] The pre-trained sequence language model is a Transformer-based model, and the initial segment representation vector is the [CLS] vector output by the last encoder layer of the model.

[0027] The pre-trained sequence language model is obtained by dividing the reference genomic DNA of multiple species containing humans into segments, and training it through a masked language model mechanism while maintaining the original order, in order to learn the potential language structure features of the DNA sequence.

[0028] Dividing into segments maintains the consistency of DNA segment size; and it uses a self-attention mechanism to understand the relationship between any two DNA sub-word segments in a sequence; the pre-trained language model built on the Transformer architecture adopts the BERT model, and the pre-trained language model contains 10-20 Transformer encoder layers; the segment feature vector is a 600-900 dimensional vector.

[0029] The vector splitting module is used for:

[0030] The initial segment representation vector of dimension d The channel is divided into dimensions. signal subspace vector and dimension are Background subspace vector ,in For sample index, For fragment index, and .

[0031] The signal subspace and the background subspace have the same dimension, that is... .

[0032] The attention mechanism in the classification path is used to calculate the attention weight of the j-th segment in the i-th sample. The calculation method is as follows:

[0033]

[0034] in, Let j be the signal subspace vector of the j-th segment. The total number of segments in the sample. and Let be the trainable parameter matrix and vector in the attention network.

[0035] The classification path further includes:

[0036] i) Based on the attention weights For the signal subspace vector of all segments The bag-level representation vector is obtained by performing a weighted summation. :

[0037]

[0038] ii) Convert the bag-level representation vector The input is fed into a multilayer perceptron that acts as a classifier. Output category logit ;

[0039] iii) The classification logit Input to the Sigmoid function To obtain the final cancer prediction probability. .

[0040] The background reconstruction path minimizes the background reconstruction loss. To achieve, the The calculation method is as follows:

[0041]

[0042] in, For the decoder, To stop the gradient operation, this operation ensures that the background reconstruction loss is used only to update the decoder. The parameters are not propagated back to the encoder.

[0043] The orthogonal constraint loss This is achieved by calculating the sum of the squares of the inner products of the signal subspace vector and the background subspace vector. The calculation method is as follows:

[0044]

[0045] in, This represents the vector dot product operation.

[0046] Overall loss function The composition is as follows:

[0047]

[0048] in, The bag-level binary cross-entropy loss is calculated based on the cancer prediction probability. and These are the preset weighting coefficients used to balance the various losses.

[0049] In the prediction module, when making inference predictions for new test samples, only the calculation steps of the classification path are executed, and the calculation steps of the background reconstruction path are not executed, thereby generating cancer prediction results.

[0050] The application of any one of the methods is used for:

[0051] Non-invasive early screening for cancer;

[0052] Or monitoring of minimal residual disease (MRD) after cancer surgery;

[0053] Identification and contribution analysis of high-risk carcinogenic cfDNA fragments.

[0054] A computer-readable medium storing instructions executable by the one or more processors;

[0055] When the instruction is executed, the system performs the detection method described above.

[0056] The beneficial effects of this invention are:

[0057] 1. A dual-path signal-noise separation mechanism, CLS (Clone-Clone System), is introduced to effectively decouple cfDNA fragment-level cancer signals from background noise, thereby improving the detection capability of weak signals.

[0058] 2. Combining dual-branch multi-instance learning, it supports end-to-end training and sample-level prediction under conditions without fragment-level annotation.

[0059] 3. Supports cross-sequencing platforms and cross-sequencing strategies (WGS / Target Capture) to improve model generalization and robustness.

[0060] 4. By assessing fragment contribution through attention weighting, we can provide accurate data for MRD monitoring and biomarker development. Attached Figure Description

[0061] Figure 1 : Flowchart of this patent.

[0062] Figure 2 Comparison of ROC curves for training sets.

[0063] Figure 3 : Comparison of ROC curves for validation set 1.

[0064] Figure 4 : Comparison of ROC curves on the validation set 2. Detailed Implementation

[0065] With the rapid development of artificial intelligence technology, especially the application of deep learning and sequence language modeling in genomics, deep sequence models such as Transformer have shown significant advantages in deciphering complex information in DNA sequences. Genome language models can treat DNA sequences as high-dimensional "text," automatically learning potential patterns from massive amounts of data without relying on prior assumptions, thereby capturing weak tumor signals in blood cell DNA (cfDNA). These signals include base combination patterns, fragment length and distribution characteristics, nucleosome occupancy, and potential epigenetic modification information, providing a technological foundation for highly sensitive identification of cfDNA in early cancer detection and minimal residual disease (MRD) monitoring.

[0066] This invention proposes a whole-genome fragment-level cancer detection method based on split-CLS dual-path signal-noise separation and multi-instance learning. It combines a Transformer-based sequence language model with cfDNA whole-genome sequencing data to achieve high-precision, end-to-end identification of cancer signals. This method effectively distinguishes fragment-level cancer information from background noise through a signal-background separation mechanism, and uses multi-instance learning to weighted aggregate fragments from each sample, thereby generating reliable sample-level cancer predictions even without fragment-level annotation. Furthermore, this method is adaptable to different sequencing platforms and strategies (such as WGS and Target Capture), demonstrating good generalization ability and robustness in multi-center, multi-cancer application scenarios.

[0067] In this embodiment, cfDNA is first extracted from the blood sample, followed by library construction and sequencing. The extraction and library construction methods can be selected and optimized based on existing technologies; the sequencing platform is not limited, and existing high-throughput sequencing technologies can be used to obtain the base information and fragment data of the cfDNA.

[0068] Subsequently, the cfDNA fragments were feature-encoded based on the pre-trained Transformer model, and the signal and background were separated through the split-CLS dual-path mechanism, providing high-quality feature representations for subsequent multi-instance learning and end-to-end cancer classification.

[0069] The pre-trained Transformer model here refers to the pre-trained model involved in the Chinese patent application No. 2025116100582 entitled "A Cancer Signal Recognition Method Based on cfDNA Sequence Language Model and Its Application in Cancer Screening and MRD Monitoring".

[0070] The classification information of the multiple cancer samples included in this invention is as follows:

[0071]

[0072] This invention incorporates two different high-depth targeted sequencing panel datasets for validation, verifying the robustness of different sequencing strategies.

[0073]

[0074] The training and testing samples above are derived from real clinical data samples obtained by the applicant of this patent application.

[0075] Methods for extracting and sequencing plasma cfDNA samples: 8ml of whole blood was collected from the patient using purple blood collection tubes (EDTA anticoagulant tubes). Plasma was immediately separated by centrifugation (within 2 hours) and transported to the laboratory. ctDNA was extracted from the plasma samples using the QIAGEN plasma DNA extraction kit according to the manufacturer's instructions. After library construction, 5x WGS sequencing was performed on the collected cfDNA samples. After obtaining the sequencing data, the data was aligned to the human reference genome to obtain the base information of the corresponding reads.

[0076] In this invention, the cfDNA data input into the model for both cancer samples and healthy human samples is obtained as follows: Sequencing data of the cfDNA is obtained separately. In the aligned BAM file, the quality, length, and alignment position information of each read are recorded. The human reference genome is the hg19 sequence provided by the University of California, Cruz. Using the Python package pysam, all paired sequencing reads with a mapping quality score ≥30 are extracted from the BAM file. The start and end coordinates of the cfDNA fragment on the genome are determined based on the alignment results of each pair of reads. To enhance the coverage of endpoint variations in the corresponding fragments, the start coordinate of each segment is extended upstream by 6 bases, and the end coordinate is extended downstream by 6 bases. Using the adjusted coordinates, the unmutated original cfDNA sequence is obtained from the hg19 reference genome. Only reference fragments with a length of 400 bp or less are retained as samples for subsequent analysis (DNA fragments longer than 400 bp may originate from non-apoptotic processes or be contaminated with genomic DNA released from leukocyte lysis during blood collection and processing; these long fragments do not carry typical apoptotic characteristics and are considered noise and interference for cancer signal identification based on fragment length patterns). For both cancer and healthy samples, the respective cfDNA sequences determined by the above methods are aligned to their positions on the reference genome, and the corresponding sequence information is extended and used as input to the model. All obtained cfDNA fragment sequences and their chromosomal coordinates are stored in HDF5 format for efficient downstream batch processing and analysis.

[0077] Basic Language Model Training

[0078] This step is used for basic model pre-training. A large number of DNA sequences are input into the model as language data. By incorporating DNA information from species other than humans as training text data, the model's ability to understand DNA text can be improved. In the cancer signal prediction model training method described in this invention, firstly, reference genome sequences from 97 species, including humans, are used. Each sample is divided into multiple DNA fragments of uniform length (e.g., each fragment is 512 bases long), maintaining the original base order. Subsequently, these DNA fragments are byte-pair encoded using a method that starts with characters and gradually merges high-frequency pairs to obtain sub-word units. This forms the input sequence for language modeling. This input is then fed into a BERT pre-trained model built on a transformer architecture. Because small fragments with specific functions appear as a whole, BPE can automatically discover these units. The model directly understands DNA text at the unit level, improving efficiency and depth, and understands the relationship between any two DNA sub-word fragments in a sequence based on a self-attention mechanism.

[0079] The model contains 15 transformer encoder layers to learn latent linguistic structural features from DNA fragments, with 12 attention heads per layer. During pre-training, a masked language model (MLM) mechanism is used to predict randomly masked segments in the input sequence, with the random masking ratio controlled at 15%. The loss function is defined as follows:

[0080]

[0081] Where M represents the set of locations that are masked, x i Let θ represent the i-th masked base fragment, and θ be the model parameters. The goal is to maximize the model's prediction probability for the masked fragment.

[0082] cfDNA data input into the model in this invention

[0083] In this invention, the cfDNA data input into the model for both cancer samples and healthy human samples is obtained as follows: Sequencing data of the cfDNA is obtained separately. In the aligned BAM file, the quality, length, and alignment position information of each read are recorded. The human reference genome is the hg19 sequence provided by the University of California, Cruz. Using the Python package pysam, all paired sequencing reads with a mapping quality score ≥30 are extracted from the BAM file. The start and end coordinates of the cfDNA fragment on the genome are determined based on the alignment results of each pair of reads. To enhance the coverage of endpoint variations in the corresponding fragments, the start coordinate of each segment is extended upstream by 6 bases, and the end coordinate is extended downstream by 6 bases. Using the adjusted coordinates, the unmutated original cfDNA sequence is obtained from the hg19 reference genome. Only reference fragments with a length of 400 bp or less are retained as samples for subsequent analysis (DNA fragments longer than 400 bp may originate from non-apoptotic processes or be contaminated with genomic DNA released from leukocyte lysis during blood collection and processing; these long fragments do not carry typical apoptotic characteristics and are considered noise and interference for cancer signal identification based on fragment length patterns). For both cancer and healthy samples, the respective cfDNA sequences determined by the above methods are aligned to their positions on the reference genome, and the corresponding sequence information is extended and used as input to the model. All obtained cfDNA fragment sequences and their chromosomal coordinates are stored in HDF5 format for efficient downstream batch processing and analysis.

[0084] This invention proposes a genome-wide fragment-level cancer detection method based on split-CLS dual-branch multi-instance learning, the specific implementation steps of which are as follows:

[0085] Step 1: Sample Construction and Label Setting

[0086] In this embodiment, each sample is constructed as a bag containing N. i genomic fragments {s i1 ,s i2 ,s i3 ,…,s iNi Each bag corresponds to a binary label y. b ∈{0,1}, where 1 represents a positive cancer sample and 0 represents a negative sample. This embodiment does not require fragment-level labels, but rather public sample (bag) labels. : Represents the total number of DNA fragments contained in the i-th sample; : represents the j-th DNA fragment in the i-th sample.

[0087] Step 2: Fragment Encoding

[0088] Preferably, each segment s ij After word segmentation, the data is input into the DNABERT-2 pre-trained model. After a cfDNA fragment passes through the Transformer model, a single, high-dimensional [CLS] feature vector is generated, which contains all the information of the fragment.

[0089] Take the [CLS] vector from the last layer of the model as a fragment representation: ; : indicates the j-th DNA fragment The [CLS] feature vector is obtained after encoding by the DNABERT-2 model. This is a high-dimensional numerical vector that represents all the information of the DNA fragment. It represents a d-dimensional real vector space. mean It is a vector containing d real numbers. Represents the feature vector of a segment The total dimension.

[0090] Step 3: Splitting the CLS twin space table:

[0091] In this embodiment, the fragment represents h ij The passage divides the space into two non-overlapping subspaces:

[0092]

[0093] This formula defines the splitting operation, which takes the original fragment feature vector It is cut along its dimension (channel) into two independent subvectors. Preferably, d is taken as... c =d r =d / 2, meaning an equal division.

[0094] Indicates the vector and Connecting the beginning and end, they can be reassembled into the original. .

[0095] Represents the signal subspace vector, which is from The first part, which split off from the middle, is specifically used for subsequent cancer classification tasks. This represents the background subspace vector, which is The second part, considered as background noise, is used for an auxiliary reconstruction task to ensure it does not interfere with classification. These are the dimensions of the signal subspace vector and the background subspace vector, respectively.

[0096] Step 4: Building Category Branches

[0097] 1. In the classification path, only the signal subspace is used. Perform the calculation of bag-level representation.

[0098] 2. First, calculate the attention weights for each segment:

[0099] ;

[0100] The above formula represents the weight calculation process of the attention mechanism, which calculates an importance weight for each DNA fragment in the sample bag. A higher weighted segment means the model considers it to contain a stronger cancer signal, and its contribution to the final judgment is greater. V and W are trainable model parameters. During training, the model automatically learns to adjust their values ​​to assign higher weights to segments that truly contain cancer signals.

[0101] 3. Based on the weighted aggregation of fragment representations, the bag-level representation is obtained:

[0102]

[0103] 4. Input the bag-level representation into a multilayer perceptron (MLP) and output the classification logit:

[0104] ;

[0105] This represents a multilayer perceptron used as a classifier, which transforms the input bag-level vectors through a nonlinear transformation. Mapping to the final category Logit .

[0106] 5. Obtain the predicted probability using the sigmoid function:

[0107] ;

[0108] The original Logit score Convert it into a probability value between 0 and 1 to make it more interpretable.

[0109] The probability that the model predicts the i-th sample as positive (cancer). Sigmoid function.

[0110] Step 5: Background Reconstruction Branch Construction

[0111] 1. In the background path, add the background subspace. Input to a lightweight decoder The decoder takes a vector as input, which is half the dimension of the original vector, and outputs a new vector whose dimension is the same as the original, unsplit, complete segment representation vector. Completely identical (dimension is) Lightweight decoder reconstruction The dimension is d / 2, and it can be a simple neural network structure, such as a multilayer perceptron (MLP) with one to three fully connected layers in the following test, which reconstructs the input vector to achieve self-supervision. Non-linear activation functions such as ReLU can be used between each layer. Alternatively, a simple regression model, such as a linear regression model, or a more complex autoencoder decoding part can be used.

[0112] 2. This decoder is used to reconstruct the original fragment vector. During training, the model compares the decoder's output with the original complete fragment vector, and the difference is calculated using the background reconstruction loss. The loss function is calculated and defined as follows:

[0113]

[0114] To stop gradient operations and ensure that gradients do not propagate back to the encoder and classification branches during the reconstruction process, a crucial technique is employed. This technique blocks the gradient flow during backpropagation; reconstruction loss... The generated gradients will only be used to update the decoder. The parameters are passed forward without affecting any parameters that would affect the encoder (DNABERT-2) or the classification branch.

[0115] Step Six: Orthogonal Collaborative Constraints

[0116] To further ensure the decoupling of the signal subspace and the background subspace, this embodiment introduces an inner product penalty at the fragment level:

[0117]

[0118] This constraint effectively prevents orientation alignment between the two subspaces. Its goal is to force the signal subspace vector and the background subspace vector to be mathematically orthogonal (perpendicular). If two vectors are orthogonal, it means they are linearly independent, and the information is decoupled. This ensures that the classification signal and background noise are completely separated.

[0119] Step 7: Training Objectives and Optimization

[0120] 1. The total loss function of this invention is:

[0121]

[0122] Where: L cls Let λ be the bag-level classification loss function; rec With λ orth These are weighting coefficients, preferably all of which are 0.01.

[0123] 2. The optimization process adopts the AdamW algorithm and combines it with commonly used Transformer training strategies, including learning rate warm-up, weight decay, and early stopping mechanism based on validation set performance.

[0124] Step 8: Reasoning Stage

[0125] During the reasoning phase, only the classification branch is retained. Specifically, the signal subspace is utilized. and its attention weight α ij Computing bag-level representation z i and output the predicted probability. Preferably, the background reconstruction branch is not invoked during inference, thus avoiding additional computational overhead.

[0126] All of the above processes can be implemented on an automated pipeline platform, supporting automatic execution from raw data processing, fragment extraction, feature encoding, classification, ctDNA fragment screening to report generation. The output includes each sample's cancer risk score, classification label, detailed sequence coordinates of high-risk ctDNA fragments, and corresponding scores.

[0127] On the training set, the AUC value of the method in this embodiment reaches 0.943 (see...). Figure 2 The AUC of 0.904 was significantly better than that without the split-CLS dual-path signal-noise separation mechanism, demonstrating that the mechanism can more effectively separate signal and noise, thereby improving the accuracy of ctDNA signal recognition.

[0128] Different sequencing platforms exhibit variations in library construction methods, such as enzyme digestion preferences, adapter structures, and fragment length distributions, which can often affect sensitivity and specificity. The method described in this embodiment was validated on the MGI and Illumina platform test sets, and the results are shown in Table 3.

[0129] Table 3

[0130]

[0131] The results show that the method in this embodiment achieves significant performance improvements on both the MGI and Illumina test sets, especially in terms of sensitivity and accuracy, demonstrating its good robustness and universality in cross-platform applications.

[0132] In early cfDNA screening, differences in library preparation strategies can also affect detection performance. Target capture methods suffer from concentrated coverage areas, altered fragment length distribution, and capture efficiency bias due to probe hybridization selection; while whole-genome sequencing (WGS), although providing uniform coverage, lacks sufficient sensitivity to low-abundance signals. The applicant further validated the effectiveness of this method on colorectal cancer and lung cancer panel test sets, and the results are shown in Table 4.

[0133] Table 4

[0134]

[0135] The results show that the method in this embodiment still outperforms the control method on the Target Capture dataset, exhibiting higher sensitivity and accuracy, further demonstrating its strong applicability under different database construction strategies.

Claims

1. A genome-wide fragment-level cancer detection method based on split-CLS dual-path signal-noise separation multi-instance learning, used for non-therapeutic and diagnostic purposes, characterized in that... Includes the following steps: a) Data acquisition and encoding steps: acquire the sequence data of cfDNA fragments in the sample to be tested, and encode the sequence data of each cfDNA fragment using a pre-trained sequence language model to obtain an initial fragment representation vector; b) Vector splitting step: Divide each of the initial segment representation vectors along its channel dimension into a signal subspace vector for carrying cancer classification signals and a background subspace vector for carrying background information; c) Dual-path processing and classification steps: In the classification path, based on the signal subspace vector, a multi-instance learning framework with an attention mechanism is used to weight and aggregate the signals of all segments in the sample to generate a bag-level representation vector, and the cancer prediction result of the sample is calculated based on the bag-level representation vector. In the background reconstruction path, the initial fragment representation vector is reconstructed by a decoder based on the background subspace vector; d) Model training steps: The trainable parameters in the sequence language model, the multi-instance learning framework, and the decoder are trained by jointly optimizing an overall loss function that includes sample classification loss, background reconstruction loss, and orthogonal constraint loss to promote decoupling between signal and background.

2. A whole-genome fragment-level cancer detection system based on splitting-CLS dual-path signal-noise separation, characterized in that, include: a) Data acquisition and encoding module: used to acquire the sequence data of cfDNA fragments in the sample to be tested, and to encode the sequence data of each cfDNA fragment using a pre-trained sequence language model to obtain an initial fragment representation vector; b) Vector splitting module: used to divide each of the initial segment representation vectors along its channel dimension into a signal subspace vector carrying cancer classification signals and a background subspace vector carrying background information; c) Dual-path processing and classification module: In the classification path, based on the signal subspace vector, a multi-instance learning framework with an attention mechanism is used to weight and aggregate the signals of all segments in the sample to generate a bag-level representation vector, and the cancer prediction result of the sample is calculated based on the bag-level representation vector. It is also used in the background reconstruction path to reconstruct the initial fragment representation vector based on the background subspace vector through a decoder; d) Model training module: used to train the trainable parameters in the sequence language model, multi-instance learning framework and decoder by jointly optimizing an overall loss function that includes sample classification loss, background reconstruction loss and orthogonal constraint loss to promote decoupling between signal and background; e) Prediction module: Used to perform inference and prediction on new test samples using the model obtained from the model training module, and obtain cancer prediction results.

3. The detection system according to claim 2, characterized in that, The data acquisition and encoding module is further used to: acquire cfDNA from tumor samples and healthy human samples, sequence them to obtain cfDNA fragment sequences, align them to a reference genome to obtain their positions on the reference genome, and then extend them upstream and downstream by several bases, using the extended sequence information on the reference genome as the data information of a single sample.

4. The detection system according to claim 2, characterized in that, In the data acquisition and encoding module, when the sequencing data is aligned to the reference genome, cfDNA fragments with alignment quality scores not lower than a preset threshold need to be screened out. The length of the screened cfDNA fragments is limited to no more than a predetermined length threshold. The preset threshold for the alignment quality score is 20-40. The predetermined length threshold for the cfDNA fragment length is 300-500 bp.

5. The detection system according to claim 2, characterized in that, The pre-trained sequence language model is a Transformer-based model, and the initial fragment representation vector is the [CLS] vector output by the last encoder layer of the model. The pre-trained sequence language model is obtained by dividing the reference genomic DNA of multiple species containing humans into fragments, and training it through a masked language model mechanism while maintaining the original order, in order to learn the latent language structure features of the DNA sequence. The segmentation maintains the consistency of DNA fragment size; and the relationship between any two DNA sub-word fragments in a sequence is understood based on the self-attention mechanism; the pre-trained language model built on the Transformer architecture adopts the BERT model, and the pre-trained language model contains 10-20 Transformer encoder layers; the fragment feature vector is a 600-900 dimensional vector.

6. The detection system according to claim 2, characterized in that, The vector splitting module is used for: The initial segment representation vector of dimension d The channel is divided into dimensions. signal subspace vector and dimension are Background subspace vector ,in For sample index, For fragment index, and .

7. The detection system according to claim 2, characterized in that, The attention mechanism in the classification path is used to calculate the attention weight of the j-th segment in the i-th sample. The calculation method is as follows: in, Let j be the signal subspace vector of the j-th segment. The total number of segments in the sample. and For the trainable parameter matrix and vector in the attention network; The classification path further includes: i) Based on the attention weights For the signal subspace vector of all segments The bag-level representation vector is obtained by performing a weighted summation. : ii) Convert the bag-level representation vector The input is fed into a multilayer perceptron that acts as a classifier. Output category logit ; iii) The classification logit Input to the Sigmoid function To obtain the final cancer prediction probability. . The background reconstruction path minimizes the background reconstruction loss. To achieve, the The calculation method is as follows: in, For the decoder, To stop the gradient operation, this operation ensures that the background reconstruction loss is used only to update the decoder. The parameters are not backpropagated to the encoder; The orthogonal constraint loss This is achieved by calculating the sum of the squares of the inner products of the signal subspace vector and the background subspace vector. The calculation method is as follows: in, This represents the vector dot product operation; Overall loss function The composition is as follows: in, The bag-level binary cross-entropy loss is calculated based on the cancer prediction probability. and These are the preset weighting coefficients used to balance the various losses.

8. The detection system according to claim 2, characterized in that, In the prediction module, when making inference predictions for new test samples, only the calculation steps of the classification path are executed, and the calculation steps of the background reconstruction path are not executed, thereby generating cancer prediction results.

9. The method according to claim 1, characterized in that, It is applied to non-invasive early screening of cancer; or monitoring of minimal residual disease (MRD) after cancer surgery; or identification and contribution analysis of high-risk oncogenic cfDNA fragments.

10. A computer-readable medium, characterized in that, The system stores instructions executable by the one or more processors; when executed, the instructions cause the system to implement the method of claim 1.