A method for establishing an early diagnosis model of intestinal adenocarcinoma based on peripheral blood detection
By combining Raman spectroscopy with a deep learning model, the problem of insufficient sensitivity and specificity in the early diagnosis of intestinal adenomas and adenocarcinomas in existing technologies has been solved, enabling non-invasive and rapid adenoma screening and improving the efficiency of early colorectal cancer screening.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2022-09-26
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies for the early diagnosis of intestinal adenomas and adenocarcinomas suffer from low sensitivity and specificity, and the detection process is complex and time-consuming. They also fail to effectively utilize the changes and differences in Raman spectral characteristic peaks, and the diagnostic efficiency of random forest algorithms and support vector machine models is limited.
Peripheral blood was detected using Raman spectroscopy. Combined with a deep learning model, the Raman spectral signals were processed through data cleaning, signal smoothing, baseline correction, and data normalization. A deep learning architecture with multi-scale embedding layers and transformer groups was constructed, and the Adam optimizer and self-attention mechanism were used for feature extraction and classification.
It enables sensitive and accurate screening of patients with adenomas and adenocarcinomas, providing a non-invasive and rapid method for early screening of colorectal cancer with high sensitivity and specificity, filling a technological gap in the field of liquid biopsy.
Smart Images

Figure CN116026807B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer medical technology, specifically relating to a method for establishing an early diagnostic model for intestinal adenocarcinoma based on peripheral blood detection. Background Technology
[0002] Intestinal adenoma is a common precancerous lesion of colorectal cancer, and screening for adenoma in the population is of great significance for the prevention of colorectal cancer.
[0003] Raman spectroscopy is an analytical method based on the Raman scattering effect discovered by CV Raman spectroscopy. It analyzes the scattered spectrum, which differs from the incident light frequency, to obtain information about molecular vibrations and rotations, and is applied to molecular structure research. As an optical detection method, Raman spectroscopy has shown great potential in biomedicine, especially in the field of liquid biopsy, due to its advantages such as fingerprint-like characteristics, non-destructive testing, high sensitivity, and ease of use.
[0004] Deep learning algorithms can classify large and complex data and quickly analyze potential data patterns. They have been widely used to identify images, text, and biological data (including medical images, tissue images, and signals).
[0005] Chinese invention patent CN104142320A discloses a diagnostic technique for parotid gland tumors based on serum surface-enhanced Raman spectroscopy. This novel technique addresses the challenge of preoperative diagnosis of parotid gland tumors in head and neck oncology and oral and maxillofacial surgery. It primarily solves the problem of not being able to perform conventional biopsies for definitive diagnosis in parotid gland tumor patients, replacing local puncture biopsy to avoid complications and reduce the workload of pathologists. It also provides surgeons with a basis for choosing different surgical treatments based on the benign or malignant nature of the tumor. This technique involves adding nanoscale gold sol particles to the serum of parotid gland tumor patients to enhance Raman spectral intensity, obtaining characteristic surface-enhanced Raman spectra of the patient's serum. By analyzing the characteristic spectral data using a support vector machine, a differential diagnostic model is established to differentiate between different parotid gland tumors, providing objective evidence for preoperative diagnosis and thus establishing a rapid, non-invasive diagnostic technique.
[0006] However, this method has the following problems: 1. It requires fasting peripheral blood collection (surface-enhanced Raman amplifies the signal intensity of all substances in the analyte, and the noise of originally small interfering substances will be amplified, affecting the judgment of the results); 2. It requires the preparation of gold nanoparticles of a certain particle size and pretreatment such as thorough mixing for 10 minutes (nanoscale metal particles are required to generate surface ion excitation resonance effect, and the two need to be in full contact to form surface-enhanced Raman effect); 3. The diagnostic model constructed by the support vector method cannot compare the changes and differences of relevant surface-enhanced Raman characteristic peaks, and the diagnostic sensitivity is not high enough and the specificity is low (the random support vector method itself is not as good as the deep learning model in classifying complex data, and it cannot perform difference screening on the graph after data visualization; for complex data containing confounding factors and multiple factors affecting classification and diagnosis, the diagnostic efficiency of the model constructed by the random support vector method is limited).
[0007] Chinese invention patent CN109852714A discloses a biomarker for early screening of colorectal cancer and diagnosis of adenoma and its uses. This invention is the first to study the differences in gut microbiota between colorectal cancer, adenoma and healthy people through plasma cfDNA, and screen out gut microbiota with significant differences. Then, a colorectal cancer risk prediction model is established through random forest method, which is suitable for screening and diagnosis of colorectal cancer and adenoma, and can be used to identify early colorectal cancer and adenoma.
[0008] However, this method has the following problems: 1. It requires sample pretreatment such as cfDNA extraction and sequencing (which is time-consuming, labor-intensive, and increases testing costs); 2. The sequencing results need to be compared with microbial genome databases (which is time-consuming, labor-intensive, and makes it take a long time for the test results to be returned to the patient); 3. The random forest algorithm, which is used to determine the probability of disease, cannot compare the changes and differences of related surface-enhanced Raman characteristic peaks. (The random forest algorithm itself is not as effective as deep learning models in classifying complex data, and it cannot perform difference filtering on graphs after data visualization; for complex data containing confounding factors and multiple factors affecting classification and diagnosis, the diagnostic efficiency of the model built by the random support vector method is limited). Summary of the Invention
[0009] To address the problems existing in the prior art, the purpose of this invention is to design and provide a technical solution for establishing an early diagnostic model for intestinal adenocarcinoma based on peripheral blood detection.
[0010] This invention is specifically achieved through the following technical solutions:
[0011] A method for establishing an early diagnostic model for colorectal adenocarcinoma based on peripheral blood detection includes the following steps:
[0012] Step 1, serum sample collection, including:
[0013] Peripheral blood was collected from healthy volunteers, adenoma patients, and adenocarcinoma patients. All participants were aged 18-80 years and had no history of autoimmune diseases or malignant tumors. They were divided into three groups based on inclusion and exclusion criteria:
[0014] Group A, healthy volunteers: colonoscopy confirmed no space-occupying colorectal lesions;
[0015] Group B, patients with adenomas: colonoscopy and pathology confirmed adenomas;
[0016] Group C: Adenocarcinoma patients who, through colonoscopy and pathology, were confirmed to have no non-adenocarcinoma malignant tumors or distant metastases.
[0017] Step 2, using a Raman spectrometer for detection, including:
[0018] The serum samples collected in step 1 were tested on a Raman spectrometer, and the Raman spectrometer spectral data were collected.
[0019] Step 3: Perform graphical analysis and data processing on the acquired Raman spectral signals;
[0020] Step 4: Build the deep learning model architecture.
[0021] Furthermore, the Raman spectroscopy detection in step 2 is specifically as follows:
[0022] The silicon wafer is placed on the sample stage and fixed. The laser power is set to 75W, the grating is set to 1200I / mm, and the scanning time is 1.0 second to ensure that the silicon signal is not saturated. The detection is then started.
[0023] Use full-spectrum scanning 100-4000cm -1 Check the peak band of the substance;
[0024] Determine the center peak position and region of the scan, and optimize the exposure time and number of accumulations to obtain a better signal-to-noise ratio curve;
[0025] Selecting the 800-1600 cm⁻¹ range from biological tissue Raman spectra -1 The range is used as the detection section;
[0026] Collect the Raman spectral signal of the test specimen;
[0027] Raman spectral data were collected using WiRE 3.2 software.
[0028] Furthermore, step 3, which involves graphical analysis and data processing of the acquired Raman spectral signals, includes: data cleaning, signal smoothing, baseline correction, and data normalization of the data obtained in step 2, specifically:
[0029] In the data cleaning step, the raw data obtained from the detection is initially cleaned, and spectra with a spectral intensity of 0 are deleted;
[0030] In the signal smoothing step, the spectrum is processed into a smooth spectrum separately using the asymmetric least squares smoothing method;
[0031] In the baseline correction step, the ZhangFit toolkit provided in Python is used for background correction. The ZhangFit algorithm uses an adaptive iteratively reweighted penalized least squares method, which does not require any intervention or prior information, such as peak detection. This method works by iteratively changing the weights of the sum of squares error between the fitted baseline and the original signal. The weights of the sum of squares error are adaptively obtained using the difference between the previously fitted baseline and the original signal. For details, please refer to the following literature: Zhang, Z.-M., Chen, S. & Liang, Y.-Z. Baseline correction using adaptive iteratively reweighted penalized leastsquares. Analyst 135, 1138–1146 (2010);
[0032] In the data normalization step, the signal intensity range is determined according to the lowest and highest intensities. Within this spectral range, the minimum intensity is 0 and the maximum intensity is 1. Then, the signal intensity of different peak positions is converted into corresponding values between 0 and 1 according to the proportion, so that the spectral data is normalized.
[0033] Furthermore, the deep learning model in step 4 comprises four modules, as follows:
[0034] a. Data Processing Layer: Uses the raw Raman spectrum as input, with the input size defined as X∈R N*Hin*Cin Where N provides the batch size, Hin represents the height of the feature map, and Cin represents the number of input channels. This module is used to segment the raw Raman data and parse the three indicators of batch size, feature map height, and number of input channels of the input data for the next step of feature extraction and recognition.
[0035] b. Multi-Scale Embedding Layer (MEL): This layer consists of 2^N scale embedding strips, where N is a natural number greater than 1. Each scale embedding layer generates an input embedding at a corresponding scale. This module is used to extract feature images at multiple scales, i.e., different resolutions, from the input data of Raman spectroscopy. As the number of MELs increases, the resolution of the extracted features decreases, the receptive field increases, and the channel dimension also increases. MELs construct multi-resolution feature maps on the Raman spectrum, improving the model's feature extraction capability.
[0036] c. Transformer Group: Composed of a multi-head self-attention layer, a normalization layer, and a feedback layer. A learnable positional bias is added before the self-attention mechanism to generate sequence embeddings. Channel-dimensional features are labeled and identified. Dropout techniques and L2 regularization are applied to improve its robustness. Formally, the self-attention mechanism becomes:
[0037]
[0038] Where Q, K, V∈R N*Hin*Cin These represent the query, key, and value in the self-attention mechanism, respectively, and √d is a constant normalizer;
[0039] d. Parameter Tuning Group: The Adam optimizer is used, and the model is built using the gradient descent algorithm. It includes a first-stage transformer and a second-stage transformer. The first-stage transformer block consists of two identical layers, and the second-stage transformer block consists of one identical layer. The batch size is set to 512, the learning rate is set to 2e-5, the dropout rate is set to 0.4, and the loss function is defined as:
[0040]
[0041] Where θ represents the model parameters, yk∈0,1 represents the ground truth of class k, and Pk represents the prediction of class k. This module can perform characteristic data processing and model construction on Raman data of different diseases by setting and adjusting the parameters.
[0042] The beneficial effects of this invention are as follows: This invention successfully constructs a natural language model suitable for Raman spectroscopy data processing and analysis, and further constructs an identification model for patients with colorectal adenomas and adenocarcinomas. This invention successfully realizes a screening method for colorectal adenomas and adenocarcinomas based on peripheral blood Raman spectroscopy detection. The detection method of this invention, through simple and rapid Raman spectroscopy detection of serum, can more sensitively and accurately screen and identify patients with adenomas and adenocarcinomas from the population, thereby more effectively carrying out early screening for colorectal cancer. This invention provides a new method for adenoma diagnosis and screening besides colonoscopy, effectively filling the current technological gap in the field of liquid biopsy for colorectal adenoma detection. More importantly, this invention not only has high sensitivity in early screening for colorectal cancer, but also extremely high specificity. Attached Figure Description
[0043] Figure 1 This is a flowchart of Example 1;
[0044] Figure 2 The image shows the surface-enhanced Raman spectrum of serum from healthy individuals in Example 2.
[0045] Figure 3The image shows the surface-enhanced Raman spectrum of serum from adenoma patients in Example 2.
[0046] Figure 4 The image shows the surface-enhanced Raman spectrum of serum from adenocarcinoma patients in Example 2.
[0047] Figure 5 The image shows the Raman spectrum of serum from healthy individuals in Example 2.
[0048] Figure 6 This is the Raman spectrum of the serum from the adenoma patient in Example 2;
[0049] Figure 7 This is the Raman spectrum of serum from the adenocarcinoma patient in Example 2;
[0050] Figure 8 The following graph shows the results of pairwise comparisons and tri-class classification of the three types of patients in Example 2. A: Contingency table results for classifying healthy individuals and adenoma patients; B: Contingency table results for classifying adenoma and adenocarcinoma patients; C: Contingency table results for classifying healthy individuals and adenocarcinoma patients; D: Receiver Operating Characteristic (ROC) curves showing the accuracy of our model in classifying the three groups of patients pairwise; E: Contingency table of model classification results for test set sample points; F: Contingency table of model classification results for patients in the test set. Detailed Implementation
[0051] In the description of this invention, it should be understood that the terms "one end", "the other end", "outer side", "upper", "inner side", "horizontal", "coaxial", "center", "end", "length", "outer end", etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, and are only for the convenience of describing this invention and simplifying the description, and are not intended to indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this invention.
[0052] The invention will now be further described with reference to the accompanying drawings.
[0053] Example 1
[0054] Please see Figure 1 A method for establishing an early diagnostic model for colorectal adenocarcinoma based on peripheral blood detection includes the following steps:
[0055] Step 1, serum sample collection, including:
[0056] Peripheral blood was collected from healthy volunteers, adenoma patients, and adenocarcinoma patients. All participants were aged 18-80 years and had no history of autoimmune diseases or malignant tumors. They were divided into three groups based on inclusion and exclusion criteria:
[0057] Group A, healthy volunteers: colonoscopy confirmed no space-occupying colorectal lesions;
[0058] Group B, patients with adenomas: colonoscopy and pathology confirmed adenomas;
[0059] Group C consists of adenocarcinoma patients who, upon colonoscopy and pathological examination, were found to have no non-adenocarcinoma malignant tumors or distant metastases.
[0060] Following colonoscopy, 2 ml of peripheral blood was collected from each participant for Raman spectroscopy and analysis. The serum was immediately separated and stored at -80°C.
[0061] Step 2, using a Raman spectrometer for detection, including:
[0062] For each group, 20 μL of the test liquid was taken and dropped onto a silicon substrate, which was then placed on the Raman spectrometer detection platform. The specific Raman spectroscopy detection method is as follows: The silicon wafer was fixed on the sample stage, and the laser power was set to 75 W (100% power is 785 nm). The test was started when the silicon signal was unsaturated. First, a full-spectrum scan (100-4000 cm⁻¹) was used to examine the peak bands of the substance, determine the position of the central peak and the scanning segment, and optimize the exposure time and accumulation time to obtain a better signal-to-noise ratio curve. Serum was extracted from healthy individuals and adenoma patients, and Raman spectroscopy was used to detect the Raman data of the patients.
[0063] Step 3 involves graphical analysis and data processing of the acquired Raman spectral signals, including:
[0064] Nine hundred spectra were obtained from each sample. Due to the diversity of the human body, the variance within the sample is high. The large number of spectra in each sample can better represent the true distribution, resulting in higher classification accuracy.
[0065] The data processing stage consists of four steps: data cleaning, signal smoothing, baseline correction, and data normalization. Specifically, in the data cleaning step, spectra with an intensity of 0 are removed. In the signal smoothing step, the spectra are individually smoothed using an asymmetric least squares smoothing method. In the baseline correction step, the background is corrected using the ZhangFit25 package in Python for baseline removal. In the data normalization step, the spectra are normalized, with the minimum intensity set to 0 and the maximum intensity set to 1 within this spectral range.
[0066] Step 4, construct the deep learning model architecture, including:
[0067] The model uses the raw Raman spectrum as input. The input size is defined as X∈RN*Hin*Cin, where N provides the batch size, Hin represents the height of the feature map, and Cin represents the number of input channels. Multi-scale embedding layers (MELs) are used to generate input embeddings for each stage. In practice, as the number of MELs increases, the resolution of the extracted features decreases, the receptive field increases, and the channel dimension also increases. MELs construct multi-resolution feature maps on the Raman spectrum, improving the model's feature extraction capability.
[0068] For transformer banks, a learnable positional bias is added before the self-attention mechanism to generate sequence embeddings. In this implementation, features along the channel dimension are treated as a label. Dropout techniques and L2 regularization are applied to improve robustness. Formally, the self-attention mechanism becomes:
[0069]
[0070] Where Q, K, V∈R N*Hin*Cin These represent the query, key, and value in the self-attention mechanism, respectively, and √d is a constant normalizer;
[0071] Parameter settings: This invention employs the Adam optimizer, an algorithm for gradient descent models. In this implementation, the first-stage transformer block consists of two identical layers, and the second-stage transformer block consists of one identical layer. The batch size is set to 512, the learning rate is set to 2e-5, the dropout rate is set to 0.4, and the loss function is defined as:
[0072]
[0073] Where θ represents the model parameters, yk∈0,1 represents the ground truth of class k, and Pk represents the prediction of class k.
[0074] Example 2
[0075] This invention constructs a natural language processing model: Ms-Former. See Example 1 for details.
[0076] Example 2 illustrates a specific application of the above-mentioned natural language processing model: This model is used in clinical practice for early screening of colorectal cancer in high-risk populations.
[0077] Based on the inclusion and exclusion criteria, 27 healthy individuals were obtained, including 28 cases of adenoma and 81 cases of colorectal cancer, of which 19 were stage I, 19 were stage II, and 24 were stage III. The experimental group consisted of 6 healthy individuals, including 6 cases of adenoma and 17 cases of adenocarcinoma.
[0078] Raman and enhanced Raman scans were performed on peripheral serum from normal, adenoma, and adenocarcinoma patients, and the results were visualized, as shown below. Figure 2 As shown. By Figure 2 The main difference peaks among the three groups were 440, 660, 840, 1200, 1160, 1220, 1260, and 1500 cm⁻¹, respectively. Compared with the control group, the peak plots of the adenoma group and the adenocarcinoma group showed that 3 or 4 peaks were "double-peaked" or even "triple-peaked".
[0079] In serum classification, this invention distinguishes three types of serum Raman spectra: N represents normal individuals, A represents adenoma patients, and T represents cancer patients. Similarly, the applicant conducted experiments on four types of data and four baseline methods. The applicant used RS data from the three groups of patients to train the model and determine the optimal model parameters. The results of the correlation analysis curves between epoch and head settings and the model's training loss and classification accuracy are as follows: Figure 8 As shown in Part A. The applicant found that the model achieved the highest accuracy when the number of heads was 12. Figure 8 B); When the number of blocks is 1, the model has the highest classification accuracy. Figure 8 C).
[0080] The applicant conducted pairwise comparative analyses of patients in the healthy individuals, adenoma, and adenocarcinoma groups, and constructed corresponding classification and diagnostic models. The classification results of the two-way pairwise comparisons among the three groups are as follows: Figure 8 As shown in AC. Figure 8 D shows the receiver operating characteristic (ROC) curves of this invention on the RS dataset. The model has sufficiently high accuracy: the area under the curve (AUC) is 0.9845 for normal individuals (N) and adenoma patients (A), 0.940 for normal individuals (N) and cancer patients (T), and 0.9968 for normal individuals (N) and patients (a and T).
[0081] Then, the applicant used a natural language processing model building method to construct a three-class disease diagnostic model based on Raman test data from three groups of patients. The applicant shows the classification accuracy of different analysis methods in Table 1. As shown in Table 1, in this invention, the RS-based data generally outperforms the SERS-based data. In the classification results, the applicant's proposed method achieves an accuracy of 94.78% on the original RS dataset across all datasets. Detailed performance on the confusion matrix is shown below. Figure 8 As shown in E. Additionally, as... Figure 8 As shown in F, the accuracy rate for each test individual reached 100%.
[0082]
[0083] Table 1: Accuracy of different analysis methods in classifying patients with normal function, adenoma, and adenocarcinoma based on RS and SERS data before and after correction.
[0084] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for establishing an early diagnostic model for intestinal adenocarcinoma based on peripheral blood detection, characterized in that, Includes the following steps: Step 1, serum sample collection, including: Peripheral blood was collected from healthy volunteers, adenoma patients, and adenocarcinoma patients. All participants were aged 18-80 years and had no history of autoimmune diseases or malignant tumors other than adenocarcinoma. They were divided into three groups based on inclusion and exclusion criteria: Group A, healthy volunteers: colonoscopy confirmed no space-occupying colorectal lesions; Group B, patients with adenomas: colonoscopy and pathology confirmed adenomas; Group C: Adenocarcinoma patients who, through colonoscopy and pathology, were confirmed to have no non-adenocarcinoma malignant tumors or distant metastases. Step 2, using a Raman spectrometer for detection, including: The serum samples collected in step 1 were tested on a Raman spectrometer, and the Raman spectrometer spectral data were collected. Step 3: Perform graphical analysis and data processing on the acquired Raman spectral signals; Step 4: Construct the deep learning model architecture; the deep learning model consists of 4 modules, as follows: a. Data Processing Layer: Uses the raw Raman spectrum as input, with the input size defined as X∈R N∗Hin∗Cin Where N provides the batch size, Hin represents the height of the feature map, and Cin represents the number of input channels. This module is used to segment the raw Raman data and parse the three indicators of batch size, feature map height, and number of input channels of the input data for the next step of feature extraction and recognition. b. Multi-scale embedding layer: Consists of 2 to the power of N scale embedding strips, where N is a natural number greater than 1. Each scale embedding layer can generate the input embedding at the corresponding scale. This module is used to extract features at multiple scales from the input Raman spectral data. c. Transformer group: It consists of a multi-head self-attention layer, a normalization layer, and a feedback layer. A learnable positional bias is added before the self-attention mechanism to generate sequence embeddings and to label and identify channel-dimensional features. d. Parameter Tuning Group: The Adam optimizer is used, and the model is built using the gradient descent algorithm. It includes a first-stage transformer and a second-stage transformer. The first-stage transformer block consists of two identical layers, and the second-stage transformer block consists of one identical layer. The batch size is set to 512, the learning rate is set to 2e-5, the dropout rate is set to 0.4, and the loss function is defined as: , Where θ represents the model parameters, y k ∈0,1 represents the ground truth of class k, P k Representing the prediction of class k, this module can perform characteristic-based data processing and model building on Raman data of different diseases by adjusting the parameters.
2. The method for establishing an early diagnostic model for intestinal adenocarcinoma based on peripheral blood detection as described in claim 1, characterized in that, The Raman spectroscopy detection in step 2 is as follows: The silicon wafer is placed on the sample stage and fixed. The laser power is set to 75mW, the grating is set to 1200 I / mm, and the scanning time is 1.0 second to ensure that the silicon signal is not saturated. The detection is then started. Use full-spectrum scanning 100-4000 cm⁻¹ -1 Check the peak band of the substance; Determine the center peak position and region of the scan, and optimize the exposure time and number of accumulations to obtain a better signal-to-noise ratio curve; Selecting the 800-1600 cm⁻¹ range from biological tissue Raman spectra -1 The range is used as the detection section; Collect the Raman spectral signal of the test specimen; Raman spectral data were collected using WiRE 3.2 software.