A non-small cell lung cancer early typing system based on peripheral blood immune cell atlas
The peripheral blood immune cell atlas typing system, using mass spectrometry flow cytometry and random forest classifier, achieves high accuracy and sensitivity in early typing of non-small cell lung cancer, solving the size dependence and complication problems of traditional methods, and is suitable for large sample size analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2023-03-13
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies are insufficient for accurate and non-invasive early classification of non-small cell lung cancer. Traditional in vivo diagnostic methods have requirements on tumor size and are prone to complications, while liquid biopsy methods lack sufficient sensitivity and stability.
A typing system based on peripheral blood immune cell atlas was adopted to perform early typing of non-small cell lung cancer through peripheral blood single cell acquisition, mass spectrometry flow cytometry detection, data conversion and normalization, gating to remove abnormal cells, graph-based clustering algorithm PARC grouping, and random forest binary classifier.
It achieves high accuracy and sensitivity in the early classification of non-small cell lung cancer, is suitable for large sample size analysis, has a moderate cost, and is suitable for clinical application.
Smart Images

Figure CN116337727B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of early classification technology for non-small cell lung cancer, specifically to an early classification system for non-small cell lung cancer based on peripheral blood immune cell atlases. Background Technology
[0002] Globally, lung cancer is the leading cause of cancer death, with non-small cell lung cancer (NSCLC) being the predominant type, accounting for approximately 85% of all cancers. NSCLC is mainly divided into squamous cell carcinoma and adenocarcinoma, with adenocarcinoma accounting for 78% and squamous cell carcinoma for 18%. The latest national cancer report shows that lung cancer ranks first in incidence and mortality among malignant tumors in my country. The classification of NSCLC directly affects treatment methods and prognosis; accurate classification provides a solid foundation for cancer treatment.
[0003] Biopsy remains the gold standard for cancer diagnosis, and doctors rely on its results to develop appropriate treatment plans. Currently, the most widely used biopsy methods for tumor diagnosis are surgical extraction of the tumor and tumor biopsy. However, both methods have limitations regarding tumor size, and neither can accurately diagnose small, early-stage tumors. Furthermore, due to the heterogeneity of tumor tissue, biopsy results only represent a portion of the tumor tissue. Additionally, biopsy is an invasive procedure and can lead to complications. The complication rate for biopsy is between 0.03% and 0.07%, while 0.1% of patients undergoing surgical tissue extraction experience complications.
[0004] In recent years, with the advancement of testing technology, various techniques for detecting biomarkers have emerged, especially analytical techniques for non-invasive liquid biopsy biomarkers, such as circulating tumor cells (CTCs), circulating tumor DNA (ctDNA), circulating miRNAs (microRNAs), DNA methylation levels, and exosomes in blood. However, these non-invasive liquid biopsy methods also have many problems, such as low sensitivity for ctDNA and CTCs, and low stability for microRNAs. Moreover, due to the immaturity of these technologies, they have not been applied clinically. Summary of the Invention
[0005] The purpose of this invention is to provide an early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas that has high accuracy, sensitivity, and moderate cost, in order to overcome the shortcomings of existing technologies.
[0006] The present invention adopts the following technical solution:
[0007] An early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas includes:
[0008] The peripheral blood single-cell acquisition unit is used to isolate mononuclear cells from peripheral blood samples of patients to be diagnosed, thereby obtaining single cells from the peripheral blood sample.
[0009] The mass spectrometry flow cytometry detection unit uses pre-designed metal isotope-coupled protein-specific biomarkers to perform mass spectrometry flow cytometry detection of single cells in peripheral blood samples, obtaining raw single-cell mass spectrometry flow cytometry data of the peripheral blood samples. raw ;
[0010] The raw data conversion unit is used for processing raw data from single-cell mass cytometry flow cytometry of peripheral blood samples. raw Data transformation processing was performed to obtain transformed single-cell mass cytometry data. trans The conversion formula is as follows:
[0011] Data trans =sinh -1 (Data raw / 10);
[0012] The normalization processing unit is used to process the converted single-cell mass cytometry data. trans The data from each biomarker channel were normalized to obtain normalized single-cell mass cytometry data. normalization ;
[0013] The gated filtering unit uses the flow cytometry software FlowJo to perform gated filtering, filtering out the normalized single-cell mass spectrometry flow cytometry data. normalization Abnormal cell populations, including adherent cells, dead cells, cell debris, and CD66b. + Cell population, yielding data including lymphocytes and myeloid cells. PBMC ;
[0014] The peripheral blood immune cell clustering acquisition unit uses the graph-based clustering algorithm PARC to analyze the data. PBMC Classification was performed to obtain immune cell population data from peripheral blood samples. PARC-cluster ; Filter out immune cell population data from peripheral blood samples PARC-cluster Redundant CD66b + Cell populations were used to obtain peripheral blood immune cell atlas data. PARC-selected ;
[0015] Cell annotation unit, peripheral blood immune cell atlas dataPARC-selected Clustering annotation was performed to determine the cell type based on the expression levels of protein-specific biomarkers for each cell type, resulting in an annotated peripheral blood immune cell atlas. immune-cell ;
[0016] The feature vector acquisition unit, based on the annotated peripheral blood immune cell atlas data... immune-cell Calculate the feature vector Q composed of the proportion of each immune cell subset in a peripheral blood sample, where the proportion of a certain immune cell subset in a certain peripheral blood sample = the number of cells belonging to a certain immune cell subset in the peripheral blood sample / the total number of cells in the peripheral blood sample.
[0017] The classification acquisition unit filters out the proportions of significantly different immune cell subpopulations based on pre-defined cell types. These proportions of immune cell subpopulations are represented by the feature vector Q of significant differences. selected Then, the significantly different feature vectors Q selected Substitute the sample into a pre-trained random forest binary classifier to calculate the sample classification and obtain the prediction result of the early non-small cell lung cancer classification system, i.e., whether the patient is adenocarcinoma or squamous cell carcinoma.
[0018] Furthermore, the peripheral blood single-cell acquisition unit uses Ficoll separation buffer gradient centrifugation or ACK lysis buffer centrifugation to separate mononuclear cells from the peripheral blood sample of the patient to be diagnosed, thereby obtaining peripheral blood sample single cells.
[0019] Furthermore, the metal isotope-coupled protein-specific biomarkers pre-designed for the mass spectrometry flow cytometry detection unit are shown in Table 1:
[0020] Table 1
[0021]
[0022] Furthermore, the normalization unit uses the z-score method to process the transformed single-cell mass cytometry data. trans The data from each biomarker channel were normalized to obtain normalized single-cell mass cytometry data. normalization .
[0023] Furthermore, the gated filtering unit performs gated filtering using the streaming data processing software FlowJo, as follows:
[0024] a) Select the 191Ir channel and the 193Ir channel, and set the value of both channels to 10. 2 ~10 3 Remove cell debris from the cell clusters;
[0025] b) Select the 191Ir channel and the 194Pt channel, and choose a value for the 194Pt channel that is less than 10. 2 Cell clusters, removing dead cells;
[0026] c) Select the Event_length channel and the 194Pt channel, and select cell clusters with an Event_length value less than 20 to remove adherent cells;
[0027] d) Select the 194Pt channel and the 165Ho channel, and select a value for the 165Ho channel that is less than 1.5 × 10. 1 Cell clusters, removing CD66b + Cell population.
[0028] Furthermore, the peripheral blood immune cell cluster acquisition unit acquires peripheral blood immune cell atlas data. PARC-selected The steps are as follows:
[0029] a) Data from peripheral blood samples of patients to be diagnosed, including lymphocytes and myeloid cells. PBMC Random sampling was performed, with 10,000 cells randomly selected from each sample, resulting in selected data containing lymphocytes and myeloid cells. PBMC-selected ;
[0030] b) Selected data containing lymphocytes and myeloid cells PBMC-selected The data is used as input to the graph-based clustering algorithm PARC to obtain the clustering results, namely, the immune cell population data of peripheral blood samples. PARC-cluster ;
[0031] c) Filter out immune cell population data from peripheral blood samples based on immune cell population information. PARC-cluster Redundant CD66b + Cell populations were used to obtain peripheral blood immunochromatographic data. PARC-selected .
[0032] Furthermore, the cell annotation unit provides peripheral blood immune cell atlas data. PARC-selected Cluster annotation was performed to determine cell types based on the expression levels of protein-specific biomarkers for each cell type, resulting in an annotated peripheral blood immune cell atlas containing 26 immune cell subsets, including 16 T cell subsets, 4 B cell subsets, 1 NK cell subset, and 5 myeloid cell subsets. immune-cell .
[0033] Furthermore, the cell type acquisition steps required by the classification acquisition unit are as follows:
[0034] a) Data containing lymphocytes and myeloid cells were randomly sampled from peripheral blood samples of 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma. An average of 10,000 cells were randomly selected from each sample, for a total of 1,330,000 cells, and the selected data containing lymphocytes and myeloid cells were obtained.
[0035] b) The selected data containing lymphocytes and myeloid cells were used as input data for the graph-based clustering algorithm PARC to obtain the immune cell clustering data of peripheral blood samples.
[0036] c) Filter out redundant CD66b based on immune cell clustering information. + Cell populations were used to obtain peripheral blood immunochromatograms.
[0037] d) Cluster annotation of peripheral blood immune cell atlases, determine cell types based on the expression levels of protein-specific biomarkers for each cell type, and obtain an annotated peripheral blood immune cell atlas;
[0038] e) Based on the annotated peripheral blood immune cell atlas, calculate the feature vector Q composed of the proportion of each immune cell subset in each peripheral blood sample, where the proportion of a certain immune cell subset in a certain peripheral blood sample = the number of cells belonging to a certain immune cell subset in the peripheral blood sample / the total number of cells in the peripheral blood sample.
[0039] f) The feature vector Q, composed of the proportion of immune cell subsets, is subjected to a two-sample t-test with a p-value threshold of 0.05. Feature vectors with significant differences are selected by p-values less than 0.05. The immune cell subsets corresponding to the feature vectors with significant differences are the cell types that are required in advance.
[0040] Furthermore, the steps for obtaining the pre-trained random forest binary classifier are as follows:
[0041] Peripheral blood samples were selected from 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma. Significantly different feature vectors were obtained from these samples. The SMOTE algorithm was used to oversample the samples from the early-stage lung squamous cell carcinoma patients to obtain significantly different oversampled feature vectors. These significantly different oversampled feature vectors were then used to train a random forest binary classifier. The random forest had 500 trees, and each tree selected 3 features.
[0042] The beneficial effects of this invention are:
[0043] The system of this invention uses peripheral blood samples from patients for analysis, which is easier to operate and non-invasive than existing clinical methods, and can achieve continuous monitoring of patients. The technology used is CyTOF mass cytometry, which is moderately costly. The graph-based clustering algorithm PARC is fast and suitable for large sample clustering. The random forest classifier is fast to train and suitable for large amounts of data. It also has high accuracy and sensitivity in the early classification of non-small cell lung cancer, which will provide richer indicators for the early diagnosis of non-small cell lung cancer, helping clinicians to make more accurate treatment decisions for patients, and has great clinical application prospects. Attached Figure Description
[0044] Figure 1 This is a schematic diagram of sample collection and mass spectrometry flow cytometry detection of the present invention: 1. Patient peripheral blood sample, 2. Ficoll separation solution, 3. Mixture after thorough mixing of ACK lysis solution and peripheral blood sample, 4. Peripheral blood single cells, 5. Metal isotope coupled protein specific biomarker (Table 1), 6. CyTOF instrument, 7. CyTOF raw data.
[0045] Figure 2 This is a heatmap of immune cell subset classification and protein-specific biomarker expression in this invention.
[0046] Figure 3 This is a schematic diagram of the binary classification confusion matrix of the present invention: FOR - Error Omission Rate, FDR - False Detection Rate, TNR - True Negative Rate (i.e., specificity), TPR - True Positive Rate (i.e., sensitivity), ACC - Accuracy.
[0047] Figure 4 The result of the binary classification of samples in this invention is the receiver operating characteristic curve (ROC curve): AUC curve area under the curve, Specificity, and Sensitivity. Detailed Implementation
[0048] The present invention will be further described below with reference to the accompanying drawings and embodiments. These embodiments are only used to explain the present invention and do not constitute a limitation on the scope of protection of the present invention.
[0049] An early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas includes:
[0050] The peripheral blood single-cell acquisition unit uses Ficoll separation buffer gradient centrifugation or ACK lysis buffer centrifugation to separate mononuclear cells from peripheral blood samples of patients to be diagnosed, thus obtaining single cells from peripheral blood samples.
[0051] The mass spectrometry flow cytometry detection unit uses pre-designed metal isotope-coupled protein-specific biomarkers to perform mass spectrometry flow cytometry detection of single cells in peripheral blood samples, obtaining raw single-cell mass spectrometry flow cytometry data of the peripheral blood samples. raw Pre-designed metal isotope-coupled protein-specific biomarkers are shown in Table 1:
[0052] Table 1
[0053]
[0054] The raw data conversion unit is used for processing raw data from single-cell mass cytometry flow cytometry of peripheral blood samples. raw Data transformation processing was performed to obtain transformed single-cell mass cytometry data. trans The conversion formula is as follows:
[0055] Data trans =sinh -1 (Data raw / 10);
[0056] The normalization unit uses the z-score method to process the transformed single-cell mass cytometry data. trans The data from each biomarker channel were normalized to obtain normalized single-cell mass cytometry data. normalization ;
[0057] The gated filtering unit uses the flow cytometry software FlowJo to perform gated filtering, filtering out the normalized single-cell mass spectrometry flow cytometry data. normalization Abnormal cell populations, including adherent cells, dead cells, cell debris, and CD66b. + Cell population, yielding data including lymphocytes and myeloid cells. PBMC The steps of gated filtering are as follows:
[0058] a) Select the 191Ir channel and the 193Ir channel, and set the value of both channels to 10. 2 ~10 3 Remove cell debris from the cell clusters;
[0059] b) Select the 191Ir channel and the 194Pt channel, and choose a value for the 194Pt channel that is less than 10. 2 Cell clusters, removing dead cells;
[0060] c) Select the Event_length channel and the 194Pt channel, and select cell clusters with an Event_length value less than 20 to remove adherent cells;
[0061] d) Select the 194Pt channel and the 165Ho channel, and select a value for the 165Ho channel that is less than 1.5 × 10. 1 Cell clusters, removing CD66b + Cell population;
[0062] The peripheral blood immune cell clustering acquisition unit uses the graph-based clustering algorithm PARC to analyze the data. PBMC Classification was performed to obtain immune cell population data from peripheral blood samples. PARC-cluster ; Filter out immune cell population data from peripheral blood samples PARC-cluster Redundant CD66b + Cell populations were used to obtain peripheral blood immune cell atlas data. PARC-selected ;
[0063] Peripheral blood immune cell cluster acquisition unit acquires peripheral blood immune cell atlas data PARC-selected The steps are as follows:
[0064] a) Data from peripheral blood samples of patients to be diagnosed, including lymphocytes and myeloid cells. PBMC Random sampling was performed, with 10,000 cells randomly selected from each sample, resulting in selected data containing lymphocytes and myeloid cells. PBMC-selected ;
[0065] b) Selected data containing lymphocytes and myeloid cells PBMC-selected The data is used as input to the graph-based clustering algorithm PARC to obtain the clustering results, namely, the immune cell population data of peripheral blood samples. PARC-cluster ;
[0066] c) Filter out immune cell population data from peripheral blood samples based on immune cell population information. PARC-cluster Redundant CD66b + Cell populations were used to obtain peripheral blood immunochromatographic data. PARC-selected ;
[0067] Cell annotation unit, peripheral blood immune cell atlas data PARC-selected Clustering annotation was performed to determine the cell type based on the expression levels of protein-specific biomarkers for each cell type, resulting in an annotated peripheral blood immune cell atlas. immune-cell Specifically, regarding Data PARC-selected Clustering annotation yielded an annotated peripheral blood immune cell atlas containing 26 immune cell subsets, including 16 T cell subsets, 4 B cell subsets, 1 NK cell subset, and 5 myeloid cell subsets. immune-cell ;
[0068] The feature vector acquisition unit, based on the annotated peripheral blood immune cell atlas data... immune-cell Calculate the feature vector Q composed of the proportion of each immune cell subset in a peripheral blood sample, where the proportion of a certain immune cell subset in a certain peripheral blood sample = the number of cells belonging to a certain immune cell subset in the peripheral blood sample / the total number of cells in the peripheral blood sample.
[0069] The classification acquisition unit filters out the proportions of significantly different immune cell subpopulations based on pre-defined cell types. These proportions of immune cell subpopulations are represented by the feature vector Q of significant differences. selected Then, the significantly different feature vectors Q selected Substitute the sample into a pre-trained random forest binary classifier to calculate the sample classification and obtain the prediction result of the non-small cell lung cancer early subtyping system, i.e., whether the patient is adenocarcinoma or squamous cell carcinoma.
[0070] The steps for obtaining the cell types required in advance for the classification acquisition unit are as follows:
[0071] a) Data containing lymphocytes and myeloid cells were randomly sampled from peripheral blood samples of 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma. An average of 10,000 cells were randomly selected from each sample, for a total of 1,330,000 cells, and the selected data containing lymphocytes and myeloid cells were obtained.
[0072] b) The selected data containing lymphocytes and myeloid cells were used as input data for the graph-based clustering algorithm PARC to obtain the immune cell clustering data of peripheral blood samples.
[0073] c) Filter out redundant CD66b based on immune cell clustering information. + Cell populations were used to obtain peripheral blood immunochromatograms.
[0074] d) Cluster annotation of peripheral blood immune cell atlases, determine cell types based on the expression levels of protein-specific biomarkers for each cell type, and obtain an annotated peripheral blood immune cell atlas;
[0075] e) Based on the annotated peripheral blood immune cell atlas, calculate the feature vector Q composed of the proportion of each immune cell subset in each peripheral blood sample, where the proportion of a certain immune cell subset in a certain peripheral blood sample = the number of cells belonging to a certain immune cell subset in the peripheral blood sample / the total number of cells in the peripheral blood sample.
[0076] f) The feature vector Q, composed of the proportion of immune cell subsets, is subjected to a two-sample t-test with a p-value threshold of 0.05. Feature vectors with significant differences are selected by p-values less than 0.05. The immune cell subsets corresponding to the feature vectors with significant differences are the cell types that are required in advance.
[0077] The steps to obtain a pre-trained random forest binary classifier are as follows:
[0078] Peripheral blood samples were selected from 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma. Significantly different feature vectors were obtained from these samples. The SMOTE algorithm was used to oversample the samples from the early-stage lung squamous cell carcinoma patients to obtain significantly different oversampled feature vectors. These significantly different oversampled feature vectors were then used to train a random forest binary classifier. The random forest had 500 trees, and each tree selected 3 features.
[0079] The early classification of non-small cell lung cancer using the above system includes the following steps:
[0080] 1) Mononuclear cells were isolated from peripheral blood samples of patients to be diagnosed using Ficoll separation buffer gradient centrifugation or ACK lysis buffer centrifugation to obtain single cells from peripheral blood samples;
[0081] 2) Using pre-designed metal isotope-coupled protein-specific biomarkers (Table 1), mass cytometry was performed on single cells in peripheral blood samples to obtain raw single-cell mass cytometry data. raw ;
[0082] 3) Raw data from single-cell mass cytometry of peripheral blood samples raw Data transformation processing was performed to obtain transformed single-cell mass cytometry data. trans The conversion formula is: Data trans =sinh -1 (Data raw / 10);
[0083] 4) The z-score method was used to process the converted single-cell mass cytometry data. trans The data from each biomarker channel were normalized to obtain normalized single-cell mass cytometry data. normalization ;
[0084] 5) Use the flow cytometry software FlowJo to process the normalized single-cell mass spectrometry flow cytometry data. normalization Gated filtering is performed to remove abnormal cell populations, including adherent cells, dead cells, cell debris, and CD66b. + Cell population, yielding data including lymphocytes and myeloid cells. PBMC The specific steps are as follows:
[0085] a) Select the 191Ir channel and the 193Ir channel, and set the value of both channels to 10. 2 ~10 3 Remove cell debris from the cell clusters;
[0086] b) Select the 191Ir channel and the 194Pt channel, and choose a value for the 194Pt channel that is less than 10. 2 Cell clusters, removing dead cells;
[0087] c) Select the Event_length channel and the 194Pt channel, and select cell clusters with an Event_length value less than 20 to remove adherent cells;
[0088] d) Select the 194Pt channel and the 165Ho channel, and select a value for the 165Ho channel that is less than 1.5 × 10. 1 Cell clusters, removing CD66b + Cell population;
[0089] 6) Use the graph-based clustering algorithm PARC to cluster the data. PBMC Classification was performed to obtain immune cell population data from peripheral blood samples. PARC-cluster Then filter out the excess CD66b + Cell populations were used to obtain peripheral blood immunochromatographic data. PARC-selected The specific steps are as follows:
[0090] a) Data from peripheral blood samples of patients to be diagnosed, including lymphocytes and myeloid cells. PBMC Random sampling was performed, with 10,000 cells randomly selected from each sample, resulting in selected data containing lymphocytes and myeloid cells. PBMC-selected ;
[0091] b) Selected data containing lymphocytes and myeloid cells PBMC-selected The data is used as input to the graph-based clustering algorithm PARC to obtain the clustering results, namely, the immune cell population data of peripheral blood samples. PARC-cluster ;
[0092] c) Filter out immune cell population data from peripheral blood samples based on immune cell population information. PARC-cluster Redundant CD66b + Cell populations were used to obtain peripheral blood immunochromatographic data. PARC-selected ;
[0093] 7) Peripheral blood immune cell atlas data PARC-selectedCluster annotation was performed to determine cell types based on the expression levels of protein-specific biomarkers for each cell type, resulting in an annotated peripheral blood immune cell atlas containing 26 immune cell subsets, including 16 T cell subsets, 4 B cell subsets, 1 NK cell subset, and 5 myeloid cell subsets. immune-cell ;
[0094] 8) Based on the annotated peripheral blood immune cell atlas Data immune-cell Calculate the feature vector Q composed of the proportion of each immune cell subset in a peripheral blood sample, where the proportion of a certain immune cell subset in a certain peripheral blood sample = the number of cells belonging to a certain immune cell subset in the peripheral blood sample / the total number of cells in the peripheral blood sample.
[0095] 9) Based on the pre-defined cell types, identify the proportions of significantly different immune cell subsets. These proportions are represented by the feature vector Q of significant differences. selected Then, the significantly different feature vectors Q selected Substitute the sample into a pre-trained random forest binary classifier to calculate the sample classification and obtain the prediction result of the non-small cell lung cancer early subtyping system, i.e., whether the patient is adenocarcinoma or squamous cell carcinoma.
[0096] The steps for obtaining the required cell types are as follows:
[0097] a) Data containing lymphocytes and myeloid cells were randomly sampled from peripheral blood samples of 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma. An average of 10,000 cells were randomly selected from each sample, for a total of 1,330,000 cells, and the selected data containing lymphocytes and myeloid cells were obtained.
[0098] b) The selected data containing lymphocytes and myeloid cells were used as input data for the graph-based clustering algorithm PARC to obtain the immune cell clustering data of peripheral blood samples.
[0099] c) Filter out redundant CD66b based on immune cell clustering information. + Cell populations were used to obtain peripheral blood immunochromatograms.
[0100] d) Cluster annotation of peripheral blood immune cell atlases, determine cell types based on the expression levels of protein-specific biomarkers for each cell type, and obtain an annotated peripheral blood immune cell atlas;
[0101] e) Based on the annotated peripheral blood immune cell atlas, calculate the feature vector Q composed of the proportion of each immune cell subset in each peripheral blood sample, where the proportion of a certain immune cell subset in a certain peripheral blood sample = the number of cells belonging to a certain immune cell subset in the peripheral blood sample / the total number of cells in the peripheral blood sample.
[0102] f) The feature vector Q, composed of the proportion of immune cell subsets, is subjected to a two-sample t-test with a p-value threshold of 0.05. Feature vectors with significant differences are selected by p-values less than 0.05. The immune cell subsets corresponding to the feature vectors with significant differences are the cell types that are required in advance.
[0103] The steps to obtain a pre-trained random forest binary classifier are as follows:
[0104] Peripheral blood samples were selected from 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma. Significantly different feature vectors were obtained from these samples. The SMOTE algorithm was used to oversample the samples from the early-stage lung squamous cell carcinoma patients to obtain significantly different oversampled feature vectors. These significantly different oversampled feature vectors were then used to train a random forest binary classifier. The random forest had 500 trees, and each tree selected 3 features.
[0105] Example 1
[0106] 1) Peripheral blood samples from 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma were processed using the ACK lysis buffer centrifugation method to obtain peripheral blood cell pellets, i.e., mononuclear cells.
[0107] 2) Using pre-designed metal isotope-coupled protein-specific biomarkers (Table 1), mass cytometry (CyTOF) was performed on peripheral blood cell pellets to obtain raw single-cell mass cytometry data of peripheral blood samples. raw :
[0108]
[0109] Where m is the number of metal isotope-coupled protein-specific biomarkers (Table 1), n is the number of cells in a single sample, and α ij The expression level of the metal isotope-coupled protein-specific biomarker in the j-th cell of a single sample.
[0110] The CyTOF detection experimental procedure is as follows: Figure 1 As shown, the specific steps are as follows:
[0111] a) Count the cells in the peripheral blood cell pellet obtained in step 1), taking 1–3 × 10⁻⁶ cells. 6 Cell count;
[0112] b) Prepare a 0.25 μM 194Pt (1 mM) live / dead staining solution using phosphate-buffered saline (PBS) (pH 7.4). Resuspend the cells in 50 μL to 1.5 mL of the 194Pt live / dead staining solution, preferably 100 μL, and stain on ice for 5 min. This solution will be used to distinguish between live and dead cells in subsequent data analysis.
[0113] c) Add 100 μL to 1.5 mL of FACS Buffer to each sample, preferably 500 μL, resuspend the cells, centrifuge at 400 g / 5 min at 2–8 °C, and discard the supernatant;
[0114] d) Add 20 μL to 1.5 mL of blocking buffer to each sample, preferably 50 μL, resuspend the cells, and block on ice for 20 min;
[0115] e) Add the metal isotope-coupled protein specific biomarker (Table 1) mixture (0.5-4 μL of each metal antibody), stain on ice for 20-60 min, and perform extracellular staining on the sample;
[0116] f) Add 100 μL to 1.5 mL of FACS Buffer (preferably 1 mL) to each sample, resuspend the cells, centrifuge at 400 g / 5 min at 2–8 °C, discard the supernatant, and repeat 2–3 times.
[0117] g) Prepare a staining solution with a final concentration of 100-500 nM Ir using Fix and Perm Buffer. Take 200 μL-1.5 mL of the solution for each sample to resuspend the cells, incubate at room temperature for 1 hour, and then stain and fix the DNA.
[0118] h) Add 100 μL to 1.5 mL of FACS Buffer (preferably 1 mL) to each sample, resuspend the cells, centrifuge at 800 g / 5 min at 2–8 °C, and discard the supernatant;
[0119] i) Add 0.5-1 mL of ddH2O to each sample to resuspend the cells and transfer them to a 5 mL flow cytometer tube (12*75 mm) with a filter, and filter 1-2 times;
[0120] j) The filtered cell suspension was analyzed using mass flow cytometry.
[0121] Table 1
[0122]
[0123] 3) Transfer Data raw Perform data transformation processing:
[0124] Data trans =sinh -1 (Data raw / 10)
[0125] Data trans This represents the single-cell mass cytometry data after conversion processing.
[0126] 4) Data trans Each biomarker channel data was normalized using the z-score method to obtain the Data. normalization :
[0127]
[0128]
[0129] Among them, Data normalization This represents single-cell mass cytometry data after z-score normalization, where m is the number of metal isotope-coupled protein-specific biomarkers (Table 1), n is the number of cells in a single sample, and α... ij This represents the expression level of the metal isotope-coupled protein-specific biomarker in the j-th cell of a single sample. σ represents the mean expression of the i-th metal isotope-coupled protein-specific biomarker in a single sample. j β is the standard deviation of the expression value of the i-th metal isotope-coupled protein-specific biomarker in a single sample. ij This represents the normalized value of the expression level of the i-th metal isotope-coupled protein-specific biomarker in the j-th cell of a single sample.
[0130] 5) Use the streaming data processing software FlowJo to process the data. normalization Gated filtering is performed to remove abnormal cell populations, including adherent cells, dead cells, cell debris, and CD66b. + Cell population, yielding data including lymphocytes and myeloid cells. PBMC .
[0131] The steps of gated filtering are as follows:
[0132] a) Select the 191Ir channel and the 193Ir channel, and set the value of both channels to 10. 2 ~10 3 Remove cell debris from the cell clusters;
[0133] b) Select the 191Ir channel and the 194Pt channel, and choose a value for the 194Pt channel that is less than 10. 2 Cell clusters, removing dead cells;
[0134] c) Select the Event_length channel and the 194Pt channel, and select cell clusters with an Event_length value less than 20 to remove adherent cells;
[0135] d) Select the 194Pt channel and the 165Ho channel, and select a value for the 165Ho channel that is less than 1.5 × 10. 1 Cell clusters, removing CD66b + Cell population.
[0136] 6) Use the graph-based clustering algorithm PARC to cluster the data. PBMC Classification was performed to obtain immune cell population data from peripheral blood samples. PARC-cluster Then filter out the excess CD66b + Cell populations were used to obtain peripheral blood immunochromatographic data. PARC-selected The specific steps are as follows:
[0137] a) Data from peripheral blood samples containing lymphocytes and myeloid cells. PBMC Random sampling was performed, with an average of 10,000 cells randomly selected from each sample, for a total of 1,330,000 cells, resulting in selected data containing lymphocytes and myeloid cells. PBMC-selected ;
[0138] b) Selected data containing lymphocytes and myeloid cells PBMC-selected The data is used as input to the graph-based clustering algorithm PARC to obtain the clustering results, namely, the immune cell population data of peripheral blood samples. PARC-cluster ;
[0139] c) Filter out redundant CD66b based on immune cell clustering information. + Cell populations were used to obtain peripheral blood immunochromatographic data. PARC-selected .
[0140] 7) Peripheral blood immune profile data PARC-selected Clustering annotation was performed to determine cell types based on the expression levels of protein-specific biomarkers for each cell type, resulting in a peripheral blood immune cell atlas containing 26 immune cell subsets, including 16 T cell subsets, 4 B cell subsets, 1 NK cell subset, and 5 myeloid cell subsets. immune-cell (like Figure 2 As shown):
[0141]
[0142] β ij β is the normalized value of the expression level of the metal isotope-coupled protein-specific biomarker in the j-th cell of a single sample. cj (j = 1, 2, 3, ..., n) represents the cell subpopulation to which the j-th cell in a single sample belongs.
[0143] 8) For a single sample, based on its respective Data immune-cell Calculate the percentage of each immune cell subset q k The eigenvectors Q(q1,q2,q3,…,q) are composed of (k=1,2,3,…,26). i ,…,q 26 ), q k = Number of cells belonging to immune cell subset k in a single sample / Total number of cells in a single sample.
[0144] 9) The feature vector Q, composed of the proportions of immune cell subsets, was subjected to a two-sample t-test with a p-value threshold of 0.05. Feature vectors with significant differences were selected based on p-values less than 0.05. The immune cell subsets corresponding to these significantly different feature vectors are the desired cell types. Based on the desired cell types, significantly different proportions of immune cell subsets were selected, and these proportions constitute the significantly different feature vector Q. selected Then, the significantly different feature vectors Q selected Substitute the sample into a pre-trained random forest binary classifier to calculate the sample classification and obtain the prediction result of the early non-small cell lung cancer classification system. If it is class 1, it means that the sample is predicted to be a lung adenocarcinoma patient; if it is class 0, it means that the sample is predicted to be a lung squamous cell carcinoma patient.
[0145] Here, K-fold cross-validation is used to test the classifier's performance, with the number of folds set to 6. The steps to obtain the pre-trained random forest binary classifier are as follows:
[0146] Peripheral blood samples from 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma were randomly divided into six groups of roughly equal size. Five of these groups were selected without repetition, and the SMOTE algorithm was used to oversample the early-stage squamous cell carcinoma samples to obtain a significantly different feature vector Q. selected-smote The feature vectors Q that are significantly different after oversampling these samples selected-smote Training is performed to obtain a random forest binary classifier; the above steps are repeated 5 times, so that the 5 selected samples are not completely the same each time, resulting in a total of 6 random forest binary classifiers. When making predictions, the input for each binary classifier should be the sample that was not used in training; the number of trees in the random forest is 500, and each tree selects 3 features.
[0147] 10) The remaining samples that were not used in the training of the six random forest binary classifiers were used as the test set and substituted into the corresponding random forest binary classifiers to obtain the random forest binary classifier prediction results for 133 samples.
[0148] 11) Plot the receiver operating characteristic (ROC) curves of the random forest binary classifier prediction results for the above 133 samples. The results are as follows: Figure 4 As shown, the random forest binary classifier has a high AUC value (0.832), indicating that the classification method has high operational performance; the sensitivity (TPR = 76.86%) and accuracy (ACC = 75.94%) of the random forest binary classifier are both at a high level, such as... Figure 3 As shown.
Claims
1. An early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas, characterized in that, include: The peripheral blood single-cell acquisition unit is used to isolate mononuclear cells from peripheral blood samples of patients to be diagnosed, thereby obtaining single cells from the peripheral blood sample. The mass spectrometry flow cytometry detection unit uses pre-designed metal isotope-coupled protein-specific biomarkers to perform mass spectrometry flow cytometry detection of single cells in peripheral blood samples, obtaining raw single-cell mass spectrometry flow cytometry data of peripheral blood samples. ; Among them, the pre-designed metal isotope-coupled protein-specific biomarkers are shown in Table 1: Table 1 Raw data conversion unit for single-cell mass cytometry raw data from peripheral blood samples. Data transformation was performed to obtain transformed single-cell mass cytometry data. The conversion formula is as follows: ; The normalization unit is used to process the converted single-cell mass cytometry data. The data from each biomarker channel were normalized to obtain normalized single-cell mass cytometry data. ; The gated filtering unit uses the flow cytometry software FlowJo to perform gated filtering, removing normalized single-cell mass spectrometry flow cytometry data. Abnormal cell populations, including adherent cells, dead cells, cell debris, and CD66b. + Cell populations were obtained, including data from lymphocytes and myeloid cells. ; The peripheral blood immune cell clustering acquisition unit uses the graph-based clustering algorithm PARC to... Classification was performed to obtain immune cell population data from peripheral blood samples. ; Filter out immune cell population data from peripheral blood samples Redundant CD66b + Cell populations were used to obtain a peripheral blood immune cell atlas. ; Cell annotation unit, peripheral blood immune cell atlas Cluster annotation was performed to determine the cell type based on the expression levels of protein-specific biomarkers for each cell type, resulting in an annotated peripheral blood immune cell atlas. ; The feature vector acquisition unit, based on the annotated peripheral blood immune cell atlas, Calculate the feature vector Q composed of the proportion of each immune cell subset in a peripheral blood sample, where the proportion of a certain immune cell subset in a certain peripheral blood sample = the number of cells belonging to a certain immune cell subset in the peripheral blood sample / the total number of cells in the peripheral blood sample. The classification acquisition unit filters out the proportions of significantly different immune cell subpopulations based on pre-defined cell types. These proportions of immune cell subpopulations serve as feature vectors of significant differences. Then, the feature vectors with significant differences Substitute the sample into a pre-trained random forest binary classifier to calculate the sample classification and obtain the prediction result of the early non-small cell lung cancer classification system, i.e., whether the patient is adenocarcinoma or squamous cell carcinoma.
2. The non-small cell lung cancer early classification system based on peripheral blood immune cell atlas according to claim 1, characterized in that, The peripheral blood single-cell acquisition unit uses Ficoll separation buffer gradient centrifugation or ACK lysis buffer centrifugation to separate mononuclear cells from the peripheral blood sample of the patient to be diagnosed, thus obtaining a single cell from the peripheral blood sample.
3. The early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas according to claim 1, characterized in that, The normalization unit uses the z-score method to process the transformed single-cell mass cytometry data. The data from each biomarker channel were normalized to obtain normalized single-cell mass cytometry data. .
4. The early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas according to claim 1, characterized in that, The steps of the gated filtering unit using the streaming data processing software FlowJo are as follows: a) Select the 191Ir channel and the 193Ir channel, ensuring that the values of both channels are within the range specified in the original text. Remove cell debris from the cell clusters; b) Select the 191Ir channel and the 194Pt channel, and choose the 194Pt channel value to be less than... Cell clusters, removing dead cells; c) Select the Event_length channel and the 194Pt channel, and select cell clusters with an Event_length value less than 20 to remove adherent cells; d) Select the 194Pt channel and the 165Ho channel. The value of the 165Ho channel should be less than... Cell clusters, removing CD66b + Cell population.
5. The early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas according to claim 1, characterized in that, Peripheral blood immune cell cluster acquisition unit obtains peripheral blood immune cell atlases The steps are as follows: a) Data from peripheral blood samples of patients to be diagnosed, including lymphocytes and myeloid cells. Random sampling was performed, with 10,000 cells randomly selected from each sample to obtain selected data containing lymphocytes and myeloid cells. ; b) Selected data containing lymphocytes and myeloid cells The data is used as input to the graph-based clustering algorithm PARC to obtain the clustering results, namely the immune cell population data of peripheral blood samples. ; c) Filter out immune cell population data from peripheral blood samples based on immune cell population information. Redundant CD66b + Cell populations were used to obtain peripheral blood immunochromatograms. .
6. The early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas according to claim 1, characterized in that, Cell annotation unit on peripheral blood immune cell atlas Cluster annotation was performed to determine cell types based on the expression levels of protein-specific biomarkers for each cell type, resulting in an annotated peripheral blood immune cell atlas comprising 26 immune cell subsets, including 16 T cell subsets, 4 B cell subsets, 1 NK cell subset, and 5 myeloid cell subsets. .
7. The early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas according to claim 1, characterized in that, The steps for obtaining the cell types required in advance for the classification acquisition unit are as follows: a) Data containing lymphocytes and myeloid cells were randomly sampled from peripheral blood samples of 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma. An average of 10,000 cells were randomly selected from each sample, for a total of 1,330,000 cells, to obtain the selected data containing lymphocytes and myeloid cells. b) The selected data containing lymphocytes and myeloid cells were used as input data for the graph-based clustering algorithm PARC to obtain the immune cell clustering data of peripheral blood samples. c) Filter out redundant CD66b based on immune cell clustering information. + Cell populations were used to obtain peripheral blood immunochromatograms. d) Cluster annotation of peripheral blood immune cell atlases, determine cell types based on the expression levels of protein-specific biomarkers for each cell type, and obtain an annotated peripheral blood immune cell atlas; e) Based on the annotated peripheral blood immune cell atlas, calculate the feature vector Q composed of the proportion of each immune cell subset in each peripheral blood sample, where the proportion of a certain immune cell subset in a certain peripheral blood sample = the number of cells belonging to a certain immune cell subset in the peripheral blood sample / the total number of cells in the peripheral blood sample. f) The feature vector Q, composed of the proportion of immune cell subsets, is subjected to a two-sample t-test with a p-value threshold of 0.
05. Feature vectors with significant differences are selected by p-values less than 0.
05. The immune cell subsets corresponding to the feature vectors with significant differences are the cell types that are required in advance.
8. The early classification system for non-small cell lung cancer based on peripheral blood immune cell atlas according to claim 1, characterized in that, The steps to obtain a pre-trained random forest binary classifier are as follows: Peripheral blood samples were selected from 121 patients with early-stage lung adenocarcinoma and 12 patients with early-stage lung squamous cell carcinoma. Significantly different feature vectors were obtained from these samples. The SMOTE algorithm was used to oversample the samples from the early-stage lung squamous cell carcinoma patients to obtain significantly different oversampled feature vectors. These significantly different oversampled feature vectors were then used to train a random forest binary classifier. The random forest had 500 trees, and each tree selected 3 features.