Single-cell analysis system and method based on precise typing of lung cancer immune microenvironment and treatment prediction
By constructing a lung cancer-specific cell atlas and integrating multi-dimensional data, we have achieved precise subtyping and treatment prediction of the lung cancer immune microenvironment. This solves the problems of incomplete subtyping, inaccurate results, and disconnect from clinical practice in existing technologies, and provides an automated and standardized analysis system from single-cell data to clinical reports.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- WEST CHINA HOSPITAL SICHUAN UNIV
- Filing Date
- 2026-01-16
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies cannot efficiently and accurately classify the immune microenvironment of lung cancer and predict treatment. Furthermore, existing methods suffer from low throughput, high subjectivity, poor reproducibility of results, and a disconnect from clinical efficacy.
This paper presents a single-cell analysis system based on the lung cancer immune microenvironment, including a data preprocessing and quality control module, a lung cancer-specific cell annotation module, an immune microenvironment typing module, a clinical efficacy prediction module, and a visualization report generation module. Through standardized data interface coupling, it realizes automated and standardized analysis from single-cell omics data to clinical treatment decisions.
It enables accurate subtyping and treatment prediction for lung cancer patients, reduces unlabeled or mislabeled cases, improves the accuracy and reliability of cell type annotation, directly outputs clinical decision indicators, supports treatment strategy decisions, and reduces the dependence on professionals and analysis time.
Smart Images

Figure CN122290983A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of bioinformatics and precision medicine, and more specifically, to a single-cell analysis system and method for precise typing and treatment prediction of lung cancer immune microenvironment. Background Technology
[0002] Lung cancer is the leading cause of cancer-related deaths worldwide, with non-small cell lung cancer (NSCLC) accounting for the vast majority. In recent years, tumor immunotherapy, represented by immune checkpoint inhibitors (ICIs, such as anti-PD-1 / PD-L1 antibodies), has become one of the standard treatment options for advanced NSCLC, bringing hope for long-term survival to some patients. However, immunotherapy faces a significant bottleneck in clinical practice: limited overall response rates and high heterogeneity among responders. For example, in first-line immunotherapy combined with chemotherapy for advanced NSCLC, a considerable proportion of patients (approximately 40%-60%) still fail to achieve an objective response. This indicates that the predictive ability of widely used biomarkers (such as PD-L1 protein expression) is limited. Therefore, accurately identifying the population that may benefit from immunotherapy while avoiding unnecessary toxic side effects and financial burdens on ineffective patients has become one of the most pressing clinical needs in the current field of lung cancer treatment.
[0003] The efficacy of tumor immunotherapy is not determined solely by tumor cells, but rather by the surrounding tumor immune microenvironment (TME). The TME is a complex ecosystem that includes various immune cells such as T cells, B cells, macrophages, dendritic cells, and NK cells, as well as stromal cells. The composition, functional state, and spatial location of these components collectively determine whether the immune system can effectively kill tumors.
[0004] Scientific research has confirmed that tumor microenvironment (TME) in lung cancer exists in different subtypes. For example, the widely accepted "three-tier classification" divides TME into: Immune-inflammatory type: characterized by a large infiltration of activated T cells into the tumor parenchyma; Immune desert type: characterized by a near-absence of T cell infiltration in the tumor tissue; and Immune rejection type: characterized by T cells being confined to the stroma at the tumor periphery and unable to invade the cancerous lesion.
[0005] These typing methods are of great biological significance, but their translation into clinical applications faces significant obstacles. Existing typing methods mostly rely on traditional techniques such as immunohistochemistry (IHC) or multicolor immunofluorescence (mIHC). These experiments suffer from poor stability, low throughput, and tissue spatial heterogeneity, making it impossible to comprehensively analyze all cell types and their complex gene expression programs in an unbiased manner.
[0006] While single-cell RNA sequencing (scRNA-seq) technology can unbiasedly and with high-throughput TIME resolution, its data analysis process is cumbersome, highly dependent on professional bioinformatics knowledge, and lacks a unified, standardized, and clinically relevant typing system, making it difficult to directly apply its results to routine clinical diagnosis and treatment decisions.
[0007] Based on the above, the state of the prior art in this field is as follows:
[0008] I. Preliminary typing based on immunohistochemistry (IHC):
[0009] This preliminary classification is the most commonly used method in current clinical practice. It is a semi-quantitative assessment performed by pathologists under a microscope by detecting the expression and localization of specific proteins (such as CD3 and CD8) in tumor tissue sections. For example, by counting the density of CD8+ T cells in the tumor core and infiltration margin, it can be roughly determined whether it is a "hot tumor" (highly invasive) or a "cold tumor" (lowly invasive). Its disadvantages are: (1) Low throughput and limited field of view: Only a few markers can be detected each time, which is difficult to fully reflect the complexity of TIME. And only a local field of view can be observed, and a global view cannot be obtained. (2) High subjectivity: The results are heavily dependent on the experience and judgment of the pathologist, and the repeatability is poor. Limited resolution: It cannot distinguish between functionally different cell subpopulations (such as exhausted T cells and effector T cells), which are crucial for predicting the efficacy of treatment.
[0010] II. Transcriptional fractionation based on bulk RNA sequencing:
[0011] This transcriptome typing method uses RNA sequencing data from the entire tumor tissue to calculate the enrichment scores of known immune cell gene tags (such as CIBERSORT, xCell, etc.) to infer the proportion of various immune cells in the TIME, thereby performing typing. Its disadvantages are: (1) Insufficient accuracy: The results are based on calculated inversion estimates rather than actual cell counts, and the accuracy decreases significantly in complex samples. It cannot identify new cell subtypes: It relies entirely on a predefined gene set and cannot discover new cell states with important biological significance. It masks cell heterogeneity: It obtains the average signal of the entire tissue, which may mask rare but key cell populations.
[0012] III. Bioinformatics workflow based on single-cell / space RNA sequencing:
[0013] Its typical process includes: using toolkits such as Seurat or Scanpy to perform quality control, standardization, dimensionality reduction, and clustering on scRNA-seq data, and then manually annotating cell clusters based on known marker genes to finally obtain the composition of cells. Its disadvantages are: (1) The process is fragmented and not integrated, requiring the combination of multiple tools and steps, each step involving complex parameter adjustments, requiring a high level of expertise. (2) The annotation is subjective and inconsistent: the annotation of cell types heavily relies on the researchers' prior knowledge and experience, and different laboratories may give different labels to the same group of cells, making it difficult to replicate and standardize the results. (3) It is disconnected from clinical decision-making: the final output of this process is a list of cell types, lacking an automated decision-making layer that directly links biological findings with clinical prognosis. Researchers still need to perform additional statistical analysis to establish the correlation, which cannot achieve the clinical application goal of directly outputting data for typing and prediction.
[0014] In summary, the present invention aims to provide an end-to-end single-cell immune microenvironment precision typing and treatment prediction system for lung cancer clinical practice, in order to solve the core pain points of existing methods in the above-mentioned technologies, such as incomplete coverage, complex typing, and disconnect from clinical efficacy evaluation. Summary of the Invention
[0015] The purpose of this invention is to overcome the technical defects of existing technologies, such as fragmented lung cancer immune microenvironment analysis process, low accuracy of cell annotation, and insufficient clinical translation capability. It provides a single-cell analysis system and method based on precise subtyping and treatment prediction of lung cancer immune microenvironment, realizing automated and standardized closed-loop analysis from raw single-cell omics data to clinical treatment decision recommendations.
[0016] To achieve the above objectives, this invention provides a single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment, comprising the following modules sequentially coupled through a standardized data interface:
[0017] Data preprocessing and quality control module: used to perform automated quality filtering, normalization and dimensionality reduction on raw single-cell transcriptome data to generate standardized single-cell data objects that meet the requirements of downstream computing;
[0018] Lung cancer-specific cell annotation module: It has a built-in reference atlas of the lung cancer immune microenvironment and is configured to use a dual-strategy adaptive weighted fusion algorithm based on probability normalization to achieve accurate identification of cell identity;
[0019] Immune microenvironment typing module: configured to integrate the composition ratio and spatial distribution characteristics of cell subpopulations, match the preset typing rule library, and automatically determine the tumor immune phenotypic label;
[0020] Clinical efficacy prediction module: configured as a multi-dimensional feature extraction unit and a pre-trained machine learning prediction model, outputting response probability prediction and prognostic risk score for immunotherapy;
[0021] Visualized report generation module: Used to automatically aggregate the analysis conclusions of multiple modules and generate a comprehensive analysis report that includes standardized bioinformatics charts and natural language clinical interpretations.
[0022] Further optimization of the key modules of the present invention:
[0023] (1) Fusion decision mechanism of lung cancer-specific cell annotation module: The module includes:
[0024] Parallel computing unit: Simultaneously executes SingleR reference mapping based on whole transcriptional profile correlation and AUCell enrichment calculation based on specific marker gene sets;
[0025] Weight Optimization Unit: This module uses a weighted fusion formula to achieve the final score. Final = w1× S singleR + w2 × S AUCell Wherein, w1 and w2 are adaptive weight coefficients, which are determined by constructing a lung cancer single-cell benchmark dataset containing manually calibrated true values, performing parameter gradient optimization iteration within a unit probability interval, and determining the optimal weight ratio with the goal of maximizing the classification F1-score index; in the preferred scheme of this embodiment, the value range of w1 is [0.55, 0.65], and the value range of w2 is [0.35, 0.45].
[0026] The result determination unit is configured to perform Z-score probability normalization mapping on the original scores output by different algorithms, then perform weighted aggregation, select the cell type corresponding to the highest score as the final annotation result, and has confidence threshold filtering logic.
[0027] (2) Judgment logic of the immune microenvironment typing engine: The typing rule base is characterized by performing three-dimensional typing determination by coupling cell abundance indicators with spatial location attributes.
[0028] Immune-inflammatory type: CD8⁺ T cell percentage > 10% and spatial proximity analysis shows extensive infiltration of the tumor core area;
[0029] Immune rejection type: characterized by an abnormally elevated proportion of FAP⁺ fibroblasts, and the spatial distribution of CD8⁺ T cells is mainly restricted to the tumor stroma area;
[0030] Immune desert type: characterized by a total immune cell infiltration rate below a preset threshold and a lack of active T cell signature.
[0031] (3) Clinical efficacy prediction module: The further clinical efficacy prediction module includes:
[0032] Multidimensional feature extraction unit: used to extract multiple feature vectors from the cell matrix and typing results output from upstream; the feature vectors include: classification features (one-hot encoding of immunophenotyping tags), quantitative features (CD8⁺ / Treg ratio, M1 / M2 macrophage polarization index) and functional features (T cell exhaustion and activation signature scores based on specific gene sets).
[0033] Dual-algorithm integrated prediction model unit: It has a built-in nonlinear prediction model based on a large-scale lung cancer immunotherapy cohort pre-trained; this unit is configured to input the heterogeneous feature vectors into the model, and output a standardized immunotherapy response score by weighting the importance of the features.
[0034] (4) A visualization report generation module, further comprising:
[0035] Multidimensional data rendering unit: used to automatically extract high-dimensional spatial coordinates and expression matrices from single-cell analysis, and render and generate, but not limited to: UMAP / t-SNE cell cluster map, immune checkpoint molecule expression heatmap, and cell composition stacking ratio map reflecting microenvironment heterogeneity;
[0036] Knowledge-driven Natural Language Generation (NLG) Unit: Built-in lung cancer clinical pathway knowledge base, configured to automatically generate targeted technical interpretation text (such as "The sample shows immune rejection characteristics, and it is recommended to pay attention to matrix remodeling-related treatments") based on the classification results and predicted scores through preset logical templates or generative algorithms.
[0037] Clinical Decision Report Encapsulation Unit: This unit automatically populates the aforementioned visualization charts, predictive indicators, and text interpretations into a standardized report template, generating an end-to-end comprehensive analysis report with clinical decision support value.
[0038] The present invention also provides a method for precise subtyping and treatment prediction of the lung cancer immune microenvironment. This method is based on a systematic implementation of the above-mentioned solution and includes the following steps:
[0039] (1) The raw single-cell transcriptome data were subjected to quality control, standardization and dimensionality reduction through the data preprocessing and quality control module;
[0040] (2) The lung cancer-specific cell annotation module is used to automatically annotate cells using a dual-strategy weighted fusion algorithm;
[0041] (3) The immune microenvironment typing module calculates cell composition based on cell annotation results and outputs immune microenvironment typing labels by matching the typing rule library;
[0042] (4) Through the clinical efficacy prediction module, features are extracted and machine learning models are used to calculate the immunotherapy response probability and relapse risk score;
[0043] (5) The analysis results are integrated and a clinical analysis report is automatically generated through the visualization report generation module.
[0044] Furthermore, the data preprocessing and quality control module specifically includes the following:
[0045] Cells with a gene count between 200 and 2500 are retained;
[0046] Cells with more than 10% mitochondrial genes removed;
[0047] The total expression level of cells was standardized, logarithmic transformation was performed, hypervariable genes were screened, and principal component analysis was performed to reduce dimensionality.
[0048] Furthermore, the annotation steps performed by the lung cancer-specific cell annotation module specifically include:
[0049] Parallel execution of SingleR reference mapping and AUCell marker gene enrichment calculations;
[0050] According to the formula: Final Score Final = w1× S singleR + w2 × S AUCell Weighted fusion is performed to determine the final annotation results for the cells.
[0051] Compared with the prior art, the present invention has the following beneficial effects:
[0052] 1. The system of the present invention can achieve accurate subtyping of a wider range of patients by constructing a lung cancer-specific cell atlas and integrating multi-dimensional data, thereby reducing the occurrence of "unlabeled" or "incorrectly labeled" cases.
[0053] 2. The solution of the present invention directly links complex biological typing with clinical outcomes, and the typing results generated by the system can directly provide decision support for clinical treatment strategies.
[0054] 3. The present invention integrates space omics features to reveal drug resistance mechanisms more deeply and utilizes novel space biomarkers to improve the accuracy of prognostic prediction.
[0055] 4. The system of the present invention overcomes the shortcomings of existing analysis processes, such as fragmentation and high dependence on professional personnel, through an integrated automated analysis process, and promotes the standardization and repeatability of analysis results.
[0056] 5. The system of the present invention integrates the fragmented single-cell analysis process into an integrated system, realizing end-to-end automatic analysis from raw data to clinical reports, which greatly reduces manual operation time and professional threshold.
[0057] 6. The system of the present invention adopts a dual-strategy weighted fusion algorithm, which combines the advantages of reference map mapping and marker gene enrichment, and can significantly improve the accuracy and reliability of cell type annotation.
[0058] 7. The system of the present invention can directly output clinical decision indicators such as immunotherapy response probability and relapse risk score, transforming complex single-cell data into diagnostic and treatment suggestions that doctors can use directly; in addition, the system adopts an incremental computing architecture and block loading technology, supporting stable analysis of millions of cell data, breaking through the memory limitations and computing bottlenecks of traditional tools. Attached Figure Description
[0059] Figure 1 This is a flowchart of the single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment in an embodiment of the present invention;
[0060] Figure 2 This is a structural block diagram of the lung cancer-specific cell annotation module in an embodiment of the present invention;
[0061] Figure 3 This is a quantitative evaluation of the collaborative fusion weight in the embodiments of the present invention. Detailed Implementation
[0062] To make the objectives, technical solutions, and advantages of this invention clearer, the following description is provided in conjunction with the appendix. Figure 1-3 The present invention will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of the invention.
[0063] The implementation of the present invention will be described in detail below with reference to specific embodiments.
[0064] Example: This invention provides a single-cell analysis system and method based on precise typing and treatment prediction of the lung cancer immune microenvironment. The system comprises five core modules: a data preprocessing and quality control module, a lung cancer-specific cell annotation module, an immune microenvironment typing engine, a clinical efficacy prediction module, and a visualization report generation module. These modules are coupled through standardized data interfaces, forming a complete technical chain from raw data to clinical insights. This system and method can take single-cell transcriptome data (which can integrate spatial transcriptome data) from lung cancer patients as input, process it through a series of specific computational modules, and finally output immune microenvironment typing results and treatment response predictions with clear clinical guidance. The following is a specific implementation of the system of this invention.
[0065] 1. Implementation process of the system based on the present invention
[0066] This embodiment uses a virtual lung cancer patient sample (sample ID: LC_Patient_001) as an example, and describes in detail the fully automated analysis process of the system from raw data to clinical report, based on the system architecture of this invention, as follows:
[0067] (1) Data preprocessing and quality control module
[0068] 1. Input: The user submits raw single-cell transcriptome data of sample LC_Patient_001 in the format of a 10XGenomics standard file: LC_Patient_001_filtered_feature_bc_matrix.h5. This file is input into the data preprocessing and quality control module.
[0069] 2: Then, the data access unit calls the `read_10x_mtx` function from the Scanpy library to parse the file and create an `AnnData` object containing the original counting matrix in memory. Initially, approximately 9,500 cells (cell barcodes) and approximately 30,000 genes were detected.
[0070] 3: The automated quality control unit then starts and performs quality control on the above-mentioned objects:
[0071] Call scanpy.pp.calculate_qc_metrics to calculate the three core quality control metrics for each cell.
[0072] The system applies a preset threshold and calls scanpy.pp.filter_cells to perform filtering:
[0073] Gene count filtering: Cells with a detected gene count between 200 and 2500 are retained. This step filters out approximately 2000 cells (e.g., cells with fewer than 200 genes may be empty droplets or dead cells, while those with more than 2500 genes may be multiple cell aggregates).
[0074] Mitochondrial gene ratio filtering: Calculate and remove cells with a mitochondrial gene ratio greater than 10%. This step further filters out approximately 500 low-quality cells.
[0075] 4. The standardization and dimensionality reduction unit processes the remaining approximately 9500 high-quality cell data points after filtering:
[0076] `scanpy.pp.normalize_total(target_sum=1e4)`: Normalizes the total count of each cell to 10,000. `scanpy.pp.log1p()`: Performs a logarithmic transformation (log(1+x)) on the normalized data. `scanpy.pp.highly_variable_genes(n_top_genes=2000)`: Selects the 2000 genes with the highest expression variability for downstream analysis. `scanpy.pp.pca(n_comps=50)`: Performs principal component analysis on the highly variable gene expression matrix, extracting the top 50 principal components to capture the main sources of variation in the data.
[0077] 5: Output and downstream feedback: This module outputs a high-quality, standardized AnnData object containing 50 principal components, which is automatically transferred to the lung cancer-specific cell annotation module through a standardized data interface.
[0078] (2) Description of lung cancer specific cell annotation module
[0079] 1. Information Reception and Preprocessing
[0080] This module first receives a normalized AnnData object output by the upstream preprocessing unit. Then, it automatically loads a pre-built lung cancer-specific reference atlas, lung_cancer_ref.loom, from the system repository via the built-in reference atlas integration unit.
[0081] Resource map features: This map deeply integrates multiple public databases and authoritative literature, and has undergone rigorous manual verification to form a high-precision knowledge base of lung cancer subtypes.
[0082] Examples of subtypes covered include key cell subpopulations such as CD8_Tex (key marker genes: CD8A, CD8B, PDCD1, HAVCR2, LAG3) which characterize immune exhaustion, SPP1+ Macrophage (key marker genes: SPP1, APOE) which represent tumor-associated macrophages, and myofibroblasts myCAF (key marker genes: ACTA2, PDGFRB, FAP) which represent immune exhaustion.
[0083] 2. Dual-strategy parallel computation execution
[0084] The dual-strategy integrated annotation engine constructed in this invention performs heterogeneous feature extraction in parallel on the test cells in the input data (such as approximately 9500 cell samples in the example). Taking the specific cell Barcode_ACGTTAGACGT as an example, the calculation process is as follows:
[0085] Strategy A (Global Feature Mapping): The SingleR algorithm is used to map the 50-dimensional PCA expression profile of the cells under test to a reference atlas space. The algorithm finds the most similar type to CD8_Tex through full-spectrum alignment, and its confidence score is defined as SingleR_score = 0.92.
[0086] Strategy B (Local Functional Enrichment): The AUCell algorithm is used to calculate the enrichment intensity of the cell on each predefined marker gene set. For the same cell, its enrichment score on the CD8_Tex gene set is defined as AUCell_score = 0.85.
[0087] 3. Integration of decision-making mechanisms and weighted optimization verification
[0088] The fusion decision unit receives the parallel computing results and executes the final decision based on the optimal weighted model determined by this invention.
[0089] Weighted fusion formula: Final score final = w1× S singleR + w2 × S AUCell Based on the re-parameters (refer to Figure 2): This invention systematically demonstrates the superiority of the fusion strategy by performing gradient scanning on the weight coefficients w1.
[0090] Demonstration of synergistic effect: As shown in Figure 2, the curve Fusion (blue triangular broken line) shows that when w1 is in the interval [0.5, 0.7], the annotation accuracy (F1-score) exhibits a significant non-linear step. At w1 = 0.6 (i.e., w2 = 0.4), the curve reaches the global performance peak (approximately 0.94), which is the maximum synergetic gain region defined in this invention.
[0091] Overcoming the bottleneck of a single algorithm: As can be seen from the comparison, the SingleR strategy (green square polyline) has a performance ceiling of only 0.88 and exhibits instability with weight shifts; the AUCell strategy (red dotted polyline) shows stable performance but limited accuracy (approximately 0.83). This invention improves annotation performance by approximately 6.8%-13.2% compared to a single algorithm by coupling features at a weight of 0.6.
[0092] Stability and reliability assurance: The shaded area of the curve represents the 95% confidence interval. The Fusion curve shows an extremely narrow shade at the optimal weighting point, demonstrating that this ratio exhibits extremely high technical robustness when dealing with single-cell lung cancer data across batches and platforms.
[0093] 4. Annotation Result Determination and Output Feedback
[0094] Based on the above mechanism, taking the cell Barcode_ACGTTAGACGT as an example, its final fusion score as CD8_Tex is: 0.6 = 0.92 × (S1) + 0.4 × 0.85 (S2) = 0.892.
[0095] The system determines the final label of the cell as CD8_Tex by traversing all candidate types and selecting the highest score.
[0096] Output and Integration: The final output of this module is an AnnData object with the cell_type annotation column added.
[0097] Typical results statistics: The annotation results in the examples include: CD8+Tex (1200), SPP1+Macrophage (1800), myCAF (850), etc. This object is then automatically transferred to the immune microenvironment typing engine for high-level cluster analysis.
[0098] (3) Immune microenvironment typing module
[0099] 1. Multidimensional Cell Feature Extraction
[0100] This module receives annotated AnnData objects from upstream and uses the cell ratio calculation unit to quantitatively characterize the tumor microenvironment (TME). Key metrics calculated include:
[0101] Immune effect indicators: The proportion of CD8⁺ T cells in the total cell population was calculated (approximately 21.1% in this example);
[0102] Immunosuppressive marker: The percentage of Treg cells was counted (approximately 3.7%).
[0103] Matrix barrier indicators: quantifying the abundance of myCAF (FAP⁺ fibroblasts) and SPP1⁺ macrophages. These cells were defined as an “immune rejection-associated cell population” to assess the physical and biochemical barriers to T cell entry into the tumor nest.
[0104] 2. Heuristic Fractal Decision-Making Logic
[0105] The genotyping rule determination unit uses a hierarchical logic to automatically classify samples into the following three typical phenotypes: Immune-Inflamed: The core determination criterion is CD8+ T. ratio >10% and spatially characterized by "tumor core infiltration". This subtype indicates an active immune response and suggests a good response to immunotherapy.
[0106] Immune-Excluded: The diagnostic criteria are that although CD8⁺ T cells infiltrate to some extent, they are confined to the peripheral stromal area, accompanied by myCAF or SPP1⁺ macrophages accounting for >5%. This classification, by introducing stromal cell indicators, scientifically explains the technical reason why there are immune cells but no therapeutic effect.
[0107] Immune-Desert: Diagnostic criteria are CD8+ T cells. ratio < 2%. This classification reflects an extreme deficiency of immune cells within the tumor region.
[0108] 3. Results output and clinical application
[0109] The module ultimately outputs a structured typing label (such as "immunoinflammatory type"). This typing result not only includes the proportion of T cells but also integrates regulatory information on stromal cells (myCAF / SPP1⁺), providing a higher-dimensional decision input for downstream clinical efficacy prediction modules.
[0110] (4) Clinical efficacy prediction module
[0111] 1. Feature Engineering and Vectorization: The feature extraction unit in this module transforms the heterogeneous information transmitted from upstream into high-dimensional numerical feature vectors that can be recognized by machine learning models.
[0112] Subtyping variable encoding: Qualitative immunotyping labels (such as "immunoinflammatory type") are converted into categorical feature vectors [1, 0, 0] using one-hot encoding;
[0113] Key immune marker: The CD8⁺ / Treg ratio (5.70) was extracted as a continuous characteristic reflecting the balance between tumor killing and inhibitory activity;
[0114] Exhaustion signature calculation: The T cell exhaustion gene signature score (1.45) was extracted by weighting the expression levels of key immune checkpoint genes such as PDCD1, LAG3, HAVCR2, and TIGIT.
[0115] 2. Machine learning predictive model decision making
[0116] The machine learning prediction model unit loads a pre-trained ensemble learning model (such as a random forest classifier model_rf_v1.pkl) and performs the following computations:
[0117] Therapeutic response prediction: The fused feature vector is input into the model. The model outputs the immunotherapy response probability (0.82 in this example) based on global nonlinear weights.
[0118] Prognostic risk assessment: The risk prediction subunit simultaneously analyzes positively correlated prognostic features such as effector memory T cell abundance and calculates a relapse risk score (0.18 in this example). This model achieves end-to-end accurate prediction from single cell abundance to complex clinical outcomes.
[0119] (5) Visual report generation module
[0120] The natural language-based clinical interpretation report generation unit automates the transition from data to decision-making through the following logic:
[0121] Data template population: Automatically maps sample metadata, typing results, response probabilities, and risk scores to standard HTML report templates;
[0122] Heuristic Interpretation Generation: The system uses a predefined clinical threshold engine and Natural Language Generation (NLG) technology to synthesize interpretable text. For example, for samples with a high response probability (>0.8) and classified as "immunoinflammatory," the system automatically generates: "This sample is characterized by a significant enrichment of activated CD8⁺ T cells in the core region, predicting high sensitivity to PD-1 / PD-L1 inhibitor treatment and a low prognostic risk. Immunotherapy is recommended as a clinical priority."
[0123] 3. Automated closed-loop output
[0124] The module ultimately generates a comprehensive report named LC_Patient_001_IM_Report.html. This report is not merely a list of data, but also an intelligent decision-making aid that integrates analytical pathways, quantitative indicators, and clinical recommendations, achieving end-to-end automation from sequencing data to clinical treatment opinions.
[0125] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment, characterized in that, This includes the following modules that are sequentially coupled through a standardized data interface: The data preprocessing and quality control module is used to receive raw single-cell transcriptome data, complete automated quality control and standardization processing, and generate high-quality single-cell data objects. The lung cancer-specific cell annotation module is connected to the data preprocessing and quality control module. It has a built-in lung cancer-specific cell reference atlas and is configured to automatically annotate cells using a dual-strategy weighted fusion algorithm. The immune microenvironment typing engine, connected to the lung cancer-specific cell annotation module, is configured to automatically determine the immune microenvironment typing of a sample based on cell annotation results and according to a typing rule library that integrates cell composition and spatial distribution characteristics. The clinical efficacy prediction module is connected to the immune microenvironment typing engine and is configured to extract features and use a pre-trained machine learning model to output the immunotherapy response probability and relapse risk score. The visualization report generation module is connected to the clinical efficacy prediction module and is configured to automatically generate a clinical analysis report that includes immune microenvironment typing results, treatment response prediction, relapse risk assessment, and visualization charts.
2. The single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment according to claim 1, characterized in that, The data preprocessing and quality control module includes: The data access unit is used to parse and load raw single-cell transcriptome data; An automated quality control unit is used to calculate and filter cells based on the number of genes, the total number of UMIs, and the threshold for the proportion of mitochondrial genes. The standardization and dimensionality reduction unit is used to standardize, logarithmically transform, screen for highly variable genes, and perform dimensionality reduction through principal component analysis on the quality-controlled data.
3. The single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment according to claim 1, characterized in that, The lung cancer-specific cell annotation module includes: Reference map integration unit, used to load built-in lung cancer-specific cell reference maps; A dual-strategy integrated annotation engine is used to perform SingleR reference mapping and AUCell marker gene enrichment calculations in parallel; The fusion decision-making mechanism is used to calculate the formula: Score Final = w1× S singleR + w2 × S AUCell ; The method for determining weights w1 and w2: The weighting coefficients w1 and w2 are not fixed empirical values, but are determined by an offline iterative optimization algorithm based on the lung cancer microenvironment benchmark dataset. The specific determination process is as follows: S1. Constructing a baseline set: Obtain a single-cell reference dataset of lung cancer cells that has been manually and precisely labeled with cell types by pathology experts; S2. Parameter traversal simulation: In the interval [0, 1], exhaustive simulation calculations are performed on the combinations of values of w1 and w2 with a step size of 0.05; S3. Performance Evaluation: Using the degree of agreement between the annotation results and the expert-annotated true values as the evaluation index, record the accuracy fluctuation curves under different weight combinations; S4. Determine the optimal value: When dealing with the "infiltrating immune cell subset" specific to lung cancer, the system reaches its peak accuracy in identifying key subtypes such as exhausted T cells when w1 is 0.6 and w2 is 0.
4. This combination is set as the optimal preset parameters to ensure the highest classification accuracy in complex lung cancer samples.
4. The single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment according to claim 1, characterized in that, The immune microenvironment typing engine includes: The cell proportion calculation unit is used to calculate the proportion of key immune cells and the CD8⁺ / Treg and M1 / M2 macrophage ratios. The typing rule determination unit has the built-in typing rule library and is used to match predefined determination conditions based on cell proportion and spatial distribution characteristics, and output immunotyping labels.
5. The single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment according to claim 4, characterized in that, The classification rule base includes the following judgment conditions: (1) If the proportion of CD8⁺ T cells is >10% and they are mainly located in the core area of the tumor, it is determined to be an immune-inflammatory type; (2) If the percentage of CD8⁺ T cells is <2% and the percentage of Treg cells is low, it is determined to be an immune desert type; (3) If CD8⁺ T cells are mainly distributed in the matrix area and the proportion of FAP⁺ fibroblasts is high, it is determined to be immune rejection type.
6. The single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment according to claim 1, characterized in that, The clinical efficacy prediction module includes: The feature extraction unit is used to extract immunophenotyping tags, CD8⁺ / Treg ratio, and T cell exhaustion gene signature scores from upstream results as predictive features. A machine learning prediction model unit is used to input the features into a pre-trained logistic regression or random forest model and output the probability of immune therapy response. The risk prediction subunit is used to calculate a relapse risk score based on relapse-related biomarkers.
7. The single-cell analysis system based on precise subtyping and treatment prediction of lung cancer immune microenvironment according to claim 1, characterized in that, The visualization report generation module includes: An automated plotting unit is used to generate UMAP clustering diagrams, cell composition stacked bar charts, and gene expression distribution maps; The report generation unit is used to populate a standardized report template with the analysis results and the technical interpretation of natural language generation, and output a comprehensive clinical analysis report.
8. A method for precise subtyping and treatment prediction of the lung cancer immune microenvironment, characterized in that, The method is implemented based on the system of any one of claims 1-7, and includes the following steps: (1) The raw single-cell transcriptome data were subjected to quality control, standardization and dimensionality reduction through the data preprocessing and quality control module; (2) The lung cancer-specific cell annotation module is used to automatically annotate cells using a dual-strategy weighted fusion algorithm; (3) The immune microenvironment typing engine calculates cell composition based on cell annotation results and outputs immune microenvironment typing labels by matching the typing rule library; (4) Through the clinical efficacy prediction module, features are extracted and machine learning models are used to calculate the immunotherapy response probability and relapse risk score; (5) The analysis results are integrated and a clinical analysis report is automatically generated through the visualization report generation module.
9. The method according to claim 8, characterized in that, The data preprocessing and quality control module specifically includes the following steps: Cells with a gene count between 200 and 2500 are retained; Cells with more than 10% mitochondrial genes removed; The total expression level of cells was standardized, logarithmic transformation was performed, hypervariable genes were screened, and principal component analysis was performed to reduce dimensionality.
10. The method according to claim 8, characterized in that, The annotation steps performed by the lung cancer-specific cell annotation module specifically include: Parallel execution of SingleR reference mapping and AUCell marker gene enrichment calculations; According to the formula: Score Final = w1× S singleR + w2 × S AUCell ; The method for determining weights w1 and w2: The weighting coefficients w1 and w2 are not fixed empirical values, but are determined by an offline iterative optimization algorithm based on a benchmark dataset of the lung cancer microenvironment.