Noise reduction method for single cell immune repertoire sequencing data and system thereof
By employing multi-parameter dynamic threshold filtering and a bidirectional collaborative noise reduction engine, combined with biological context and machine learning models, the noise interference problem in single-cell immune repertoire sequencing was solved, achieving efficient and refined noise removal and cell information preservation, thereby improving data accuracy and analytical flexibility.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHANGSHA WEISHI MEDICAL LAB CO LTD
- Filing Date
- 2026-02-11
- Publication Date
- 2026-06-19
AI Technical Summary
Existing single-cell immune repertoire sequencing technologies suffer from noise interference, resulting in data contamination by false positive cells, mitochondrial or ribosomal genes, and low pairing rates. Current methods struggle to accurately distinguish between the true biological state of cells and technical noise, affecting the accuracy of quantifying clonal diversity and analyzing cell phenotypes in immune repertoires.
We employ a noise reduction method that integrates multi-parameter dynamic threshold filtering, specific gene contamination analysis, and VDJ sequencing data-oriented optimization. Through a bidirectional collaborative noise reduction engine, intelligent comprehensive judgment and classification, data archiving and background learning, we construct a collaborative noise reduction process. Combining biological context and machine learning models, we dynamically adjust quality control thresholds and feedback mechanisms to remove noise in a refined manner.
It significantly improves the purity of single-cell immune repertoire data and the reliability of analysis results, retains high-value cell information, improves the accuracy of cell identification and the depth of biological interpretation of data, and enhances the flexibility and robustness of analysis.
Smart Images

Figure CN121687192B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of bioinformatics technology, and in particular to a noise reduction method and system for single-cell immune repertoire sequencing data. Background Technology
[0002] Single-cell immune repertoire sequencing is a key technology for elucidating the diversity of receptors on T cells and B cells, the core of the adaptive immune system. It can simultaneously acquire gene expression profiles and unique receptor sequence information of immune cells at single-cell resolution. This technology has irreplaceable value for revealing immune response mechanisms, discovering disease biomarkers, and guiding cutting-edge research such as cancer immunotherapy. The accuracy of its analytical results directly affects the reliability of subsequent scientific discoveries.
[0003] However, this technology faces significant noise interference challenges in practice. These noises primarily stem from the technical limitations of the sequencing experiment itself. For example, the droplet microreaction system inevitably captures low-quality cells, dead cells, or environmental RNA, while the "single droplet, multiple cells" phenomenon leads to cell identity confusion. At the data level, this manifests as a large number of false-positive cells, excessively high mitochondrial or ribosomal gene contamination, and low receptor sequence pairing rates. Current mainstream data processing methods largely rely on static threshold filtering of single or a few indicators such as UMI counting and the number of detected genes. This "one-size-fits-all" strategy cannot finely distinguish between the true biological state of cells and complex technical noise, especially struggling to handle valuable cells that exhibit abnormalities in multiple dimensions. This results in the loss of effective information or residual background noise, ultimately severely impacting the accuracy of quantifying clonal diversity in immune repertoires and analyzing cell phenotypes.
[0004] To address the aforementioned issues, this invention proposes a noise reduction method and system for single-cell immune repertoire sequencing data. This invention integrates multiple steps, including multi-parameter dynamic threshold filtering, specific gene contamination analysis, and VDJ sequencing data-oriented optimization, to construct a collaborative noise reduction workflow. This approach comprehensively assesses noise characteristics from various aspects such as cell activity, gene contamination, and background interference, and performs layered and refined removal. This effectively removes technical noise while maximizing the preservation of biologically significant cellular information, significantly improving the purity of single-cell immune repertoire data and the reliability of subsequent analysis results. Summary of the Invention
[0005] To overcome the problems mentioned in the background art, the present invention proposes a noise reduction method and system for single-cell immune repertoire sequencing data.
[0006] The technical solution of this invention is: a noise reduction method for single-cell immune repertoire sequencing data, comprising the following steps:
[0007] S11: Data preprocessing and feature extraction: The raw sequencing data generated from single-cell immune repertoire sequencing is compared and quantified to generate gene expression profile data and VDJ receptor sequence data for each cell barcode, and the initial quality indicators for each cell are calculated.
[0008] S12: Perform bidirectional collaborative denoising. Based on the data generated by feature extraction, run a denoising engine with bidirectional information feedback to obtain multi-dimensional cell features.
[0009] S13: Intelligent comprehensive judgment and classification, integrating multi-dimensional cell features, using a preset judgment model to calculate and retain priority and classification labels for each cell, multi-dimensional features include comprehensive quality score and multiple capture evidence;
[0010] S14: Data archiving and background learning. Based on the judgment results, cell data are classified and stored in different datasets. Feature analysis is performed on datasets judged as low quality and technical noise to extract background noise parameters for this sequencing experiment.
[0011] S15: Results output, which includes a set of results including purified high-quality cell data, a weighted list of clonogenic frequencies, cell classification archive information, and a background noise analysis report.
[0012] Preferably, before performing bidirectional collaborative noise reduction, a fine-grained coarse filter based on biological context is also included, specifically:
[0013] S21: Initial filtering is performed based on setting and applying a first-level adaptive threshold according to the background noise distribution;
[0014] S22: Cell subpopulation identification is performed on the initially filtered data;
[0015] S23: For the specific functional subgroups identified, a secondary quality control threshold matching their biological state is applied for secondary filtering.
[0016] Preferably, when performing bidirectional collaborative noise reduction, the noise reduction engine including bidirectional information feedback includes a first feedback loop and a second feedback loop, specifically:
[0017] The first feedback loop of the noise reduction engine is based on VDJ receptor sequence data to verify the integrity of the receptor chain under a single cell barcode, generate evidence identifying potential multiple capture events, and feed this evidence back into the cluster analysis of gene expression profile data to separate confounding cell signals.
[0018] The second feedback loop of the noise reduction engine calculates a comprehensive quality score characterizing cell integrity and activity based on gene expression profile data and initial quality indicators. This score is then used to weight the VDJ clonality frequency to improve the reliability of clonal quantification analysis based on VDJ receptor sequence data.
[0019] Preferably, the first feedback loop of the noise reduction engine, when in operation, specifically includes:
[0020] S31: Identifying contradictory evidence of cell identity in single-cell VDJ data;
[0021] S32: Quantitative calculation of multiple capture suspect score;
[0022] S33: Integrating scores in gene expression profile clustering analysis to separate confounding signals.
[0023] Preferably, when the first feedback loop identifies contradictory evidence of cell identity in single-cell VDJ data, it specifically includes:
[0024] S311: For each valid cell barcode, identify all functional T-cell receptor and B-cell receptor sequences with complete open reading frames and no nonproductive rearrangements in its VDJ sequencing data;
[0025] S312: Based on receptor chain type and pairing rules, pairing analysis is performed on the identified functional sequences to identify potential effective receptor pairs under each barcode;
[0026] S313: Define and detect cell identity conflict events, denoted as event E, specifically:
[0027] For T cells, event E is defined as: under a cell barcode, there are more than one distinct functional TCRα chain and more than one distinct functional TCRβ chain;
[0028] For B cells, event E is defined as: under a cell barcode, there are more than one different functional immunoglobulin heavy chain and more than one different functional immunoglobulin light chain.
[0029] S314: For the cell barcode that triggers event E, calculate and record the strength of contradictory evidence.
[0030] Preferably, the second feedback loop of the noise reduction engine, when in operation, specifically includes:
[0031] S41: Calculate the overall cell quality score based on multi-dimensional gene expression characteristics;
[0032] S42: The VDJ clonality frequency is weighted using the calculated overall cell quality score.
[0033] As a preferred option, the intelligent comprehensive judgment and classification process specifically includes:
[0034] S51: Construct a comprehensive cell feature vector for model determination;
[0035] S52: Input the comprehensive feature vector into the preset intelligent judgment model. The intelligent judgment model integrates multi-dimensional features and makes a comprehensive decision based on the preset analysis target.
[0036] S53: Obtain and apply the judgment output of the intelligent judgment model to perform the final classification of cells.
[0037] Preferably, the intelligent judgment model used in the intelligent comprehensive judgment and classification step is a supervised machine learning model that integrates multi-dimensional features and preset analysis objectives to perform the final classification of cells, and includes the following components:
[0038] A11: Model input module, used to receive the standardized integrated feature vector of each cell;
[0039] A12: Target weight configuration module, used to integrate user-specified analysis target priority weights;
[0040] A13: Model inference engine, built on gradient boosting tree algorithm, used to calculate the final judgment score of each cell based on the input feature vector and target weights;
[0041] A14: Decision and Output Module, used to combine the judgment score with preset rules to generate the final cell classification label.
[0042] As a preferred option, when performing data archiving and background learning, the specific steps include:
[0043] S61: Based on the final cell classification label, cell data and its associated gene expression profiles, VDJ sequences, quality scores and judgment evidence are classified and stored into four independent data subsets;
[0044] S62: Perform feature analysis on the data subset classified as technical noise, and extract and quantify the core parameters that describe the background noise of this sequencing experiment;
[0045] S63: Feed back the extracted background noise parameters to the biological context-based fine coarse filtering step and data preprocessing, dynamically calibrate the initial filtering threshold, and generate a background noise analysis report.
[0046] Noise reduction systems for single-cell immune repertoire sequencing data include:
[0047] The data preprocessing and feature extraction module is used to compare and quantify the raw sequencing data, and generate gene expression profile data, VDJ receptor sequence data and initial quality indicators for each cell barcode.
[0048] The fine coarse filtration module is used for preliminary filtration of cells and adjustment of cell type-specific thresholds;
[0049] The bidirectional collaborative noise reduction engine includes an authentication feedback unit and a quality-weighted feedback unit, specifically:
[0050] The authentication feedback unit is used to generate multiple capture suspicion scores based on VD sequences and integrate them into gene expression profile clustering analysis;
[0051] The quality-weighted feedback unit is used to calculate the overall cell quality score based on gene expression profiles and to use this score to perform weighted calculations on VDJ clonoid frequencies.
[0052] The intelligent integrated judgment and classification module is used to integrate multi-dimensional features and use a preset judgment model to calculate and retain the priority and classification label for each cell;
[0053] The data archiving and background learning module is used to classify cell data into different datasets and extract background noise parameters based on the classification results.
[0054] The results output module is used to output a set of results including purified data, a weighted clone list, classification and archiving information, and a background noise analysis report.
[0055] The beneficial effects of this invention are:
[0056] 1. Compared to existing technologies that primarily rely on static thresholds based on a single dimension such as UMI number or gene number for cell filtering, this approach struggles to comprehensively assess and eliminate multi-dimensional noise caused by mitochondrial gene contamination, ribosomal gene interference, low-quality cells, and background sequences. This results in incomplete purification and the potential deletion of high-value cell information. This invention employs a systematic noise reduction scheme integrating multi-parameter dynamic threshold filtering, specific gene contamination analysis, and targeted optimization of VDJ data. By integrating multi-dimensional quality control indicators such as UMI number, gene number, and the proportion of mitochondrial and ribosomal genes to set dynamic thresholds, it specifically identifies and filters interfering genes and background sequences. This scheme has the advantage of enabling refined and layered removal of complex noise, significantly improving the overall quality of the cell dataset and the accuracy of VDJ receptor sequence analysis, thus providing a more reliable data foundation for downstream immune repertoire analysis.
[0057] 2. Compared to existing technologies that typically use static, uniform thresholds to filter all cells, this method is prone to misclassifying highly active immune cells with specific biological states as low-quality cells and thus eliminating them. This invention employs a two-level adaptive filtering strategy based on biological context. First, an initial filtering threshold is set based on the background noise distribution. Then, after identifying cell subpopulations, a secondary quality control threshold matching the biological state of specific functional subpopulations is applied. The advantage of this approach is that it upgrades quality control from a simple technical screening to a refined process that incorporates biological understanding, effectively avoiding false negatives, preserving specific cell subpopulations that are crucial for immune research, and enhancing the depth of biological interpretation of the data.
[0058] 3. Compared to existing technologies that often isolate or simply merge gene expression data and VDJ sequence data, lacking a deep two-way verification mechanism, making it difficult to effectively identify cell identity confusion caused by technical noise such as cell multiple capture; this invention designs a collaborative noise reduction engine with a two-way feedback loop, in which the first loop uses VDJ sequence information to verify the uniqueness of cell identity and generate a multiple capture suspicion score, which is then fed back to gene expression clustering analysis to separate confounding signals; the advantage of this scheme is that through the two-way collaboration of multi-omics information, it enhances the accuracy of cell identity determination, significantly improves the ability to identify and separate confounding cell signals, and helps to obtain a purer cell subpopulation;
[0059] 4. Compared to existing technologies that mainly rely on preset fixed rules for cell screening, which cannot flexibly adapt to the specific needs of different research objectives for cell preservation strategies, this invention introduces an intelligent decision-making module with a machine learning model at its core. This module can integrate multi-dimensional features and allow users to configure priority weights according to specific analytical objectives such as enriching rare clonal types, thereby dynamically adjusting the decision threshold. The advantage of this solution is that it realizes the intelligence and customizability of cell filtering and preservation decisions, enabling the noise reduction process to accurately serve the specific scientific objectives of downstream analysis, and improving the flexibility and purposefulness of the analysis.
[0060] 5. Compared to existing technologies where data processing workflows are mostly unidirectional pipelines, lacking the ability to learn from the final results and optimize the front-end workflow, this invention proposes a closed-loop background learning and feedback mechanism. By analyzing cell data identified as technical noise, it extracts background noise parameters specific to this experiment and dynamically feeds them back to the quality control steps at the front end of the workflow to calibrate the initial filtering threshold. The advantage of this scheme is that it enables the entire noise reduction system to have self-iterative and adaptive capabilities, allowing for dynamic optimization based on the characteristics of different experimental batches, greatly improving the robustness of data cleaning and the relevance of the experiment. Attached Figure Description
[0061] Figure 1The diagram shown is a flowchart of the noise reduction method for single-cell immune repertoire sequencing data according to the present invention.
[0062] Figure 2 The diagram shown is a schematic representation of the construction of the noise reduction system for single-cell immune repertoire sequencing data according to the present invention. Detailed Implementation
[0063] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0064] Please see Figure 1 The present invention provides an embodiment of a noise reduction method for single-cell immune repertoire sequencing data, comprising the following steps:
[0065] Step 1: Data Preprocessing and Feature Extraction
[0066] The raw sequencing data generated from single-cell immune repertoire sequencing were compared and quantified to generate gene expression profile data and VDJ receptor sequence data for each cell barcode. Initial quality indicators for each cell were calculated, including: total UMI number, number of detected genes, mitochondrial gene expression percentage, ribosomal gene expression percentage, and cell cycle score.
[0067] Step 2: Refined Coarse Filtering Based on Biological Context
[0068] The specific process is as follows:
[0069] S21: Based on the background noise distribution, a first-level adaptive threshold is set and applied for preliminary filtering, specifically including:
[0070] S211: Identify and count the low-quality and background droplet sets in all cell barcodes, and calculate the background distribution of cell quality control indicators in the set. The quality control indicators include the total number of UMIs and the proportion of mitochondrial gene expression.
[0071] S212: Based on the background distribution, the first-level filtering threshold is dynamically calculated using the following formula:
[0072] UMI number threshold: ;
[0073] upper threshold of mitochondrial gene percentage: ;
[0074] in, , and The preset coefficients, This indicates taking the median;
[0075] S213: Apply the first-level filtering threshold and remove cells that meet any of the following conditions:
[0076] The total number of UMIs is less than the lower threshold of the number of UMIs. ;
[0077] The proportion of mitochondrial genes is greater than ;
[0078] S22: Perform cell subpopulation identification on the initially filtered data, specifically including:
[0079] S221: Perform dimensionality reduction and clustering on the gene expression profile data of the cells after preliminary filtering to obtain preliminary cell clustering groups;
[0080] S222: Based on known cell type marker genes, each cell cluster is annotated to identify cell subpopulations with specific biological functions, including effector immune cell subpopulations with high metabolic activity.
[0081] S23: For the specific functional subgroups identified, a secondary quality control threshold matching their biological state is applied for secondary filtering, specifically including:
[0082] S231: Predefine corresponding threshold adjustment factors for different biologically functional cell subpopulations, where the adjustment factor for the proportion of mitochondrial genes is greater than 1 for effector immune cell subpopulations with high metabolic activity.
[0083] S232: For a specific functional cell subset, calculate its applicable secondary quality control threshold using the following formula:
[0084] ;
[0085] in, This is the secondary quality control threshold. This is the threshold adjustment factor;
[0086] S233: Apply a secondary quality control threshold within the corresponding subgroup, removing only those subgroups where the proportion of mitochondrial genes is greater than the secondary quality control threshold. Cells that are below the secondary quality control threshold but above the upper threshold for mitochondrial gene percentage are retained and marked as having a specific biological state.
[0087] In this embodiment, an adaptive first-level filtering threshold is first set based on the background noise distribution. By statistically analyzing the background distribution of UMI counts of low-quality and background droplets and the proportion of mitochondrial genes, the threshold is dynamically calculated and applied to initially remove low-quality cells. Subsequently, the filtered data undergoes dimensionality reduction, clustering, and cell subpopulation annotation to identify specific functional subpopulations, such as highly metabolically active effector immune cells. Then, for these subpopulations, a predefined threshold adjustment factor is defined, and a higher secondary mitochondrial gene proportion threshold is calculated and applied for secondary filtering. Only cells within the subpopulation that exceed the secondary threshold are removed, while cells that, although above the first-level threshold, are below the secondary threshold and possess specific biological states, are retained. This approach, through two-level adaptive filtering that incorporates biological context, avoids the limitations of a one-size-fits-all approach in traditional quality control. While effectively removing background noise and low-quality cells, it retains and identifies special functional cells such as those with high metabolic activity, significantly improving the accuracy and depth of biological interpretation in single-cell data analysis.
[0088] Step 3: Perform bidirectional collaborative noise reduction
[0089] Based on the data generated by feature extraction, a noise reduction engine with bidirectional information feedback is run to obtain multi-dimensional cell features.
[0090] Specifically, the noise reduction engine, which includes two-way information feedback, comprises a first feedback loop and a second feedback loop.
[0091] The first feedback loop of the noise reduction engine is based on VDJ receptor sequence data to verify the integrity of the receptor chain under a single cell barcode, generate evidence identifying potential multiple capture events, and feed this evidence back into the cluster analysis of gene expression profile data to separate confounding cell signals.
[0092] The second feedback loop of the noise reduction engine calculates a comprehensive quality score characterizing cell integrity and activity based on gene expression profile data and initial quality indicators. This score is then used to weight the VDJ clonality frequency to improve the reliability of clonal quantification analysis based on VDJ receptor sequence data.
[0093] In this embodiment, the present invention performs bidirectional collaborative noise reduction through a noise reduction engine containing two feedback loops: the first loop verifies the integrity of the receptor chain under the barcode of a single cell based on VDJ receptor sequence data, generates evidence identifying potential multiple capture events, and feeds this evidence back into the clustering analysis of gene expression profiles to separate confounding cell signals; the second loop calculates a comprehensive quality score characterizing cell integrity and activity based on gene expression data and initial quality indicators, and uses this score to weight the VDJ clonoid frequency, thereby improving the reliability of clonoid quantification. This scheme effectively enhances the uniqueness of cell identification by realizing bidirectional verification and weighting between gene expression and VDJ sequence information, reduces noise interference caused by cell multiple capture or low-quality data, and thus synergistically improves the accuracy of integrated analysis of multi-omics cell data and the reliability of biological interpretation.
[0094] In one aspect of this embodiment, the first feedback loop of the noise reduction engine, when in operation, specifically includes:
[0095] S31: Identifying contradictory evidence of cell identity in single-cell VDJ data, specifically including:
[0096] S311: For each valid cell barcode, identify all functional T-cell receptor and B-cell receptor sequences with complete open reading frames and no nonproductive rearrangements in its VDJ sequencing data;
[0097] S312: Based on receptor chain type and pairing rules, pairing analysis is performed on the identified functional sequences to identify potential effective receptor pairs under each barcode;
[0098] S313: Define and detect cell identity conflict events, denoted as event E, specifically:
[0099] For T cells, event E is defined as: under a cell barcode, there are more than one distinct functional TCRα chain and more than one distinct functional TCRβ chain;
[0100] For B cells, event E is defined as: under a cell barcode, there are more than one different functional immunoglobulin heavy chain and more than one different functional immunoglobulin light chain.
[0101] S314: For the cell barcode that triggers event E, calculate and record the strength of contradictory evidence. The calculation formula is as follows:
[0102] Targeting T cells: ;
[0103] Targeting B cells: ;
[0104] in, For the strength of contradictory evidence, , , , and These represent the number of functional receptor chains of the corresponding type;
[0105] S32: Quantitative calculation of multiple capture suspect score, specifically including:
[0106] S321: Based on the strength of the contradictory evidence obtained, calculate the original multiple capture suspicion score for each cell barcode using the following formula:
[0107] ;
[0108] in, The original multiple capture suspicion score for cell barcode i. The strength of contradictory evidence for cell barcode i, This is a preset scaling constant;
[0109] S322: Will As a multiple capture suspicion score for cell i, and when hour, ;when hour, ;
[0110] S33: Integrating scores in gene expression profiling cluster analysis to separate confounding signals, specifically including:
[0111] S331: When performing cell clustering analysis based on gene expression profile data, first calculate the gene expression similarity distance between cells;
[0112] S332: Obtain the multiple capture suspicion score for cell i and cell j;
[0113] S333: Introduces a penalty term to correct the distance between cells, and calculates the integrated distance using the following formula:
[0114] ;
[0115] in, The integrated distance matrix, The distance representing the gene expression similarity between cells. The preset penalty intensity coefficient, This represents the absolute value of the difference between the suspicion scores of the two cells.
[0116] S334: Use the corrected distance matrix Further dimensionality reduction and clustering analysis will be performed.
[0117] S335: After clustering is completed, automatically identify and label those cell clusters that are mainly composed of cells with high multiple capture suspicion scores in the clustering results, and mark them as suspected multiple capture cell clusters.
[0118] In this embodiment, the first feedback loop of the noise reduction engine first identifies contradictory evidence of cell identity in single-cell VDJ data: by identifying functional T-cell or B-cell receptor sequences with complete open reading frames under each cell barcode, and based on receptor chain type and pairing rules, it detects events E where multiple different functional chains exist under one barcode, and calculates the strength of contradictory evidence C. Subsequently, based on C, the multiple capture suspicion score of each cell is quantified using a formula. Finally, in gene expression profile clustering analysis, this score is introduced as a penalty term to correct the similarity distance between cells, and this is used for subsequent dimensionality reduction and clustering, automatically identifying cell clusters composed of highly suspicious cells. This scheme effectively verifies the uniqueness of cell identity through VDJ sequence information and dynamically feeds back multiple capture suspicion to gene expression clustering, thereby significantly improving the ability to identify and separate mixed cell signals, helping to obtain purer cell subpopulations with clearer biological significance, and improving the accuracy of single-cell multi-omics data integration analysis.
[0119] In another aspect of this embodiment, the second feedback loop of the noise reduction engine, when in operation, specifically includes:
[0120] S41: Calculate the overall cell quality score based on multi-dimensional gene expression characteristics, specifically including:
[0121] S411: For each cell i, extract and compute a set of initial quality and state feature vectors from its gene expression profile, including:
[0122] Activity indicators: total UMI count, number of genes detected;
[0123] Health status indicators: percentage of mitochondrial genes, percentage of ribosome genes;
[0124] Cell cycle metrics: Cell cycle scores calculated based on a specific gene set;
[0125] S412: Standardize each component in the eigenvector to obtain the standardized eigenvector. The calculation formula is as follows:
[0126] ;
[0127] in, Let k be the value of the standardized i-th cell's k-th feature. Let k be the original value of the kth feature of cell i. This represents the average of the k-th feature of all cells in this batch of data. This represents the standard deviation of the k-th feature of all cells in this batch of data;
[0128] S413: Calculate the basal mass fraction of cells using the following formula:
[0129] ;
[0130] in, These are preset importance weights corresponding to each feature. The scoring transformation function defined for the k-th feature maps the standardized feature values to positive quality contributions.
[0131] S414: For cell i, based on its initially annotated cell type, apply a correction factor to adjust the score for that specific cell type, and then recalculate the baseline quality score using the corrected score. The correction formula is as follows:
[0132] ;
[0133] in, This represents the standardized value for the proportion of mitochondrial genes in cell i after correction. This represents the original normalized value of the proportion of mitochondrial genes in cell i. This is the preset correction offset for cell i;
[0134] S415: Calculate the overall quality score of cell i using the following formula:
[0135] ;
[0136] in, The overall quality score for cell i. Represents the natural exponential function. This is the preset offset parameter;
[0137] S42: The VDJ clonoid frequency is weighted using the calculated overall cell quality score, specifically including:
[0138] S421: Identify and define a unique VDJ clone type j;
[0139] S422: For each clone j, find the set of all cells that detected that clone;
[0140] S423: Calculate the weighted frequency of clone j using the following formula:
[0141] ;
[0142] The overall quality scores of all cells carrying this clone are summed.
[0143] S424: The weighted frequencies are normalized to obtain the relative abundance of this clonal type in high-quality cells.
[0144] In this embodiment, before calculating the weighted frequency of clonus j, the comprehensive quality score of cell i is finally corrected based on the cell multiple capture suspicion score obtained from the first feedback loop. Then, the corrected comprehensive quality score is used for weighted frequency calculation, so that the contribution of cells suspected of multiple capture to the clonus frequency is reduced. The correction formula is:
[0145] ;
[0146] in, For cell multiple capture suspicion scoring, This is the corrected overall quality score.
[0147] In this embodiment, the second feedback loop of the noise reduction engine first calculates the comprehensive quality score for each cell based on gene expression profile data. This is achieved by extracting and standardizing multidimensional features such as the total UMI count, the number of detected genes, the proportion of mitochondrial and ribosomal genes, and cell cycle scores. These features are then combined with preset weights and corrections for the proportion of mitochondrial genes specific to certain cell types (e.g., highly metabolically active cells) to calculate a baseline quality score. This score is then converted into a comprehensive quality score between 0 and 1 using a logical function. Subsequently, this score is used to weight the VDJ clonal frequency. The weighted frequency of a clonal type is equal to the sum of the comprehensive quality scores of all cells carrying that clonal type. After normalization, its relative abundance in high-quality cells is obtained. During this process, multiple capture suspicion scores from the first feedback loop are also integrated to correct the comprehensive quality score and further reduce the contribution of suspicious cells. This approach effectively reduces the bias in clonal frequency estimation caused by low-quality or dying cells by systematically integrating comprehensive quality scores characterizing cell state and activity into clonal quantification analysis. Simultaneously, by combining corrections for multiple capture events, it significantly improves the accuracy and biological reliability of immune repertoire clonal abundance analysis based on VDJ data.
[0148] Step 4: Intelligent Comprehensive Judgment and Classification
[0149] Integrating multi-dimensional cellular features, a pre-defined decision model is used to calculate and retain priority and classification labels for each cell. These multi-dimensional features include a comprehensive quality score and multiple capture evidence, specifically:
[0150] S51: Construct a comprehensive cell feature vector for model determination, specifically including:
[0151] S511: For each cell i, integrate multivariate heterogeneous data from previous steps to generate a comprehensive feature vector. The multivariate heterogeneous data includes the cell comprehensive quality score, multiple capture suspicion score, initial quality control indicators, and the expression levels of marker genes characterizing cell type or functional state.
[0152] S512: Standardize each original feature component in the comprehensive feature vector to obtain a standardized feature vector;
[0153] S52: Input the comprehensive feature vector into the preset intelligent judgment model. The intelligent judgment model integrates multi-dimensional features and makes a comprehensive decision based on the preset analysis target.
[0154] S53: Obtain and apply the judgment output of the intelligent judgment model to perform the final classification of cells, specifically including:
[0155] S531: Compare the final judgment score of each cell with the preset threshold to generate a preliminary binary judgment result;
[0156] S532: Combine biological state marker information to correct the preliminary judgment result: If a cell is initially judged to be excluded, but it has been marked as a specific biological functional state and its comprehensive quality score is higher than the preset lower limit of the state, then the judgment result of the cell is corrected to be retained and a special marker is assigned.
[0157] S533: Output the final classification label for each cell, where the classification labels include: high-quality single cells, suspicious but biologically valuable special cases, and clearly technically noisy cells.
[0158] In this embodiment, the intelligent judgment model used in the intelligent comprehensive judgment and classification step is a supervised machine learning model that integrates multi-dimensional features and preset analysis objectives to perform the final classification of cells, and includes the following components:
[0159] A11: Model input module, used to receive the standardized integrated feature vector of each cell;
[0160] A12: Target weight configuration module, used to integrate user-specified analysis target priority weights;
[0161] A13: Model inference engine, built on gradient boosting tree algorithm, used to calculate the final judgment score of each cell based on the input feature vector and target weights;
[0162] A14: Decision and Output Module, used to combine the judgment score with preset rules to generate the final cell classification label.
[0163] In this embodiment, the target weight configuration module is specifically configured as follows:
[0164] Receive priorities selected by the user from predefined analysis objectives, and assign a weight value to each priority. The predefined analysis objectives include:
[0165] To maximize immune cell diversity, the corresponding weighting configuration tends to retain more cell subtypes;
[0166] Enriching rare clonal types, with corresponding weighting configurations that tend to retain cells carrying low-frequency clonal types;
[0167] Focusing on highly active effector cells, the corresponding weighting configuration tends to retain cells that highly express effector genes;
[0168] Furthermore, the target weights are integrated into the decision-making process of the model inference engine in the form of vectors, and are achieved by adjusting the decision thresholds of the features related to each target in the model.
[0169] In this embodiment, the model inference engine is a gradient boosting tree model, and the process of calculating the final decision score of cell i is as follows:
[0170] ;
[0171] in, The total number of decision trees, Let be the prediction output function of the t-th decision tree. For learning rate, This is the standardized comprehensive feature vector. The target weight vector;
[0172] Furthermore, during the growth of each decision tree, the importance of features evaluated when its nodes split is affected by the target weight vector, which makes features more relevant to high-weight targets receive higher priority during splitting.
[0173] In this embodiment, the decision-making and output module is specifically configured as follows:
[0174] Receive the final decision score for each cell from the model inference engine;
[0175] The application uses a classification threshold based on the score distribution of the entire cell population, where the classification threshold is calculated based on a preset expected cell retention rate;
[0176] The final score is combined with the classification threshold and biological state marker information, and the final classification label is output according to the following rules:
[0177] If the final score is greater than or equal to the classification threshold, it is classified as a high-quality single cell.
[0178] If the final score is less than the classification threshold, but the cell is in the whitelist of biological status marker information, it is classified as a special case cell with biological value.
[0179] If the final score is less than the classification threshold and the cell is not in the whitelist of biological state marker information, it is classified as a technical noise cell.
[0180] In this embodiment, the present invention constructs a comprehensive feature vector integrating multi-dimensional features such as comprehensive quality score, multiple capture suspicion score, initial quality control indicators, and cell marker gene expression levels, and inputs it into an intelligent judgment model with gradient boosting tree as the inference engine. The core innovation of this model lies in its target weight configuration module, which allows users to assign priority weights according to preset analysis goals such as "maximizing immune cell diversity" and "enriching rare clones," thereby dynamically adjusting the decision thresholds of relevant features in the model. The model calculates the final judgment score for each cell based on weighted features and combines the threshold set based on the score distribution with the biological status whitelist rules to intelligently classify cells into "high-quality single cells," "special cells with biological value," or "technical noise cells." This scheme organically integrates quantitative indicators and qualitative biological knowledge through machine learning, realizing intelligent and customizable cell filtering and retention decisions. It can effectively remove technical noise while accurately retaining special cells (such as highly active effector cells) with key biological value for specific research goals, significantly improving the accuracy, interpretability, and research goal-oriented flexibility of downstream analysis of single-cell multi-omics data.
[0181] Step 5: Data Archiving and Background Learning
[0182] Based on the judgment results, the cell data were classified and stored in different datasets. Feature analysis was performed on the datasets judged as low-quality and technically noisy to extract background noise parameters for this sequencing experiment, specifically including:
[0183] S61: Based on the final cell classification label, cell data and its associated gene expression profiles, VDJ sequences, quality scores and judgment evidence are classified and stored into four independent data subsets;
[0184] S62: Perform feature analysis on the data subset classified as technical noise, extract and quantify the core parameters describing the background noise of this sequencing experiment, specifically including:
[0185] S621: The set of background cells is the union of the subset of Class III technical noise cells and the subset of Class IV excluded background cells;
[0186] S622: Calculate the distribution of quality control indicators for the background cell set and calculate key statistics, including the median and 95th percentile of UMI counts, the median number of genes detected, and the median percentage of mitochondrial gene expression.
[0187] S623: In the background cell set, calculate the average expression level of all genes, and define the genes whose average expression level is higher than the preset global threshold as the background gene feature set.
[0188] S624: Integrate the calculated statistics with the background gene feature set as the background noise parameter output for this experiment;
[0189] S63: Feed back the extracted background noise parameters to the biological context-based fine coarse filtering step and data preprocessing, dynamically calibrate the initial filtering threshold, and generate a background noise analysis report.
[0190] In this embodiment, when cell data and its associated gene expression profiles, VDJ sequences, quality scores, and judgment evidence are classified and stored into four independent data subsets based on the final cell classification label, the definitions and storage contents of the four independent data subsets are as follows:
[0191] Category I Core High-Quality Cell Subset: Stores data classified as high-quality single cells, which is used for all subsequent core immune repertoire and transcriptome analyses;
[0192] Category II biological special case cell subset: Stores data on special case cells classified as having biological value. This subset retains data for exploratory analysis used for specific biological questions and associates records with the biological basis for its retention.
[0193] Category III Technical Noise Cell Subset: Stores data classified as technical noise cells for background noise feature analysis;
[0194] Type IV Excluded Background Subset: Stores data that are explicitly identified as empty droplets or UMIs and have extremely low gene counts. This subset is used only for statistical calculations of the background distribution.
[0195] In this embodiment, when performing feature analysis on the data subset classified as technical noise, the core parameters that describe the background noise of this sequencing experiment are extracted and quantified.
[0196] In this embodiment, when feeding back the extracted background noise parameters to the data preprocessing stage to dynamically calibrate the initial filtering threshold, the specific method is as follows:
[0197] The first-level lenient threshold is replaced by a dynamic value based on the background distribution, and the update formula includes:
[0198] The updated threshold for UMI count is: ;
[0199] in, The absolute minimum protection threshold, A preset coefficient greater than 1 The median of the UMI count;
[0200] In data preprocessing, the background gene feature set is recorded, which can be excluded from the candidate gene list or marked in subsequent analysis steps such as selection of highly variable genes.
[0201] In this embodiment, the generated background noise analysis report includes the following:
[0202] Cell count statistics for each data subset;
[0203] Histogram of UMI and gene number distribution of background cell set;
[0204] Background gene feature set: gene list and its functional annotations;
[0205] Comparison of filtering thresholds before and after dynamic updates based on background parameters;
[0206] Summary of the overall assessment of the background noise level in this experiment.
[0207] In this embodiment, the present invention archives cell data based on the final classification results and achieves background learning through in-depth analysis of the "technical noise cells" and "background" subsets. Specifically, firstly, all cells are systematically stored in four independent subsets according to the judgment labels to ensure high-quality data is used for core analysis, thereby classifying low-quality and background cells. Subsequently, feature extraction is performed specifically on the subset of cells judged as technical noise, quantifying the distribution of quality control indicators such as UMI, gene number, and mitochondrial proportion, and defining a "background gene feature set" with excessively high average expression, which together constitute the core parameters describing the specific noise of this experiment. Finally, these background noise parameters are dynamically fed back to the preprocessing and coarse filtering steps at the beginning of the process to replace static empirical values, calibrate the initial filtering threshold, exclude background genes from subsequent high-variability gene screening, and generate a detailed background noise analysis report. This step, through a closed-loop design of "archiving-learning-feedback," transforms the results of the end-stage discrimination into knowledge that improves the accuracy of the front-end quality control. This enables the entire process to have self-iterative and adaptive capabilities, dynamically optimizing filtering standards based on the unique noise background of different experimental batches. This greatly enhances the accuracy, robustness, and experimental relevance of data cleaning, ensuring the acquisition of highly reliable analytical results.
[0208] Step Six: Output Results
[0209] The output includes a set of results including purified high-quality cell data, a weighted list of clonal frequencies, cell classification archive information, and a background noise analysis report.
[0210] like Figure 2 As shown, this embodiment also provides a noise reduction system for single-cell immune repertoire sequencing data, including:
[0211] The data preprocessing and feature extraction module is used to compare and quantify the raw sequencing data, and generate gene expression profile data, VDJ receptor sequence data and initial quality indicators for each cell barcode.
[0212] The fine coarse filtration module is used for preliminary filtration of cells and adjustment of cell type-specific thresholds;
[0213] The bidirectional collaborative noise reduction engine includes an authentication feedback unit and a quality-weighted feedback unit, specifically:
[0214] The authentication feedback unit is used to generate multiple capture suspicion scores based on VD sequences and integrate them into gene expression profile clustering analysis;
[0215] The quality-weighted feedback unit is used to calculate the overall cell quality score based on gene expression profiles and to use this score to perform weighted calculations on VDJ clonoid frequencies.
[0216] The intelligent integrated judgment and classification module is used to integrate multi-dimensional features and use a preset judgment model to calculate and retain the priority and classification label for each cell;
[0217] The data archiving and background learning module is used to classify cell data into different datasets and extract background noise parameters based on the classification results.
[0218] The results output module is used to output a set of results including purified data, a weighted clone list, classification and archiving information, and a background noise analysis report.
[0219] Example 1: Analysis of the tumor-infiltrating T-cell immune repertoire
[0220] This embodiment uses single-cell T-cell receptor (TCR) and transcriptome sequencing data from a human non-small cell lung cancer tissue sample as an example to demonstrate the application of this method. First, the raw sequencing data is compared and quantified to generate gene expression profiles and TCR sequences for each cell. Initial quality indicators are calculated; for example, the median UMI count of all cell barcodes in this batch of data is approximately 1500, and the median background distribution of mitochondrial gene proportion is approximately 8%.
[0221] The system then enters a refined coarse-filtering stage based on biological context. First, it sets a first-level adaptive threshold based on the background noise distribution. For example, the threshold for UMI number is set to 0.2 times the background median, and the threshold for mitochondrial gene proportion is set to 1.5 times the background median plus 2%. After this initial filtering, approximately 8000 cells remain. Next, these cells undergo dimensionality reduction and clustering to identify several subpopulations, including CD8+ effector T cells, regulatory T cells, and exhausted T cells. Among them, a group of CD8+ effector T cells highly expressing GZMK and IFNG is identified as a highly metabolically active subpopulation. For this subpopulation, the system applies a predefined threshold adjustment factor, for example, raising its mitochondrial gene proportion threshold to 1.3 times the first-level threshold. This allows a group of cells within this subpopulation with a slightly higher mitochondrial proportion than the general threshold but active effector function to be retained and labeled as "highly active effector T cells."
[0222] Next, bidirectional collaborative noise reduction is performed. In the first feedback loop, the system checks the TCR sequence integrity of each cell. For example, under a cell barcode, the system identifies two distinct functional TCRα chains and two distinct functional TCRβ chains, triggering a multiple capture event determination and calculating a high suspicion score. During gene expression clustering, this score is used to adjust the similarity distance between the cell and other cells, making it more likely to be separated in the cluster. In the second feedback loop, the system calculates a comprehensive quality score between 0 and 1 based on features such as the total UMI number, the number of detected genes, and the proportion of mitochondria in each cell. A resting memory T cell with a high total UMI number and a low mitochondrial proportion might receive a high score of 0.95, while a cell with a low UMI number and a high mitochondrial proportion might only receive 0.2. When calculating TCR clonoid frequencies, the contribution of cells carrying the same clonoid is weighted according to this quality score. The clonoid sequence count from high-quality cells increases significantly, while the contribution of the same sequence count from low-quality cells decreases, resulting in a more reliable estimate of clonoid abundance.
[0223] Subsequently, the intelligent comprehensive judgment and classification module is activated. The system constructs a feature vector for each cell, containing information such as its comprehensive quality score, multiple capture suspicion score, and expression level of cell type marker genes. In this example, the user sets "focusing on highly active effector cells" as a high-priority analysis target. The gradient boosting tree model assigns higher decision importance to features with high expression of effector genes based on this target weight, ultimately calculating a judgment score for each cell. Combining preset thresholds and a biological whitelist, cells are automatically classified. For example, the vast majority of high-quality effector T cells are classified as "Class I core high-quality cells," a few cells with a high proportion of mitochondria but high expression of effector genes are classified as "Class II biological exception cells," and a large number of cells with extremely low UMI numbers and no complete TCR sequence are classified as "Class III technical noise cells."
[0224] Finally, the data archiving and background learning module archives the classification results. The system specifically analyzes the "Class III technical noise cells" subset, extracting background noise parameters for this experiment. For example, it finds that the 95th percentile of the background UMI count is 50, and identifies a group of genes that are generally highly expressed in the background. These parameters are fed back to the workflow starting point to dynamically update the initial filtering thresholds for the next batch of data or similar data analysis, and to exclude the background gene set from the screening of highly variable genes. Ultimately, the system outputs purified, high-quality cell data, a weighted list of TCR clonal frequency, a detailed cell classification report, and a background noise analysis summary, providing a cleaner and more reliable data foundation for subsequent tumor immune microenvironment research.
[0225] Example 2: Analysis of peripheral blood B cell immune repertoire after vaccination
[0226] This example uses single-cell B-cell receptor and transcriptome sequencing data from peripheral blood mononuclear cells of a person following influenza vaccination to further illustrate the universality of this method. After data preprocessing, approximately 12,000 initial cells were obtained, and the background distribution of their ribosomal gene proportions showed heterogeneity.
[0227] In the refined coarse filtering stage, the first-level filter removed approximately 3000 low-quality cells based on UMI and mitochondrial gene percentage thresholds. After clustering and annotating the remaining cells, the system identified several subpopulations, including plasma cells and activated memory B cells. Among them, the plasma cell subpopulation typically has a high ribosomal gene percentage due to its vigorous antibody synthesis function. The system applied a specific adjusting factor to this subpopulation, appropriately relaxing the ribosomal gene percentage filtering threshold, successfully preserving this group of key functional cells with high secretory activity and avoiding the loss that might occur due to the "one-size-fits-all" approach of traditional methods.
[0228] In the bidirectional collaborative noise reduction engine, the first feedback loop verifies B cell characteristics. For example, under a barcode, the system detects two different functional immunoglobulin heavy chains and one light chain, triggering a B cell identity conflict event and calculating the corresponding conflict evidence strength and multiple capture suspicion score. This information is integrated into the gene expression clustering analysis to help distinguish true cell subpopulations from confounding signals caused by multiple capture. The second feedback loop calculates the overall quality score for each B cell. For example, a B cell in an actively proliferating state may have a high cell cycle score, but other activity indicators are good; after cell type-specific correction, it can still obtain a high quality score. When calculating BCR clonoid frequencies, the clonoid contribution of these high-quality proliferating cells is amplified, while the sequence contribution of seemingly identical clones but actually from low-quality cells about to apoptosis is reduced, thus more realistically reflecting the amplification of antigen-specific clones after vaccination.
[0229] In this example, the intelligent decision-making module set "enrichment of rare clones" as the core objective. Based on this, the model adjusted its decision-making logic, giving a higher retention bias to cells carrying low-frequency, unique BCR sequences, even if some quality control indicators (such as the number of detected genes) were slightly below the conventional threshold. Ultimately, the cells were classified and archived. Specifically, the system discovered a group of plasma cells carrying rare but potentially neutralizing antibody sequences within the "Class II Biological Special Cases" subset.
[0230] During the data archiving and background learning phase, the system analyzes cells identified as technical noise, quantifies the background characteristics unique to this experiment, and feeds these parameters, such as the typical gene expression profiles of background droplets, back to the system's knowledge base. When analyzing subsequent batches of samples from the same project, the system automatically applies the learned background parameters for more precise initial filtering, achieving self-optimization of the process. The final output of purified data, a weighted list of BCR clones, and cell classification information provides crucial support for accurately assessing the breadth and depth of humoral immune responses after vaccination.
[0231] The embodiments of the present invention have been described in detail above with reference to the accompanying drawings. However, the present invention is not limited to the above embodiments. Within the scope of knowledge possessed by those skilled in the art, various changes can be made without departing from the spirit of the present invention.
Claims
1. A method for denoising single-cell immune repertoire sequencing data, characterized in that: Includes the following steps: S11: Data preprocessing and feature extraction: The raw sequencing data generated from single-cell immune repertoire sequencing is compared and quantified to generate gene expression profile data and VDJ receptor sequence data for each cell barcode, and the initial quality indicators for each cell are calculated. S12: Perform bidirectional collaborative denoising. Based on the data generated by feature extraction, run a denoising engine with bidirectional feedback to obtain multi-dimensional cell features. These features include a comprehensive quality score and multiple capture evidence. The comprehensive quality score is calculated using activity indicators, health status indicators, and cell cycle indicators. The denoising engine with bidirectional feedback includes a first feedback loop and a second feedback loop. Specifically: The first feedback loop of the noise reduction engine is based on VDJ receptor sequence data to verify the integrity of the receptor chain under a single cell barcode, generate evidence to identify potential multiple capture events, obtain multiple capture evidence, and feed the multiple capture evidence back to the cluster analysis of gene expression profile data to separate confounding cell signals. The second feedback loop of the noise reduction engine calculates a comprehensive quality score characterizing cell integrity and activity based on gene expression profile data and initial quality indicators. This score is then used to weight the VDJ clonality frequency to improve the reliability of clonal quantification analysis based on VDJ receptor sequence data. S13: Intelligent comprehensive judgment and classification, integrating multi-dimensional cell features, and using a preset judgment model to calculate and retain priority and classification label for each cell. The intelligent judgment model is a supervised machine learning model that integrates multi-dimensional features and preset analysis targets to perform the final classification of cells. The priority is the priority feature that is integrated through the intelligent judgment model, which is the priority weight of the analysis target specified by the user. S14: Data archiving and background learning. Based on the judgment results, cell data are classified and stored in different datasets. Feature analysis is performed on datasets judged as low quality and technical noise to extract background noise parameters for this sequencing experiment. S15: Results output, which includes a set of results including purified high-quality cell data, a weighted list of clonogenic frequencies, cell classification archive information, and a background noise analysis report.
2. The method for denoising single-cell immune repertoire sequencing data according to claim 1, wherein: Before performing bidirectional collaborative noise reduction, a fine-grained coarse filter based on biological context is also included, specifically: S21: Initial filtering is performed based on setting and applying a first-level adaptive threshold according to the background noise distribution; S22: Cell subpopulation identification is performed on the initially filtered data; S23: For the specific functional subgroups identified, a secondary quality control threshold matching their biological state is applied for secondary filtering.
3. The method for denoising single-cell immune repertoire sequencing data according to claim 2, wherein: The first feedback loop of the noise cancellation engine, when in operation, specifically includes: S31: Identifying contradictory evidence of cell identity in single-cell VDJ data; S32: Quantitative calculation of multiple capture suspect score; S33: Integrating scores in gene expression profile clustering analysis to separate confounding signals.
4. The method for denoising single-cell immune repertoire sequencing data according to claim 3, wherein: The first feedback loop, when identifying contradictory evidence of cell identity in single-cell VDJ data, specifically includes: S311: For each valid cell barcode, identify all functional T-cell receptor and B-cell receptor sequences with complete open reading frames and no nonproductive rearrangements in its VDJ sequencing data; S312: Based on receptor chain type and pairing rules, pairing analysis is performed on the identified functional sequences to identify potential effective receptor pairs under each barcode; S313: Define and detect cell identity conflict events, denoted as event E, specifically: For T cells, event E is defined as: under a cell barcode, there are more than one distinct functional TCRα chain and more than one distinct functional TCRβ chain; For B cells, event E is defined as: under a cell barcode, there are more than one different functional immunoglobulin heavy chain and more than one different functional immunoglobulin light chain. S314: For the cell barcode that triggers event E, calculate and record the strength of contradictory evidence.
5. The method for denoising single-cell immune repertoire sequencing data according to claim 4, wherein: When the second feedback loop of the noise reduction engine is working, it specifically includes: S41: Calculate the overall cell quality score based on multi-dimensional gene expression characteristics; S42: The VDJ clonality frequency is weighted using the calculated overall cell quality score.
6. The method for single-cell immune repertoire sequencing data denoising of claim 5, wherein: When performing intelligent comprehensive judgment and classification, the specific steps include: S51: Construct a comprehensive cell feature vector for model determination; S52: Input the comprehensive feature vector into the preset intelligent judgment model. The intelligent judgment model integrates multi-dimensional features and makes a comprehensive decision based on the preset analysis target. S53: Obtain and apply the judgment output of the intelligent judgment model to perform the final classification of cells.
7. The method for denoising single-cell immune repertoire sequencing data according to claim 6, wherein: The intelligent judgment model used in the intelligent comprehensive judgment and classification steps includes the following components: A11: Model input module, used to receive the standardized integrated feature vector of each cell; A12: Target weight configuration module, used to integrate user-specified analysis target priority weights; A13: Model inference engine, built on gradient boosting tree algorithm, used to calculate the final judgment score of each cell based on the input feature vector and target weights; A14: Decision and Output Module, used to combine the judgment score with preset rules to generate the final cell classification label.
8. The noise reduction method for single-cell immune repertoire sequencing data according to claim 7, characterized in that: When performing data archiving and background learning, the specific steps include: S61: Based on the final cell classification label, cell data and its associated gene expression profiles, VDJ sequences, quality scores and judgment evidence are classified and stored into four independent data subsets; S62: Perform feature analysis on the data subset classified as technical noise, and extract and quantify the core parameters that describe the background noise of this sequencing experiment; S63: Feed back the extracted background noise parameters to the biological context-based fine coarse filtering step and data preprocessing, dynamically calibrate the initial filtering threshold, and generate a background noise analysis report.
9. A noise reduction system for single-cell immune repertoire sequencing data, used to implement the noise reduction method for single-cell immune repertoire sequencing data as described in any one of claims 1-8, characterized in that: include: The data preprocessing and feature extraction module is used to compare and quantify the raw sequencing data, and generate gene expression profile data, VDJ receptor sequence data and initial quality indicators for each cell barcode. The fine coarse filtration module is used for preliminary filtration of cells and adjustment of cell type-specific thresholds; The bidirectional collaborative noise reduction engine includes an authentication feedback unit and a quality-weighted feedback unit, specifically: The authentication feedback unit is used to generate multiple capture suspicion scores based on VD sequences and integrate them into gene expression profile clustering analysis; The quality-weighted feedback unit is used to calculate the overall cell quality score based on gene expression profiles and to use this score to perform weighted calculations on VDJ clonoid frequencies. The intelligent integrated judgment and classification module is used to integrate multi-dimensional features and use a preset judgment model to calculate and retain the priority and classification label for each cell; The data archiving and background learning module is used to classify cell data into different datasets and extract background noise parameters based on the classification results. The results output module is used to output a set of results including purified data, a weighted clone list, classification and archiving information, and a background noise analysis report.