A single-cell rare cell type identification method and system based on divide-and-conquer strategy

By optimizing single-cell data using variational autoencoders and divide-and-conquer strategies, the problem of identifying and clustering rare cell types was solved, achieving high-precision cell type identification and biological interpretation, and providing reliable cell annotation capabilities.

CN115910219BActive Publication Date: 2026-06-26SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Filing Date
2022-12-02
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to accurately identify and analyze rare cell types, leading to misclassification or difficulty in identification, and lack reliability assessment of clustering results.

Method used

A variational autoencoder is used to encode single-cell transcriptome data into Gaussian distributed latent variables. The cluster loss is optimized by spectral clustering and divide-and-conquer strategies. Combined with Bayesian neural networks and MC-dropout, the identification and clustering of rare cell types are achieved.

Benefits of technology

It improves the accuracy of rare cell type identification and clustering precision, provides reliable biological interpretation and knowledge-driven cell annotation capabilities, and enhances the reliability and biological significance of single-cell data analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115910219B_ABST
    Figure CN115910219B_ABST
Patent Text Reader

Abstract

The application provides a single-cell rare cell type identification method and system based on a divide-and-conquer strategy, comprising the following steps: S1, using a variational autoencoder to encode single-cell transcriptome data into Gaussian distribution hidden variables; S2, performing spectral clustering on the Gaussian distribution hidden variables to obtain a single-cell coarse-grained clustering result; and S3, optimizing the single-cell coarse-grained clustering result based on intra-class cluster loss and inter-class cluster loss until the intra-class cluster loss and the inter-class cluster loss are minimized to obtain a single-cell clustering result containing rare cell types. The application can overcome the deficiency that previous single-cell clustering and identification methods do not optimize rare cell types, leading to misclassification or difficulty in identifying rare cell types.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of pharmaceutical biotechnology and relates to a method and system for identifying rare single-cell cell types based on a divide-and-conquer strategy. Background Technology

[0002] Cell types that play crucial roles in individual development and disease progression are typically found in low abundance, such as circulating endothelial cells, endothelial progenitor cells, antigen-specific T cells, cancer stem cells, or circulating tumor cells. These rare cell types play important roles in determining disease pathogenesis, mediating immune responses, and angiogenesis in various diseases. Accurate identification and analysis of rare cell types are key to understanding disease progression and biological developmental mechanisms, and are of great significance for a deeper understanding of tissue biology in both normal and disease states.

[0003] Biological tissues exhibit heterogeneity, and with increasing sequencing depth, cell types show greater abundance differences. When the abundance differences between cell types are small, most single-cell clustering methods can achieve good results. However, in real-world sequencing environments, traditional methods often fail to identify or misclassify rare cell types.

[0004] For example, document CN114783526A proposes a deep unsupervised single-cell clustering method based on a Gaussian mixture graph variational autoencoder. It initializes the gene regulation network A using protein-protein interaction relationships (PPIs) (or regulatory element interactions (HiChIPs)); initializes each cell's cell cluster C using the K-means method; passes the gene regulation network A and single-cell gene expression data X (or regulatory element openness data X) through a graph encoder to obtain the hidden layer; obtains cell cluster C; samples the low-dimensional cell representation Z from the Gaussian mixture model (GMM); uses a decoder to predict the gene regulation network A; calculates the loss function; backpropagates to update A and GCN; repeats the above steps until convergence; outputs the gene regulation network A, the low-dimensional cell representation Z, and cell cluster C. This invention completes cell clustering and dimensionality reduction of cell representation during the construction of the gene regulation network A. However, this method also has the problem of failing to identify or misclassifying rare cell types.

[0005] In summary, the integration and optimization of single-cell data structural features and clustering algorithms for identifying rare cell types remains lacking, leading to misclassification or difficulty in identifying rare cell types. Secondly, confidence interval assessments for the reliability of clustering results are still missing in rare cell type identification. Therefore, identifying rare cell types remains a challenge in single-cell data analysis. Summary of the Invention

[0006] This invention provides a method and system for identifying rare cell types in single cells based on a divide-and-conquer strategy, overcoming the shortcomings of previous single-cell clustering and identification methods that did not optimize for rare cell types, resulting in misclassification or difficulty in identifying rare cell types.

[0007] This invention is achieved through the following technical solution:

[0008] A method for identifying rare single-cell cell types based on a divide-and-conquer strategy, comprising:

[0009] S1. A variational autoencoder is used to encode single-cell transcriptome data into Gaussian distributed latent variables.

[0010] S2. Perform spectral clustering on the Gaussian distributed latent variables to obtain coarse-grained clustering results for single cells;

[0011] S3. Optimize the coarse-grained clustering results of single cells based on intra-cluster loss and inter-cluster loss until the intra-cluster loss and inter-cluster loss are minimized, and obtain single-cell clustering results that include rare cell types.

[0012] Preferably, S1 specifically involves: the variational autoencoder inferring the posterior distribution p(w|X,Y) of the model parameters w through a Bayesian neural network;

[0013] Minimize the variational distribution q using mc-dropout θ The Kulback-Leibler divergence between (w) and the posterior distribution p(w|X,Y) yields the Gaussian distributed latent variables.

[0014] Furthermore, the formula for calculating the posterior distribution p(w|X,Y) is as shown in equation (1):

[0015]

[0016] Where p(Y|X,w) is the probability of generating single-cell transcriptome data X given model parameters w, p(w) is the distribution of model parameters w, p(Y|X) represents the probability of outputting Y, and Y is the estimate of single-cell transcriptome data X.

[0017] Furthermore, mc-dropout is used to minimize the variational distribution q. θ The Kulback-Leibler divergence between (w) and the posterior distribution p(w|X,Y) is specifically achieved by maximizing the function L(θ):

[0018]

[0019] Formula (12) is solved by iterative formula (13):

[0020]

[0021] Among them, w j Let w represent the j-th sample drawn from the posterior distribution p(w|X,Y), where w j ~q θ When (w), j = 1,..., N is the number of network layers, K is an arbitrary constant, p(D|w) j ) indicates w j The conditional probability.

[0022] Furthermore, S1 also includes: the confidence distribution interval of the Gaussian distributed latent variables obtained by training the Bayesian neural network.

[0023] Preferably, in S1, the distribution of the original single-cell transcriptome data is reconstructed from the Gaussian distributed latent variables using a decoder, and the optimized Gaussian distributed latent variables are obtained through backpropagation iterative optimization.

[0024] Preferably, in S1, the single-cell transcriptome data includes a reference dataset, which is a dataset labeled with cell category tags.

[0025] Furthermore, in S3, for the single-cell clustering results, a voting mechanism is used to annotate the clustered single cells based on the cell category labels of the reference dataset.

[0026] Preferably, in S3, formulas (15)-(17) are used to optimize the coarse-grained clustering results of single cells:

[0027]

[0028] Where, λ tig λ represents the loss function within a cluster. seq Represents the loss function between clusters;

[0029]

[0030]

[0031] in, Let C represent all clusters after clustering, and c represent a specific cluster among all clusters. m x n This represents two samples within or between clusters.

[0032] A single-cell rare cell type identification system based on a divide-and-conquer strategy, comprising:

[0033] The cell Gaussian distribution latent variable characterization module is used to encode single-cell transcriptome data into Gaussian distribution latent variables through a variational autoencoder and output them to the rare cell type identification module based on the divide-and-conquer strategy clustering algorithm.

[0034] A rare cell type identification module based on a divide-and-conquer clustering algorithm is used to perform spectral clustering on Gaussian distributed latent variables to obtain coarse-grained clustering results for single cells. The coarse-grained clustering results for single cells are optimized based on intra-cluster loss and inter-cluster loss until the intra-cluster loss and inter-cluster loss are minimized to obtain single-cell clustering results that include rare cell types.

[0035] Compared with the prior art, the present invention has the following beneficial effects:

[0036] This invention provides a novel cell type identification scheme for rare cell type identification tasks using single-cell data, combining single-cell characterization optimization with a divide-and-conquer strategy. The invention encodes the feature distribution of single cells using a variational autoencoder, and clusters similar cells based on the hidden layer distribution encoded by the encoder, thereby preserving more of the biological characteristics of single cells. For the initial coarse-grained clustering results, the scheme is optimized based on intra-cluster and inter-cluster losses. Specifically, a divide-and-conquer strategy is used to determine whether secondary clustering of single cells within each cluster is necessary. Cells requiring secondary clustering are clustered again until intra-cluster and inter-cluster losses are minimized, thus preventing the misclassification or neglect of rare cell types due to limited cell numbers. This invention provides reliable support and a solid foundation for downstream analysis tasks in single-cell sequencing.

[0037] Furthermore, this invention can obtain the confidence distribution interval of clustering accuracy through Bayesian neural network training.

[0038] Furthermore, by combining labeled single-cell data as a reference dataset, knowledge-driven, biologically meaningful cell annotation functions can be achieved.

[0039] Furthermore, this invention proposes a self-supervised scheme for clustering algorithms, which can effectively improve clustering accuracy.

[0040] This invention utilizes a variational autoencoder to generate a model for cell characterization and optimizes the cell characterization through backpropagation. It also incorporates a divide-and-conquer strategy to fine-tune cell clustering, enabling the identification of rare cell types. This solves the problem of traditional single-cell clustering algorithms failing to identify rare cell types with small cell numbers. This invention provides assurance for improving the convenience, accuracy, reliability, and biointerpretability of single-cell data analysis tasks in the biomedical field. Attached Figure Description

[0041] Figure 1 This is a flowchart of the single-cell rare cell type identification method based on the divide-and-conquer strategy of the present invention. Detailed Implementation

[0042] To further understand the present invention, the present invention will be described below with reference to embodiments. These descriptions are only for further explaining the features and advantages of the present invention and are not intended to limit the claims of the present invention.

[0043] like Figure 1 This invention provides a method for identifying rare single-cell cell types based on a divide-and-conquer strategy, comprising:

[0044] 1. Input single-cell transcriptome data:

[0045] Input single-cell transcriptome data X. In the cell gene count matrix (i.e., single-cell transcriptome data) obtained from different batches of single-cell sequencing, the columns (cell) of the matrix represent different cells, and the rows (gene) of the matrix represent the gene expression levels of the cells. The single-cell transcriptome data X includes a test dataset, and preferably also includes a reference dataset, which is a labeled dataset, i.e., a dataset labeled with cell category labels.

[0046] 2. Characterization of latent variables in Gaussian distribution of cells:

[0047] The variational autoencoder was used to encode single-cell transcriptome data into Gaussian distributed latent variables. The specific process is as follows.

[0048] To address the challenges of limited data for rare cell types and an imbalance in sample size compared to other cell types, this invention aims to combine Bayesian neural networks with MC-dropout to test and adjust prediction errors under different network architectures. In Bayesian neural networks, because a Gaussian distribution is used to calculate the prior probability distribution for each weight parameter, the single-point estimation in traditional neural networks is expanded to include the mean μ and variance δ. 2 The parameters in a Bayesian neural network, which are composed of a Gaussian distribution, can be represented as θ = [μ, δ]. 2 Compared to a conventional neural network with the same structure, a Bayesian neural network has twice the number of parameters because its network parameters follow a Gaussian distribution. This requires optimizing both the mean and variance of the parameter distribution, increasing the model's parameter count, computation time, and computational complexity. To address this issue, the complexity of the model is reduced by decreasing the number of hidden layers. The system achieves this by establishing a simple neural network structure with only one hidden layer, thereby reducing the number of network parameters, computation time, and model complexity.

[0049] In the Bayesian method, the goal is to infer the posterior distribution of the model parameters w that are most likely to generate the observed data X and its estimated Y. In this method, a prior distribution is assigned in the space of the distribution p(w) of the model parameters w to represent the prior belief in each candidate model parameter when generating the observed data. Then, a likelihood function p(Y|,ω) is constructed to characterize the probability of generating the observed data given the model parameters w. Bayesian inference aims to compute the posterior distribution p(w|X,Y) on the model parameters w, as shown in Equation (1).

[0050]

[0051] Where p(Y|X) represents the probability that the output is Y.

[0052] After obtaining the complete posterior distribution p(w|X,Y), the new data point X * The predicted distribution on (i.e., the single-cell transcriptome data that actually needs to be classified) can be obtained by integrating over w, as shown in Equation (2).

[0053] p(Y * |X * ,,)=∫p * | * ,)p(w|,Y(2)

[0054] Here, p(w|X,Y) represents the posterior distribution on the model parameters w. The Bayesian neural network draws a sample w from the posterior distribution p(w|X,Y) and obtains a new data point X given w. * Estimated Y * Through iteration, a new data point X is obtained. * The distribution estimation is obtained by Bayesian neural networks, which obtain the distribution of parameters in each layer, and thus obtain the posterior distribution p(w|X,Y) on the model parameters w. Therefore, it can effectively solve the overfitting problem when there is a small amount of sample data for rare cell types.

[0055] The core idea of ​​variational inference is to approximate the posterior distribution p(w|X,Y) using a variational parameter distribution. The system employs MC-dropout to minimize the variational distribution q. θ The Kulback-Leibler divergence between (w) and the posterior distribution p(w|X,Y) can approximate Bayesian inference in deep neural networks.

[0056]

[0057] Where p(X,Y) represents the joint probability distribution of X and Y. Since p(Y|X) is a constant, minimizing the KL divergence is equivalent to maximizing the lower bound of evidence (ELBO) with respect to θ, as shown in Equation (4):

[0058] L ELBO (θ)=∫q θ (w)log p(Y|X)dw-KL(q θ (w)∥p(w)) (4)

[0059] The first measurement is derived from the variational distribution q. θ (w) The sampled parameter values ​​generate the Bayesian probability of the observed data; the second term is the negative divergence between the variational density and the prior, the closer to the prior distribution the better. From q θ (w) sampling And optimize for θ to maximize the objective function in formula (5). Next, update q θ (w), and from the updated q θ A new sample is drawn from (w). Through iteration, it eventually converges to a θ that is closest to the true posterior distribution p(w|X,Y).

[0060]

[0061] In the model, q is defined as the weight matrix W across all network layers [i = 1, 2, 3, ..., N]. i The factorization is shown in formula (6):

[0062]

[0063] Among them, each W i q is defined as the average weight matrix M i (M i =mean(W i The product of the product with each diagonal matrix p with Bernoulli variables. As shown in equations (7)-(9):

[0064]

[0065] Z i ~Bernoulli(p i (8)

[0066]

[0067] Among them, Z i This indicates a Bernoulli variable p. i The diagonal matrix, p i Let W represent the Bernoulli variable. i This represents the weight matrix across all network layers [i = 1, 2, 3, ..., N].

[0068] The goal of variational inference is to use a simpler distribution qθ (w) is used to approximate the complex distribution p(w|D). By adjusting q θ The parameter θ of (w) is used to maximize the function L(θ), as shown in formula (10):

[0069]

[0070]

[0071] p(D) represents the complex distribution of the original single-cell transcriptome data, and p(D|w) represents the conditional probability of w.

[0072] Minimizing formula (10) is typically achieved by analytically calculating the derivative, setting it to zero, solving a set of coupled nonlinear equations, and deriving an iterative coordinate descent algorithm. However, this method is only applicable to conjugate distributions and not to highly structured distributions represented by probabilistic programs, which are difficult for this invention to analyze. To address the above problem, this invention performs stochastic gradient descent on L(θ):

[0073]

[0074] Formula (12) is solved by iterative formula (13):

[0075]

[0076]

[0077] Among them, w j Let w represent the j-th sample drawn from the posterior distribution p(w|X,Y), where w j ~q θ When (w), j = 1,...,N, and K is an arbitrary constant, the variance of the Monte Carlo estimate is minimized by adding the constant K, resulting in a Gaussian distributed latent variable. In fact, p(w) j This is a Gaussian-distributed latent variable that approximates the original distribution of the input single-cell transcriptome data.

[0078] The Bayesian neural network simultaneously provides the confidence distribution interval of the hidden layer representations (Gaussian distributed latent variables). The distribution of the original single-cell transcriptome data is reconstructed from the Gaussian distributed latent variables using the decoder, and optimized through backpropagation iterative optimization to obtain the optimized Gaussian distributed latent variables. These Gaussian latent variables serve as the input for clustering in step 3. The confidence level of the generated data can be assessed based on the confidence distribution interval of the hidden layer representations obtained during training, given the distribution of the single-cell transcriptome data reconstructed by the decoder.

[0079] 3. Rare cell type identification based on divide-and-conquer clustering algorithm:

[0080] Variational inference is performed using the distribution of variational autoencoders, and the identification of rare cell types is achieved by using the prediction error of the model and the uncertainty error caused by noise.

[0081] Based on the variational autoencoder in step 2, spectral clustering of the intermediate layer Gaussian distributed latent variables is performed using formulas (15)-(17) between the variational autoencoder and the decoder. Spectral clustering is then used to group cells, resulting in coarse-grained single-cell clustering. During spectral clustering, distant node pairs may still belong to the same cluster, while closely spaced node pairs may belong to different clusters. To address this issue, this invention employs a divide-and-conquer strategy. "Divide-and-conquer" is achieved through threshold judgments within and between clusters. Specifically, the coarse-grained single-cell clustering results are optimized based on intra- and inter-cluster losses. A secondary clustering is performed on misclassified nodes or clusters until intra- and inter-cluster losses are minimized, thereby enabling the identification of rare cell types and obtaining single-cell clustering results. The formula used for secondary clustering is:

[0082]

[0083] Where, λ tig λ represents the loss function within a cluster. seq This represents the loss function between clusters.

[0084]

[0085]

[0086] Let C represent all clusters after clustering, and c represent a specific cluster among all clusters. m x n This represents two samples within or between classes.

[0087] This step, through distribution fitting, places similar cells at the centroid of the most similar clusters, balances the biological and technical differences between clusters, adjusts the distribution of similar cells through backpropagation, thereby eliminating batch effects, and relies on a generative model such as a variational autoencoder to complete missing values ​​in the hidden layer.

[0088] 4. Knowledge-driven cell annotation:

[0089] In the cell clustering results, cells are annotated using a voting mechanism by referencing the cell category labels in the dataset. This combines past knowledge to achieve a knowledge-driven, biologically meaningful cell annotation function.

[0090] This invention discloses a single-cell rare cell type identification system based on a divide-and-conquer strategy, comprising:

[0091] The Gaussian distribution latent variable representation module is used to encode single-cell transcriptome data into Gaussian distributed latent variables through a variational autoencoder. A decoder then reconstructs the distribution of the original single-cell transcriptome data from these Gaussian distributed latent variables. Through backpropagation and iterative optimization, the optimized Gaussian distributed latent variables are output to a rare cell type identification module based on a divide-and-conquer clustering algorithm. The single-cell transcriptome data X includes a test dataset, and preferably also includes a reference dataset, which is a labeled dataset. This module combines a deep neural network variational autoencoder to achieve low-dimensional mapping representation of Gaussian latent variables in single-cell transcriptome data, uses Bayesian network characteristics to provide confidence distribution intervals for the hidden layer representations, and achieves gene completion and batch effect optimization in the low-dimensional space.

[0092] The rare cell type identification module based on the divide-and-conquer clustering algorithm is used to perform spectral clustering on Gaussian distributed latent variables to obtain coarse-grained clustering results for single cells. The coarse-grained clustering results for single cells are optimized based on intra-cluster and inter-cluster losses. The divide-and-conquer strategy is used to perform secondary clustering on misclassified nodes or clusters until the intra-cluster and inter-cluster losses are minimized, thereby realizing the identification of rare cell types and obtaining single-cell clustering results.

[0093] The knowledge-driven cell annotation module is used to store the reference dataset and, in the single-cell clustering results, annotates the clustered single cells using the cell category labels of the reference dataset and a voting mechanism. This combines past knowledge to achieve a knowledge-driven, biologically meaningful cell annotation function.

[0094] Example 1:

[0095] Select pbmc_10k_v3

[0096] The datasets (https: / / support.10xgenomics.com / single-cell-gene-expression / datasets / 3.0.0 / pbmc_10k_v3) and the GSE115189 dataset from the GEO open-source database were used as test data for analysis.

[0097] 1. Input single-cell transcriptome data:

[0098] The pbmc_10k_v3 dataset contains 11,769 columns (total cell count) and 23,036 rows (total gene count), meaning 11,769 samples and 23,036 features. The GSE115189 dataset contains 3,372 columns (total cell count) and 24,654 rows (total gene count), meaning 3,372 samples and 23,036 features. This example only uses two human peripheral blood datasets; in practical applications, multiple datasets can be added as needed.

[0099] 2. Characterization of latent variables in Gaussian distribution of cells:

[0100] The single-cell transcriptome data is encoded into Gaussian distributed latent variables by a variational autoencoder. The decoder reconstructs the distribution of the original single-cell transcriptome data from the Gaussian distributed latent variables and obtains the optimized Gaussian distributed latent variables through backpropagation iterative optimization.

[0101] 3. Rare cell type identification based on divide-and-conquer clustering algorithm:

[0102] Based on the variational autoencoder from step 2, spectral clustering is performed on the Gaussian distributed latent variables. This spectral clustering is then used to group cells, yielding coarse-grained clustering results for individual cells. The coarse-grained clustering results are optimized based on intra-cluster and inter-cluster losses. A divide-and-conquer strategy is employed to perform secondary clustering on misclassified nodes or clusters until intra-cluster and inter-cluster losses are minimized, thereby achieving the identification of rare cell types and obtaining individual cell clustering results. Self-supervised clustering of individual cells is performed. The algorithm in the model, to a certain extent, ensures that the distance between each cluster is maximized and the distance between nodes within each cluster is minimized. The generative model from step 2, i.e., the variational autoencoder, provides the clustering confidence intervals for each individual cell.

[0103] 4. Knowledge-driven cell annotation module:

[0104] The GSE111108 dataset, containing 1039 columns (total number of cells) and 29451 rows (total number of genes), i.e., 1039 samples and 29451 features, is selected as a reference dataset. It is mixed with the data from step 1 and then used for training in step 2 to achieve knowledge-driven cell annotation.

Claims

1. A method for identifying rare single-cell cell types based on a divide-and-conquer strategy, characterized in that, include: S1. A variational autoencoder is used to encode single-cell transcriptome data into Gaussian distributed latent variables. S2. Perform spectral clustering on the Gaussian distributed latent variables to obtain coarse-grained clustering results for single cells; S3. Optimize the coarse-grained clustering results of single cells based on intra-cluster loss and inter-cluster loss until the intra-cluster loss and inter-cluster loss are minimized, and obtain single-cell clustering results that include rare cell types. In S1, the specific method of encoding single-cell transcriptome data into Gaussian distributed latent variables using a variational autoencoder is as follows: the variational autoencoder infers model parameters through a Bayesian neural network. posterior distribution Minimize variational distribution using mc-dropout and posterior distribution The Kulback-Leibler divergence between them yields the Gaussian distributed latent variables; This is single-cell transcriptome data. Single-cell transcriptome data The estimate; Among them, mc-dropout is used to minimize the variational distribution. and posterior distribution The Kulback-Leibler divergence between them is specifically determined by maximizing the function accomplish: Formula (12) is solved by iterative formula (13): in, Indicates from the posterior distribution The j-th sample drawn from ,when hour, , The number of network layers, It is an arbitrary constant. express The conditional probability; In S3, formulas (15)-(17) are specifically used to optimize the coarse-grained clustering results of single cells: in, Represents the loss function within a cluster. Represents the loss function between clusters; in, Let C represent all clusters after clustering, and c represent a specific cluster among all the clusters after clustering. This represents two samples within or between clusters.

2. The method for identifying rare single-cell cell types based on a divide-and-conquer strategy according to claim 1, characterized in that, Posterior distribution The calculation formula is as shown in equation (1): in, Given model parameters Generate single-cell transcriptome data in time The probability, For model parameters The distribution, for The probability of.

3. The method for identifying rare single-cell cell types based on a divide-and-conquer strategy according to claim 1, characterized in that, S1 also includes: the confidence distribution interval of the Gaussian distributed latent variables obtained by training the Bayesian neural network.

4. The method for identifying rare single-cell cell types based on a divide-and-conquer strategy according to claim 1, characterized in that, In S1, the distribution of the original single-cell transcriptome data is reconstructed from the Gaussian distributed latent variables through the decoder, and the optimized Gaussian distributed latent variables are obtained through backpropagation iterative optimization.

5. The method for identifying rare single-cell cell types based on a divide-and-conquer strategy according to claim 1, characterized in that, In S1, the single-cell transcriptome data includes a reference dataset, which is a dataset labeled with cell category labels.

6. The method for identifying rare single-cell cell types based on a divide-and-conquer strategy according to claim 5, characterized in that, In S3, for the single-cell clustering results, a voting mechanism is used to annotate the clustered single cells based on the cell category labels of the reference dataset.

7. A single-cell rare cell type identification system based on a divide-and-conquer strategy, characterized in that, include: The cell Gaussian distribution latent variable representation module is used to encode single-cell transcriptome data into Gaussian distributed latent variables through a variational autoencoder and output it to the rare cell type identification module based on a divide-and-conquer clustering algorithm. Specifically, the variational autoencoder encodes single-cell transcriptome data into Gaussian distributed latent variables by inferring model parameters through a Bayesian neural network. posterior distribution Minimize variational distribution using mc-dropout and posterior distribution The Kulback-Leibler divergence between them yields the Gaussian distributed latent variables; This is single-cell transcriptome data. Single-cell transcriptome data The estimation; where mc-dropout is used to minimize the variational distribution. and posterior distribution The Kulback-Leibler divergence between them is specifically determined by maximizing the function accomplish: Formula (12) is solved by iterative formula (13): in, Indicates from the posterior distribution The j-th sample drawn from ,when hour, , The number of network layers, It is an arbitrary constant. express The conditional probability; The rare cell type identification module based on the divide-and-conquer strategy clustering algorithm is used to perform spectral clustering on Gaussian distributed latent variables to obtain single-cell coarse-grained clustering results; the single-cell coarse-grained clustering results are optimized based on intra-cluster loss and inter-cluster loss until the intra-cluster loss and inter-cluster loss are minimized to obtain single-cell clustering results containing rare cell types; wherein, the optimization of single-cell coarse-grained clustering results based on intra-cluster loss and inter-cluster loss is specifically optimized using formulas (15)-(17): in, Represents the loss function within a cluster. Represents the loss function between clusters; in, Let C represent all clusters after clustering, and c represent a specific cluster among all the clusters after clustering. This represents two samples within or between clusters.