A method and device for automatic annotation of a cell subpopulation

By combining a modularity-optimized community discovery algorithm and a similarity iterative merging technique with a specific scoring function, the problem of automatic annotation of novel cell subtypes in single-cell transcriptome sequencing was solved. This achieved highly accurate and robust cell subpopulation identification, adapting to the characteristics of multimodal data and improving the efficiency and reliability of single-cell research.

CN121725895BActive Publication Date: 2026-06-26INNOVATION CENTER OF YANGTZE RIVER DELTA ZHEJIANG UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INNOVATION CENTER OF YANGTZE RIVER DELTA ZHEJIANG UNIVERSITY
Filing Date
2026-02-25
Publication Date
2026-06-26

Smart Images

  • Figure CN121725895B_ABST
    Figure CN121725895B_ABST
Patent Text Reader

Abstract

The application discloses a cell subpopulation automatic annotation method and device, and relates to the technical field of single-cell sequencing. The method comprises the following steps: acquiring single-cell transcriptome data; in a preset target resolution range, an initial clustering result of each resolution, i.e., a first target cluster, is obtained by means of a community discovery algorithm based on modularity optimization; if there is a cluster in the first target clustering result at each resolution that does not meet the specificity requirement, a second target clustering result is obtained by using similarity iterative merging; a representative marker of each cluster in the second target clustering result is determined by using a getSpecificityScore function; and a cell subpopulation annotation result under the resolution is recommended by comparing the specificity score. By means of a multi-scale analysis strategy, the application combines the specificity score and similarity merging cell subpopulation automatic annotation, realizes automatic identification and annotation of novel cell subtypes without relying on prior annotation information, can adapt to diversified data characteristics, and improves the accuracy and robustness of cell subpopulation annotation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of single-cell sequencing technology, and in particular to a method and apparatus for automatic annotation of cell subpopulations. Background Technology

[0002] The development of single-cell RNA sequencing (scRNA-seq) technology has greatly advanced our understanding of cellular heterogeneity and disease mechanisms. Through high-throughput sequencing, researchers can reveal the distribution and functional status of different cell types and subtypes in complex tissues at the single-cell level. This is of great significance for exploring the mechanisms of disease occurrence, development, and treatment response. Especially in fields such as cancer, immune diseases, and cardiovascular diseases, the precise identification of cell subtypes provides a solid foundation for discovering new therapeutic targets and designing personalized therapies.

[0003] However, current cell subtype annotation faces multiple challenges. Most existing methods rely on reference datasets with rich annotation information or predefined marker genes, which not only limits the identification of novel cell subtypes but also has weak generalization ability across different datasets and experimental conditions. Technical noise, insufficient sequencing depth, cell cycle differences, and biological background heterogeneity lead to a large amount of variation and bias in the data, increasing the uncertainty of cell state identification. Furthermore, cell states often exhibit continuous changes rather than clear boundaries, which increases the difficulty and ambiguity of classification. Traditional manual annotation, while flexible, is time-consuming, labor-intensive, and lacks standardization, making it difficult to meet the needs of large-scale data analysis.

[0004] To address these challenges, machine learning methods such as ensemble learning have been increasingly introduced into single-cell analysis in recent years, improving the stability and accuracy of cell classification by fusing multiple models or feature sources. However, automatically identifying and annotating novel cell subtypes without relying on prior annotation information remains a pressing problem. Meanwhile, the emergence of multimodal data, such as spatial transcriptomics and proteomics data, provides more comprehensive information for the characterization of cell subtypes, placing higher demands on the adaptability of annotation tools. Summary of the Invention

[0005] The purpose of this application is to provide a method and apparatus for automatic annotation of cell subpopulations, so as to automatically identify and annotate novel cell subtypes without relying on prior annotation information, in order to adapt to diverse data characteristics and improve the accuracy and robustness of cell subpopulation annotation.

[0006] To achieve the above objectives, this application provides the following solution.

[0007] Firstly, this application provides an automatic annotation method for cell subpopulations, including:

[0008] Acquire single-cell transcriptomics data;

[0009] Within the preset target resolution range, the initial clustering results for each resolution are obtained by using a community detection algorithm based on modularity optimization, which is the first target clustering;

[0010] If any clusters in the first target clustering results at each resolution do not meet the specificity requirement, they are iteratively merged using similarity to obtain the second target clustering results; both the first target clustering results and the second target clustering results include multiple clusters; the specificity requirement is: the specificity score is not less than a preset threshold for measuring specificity;

[0011] The getSpecificityScore function is used to determine the representative label of each cluster in the second objective clustering result;

[0012] The specificity score at comparative resolution recommends cell subpopulation annotation results.

[0013] Optionally, if any clusters in the first target clustering results at each resolution do not meet the specificity requirement, they are iteratively merged using similarity to obtain the second target clustering results, specifically including:

[0014] Set the iteration number s to 0;

[0015] Based on the target resolution, the single-cell transcriptomics data are clustered to obtain the clustering result of the s-th iteration, which is used as the initial clustering result.

[0016] The getSpecificityScore function is used to calculate the specificity score of each cluster in the clustering results of the s-th iteration;

[0017] Clusters with specificity scores below a preset threshold for measuring specificity are merged with clusters of high similarity to obtain the clustering results for the (s+1)th iteration.

[0018] Increment the value of s by 1, and return to the steps of calculating the specificity score of each cluster in the clustering results of the s-th iteration using the getSpecificityScore function, until the termination merging condition is met.

[0019] Optionally, the getSpecificityScore function is used to calculate the specificity score of each cluster in the clustering results of the s-th iteration, specifically including:

[0020] Determine whether gene i in cluster m satisfies the calculation conditions of the getSpecificityScore function to obtain the first judgment result; cluster m is any cluster in the clustering result of the s-th iteration;

[0021] If the first judgment result is negative, then set the specificity score of gene i in cluster m to 0;

[0022] If the first judgment result is yes, then the specificity score of gene i in cluster m is calculated using the getSpecificityScore function;

[0023] Based on the specificity scores of each gene in cluster m, the overall specificity score of cluster m is calculated using the following formula;

[0024] ;

[0025] in, The overall specificity score for cluster m is given, where n is a user-defined parameter representing the selection of the top n genes with the highest scores. Let be the specificity score of the gene ranked i-th in cluster m after sorting by specificity score from highest to lowest. The value of i ranges from 1 to n.

[0026] Optionally, the getSpecificityScore function is evaluated under the following conditions: ;

[0027] in, The proportion of cells expressing gene i in cluster m in the clustering results of the s-th iteration is given. Let be the proportion of cells expressing gene i in all clusters except cluster m in the clustering results of the s-th iteration. A preset threshold for measuring the difference in cell proportions. represents the absolute expression level of gene i in cluster m. A preset threshold for measuring absolute expression levels.

[0028] Optionally, the getSpecificityScore function is:

[0029] ;

[0030] in, Genes in cluster m i Specificity score, Genes in cluster m i The absolute level of expression, For the clustering results of the s-th iteration, the genes expressed in cluster m are... i The proportion of cells, For the clustering results of the s-th iteration, the expression genes in the other clusters besides cluster m are... i The proportion of cells.

[0031] Optionally, the termination merging condition is that all clusters in the clustering results meet the specificity evaluation condition; the specificity evaluation condition is: the number of merges reaches the maximum number of merges threshold or the specificity score is not less than the preset threshold for measuring specificity.

[0032] Optionally, clusters with specificity scores below a preset threshold for measuring specificity are merged with clusters of high similarity, specifically including:

[0033] Clusters in the clustering results of the s-th iteration whose specificity scores are less than a preset threshold for measuring specificity are identified as the first target clusters; the similarity between the first target clusters and each cluster in the clustering results of the s-th iterations other than the first target clusters is calculated using the similarity calculation formula.

[0034] Identify the cluster with the highest similarity to the first target cluster and designate it as the cluster to be merged.

[0035] Determine whether the similarity between the first target cluster and the cluster to be merged is greater than a similarity threshold to obtain a second determination result;

[0036] If the second determination result is yes, then the first target cluster is merged into the cluster to be merged;

[0037] If the second judgment result is negative, then the first target cluster will not be merged.

[0038] Optionally, when the specificity score of cluster m in the initial clustering results is less than a preset threshold for measuring specificity, the maximum number of merges threshold for cluster m is determined to be 2.

[0039] Optionally, the getSpecificityScore function is used to determine the representative label of each cluster in the second target clustering result, specifically including:

[0040] The specificity score of each gene in the second target cluster is calculated using the getSpecificityScore function; the second target cluster is any cluster in the second target clustering result.

[0041] The genes in the second target cluster were sorted in descending order of specificity score to obtain the second gene sequence;

[0042] Phenotypic molecules of the first three genes in the second gene sequence were selected as representative markers of the second target cluster.

[0043] Secondly, this application provides an automatic cell subpopulation annotation device, which applies the above-mentioned automatic cell subpopulation annotation method, and the automatic cell subpopulation annotation device includes:

[0044] The data acquisition module is used to acquire single-cell transcriptomics data;

[0045] The first clustering module, within the preset target resolution range, uses a community detection algorithm based on modularity optimization to obtain the initial clustering results at each resolution, i.e., the first target clustering;

[0046] The second clustering module, at each resolution, if there are clusters in the first target clustering results that do not meet the specificity requirement, uses similarity iterative merging to obtain the second target clustering result; both the first target clustering result and the second target clustering result include multiple clusters; the specificity requirement is: the specificity score is not less than a preset threshold for measuring specificity;

[0047] The labeling module uses the getSpecificityScore function to determine the representative label of each cluster in the second target clustering result;

[0048] The comparison module compares the specificity scores of recommended cell subpopulation annotation results at different resolutions.

[0049] According to the specific embodiments provided in this application, this application has the following technical effects.

[0050] This application provides a method and apparatus for automatic cell subpopulation annotation. First, single-cell transcriptomics data is acquired. Then, within a preset target resolution range, an initial clustering result (i.e., the first target cluster) is obtained at each resolution using a community detection algorithm based on modularity optimization. If any clusters in the first target clustering result at each resolution do not meet the specificity requirement, they are iteratively merged using similarity to obtain a second target clustering result. Then, the `getSpecificityScore` function is used to determine the representative label of each cluster in the second target clustering result. The specificity scores at different resolutions are compared to recommend cell subpopulation annotation results. This application, through a multi-scale analysis strategy combined with specificity scoring and similarity merging for automatic cell subpopulation annotation, achieves automatic identification and annotation of novel cell subtypes without relying on prior annotation information. It can adapt to diverse data characteristics and improve the accuracy and robustness of cell subpopulation annotation. Attached Figure Description

[0051] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0052] Figure 1 This is a flowchart illustrating an automatic annotation method for cell subpopulations provided in an embodiment of this application.

[0053] Figure 2 This is a schematic diagram illustrating an automatic cell subpopulation annotation method provided in an embodiment of this application.

[0054] Figure 3 A comparison chart showing the performance of Subtypist and other tools in a simulated dataset, as provided in one embodiment of this application.

[0055] Figure 4 This is a schematic diagram illustrating the performance of Subtypist in real data, as provided in an embodiment of this application.

[0056] Figure 5 This is an example diagram illustrating the identification of space-specific macrophage subtypes in hepatocellular carcinoma, provided as an embodiment of this application.

[0057] Figure 6 An example diagram illustrating the identification of SPP1+ macrophages in hepatocellular carcinoma similar to those in the original study, provided as an embodiment of this application.

[0058] Figure 7 An example diagram illustrating the spatial specificity of SPP1+ macrophages in hepatocellular carcinoma, provided as an embodiment of this application.

[0059] Figure 8 An embodiment of this application provides the identification of HES1 enriched in stage II & III esophageal squamous cell carcinoma. + Example diagram of epithelial cell subsets.

[0060] Figure 9 HES1 in esophageal squamous cell carcinoma provided in one embodiment of this application + A schematic diagram of differential expression analysis and enrichment analysis of epithelial cells.

[0061] Figure 10 The HES1 variant associated with poor prognosis in esophageal squamous cell carcinoma is provided in one embodiment of this application. + Example diagram of epithelial tumor cell subtypes.

[0062] Figure 11 Subtypist, provided as an embodiment of this application, reveals FN1 enrichment in the ischemic area of ​​myocardial infarction. + Example image of a macrophage.

[0063] Figure 12 An example diagram illustrating a pro-remodeling macrophage subtype that communicates with cardiomyocytes in the ischemic area of ​​myocardial infarction, provided as an embodiment of this application.

[0064] Figure 13 An example diagram illustrating a pro-remodeling macrophage subtype spatially co-located with cardiomyocytes in the ischemic area of ​​myocardial infarction, provided as an embodiment of this application. Detailed Implementation

[0065] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0066] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0067] Against this backdrop, it is particularly necessary to develop an annotation tool that can automatically identify cell subtypes under reference-free conditions, adapt to diverse data characteristics, and possess both accuracy and robustness. This will not only help discover potential novel cell subpopulations but also promote the in-depth development of single-cell omics research in elucidating disease mechanisms and its clinical applications.

[0068] In one exemplary embodiment, an automatic annotation method for cell subpopulations is provided, such as... Figure 1 As shown, it includes the following steps 101-105.

[0069] Step 101: Obtain single-cell transcriptomics data.

[0070] This step aims to provide raw input for the entire annotation process.

[0071] Step 102: Within the preset target resolution range, the initial clustering results for each resolution are obtained using a community detection algorithm based on modularity optimization, i.e., the first target clustering.

[0072] To perform basic data segmentation and establish a preliminary clustering structure, this step identifies the basic structural characteristics of cell populations through preliminary clustering, laying the foundation for subsequent refined classification.

[0073] Step 103: If there are clusters in the first target clustering results at each resolution that do not meet the specificity requirement, they are merged using similarity iteration to obtain the second target clustering result; both the first target clustering result and the second target clustering result include multiple clusters; the specificity requirement is: the specificity score is not less than the preset threshold for measuring specificity.

[0074] This step effectively improves the biological specificity and subpopulation resolution of the final clustering results by automatically identifying and refining biologically insignificant cell subpopulations present in the initial clustering, providing a more accurate classification basis for subsequent cell type annotation.

[0075] Step 104: Use the getSpecificityScore function to determine the representative label of each cluster in the second target clustering result.

[0076] Step 105: Compare the specificity scores at different resolutions to recommend cell subpopulation annotation results.

[0077] By calculating the specificity score of each cluster, specific genes or molecular markers that can represent the biological characteristics of the cluster are screened out, providing a key basis for subsequent functional analysis and annotation of cell subpopulations; and by comparing multiple scales within the target resolution range, cell subpopulation annotation results with more biological characteristics are recommended.

[0078] Implementing steps 101-105 above can achieve the following technical effects.

[0079] (1) It has the ability to discover new cell subtypes independently without relying on reference datasets. The method in the above embodiments adopts a reference-free annotation strategy, which does not rely on predefined reference cell atlases or labels, and can discover representative new subtypes in real samples. It is especially suitable for exploratory studies in unknown tissues or complex disease environments.

[0080] (2) Specifically designed for cell subtype annotation tasks, it has the ability to automatically identify phenotypic molecules. The method in the above embodiments constructs a specific scoring function that can automatically screen out the most representative phenotypic marker genes in each cluster, which not only improves the accuracy of annotation but also enhances the biological interpretability of the results.

[0081] (3) An integrated strategy is adopted to improve the stability and reliability of annotation results. By using multi-resolution calculation and automatic scoring mechanism, the process of manual layer-by-layer judgment is simulated, the annotation results under different parameters are systematically evaluated, and the optimal result is recommended. This avoids the uncertainty caused by human selection and improves the degree of standardization.

[0082] (4) It has strong resistance to technical noise and reduces the risk of subtype loss. When faced with common technical noise such as insufficient sequencing depth and sparse gene expression, this method can effectively avoid the neglect of subpopulations through the design of specificity test modules, ensuring comprehensive and reliable results.

[0083] (5) Compatible with multiple data types and suitable for multimodal analysis needs. This method can be directly applied to single-cell transcriptome, mononuclear transcriptome and spatial transcriptome data. It is highly adaptable and can be used for subgroup identification and spatial functional analysis of multiple systems such as tumor, immune, and cardiovascular, and has broad clinical and research application potential.

[0084] like Figure 2As shown, the automatic cell subpopulation annotation method in this application includes a data input section, a feature scoring section, a similarity merging section, a specificity testing section, and a final result integration and visualization section. Based on the data input module, this application simulates the manual subpopulation annotation process and sequentially constructs a multi-resolution feature scoring module, a similarity merging module, a specificity testing module, and a final result integration and visualization module, with each module corresponding to a different functional part. Inspired by ensemble learning strategies, this application supports systematic evaluation and annotation of cell subtypes at multiple clustering resolutions, and finally compares and integrates the multi-resolution annotation results.

[0085] In another exemplary embodiment, in the data input module corresponding to step 101 above, the system supports user input of standard-format single-cell expression data (such as Seurat objects) and automatically extracts gene expression information from them. While ensuring flexibility, the system also provides a default analysis workflow suitable for common application scenarios and rapid deployment, and supports personalized analysis settings. Users can customize multi-sample integration, feature selection, and other strategies according to research needs, thereby adapting to different data sources and biological problems.

[0086] In another exemplary embodiment, step 103 above corresponds to the specificity scoring module and the specificity testing module.

[0087] The working principle of the specificity scoring module is as follows:

[0088] At a specified resolution, clusters are first formed using a community detection algorithm optimized for modularity or a similarity-based merging algorithm. Then, our method utilizes the FindMarkers function from the Seurat package to identify marker genes specific to each cluster. System design specificity scoring. It is a quantitative indicator used to assess the uniqueness and potential of gene expression within a cluster. Inspired by the manual annotation process, it is calculated using the fold change and expression percentage of marker genes. For a given set of genes in cluster m... Specificity score Calculated using the following formula:

[0089] (1);

[0090] in, The specificity score of gene i in cluster m. represents the absolute expression level of gene i in cluster m. For the clustering results of the s-th iteration, the genes expressed in cluster m are... i The proportion of cells, For the clustering results of the s-th iteration, the expression genes in the other clusters besides cluster m are...i The proportion of cells.

[0091] Characterization clusters Zhonggen The average log2 fold change of the expression and Representing clusters Genes expressed in other clusters i The proportion of cells. To ensure that the selected marker genes have sufficient expression specificity, this method sets the following constraints on relevant parameters: First, The expression ratio difference must be positive to ensure that the target gene shows an upregulated expression trend in the current cluster; secondly, the expression ratio difference... It should be greater than or equal to the preset threshold. This is used to exclude genes with non-specific high expression. If any of the above conditions are not met, the specificity score of the corresponding gene in that cluster is forcibly set to 0 to prevent non-specific markers from interfering with downstream analysis. Then, the overall specificity score for each cluster is defined. The sum of the scores of the top five marker genes in terms of their internal specificity:

[0092] (2);

[0093] in, The overall specificity score for cluster m, The first gene sequence The specificity score of each gene is calculated; the first gene sequence is obtained by sorting the genes in cluster m in descending order of their specificity scores. To evaluate the overall annotation quality at a given resolution, a resolution-level specificity score is defined. This score is calculated as the specificity score for all cluster levels at this resolution. Average value:

[0094] (3);

[0095] in Indicates resolution as The total number of clusters, The overall specificity score for cluster m is represented as defined above. Resolution-level specificity score provides a quantitative metric for comparing annotation results across multiple resolutions. Furthermore, the system fully considers various practical scenarios that may arise during manual annotation and optimizes the specificity score calculation strategy accordingly.

[0096] If genes The overall expression level was low in the target cluster m and other clusters, but it still showed some specificity (i.e., it met the requirements). ,in (A preset threshold for measuring the difference in cell proportions), and the expression proportion of this gene in cluster m. If the expression of the gene is low, it is considered that the gene has discriminative power, and its specificity score is calculated according to formula (1).

[0097] If genes The expression ratio in cluster m is similar to that in other clusters (i.e.) However, the absolute expression level in cluster m is significantly higher than that in other clusters, satisfying the condition that... (in If a preset threshold is used to measure the absolute expression level, then the gene is considered to have high expression specificity, and its specificity score is calculated according to formula (1).

[0098] Furthermore, the system will calculate the specificity score for each cluster at a specified resolution, and use this score as a standard to measure the specificity level of the cluster, providing basic data support for the subsequent specificity testing module.

[0099] The working principle of the specificity testing module is as follows:

[0100] In the specificity testing module, the system will perform an overall evaluation of the input cluster set based on the above calculation results to determine whether each cluster has sufficient expression specificity. If the specificity score of a cluster is lower than a preset threshold, it is considered to have insufficient discriminative ability. At this time, the system will trigger an automatic merging strategy to merge the cluster with neighboring clusters to improve the accuracy and stability of annotation.

[0101] In particular, during actual cell subtype identification, some clusters are often found to be underexpressed, leading to unclear biological meanings and difficulties in accurate annotation. Especially when certain subtypes lack obvious differential expression characteristics, their representative marker genes cannot provide sufficient discriminative power, thus exhibiting low specificity. Blindly merging such clusters often fails to effectively improve clustering quality and may even weaken the specificity and stability of other clusters. To address these issues, this method introduces a cluster merging tracking and control mechanism. Specifically, during the cluster merging operation, the system constructs a vector... This is used to record the number of times each initial cluster participates in the merging process. For clusters with an initial specificity score below a set threshold, the system sets the maximum number of merging attempts they are allowed to participate in to be 2. If a cluster's specificity score still does not reach the preset threshold after completing two merging operations, the system considers that the cluster has a certain representativeness of expression features, stops further merging of it, and retains its current state as an independent cluster to enter the subsequent process.

[0102] When the specificity scores of all clusters meet the set threshold conditions, the system terminates the merging process and outputs the current cluster set as the final annotation result, simultaneously entering the integration result and visualization module. If there are still clusters whose specificity scores do not meet the standard, the system automatically enters the next round of the similarity merging module to further adjust and optimize the structure of these clusters until the overall cluster set meets the annotation accuracy and specificity requirements.

[0103] In another exemplary embodiment, step 103 described above can be replaced by steps 201-205.

[0104] Step 201: Based on the target resolution, the single-cell transcriptomics data are clustered to obtain the clustering result of the s-th iteration, which is used as the initial clustering result.

[0105] Step 202: Use the getSpecificityScore function to calculate the specificity score of each cluster in the clustering results of the s-th iteration.

[0106] This step is used to determine the specificity of the clustering results in this iteration.

[0107] Step 203: Merge clusters with specificity scores less than the preset threshold for measuring specificity with clusters with high similarity to obtain the clustering results of the (s+1)th iteration.

[0108] This step is used to update and optimize clustering results by merging neighboring clusters that do not meet specificity requirements, effectively reducing biologically insignificant cell subpopulations.

[0109] Step 204: Increment the value of s by 1, and return to the step of calculating the specificity score of each cluster in the clustering results of the s-th iteration using the getSpecificScore function, until the termination merging condition is met.

[0110] By incrementing the iterative counter and repeatedly executing the specificity evaluation and optimization process, the clustering results are continuously improved until the preset termination criteria are met, ensuring that the final output clustering results meet both biological specificity requirements and maintain the stability of the cluster structure.

[0111] In another exemplary embodiment, step 202 above can be implemented using steps 301-304 as follows.

[0112] Step 301: Determine whether gene i in cluster m satisfies the calculation conditions of the getSpecificityScore function to obtain the first judgment result; cluster m is any cluster in the clustering result of the s-th iteration.

[0113] Step 302: If the first judgment result is negative, then set the specificity score of gene i in cluster m to 0.

[0114] Step 303: If the first judgment result is yes, then use the getSpecificityScore function to calculate the specificity score of gene i in cluster m.

[0115] Step 304: Based on the specificity scores of each gene in cluster m, calculate the overall specificity score of cluster m using the following formula;

[0116] ;

[0117] in, The overall specificity score for cluster m is given, where n is a user-defined parameter representing the selection of the top n genes with the highest scores. Let be the specificity score of the gene ranked i-th in cluster m after sorting by specificity score from highest to lowest. The value of i ranges from 1 to n.

[0118] The calculation condition for the getSpecificScore function is as follows: ;

[0119] In the above formula, where, The proportion of cells expressing gene i in cluster m in the clustering results of the s-th iteration is given. Let be the proportion of cells expressing gene i in all clusters except cluster m in the clustering results of the s-th iteration. A preset threshold for measuring the difference in cell proportions. represents the absolute expression level of gene i in cluster m. A preset threshold for measuring absolute expression levels.

[0120] The getSpecificScore function is as follows:

[0121] ;

[0122] in, Genes in cluster m i Specificity score, Genes in cluster m i The absolute level of expression, For the clustering results of the s-th iteration, the genes expressed in cluster m are... i The proportion of cells, For the clustering results of the s-th iteration, the expression genes in the other clusters besides cluster m are... i The proportion of cells.

[0123] In another exemplary embodiment, step 203 above may be implemented using steps 401-402.

[0124] Step 401: In the clustering results of the s-th iteration, the clusters with a specificity score less than a preset threshold for measuring specificity are identified as the first target clusters; using the similarity calculation formula, the similarity between the first target clusters and each cluster in the clustering results of the s-th iteration, excluding the first target clusters, is calculated.

[0125] Step 402: Determine the cluster with the highest similarity to the first target cluster and use it as the cluster to be merged;

[0126] Step 403: Determine whether the similarity between the first target cluster and the cluster to be merged is greater than a similarity threshold, and obtain a second determination result;

[0127] Step 404: If the second judgment result is yes, then merge the first target cluster into the cluster to be merged;

[0128] Step 405: If the second judgment result is negative, then the first target cluster is not merged.

[0129] The similarity calculation formula is the above formula (4).

[0130] In another exemplary embodiment, the termination merging condition in step 204 above is that all clusters in the clustering results meet the specificity evaluation condition; the specificity evaluation condition is: the number of merges reaches the maximum number of merges threshold or the specificity score is not less than a preset threshold for measuring specificity. Specifically, when the specificity score of cluster m in the initial clustering results is less than the preset threshold for measuring specificity, the maximum number of merges threshold for cluster m is determined to be 2.

[0131] In another exemplary embodiment, step 103 described above corresponds to a similarity merging module, and the working principle of the similarity merging module is as follows:

[0132] When some clusters fail to meet the preset specificity requirements, the system will initiate a similarity-based merging process, similar to hierarchical clustering. To this end, this method first calculates a pairwise similarity distance matrix for each pair of clusters to guide subsequent merging operations. This process aims to simulate the cluster merging logic commonly found in manual annotation practice, i.e., determining whether different clusters should be classified into the same subtype by evaluating the similarity of marker genes between them. Specifically, for any cluster m, its differentially expressed genes (DEGs) are first extracted, and the corresponding gene set is constructed. Then, for the set The genes in the dataset are sorted in descending order of their log2 fold change, and the first 300 genes are retained, denoted as [missing information]. For cluster m and cluster Its Jaccard coefficient based on DEG similarity The calculation is as follows:

[0133] (4)

[0134] Constructed in this way Each element of the similarity matrix D Cluster and cluster The system measures the similarity of gene sets between clusters. Based on this matrix, the system can identify the two most similar clusters and perform a merging operation, thereby achieving progressive integration of clusters with insufficient expression specificity.

[0135] In another exemplary embodiment, step 104 above corresponds to the integration result and visualization module, and the working principle of the integration result and visualization module is as follows:

[0136] The system integrates subtype annotation results across all resolution levels and performs aggregate analysis on the specificity scores corresponding to each resolution to recommend the optimal annotation scheme. Simultaneously, based on gene specificity scores, the system automatically selects and recommends the top three most specific phenotypic molecules for each cluster as representative markers for downstream annotation and result output.

[0137] The system supports exporting annotation results in multiple ways, facilitating subsequent analysis, reproduction, and sharing. Simultaneously, the system integrates visualization functions to intuitively display core information such as clustering results, subtype specificity scores, marker gene distribution, and resolution comparisons.

[0138] In another exemplary embodiment, step 104 above may be implemented using steps 501-505.

[0139] Step 501: Calculate the specificity score of each gene in the second target cluster using the getSpecificScore function; the second target cluster is any cluster in the second target clustering result.

[0140] Step 502: Sort the genes in the second target cluster in descending order of specificity score to obtain the second gene sequence.

[0141] Step 503: Select the phenotypic molecules of the first 3 genes in the second gene sequence as representative markers of the second target cluster.

[0142] In another exemplary embodiment, in order to comprehensively evaluate the annotation performance and clustering effect of the above-described automatic cell subpopulation annotation method (hereinafter referred to as the annotation method of this application or Subtypist), the following evaluation metrics are adopted:

[0143] The Adjusted Rand Index (ARI) measures the consistency between the clustering results and the true labels; an ARI closer to 1 indicates better performance. First, the Rand Index is calculated, assuming there are a total of [number missing] clusters in the dataset. There are samples, and the total number of sample pairs is . Let the true labels and clustering results be divided into several categories, and construct a cross-counting table, where... Indicates that they are simultaneously classified into clusters and cluster The number of samples, Cluster The number of samples in Cluster The number of samples in the dataset. Based on this, the original RAND index RI can be expressed as:

[0144] (5);

[0145] To further eliminate the interference caused by random consistency, a random case is introduced. Expected value The calculation method is as follows:

[0146] (6);

[0147] The formula for adjusting the RAND index is as follows:

[0148] (7);

[0149] in, The RAND index measures the consistency of whether pairs of samples are grouped into the same or different clusters in two clustering results. In other words, it determines whether a pair of samples is classified into the same or different clusters in both clustering results. The ARI value typically ranges from 0 to 1; a value closer to 1 indicates greater consistency between the two clustering results, while a value closer to 0 indicates near-random consistency.

[0150] Annotation accuracy: Calculated by statistically analyzing the proportion of correctly identified phenotypic markers across all predicted subtypes. The set of specific marker genes used to determine the true identity of a cell subtype is generated by the method system of this invention. During actual annotation, Subtypist automatically recommends three specific marker genes for each cluster using its built-in specificity scoring function (see the specificity scoring module). Accuracy is calculated based on marker gene matching at the cell level. For each cell p, if there is at least one intersection element between its predicted phenotypic molecule set and the true marker gene set, the cell is considered correctly predicted. Its accuracy is defined as follows:

[0151] (8);

[0152] in, This represents the total number of all cells. Indicates the first Cells ( ). This represents the set of molecular genes representing the true (initial) phenotypic characteristics of the p-th cell. It represents the set of phenotypic molecules predicted by it; This is an indicator function that takes the value 1 when the intersection is not empty, and 0 otherwise. The final accuracy is the average of the accuracy of all cells.

[0153] F1-score: To evaluate the performance of the Subtypist tool in classification tasks, the F1 score is calculated using the following formula:

[0154] (9);

[0155] Where TP represents the number of true positives, FP represents the number of false positives, and FN represents the number of false negatives.

[0156] In multi-class classification scenarios, to comprehensively and objectively measure Subtypist's performance, two aggregate metrics are further introduced: the macro-average F1-score and the weighted average F1-score, to enhance the overall evaluation capability of prediction performance across all categories. The formula for calculating the macro-average F1-score is as follows:

[0157] (10);

[0158] in, Represents the k-th category -score, where K represents the total number of categories. This metric assigns equal weight to each category and effectively reflects the model's classification performance on a smaller number of categories.

[0159] In another exemplary embodiment, the following example is provided to illustrate the annotation performance and clustering effect of the annotation method of this application.

[0160] Example 1: Performance evaluation of subtype annotation based on simulated single-cell dataset.

[0161] like Figure 3 As shown, Figure 3The image 'a' in the diagram shows the UMAP plot of paired scRNA-seq data, displaying the actual annotations and tool-predicted labels. Figure 3 The performance evaluation in section b is based on simulated data, covering different cluster numbers, cell numbers, and gene numbers. The metrics are ARI and annotation accuracy. The box plot shows the clustering performance of scSHC on various datasets. To comprehensively and systematically evaluate the applicability and stability of the proposed annotation method in various practical application scenarios, this example designs and constructs a series of simulated single-cell transcriptome (scRNA-seq) datasets with real labels to verify its annotation ability and robustness under different data scales, cell subpopulation heterogeneity levels, and expression feature complexity conditions, such as... Figure 3 As shown in 'a', the simulation data was constructed using the publicly available single-cell data simulation tool ESCO. Specifically, the simulation data design combined two dimensions: the number of cells (500, 1000, 2000) and the number of genes (500, 1000, 2000), while setting different numbers of cell subtypes (4 to 7) to reflect the complexity of cell heterogeneity from low to high. Each dataset had explicitly defined ground truth cluster labels, and a candidate set of phenotypic molecules was automatically generated using the specific scoring function proposed by Subtypist, ensuring that the simulation data could be used to comprehensively evaluate the annotation accuracy and clustering performance of the tool.

[0162] The annotation method in this application sets the default resolution parameter range to 0.1–1.5 at runtime. It selects the optimal annotation model by scanning multiple candidate resolutions and calculating their corresponding specificity scores. In the final annotation process, a specificity-driven hierarchical clustering strategy is employed, supplemented by a clustering merging method based on specificity distribution, to ensure clear subgroup boundaries and strong interpretability of the results. Performance evaluation uses two core metrics:

[0163] (1) Clustering performance metrics: The Adjusted Rand Index (ARI) is used to measure the similarity between the predicted cluster labels and the true labels. The higher the ARI value, the more consistent the cluster structure is with the true structure.

[0164] (2) Annotation accuracy: defined as the proportion of cells in the predicted cells whose set of annotated phenotypic molecules intersects with the set of true markers, reflecting the accuracy of subgroup characterization molecules.

[0165] The results showed that, Figure 3As shown in b, Subtypist represents the annotation method of this application. Subtypist demonstrates stable and excellent annotation performance on all simulated datasets. Under typical settings (1000 cells, 1000 genes, 7 cell subtypes, minimum population ratio of 0.025, maximum of 0.2), the tool achieves optimal performance at a resolution of 0.3, with an ARI as high as 0.987 and an annotation accuracy of 0.983. These results fully demonstrate that Subtypist can maintain high recognition resolution and annotation stability even in complex scenarios with high cellular heterogeneity and weak differences in marker molecule expression, possessing good application and promotion potential.

[0166] Example 2: Performance evaluation of subtype annotation based on real single-cell datasets.

[0167] To verify the cell subtype annotation capabilities of the Subtypist tool in real-world scenarios, this example designed and conducted a multi-dataset evaluation experiment. A total of 13 benchmark datasets were selected, covering multiple species such as humans and mice, and involving multiple organ types such as liver, pancreas, and intestine, including various types of epithelial cells, endothelial cells, myeloid cells, immune cells, and stromal cells, demonstrating strong representativeness and complexity.

[0168] Since the original data annotations were mostly not based on phenotypic molecular classification, in order to unify the standards and facilitate cross-comparison, Subtypist selected the three genes with the highest specificity scores from each cluster as phenotypic molecules for annotation based on the specificity scoring function. Most results showed good consistency with the original labels.

[0169] To systematically evaluate annotation performance, four metrics were designed: annotation accuracy, macro-average F1-score, weighted F1-score, and adjusted RAND index (ARI), measuring the classification and clustering effects of the method from different perspectives. The results show that... Figure 4 As shown, Subtypist performs stably across multiple datasets, with an average annotation accuracy of 0.76 and a clustering accuracy of 0.56. In four pancreatic datasets, the highest annotation accuracy reaches 0.96, demonstrating high accuracy and adaptability.

[0170] Example 3: Space-specific macrophage subtypes involved in tumor immune barrier construction in hepatocellular carcinoma.

[0171] This example applies Subtypist to single-cell transcriptome data from human hepatocellular carcinoma (HCC) macrophages to validate its ability to identify functional subtypes within the complex tumor microenvironment. At the recommended resolution of 0.3, Subtypist classifies macrophages into seven subtypes, one of which is SPP1.+ Macrophage subsets were enriched in tumor tissues, and their expression levels differed significantly between tumor and normal tissues (p = 0.01). Enrichment analysis showed that this subset was significantly associated with key pathways such as extracellular matrix remodeling, inflammation, and immune cell recruitment, suggesting that it may play a role in constructing the tumor immune microenvironment, such as... Figure 5 As shown.

[0172] In this example, the original study and the annotation method of this application differ in the classification of continuous cell state changes, as shown in Figure 6. The cells identified by Subtypist show a high degree of similarity to the SPP1+ macrophages labeled in the original study. Subsequently, the SPP1-positive macrophages (SPP1+ macrophages) identified by Subtypist were validated. + Macrophages can more accurately reflect cell subtype classification and have greater biological significance.

[0173] Specifically, firstly, the original study and the SPP1 identified by this tool are compared separately. + Macrophages were mapped to spatial transcriptome data, and their spatial localization was performed using CellTrek software. SPP1 was obtained from spatial transcriptome slices. + The spatial coordinates of macrophages and hepatocellular carcinoma (HCC) tumor regions were used to calculate their distance. The results showed that... Figure 6 As shown, SPP1 + Macrophages are significantly closer to the tumor area than other cellular subtypes.

[0174] Further comparison was made between the original study and the SPP1 identified by Subtypist. + Macrophage subtypes and SPP1 labeled in the spatial transcriptome + Spatial distances between macrophages / cancer-associated fibroblasts (CAFs) and HCC regional spots. SPP1 recognized by Subtypist. + Macrophages are significantly closer to the marked areas, such as Figure 7 As shown, the corresponding p-values ​​were 0.00001 and 0.003, respectively. These results indicate that the cell subtypes identified by this tool have stronger biological relevance, highlighting their potential important role in the tumor microenvironment.

[0175] In addition, Subtypist also retains a version labeled MT1G + This subpopulation, despite its lower specificity, has been considered as Kupffer cells in some studies. These results demonstrate Subtypist's ability to maintain both sensitivity and subtype diversity, and its potential for discovering biologically significant subtypes.

[0176] Example 4: Subtypist identified HES1 associated with poor prognosis in esophageal squamous cell carcinoma. + Epithelial tumor cell subtype.

[0177] This example applies Subtypist to a single-cell transcriptome dataset of human esophageal squamous cell carcinoma (ESCC) to validate its ability to identify key subtypes in complex tumor microenvironments. The tool identified 11 cell subtypes at the recommended resolution of 0.5, one of which was HES1. + Epithelial cell subsets. This subset was significantly enriched in patients with stage II and III esophageal cancer, showing a higher proportion compared to early-stage (stage I) samples, suggesting a close association with disease progression, such as... Figure 8 As shown.

[0178] HES1 is an important regulator of the Notch signaling pathway, involved in maintaining stemness, regulating differentiation, promoting tumor cell proliferation, and escaping apoptosis. Existing studies have confirmed that HES1 is highly associated with increased invasiveness and poor prognosis in various tumor types. In ESCC, HES1 is also considered an independent prognostic marker.

[0179] To further characterize the molecular features of this subtype, we performed differential expression analysis and visualized the expression of marker genes for the subtype (e.g., Figure 9 (As shown) Genes such as NR4A1 and FOSB in HES1 + Significant upregulation was observed in epithelial cell subsets. Among them, NR4A1 was closely associated with T cell dysfunction and immune escape, while FOSB was involved in cell proliferation and differentiation. Subsequent pathway and GO analysis results showed (e.g.) Figure 9 As shown in the figure, this subtype is highly enriched in pathways such as epithelial cell proliferation and heterologous substance response, suggesting that it may play an important role in adapting to metabolic stress and treating stress, as well as promoting tumor progression.

[0180] To evaluate the potential value of this subtype in clinical prognosis, we further validated it using 88 ESCC batch RNA-seq samples from the TCGA database. First, the samples were grouped according to HES1 expression levels: high and low HES1. + Kaplan–Meier survival curves for epithelial cell module scores (e.g.) Figure 10 As shown in the figure, higher HES1 expression levels were significantly associated with worse prognostic outcomes (p = 0.032). Furthermore, we used a deconvolution algorithm to analyze HES1 expression in the samples. + The proportion of epithelial cells was estimated and grouped into high and low groups based on quartiles. High and low HES1 levels were identified in the TCGA data. +Kaplan-Meier survival curves for patients with a higher percentage of epithelial cells showed that the higher the percentage of this subtype, the lower the patient survival rate (p = 0.044). Figure 10 As shown in the figure, this further supports its discriminative value as a potentially poor prognostic subtype.

[0181] Example 5: Subtypist reveals a subtype of pro-remodeling macrophages that spatially co-localizes with cardiomyocytes in the ischemic zone of myocardial infarction.

[0182] This example demonstrates the subtype resolution capability of the Subtypist tool in complex tissue contexts. Human myocardial infarction (MI)-related single-nuclear RNA sequencing (snRNA-seq) and spatial transcriptome (ST) data were used for analysis, covering ischemic areas, marginal zones, distal areas, and fibrotic regions.

[0183] In snRNA-seq data, Subtypist identified a total of 8 functionally specific macrophage subtypes, such as Figure 11 As shown. Furthermore, the proportions of the eight myeloid cell subtypes in the control group, ischemic area, marginal area, distal area, and fibrotic tissue samples indicate that FN1... + Macrophages were significantly enriched in the ischemic area, and their abundance gradually increased towards the ischemic core. In the ST data, FN1... + Macrophages, through deconvolution analysis, exhibited a spatial co-localization trend with mainstream macrophages, and their regional distribution was consistent with the snRNA-seq results, such as... Figure 11 As shown.

[0184] This subtype highly expresses FN1, STAB1, and MYH6, and is involved in extracellular matrix remodeling and immune regulation. Enrichment analysis showed that it is closely related to muscle tissue development and structural remodeling, suggesting its functional potential in myocardial repair.

[0185] Further CellChat analysis demonstrates the mediator of FN1. + Ligand-receptor pairs involved in macrophage-related intercellular communication, such as Figure 12 As shown. FN1 + Macrophages can communicate with cardiomyocytes through the GAS6–MERTK pathway to participate in anti-inflammatory and apoptotic clearance; at the same time, they can establish signaling connections with fibroblasts and adipocytes through the IGF1–IGF1R pathway to regulate tissue repair.

[0186] Combined with COMMOT analysis, we reconstructed the spatial distribution and communication pathways of IGF signaling in spatial transcriptome data (as shown in Figure 13). The background heatmap shows FN1. +Macrophage unconvolution ratio. By comparing IGF signaling activity in different regions based on polymerization pathway scores, we found that this signal is related to FN1. + Macrophages exhibited high spatial colocalization, and the signal intensity was positively correlated with the abundance of this subtype. This result suggests that FN1... + Macrophages may be the main responders to IGF signaling, thereby participating in immune regulation and tissue remodeling after MI.

[0187] In another exemplary embodiment, prior art is used as a comparative example to illustrate how the annotation method of this application solves the problems of the prior art.

[0188] To further verify the performance advantages of the annotation method in the cell subtype identification task, it was compared and evaluated with the existing unsupervised clustering method scSHC. The experiment was conducted on a constructed simulated dataset containing multiple known subtypes as control standards, and simulating common expression noise and feature dimensionality variations found in real single-cell sequencing data.

[0189] With 1000 feature genes, scSHC's average adjusted RAND index (ARI) is 0.817, while the annotation method Subtypist in this application exhibits higher accuracy and stability under the same conditions. Furthermore, as the feature dimension increases, when the number of gene features reaches 2000, scSHC's ARI drops to 0.6702, showing a certain degree of performance degradation; in contrast, Subtypist maintains a high ARI value even under high-dimensional feature conditions, demonstrating good generalization ability and adaptability to large-scale feature data.

[0190] Furthermore, unlike traditional methods that only output clustering structures, Subtypist incorporates phenotypic molecules (i.e., specific marker genes) for biological annotation of subpopulations after clustering. This not only provides the clustering structure but also identifies key molecular features of each subpopulation, enhancing interpretability and usability. Therefore, this method not only outperforms existing algorithms in statistical performance but also possesses practical value at the functional annotation level.

[0191] Based on the same inventive concept, this application also provides an automatic cell subpopulation annotation apparatus for implementing the aforementioned automatic cell subpopulation annotation method. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more embodiments of the automatic cell subpopulation annotation apparatus provided below can be found in the limitations of the automatic cell subpopulation annotation method described above, and will not be repeated here.

[0192] In one exemplary embodiment, an automatic cell subpopulation annotation device is provided, comprising:

[0193] The data acquisition module is used to acquire single-cell transcriptomics data;

[0194] The first clustering module, within the preset target resolution range, uses a community detection algorithm based on modularity optimization to obtain the initial clustering results at each resolution, i.e., the first target clustering;

[0195] The second clustering module, at each resolution, if there are clusters in the first target clustering results that do not meet the specificity requirement, uses similarity to iteratively merge them to obtain the second target clustering result; both the first target clustering result and the second target clustering result include multiple clusters; the specificity requirement is: the specificity score is not less than a preset threshold for measuring specificity;

[0196] The labeling module uses the getSpecificityScore function to determine the representative label of each cluster in the second target clustering result;

[0197] The comparison module compares the specificity scores of recommended cell subpopulation annotation results at different resolutions.

[0198] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0199] This document uses specific examples to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. Furthermore, those skilled in the art will recognize that, based on the ideas of this application, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A method for automatic annotation of cell subpopulations, characterized in that, Includes the following steps: Acquire single-cell transcriptomics data; Within the preset target resolution range, the initial clustering results for each resolution are obtained by using a community detection algorithm based on modularity optimization, which is the first target clustering result; If any clusters in the first target clustering results at each resolution do not meet the specificity requirements, they are iteratively merged using similarity to obtain the second target clustering results; Both the first target clustering result and the second target clustering result include multiple clusters; The specificity requirement is that the specificity score is not less than a preset threshold for measuring specificity. If any clusters in the first target clustering results at each resolution do not meet the specificity requirement, they are iteratively merged using similarity to obtain the second target clustering results, which specifically include: Based on the target resolution, the single-cell transcriptomics data are clustered to obtain the clustering result of the s-th iteration, which is used as the initial clustering result. The getSpecificityScore function is used to calculate the specificity score of each cluster in the clustering results of the s-th iteration; Clusters with specificity scores below a preset threshold for measuring specificity are merged with clusters of high similarity to obtain the clustering results for the (s+1)th iteration. Increment the value of s by 1, and return to the steps of calculating the specificity score of each cluster in the clustering results of the s-th iteration using the getSpecificityScore function, until the termination merging condition is met; The getSpecificityScore function is used to determine the representative label of each cluster in the second target clustering result; the getSpecificityScore function is: ; in, Genes in cluster m i Specificity score, Genes in cluster m i The absolute level of expression, For the clustering results of the s-th iteration, the genes expressed in cluster m are... i The proportion of cells, For the clustering results of the s-th iteration, the expression genes in the other clusters besides cluster m are... i The proportion of cells; The specificity score at comparative resolution recommends cell subpopulation annotation results.

2. The automatic cell subpopulation annotation method according to claim 1, characterized in that, The `getSpecificityScore` function is used to calculate the specificity score of each cluster in the clustering results of the `s`th iteration, specifically including: Determine whether gene i in cluster m satisfies the calculation conditions of the getSpecificityScore function to obtain the first judgment result; cluster m is any cluster in the clustering result of the s-th iteration; If the first judgment result is negative, then set the specificity score of gene i in cluster m to 0; If the first judgment result is yes, then the specificity score of gene i in cluster m is calculated using the getSpecificityScore function; Based on the specificity scores of each gene in cluster m, the overall specificity score of cluster m is calculated using the following formula; ; in, The overall specificity score for cluster m is given, where n is a user-defined parameter representing the selection of the top n genes with the highest scores. Let i be the specificity score of the gene in cluster m, which is ranked from highest to lowest according to its specificity score. The value of i ranges from 1 to n.

3. The automatic cell subpopulation annotation method according to claim 2, characterized in that, The condition for the getSpecificityScore function to be calculated is: ; in, The proportion of cells expressing gene i in cluster m in the clustering results of the s-th iteration is given. Let be the proportion of cells expressing gene i in all clusters except cluster m in the clustering results of the s-th iteration. A preset threshold for measuring the difference in cell proportions. represents the absolute expression level of gene i in cluster m. A preset threshold for measuring absolute expression levels.

4. The automatic cell subpopulation annotation method according to claim 1, characterized in that, The termination condition for merging is that all clusters in the clustering results meet the specificity evaluation criteria; the specificity evaluation criteria are: the number of merges reaches the maximum number of merges threshold or the specificity score is not less than the preset threshold for measuring specificity.

5. The automatic cell subpopulation annotation method according to claim 1, characterized in that, Clusters with specificity scores below a preset threshold for measuring specificity are merged with clusters of high similarity, specifically including: Clusters in the clustering results of the s-th iteration whose specificity scores are less than a preset threshold for measuring specificity are identified as the first target clusters; the similarity between the first target clusters and each cluster in the clustering results of the s-th iterations other than the first target clusters is calculated using the similarity calculation formula. Identify the cluster with the highest similarity to the first target cluster and designate it as the cluster to be merged; Determine whether the similarity between the first target cluster and the cluster to be merged is greater than a similarity threshold to obtain a second determination result; If the second determination result is yes, then the first target cluster is merged into the cluster to be merged; If the second judgment result is negative, then the first target cluster will not be merged.

6. The automatic cell subpopulation annotation method according to claim 4, characterized in that, When the specificity score of cluster m in the initial clustering results is less than the preset threshold for measuring specificity, the maximum number of merges for cluster m is set to 2.

7. The automatic cell subpopulation annotation method according to claim 1, characterized in that, The `getSpecificityScore` function is used to determine the representative label of each cluster in the second-objective clustering results, specifically including: The specificity score of each gene in the second target cluster is calculated using the getSpecificityScore function; the second target cluster is any cluster in the second target clustering result. The genes in the second target cluster were sorted in descending order of specificity score to obtain the second gene sequence; Phenotypic molecules of the first three genes in the second gene sequence were selected as representative markers of the second target cluster.

8. An automated cell subpopulation annotation device, characterized in that, The automatic cell subpopulation annotation device applies the automatic cell subpopulation annotation method according to any one of claims 1-7, and the automatic cell subpopulation annotation device comprises: The data acquisition module is used to acquire single-cell transcriptomics data; The first clustering module, within the preset target resolution range, uses a community detection algorithm based on modularity optimization to obtain the initial clustering results at each resolution, i.e., the first target clustering results; The second clustering module, at each resolution, if there are clusters in the first target clustering results that do not meet the specificity requirement, uses similarity to iteratively merge them to obtain the second target clustering result; both the first target clustering result and the second target clustering result include multiple clusters; the specificity requirement is: the specificity score is not less than a preset threshold for measuring specificity; If any clusters in the first target clustering results at each resolution do not meet the specificity requirement, they are iteratively merged using similarity to obtain the second target clustering results, which specifically include: Based on the target resolution, the single-cell transcriptomics data are clustered to obtain the clustering result of the s-th iteration, which is used as the initial clustering result. The getSpecificityScore function is used to calculate the specificity score of each cluster in the clustering results of the s-th iteration; Clusters with specificity scores below a preset threshold for measuring specificity are merged with clusters of high similarity to obtain the clustering results for the (s+1)th iteration. Increment the value of s by 1, and return to the steps of calculating the specificity score of each cluster in the clustering results of the s-th iteration using the getSpecificityScore function, until the termination merging condition is met; The labeling module uses the getSpecificityScore function to determine the representative label of each cluster in the second target clustering result; the getSpecificityScore function is: ; in, Genes in cluster m i Specificity score, Genes in cluster m i The absolute level of expression, For the clustering results of the s-th iteration, the genes expressed in cluster m are... i The proportion of cells, For the clustering results of the s-th iteration, the expression genes in the other clusters besides cluster m are... i The proportion of cells; The comparison module compares the specificity scores of recommended cell subpopulation annotation results at different resolutions.