Cell annotation method, apparatus, device, storage medium, and program product
By performing cell clustering and differential gene calculation on single-cell transcriptomes, and combining this with pre-constructed knowledge graphs for association queries and fine-tuning, the problems of time-consuming, labor-intensive, and inconsistent methods in existing technologies have been solved, achieving efficient and accurate automatic cell annotation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BGI RES BEIJING
- Filing Date
- 2024-12-10
- Publication Date
- 2026-06-12
Smart Images

Figure CN122201458A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of cell annotation technology, and in particular to a cell annotation method, apparatus, computer device, computer-readable storage medium, and computer program product. Background Technology
[0002] With the development of single-cell transcriptome data analysis technology, cell annotation has become one of the most common and fundamental analytical tasks. Most subsequent downstream analyses, such as differential analysis, trajectory inference, and cell interaction analysis, rely on the results of cell annotation. However, current cell annotation methods are divided into manual annotation and automated annotation. Manual annotation requires searching relevant literature and databases, while automated annotation requires an already annotated single-cell transcriptome dataset. Searching for suitable marker genes or datasets for new samples is typically very time-consuming and labor-intensive. Furthermore, different researchers often have different levels of granularity and naming conventions for manual annotation, making standardized annotation difficult. Summary of the Invention
[0003] Therefore, it is necessary to provide a cell annotation method, apparatus, computer device, computer-readable storage medium, and computer program product that can achieve high accuracy even without a reference transcriptome, addressing the aforementioned technical problems.
[0004] Firstly, this application provides a cell annotation method, including:
[0005] The single-cell transcriptome to be annotated was subjected to cell clustering to obtain multiple cell clusters.
[0006] For each cell group, differential gene calculations are performed to obtain a list of differential genes.
[0007] By performing an association query on each marker gene in the differential gene list using a pre-constructed knowledge graph, association data for each marker gene is obtained; the association data includes at least the associated cell type.
[0008] Based on the association data, cell annotation analysis is performed on each of the marker genes to obtain candidate cell annotation results;
[0009] The candidate cell annotation results are fine-tuned to obtain the target cell annotation results, which are used to annotate the single-cell transcriptome.
[0010] In one embodiment, the step of performing cell annotation analysis on each marker gene based on the association data to obtain candidate cell annotation results includes:
[0011] For each of the marker genes, a cell annotation probability is calculated to obtain a cell type annotation probability; the cell type annotation probability represents the association strength of each of the marker genes with respect to each of the associated cell types.
[0012] Gene expression probabilities are calculated for the associated cell types based on the annotation probabilities to obtain cell type expression probabilities; the cell type expression probabilities are used to describe the association strength of each associated cell type with respect to each marker gene;
[0013] The expression probabilities of the cell types are sorted to obtain a sorting result, and the annotation result of the candidate cells is determined based on the sorting result.
[0014] In one embodiment, fine-tuning the candidate cell annotation results to obtain the target cell annotation results includes:
[0015] If it is determined that the cell cluster contains two target cell types, the cells are regrouped based on the difference in the average expression level of marker genes between at least two target cell types. Based on the regrouping results, differential gene calculation, association query and cell annotation analysis are performed sequentially to obtain the target cell annotation results.
[0016] The target cell type is used to indicate the top two associated cell types in the sorting results.
[0017] In one embodiment, determining that the cell cluster includes two target cell types includes:
[0018] When the number of associated genes corresponding to each of the associated cell types in the knowledge graph is not less than half of the total number of genes of the two associated cell types, and the marker genes corresponding to each of the associated cell types are clustered in the cell's embedding space, it is determined that the cell cluster contains two target cell types.
[0019] In one embodiment, fine-tuning the candidate cell annotation results to obtain the target cell annotation results includes:
[0020] Obtain parent-child node relationship data related to the annotation results of the candidate cells from the knowledge graph;
[0021] The parent nodes of the knowledge graph are pruned based on the parent-child node relationship data, and the knowledge graph is updated. The updated knowledge graph is then used to perform association queries on each marker gene in the differential gene list, and the cell annotation analysis steps are performed based on the query results to obtain the target cell annotation results.
[0022] In one embodiment, the method further includes:
[0023] Acquire a marker gene database and a single-cell dataset; the marker gene database includes multiple first cell types and first marker gene data corresponding to each first cell type;
[0024] Data analysis was performed on the single-cell dataset to obtain multiple second cell types and the second marker gene data corresponding to each second cell type;
[0025] The difference in the proportion of expression of the second marker gene data within and outside the group is used as the association weight; the association weight is used to describe the association strength between the second marker gene data and the second cell type.
[0026] The knowledge graph is constructed based on the first cell type and its corresponding first marker gene data, the second cell type and its corresponding second marker gene data, and the association weights.
[0027] Secondly, this application also provides a cell annotation device, comprising:
[0028] The cell clustering module is used to perform cell clustering on the single-cell transcriptome to be annotated, resulting in multiple cell clusters.
[0029] The differential gene calculation module is used to perform differential gene calculations on each of the cell groups to obtain a list of differential genes.
[0030] The associated gene query module is used to perform an associated query on each marker gene in the differential gene list through a pre-constructed knowledge graph to obtain the associated data for each marker gene; the associated data includes at least the associated cell type;
[0031] The cell annotation analysis module is used to perform cell annotation analysis on each of the marker genes based on the association data to obtain candidate cell annotation results.
[0032] The fine-tuning module is used to fine-tune the annotation results of the candidate cells to obtain the annotation results of the target cells, which are used to annotate the single-cell transcriptome.
[0033] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:
[0034] The single-cell transcriptome to be annotated was subjected to cell clustering to obtain multiple cell clusters.
[0035] For each cell group, differential gene calculations are performed to obtain a list of differential genes.
[0036] By performing an association query on each marker gene in the differential gene list using a pre-constructed knowledge graph, association data for each marker gene is obtained; the association data includes at least the associated cell type.
[0037] Based on the association data, cell annotation analysis is performed on each of the marker genes to obtain candidate cell annotation results;
[0038] The candidate cell annotation results are fine-tuned to obtain the target cell annotation results, which are used to annotate the single-cell transcriptome.
[0039] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the following steps:
[0040] The single-cell transcriptome to be annotated was subjected to cell clustering to obtain multiple cell clusters.
[0041] For each cell group, differential gene calculations are performed to obtain a list of differential genes.
[0042] By performing an association query on each marker gene in the differential gene list using a pre-constructed knowledge graph, association data for each marker gene is obtained; the association data includes at least the associated cell type.
[0043] Based on the association data, cell annotation analysis is performed on each of the marker genes to obtain candidate cell annotation results;
[0044] The candidate cell annotation results are fine-tuned to obtain the target cell annotation results, which are used to annotate the single-cell transcriptome.
[0045] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, performs the following steps:
[0046] The single-cell transcriptome to be annotated was subjected to cell clustering to obtain multiple cell clusters.
[0047] For each cell group, differential gene calculations are performed to obtain a list of differential genes.
[0048] By performing an association query on each marker gene in the differential gene list using a pre-constructed knowledge graph, association data for each marker gene is obtained; the association data includes at least the associated cell type.
[0049] Based on the association data, cell annotation analysis is performed on each of the marker genes to obtain candidate cell annotation results;
[0050] The candidate cell annotation results are fine-tuned to obtain the target cell annotation results, which are used to annotate the single-cell transcriptome.
[0051] The aforementioned cell annotation method, apparatus, computer equipment, computer-readable storage medium, and computer program product perform cell clustering on the single-cell transcriptome to be annotated, resulting in multiple cell clusters; calculate differentially expressed genes for each cell cluster to obtain a list of differentially expressed genes; perform association queries on each marker gene in the list of differentially expressed genes using a pre-constructed knowledge graph to obtain association data for each marker gene; the association data includes at least the associated cell type; perform cell annotation analysis on each marker gene based on the association data to obtain candidate cell annotation results; fine-tune the candidate cell annotation results to obtain target cell annotation results, which are then used to annotate single-cell transcriptomes. This achieves automatic annotation without relying on a reference transcriptome (annotated single-cell transcriptome) or marker gene database, greatly reducing the time and workload of manual annotation. Further fine-tuning of the candidate cell annotation results effectively improves the accuracy of the target cell annotation results. Attached Figure Description
[0052] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0053] Figure 1 This is a diagram illustrating the application environment of the cell annotation method in one embodiment;
[0054] Figure 2 This is a flowchart illustrating a cell annotation method in one embodiment;
[0055] Figure 3 The fine-tuning process of the cell annotation method in one embodiment is shown to demonstrate the results of cell repopulation;
[0056] Figure 4 This is a schematic diagram of the process of cell annotation analysis based on knowledge graph in a cell annotation method of one embodiment;
[0057] Figure 5 This is a flowchart illustrating a cell annotation method in one embodiment;
[0058] Figure 6 This is a diagram showing the cell annotation results of single-cell data of the human pancreas using a cell annotation method in one embodiment.
[0059] Figure 7 This is a structural block diagram of a cell annotation device in one embodiment;
[0060] Figure 8 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0061] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0062] The cell annotation method provided in this application embodiment can be applied to, for example... Figure 1 In the application environment shown, terminal 102 communicates with server 104 via a network. A data storage system can store the data that server 104 needs to process. The data storage system can be integrated onto server 104 or placed on a cloud or other network server. Server 104 performs cell clustering processing on the single-cell transcriptome to be annotated, obtaining multiple cell clusters; calculates differentially expressed genes for each cell cluster, obtaining a list of differentially expressed genes; performs association queries on each marker gene in the list of differentially expressed genes using a pre-constructed knowledge graph, obtaining association data for each marker gene; the association data includes at least the associated cell type; performs cell annotation analysis on each marker gene based on the association data, obtaining candidate cell annotation results; fine-tunes the candidate cell annotation results to obtain target cell annotation results, which are used for cell annotation of the single-cell transcriptome. Terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can be smart speakers, smart TVs, smart air conditioners, smart in-vehicle devices, projection devices, etc. Portable wearable devices can be smartwatches, smart bracelets, head-mounted devices, etc. Headset devices can be virtual reality (VR) devices, augmented reality (AR) devices, smart glasses, etc. Server 104 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.
[0063] In one exemplary embodiment, such as Figure 2 As shown, a cell annotation method is provided, which can be applied to... Figure 1 Taking server 104 as an example, the explanation includes the following steps 202 to 208. Wherein:
[0064] Step 202: Perform cell clustering on the single-cell transcriptome to be annotated to obtain multiple cell clusters.
[0065] The single-cell transcriptome to be annotated is single-cell data obtained by single-cell transcriptome sequencing methods. Cell annotation refers to identifying the cell type of each cell in the single-cell data. It should be noted that most of the subsequent downstream analyses, such as differential analysis, trajectory inference, and cell interaction, depend on the results of cell annotation.
[0066] In some embodiments, a cell clustering algorithm can be used to perform cell clustering on the single-cell transcriptome to be annotated, resulting in multiple cell clusters.
[0067] The cell clustering algorithm can be the Leiden clustering algorithm or other similar analysis tools, and is not limited to these.
[0068] Step 204: Calculate the differentially expressed genes for each cell group to obtain a list of differentially expressed genes.
[0069] In some embodiments, differential gene calculations are performed on each cell group using a differential analysis algorithm to obtain a list of differential genes.
[0070] The difference analysis algorithm can be the Wilcoxon method or other similar analysis tools, and is not limited to these.
[0071] In this embodiment, during differential gene calculation, the top M marker genes in each cell cluster are used as the corresponding gene list, serving as a screening criterion so that the differential gene list can be used as a reference for subsequent cell annotation. M can be 10, and the value of M can be adjusted according to actual conditions; no limitation is made here.
[0072] Step 206: Use a pre-constructed knowledge graph to perform an association query on each marker gene in the differential gene list to obtain the association data for each marker gene; the association data includes at least the associated cell type.
[0073] The pre-constructed knowledge graph is a knowledge graph containing a large amount of knowledge about cell annotation, which is constructed in advance by collecting a large database of cell type marker genes (including but not limited to CellMarker, CellTaxonomy, and PanglaoDB) and conducting data analysis on a large number of single-cell datasets.
[0074] The associated data includes the cell types associated with the marker genes in the knowledge graph and the corresponding association strengths. Association strength refers to the strength of the association between each gene and its corresponding cell type; this strength can be expressed by the weight of the in-group and out-group expression percentages for each gene (relation_confidence). Associated cell types indicate the cell types associated with each gene in the graph.
[0075] In some embodiments, the association data for each gene can be filtered, i.e., cell types with less than 50% of the associated genes in the association data are filtered to obtain more associated cell types, thereby improving the accuracy of subsequent cell annotation analysis based on the association data.
[0076] In some embodiments, a gene association query can be performed on each gene in the differential gene list using a pre-constructed knowledge graph, to obtain the association data for each gene. The specific gene association query statement can be as follows:
[0077] UNWIND $keywords AS keyword;
[0078] MATCH (n:Gene)-[r:marker_of]->(m:Cell);
[0079] WHERE keyword IN COALESCE([n.name] + SPLIT(n.synonym, '|'), []);
[0080] RETURN n.name, r.relation_confidence, r.condition, r.info_source,m.name;
[0081] This query statement belongs to the Cypher query language in Neo4j. First, the parameter $keywords is defined and assigned to the variable keyword; MATCH is the matching mode, used to query the associated data of marker genes; WHERE is the filtering condition, used to filter out genes whose gene names or synonyms contain the keyword from the knowledge graph; RETURN is the field to return, such as cell type name, weight, gene name, etc., to indicate the source of the information.
[0082] Step 208: Perform cell annotation analysis on each marker gene based on the association data to obtain candidate cell annotation results.
[0083] In some embodiments, cell annotation analysis is performed on each marker gene based on association data to obtain candidate cell annotation results, including: calculating the cell annotation probability for each marker gene to obtain the cell type annotation probability; the cell type annotation probability represents the association strength of each marker gene with respect to each associated cell type; calculating the gene expression probability of associated cell types based on the annotation probability to obtain the cell type expression probability; the cell type expression probability is used to describe the association strength of each associated cell type with respect to each marker gene; sorting the cell type expression probabilities to obtain the sorting results, and determining the candidate cell annotation results based on the sorting results.
[0084] In some embodiments, cell annotation probability is calculated for each marker gene to obtain cell type annotation probability. Cell type annotation probability represents the annotation probability of each marker gene with respect to each associated cell type. The specific calculation formula for cell annotation probability is shown in the following formula (1):
[0085] (1)
[0086] in, This refers to each marker gene. Corresponding cell type annotation probability, Refers to genes The strength of the association with the i-th associated cell type It is the sum of the probabilities of the n annotations corresponding to the gene.
[0087] In some embodiments, gene expression probabilities of associated cell types are calculated based on cell type annotation probabilities to obtain cell type expression probabilities. Cell type expression probabilities represent the expression probabilities of each associated cell type on each gene. The specific formula for calculating gene expression probabilities is shown in formula (2) below:
[0088] (2)
[0089] in, It is related to cell type ( The average probability of cell type annotation on j genes.
[0090] In some embodiments, the expression probabilities of cell types are sorted to obtain a sorting result, and the candidate cell annotation result is determined based on the sorting result.
[0091] In this embodiment, standardized candidate cell annotation results are obtained through conditional probability calculation. Compared with traditional annotation methods based on large language models, the results are traceable and supported by probability theory models and data, making them more interpretable and reliable, and effectively improving the credibility of the annotation results.
[0092] Step 210: Fine-tune the candidate cell annotation results to obtain the target cell annotation results, which are used for cell annotation of single-cell transcriptomes.
[0093] The fine-tuning process involves making adjustments based on different fine-tuning schemes to improve the accuracy of the process.
[0094] In some embodiments, fine-tuning the candidate cell annotation results to obtain the target cell annotation results includes: when it is determined that the cell cluster contains two target cell types, regrouping the cells according to the difference in the average expression level of the marker genes between the two target cell types, and sequentially performing differential gene calculation, association query and cell annotation analysis according to the regrouping results to obtain the target cell annotation results;
[0095] The target cell type is used to indicate the top two associated cell types in the sorting results.
[0096] In some embodiments, the top two associated cell types are obtained based on the sorting results, and the number of associated genes for each associated cell type in the knowledge graph is counted. This determines whether the cell clusters corresponding to the two target cell types are a mixture of the two cell types. If it is determined that there is a mixture of the two cell types in the current cell cluster, the mixed cell population containing the mixture of the two cell types can be regrouped under the guidance of the knowledge graph. The number of associated genes is used as fine-tuning data to fine-tune the annotation results of the candidate cells to obtain the annotation results of the target cells, thereby completing a more accurate cell annotation fine-tuning.
[0097] In some embodiments, the criteria for determining whether a mixture of two cell types exists in the current cell cluster include: when the number of associated genes corresponding to each of the associated cell types in the knowledge graph is not less than half of the total number of genes of the two associated cell types and the marker genes corresponding to each of the associated cell types are clustered in the cell embedding space, the cell cluster is determined to contain two target cell types.
[0098] It should be noted that when the number of associated genes for each of the two target cell types reaches half of the total number of marker genes for both target cell types, and the associated genes for each cell type exhibit clustering in the cell's embedding space, the cluster is considered a mixture of the two cell types. The clustering of associated genes in the cell's embedding space can be described using the expression levels of the respective associated genes as an observation indicator. By observing the Moran's I index of spatial autocorrelation in the embedding space, if the Moran's I index is greater than 0.5, it is considered a mixture of the two cell types. For mixed cell populations, regrouping can be performed under the guidance of a knowledge graph, thereby completing a more accurate fine-tuning process for candidate cell annotation results and obtaining the target cell annotation results.
[0099] Specifically, when regrouping a mixed cell population, the difference in the average expression level of the specific marker genes between the two cell types in the mixed cell population is first calculated and denoted as Exp(ab). Each cell corresponds to an Exp(ab) value. Based on Exp(ab) as an index, an adjacency matrix is constructed using a Gaussian kernel function based on this index, and the Leiden algorithm is used to regroup the cells based on the adjacency matrix.
[0100] The total number of genes for the two target cell types refers to the total number of genes in the knowledge graph that have edges with the target cell type.
[0101] In some embodiments, after fine-tuning the annotation results of candidate cells to obtain the annotation results of target cells, the cell annotation method further includes: obtaining the information source of the annotation results of target cells and obtaining information source prompts; interpreting the gene to be interpreted based on a knowledge graph and obtaining interpretation information; and interactively displaying the information source prompts and interpretation information.
[0102] Among them, the information source prompt information is an interactive visual information, and the gene to be interpreted refers to the marker gene that has significant differential expression in single-cell transcriptome data but is not recorded in the knowledge graph.
[0103] Specifically, after the annotation results are completed, the information sources used in the annotation process can be displayed, and this information can be presented in an interactive visualization format on charts such as volcano plots to demonstrate the reliability of the annotation results. Simultaneously, for marker genes that show significant differential expression in the dataset but are not recorded in the atlas, information on the biological functions and pathways of genes recorded in the atlas can be used to interpret why they might be potential marker genes for that cell type, assisting users in data mining and scientific discovery.
[0104] In some embodiments, fine-tuning the candidate cell annotation results to obtain the target cell annotation results includes: obtaining parent-child node relationship data related to the candidate cell annotation results from the knowledge graph; pruning the parent nodes of the knowledge graph based on the parent-child node relationship data, updating the knowledge graph, performing association queries on each marker gene in the differential gene list through the updated knowledge graph, and performing cell annotation analysis steps based on the query results to obtain the target cell annotation results.
[0105] In some embodiments, parent-child node relationship data related to candidate cell annotation results is obtained from the knowledge graph, and the parent nodes of the knowledge graph are pruned according to the parent-child node relationship data to update the knowledge graph. The updated knowledge graph is then used to perform association queries on each marker gene in the differential gene list, and cell annotation analysis steps are performed based on the query results to obtain target cell annotation results. This makes the cell annotation results closer to the leaf nodes of the cell tree. It should be noted that leaf nodes in the cell tree usually have higher information content, and cell annotation results closer to leaf nodes in the cell tree will be more accurate. Therefore, the candidate cell annotation results are fine-tuned according to the updated knowledge graph to obtain target cell annotation results, thereby making the cell annotation results more granular and obtaining more accurate and detailed target annotation results.
[0106] For example, CD4 T cells are a type of T cell. During fine-tuning of pruning, T cells will be pruned. Generally, the leaf nodes of the cell tree will have higher information content. Therefore, it is more recommended to annotate the target cells that are closer to the leaf nodes in the cell tree.
[0107] In some embodiments, such as Figure 4 As shown, Figure 4 This is a flowchart illustrating a knowledge graph-based cell annotation algorithm. Figure 4 The left-middle figure illustrates how the input module processes the input single-cell transcriptome to be annotated into multiple cell clusters, and then performs differential gene calculations on each cell cluster to obtain a list of differentially expressed genes [G1, G2, G3, ..., Gn]. Figure 4 The intermediate module demonstrates how to perform gene association queries on each gene in the differential gene list by querying a pre-built knowledge graph, and obtain the association data for each gene. Figure 4 The right figure shows the process of performing cell annotation analysis on each marker gene based on the association data to obtain candidate cell annotation results, and then fine-tuning the candidate cell annotation results to obtain the target cell annotation results.
[0108] For example, such as Figure 4 As shown, Figure 4The green arrows in the diagram indicate relationships, while multiple pink arrows for the same node in the graph indicate evidence of different data sources (databases or datasets). Figure 4 Taking an example to illustrate the pruning method, node C3 belongs to node C4, and its parent node (G2) also points to the same cell type node (C4). The edge between the cell type (C4) pointed to by the arrow and its parent node (G2) is pruned, making the cell annotation result closer to the leaf node of the cell tree. Then, the pruned updated knowledge graph is processed again through step S206 to select the associated cell type with the highest probability as the target cell annotation result.
[0109] In some embodiments, after the annotation results are completed, the information sources used in the annotation process of the target cell annotation results can be obtained as information source prompts, and this information can be displayed in an interactive visualization such as a volcano plot to demonstrate the reliability of the target cell annotation results. Simultaneously, by using the biological functions and pathways of the genes to be interpreted recorded in the knowledge graph, the reasons why the genes to be interpreted might be potential marker genes for this cell type can be interpreted, thus assisting users in data mining and scientific discovery.
[0110] In the aforementioned cell annotation method, the single-cell transcriptome to be annotated is divided into multiple cell clusters; differentially expressed genes are calculated for each cell cluster to obtain a list of differentially expressed genes; each marker gene in the list of differentially expressed genes is associated with a pre-constructed knowledge graph to obtain association data for each marker gene; the association data includes at least the associated cell type; cell annotation analysis is performed on each marker gene based on the association data to obtain candidate cell annotation results; the candidate cell annotation results are fine-tuned to obtain target cell annotation results, which are used to annotate single-cell transcriptomes. This achieves automatic annotation without relying on a reference transcriptome (annotated single-cell transcriptome) or marker gene database, greatly reducing the time and workload of manual annotation. Further fine-tuning of the candidate cell annotation results effectively improves the accuracy of the target cell annotation results.
[0111] In one exemplary embodiment, such as Figure 5 As shown, the steps for constructing a knowledge graph include steps 502 to 508. Wherein:
[0112] Step 502: Obtain the marker gene database and single-cell dataset; the marker gene database includes multiple first cell types and the first marker gene data corresponding to each first cell type.
[0113] Among them, the marker gene database refers to a database containing cell type marker genes, including but not limited to CellMarker, CellTaxonomy, PanglaoDB, etc.
[0114] In some embodiments, a large database of marker genes and a dataset of more than 1,000 single cells can be obtained to ensure coverage of existing cell types in subsequent steps.
[0115] Step 504: Perform data analysis on the single-cell dataset to obtain multiple second cell types and the corresponding second marker gene data for each second cell type.
[0116] Data analysis refers to the process of extracting cell annotation information.
[0117] In some embodiments, data analysis is performed on a single-cell dataset to obtain multiple second cell types and corresponding second marker gene data for each second cell type. Then, data analysis is performed on the marker genes corresponding to the cell types and the marker gene database to obtain a large amount of cell annotation analysis data that can reflect cell annotation knowledge. The cell annotation analysis data includes multiple second cell types and corresponding second marker gene data for each second cell type.
[0118] Step 506: The difference between the expression ratios of the second marker gene data within and outside the group is used as the association weight; the association weight is used to describe the association strength between the second marker gene data and the second cell type.
[0119] It should be noted that this method assigns a weight of 1 to triplet data corresponding to cell annotation analysis data from marker gene databases. For triplet data from single-cell datasets, a weight is assigned to the proportion of expression within and outside the cell group (PCTin - PCTout). Specifically, the difference between the proportions of expression within and outside the cell group for the second marker gene data is used as the association weight, with this difference ranging from [0,1]. PCTin refers to the proportion of cells with expression levels greater than 0 in the given cell population; PCTout refers to the proportion of all cells outside the given cell population with expression levels greater than 0. This association weight, denoted as r.relation_confidence, is used in subsequent calculations to rank cell types and represents the strength of the association between the second marker gene data and the second cell type.
[0120] Step 508: Construct a knowledge graph based on the first cell type and its corresponding first marker gene data, the second cell type and its corresponding second marker gene data, and the association weights.
[0121] In some embodiments, cell annotation analysis data is organized into triples of Gene1, marker_of, and Cell1, and then written into the Neo4j graph database to obtain a knowledge graph. This graph database has a significant time advantage in neighbor lookups and multi-hop queries.
[0122] In this embodiment, a marker gene database and a single-cell dataset are acquired. The marker gene database includes multiple first cell types and their corresponding first marker gene data. Data analysis is performed on the single-cell dataset to obtain multiple second cell types and their corresponding second marker gene data. The difference between the in-group and out-of-group expression ratios of the second marker gene data is used as the association weight. The association weight is used to describe the association strength between the second marker gene data and the second cell type. A knowledge graph is constructed based on the first cell type and its corresponding first marker gene data, the second cell type and its corresponding second marker gene data, and the association weight, thereby extracting a large amount of knowledge about cell annotation and ensuring that the knowledge graph has extremely high coverage for existing cell type detection.
[0123] To better understand the cell annotation process in this application, this section combines... Figure 3 and 6 The explanation is as follows:
[0124] Figure 6 Taking the single-cell transcriptome of the human pancreas as an example, differential analysis was performed after cell clustering of the single-cell transcriptome of the human pancreas. The results of Leiden's differential analysis are as follows: Figure 6 The results from Leiden are shown.
[0125] By using a pre-constructed knowledge graph, each gene in the differentially expressed gene list is associated with other genes. After obtaining the association data for each gene, cell annotation analysis is performed on each gene based on this data to obtain candidate cell annotation results. These candidate cell annotation results are then fine-tuned to obtain target cell annotation results. These target cell annotation results are used for cell annotation of single-cell transcriptomes, resulting in... Figure 5 The results are shown in the KGannotator.
[0126] By combining Figure 6 Compared with other methods, this method significantly outperforms traditional marker-based annotation methods such as Cellmarker2, and large language model-based annotation methods such as GPTcelltype. The parameter-free annotation method in this invention is essentially on par with reference transcriptome-based annotation methods.
[0127] Please see Figure 3The fine-tuning process was specifically demonstrated, showing how the fine-tuning scheme in this invention was used to regroup cell group 2 (a cell group identified as a mixed cell type) to distinguish between the delta and gamma cell types.
[0128] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0129] Based on the same inventive concept, this application also provides a cell annotation apparatus for implementing the cell annotation method described above. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more cell annotation apparatus embodiments provided below can be found in the limitations of the cell annotation method described above, and will not be repeated here.
[0130] In one exemplary embodiment, such as Figure 7 As shown, a cell annotation device 700 is provided, including: a cell clustering module 701, a differential gene calculation module 702, an associated gene query module 703, a cell annotation analysis module 704, and a fine-tuning module 705, wherein:
[0131] Cell clustering module 701 is used to perform cell clustering on the single-cell transcriptome to be annotated, resulting in multiple cell clusters.
[0132] The differential gene calculation module 702 is used to perform differential gene calculations on each cell group to obtain a list of differential genes.
[0133] The associated gene query module 703 is used to perform an associated query on each marker gene in the differential gene list through a pre-constructed knowledge graph to obtain the associated data for each marker gene; the associated data includes at least the associated cell type.
[0134] The cell annotation analysis module 704 is used to perform cell annotation analysis on each marker gene based on the association data to obtain candidate cell annotation results;
[0135] The fine-tuning module 705 is used to fine-tune the candidate cell annotation results to obtain the target cell annotation results, which are then used for cell annotation of single-cell transcriptomes.
[0136] In some embodiments, the cell annotation analysis module 704 is further configured to calculate the cell annotation probability for each gene to obtain the cell type annotation probability; the cell type annotation probability represents the association strength of each gene with respect to each associated cell type; calculate the gene expression probability for the associated cell type based on the annotation probability to obtain the cell type expression probability; the cell type expression probability is used to describe the association strength of each associated cell type with respect to each marker gene; sort the cell type expression probabilities to obtain the sorting result, and determine the candidate cell annotation result based on the sorting result.
[0137] In some embodiments, the fine-tuning module 705 is further configured to, when it is determined that the cell cluster contains two target cell types, perform cell regrouping based on the difference in the average expression level of marker genes between at least two target cell types, and sequentially perform differential gene calculation, association query and cell annotation analysis based on the regrouping results to obtain target cell annotation results; wherein, the target cell type is used to indicate the top two associated cell types in the sorting results.
[0138] In some embodiments, the fine-tuning module 705 is further configured to determine that the cell cluster contains two target cell types when the number of associated genes corresponding to each associated cell type in the knowledge graph is not less than half of the total number of genes of the two associated cell types and the marker genes corresponding to each associated cell type are clustered in the cell's embedding space.
[0139] In some embodiments, the fine-tuning module 705 is further configured to obtain parent-child node relationship data related to the candidate cell annotation results from the knowledge graph; prune the parent nodes of the knowledge graph according to the parent-child node relationship data, update the knowledge graph, and perform an association query on each marker gene in the differential gene list through the updated knowledge graph, and perform cell annotation analysis steps according to the query results to obtain the target cell annotation results.
[0140] In some embodiments, the cell annotation device further includes: a knowledge graph construction module, used to acquire a marker gene database and a single-cell dataset; the marker gene database includes multiple first cell types and first marker gene data corresponding to each first cell type; data analysis is performed on the single-cell dataset to obtain multiple second cell types and second marker gene data corresponding to each second cell type; the difference between the in-group and out-of-group expression ratios of the second marker gene data is used as the association weight; the association weight is used to describe the association strength between the second marker gene data and the second cell type; a knowledge graph is constructed based on the first cell type and its corresponding first marker gene data, the second cell type and its corresponding second marker gene data, and the association weight.
[0141] In the aforementioned cell annotation device, multiple cell clusters are obtained by performing cell segmentation on the single-cell transcriptome to be annotated; differential gene calculation is performed on each cell cluster to obtain a list of differential genes; each marker gene in the list of differential genes is associated with a pre-constructed knowledge graph to obtain association data for each marker gene; the association data includes at least the associated cell type; cell annotation analysis is performed on each marker gene based on the association data to obtain candidate cell annotation results; the candidate cell annotation results are fine-tuned to obtain target cell annotation results, which are used to annotate single-cell transcriptomes. This achieves automatic annotation without relying on a reference transcriptome (annotated single-cell transcriptome) or marker gene database, greatly reducing the time and workload of manual annotation. Further fine-tuning of the candidate cell annotation results effectively improves the accuracy of the target cell annotation results.
[0142] Each module in the aforementioned cell annotation device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.
[0143] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 8As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores a knowledge graph. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network. When executed by the processor, the computer program implements a cell annotation method.
[0144] Those skilled in the art will understand that Figure 8 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0145] In one exemplary embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the cell annotation method described above.
[0146] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the cell annotation method described above.
[0147] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the cell annotation method described above.
[0148] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.
[0149] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.
[0150] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.
[0151] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A cell annotation method, characterized in that, The method includes: The single-cell transcriptome to be annotated was subjected to cell clustering to obtain multiple cell clusters. For each cell group, differential gene calculations are performed to obtain a list of differential genes. By performing an association query on each marker gene in the differential gene list using a pre-constructed knowledge graph, association data for each marker gene is obtained; the association data includes at least the associated cell type. Based on the association data, cell annotation analysis is performed on each of the marker genes to obtain candidate cell annotation results; The candidate cell annotation results are fine-tuned to obtain the target cell annotation results, which are used to annotate the single-cell transcriptome.
2. The method according to claim 1, characterized in that, The step of performing cell annotation analysis on each marker gene based on the associated data to obtain candidate cell annotation results includes: For each of the marker genes, a cell annotation probability is calculated to obtain a cell type annotation probability; the cell type annotation probability represents the association strength of each of the marker genes with respect to each of the associated cell types. Gene expression probabilities are calculated for the associated cell types based on the annotation probabilities to obtain cell type expression probabilities; the cell type expression probabilities are used to describe the association strength of each associated cell type with respect to each marker gene; The expression probabilities of the cell types are sorted to obtain a sorting result, and the annotation result of the candidate cells is determined based on the sorting result.
3. The method according to claim 2, characterized in that, The fine-tuning of the candidate cell annotation results to obtain the target cell annotation results includes: If it is determined that the cell cluster contains two target cell types, the cells are regrouped based on the difference in the average expression level of marker genes between at least two target cell types. Based on the regrouping results, differential gene calculation, association query and cell annotation analysis are performed sequentially to obtain the target cell annotation results. The target cell type is used to indicate the top two associated cell types in the sorting results.
4. The method according to claim 3, characterized in that, The determination that the cell population contains two target cell types includes: When the number of associated genes corresponding to each of the associated cell types in the knowledge graph is not less than half of the total number of genes of the two associated cell types, and the marker genes corresponding to each of the associated cell types are clustered in the cell's embedding space, it is determined that the cell cluster contains two target cell types.
5. The method according to claim 1, characterized in that, The fine-tuning of the candidate cell annotation results to obtain the target cell annotation results includes: Obtain parent-child node relationship data related to the annotation results of the candidate cells from the knowledge graph; The parent nodes of the knowledge graph are pruned based on the parent-child node relationship data, and the knowledge graph is updated. The updated knowledge graph is then used to perform association queries on each marker gene in the differential gene list, and the cell annotation analysis steps are performed based on the query results to obtain the target cell annotation results.
6. The method according to claim 1, characterized in that, The method further includes: Acquire a marker gene database and a single-cell dataset; the marker gene database includes multiple first cell types and first marker gene data corresponding to each first cell type; Data analysis was performed on the single-cell dataset to obtain multiple second cell types and the second marker gene data corresponding to each second cell type; The difference in the proportion of expression of the second marker gene data within and outside the group is used as the association weight; the association weight is used to describe the association strength between the second marker gene data and the second cell type. The knowledge graph is constructed based on the first cell type and its corresponding first marker gene data, the second cell type and its corresponding second marker gene data, and the association weights.
7. A cell annotation device, characterized in that, The device includes: The cell clustering module is used to perform cell clustering on the single-cell transcriptome to be annotated, resulting in multiple cell clusters. The differential gene calculation module is used to perform differential gene calculations on each of the cell groups to obtain a list of differential genes. The associated gene query module is used to perform an associated query on each marker gene in the differential gene list through a pre-constructed knowledge graph to obtain the associated data for each marker gene; the associated data includes at least the associated cell type; The cell annotation analysis module is used to perform cell annotation analysis on each of the marker genes based on the association data to obtain candidate cell annotation results. The fine-tuning module is used to fine-tune the annotation results of the candidate cells to obtain the annotation results of the target cells, which are used to annotate the single-cell transcriptome.
8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.