Clustering method, device and equipment of spatial transcriptome data and storage medium

By constructing a hypergraph model and a simple undirected graph, the problem that existing clustering algorithms cannot integrate spatial location information and gene expression data is solved, enabling detailed classification of multiple cell types at the same sampling site in spatial transcriptome data and improving the accuracy of the analysis.

CN116705158BActive Publication Date: 2026-06-26BEIJING YUANMA MEDICAL LAB CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING YUANMA MEDICAL LAB CO LTD
Filing Date
2023-06-12
Publication Date
2026-06-26

Smart Images

  • Figure CN116705158B_ABST
    Figure CN116705158B_ABST
Patent Text Reader

Abstract

The application provides a spatial transcriptome data clustering method and device, equipment and storage medium, and relates to the technical field of biological information. The method comprises: acquiring spatial transcriptome data of a preset biological sample, wherein the spatial transcriptome data comprises gene expression data of the preset biological sample at multiple spatial sampling sites; constructing a hypergraph model of the preset biological sample according to the gene expression data of the multiple spatial sampling sites; constructing a simple undirected graph according to each hyperedge in the hypergraph model; and clustering the points in the simple undirected graph to obtain a gene expression clustering result of at least one point cluster. The spatial transcriptome data clustering method provided by the application can obtain a tissue region with specific biological functions composed of one or more cells contained in the biological sample at each sampling site, so that the analysis result of the clustering algorithm is more matched with the spatial transcriptome data structure.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of bioinformatics, and more specifically, to a method, apparatus, device, and storage medium for clustering spatial transcriptome data. Background Technology

[0002] Spatial transcriptomics is a method of analyzing and describing the expression profile of a specific cell type in a spatial dimension. Spatial transcriptomics not only provides gene expression data at different spatial locations in a biological sample, but also provides spatial location information corresponding to the gene expression data.

[0003] If spatial transcriptome data is needed to obtain information from biological samples, clustering algorithms are required to perform cluster analysis on multiple sampling sites of the spatial transcriptome data. However, existing clustering algorithms assume that a sampling site belongs to only one cell type and one classification, which does not match the data structure of spatial transcriptome and may lead to misjudgment of information in biological samples. Summary of the Invention

[0004] The purpose of this application is to provide a method, apparatus, device, and storage medium for clustering spatial transcriptome data, in order to address the shortcomings of the prior art and solve the problems existing in the prior art.

[0005] To achieve the above objectives, the technical solutions adopted in the embodiments of this application are as follows:

[0006] In a first aspect, embodiments of this application provide a method for clustering spatial transcriptome data, including:

[0007] Acquire spatial transcriptome data of a preset biological sample; wherein, the spatial transcriptome data includes gene expression data of the preset biological sample at multiple spatial sampling sites, and the gene expression data of each spatial sampling site includes: expression data of multiple genes at each spatial sampling site;

[0008] Based on the gene expression data of the multiple spatial sampling sites, a hypergraph model of the preset biological sample is constructed, wherein the hypergraph model has: at least one hyperedge, each hyperedge corresponding to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site;

[0009] Based on each hyperedge in the hypergraph model, a simple undirected graph is constructed, wherein each point in the simple undirected graph corresponds to a hyperedge, and the weight of the undirected edge between points is used to indicate the information of the common space sampling point between the corresponding hyperedges.

[0010] Clustering is performed on the points in the simple undirected graph to obtain gene expression clustering results for at least one point cluster.

[0011] In one embodiment, before constructing the hypergraph model of the preset biological sample based on the gene expression data of the plurality of spatial sampling sites, the method further includes:

[0012] The gene expression data of the multiple spatial sampling sites are normalized so that the total expression level of all genes within the multiple spatial sampling sites is the same;

[0013] Based on the normalized gene expression data, calculate the variance of the expression level of each gene at the multiple spatial sampling sites;

[0014] Based on the variance of expression levels of the multiple genes at the multiple spatial sampling sites, a predetermined number of target genes are selected from the multiple genes;

[0015] The step of constructing a hypergraph model of the preset biological sample based on gene expression data from the multiple spatial sampling sites includes:

[0016] The hypergraph model is constructed based on the expression data of the target gene at the multiple spatial sampling sites.

[0017] In one embodiment, constructing a hypergraph model of the preset biological sample based on gene expression data from the plurality of spatial sampling sites includes:

[0018] The gene expression data of the multiple spatial sampling sites are binarized to obtain the binarized data of the multiple spatial sampling sites;

[0019] The hypergraph model is constructed based on the binarized data of the multiple spatial sampling sites.

[0020] In one embodiment, before constructing a simple undirected graph based on each hyperedge in the hypergraph model, the method further includes:

[0021] Based on the location information of the plurality of spatial sampling points, the continuity of the spatial sampling points corresponding to the at least one hyperedge in the two-dimensional space where the hypergraph model is located is determined;

[0022] The hyperedges whose continuity in the two-dimensional space does not meet the preset condition are segmented so that the continuity of the spatial sampling points corresponding to each segmented hyperedge meets the preset condition, thus obtaining an optimized hypergraph model.

[0023] The step of constructing a simple undirected graph based on each hyperedge in the hypergraph model includes:

[0024] Based on each hyperedge in the optimized hypergraph model, the simple undirected graph is constructed.

[0025] In one embodiment, constructing a simple undirected graph based on each hyperedge in the hypergraph model includes:

[0026] Each hyperedge in the hypergraph model is set as a point in the simple undirected graph.

[0027] Based on the number of common space sampling points between the two hyperedges in the hypergraph model, the weights of the undirected edges between the two points corresponding to the two hyperedges in the simple undirected graph are set.

[0028] In one embodiment, the method further includes:

[0029] Based on the gene expression clustering results of each of the at least one clusters, the marker gene of each cluster and the cell type corresponding to each cluster are determined.

[0030] In one embodiment, determining the marker gene of each cluster and the cell type corresponding to each cluster based on the gene expression clustering results of each cluster includes:

[0031] Based on the gene expression clustering results of each cluster, a preset differential analysis algorithm is used to obtain the marker gene of each cluster;

[0032] Based on the marker genes of each cluster and the preset correspondence between marker genes and cell types, the cell type corresponding to each cluster is determined.

[0033] Secondly, embodiments of this application provide a clustering device for spatial transcriptome data, comprising:

[0034] An acquisition module is used to acquire spatial transcriptome data of a preset biological sample; wherein, the spatial transcriptome data includes gene expression data of the preset biological sample at multiple spatial sampling sites, and the gene expression data of each spatial sampling site includes: expression data of multiple genes at each spatial sampling site;

[0035] The hypergraph model construction module is used to construct a hypergraph model of the preset biological sample based on the gene expression data of the multiple spatial sampling sites. The hypergraph model has at least one hyperedge, and each hyperedge corresponds to the expression data of the same gene at more than one spatial sampling site.

[0036] An undirected graph construction module is used to construct a simple undirected graph based on each hyperedge in the hypergraph model, wherein each point in the simple undirected graph corresponds to a hyperedge, and the weight of the undirected edge between points is used to indicate the information of the common space sampling point between the corresponding hyperedges.

[0037] The clustering module is used to cluster the points in the simple undirected graph to obtain the gene expression clustering results of at least one point cluster.

[0038] Thirdly, embodiments of this application provide a computer device, including: a processor, a storage medium, and a bus. The storage medium stores program instructions executable by the processor. When the computer device is running, the processor communicates with the storage medium via the bus, and the processor executes the program instructions to perform the steps of the spatial transcriptome data clustering method as described in the above embodiments.

[0039] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program, which, when executed by a processor, performs the steps of the spatial transcriptome data clustering method as described in the above embodiments.

[0040] The beneficial effects of this application are as follows: This application provides a method, apparatus, device, and storage medium for clustering spatial transcriptome data. The method includes: First, acquiring spatial transcriptome data of a preset biological sample; wherein the spatial transcriptome data includes gene expression data of the preset biological sample at multiple spatial sampling sites, and the gene expression data of each spatial sampling site includes expression data of multiple genes at each spatial sampling site; Second, constructing a hypergraph model of the preset biological sample based on the gene expression data of multiple spatial sampling sites, wherein the hypergraph model has at least one hyperedge, and each hyperedge corresponds to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling sites; Then, constructing a simple undirected graph based on each hyperedge in the hypergraph model, wherein each point in the simple undirected graph corresponds to a hyperedge, and the weight of the undirected edge between points is used to indicate the information of the common spatial sampling sites between corresponding hyperedges; Finally, clustering the points in the simple undirected graph to obtain gene expression clustering results of at least one point cluster.

[0041] Using the spatial transcriptome data clustering method provided in this application, the same sampling site of a biological sample can belong to multiple hyperedges simultaneously. Therefore, after obtaining gene expression clustering results by clustering the hyperedges, it is possible to obtain that some sampling sites contain different gene expression clustering results at the same time. That is, the same sampling site contains multiple different cell types at the same sampling site. This achieves detailed classification of multiple cell types at the same sampling site in the spatial transcriptome data. It can obtain tissue regions with specific biological functions composed of one or more cells contained in the biological samples at each sampling site, making the analysis results of the clustering algorithm more consistent with the spatial transcriptome data structure and avoiding misjudgment of information in the preset biological samples due to mismatch. Attached Figure Description

[0042] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0043] Figure 1 A flowchart illustrating a method for clustering spatial transcriptome data provided in an embodiment of this application;

[0044] Figure 2 A schematic diagram of a constructed hypergraph model provided in an embodiment of this application;

[0045] Figure 3 This is a schematic flowchart of a spatial transcriptome data preprocessing method provided in an embodiment of this application;

[0046] Figure 4 This is a schematic diagram of a method for constructing a hypergraph model according to an embodiment of this application;

[0047] Figure 5 This is a schematic flowchart of a method for optimizing a hypergraph model according to an embodiment of this application.

[0048] Figure 6 A schematic diagram illustrating the result of optimizing a hypergraph model according to an embodiment of this application;

[0049] Figure 7 This is a schematic flowchart of a method for constructing a simple undirected graph provided in an embodiment of this application;

[0050] Figure 8 A schematic diagram of a simple undirected graph provided in an embodiment of this application;

[0051] Figure 9 This is a schematic flowchart of a method for obtaining cell types based on clustering results, provided in an embodiment of this application.

[0052] Figure 10 This is a schematic diagram of the structure of a clustering device for spatial transcriptome data provided in an embodiment of this application;

[0053] Figure 11 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Detailed Implementation

[0054] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are some embodiments of this application, but not all embodiments.

[0055] Therefore, the following detailed description of the embodiments of this application provided in the accompanying drawings is not intended to limit the scope of the claimed application, but merely to illustrate selected embodiments of the application. All other embodiments obtained by those skilled in the art based on the embodiments of this application without inventive effort are within the scope of protection of this application.

[0056] Furthermore, the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Additionally, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0057] It should be noted that, where there is no conflict, the features in the embodiments of this application can be combined with each other.

[0058] A key challenge in biological analysis is cell classification within biological samples. Current research primarily relies on gene expression data from single-cell sequencing for function-based classification, but single-cell sequencing cannot capture spatial location information. The rise of spatial transcriptomics not only provides gene expression data from different spatial locations within biological samples but also the corresponding spatial location information. Therefore, if spatial transcriptomics data can be used to obtain information from biological samples, it would be possible to acquire cell classification information simultaneously with cell location information, which is of great significance for biological analysis.

[0059] However, current clustering algorithms used for spatial transcriptome data clustering analysis have shortcomings. When classifying cells in biological samples, they fail to fully integrate spatial location information and gene expression data. This is mainly because spatial transcriptome data is obtained by collecting data from multiple sampling sites of biological samples. The same sampling site contains information on multiple cells (typically, a cell diameter is 10-15 μm, while a sampling site in spatial transcriptome has a diameter of 55 μm, meaning one sampling site actually contains 2-10 cells). Existing clustering algorithms, after clustering analysis of spatial transcriptome data, can only group multiple cells at each sampling site into the same category, which does not conform to the data structure of spatial transcriptomes. Therefore, this application provides a clustering method for spatial transcriptome data. This method can classify the types of multiple cells at the same sampling site in spatial transcriptome data in detail, making the analysis results of the clustering algorithm more consistent with the spatial transcriptome data structure.

[0060] The following examples, in conjunction with the accompanying drawings, provide specific illustrations of the clustering method for spatial transcriptome data provided in this application.

[0061] First, it should be noted that the embodiments of this application provide a method for clustering spatial transcriptome data. The spatial transcriptome data clustering provided by this method can be generated by any computer device that integrates a preset spatial transcriptome data clustering generation algorithm. The computer device can be, for example, a terminal-oriented computer device or a backend server.

[0062] Figure 1 This is a flowchart illustrating a method for clustering spatial transcriptome data according to an embodiment of this application. Figure 1 As shown, the method includes:

[0063] S101. Obtain spatial transcriptome data of a preset biological sample.

[0064] This embodiment performs cluster analysis on the spatial transcriptome of biological samples. Therefore, before performing cluster analysis, it is necessary to obtain the spatial transcriptome data of the preset biological samples.

[0065] The spatial transcriptome data includes gene expression data of a pre-defined biological sample at multiple spatial sampling sites. Each spatial sampling site's gene expression data includes the expression data of multiple genes at that site. The pre-defined biological sample can be any biological sample to be identified, and the number of spatial sampling sites can be adjusted according to actual needs. This embodiment does not limit the type of pre-defined biological sample or the number of spatial sampling sites.

[0066] The spatial transcriptome data of the pre-defined biological samples can be shown in Table 1. In Table 1, the rows represent the names of sampling sites, the columns represent the names of genes, and each value represents the expression level of the gene corresponding to the column detected at the sampling site in the row. That is, each row represents the expression status of each gene at the sampling site in that row. Since Table 1 is an extremely sparse matrix, the expression level of most genes in Table 1 is 0.

[0067] Table 1 Spatial transcriptome data of the pre-selected biological samples

[0068]

[0069] S102. Based on gene expression data from multiple spatial sampling sites, construct a hypergraph model of the pre-defined biological samples.

[0070] After obtaining the spatial transcriptome data of the preset biological sample, a hypergraph model of the preset biological sample can be constructed based on the gene expression data of multiple spatial sampling sites. The specific steps for constructing the hypergraph model can be referred to the detailed description of the following embodiments, which will not be repeated in this embodiment.

[0071] The hypergraph model established in this embodiment has at least one hyperedge, and each hyperedge corresponds to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site (the hyperedge in this application contains the expression data and spatial continuity information of two or more sampling sites). Figure 2 This is a schematic diagram of a constructed hypergraph model provided in an embodiment of this application, such as... Figure 2 As shown, each dot represents a sampling point, and each closed circle represents a hyperedge, that is, Figure 2 The hypergraph model shown has three hyperedges, each representing the expression of the same gene at different sampling sites. Figure 2 The hypergraph model in this embodiment does not correspond to the data in Table 1. Figure 2 For illustrative purposes only, in practice, the hypergraph model built based on spatial transcriptome data corresponds strictly to the spatial transcriptome data.

[0072] according to Figure 2 It can also be seen that the same sampling site can belong to multiple hyperedges at the same time, that is, the same sampling site can contain a variety of different genes. This matches the structure of spatial transcriptome data. Therefore, the results obtained by performing cluster analysis on spatial transcriptome data according to the method of this embodiment are completely matched with the structure of spatial transcriptome data.

[0073] S103. Construct a simple undirected graph based on the hyperedges in the hypergraph model.

[0074] After constructing the hypergraph model of the pre-defined biological samples, a simple undirected graph can be built based on the hyperedges in the hypergraph model. This simple undirected graph can then be used to perform cluster analysis on the spatial transcriptome data. The specific steps for constructing the simple undirected graph can be found in the detailed description of the following embodiments, and will not be repeated in this embodiment.

[0075] In this embodiment, each point in the constructed simple undirected graph corresponds to a hyperedge (e.g. Figure 2 The hypergraph model shown has 3 hyperedges, then according to Figure 2 A simple undirected graph constructed from a hypergraph model (with 3 points) has weights on the undirected edges connecting the points. These weights can be used to indicate information about the common space sampling points between the hyperedges corresponding to the points connected by the undirected edges.

[0076] S104. Cluster the points in a simple undirected graph to obtain the gene expression clustering results of at least one point cluster.

[0077] Finally, after constructing a simple undirected graph based on the hyperedges in the hypergraph model, the points in the simple undirected graph are clustered using a preset clustering algorithm. This yields the gene expression clustering results for at least one point cluster. The gene expression clustering results of a point cluster can be used to obtain its corresponding cell type. Here, a point cluster refers to multiple sampling sites corresponding to the same hyperedge. The preset clustering algorithm can be, for example, a graph-based unsupervised algorithm, such as the Louvain or Leiden algorithm. Specific methods for obtaining cell types using gene expression clustering results can be found in the following embodiments of this application.

[0078] In summary, this embodiment provides a clustering method for spatial transcriptome data. After constructing a hypergraph model for the spatial transcriptome data of a preset biological sample using the method provided in this embodiment, since the same sampling site of the preset biological sample can belong to multiple hyperedges simultaneously, after obtaining gene expression clustering results by clustering the hyperedges, it is possible to obtain certain sampling sites containing different gene expression clustering results at the same time. That is, the same sampling site contains multiple different cell types simultaneously, realizing a detailed classification of multiple cell types at the same sampling site in the spatial transcriptome data. It is possible to obtain tissue regions with specific biological functions composed of one or more cells contained in the biological sample at each sampling site, making the analysis results of the clustering algorithm more consistent with the spatial transcriptome data structure, and avoiding misjudgment of information in the preset biological sample due to mismatch.

[0079] One embodiment of this application also provides a method for preprocessing spatial transcriptome data. Figure 3 This is a schematic flowchart of a spatial transcriptome data preprocessing method provided in an embodiment of this application, as shown below. Figure 3As shown, in step S102 of the above embodiment, before constructing a hypergraph model of a preset biological sample based on gene expression data from multiple spatial sampling sites, the following may also be included:

[0080] S301. Normalize the gene expression data of multiple spatial sampling sites so that the total expression level of all genes within the multiple spatial sampling sites is the same.

[0081] Before constructing a hypergraph model of a pre-defined biological sample based on gene expression data from multiple spatial sampling sites, the gene expression data from these sites can be preprocessed. This preprocessing makes the constructed hypergraph model more closely match the data, improving the accuracy of the model and facilitating subsequent processing. The preprocessing methods are described in detail below.

[0082] First, after obtaining the spatial transcriptome data of the preset biological samples, the gene expression data of multiple spatial sampling sites can be normalized. Normalization will make the total expression level of all genes in multiple spatial sampling sites the same (that is, the total gene expression in each row of Table 1 is consistent, but the relative expression level between genes in each row is not changed). For example, the gene expression level in each row can be normalized to 1,000,000 (see Formula 1).

[0083] The normalization formula is shown in equation (1) below, where G i,j C represents the normalized expression level of the j-th gene at the i-th sampling site. i,j N represents the expression level of the j-th gene at the i-th sampling site before normalization, and N is the number of all genes, i.e., the number of columns in Table 1. For example, if there are 5 genes in Table 1, then N is 5.

[0084] G i,j =C i,j / ∑ j∈N C i,j ×1000000 (1)

[0085] S302. Based on the normalized gene expression data, calculate the variance of the expression level of each gene at multiple spatial sampling sites.

[0086] After normalizing the gene expression data from multiple spatial sampling sites, the variance of the expression level of each gene at multiple spatial sampling sites can be calculated. That is, the variance between multiple values ​​in each column of the normalized data in Table 1 can be calculated, and the gene expression data can be further preprocessed by the variance.

[0087] S303. Based on the variance of expression levels of multiple genes at multiple spatial sampling sites, select a predetermined number of target genes from multiple genes.

[0088] After obtaining the expression variance of multiple genes at multiple spatial sampling sites, a predetermined number of genes can be selected as target genes from the normalized multiple genes based on the variance to complete further preprocessing of the gene expression data. The predetermined number can be, for example, 3000. In practice, the predetermined number is not limited to 3000 and can be determined according to actual needs. This embodiment does not limit the specific value of the predetermined number.

[0089] Based on this, in step S102, a hypergraph model of the pre-defined biological sample is constructed according to gene expression data from multiple spatial sampling sites, which may include:

[0090] S304. Construct a hypergraph model based on the expression data of the target gene at multiple spatial sampling sites.

[0091] Step S102 can be: after obtaining a preset number of target genes, construct a hypergraph model based on the expression data of multiple spatial sampling sites in the target genes.

[0092] In practice, RNA from the current sampling site may flow to nearby sampling sites, causing data contamination. Therefore, in an optional embodiment, before normalizing the gene expression data from multiple spatial sampling sites, noise reduction processing can be performed on the data from each sampling site. This ensures that the transcript data flowing out of each sampling site is only recorded at the original sampling site, reducing this data contamination. Noise reduction methods can include, for example, using the Spotclean or Sprod algorithms. The data structure before and after noise reduction remains the expression matrix shown in Table 1, except that the expression levels of certain genes at certain locations have changed.

[0093] In summary, the method described in this embodiment, which performs preprocessing such as normalization and variance calculation on spatial transcriptome data, can make the total expression level of all genes within multiple spatial sampling sites the same. This makes the hypergraph model constructed based on the preprocessed data more closely match the spatial transcriptome data, thereby improving the accuracy of the constructed hypergraph model.

[0094] One embodiment of this application provides a possible implementation for constructing a hypergraph model. Figure 4 This is a schematic diagram of a method for constructing a hypergraph model according to an embodiment of this application, as shown below. Figure 4 As shown, in step S102, constructing a hypergraph model of a pre-defined biological sample based on gene expression data from multiple spatial sampling sites may include:

[0095] S401. Binarize the gene expression data of multiple spatial sampling sites to obtain binarized data of multiple spatial sampling sites.

[0096] In this embodiment, when constructing a hypergraph model of a preset biological sample based on gene expression data from multiple spatial sampling sites, the gene expression data from multiple spatial sampling sites can be binarized first, and the hypergraph model can be constructed based on the binarized data.

[0097] The specific operation of binarization is as follows: A suitable threshold is pre-set. Based on the threshold, the expression levels of normalized gene expression data at each sampling site are divided. Genes with expression levels greater than or equal to the threshold are set to 1, and genes with expression levels less than the threshold are set to 0. This process is repeated for all genes to complete the binarization of gene expression data at multiple spatial sampling sites, resulting in binarized data. In the obtained binarized data, the gene expression data at each sampling site is labeled as either 0 or 1.

[0098] S402. Construct a hypergraph model based on the binarized data of multiple spatial sampling points.

[0099] Once the binarized data of multiple spatial sampling sites is obtained, a hypergraph model can be constructed based on this data. That is, by sequentially connecting the sampling sites labeled 1 within the same gene, the hyperedge of that gene in the hypergraph model can be obtained. For example... Figure 2 In the diagram, the same closed circle represents the same gene whose gene expression level exceeds the threshold and is set to 1.

[0100] Optionally, if, after sequentially connecting the sampling sites marked as 1, one of the resulting hyperedges contains only one sampling site, then that hyperedge is deleted. Since the probability of the same gene existing only in one sampling site is very small, this can avoid misidentification of other non-gene factors when establishing hyperedges, further improving the accuracy and reliability of the established hypergraph model.

[0101] In this embodiment, by binarizing the gene expression data and constructing a hypergraph model based on the obtained binarized data, it is beneficial to reduce the amount of data that the computer device needs to process to construct the hypergraph model, thereby increasing the speed of constructing the hypergraph model. Furthermore, the binarized data enables the computer device to more accurately identify the data, which is beneficial to improving the accuracy of the computer device in constructing the hypergraph model.

[0102] One embodiment of this application also provides a possible implementation for optimizing the constructed hypergraph model. Figure 5 This is a schematic diagram of a method for optimizing a hypergraph model according to an embodiment of this application, as shown below. Figure 5 As shown, before constructing a simple undirected graph based on the hyperedges in the hypergraph model, step S103 may also include:

[0103] S501. Based on the location information of multiple spatial sampling points, determine the continuity of at least one hyperedge corresponding to a spatial sampling point in the two-dimensional space of the hypergraph model.

[0104] Spatial transcriptome data can also include the location information of multiple spatial sampling sites in the above embodiments. Therefore, after constructing the hypergraph model, the continuity of at least one hyperedge corresponding to the spatial sampling site in the two-dimensional space of the hypergraph model can be determined based on the location information of multiple spatial sampling sites.

[0105] The location information of multiple spatial sampling points includes the row and column position information of the sampling points, that is, the coordinate position information of multiple sampling points in two-dimensional space.

[0106] S502. Segment the hyperedges whose continuity in the two-dimensional space does not meet the preset conditions, so that the continuity of the spatial sampling points corresponding to each hyperedge after segmentation meets the preset conditions, and obtain the optimized hypergraph model.

[0107] When it is determined that the spatial sampling points corresponding to a certain hyperedge are discontinuous in the two-dimensional space of the hypergraph model, the hyperedge can be divided into multiple hyperedges, such that the continuity of the spatial sampling points corresponding to each hyperedge in the two-dimensional space satisfies a preset condition, thus obtaining an optimized hypergraph model. The number of hyperedges into which the hyperedge is divided is determined by the continuity of the hyperedge before division.

[0108] Figure 6 This is a schematic diagram illustrating the result of optimizing a hypergraph model according to an embodiment of this application. Figure 2 and Figure 6 For example, in Figure 2 In the hypergraph model shown, one hyperedge corresponds to a spatial sampling site that is discontinuous in the two-dimensional space of the hypergraph model (this hyperedge contains regions without sampling sites; these regions can be considered not to be gene expression regions, meaning the hyperedge is essentially two different hyperedges of the same gene, belonging to the same gene class but not the same gene). Therefore, after optimizing this hyperedge using the optimization method provided in this embodiment, we can obtain... Figure 6 The hypergraph model shown, in Figure 6 In this process, the hyperedge is divided into two hyperedges. After the division, the continuity of the spatial sampling points corresponding to each hyperedge meets the preset conditions, thus completing the optimization of the hypergraph model.

[0109] Optionally, if, after optimizing the hypergraph model, one of the multiple hyperedges in the optimized hypergraph model has only one sampling site, then that hyperedge should be deleted. This can avoid misidentification of other non-genetic factors when establishing hyperedges, and further improve the accuracy and reliability of the established hypergraph model.

[0110] Based on this, the construction of a simple undirected graph from each hyperedge in the hypergraph model described in the above embodiments may include:

[0111] S503. Construct a simple undirected graph based on each hyperedge in the optimized hypergraph model.

[0112] Based on the optimization method provided in this embodiment, the construction of a simple undirected graph based on each hyperedge in the hypergraph model described in the above embodiment can also be: constructing a simple undirected graph based on each hyperedge in the optimized hypergraph model. This can make the constructed simple undirected graph fit the spatial transcriptome data better, and the clustering results based on the simple undirected graph are more accurate.

[0113] In this embodiment, by segmenting the hyperedges whose continuity in two-dimensional space does not meet the preset conditions, an optimized hypergraph model is obtained. A simple undirected graph is then constructed based on the optimized hypergraph model. This allows the hypergraph model and the simple undirected graph to better fit the spatial transcriptome data, making the clustering results obtained based on the hypergraph model and the simple undirected graph more accurate and improving the accuracy of clustering spatial transcriptome data.

[0114] One embodiment of this application also provides a possible implementation for constructing a simple undirected graph. Figure 7 This is a schematic diagram of a method for constructing a simple undirected graph according to an embodiment of this application, as shown below. Figure 7 As shown, in S103 of the above embodiment, constructing a simple undirected graph based on each hyperedge in the hypergraph model may include:

[0115] S701. Set each hyperedge in the hypergraph model to a point in a simple undirected graph.

[0116] After constructing the hypergraph model of the preset biological sample according to step S102, each hyperedge in the hypergraph model can be set as a point in a simple undirected graph.

[0117] Figure 8 This is a schematic diagram of a simple undirected graph provided in one embodiment of this application. Figure 8 The simple undirected graph shown is Figure 6 The optimized hypergraph model shown corresponds to, i.e., Figure 6 The top left corner contains the corresponding hyperedge with 3 sampling points. Figure 8 The point in the upper left corner, Figure 6 The lower right corner contains a superedge with two sampling points. Figure 8 The dot in the upper right corner, Figure 6 The middle position contains two hyperedges with four sampling points, corresponding to... Figure 8 The middle position and the bottom left point.

[0118] It is important to note that Figure 6 In the hypergraph model, the number of sampling points contained in the hyperedge is related to... Figure 8 The number of vertices in a simple undirected graph is irrelevant. Figure 6 One of the super edges in the strictly corresponds to Figure 8 In a hypergraph, a point corresponds one-to-one with a point in a simple undirected graph.

[0119] S702. Based on the number of common space sampling points between two hyperedges in the hypergraph model, set the weights of the undirected edges between two points corresponding to two hyperedges in a simple undirected graph.

[0120] After obtaining the simple undirected graph, the weights of the undirected edges between two points corresponding to two hyperedges in the simple undirected graph can be set according to the number of common space sampling points between two hyperedges in the hypergraph model.

[0121] by Figure 6 and Figure 8 For example, Figure 6 The top-left corner of the superedge containing three sampling points shares a common sampling point with the middle superedge (a common sampling point refers to an overlapping sampling point, meaning that the sampling point is simultaneously located within two different superedges). Figure 6 If there is a common sampling point among the two hyperedges in the middle position, then, according to Figure 6 Hypergraph model construction Figure 8 In a simple undirected graph, the weight of the undirected edge between the top-left corner and the middle corner is 1, and the weight of the undirected edge between the middle corner and the bottom-left corner is also 1 (the weight is determined by the number of common sampling points; if the number of common sampling points is 2, then the weight is 2).

[0122] It should be noted that if Figure 6 Since there is no common sampling point between the two hyperedges shown, then according to Figure 6 Received Figure 8 In a simple undirected graph, there is no edge connecting these two vertices (i.e., ...). Figure 8 There are no connecting edges between the top right corner and the middle and bottom left corners.

[0123] In this embodiment, by constructing a simple undirected graph from the hypergraph model, and by clustering the points in the simple undirected graph, the gene expression clustering results of the point clusters are obtained. Since the amount of data in the simple undirected graph is less than that in the hypergraph model, the computer device can perform clustering analysis based on the simple undirected graph, which helps to reduce the amount of data for clustering analysis and improves the speed of clustering analysis. Furthermore, the simple undirected graph enables the computer device to identify data more accurately, which helps to improve the accuracy of clustering analysis.

[0124] One embodiment of this application also provides a possible method for obtaining cell types based on clustering results. Based on the spatial transcriptome clustering method provided in the above embodiments, the spatial transcriptome clustering method provided in this application may further include: after obtaining the gene expression clustering results of the data points, determining the marker gene for each data point and the corresponding cell type based on the gene expression clustering results of each data point. The method for determining the marker gene and cell type is specifically described as follows: Figure 9 .

[0125] Figure 9 This is a schematic flowchart of a method for obtaining cell types based on clustering results according to an embodiment of this application, as shown below. Figure 9 As shown, the method specifically includes:

[0126] S901. Based on the gene expression clustering results of each cluster, a preset differential analysis algorithm is used to obtain the marker gene for each cluster.

[0127] After obtaining the gene expression clustering results of at least one cluster according to step S104, a preset differential analysis algorithm can be used to obtain the marker gene of each cluster based on the gene expression clustering results of each cluster, and the cell type of the corresponding cluster can be determined by the marker gene.

[0128] S902. Based on the marker genes of each cluster and the preset correspondence between marker genes and cell types, determine the cell type corresponding to each cluster.

[0129] In this embodiment, the correspondence between marker genes and cell types can be obtained in advance. After obtaining the marker genes of each cluster according to step S901, the cell type corresponding to each cluster can be determined according to the marker genes of each cluster and the correspondence between marker genes and cell types.

[0130] For example, cells with marker genes including CD3 and CD8A are identified as CD8+ T cells; cells with marker genes including CD68 are identified as macrophages; and cells with marker genes including FGF7 and MME are identified as fibroblasts.

[0131] In this embodiment, by determining the marker gene of each cluster, the cell type corresponding to each cluster is determined based on the marker gene. Compared with gene expression clustering results, cell type makes the clustering results clearer and facilitates subsequent processing by staff. For example, cell type can be used to accurately identify the microhabitat of a preset biological sample corresponding to spatial transcriptome data.

[0132] In an optional embodiment, after obtaining the cell type corresponding to each cluster, the computer device can also automatically annotate the cell type in the gene expression clustering results, which can facilitate staff to view the cell type of the spatial transcriptome data.

[0133] The following will continue to explain the apparatus, device, and storage medium for performing the spatial transcriptome data clustering method provided in any of the above embodiments of this application. The specific implementation process and the resulting technical effects are the same as those in the corresponding method embodiments. For the sake of brevity, parts not mentioned in this embodiment can be referred to the corresponding content in the method embodiments.

[0134] One embodiment of this application also provides a clustering device for spatial transcriptome data. Figure 10 This is a schematic diagram of the structure of a spatial transcriptome data clustering device provided in an embodiment of this application, as shown below. Figure 10 As shown, the device includes:

[0135] The acquisition module 100 is used to acquire spatial transcriptome data of a preset biological sample; wherein, the spatial transcriptome data includes gene expression data of the preset biological sample at multiple spatial sampling sites, and the gene expression data of each spatial sampling site includes: expression data of multiple genes at each spatial sampling site.

[0136] The hypergraph model construction module 200 is used to construct a hypergraph model of a preset biological sample based on gene expression data from multiple spatial sampling sites. The hypergraph model has at least one hyperedge, and each hyperedge corresponds to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling sites.

[0137] The undirected graph construction module 300 is used to construct a simple undirected graph based on each hyperedge in the hypergraph model. In the simple undirected graph, each point corresponds to a hyperedge, and the weight of the undirected edge between points is used to indicate the information of the common space sampling point between the corresponding hyperedges.

[0138] Clustering module 400 is used to cluster points in a simple undirected graph to obtain gene expression clustering results for at least one point cluster.

[0139] In one embodiment, the clustering device for spatial transcriptome data further includes a data processing module for normalizing gene expression data at multiple spatial sampling sites so that the total expression level of all genes at the multiple spatial sampling sites is the same; calculating the expression level variance of each gene at multiple spatial sampling sites based on the normalized gene expression data; and selecting a preset number of target genes from the multiple genes based on the expression level variance of the multiple genes at multiple spatial sampling sites.

[0140] The hypergraph model building module 200 is also used to build a hypergraph model based on the expression data of the target gene at multiple spatial sampling sites.

[0141] In one embodiment, the hypergraph model construction module 200 is further used to binarize the gene expression data of multiple spatial sampling sites to obtain binarized data of multiple spatial sampling sites; and to construct a hypergraph model based on the binarized data of multiple spatial sampling sites.

[0142] In one embodiment, the data processing module is further configured to determine the continuity of at least one hyperedge corresponding to a spatial sampling point in the two-dimensional space where the hypergraph model is located, based on the location information of multiple spatial sampling points; segment the hyperedges whose continuity in the two-dimensional space does not meet the preset conditions, so that the continuity of the spatial sampling points corresponding to each hyperedge after segmentation meets the preset conditions, thereby obtaining an optimized hypergraph model.

[0143] The undirected graph construction module 300 is also used to construct a simple undirected graph based on each hyperedge in the optimized hypergraph model.

[0144] In one embodiment, the undirected graph construction module 300 is further configured to set each hyperedge in the hypergraph model as a point in the simple undirected graph; and to set the weight of the undirected edge between two points corresponding to two hyperedges in the simple undirected graph according to the number of common space sampling points between two hyperedges in the hypergraph model.

[0145] In one embodiment, the clustering device for spatial transcriptome data further includes a determination module for determining the marker gene of each cluster and the cell type corresponding to each cluster based on the gene expression clustering results of each cluster.

[0146] In one embodiment, the determining module is further configured to obtain the marker gene of each cluster by using a preset differential analysis algorithm based on the gene expression clustering results of each cluster; and determine the cell type corresponding to each cluster based on the marker gene of each cluster and the preset correspondence between the marker gene and cell type.

[0147] The above-described device is used to execute the method provided in the foregoing embodiments, and its implementation principle and technical effect are similar, so they will not be described again here.

[0148] These modules can be one or more integrated circuits configured to implement the above methods, such as one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors, or one or more Field Programmable Gate Arrays (FPGAs). Alternatively, when a module is implemented using processing element scheduler code, the processing element can be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. Furthermore, these modules can be integrated together as a system-on-a-chip (SOC).

[0149] One embodiment of this application also provides a computer device. Figure 11 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application, as shown below. Figure 11 As shown, the computer device includes a processor 1, a storage medium 2, and a bus 3. The storage medium stores program instructions that can be executed by the processor. When the computer device is running, the processor communicates with the storage medium through the bus, and the processor executes the program instructions to perform the steps of the spatial transcriptome data clustering method provided in the above embodiments.

[0150] An embodiment of this application also provides a computer-readable storage medium storing a computer program, which, when run by a processor, performs the steps of the clustering method for spatial transcriptome data as provided in the above embodiments.

[0151] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0152] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0153] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in a combination of hardware and software functional units.

[0154] The integrated units implemented as software functional units described above can be stored in a computer-readable storage medium. These software functional units, stored in a storage medium, include several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute some steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0155] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A clustering method for spatial transcriptome data, characterized in that, include: Acquire spatial transcriptome data of a preset biological sample; wherein, the spatial transcriptome data includes gene expression data of the preset biological sample at multiple spatial sampling sites, and the gene expression data of each spatial sampling site includes: expression data of multiple genes at each spatial sampling site; Based on the gene expression data of the multiple spatial sampling sites, a hypergraph model of the preset biological sample is constructed, wherein the hypergraph model has: at least one hyperedge, each hyperedge corresponding to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site; Based on each hyperedge in the hypergraph model, a simple undirected graph is constructed, wherein each point in the simple undirected graph corresponds to a hyperedge, and the weight of the undirected edge between points is used to indicate the information of the common space sampling point between the corresponding hyperedges. Clustering is performed on the points in the simple undirected graph to obtain gene expression clustering results for at least one point cluster; wherein, each point cluster refers to multiple sampling sites corresponding to the same hyperedge; The method further includes: Based on the gene expression clustering results of each of the at least one clusters, the marker gene of each cluster and the cell type corresponding to each cluster are determined.

2. The method according to claim 1, characterized in that, Before constructing the hypergraph model of the preset biological sample based on the gene expression data of the multiple spatial sampling sites, the method further includes: The gene expression data of the multiple spatial sampling sites are normalized so that the total expression level of all genes within the multiple spatial sampling sites is the same; Based on the normalized gene expression data, calculate the variance of the expression level of each gene at the multiple spatial sampling sites; Based on the variance of expression levels of the multiple genes at the multiple spatial sampling sites, a predetermined number of target genes are selected from the multiple genes; The step of constructing a hypergraph model of the preset biological sample based on gene expression data from the multiple spatial sampling sites includes: The hypergraph model is constructed based on the expression data of the target gene at the multiple spatial sampling sites.

3. The method according to claim 1, characterized in that, The step of constructing a hypergraph model of the preset biological sample based on gene expression data from the multiple spatial sampling sites includes: The gene expression data of the multiple spatial sampling sites are binarized to obtain the binarized data of the multiple spatial sampling sites; The hypergraph model is constructed based on the binarized data of the multiple spatial sampling sites.

4. The method according to claim 1, characterized in that, Before constructing a simple undirected graph based on each hyperedge in the hypergraph model, the method further includes: Based on the location information of the plurality of spatial sampling points, the continuity of the spatial sampling points corresponding to the at least one hyperedge in the two-dimensional space where the hypergraph model is located is determined; The hyperedges whose continuity in the two-dimensional space does not meet the preset condition are segmented so that the continuity of the spatial sampling points corresponding to each segmented hyperedge meets the preset condition, thus obtaining an optimized hypergraph model. The step of constructing a simple undirected graph based on each hyperedge in the hypergraph model includes: Based on each hyperedge in the optimized hypergraph model, the simple undirected graph is constructed.

5. The method according to claim 1, characterized in that, The step of constructing a simple undirected graph based on each hyperedge in the hypergraph model includes: Each hyperedge in the hypergraph model is set as a point in the simple undirected graph. Based on the number of common space sampling points between the two hyperedges in the hypergraph model, the weights of the undirected edges between the two points corresponding to the two hyperedges in the simple undirected graph are set.

6. The method according to claim 1, characterized in that, The step of determining the marker gene for each cluster and the corresponding cell type based on the gene expression clustering results of each cluster includes: Based on the gene expression clustering results of each cluster, a preset differential analysis algorithm is used to obtain the marker gene of each cluster; Based on the marker genes of each cluster and the preset correspondence between marker genes and cell types, the cell type corresponding to each cluster is determined.

7. A clustering device for spatial transcriptome data, characterized in that, include: An acquisition module is used to acquire spatial transcriptome data of a preset biological sample; wherein, the spatial transcriptome data includes gene expression data of the preset biological sample at multiple spatial sampling sites, and the gene expression data of each spatial sampling site includes: expression data of multiple genes at each spatial sampling site; The hypergraph model construction module is used to construct a hypergraph model of the preset biological sample based on the gene expression data of the multiple spatial sampling sites. The hypergraph model has at least one hyperedge, and each hyperedge corresponds to the expression data of the same gene at more than one spatial sampling site and the spatial continuity information of the sampling site. An undirected graph construction module is used to construct a simple undirected graph based on each hyperedge in the hypergraph model, wherein each point in the simple undirected graph corresponds to a hyperedge, and the weight of the undirected edge between points is used to indicate the information of the common space sampling point between the corresponding hyperedges. The clustering module is used to cluster the points in the simple undirected graph to obtain the gene expression clustering results of at least one point cluster; wherein, each point cluster refers to multiple sampling sites corresponding to the same hyperedge; The device further includes a determination module, used to determine the marker gene of each cluster and the cell type corresponding to each cluster based on the gene expression clustering results of each cluster in the at least one cluster.

8. A computer device, characterized in that, include: The computer device includes a processor, a storage medium, and a bus, wherein the storage medium stores program instructions executable by the processor, and when the computer device is running, the processor communicates with the storage medium via the bus, and the processor executes the program instructions to perform the steps of the clustering method for spatial transcriptome data as described in any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that, The storage medium stores a computer program that, when executed by a processor, performs the steps of the clustering method for spatial transcriptome data as described in any one of claims 1 to 6.