Spatial transcriptomic data clustering method and apparatus

By calculating the Pearson correlation coefficient and sequence abundance of sampling points, and combining sequence flow models and graph convolutional neural networks, the problem of UMI count contamination in spatial transcriptome data clustering was solved, improving the accuracy of clustering results and the reliability of downstream analysis.

CN116825205BActive Publication Date: 2026-06-12BEIJING YUANMA MEDICAL LAB CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING YUANMA MEDICAL LAB CO LTD
Filing Date
2023-05-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

The clustering results of existing spatial transcriptome data are inaccurate, mainly due to UMI count contamination, which leads to the mixing of gene information from multiple cells. Existing algorithms fail to make full use of the spatial location information of cells.

Method used

By calculating the Pearson correlation coefficient and sequence abundance between sampling points and neighboring sampling points, blank sampling points and tissue downsampling points are distinguished. The true expression level is estimated using a sequence flow model. Data denoising and enhancement are performed by combining graph convolutional neural networks and adjacency matrices. Finally, clustering is performed using a clustering algorithm.

🎯Benefits of technology

This improved the accuracy of cluster analysis results for spatial transcriptome data, ensuring the reliability and accuracy of downstream analysis, and fully utilizing the spatial location information of sampling points and gene expression in the microenvironment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116825205B_ABST
    Figure CN116825205B_ABST
Patent Text Reader

Abstract

The application provides a spatial transcriptome data clustering method and device, the method comprises the following steps: reading spatial transcriptome data in a spatial transcriptome data set; determining the Pearson correlation coefficient of each sampling point and the adjacent sampling point based on the spatial transcriptome data, and distinguishing the sampling point species based on the sequence abundance; based on the pre-constructed sequence flow model, the expression spectrum of the lower sampling point is corrected to obtain the gene expression matrix after noise reduction; and the spatial transcriptome data is clustered based on the gene expression matrix after noise reduction. The spatial transcriptome data clustering method and device provided by the application fully utilize the spatial position information of cells by determining the Pearson correlation coefficient of each sampling point and the adjacent sampling point, and the spatial transcriptome data is subjected to noise reduction processing, the gene expression amount is corrected, and the accuracy of the clustering analysis result is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data mining technology, and in particular to a method and apparatus for spatial transcriptome data clustering. Background Technology

[0002] Spatial transcriptomics technology captures messenger RNA transcripts at each sampling point using microarray chips. Ideally, a unique molecular identifier (UMI) specific to a given point would represent the gene's expression at that point. However, in reality, the information captured at each sampling point does not necessarily indicate gene expression at that point. Messenger RNA flowing from nearby sampling points leads to significant contamination in UMI counts, resulting in a mixture of gene information from multiple cells. Consequently, the clustering results of spatial transcriptomics data are inaccurate. Summary of the Invention

[0003] This invention provides a method and apparatus for clustering spatial transcriptome data, which solves the technical problem of inaccurate clustering results of spatial transcriptome data in the prior art.

[0004] In a first aspect, the present invention provides a spatial transcriptome data clustering method, comprising:

[0005] Read spatial transcriptome data from the spatial transcriptome dataset;

[0006] Based on the spatial transcriptome data, the Pearson correlation coefficient between each sampling point and its neighboring sampling points, as well as the sequence abundance, are used to distinguish the types of sampling points.

[0007] Based on a pre-built sequence flow model, the expression profile of tissue downsampling points is corrected to obtain a denoised gene expression matrix;

[0008] Spatial transcriptome data are clustered based on the denoised gene expression matrix.

[0009] In some embodiments, determining the type of sampling point based on the Pearson correlation coefficient and sequence abundance between each sampling point and its neighboring sampling points using the spatial transcriptome data includes:

[0010] Calculate the Pearson correlation coefficient between each sampling point and its adjacent sampling points, as well as the average Pearson correlation coefficient for each sampling point, and calculate the sequence abundance for each sampling point;

[0011] Sampling points with an average Pearson correlation coefficient higher than the first threshold and a sequence abundance higher than the second threshold are denoted as tissue sampling points, and other sampling points are denoted as blank sampling points.

[0012] In some embodiments, based on a pre-built sequence flow model, expression profiles of tissue downsampling points are corrected to obtain a denoised gene expression matrix, including:

[0013] Based on the blank sampling points, determine the relevant parameters of the sequence outflow sampling points;

[0014] Based on the relevant parameters of the sequence efflux sampling points, the gradient descent algorithm corresponding to the pre-constructed sequence flow model is used to estimate the sequence efflux rate of the tissue sampling sites and the size of the affected neighborhood. The expectation-maximization algorithm is then used to estimate the true expression level, resulting in a denoised gene expression matrix.

[0015] In some embodiments, spatial transcriptome data is clustered based on the denoised gene expression matrix, including:

[0016] An adjacency matrix is ​​generated based on the spatial location of the denoised gene expression matrix;

[0017] The denoised gene expression matrix is ​​amplified based on the adjacency matrix to obtain a gene expression enhancement matrix.

[0018] Spatial transcriptome data are clustered based on the adjacency matrix and the gene expression enhancement matrix.

[0019] In some embodiments, the denoised gene expression matrix is ​​amplified based on the adjacency matrix to obtain a gene expression enhancement matrix, including:

[0020] Based on the adjacency matrix and the denoised gene expression matrix, the neighborhood average expression matrix is ​​determined;

[0021] The denoised gene expression matrix and the neighborhood average expression matrix are spliced ​​and amplified to obtain the gene expression enhancement matrix.

[0022] In some embodiments, clustering of spatial transcriptome data based on the adjacency matrix and the gene expression enhancement matrix includes:

[0023] The adjacency matrix and the gene expression enhancement matrix are input into the graph convolutional neural network model to obtain the node embedding matrix output by the graph convolutional neural network model.

[0024] Spatial transcriptome data are clustered based on the node embedding matrix.

[0025] In some embodiments, clustering of spatial transcriptome data based on the node embedding matrix includes:

[0026] Principal component analysis is used to map the node embedding matrix to a low-dimensional space to obtain dimensionality-reduced data.

[0027] Clustering algorithms are used to cluster the dimensionality-reduced data.

[0028] In a second aspect, the present invention also provides a spatial transcriptome data clustering device, comprising:

[0029] The read module is used to read spatial transcriptome data from the spatial transcriptome dataset.

[0030] The sampling point differentiation module is used to determine the Pearson correlation coefficient and sequence abundance of each sampling point with its neighboring sampling points based on the spatial transcriptome data to differentiate the types of sampling points.

[0031] The correction module is used to correct the expression profile of tissue downsampling points based on a pre-built sequence flow model, so as to obtain a denoised gene expression matrix.

[0032] The clustering module is used to cluster spatial transcriptome data based on the denoised gene expression matrix.

[0033] Thirdly, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the spatial transcriptome data clustering method as described above.

[0034] Fourthly, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the spatial transcriptome data clustering method as described above.

[0035] Fifthly, the present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the spatial transcriptome data clustering method as described above.

[0036] The spatial transcriptome data clustering method and apparatus provided by this invention, by determining the Pearson correlation coefficient between each sampling point and its neighboring sampling points, makes full use of the spatial location information of cells to perform noise reduction processing on spatial transcriptome data, remove contaminated gene expression levels, and improve the accuracy of clustering analysis results. Attached Figure Description

[0037] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0038] Figure 1 This is a flowchart illustrating the spatial transcriptome data clustering method provided by the present invention;

[0039] Figure 2 This is a schematic diagram illustrating the principle of spatial transcriptome data clustering provided by the present invention;

[0040] Figure 3 This is one of the related matrix diagrams provided by the present invention;

[0041] Figure 4 This is the second schematic diagram of the correlation matrix provided by the present invention;

[0042] Figure 5 This is the third schematic diagram of the correlation matrix provided by the present invention;

[0043] Figure 6 This is the fourth schematic diagram of the correlation matrix provided by the present invention;

[0044] Figure 7 This is the fifth schematic diagram of the correlation matrix provided by the present invention;

[0045] Figure 8 This is the sixth schematic diagram of the correlation matrix provided by the present invention;

[0046] Figure 9 This is the seventh schematic diagram of the correlation matrix provided by the present invention;

[0047] Figure 10 This is a schematic diagram of the spatial transcriptome data clustering device provided by the present invention;

[0048] Figure 11 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0049] One of the initial problems in biological analysis is cell clustering. Current research mainly relies on gene expression information provided by single-cell sequencing technology for cell clustering. The rise of spatial transcriptomics technology has provided not only gene expression profiles of tissues but also spatial structures. However, existing cell clustering algorithms have not fully utilized the spatial location information of cells, or they rely solely on the fact that cells in adjacent locations tend to be of the same type.

[0050] Spatial transcriptomics technology captures messenger RNA transcripts at each sampling point using microarray chips. Ideally, the gene-specific UMI at a given point would represent the gene's expression at that point. However, in reality, the information captured at each sampling point does not necessarily indicate the gene's expression at that point. Messenger RNA flowing out from nearby sampling points leads to significant contamination in the UMI count, and the gene information comes from a mixture of multiple cells.

[0051] In summary, spatial transcriptomics data suffers from UMI (Underlying Microorganisms) contamination, which can severely impact downstream analyses. For example, it may lead to overstating of low-expression genes and understating of high-expression genes, resulting in incorrect cell grouping or bias in differential expression or clustering analyses. In clustering spatial transcriptomics data, algorithms based on the gene expression of the sampling point itself do not consider gene expression in the surrounding microenvironment and cannot fully utilize the spatial location information of the sampling points. However, spatial information is coupled with cell type; therefore, spatial structure, as an informational feature for improving clustering of spatially isolated data, is of great significance for downstream analysis.

[0052] This invention aims to address the UMI (Unique Minute Injection) contamination problem in spatial transcriptomics data through quality control and correction, ensuring the reliability and accuracy of downstream analysis. It combines clustering methods that integrate the sampling points themselves with information on gene expression in the microenvironment and the spatial location of the sampling points. To address spatial transcriptomics data contamination, mathematical algorithms are used for noise reduction. The spatial location of sampling points and gene expression data are extracted from public spatial transcriptomics datasets. Then, a graph neural network is used to learn the characteristics of the graph data structure, thereby performing spatial transcriptomics data clustering. This invention fully considers the spatial transcription data contamination problem, as well as the spatial location of sampling points and gene expression in the microenvironment, further improving spatial transcriptomics data clustering and making the clustering results more accurate, which is beneficial for downstream analysis of this data.

[0053] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0054] Figure 1 This is a flowchart illustrating the spatial transcriptome data clustering method provided by the present invention, as shown below. Figure 1 As shown, the present invention provides a spatial transcriptome data clustering method, the method comprising:

[0055] Step 101: Read the spatial transcriptome data from the spatial transcriptome dataset.

[0056] Specifically, first, the idling dataset is read to obtain spatial transcriptome data. Spatial transcriptome data includes the original gene count matrix and metadata for each sampling point.

[0057] Then, a slide object is created based on the original gene count matrix and metadata for each sampling point. Spatial transcriptome data includes at least the original gene count matrix and metadata for each sampling point. The metadata includes relevant information corresponding to the original genes, such as cancer type, cell type, image information, and spatial location information.

[0058] Step 102: Based on the spatial transcriptome data, determine the Pearson correlation coefficient and sequence abundance of each sampling site with neighboring sampling sites, and distinguish the types of sampling sites.

[0059] Specifically, for each sampling point, based on the spatial location information in the acquired spatial transcriptome data (slide object), the neighboring sampling points around each sampling point are determined, and the Pearson correlation coefficient between each sampling point and its neighboring sampling points is calculated to obtain a set of Pearson correlation coefficients. The Pearson correlation coefficient represents the similarity of gene expression at different sampling points. Based on the Pearson correlation coefficient and sequence abundance (sequence abundance is the total UMI count of each sampling site), and based on the spatial transcriptome data, the Pearson correlation coefficient and sequence abundance between each sampling site and its neighboring sampling sites are determined to distinguish between blank sampling points and tissue sampling points.

[0060] In some embodiments, the Pearson correlation coefficient between each sampling point and its neighboring sampling points is calculated, as well as the average Pearson correlation coefficient corresponding to each sampling point, and the sequence abundance of each sampling point is calculated.

[0061] Sampling points with an average Pearson correlation coefficient higher than the first threshold and a sequence abundance higher than the second threshold are designated as tissue sampling points, while other sampling points are designated as blank sampling points. The first and second thresholds can be set according to the specific circumstances.

[0062] Step 103: Based on the pre-constructed sequence flow model, the expression profile of the tissue downsampling points is corrected to obtain the denoised gene expression matrix.

[0063] In some embodiments, blank sampling sites and tissue subsampling sites are distinguished based on the Pearson correlation coefficient. The UMI count (gene expression value) of blank sampling sites comes from the inflow of other tissue subsampling sites in the vicinity, while the true expression value of tissue subsampling sites includes the observed value and a portion of the values ​​inflow from other sampling sites.

[0064] In some embodiments, relevant parameters of the sequence outflow sampling points are determined based on the blank sampling points; the relevant parameters include one or more of the following: total number of sampling points, number of genes, number of downsampling points, number of empty sampling points, UMI count observation of empty sampling points, distance between sampling points, etc.

[0065] Based on the relevant parameters of the sequence efflux sampling points, the efflux rate of the actual sampling sites and the size of the affected neighborhoods are estimated using the gradient descent algorithm corresponding to the pre-built sequence flow model. The actual expression level is estimated by the Expectation Maximization (EM) algorithm, and the denoised gene expression matrix is ​​obtained.

[0066] For example, suppose there are 5 sampling points, a, b, c, d, and f, where a, b, and c are actual sampling sites, and d and f are empty sampling sites. q, w, e, r, and s correspond to the observed gene expression levels at the 5 sampling sites. An initial efflux rate, denoted as P, is obtained from the observed values. The value of P is the average gene expression level of all empty sampling sites divided by the average gene expression level of all points (empty sampling sites plus actual sampling sites). A distant contamination rate, denoted as Q, can also be obtained. The value of Q is the average gene expression level of 25%-50% of all empty sampling sites divided by the average expression level of all empty sampling sites. Furthermore, the weight values ​​between sampling sites are obtained using a Gaussian kernel function based on the distance between the sampling sites. Thus, the relevant parameters of the gradient descent algorithm are obtained, including the total number of sampling points, the observed gene expression level at each sampling point, the initial efflux rate, the initial distant contamination rate, and the weight values ​​between sampling sites. The efflux rate and distant contamination rate are estimated using the gradient descent algorithm, and then the true expression level is estimated using the EM algorithm.

[0067] Optionally, in some embodiments, the denoised gene expression matrix can also be normalized and standardized.

[0068] Step 104: Cluster the spatial transcriptome data based on the denoised gene expression matrix.

[0069] Specifically, after determining the denoised gene expression matrix, the spatial transcriptome data are clustered based on the denoised gene expression matrix.

[0070] The spatial transcriptome data clustering method provided by this invention fully utilizes the spatial location information of the sampling points by determining the Pearson correlation coefficient between each sampling point and its neighboring sampling points, performs noise reduction processing on the spatial transcriptome data, corrects the expression profile, and improves the accuracy of the clustering analysis results.

[0071] The spatial transcriptome data clustering method provided by this invention distinguishes blank sampling points from tissue downsampling points using the Pearson correlation coefficient corresponding to each sampling point, estimates the exudation rate of tissue downsampling sites and the size of affected neighborhoods based on the gradient descent algorithm, estimates the true expression level using the EM algorithm, and obtains a denoised gene expression matrix, thereby improving the accuracy of clustering analysis results.

[0072] In some embodiments, spatial transcriptome data is clustered based on the denoised gene expression matrix, including:

[0073] An adjacency matrix is ​​generated based on the spatial location of the denoised gene expression matrix;

[0074] The denoised gene expression matrix is ​​amplified based on the adjacency matrix to obtain a gene expression enhancement matrix.

[0075] Spatial transcriptome data are clustered based on the adjacency matrix and the gene expression enhancement matrix.

[0076] Specifically, such as Figure 2 As shown in the embodiments of this application, the gene expression matrix can also be enhanced to fully consider the gene expression of the neighborhood microenvironment, avoid the inability to fully utilize the spatial location information of the sampling points, use spatial structure as an information feature to improve the clustering of spatial transcriptome data, improve the accuracy of clustering analysis results, and have important significance for downstream analysis.

[0077] First, an adjacency matrix is ​​generated based on the spatial location of the denoised gene expression matrix.

[0078] Spatial nodes are identified in the denoised data (gene expression matrix), and the number of spatial nodes is determined. Each spatial node represents a sampling point; for example, there are a total of 3798 spatial nodes. After identifying the spatial nodes, the distances between them are calculated. Inputting spatial coordinate vectors, the distances between all spatial nodes are calculated, resulting in a distance matrix where rows and columns represent sampling points. Figure 3 As shown, taking 4 sampling points as an example, the distance between sampling point a and sampling point b is 3 units.

[0079] After determining the distance matrix, neighboring nodes are selected. Based on the distance matrix, one or more spatial nodes closest to the target node are chosen as the neighborhood. This constructs a neighborhood graph, where each spatial node forms an edge with its neighboring nodes. The data structure is represented by an adjacency matrix. In the adjacency matrix, nodes with an edge relationship are represented by 1 (where the node is closest to itself), and all others are represented by 0. The rows and columns of this matrix are sampling points. Figure 4 As shown, in Figure 3 Based on the examples in [the text], taking b as an example, according to Figure 3 Given the distance matrix determined in the model, the two sampling points closest to b are a and c. Then a and c are the neighboring sampling points of b, and the corresponding values ​​on the adjacency matrix are 1 (where the node is closest to itself, and the default value is 1), and the rest are 0.

[0080] Then, the denoised gene expression matrix is ​​amplified based on the determined adjacency matrix to obtain the gene expression enhancement matrix.

[0081] In some embodiments, the denoised gene expression matrix is ​​amplified based on the adjacency matrix to obtain a gene expression enhancement matrix, including:

[0082] Based on the adjacency matrix and the denoised gene expression matrix, the neighborhood average expression matrix is ​​determined;

[0083] The denoised gene expression matrix and the neighborhood average expression matrix are spliced ​​and amplified to obtain the gene expression enhancement matrix.

[0084] Specifically, in Figure 3 Based on the example in the example, the denoised gene expression matrix is ​​as follows: Figure 5 As shown, the expression value of gene r at sampling point b is 10 units.

[0085] Taking sampling point a as an example, combined with Figure 4 The adjacency matrix and Figure 5 The denoised gene expression matrix can be used to determine the gene expression matrices of neighboring sampling points (sampling points b and c) of sampling point a, such as... Figure 6 As shown.

[0086] After determining the gene expression matrices of the neighboring sampling points of the target sampling point, the average value of each gene is taken to obtain the neighborhood average expression matrix corresponding to the target sampling point. Taking sampling point a as an example, the neighborhood average expression matrix corresponding to sampling point a is obtained as follows: Figure 7 As shown.

[0087] Using the above method, the gene expression matrix of the neighboring sampling points corresponding to each sampling point is determined, and the overall neighborhood average expression matrix is ​​obtained, as follows: Figure 8 As shown in the figure (some values ​​are omitted).

[0088] After determining the overall neighborhood average expression matrix, the denoised gene expression matrix and the neighborhood average expression matrix are spliced ​​and amplified to obtain the gene expression enhancement matrix and determine the neighborhood average expression matrix.

[0089] Specifically, still Figure 3 Based on the examples in, Figure 5 The denoised gene expression matrix and Figure 8 The neighborhood average expression matrix is ​​spliced ​​and amplified to obtain the gene expression enhancement matrix, and the neighborhood average expression matrix is ​​determined as follows: Figure 9 As shown in the figure (some values ​​are omitted).

[0090] Optionally, in some embodiments, during the splicing and amplification process of the denoised gene expression matrix and the neighborhood average expression matrix, a weighted splicing can also be performed. That is, the denoised gene expression matrix and the neighborhood average expression matrix are each multiplied by a weight value before splicing. The weight value can be configured as needed or determined based on data analysis.

[0091] The spatial transcriptome data clustering method provided by this invention enhances the gene expression matrix, fully considers the gene expression of the neighborhood microenvironment, avoids the failure to fully utilize the spatial location information of the sampling points, and uses spatial structure as an information feature to improve the clustering of spatial transcriptome data, thereby improving the accuracy of the clustering analysis results.

[0092] In some embodiments, clustering of spatial transcriptome data based on the adjacency matrix and the gene expression enhancement matrix includes:

[0093] The adjacency matrix and the gene expression enhancement matrix are input into the graph convolutional neural network model to obtain the node embedding matrix output by the graph convolutional neural network model.

[0094] Spatial transcriptome data are clustered based on the node embedding matrix.

[0095] Specifically, such as Figure 2 As shown in the embodiment of this application, the adjacency matrix and the feature enhancement matrix (gene expression enhancement matrix) are used as inputs. By performing convolution operations on the adjacency matrix and the feature enhancement matrix, the similarity and difference between spatial nodes are learned. Through information transfer and feature update between spatial nodes, after multiple layers of graph convolution operations, a node embedding representation is obtained. The node embedding output is a matrix, and each row represents the embedding vector of a node.

[0096] Then, the spatial transcriptome data are clustered based on the obtained node embedding matrix.

[0097] The spatial transcriptome data clustering method provided by this invention learns the similarities and differences between spatial nodes through a graph convolutional neural network model, thereby further improving the accuracy of clustering analysis results.

[0098] In some embodiments, clustering of spatial transcriptome data based on the node embedding matrix includes:

[0099] Principal component analysis is used to map the node embedding matrix to a low-dimensional space to obtain dimensionality-reduced data.

[0100] Clustering algorithms are used to cluster the dimensionality-reduced data.

[0101] Specifically, such as Figure 2As shown in this embodiment, clustering spatial transcriptome data based on the node embedding matrix may specifically include the following steps:

[0102] First, principal component analysis can be used to map the node embedding matrix to a low-dimensional space, resulting in dimensionality-reduced data.

[0103] Then, clustering algorithms are used to cluster the dimensionality-reduced data.

[0104] For example, the K-means clustering algorithm can be used to perform clustering and the clustering results can be visualized.

[0105] The spatial transcriptome data clustering method provided by this invention further improves the accuracy of clustering analysis results by performing dimensionality reduction processing on the data.

[0106] The spatial transcriptome data clustering device provided by the present invention is described below. The spatial transcriptome data clustering device described below and the spatial transcriptome data clustering method described above can be referred to in correspondence.

[0107] Figure 10 This is a schematic diagram of the spatial transcriptome data clustering device provided by the present invention, as shown below. Figure 10 As shown, the present invention provides a spatial transcriptome data clustering device, including a reading module 1001, a sampling point differentiation module 1002, a correction module 1003, and a clustering module 1004, wherein:

[0108] The reading module 1001 is used to read spatial transcriptome data from the spatial transcriptome dataset; the sampling point differentiation module 1002 is used to determine the Pearson correlation coefficient and sequence abundance of each sampling point with its neighboring sampling points based on the spatial transcriptome data to differentiate the types of sampling points; the correction module 1003 is used to correct the expression profile of the tissue sampling points based on a pre-constructed sequence flow model to obtain a denoised gene expression matrix; and the clustering module 1004 is used to cluster the spatial transcriptome data based on the denoised gene expression matrix.

[0109] In some embodiments, the sampling point differentiation module is specifically used for:

[0110] Calculate the Pearson correlation coefficient between each sampling point and its adjacent sampling points, as well as the average Pearson correlation coefficient for each sampling point, and calculate the sequence abundance for each sampling point;

[0111] Sampling points with an average Pearson correlation coefficient higher than the first threshold and a sequence abundance higher than the second threshold are denoted as tissue sampling points, and other sampling points are denoted as blank sampling points.

[0112] In some embodiments, the correction module determines the sub-module and the correction sub-module;

[0113] The determining submodule is used to determine the relevant parameters of the sequence outflow sampling points based on the blank sampling points;

[0114] The correction submodule is used to estimate the sequence exudation rate and the affected neighborhood size of the tissue sampling site based on the relevant parameters of the sequence efflux sampling point using the gradient descent algorithm corresponding to the pre-built sequence flow model, and to estimate the true expression level using the expectation-maximization algorithm to obtain the denoised gene expression matrix.

[0115] In some embodiments, the clustering module includes a generation submodule, an enhancement submodule, and a clustering submodule;

[0116] The generation submodule is used to generate an adjacency matrix based on the spatial location of the denoised gene expression matrix;

[0117] The enhancement submodule is used to amplify the denoised gene expression matrix based on the adjacency matrix to obtain a gene expression enhancement matrix.

[0118] The clustering submodule is used to cluster spatial transcriptome data based on the adjacency matrix and the gene expression enhancement matrix.

[0119] In some embodiments, the enhancement submodule includes a third determining unit and a splicing unit;

[0120] The third unit is used to determine the neighborhood average expression matrix based on the adjacency matrix and the denoised gene expression matrix;

[0121] The splicing unit is used to splice and amplify the denoised gene expression matrix and the neighborhood average expression matrix to obtain a gene expression enhancement matrix.

[0122] In some embodiments, the clustering submodule includes a processing unit and a clustering unit:

[0123] The processing unit is used to input the adjacency matrix and the gene expression enhancement matrix into the graph convolutional neural network model to obtain the node embedding matrix output by the graph convolutional neural network model;

[0124] The clustering unit is used to cluster spatial transcriptome data based on the node embedding matrix.

[0125] In some embodiments, the clustering unit includes a dimensionality reduction subunit and a clustering subunit;

[0126] The dimensionality reduction subunit is used to map the node embedding matrix to a low-dimensional space through principal component analysis to obtain dimensionality-reduced data.

[0127] The clustering subunit is used to cluster the dimensionality-reduced data using a clustering algorithm.

[0128] Specifically, the spatial transcriptome data clustering device provided in this application embodiment can implement all the method steps implemented in the above spatial transcriptome data clustering method embodiment, and can achieve the same technical effect. Here, the parts that are the same as those in the method embodiment and the beneficial effects will not be described in detail.

[0129] Figure 11 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 11 As shown, the electronic device may include: a processor 1110, a communications interface 1120, a memory 1130, and a communication bus 1140, wherein the processor 1110, the communications interface 1120, and the memory 1130 communicate with each other via the communication bus 1140. The processor 1110 can call logical instructions in the memory 1130 to execute a spatial transcriptome data clustering method, which includes:

[0130] Read spatial transcriptome data from the spatial transcriptome dataset;

[0131] Based on the spatial transcriptome data, the Pearson correlation coefficient between each sampling point and its neighboring sampling points, as well as the sequence abundance, are used to distinguish the types of sampling points.

[0132] Based on a pre-built sequence flow model, the expression profile of tissue downsampling points is corrected to obtain a denoised gene expression matrix;

[0133] Spatial transcriptome data are clustered based on the denoised gene expression matrix.

[0134] Furthermore, the logical instructions in the aforementioned memory 1130 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0135] On the other hand, the present invention also provides a computer program product, the computer program product comprising a computer program that can be stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the spatial transcriptome data clustering method provided by the above methods, the method comprising:

[0136] Read spatial transcriptome data from the spatial transcriptome dataset;

[0137] Based on the spatial transcriptome data, the Pearson correlation coefficient between each sampling point and its neighboring sampling points, as well as the sequence abundance, are used to distinguish the types of sampling points.

[0138] Based on a pre-built sequence flow model, the expression profile of tissue downsampling points is corrected to obtain a denoised gene expression matrix;

[0139] Spatial transcriptome data are clustered based on the denoised gene expression matrix.

[0140] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the spatial transcriptome data clustering method provided by the methods described above, the method comprising:

[0141] Read spatial transcriptome data from the spatial transcriptome dataset;

[0142] Based on the spatial transcriptome data, the Pearson correlation coefficient between each sampling point and its neighboring sampling points, as well as the sequence abundance, are used to distinguish the types of sampling points.

[0143] Based on a pre-built sequence flow model, the expression profile of tissue downsampling points is corrected to obtain a denoised gene expression matrix;

[0144] Spatial transcriptome data are clustered based on the denoised gene expression matrix.

[0145] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0146] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0147] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of spatial transcriptome data clustering, the method comprising: include: Read spatial transcriptome data from the spatial transcriptome dataset; Based on the spatial transcriptome data, the Pearson correlation coefficient between each sampling point and its neighboring sampling points, as well as the sequence abundance of each sampling point, are determined to distinguish the types of sampling points; the types of sampling points include blank sampling points and tissue sampling points. Based on a pre-built sequence flow model, the expression profile of the tissue downsampling points is corrected to obtain a denoised gene expression matrix; Spatial transcriptome data are clustered based on the denoised gene expression matrix.

2. The spatial transcriptomic data clustering method of claim 1, wherein, Based on the spatial transcriptome data, the Pearson correlation coefficient between each sampling point and its neighboring sampling points, as well as the sequence abundance of each sampling point, are determined to distinguish the types of sampling points, including: Calculate the Pearson correlation coefficient between each sampling point and its adjacent sampling points, as well as the average Pearson correlation coefficient for each sampling point, and calculate the sequence abundance for each sampling point; Sampling points with an average Pearson correlation coefficient higher than the first threshold and a sequence abundance higher than the second threshold are denoted as tissue sampling points, and other sampling points are denoted as blank sampling points.

3. The spatial transcriptome data clustering method according to claim 2, characterized in that, Based on a pre-built sequence flow model, expression profiles of tissue downsampling points are corrected to obtain a denoised gene expression matrix, including: Based on the blank sampling points, determine the relevant parameters of the sequence outflow sampling points; Based on the relevant parameters of the sequence efflux sampling points, the gradient descent algorithm corresponding to the pre-constructed sequence flow model is used to estimate the sequence efflux rate of the tissue sampling sites and the size of the affected neighborhood. The expectation-maximization algorithm is then used to estimate the true expression level, resulting in a denoised gene expression matrix.

4. The spatial transcriptome data clustering method according to claim 1, characterized in that, Clustering of spatial transcriptome data based on the denoised gene expression matrix includes: An adjacency matrix is ​​generated based on the spatial location of the denoised gene expression matrix; The denoised gene expression matrix is ​​amplified based on the adjacency matrix to obtain a gene expression enhancement matrix. Spatial transcriptome data are clustered based on the adjacency matrix and the gene expression enhancement matrix.

5. The spatial transcriptome data clustering method according to claim 4, characterized in that, The denoised gene expression matrix is ​​amplified based on the adjacency matrix to obtain a gene expression enhancement matrix, including: Based on the adjacency matrix and the denoised gene expression matrix, the neighborhood average expression matrix is ​​determined; The denoised gene expression matrix and the neighborhood average expression matrix are spliced ​​and amplified to obtain the gene expression enhancement matrix.

6. The spatial transcriptome data clustering method according to claim 5, characterized in that, Clustering of spatial transcriptome data based on the adjacency matrix and the gene expression enhancement matrix includes: The adjacency matrix and the gene expression enhancement matrix are input into the graph convolutional neural network model to obtain the node embedding matrix output by the graph convolutional neural network model. Spatial transcriptome data are clustered based on the node embedding matrix.

7. The spatial transcriptome data clustering method according to claim 6, characterized in that, Clustering of spatial transcriptome data based on the node embedding matrix includes: Principal component analysis is used to map the node embedding matrix to a low-dimensional space to obtain dimensionality-reduced data. Clustering algorithms are used to cluster the dimensionality-reduced data.

8. A spatial transcriptome data clustering device, characterized in that, include: The read module is used to read spatial transcriptome data from the spatial transcriptome dataset. The sampling point differentiation module is used to determine the Pearson correlation coefficient between each sampling point and its neighboring sampling points, as well as the sequence abundance of each sampling point, based on the spatial transcriptome data, so as to differentiate the types of sampling points; the types of sampling points include blank sampling points and tissue sampling points; The correction module is used to correct the expression profile of the tissue downsampling points based on a pre-built sequence flow model to obtain a denoised gene expression matrix. The clustering module is used to cluster spatial transcriptome data based on the denoised gene expression matrix.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the spatial transcriptome data clustering method as described in any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the spatial transcriptome data clustering method as described in any one of claims 1 to 7.