A Cell Type Identification Method and System Based on Multi-Omics Decoupling Representation and Graph Embedding
By using multi-omics decoupling representation and graph embedding methods, a sample similarity graph is constructed and similarity information is integrated, which solves the problem that sample similarity is not utilized in the existing technology and achieves higher accuracy in cell type identification and clustering.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CENT SOUTH UNIV
- Filing Date
- 2024-07-08
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies have failed to effectively integrate sample similarity information in single-cell multi-omics data analysis, resulting in insufficient accuracy in cell type identification, especially when cell type boundaries are not obvious or the state changes continuously, making accurate classification difficult.
We employ a method based on multi-omics decoupled representation and graph embedding. We construct sample similarity graphs through graph embedding models, integrate data in the same dimensional space using variational autoencoders and Gaussian mixture models, incorporate similarity information, and achieve cell clustering by minimizing JS divergence optimization models.
It improves the accuracy and robustness of cell type identification, can handle large multi-omics datasets, enhances the model's adaptability to complex data, and provides more accurate cell type identification and clustering results.
Smart Images

Figure CN119027941B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of cell type recognition technology, specifically to a cell type recognition method and system based on multi-omics decoupled representation and graph embedding. Background Technology
[0002] Single-cell multi-omics sequencing is a revolutionary biotechnology that allows scientists to simultaneously analyze multiple omics information at the single-cell level, including the transcriptome, epigenome, and proteome. This technology has significantly pushed the boundaries of biological research, enabling researchers to gain a more precise understanding of cellular function, developmental processes, and disease mechanisms. Traditional omics research, limited to analyzing mixed cell populations, struggles to obtain in-depth knowledge about individual cellular differences and complex biological processes. Single-cell multi-omics technology effectively addresses this issue, providing more comprehensive and refined biological data.
[0003] Despite the immense potential of single-cell multi-omics technologies, data analysis still faces several challenges. Currently, various cell type identification methods based on multi-omics data have been developed, which can be broadly categorized into three types: matrix factorization-based methods, graph-based methods, and deep learning-based methods. Each method has its advantages; for example, matrix factorization-based methods can extract the main components of omics data, graph-based methods can utilize the similarities between cells to construct networks, and deep learning-based methods can learn predictive models from complex data.
[0004] However, these existing technologies still have some limitations in practical applications. For example, while deep learning-based ensemble methods can handle large-scale datasets and learn complex patterns from them, most of these methods only focus on learning the latent representations of individual samples, failing to fully utilize the similarity information between samples. This neglect of sample similarity can lead to insufficient accuracy in cell type identification, especially when cell type boundaries are not obvious or cell states change continuously, making correct cell classification particularly difficult. Therefore, how to effectively integrate the similarity information between samples in deep learning models to improve the accuracy and robustness of cell type identification has become an urgent problem to be solved. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a cell type identification method and system based on multi-omics decoupled representation and graph embedding, which solves the problems of existing technologies that cannot capture nonlinear biological processes, cannot be extended to large datasets, and do not fully consider sample similarity information.
[0006] To achieve the above objectives, the present invention provides the following technical solution: a cell type identification method based on multi-omics decoupled representation and graph embedding, comprising the following steps:
[0007] Step 1: Acquire single-cell multi-omics data and perform preprocessing;
[0008] Step 2: Obtain cell spectrum embedding representation using graph embedding model, calculate single-cell sample similarity based on spectrum embedding representation, and then construct sample similarity graph;
[0009] Step 3: Use the variational autoencoder model to map different omics data to the same dimensional space for data integration, and obtain a single-cell shared potential representation;
[0010] Step 4: Cluster the cells using a Gaussian mixture model based on the shared latent representation to obtain the cell cluster assignments;
[0011] Step 5: Construct an objective function based on maximum likelihood estimation, and incorporate similarity information by minimizing the latent representation of similar cells and the JS divergence of cluster assignment;
[0012] Step 6: Simplify the objective function to obtain the final loss function, and use stochastic gradient descent to iteratively update the variational autoencoder model and the Gaussian mixture model.
[0013] Step 7: Use the trained model to obtain the shared latent representation and cluster assignment of single cells.
[0014] Preferably, the training process of the graph embedding model includes the following steps:
[0015] The preprocessed single-cell multi-omics data were stitched together;
[0016] The dimensionality reduction of the spliced data is performed using an autoencoder model;
[0017] The sample similarity matrix of the dimensionality-reduced data is learned using a Siamese neural network model;
[0018] Based on spectral clustering, a neural network mapping function is learned from the dimensionality reduction space to the spectral embedding space. The sample similarity is calculated based on the spectral embedding representation, and then the sample similarity map is obtained.
[0019] Preferably, further, the objective function of the graph embedding model is:
[0020]
[0021] Where Y = [y1, y2, …, y N ], N is the number of data samples, x i and x jThe encoded representation after dimensionality reduction by the encoder, w(x) i , x j ) represents the similarity matrix of sample x learned through Siamese neural network. i and x j similarity, y i and y j Sample x i and x j The spectrum is embedded in a low-dimensional representation.
[0022] Preferably, in step five, the objective function of the maximum likelihood estimation is:
[0023]
[0024] Where ϕ and θ represent the sets of encoder parameters and decoder parameters in each set of learning variational autoencoder models, respectively, {x (v)} represents a general multi-omics variable, {x (v)} := {x (1) , …, x (v) , …, x (V)}, where x (v) It is the general variable of the vth omics.
[0025] Preferably, by incorporating similarity information constraints into the objective function, the objective function becomes:
[0026]
[0027] Among them, w ij (∑ j w ij =1) represents the similarity between samples {x(v)i} and {x(v)j} in the sample similarity graph learned through the graph embedding model. This represents the shared latent representation of cells, where M is the feature dimension of the latent variable z. JS(⋅,⋅) represents the clustering assignment of cells, K is the number of clusters, and JS(⋅,⋅) represents the JS divergence between the two distributions.
[0028] Preferably, further, the objective function is simplified and estimated to obtain the final loss function. Then, the variational autoencoder model and the Gaussian mixture model are iteratively updated using stochastic gradient descent. Specifically, the simplified estimation of the objective function includes:
[0029] Introduce a variational upper bound for JS divergence to replace JS divergence;
[0030] ln p θ ({x(v) i}) is decomposed into two terms related to {x(v) i} and {x(v) j};
[0031] The objective function is estimated and solved to obtain the final loss function.
[0032] A cell type identification system based on multi-omics decoupled representation and graph embedding includes:
[0033] The similarity graph construction module is used to identify similar samples in the data, capture the interactions and similarities between cells, and provide an accurate graph structure representation for subsequent analysis;
[0034] The decoupled representation learning module is used to learn the potential representations of data from different omics in the same dimensional space.
[0035] The fusion module is used to construct omics weight parameters, and obtains cell-shared latent representations by weighting and fusing the latent representations learned from each omics data.
[0036] The clustering module is used to cluster and assign data from multiple omics datasets, and implements clustering and assignment of cell types based on shared latent representations and Gaussian mixture models.
[0037] The pre-training module is used to pre-train the variational autoencoder model to ensure that the variational autoencoder can reconstruct multi-omics data, and to initialize the Gaussian mixture model to ensure that the Gaussian mixture model can capture the category information of the data.
[0038] The training module further optimizes the graph-based variational autoencoder and Gaussian mixture model based on the graph embedding model, thereby enabling the integrated utilization of similarity information.
[0039] The identification module is used to identify cell types based on a trained variational autoencoder model and a Gaussian mixture model.
[0040] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor.
[0041] A computer-readable storage medium having a computer program stored thereon.
[0042] This invention provides a cell type identification method and system based on multi-omics decoupled representation and graph embedding. It has the following beneficial effects:
[0043] 1. This invention, by combining pre-training and training, rationally integrates similarity information into decoupled representation learning while ensuring the quality of multi-omics data reconstruction. This phased training strategy progressively optimizes each component and provides a powerful multi-omics data clustering solution by integrating these components. This approach not only improves the model's clustering performance but also enhances its adaptability to the complexity of multi-omics data.
[0044] 2. This invention demonstrates superior performance on single-cell multi-omics datasets with varying omics types and quantities, achieving relatively accurate cell type identification. This implies that the invention possesses the capability to integrate most single-cell multi-omics datasets. This capability is significant for processing large and complex single-cell multi-omics datasets, not only enabling the identification of cell types with different omics types and quantities but also providing new directions and possibilities for the integration of single-cell multi-omics data. Attached Figure Description
[0045] Figure 1 This is a flowchart of the method of the present invention;
[0046] Figure 2 This is a schematic diagram of the method of the present invention. Detailed Implementation
[0047] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0048] Example 1:
[0049] Please see the appendix Figure 1 -Appendix Figure 2 This invention provides a cell type identification method based on multi-omics decoupled representation and graph embedding, comprising the following steps:
[0050] Step 1: Obtain commonly used multi-omics datasets in the experiment and preprocess them;
[0051] In single-cell multi-omics studies, the first step is to acquire various types of omics data, such as transcriptomics, proteomics, and metabolomics data. These datasets typically originate from public databases or samples sequenced in-house. After data acquisition, necessary preprocessing steps are performed, including but not limited to data cleaning, standardization, and handling of missing values. The purpose of preprocessing is to ensure data quality and provide accurate input for subsequent analyses. For example, in transcriptomics data, gene expression levels may need to be normalized to eliminate systematic biases from experimental procedures.
[0052] Step 2: For the preprocessed multi-omics dataset, the next step is to use the deep learning-based SpectralNet model to learn the mapping of the data to its spectral embedding space. Spectral embedding is a powerful dimensionality reduction technique that can reveal the underlying structure of data, and is particularly suitable for revealing the intrinsic patterns and associations in complex biological data, specifically including:
[0053] Step 2.1: Data Dimensionality Reduction. In the spectral embedding process, the first step is to effectively reduce the dimensionality of the multi-omics data. This can be achieved using an autoencoder model. An autoencoder is an unsupervised learning neural network that learns to encode high-dimensional input data into a low-dimensional hidden layer, and then decodes it back to the original data space from that hidden layer. Through this process, the model learns a compressed representation of the data. Applying an autoencoder to the concatenated multi-omics data not only reduces the dimensionality of the data but also helps capture complex and abstract biological information across different omics levels.
[0054] Step 2.2: Similarity Calculation. Spectral embedding learning requires sample similarity information; therefore, this invention uses a Siamese neural network model to calculate sample similarity. The K-nearest neighbor algorithm is used to identify the k nearest neighbors of a data point, constructing positive and negative sample pairs. Each sample forms a positive sample pair with its k nearest neighbors and a negative sample pair with another k randomly selected samples. By minimizing the contrastive loss L... Siamese Training a Siamese neural network, contrastive loss L Siamese The definition is as follows:
[0055]
[0056] Where zSiamese i is the latent representation learned by the Siamese neural network, and c is the margin, usually set to 1. Based on the learned latent representation zSiamese i, the Euclidean distance between samples is calculated, and the Gaussian kernel function is used to convert the distance into sample similarity information.
[0057] Step 2.3: Spectral Embedding Map Learning. To address the scalability issues of graph-based methods, this invention employs a deep neural network to learn the spectral embedding mapping function. Based on this mapping function, the optimal low-dimensional representation of the spectral embedding, {y*n}, is obtained as follows:
[0058]
[0059] Step 2.4: Similarity Graph Construction. Based on the low-dimensional representation of the spectral embeddings obtained from SpectralNet, we calculate the Euclidean distance between samples. This distance measures the relative position of samples in the low-dimensional space, reflecting the direct similarity or difference between samples. To transform these Euclidean distances into more expressive similarity information, a Gaussian kernel function is applied. This process involves calculating the similarity between each pair of samples, using the exponential decay property of the Gaussian kernel to strengthen the connections between nearest neighbors while weakening the influence between distant samples. Ultimately, this similarity information is used to construct a comprehensive sample similarity graph, which serves as a key data structure in subsequent analyses, supporting more complex data-driven tasks such as clustering and classification.
[0060] Step 3: To integrate different omics data within a unified dimensional space, a variational autoencoder (VAE) model is employed. A VAE is a deep generative model that not only learns compressed representations of data but also captures uncertainty and variability in the data through probability distributions, specifically including:
[0061] Step 3.1: For each omics dataset, train an independent encoder to learn the distribution of that omics data in the low-dimensional latent space. This step allows the model to capture the biological information and variability specific to each omics, ensuring that key information about specific omics is not lost during integration.
[0062] Step 3.2: The latent representations learned from each independent encoder are weighted and fused using a specific strategy. This fusion process takes into account the potentially varying contributions of different omics data; therefore, the weighting coefficients may be chosen based on the intrinsic quality, information content, or importance of the data in biological processes. In this way, a single-cell shared latent representation that comprehensively reflects the characteristics of multiple omics can be obtained.
[0063] Step 3.3: Each omics dataset is equipped with an independent decoder to reconstruct its original data from the shared latent representation. This step verifies the effectiveness and integrity of the shared latent representation, ensuring that the integrated representation can restore the key features of various omics, while also supporting subsequent biological interpretation and analysis.
[0064] Step 4: Based on the integrated shared latent representation, cluster the cells using a Gaussian Mixture Model (GMM). A GMM is a probabilistic model that assumes the data consists of a mixture of multiple Gaussian distributions, making it suitable for handling complex biological data with diverse potential subpopulations. The clustering results can reveal the distribution of different cell types, helping researchers understand cellular heterogeneity and the roles of different omics features in cellular function. This step is crucial in multi-omics data analysis, providing a foundation for a deeper understanding of the organizational structure and functional dynamics of biological systems.
[0065] Step 5: Construct the objective function based on maximum likelihood estimation, and incorporate similarity information by minimizing the latent representation of similar cells and the JS divergence of cluster assignments to obtain the final objective function:
[0066]
[0067] Step 6: Simplify the objective function to obtain the final loss function. Then, use stochastic gradient descent to iteratively update the variational autoencoder model and the Gaussian mixture model. Specifically, this includes:
[0068] Step 6.1: For the objective function (3), introduce the variational upper bound G({x(v) i},{x(v) j}) of the JS divergence to replace the JS divergence. G({x(v) i},{x(v) j}) is defined as:
[0069]
[0070] in .
[0071] Step 6.2: Subtract ln p θ ({x(v) i}) can be decomposed into two terms related to {x(v) i} and {x(v) j}:
[0072]
[0073]
[0074] ln p θ The two terms obtained from the decomposition of ({x(v) i}) are added together and averaged. Substituting this average into the objective function, the objective function becomes:
[0075]
[0076] Step 6.3: Further simplify and estimate the objective function, where the first term in objective function (7) can be decomposed as follows:
[0077]
[0078] in ϕ1 and ϕ2 represent the trainable neural network parameters.
[0079] Further simplification and estimation of the terms in formula (8) yields:
[0080]
[0081] Where L represents the number of Monte Carlo samples in the stochastic gradient variational Bayesian (SGVB) estimator, m represents the number of omics, and D represents x(v)i and The feature dimension of x(v) is the d-th element of x(v)i. It is the mean of the omics data reconstructed by the variational autoencoder model, where *|ld represents the d-th element of the l-th sample, and λ (v) This is a predefined parameter, usually set to 1.
[0082]
[0083] in K is the number of clusters, and M is... , , and Feature dimensions, and Let represent the mean and variance of the k-th component of the Gaussian mixture model. and Let *| represent the mean and variance of the shared latent representation of the i-th sample. m This represents the m-th element of *.
[0084]
[0085] Where π ik Let represent the prior probability of cluster k, and satisfy . .
[0086]
[0087] Integrating formulas (9), (10), (11), and (12), formula (8) can be rewritten as:
[0088]
[0089] Similarly, the second term in formula (7) can be estimated as:
[0090]
[0091] To achieve more flexible alignment in the latent space and obtain more accurate cluster assignments, this invention adjusts the coefficients of the alignment loss and clustering loss to α and β, respectively. Therefore, the final loss function is:
[0092]
[0093]
[0094]
[0095] Step 7: Use the trained variational autoencoder model to obtain the shared latent representation of a single cell, and use the trained Gaussian mixture model to obtain the clustering assignment of the single cell based on the shared latent representation.
[0096] Step 8: Evaluate the performance of the feature selection method using a clustering task. Three widely used clustering evaluation metrics are employed: Cluster Accuracy (ACC), Normalized Mutual Information (NMI), and Purity, to evaluate the clustering performance. Cluster accuracy represents the proportion of correctly predicted samples out of the total number of samples; Normalized Mutual Information measures the correlation between two random variables; and Purity represents the proportion of samples belonging to the principal classes within each cluster. All three metrics range from 0 to 1, with higher values indicating better performance.
[0097] Example 2:
[0098] This invention provides a cell type identification system based on multi-omics decoupled representation and graph embedding, comprising:
[0099] The similarity graph construction module identifies similar samples in the data, captures interactions and similarities between cells, and provides an accurate graph structure representation for subsequent analysis. The core function of this module is to identify similar samples in multi-omics data and construct a detailed graph structure representation accordingly. By calculating the similarity between samples (e.g., using a Gaussian kernel function to transform Euclidean distance), this module can effectively capture interactions and similarities between cells, forming a similarity graph. This graph not only reflects the direct relationships between samples but also reveals potential population structures, providing a foundation for subsequent data analysis such as clustering and classification.
[0100] The decoupled representation learning module is used to learn the latent representations of various omics data in the same dimensional space. In this module, independent encoders are used to learn the latent representations of different omics data within the same dimensional space. This decoupled representation learning approach allows each omics data to independently express its information while maintaining the correlation between the data through a unified framework. This is particularly important for handling complex interactions and intrinsic connections between omics data.
[0101] The fusion module is used to construct omics weight parameters. It obtains a cell-shared latent representation by weighted fusion of latent representations learned from various omics data. The task of the fusion module is to construct omics weight parameters to weight and fuse the latent representations learned from various omics data. This process involves determining the contribution of each omics data point to the overall biological characterization, thereby generating a single-cell shared latent representation. This weighted fusion not only considers the uniqueness of each omics data point but also optimizes the information integrity and expressiveness of the integrated data representation.
[0102] The clustering module is used to cluster and assign data from multiple omics datasets. It achieves cell type clustering based on a shared latent representation and a Gaussian mixture model. The module utilizes the fused shared latent representation and assigns cells to clusters using a Gaussian mixture model. This model assumes the data consists of multiple Gaussian distributions, allowing it to consider the probability distribution of the data in cell type identification, thus more accurately reflecting the natural classification of cells. The clustering results help identify and interpret biologically meaningful cell populations.
[0103] The pre-training module is used to pre-train the variational autoencoder model, ensuring that it can reconstruct multi-omics data. It also initializes the Gaussian mixture model, ensuring that it can capture the class information of the data. The purpose of the pre-training module is to ensure that the variational autoencoder model can effectively reconstruct multi-omics data and to support the initialization of the Gaussian mixture model. Through pre-training, model parameters can be adjusted in advance, ensuring that the model has the basic ability to handle complex data before more detailed training. This step is crucial for improving the model's stability and performance.
[0104] The training module further optimizes the variational autoencoder and Gaussian mixture model using a graph constructed based on a graph embedding model. This enables the integration and utilization of similarity information. Within the training module, similarity graphs constructed based on the graph embedding model are used to further optimize the variational autoencoder and Gaussian mixture model. In this process, the model improves its learning and representation of data structures by integrating and utilizing similarity information, thereby enhancing the accuracy of cell type recognition and the model's generalization ability.
[0105] The identification module identifies cell types based on a trained variational autoencoder (VAE) model and a Gaussian mixture model. This final identification module is the endpoint of the entire analysis process, and its output can be directly used for biological research and clinical applications, supporting disease diagnosis, biomarker discovery, and the development of treatment strategies.
[0106] Example 3:
[0107] The purpose of this embodiment is to provide an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the cell type identification method based on multi-omics decoupled representation and graph embedding as described in Embodiment 1, in order to identify cell types.
[0108] Storage: Electronic devices are equipped with high-speed, high-capacity storage to store various omics data (such as transcriptome data, proteome data, etc.) as well as related analysis tools and programs. This storage can handle large datasets from multiple data sources and ensure data security and integrity.
[0109] Processor: The processor in the device is high-performance and capable of handling complex algorithms and models, including but not limited to variational autoencoders and graph embedding models. The processor's main function is to perform computationally intensive tasks, such as data preprocessing, model training, and cell type identification.
[0110] Computer program: The computer program stored in memory is specifically designed for multi-omics data analysis. This program contains multiple modules, including data processing, decoupled representation learning, fusion, and clustering modules, and can run on a processor to achieve automatic cell type identification and classification.
[0111] Example 4:
[0112] The purpose of this embodiment is to provide a computer-calibrated storage medium:
[0113] It stores a computer program that, when executed by a processor, implements a cell type identification method based on multi-omics decoupled representation and graph embedding, as described in Example 1.
[0114] Computer-readable storage media may be solid-state drives (SSDs), hard disk drives (HDDs), USB flash drives, optical discs, or any other form of digital data storage device. It features high data transfer rates and sufficient storage capacity to store complex computer programs and their required datasets.
[0115] The steps and methods involved in Embodiments 2 to 4 above correspond to those in Embodiment 1. For specific implementation details, please refer to the relevant description section of Embodiment 1. The term "computer-readable storage medium" should be understood as a single medium or multiple media including one or more instruction sets; it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set for execution by a processor and enabling the processor to perform any of the methods in this invention.
[0116] Those skilled in the art will understand that the modules or steps of the present invention described above can be implemented using general-purpose computer devices. Optionally, they can be implemented using computer-executable program code, thereby allowing them to be stored in a storage device for execution by a computer device, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. The present invention is not limited to any particular combination of hardware and software.
[0117] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A cell type identification method based on multi-omics decoupled representation and graph embedding, characterized in that, Includes the following steps: Step 1: Acquire single-cell multi-omics data and perform preprocessing; Step 2: Obtain cell spectrum embedding representation using graph embedding model, calculate single-cell sample similarity based on spectrum embedding representation, and then construct sample similarity graph; Step 3: Use the variational autoencoder model to map different omics data to the same dimensional space for data integration, and obtain a single-cell shared potential representation; Step 4: Cluster the cells using a Gaussian mixture model based on the shared latent representation to obtain the cell cluster assignments; Step 5: Construct an objective function based on maximum likelihood estimation, and incorporate similarity information by minimizing the latent representations of similar cells and the JS divergence of cluster assignments. The objective function of the graph embedding model is: ; Where Y = [y1, y2, …, y N ], N is the number of data samples, x i and x j The encoded representation after dimensionality reduction by the encoder, w(x) i , x j ) represents the similarity matrix of sample x learned through Siamese neural network. i and x j similarity, y i and y j Sample x i and x j Spectral embedding of low-dimensional representation; The objective function for maximum likelihood estimation is: ; Where ϕ and θ represent the sets of encoder parameters and decoder parameters in each set of learning variational autoencoder models, respectively, {x (v) } represents a general multi-omics variable, {x (v) } := {x (1) , …, x (v) , …, x (V) }, where x (v) It is the general variable of the v-th omics; By adding similarity information constraints to the objective function, the objective function becomes: ; Among them, w ij (∑ j w ij =1) represents the similarity between samples {x(v)i} and {x(v)j} in the sample similarity graph learned through the graph embedding model. This represents the shared latent representation of cells, where M is the feature dimension of the latent variable z. The cluster assignment of cells is represented by K, where K is the number of clusters, and JS(⋅,⋅) represents the JS divergence between the two distributions. Step 6: Simplify the objective function to obtain the final loss function, and use stochastic gradient descent to iteratively update the variational autoencoder model and the Gaussian mixture model. Step 7: Use the trained model to obtain the shared latent representation and cluster assignment of single cells.
2. The cell type identification method based on multi-omics decoupled representation and graph embedding according to claim 1, characterized in that, Furthermore, the training process of the graph embedding model includes the following steps: The preprocessed single-cell multi-omics data were stitched together; The dimensionality reduction of the spliced data is performed using an autoencoder model; The sample similarity matrix of the dimensionality-reduced data is learned using a Siamese neural network model; Based on spectral clustering, a neural network mapping function is learned from the dimensionality reduction space to the spectral embedding space. The sample similarity is calculated based on the spectral embedding representation, and then the sample similarity map is obtained.
3. The cell type identification method based on multi-omics decoupled representation and graph embedding according to claim 1, characterized in that, Furthermore, the objective function is simplified and estimated to obtain the final loss function. Then, the variational autoencoder model and Gaussian mixture model are iteratively updated using stochastic gradient descent. The simplified estimation of the objective function specifically includes: Introduce a variational upper bound for JS divergence to replace JS divergence; ln p θ ({x(v) i}) is decomposed into two terms related to {x(v) i} and {x(v) j}; The objective function is estimated and solved to obtain the final loss function.
4. A cell type recognition system based on multi-omics decoupled representation and graph embedding, characterized in that, The cell type identification method based on multi-omics decoupled representation and graph embedding, applied to any one of claims 1-3, includes: The similarity graph construction module is used to identify similar samples in the data, capture the interactions and similarities between cells, and provide an accurate graph structure representation for subsequent analysis; The decoupled representation learning module is used to learn the potential representations of data from different omics in the same dimensional space. The fusion module is used to construct omics weight parameters, and obtains cell-shared latent representations by weighting and fusing the latent representations learned from each omics data. The clustering module is used to cluster and assign data from multiple omics datasets, and implements clustering and assignment of cell types based on shared latent representations and Gaussian mixture models. The pre-training module is used to pre-train the variational autoencoder model to ensure that the variational autoencoder can reconstruct multi-omics data, and to initialize the Gaussian mixture model to ensure that the Gaussian mixture model can capture the category information of the data. The training module further optimizes the graph-based variational autoencoder and Gaussian mixture model based on the graph embedding model, thereby enabling the integrated utilization of similarity information. The identification module is used to identify cell types based on a trained variational autoencoder model and a Gaussian mixture model.
5. An electronic device, characterized in that, The cell type identification method based on multi-omics decoupled representation and graph embedding as described in any one of claims 1-3 includes a memory, a processor, and a computer program stored in the memory and executable on the processor.
6. A computer-readable storage medium, characterized in that, Applied to an electronic device as described in claim 5, wherein a computer program is stored thereon.