A method for integrating and clustering single-cell multi-omics data
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHEAST FORESTRY UNIV
- Filing Date
- 2025-05-30
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to effectively integrate single-cell multi-omics data, and traditional methods fail to preserve the biological characteristics of multi-omics data, leading to information loss or noise amplification, as well as inaccurate cross-modal semantic alignment.
The scECDA method is adopted, which uses a modular design of deep learning, including multi-omics feature extraction, latent feature alignment and fusion, and utilizes contrastive learning strategy and differential attention mechanism to achieve semantic alignment and clustering analysis of cross-omics latent representations.
It significantly improves the integration effect of multi-omics data, enhances clustering accuracy and robustness, effectively handles datasets of different types and qualities, reduces noise interference, and enhances the consistency of feature representation.
Smart Images

Figure CN122245431A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a method for integrating and clustering single-cell multi-omics data, specifically a method for integrating single-cell multi-omics data and cell clustering based on deep learning, belonging to the field of bioinformatics. Background Technology
[0002] In recent years, the rapid development of single-cell multi-omics sequencing technologies (such as 10X Multiome, CITE-seq, TEA-seq, etc.) has made it possible to simultaneously measure transcriptome, epigenetics, proteome, and other data of the same cell, providing unprecedented opportunities for in-depth analysis of cellular heterogeneity, cell state transitions, and regulatory mechanisms. However, effectively integrating multi-omics data and performing accurate cell type annotation still faces many challenges.
[0003] Different omics data (such as RNA-seq, ATAC-seq, and proteomics) have different data distributions, sparsity, and noise patterns. Traditional dimensionality reduction methods (such as PCA and t-SNE) are difficult to retain the biological characteristics of multiple omics data at the same time, resulting in the loss of information or amplification of noise in the integrated low-dimensional representation.
[0004] Different omics data measure information at different levels of the cell (such as gene expression, chromatin accessibility, and surface proteins), and their potential feature spaces often differ significantly. Existing methods (such as canonical correlation analysis (CCA) and multi-omics factor analysis (MOFA)) usually assume linear relationships or rely on prior distribution alignment, making it difficult to effectively capture complex nonlinear associations, resulting in inaccurate cross-modal semantic alignment.
[0005] Therefore, there is an urgent need to propose a method for integrating and clustering single-cell multi-omics data to solve the above-mentioned technical problems. Summary of the Invention
[0006] To address the challenges of integrating multimodal data in traditional single-cell multi-omics sequencing technologies, this invention provides a brief overview below to offer a basic understanding of certain aspects of the invention. It should be understood that this overview is not an exhaustive summary of the invention. It is not intended to identify key or essential parts of the invention, nor is it intended to limit the scope of the invention.
[0007] The technical solution of the present invention:
[0008] A method for generating single-cell sequencing data, characterized by comprising the following steps:
[0009] S1. Acquire single-cell multi-omics data generated by technologies such as 10X Multiome, CITE-seq, TEA-seq, scRNA-seq, scATAC-seq, and scADT-seq.
[0010] S2. Preprocess single-cell multi-omics data for subsequent analysis and data manipulation.
[0011] The overall computational process of the S3 and scECDA methods adopts a modular design, mainly including three core stages: multi-omics feature extraction, latent feature alignment and fusion, and cluster analysis. First, the method extracts omics-specific low-dimensional latent representations from the original single-cell multi-omics data by constructing an independent deep encoder framework, so as to preserve the specificity of each omics data.
[0012] Subsequently, to eliminate noise interference and enhance the consistency of feature representations, the extracted latent representations are smoothed. In the latent feature alignment stage, scECDA introduces a contrastive learning strategy to achieve semantic alignment of cross-omics latent representations by optimizing the distance metric between positive and negative sample pairs.
[0013] To further improve the integration effect of multi-omics data, this method introduces a differential attention mechanism and designs a feature fusion module. This module can adaptively weigh the contribution weight of each omics feature, thereby generating a more discriminative unified latent representation.
[0014] Finally, unsupervised clustering analysis is performed on the fused potential representations to fully realize the integration and annotation process of single-cell multi-omics data.
[0015] Preferred method: scECDA was compared with the six existing mainstream methods: TriTan, Mowgli, scDMSC, scMCs, K-means, and scRISE.
[0016] Among them, TriTan and Mowgli are based on the nonnegative matrix factorization (NMF) framework and can effectively integrate three or more omics data; scDMSC and scMCs adopt the methods of multi-view subspace learning and optimization of subspace cluster centers, respectively, which are suitable for the integration of dual-omics data; while scRISE based on graph autoencoders and the traditional K-means algorithm only support clustering analysis of single-omics data.
[0017] Experimental results show that scECDA demonstrates significant advantages in processing data of different types and dimensions.
[0018] Preferred: To test the clustering performance of the model on datasets of different quality and size, we selected six datasets for evaluation: Inhouse, BMNC, 10x, pbmc10k, Chen, and Mouse Brain (as shown in Table 1).
[0019] In these datasets, Chen and Brain are omics datasets paired with RNA and ATAC, while the remaining datasets are omics datasets paired with RNA and ADT.
[0020] Furthermore, among the selected datasets, Inhouse and pbmc10k contain fewer cell types, while BMNC and Chen contain more cell types; Chen, Mouse Brain, and PBMC 10x have high sparsity rates; Inhouse and Mouse Brain have fewer cells, thus serving as small datasets. BMNC and Chen, on the other hand, have relatively more cells, thus serving as large datasets.
[0021] Because these datasets have different characteristics, selecting them helps to comprehensively compare and analyze the performance of different methods.
[0022] This invention uses four metrics—ARI (Adjusted Rand Index), NMI (Normalized Mutual Information), ACC (Accuracy), and PUR (Purity)—to evaluate the accuracy of cell clustering.
[0023] To ensure fairness in comparing different methods, in all comparative experiments in this paper, we uniformly specified that the number of clusters for all methods equaled the actual number of cell types. The relevant results are presented in... Figure 2 .
[0024] With K-means as the benchmark method, the results show that the scECDA method achieved good scores on all datasets.
[0025] Especially on the BMNC and Inhouse datasets, the scores for every metric are the highest:
[0026] For example, on the Inhouse dataset ( Figure 2 a) scECDA achieved scores of 0.9674, 0.9407, 0.9755, and 0.9941 on the ARI, NMI, ACC, and PUR indicators, respectively, which are significantly better than other methods.
[0027] On the PBMC10K dataset ( Figure 2c) scECDA also achieved the highest scores on the three indicators of ARI, NMI and ACC, and its PUR score was only about 0.0015 different from Mowgli.
[0028] This invention reveals the quality differences between different datasets by using the K-means algorithm to independently cluster RNA and ATAC omics data.
[0029] like Figure 2 As shown, the RNA and ATAC data quality of the PBMC10k dataset are both good (all metrics > 0.52), while Mouse Brain... Figure 2 The ATAC data in the b) and chen datasets (Fig s1.a) are of significantly lower quality (ARI / NMI < 0.1), indicating that the latter has significant background noise interference.
[0030] In high-quality data scenarios, TriTan, scECDA, and Mowgli demonstrate good overall performance on the PBMC10k dataset.
[0031] It is worth noting that scECDA shows a significant advantage in low-quality ATAC data scenarios: compared with the K-means benchmark, all four key metrics of the Mouse Brain dataset are improved, with the improvement of biologically related metrics such as ARI (0.0973↑) and ACC (0.1518↑) being particularly significant.
[0032] In contrast, other methods generally exhibited performance degradation in such data, with TriTan showing the most significant decline, exhibiting a precipitous drop in metrics such as ARI (0.35↓) and PUR (0.34↓). This phenomenon reveals a common limitation of existing methods (including TriTan, scMCs, and Mowgli) in terms of noise resistance: when processing data with high-dimensional sparsity characteristics such as scATAC-seq, the feature fusion strategies of these methods struggle to effectively distinguish between real biological signals and sequencing noise.
[0033] It is worth noting that, compared with other methods, scECDA effectively reduces the impact of noise on clustering results by smoothing latent features, aligning cross-omics features, and introducing a differential attention mechanism.
[0034] Furthermore, scECDA achieved the highest comprehensive score across four evaluation metrics in various scenarios, including the large dataset BMNC, the small datasets Mouse Brain and Inhouse, the datasets with many cell types BMNC and Chen, the datasets with few cell types PBMC 10k and Inhouse, and the datasets with high sparsity Chen and MouseBrain. This further demonstrates that scECDA has strong robustness and adaptability, which is attributed to the fact that the model minimizes the introduction of data noise in its designed modules.
[0035] Preferred: scECDA can not only fuse two single-cell omics data, but also process data from three omics, and can perform cluster analysis even when there is only one single-cell omics.
[0036] To evaluate the performance of the scECDA method on single-omics and multi-omics data, this invention still uses four metrics—ARI, NMI, ACC, and PUR—to assess the accuracy of cell clustering.
[0037] We selected Chen's RNA omics data, Mouse Brain's RNA omics data, tea's RNA omics data, and three omics datasets from tea as our test dataset.
[0038] Figure 3 (a) and (b) show the evaluation results of scECDA on single-array datasets. It can be seen that on the Chen and Mouse Brain datasets, scECDA achieved the highest scores on every metric, significantly outperforming scRISE and Kmeans. Specifically, in the Chen dataset evaluation, scECDA outperformed scRISE by approximately 0.1124, 0.171, 0.14, and 0.268 on the ARI, NMI, ACC, and PUR metrics, respectively; while in the Mouse Brain dataset evaluation, scECDA outperformed scRISE by approximately 0.134, 0.06, 0.092, and 0.13 on these four metrics, respectively.
[0039] Figure 3 (c) shows the evaluation results of scECDA on the tea dataset, which contains three omics data. It can be seen that scECDA has the highest overall score on this dataset, followed by TriTan.
[0040] Although TriTan slightly outperformed scECDA in the NMI score (by approximately 0.0131), scECDA surpassed TriTan by approximately 0.0643, 0.1015, and 0.1173 in the ARI, ACC, and PUR scores, respectively. In RNA omics data, scECDA demonstrated the best performance across all three key indicators: ARI, ACC, and PUR.
[0041] Especially in ACC, a core metric for measuring clustering accuracy, scECDA significantly outperforms scRISE (approximately 0.1054) and Kmeans (approximately 0.18).
[0042] Although scRISE is slightly higher than scECDA (approximately 0.0311) in the NMI index, scECDA performs best overall among these four important evaluation criteria.
[0043] Through the above comparative analysis, scECDA can achieve more accurate results in clustering tasks, whether on single-omics data or three-omics data.
[0044] Preferred: In the design of the model, this invention pairs each omics dataset with an autoencoder, primarily for dimensionality reduction and noise reduction, preserving the most important features of each omics dataset. The obtained latent features can be represented as:
[0045]
[0046] Where E u This represents the encoder for the u-th omics data, with parameters being... The input data dimension m u Reduced to d u ,Right now Then it is obtained through the decoder
[0047]
[0048] in The parameters correspond to the decoder. After multiple iterations of the autoencoder, X is minimized. u and Mean squared error (MSE) loss between:
[0049]
[0050]
[0051] Where ||·||2 is the l2 norm.
[0052] Using the Student's t-distribution to analyze the latent features Z uPerform spatial transformation:
[0053]
[0054] Specifically, the K-means algorithm is first used to analyze the latent features of the u-th omics data. Perform cluster analysis to obtain the corresponding cluster centers. Using this as the benchmark for feature transformation, this t-distribution-based transformation strategy can enhance the stability of feature representation, thereby improving the accuracy and reliability of subsequent analysis.
[0055] Specific Implementation Method Six: The multi-omics feature extraction stage adopts a dual-encoder architecture, including the original encoder E u , with the original encoder E u Auxiliary encoders with the same structure Both use the exact same neural network architecture and share parameters. To achieve feature perturbation, A Dropout layer was added to the original structure, and a small perturbation ε was injected to simulate different sequencing depths in real-world scenarios. The specific process is as follows:
[0056]
[0057] The set of positive sample pairs constructed for cell i using the augmented data is as follows:
[0058]
[0059] Where <·,·> represent a sample pair, and ∪ is the union operation. Within the same batch, the latent features of other cells k and their associated positive samples are considered as negative samples of cell i (i≠k). Each batch contains b cells:
[0060]
[0061] Calculate the distance between cell i and its positive sample in the latent space. for:
[0062]
[0063] Among them, cell i + Yes, it's a positive sample. τ is the adjustment coefficient, with a default value of 1.0;
[0064] Calculate the distance between cell i and all negative samples in the latent space. for:
[0065]
[0066] Among them, cell i -It is a negative sample of cell i. The contrast loss can be expressed as:
[0067]
[0068] To avoid grouping all cells into the same cluster, a cross-entropy regularization loss is introduced:
[0069]
[0070] in The contrastive loss function with regularization is:
[0071] L stage2 =L contrastive +L entropy #(15)
[0072] Potential features of fused omics data: First, the potential features of each omics obtained in formula (1) are concatenated:
[0073] Z = [Z 1 Z 2 ,…,Z u ]#(16)
[0074] in
[0075] Projecting Z onto different feature spaces:
[0076] [G1;G2]=ZW1,[K1;K2]=ZW2,Z′=ZW3#(17)
[0077] in All are parameter matrices, with a default value of 2f = 256.
[0078] According to the differential attention calculation formula, we can obtain:
[0079]
[0080] Where λ is a learnable scalar:
[0081]
[0082] in All are learnable vectors, λ0∈(0,10) is a constant used to initialize λ, λ0=5 is taken, and finally the final fused feature H is obtained after normalization, residual connection and linear transformation:
[0083] O′=Norm(Z+O)#(20)
[0084] H=Norm(O′W4+O′)#(21)
[0085] in It is a parameter matrix, and eigenvalue transformation is performed on O′.
[0086] Preferably, the clustering module is responsible for classifying cell types based on the result H of omics feature fusion. First, it calculates the probability of a cell belonging to each category:
[0087] A = softmax(HW5) #(22)
[0088] in It is a parameter matrix, where y is the predefined number of categories, and C ij To represent soft clustering, and to enhance the discriminative power of the soft clustering results, the target distribution is constructed as follows:
[0089]
[0090] Where P is the clustering assignment probability matrix after enhancing class discrimination, and the index of the maximum value of each cell in the probability matrix is taken as the final class label:
[0091]
[0092] In obtaining the target distribution P ij (Formula 23) and the distribution of specific omics data After (Formula 5), we use the following loss function. Guide P ij :
[0093]
[0094] Where KL(·||·) is the KL divergence (Kullback-Leibler divergence).
[0095] scEDCA achieves feature extraction and modality alignment through a three-part joint optimization framework. The first part minimizes the reconstruction loss L... stage1 (Formula 4) ensures that the potential space can effectively preserve the biological characteristics of the original omics data; the second part introduces contrast loss L. stage2 (Formula 15) Align cross-omics data features, i.e., maximize the similarity of feature representations of different omics data for the same cell; the third part uses KL divergence loss L stage3 (Formula 25) constrains the potential spatial distribution to ensure that the target distribution is consistent with the prior distribution of specific omics data.
[0096] The overall loss function L is defined as a linear combination of the losses of each component (Equation 26), and is optimized using the Adam algorithm, with network parameters updated iteratively through backpropagation. The model gradually reduces the overall loss function L through multiple rounds of iterative optimization, and the training process terminates when the clustering accuracy reaches convergence.
[0097] This multi-objective joint optimization achieves effective dimensionality reduction and denoising of omics features, semantic alignment of cross-omics data features, and probability distribution of omics data features guiding the probability distribution of fused features.
[0098] L = L stage1 +L stage2 +L stage3 #(26)
[0099] The present invention has the following beneficial effects:
[0100] This invention provides an end-to-end joint optimization method that achieves effective noise reduction and feature extraction of multi-omics data, accurate alignment of cross-modal features, and dynamic adaptive feature fusion through the synergistic effect of deep encoders, contrastive learning, and attention mechanisms.
[0101] This invention effectively preserves the specific characteristics of each modality's data through an independent deep encoder architecture, and significantly improves the data signal-to-noise ratio by employing smoothing and noise suppression techniques, thus overcoming the information loss problem in cross-modal data alignment of traditional methods.
[0102] This invention enables semantic-level alignment of the latent spaces of different omics data through a contrastive learning strategy. The nonlinear dimensionality reduction method significantly improves the feature retention rate compared with traditional linear methods such as PCA. The differential attention mechanism automatically learns the optimal feature combination weights.
[0103] This method, through its innovative technical architecture and optimization strategies, significantly outperforms existing technologies in several key performance indicators, providing a more powerful and reliable analytical tool for single-cell multi-omics research. Attached Figure Description
[0104] Figure 1 This is a flowchart of the present invention;
[0105] Figure 2 This is a comparison chart of clustering results from different methods on four datasets;
[0106] Figure 3 This is a comparison chart of clustering results using different methods on (a) MouseBrain's RNA omics data, (b) Chen's RNA omics data, and (c) tea's three omics data.
[0107] Figure 4 It is a comparison graph showing the clustering performance of the models on different types of omics data;
[0108] Figure 5 This is a performance evaluation graph of scECDA in multi-batch single-cell multi-omics integration;
[0109] Figure 6 The top three differentially expressed genes were selected from the Inhouse dataset and plotted as a dot plot (b) and an expression distribution plot (a).
[0110] Figure 7 This is a flowchart illustrating the present invention.
[0111] Figure 5 In the diagram: (a) represents the distribution of different batches of data in the original data; (b) represents the distribution of batch data after scECDA removes batch effects; (c) evaluates the clustering results of different methods on the BMMC dataset; (d) explores the impact of the number of data batches on the scECDA clustering results, where s1(s1d1,s1d2,s1d3), s2(s2d1,s2d4,s2d5), s3(s3d3,s3d6,s3d7,s3d10), and s4(s4d1,s4d8,s4d9) are 4 large batches, and all refers to s1+s2+s3+s4. Detailed Implementation
[0112] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention is described below with reference to specific embodiments shown in the accompanying drawings. However, it should be understood that these descriptions are merely exemplary and not intended to limit the scope of the invention. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concept of the present invention.
[0113] Specific implementation method one: Combining Figures 1-7 This embodiment describes a method for generating single-cell sequencing data, which includes the following steps: S1 obtains the scRNA-seq dataset: This invention obtained eight real multi-omics datasets from the GEO database and previous research papers. These datasets are divided into three categories based on the types of omics they contain: paired scRNA-seq data and scATAC-seq data, paired scRNA-seq data and scADT-seq data, and paired scRNA-seq data, scATAC-seq data, and scADT-seq data. Data downloaded from the GEO database include Chen (GSE126074), Mouse Brain (GSE140203), BMMC (GSE194122), and InHouse (GSE148665). Datasets downloaded from other papers include BMNC, PBMC 10k, Tea-seq, and PBMC 10x. S2 builds deep neural network models: The scECDA model framework is used for the integration and cell clustering of single-cell multi-omics data, such as... Figure 1 Figure (a) illustrates the structure of the input module, which receives paired single-cell multi-omics data (including scRNA-seq, scATAC-seq, and scADT-seq) and integrates and jointly clusters the multimodal data through a deep neural network. As shown in (b), the specific implementation of this framework includes the following key steps: First, low-dimensional latent representations of each single-cell omics data are extracted using an independent encoder architecture, and iterative optimization is performed using a reconstruction loss function. Then, an auxiliary encoder is constructed based on a parameter sharing mechanism to apply controllable perturbations to the latent space features and generate enhanced samples. In the feature alignment stage, the latent representations of different omics data of the same cell are constrained by a contrastive learning loss function to maximize their cosine similarity in the shared embedding space. The aligned cross-omics features are then concatenated and input into a differential attention module, which can improve the signal-to-noise ratio of effective features. Finally, the fused features are output as cell clustering results through a clustering module. S3 deep neural network model training: In designing the model, this invention pairs each omics dataset with an autoencoder, primarily for dimensionality reduction and noise reduction, preserving the most important features of each omics dataset. The acquired latent features can be represented as follows: (1) in Indicates the first An encoder for omics data, with parameters being Dimensions of the input data Down to ,Right now Then, after being decoded, it becomes... : (2) in The parameters correspond to the decoder. After multiple iterations of the autoencoder, the minimum... and Mean squared error (MSE) loss between: (3),
[0114] Where ||·||2 is the l2 norm.
[0115] Using the Student's t-distribution to analyze the latent features Z u Perform spatial transformation:
[0116]
[0117] Specifically, the K-means algorithm is first used to analyze the latent features of the u-th omics data. Perform cluster analysis to obtain the corresponding cluster centers. Using this as the benchmark for feature transformation, this t-distribution-based transformation strategy can enhance the stability of feature representation, thereby improving the accuracy and reliability of subsequent analysis.
[0118] A code was built that is compatible with the original encoder E. u Auxiliary encoders with the same structure Both employ the exact same neural network architecture and share parameters (as shown in Fig. 1). To achieve feature perturbation, A Dropout layer was added to the original structure, along with a small perturbation ε to simulate different sequencing depths in real-world scenarios. This design not only ensures the compatibility of feature spaces between encoders but also generates diverse feature representations through controllable perturbations, enhancing the model's generalization ability. The specific process is as follows:
[0119]
[0120] The set of positive sample pairs constructed for cell i using the augmented data is as follows:
[0121]
[0122] Where <·,·> represent a sample pair, and ∪ is the union operation. Within the same batch, the latent features of other cells k and their associated positive samples are considered as negative samples of cell i (i≠k). Each batch contains b cells:
[0123]
[0124] Calculate the distance between cell i and its positive sample in the latent space. for:
[0125]
[0126] Among them, cell i + Yes, it's a positive sample. τ is the adjustment coefficient, with a default value of 1.0.
[0127] Calculate the distance between cell i and all negative samples in the latent space. for:
[0128]
[0129] Among them, cell i - It is a negative sample of cell i. Therefore, the contrast loss can be expressed as:
[0130]
[0131] To avoid grouping all cells into the same cluster, this invention introduces cross-entropy regularization loss:
[0132]
[0133] in The contrastive loss function with regularization is:
[0134] L stage2 =L contrastive +L entropy #(15)
[0135] Potential features of fused omics data: First, the potential features of each omics obtained in formula (1) are concatenated:
[0136] Z = [Z 1 Z 2 ,…,Z u ]#(16)
[0137] in The second step is to project Z onto different feature spaces:
[0138] [G1;G2]=ZW1,[K1;K2]=ZW2,Z′=ZW3#(17)
[0139] in All are parameter matrices, with a default value of 2f = 256. The third step, based on the differential attention calculation formula, yields:
[0140]
[0141] Where λ is a learnable scalar:
[0142]
[0143] in All are learnable vectors, and λ0∈(0,10) is a constant used to initialize λ. In this invention, λ0=5. The relevant parameter analysis is detailed in the following sections. Finally, after normalization, residual connection, and linear transformation, the final fused feature H is obtained:
[0144] O′=Norm(Z+O)#(20)
[0145] H=Norm(O′W4+O′)#(21)
[0146] in It is a parameter matrix, and eigenvalue transformation is performed on O′.
[0147] The clustering module is responsible for classifying cell types based on the result H of omics feature fusion. First, it calculates the probability of a cell belonging to each category:
[0148] A = softmax(HW5) #(22)
[0149] in It is a parameter matrix, where y is the predefined number of categories, and C ij To represent soft clustering, and to enhance the discriminative power of the soft clustering results, the target distribution is constructed as follows:
[0150]
[0151] Where P is the clustering assignment probability matrix after enhancing class discrimination, and the index of the maximum value of each cell in the probability matrix is taken as the final class label:
[0152]
[0153] In obtaining the target distribution P ij (Formula 23) and the distribution of specific omics data After (Formula 5), we use the following loss function. Guide P ij :
[0154]
[0155] Where KL(·||·) is the KL divergence (Kullback-Leibler divergence).
[0156] scEDCA achieves feature extraction and modality alignment through a three-part joint optimization framework. The first part minimizes the reconstruction loss L... stage1(Formula 4) ensures that the potential space can effectively preserve the biological characteristics of the original omics data; the second part introduces contrast loss L. stage2 (Formula 15) Align cross-omics data features, i.e., maximize the similarity of feature representations of different omics data for the same cell; the third part uses KL divergence loss L stage3 (Equation 25) constrains the potential spatial distribution to ensure consistency between the target distribution and the prior distribution of the specific omics data. The model's total loss function L is defined as a linear combination of the losses of each component (Equation 26), and is optimized using the Adam algorithm, with network parameters updated iteratively through backpropagation. The model gradually reduces the overall loss function L through multiple rounds of iterative optimization, terminating the training process when the clustering accuracy reaches convergence.
[0157] This multi-objective joint optimization achieves effective dimensionality reduction and denoising of omics features, semantic alignment of cross-omics data features, and probability distribution of omics data features guiding the probability distribution of fused features.
[0158] L = L stage1 +L stage2 +L stage3 #(26)
[0159] Example 1
[0160] The recognition effect of the present invention is further described below with a set of specific experimental examples:
[0161] As stated in S1: the more diverse the types of high-quality omics data, the richer the effective information provided, and the more accurate the clustering results. Therefore, a good model should be able to effectively integrate information from different omics and further improve the accuracy of clustering results. To explore whether scECDA can effectively integrate information from more omics data, we used the TEA dataset for experiments. From the clustering results of K-means on RNA and ATAC data (e.g....), we can see... Figure 3 As shown in c), these two datasets are of high quality and meet the ideal conditions.
[0162] In the experiment, we processed the tea dataset as follows: splitting it into paired (RNA, ATAC), (RNA, ADT), and not splitting it into (RNA, ATAC, ADT). The experimental results show (e.g.) Figure 4As shown in the figure, scECDA and TriTan scored significantly higher on the three-omics dataset than on the two-omics dataset, while Mowgli's performance declined instead of improving, worsening with increasing omics dataset size. This indicates that Mowgli cannot effectively integrate more omics data. Although TriTan ranked second in the three-omics dataset, it ranked last in the two-omics dataset. On the (RNA, ADT) dataset combination, scECDA scored approximately 0.2 higher than TriTan on each indicator, and on the (RNA, ATAC) dataset combination, scECDA scored approximately 0.3 higher than TriTan on each indicator. Compared to other methods, scECDA can effectively integrate information from different types of omics data, thereby improving the accuracy of clustering results.
[0163] As described in S2: Batch effect refers to the technical variation caused by differences in time, operator, reagents, instruments, etc. during the processing and measurement of different batches of experimental samples. These technical variations are unrelated to biological variations and may mask or confuse the real biological differences. In single-cell data analysis, especially when merging sequencing data from different batches, batch effect may lead to bias in the analysis of results.
[0164] Therefore, data needs to be corrected using some method before analysis to reduce batch-to-batch variations and ensure the accuracy and reliability of the data.
[0165] To investigate scECDA's data integration capabilities and clustering performance on datasets with multiple batches, we selected the BMMC dataset (containing 3 large batches and 12 small batches of data) to verify its performance. Experimental results show (e.g.) Figure 5 As shown in c), scECDA has the highest overall score, followed by scRISE. Although scRISE is slightly higher than scECDA in NMI and PUR by about 0.029 and 0.056 respectively, scECDA is significantly better than scRISE in ARI and ACC, by about 0.1484 and 0.068 respectively. Figure 5 d shows the impact of multiple batches of data combination on the clustering results. Numerically, the impact only fluctuates within 0.05, indicating that scECDA has good stability. UMAP is used to show the changes in data before and after integration. Figure 5 As shown in a and b), cells with similar characteristics are effectively clustered together, eliminating batch effects. The results and related analyses conclude that scECDA effectively mitigates the effects of batch effects and is more stable and reliable in performance compared to other methods.
[0166] As described in S3: scECDA identifies biomarkers on the InHouse dataset such as Figure 6 As shown, the S100A9, DUSP1, and LYZ genes are highly expressed only in CD14+ monocytes.
[0167] CD14+ monocytes can differentiate into macrophages or dendritic cells, a process that depends on the orderly activation of the MAPK pathway (such as ERK driving cell proliferation and differentiation).
[0168] DUSP1, through "pulsed" dephosphorylation of ERK, may suppress unnecessary signaling propagation in early differentiation while allowing the expression of specific genes (such as pro-inflammatory factors) in later stages, thus balancing differentiation efficiency and functional specificity. S100A9 is a member of the S100 calcium-binding protein family, and its high expression is highly correlated with the innate immune function of monocytes.
[0169] CD14+ monocytes rapidly release S100A9 upon infection, which induces neutrophil chemotaxis via CXC chemokine receptor 4 (CXCR4) or RAGE receptor, forming an inflammatory cascade.
[0170] S100A9 activates the SYK / PI3K / AKT pathway, enhances neutrophil phagocytosis efficiency and lysosomal enzyme release, and directly participates in pathogen clearance;
[0171] S100A9 itself can induce neutrophil degranulation through a MAPK / ERK-dependent mechanism (forming a feedback loop with DUSP1 MAPK regulation). LYZ encodes lysozyme and is also a key marker of the monocyte-macrophage system; its high expression directly reflects the phagocytic function of monocytes.
[0172] Lysozyme kills Gram-positive bacteria by hydrolyzing peptidoglycans in the bacterial cell wall (such as N-acetylglucosamine bonds) and activates the complement pathway. When lysozyme is released into the extracellular space (such as through lactoferrin), it can enhance antitumor activity (e.g., through the TLR4 / NF-κB pathway) and connect with monocyte-mediated adaptive immunity.
[0173] The genes GNLY, GZMB, and CST7 are all highly expressed in NK cells, and the genes GZMB and CST7 are also highly expressed in some CD8+ T cells.
[0174] GNLY encodes a core bactericidal protein in the cytoplasmic granules of NK cells, which directly kills intracellular pathogens (such as Mycobacterium tuberculosis) by disrupting cell membrane integrity;
[0175] NK cells, as the "first line of defense" of innate immunity, need to respond quickly to pathogen infection. High expression of GNLY ensures that they can efficiently clear infected cells and prevent the spread of pathogens.
[0176] GZMB triggers pyroptosis by cleaving GSDME, which is the core mechanism by which NK cells clear virus-infected cells and tumor cells.
[0177] During the cytotoxic phase (such as in the liver cancer microenvironment), CD8+ T cells clear tumor cells through GZMB-mediated pyroptosis, and its expression level is positively correlated with cytotoxic phenotypes (such as co-expression of GZMH and GNLY).
[0178] CST7 belongs to the cystatin family and prevents excessive inflammatory responses and damage to tissues by inhibiting the activity of cysteine proteases (such as granzymes).
[0179] During NK cell killing, CST7 may balance the activities of GZMB and GNLY, preventing cell autolysis caused by lysosomal membrane rupture.
[0180] GNLY and GZMB (granzyme B) coexist in NK cell cytotoxic particles. The former destroys the target cell membrane, while the latter induces pyroptosis by activating GSDME, forming a synergistic killing mechanism of "membrane perforation + contents release".
[0181] This synergistic effect is particularly important when NK cells fight viral or intracellular bacterial infections;
[0182] During the effector phase of CD8+ T cells, CST7 regulates the transition between pyroptosis and apoptosis by inhibiting GSDME cleavage, thus maintaining the precision of the immune response.
[0183] CD74, CD79A, and CD79B are B cell-specific genes;
[0184] CD79A and CD79B, as core components of BCR signaling, drive the response and differentiation of B cells to antigens.
[0185] CD74 stabilizes peptide-free MHC class II heterodimers and guides their transport to endosomes / lysosomes, ensuring the efficiency of antigen processing and presentation.
[0186] As professional antigen-presenting cells (APCs), B cells need to efficiently process extracellular antigens (such as viral proteins) to activate T cells. High expression of CD74 plays a key role in this process, and CD74 strengthens the pivotal role of B cells in adaptive immunity by regulating antigen presentation and antiviral functions.
[0187] This mechanism provides a potential target for the treatment of B-cell-related diseases (such as lymphoma and autoimmune diseases), such as ADC drugs that target CD79B.
[0188] The complementary nature of single-cell omics data helps to provide a more comprehensive understanding of cellular heterogeneity. By integrating transcriptomics and surface proteomics data, we can more accurately identify and differentiate different cell subpopulations.
[0189] In the Inhouse dataset, CD8+ T cells and CD4+ T cells have similar functions and are visualized in two-dimensional plots (e.g., Figure 6 a) When it is difficult to distinguish, biomarkers are needed;
[0190] Based on the distribution of differential gene expression ( Figure 6 a) shows that CD8A and CD8B genes are almost exclusively highly expressed in CD8+ T cells, because CD8A can promote the survival and differentiation of activated lymphocytes into memory CD8+ T cells, and CD8B plays a key role in thymic selection of CD8+ T cells.
[0191] According to the differential protein expression distribution (Fig S3), CD4 is specifically expressed in CD4+ T cells. The expression of the CD4 gene is strictly regulated by specific transcription factors and gene regulatory networks. These regulatory mechanisms ensure that the CD4 gene is activated only in specific cell types, such as helper T cells and monocytes.
[0192] Transcription factors such as T-cell factor 1 (TCF-1) and GATA3 are activated in helper T cells, thereby promoting the expression of the CD4 gene;
[0193] In other types of immune cells (such as B cells and CD8+ T cells), these specific regulatory factors are usually inactive, leading to the suppression of CD4 gene expression. Therefore, CD8+ T cells and CD4+ T cells can be distinguished by the specific expression distribution of these biomarkers.
[0194] It should be noted that in the above embodiments, as long as the technical solutions are not contradictory, they can be permuted and combined. Those skilled in the art can exhaust all possibilities based on the mathematical knowledge of permutation and combination. Therefore, the present invention will not describe the technical solutions after permutation and combination one by one, but it should be understood that the technical solutions after permutation and combination have been disclosed by the present invention.
[0195] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for combining single-cell multi-omics data integration and clustering, characterized in that, Includes the following steps: Step 1: Acquire single-cell multi-omics data; Step 2: Preprocess the data for each omics dataset; Step 3: Construct independent deep encoders to extract low-dimensional latent representations of each omics data, and use smoothing processing to reduce noise interference; Step 4: Optimize the distance metric between positive and negative sample pairs based on a contrastive learning strategy to achieve semantic alignment of latent representations from different omics. Step 5: Employ differential attention mechanism to dynamically adjust the weights of different omics features to generate a unified low-dimensional representation; Step 6: Perform unsupervised clustering on the fused features to achieve accurate annotation of cell types.
2. The method for integrating and clustering single-cell multi-omics data according to claim 1, characterized in that: The method described in step one is compared with six other methods, namely TriTan, Mowgli, scDMSC, scMCs, K-means, and scRISE, to verify its advantages in clustering single-omics and multi-omics data.
3. The method for combined single-cell multi-omics data integration and clustering according to claim 1, characterized in that: Step two uses the Inhouse, BMNC, 10x, pbmc10k, Chen, and Mouse Brain datasets for evaluation, and verifies the robustness of the method under different data quality, scale, and sparsity using ARI, NMI, ACC, and PUR metrics.
4. The method for combined single-cell multi-omics data integration and clustering according to claim 1, characterized in that: Each omics dataset is paired with an autoencoder, primarily used for dimensionality reduction and noise reduction, preserving the most important features of each omics dataset. The resulting latent features can be represented as: Where E u This represents the encoder for the u-th omics data, with parameters being... The input data dimension m u Reduced to d u ,Right now Then it is obtained through the decoder in Parameters corresponding to the decoder; After multiple iterations of the autoencoder, X is minimized. u and Mean squared error (MSE) loss between: Where ||·||2 is the l2 norm; Using the Student's t-distribution to analyze the latent features Z u Perform spatial transformation:
5. The method for combined single-cell multi-omics data integration and clustering according to claim 1, characterized in that: The smoothing process utilizes the K-means algorithm to analyze the latent features of the u-th omics data. Perform cluster analysis to obtain the corresponding cluster centers. Using this as the benchmark for feature transformation, the Student t-distribution is used to perform spatial transformation on the latent features, thereby enhancing the stability of cross-omics alignment through the transformed features.
6. The method for combined single-cell multi-omics data integration and clustering according to claim 4, characterized in that, The multi-omics feature extraction stage employs a dual-encoder architecture, including the original encoder E. u , with the original encoder E u Auxiliary encoders with the same structure Both use the exact same neural network architecture and share parameters. To achieve feature perturbation, A Dropout layer was added to the original structure, and a small perturbation ε was injected to simulate different sequencing depths in real-world scenarios. The specific process is as follows: The set of positive sample pairs constructed for cell i using the augmented data is as follows: Where <·,·> represent a sample pair, and ∪ is the union operation. Within the same batch, the latent features of other cells k and their associated positive samples are considered as negative samples of cell i (i≠k). Each batch contains b cells: Calculate the distance between cell i and its positive sample in the latent space. for: Among them, cell i + Yes, it's a positive sample. τ is the adjustment coefficient, with a default value of 1.0; Calculate the distance between cell i and all negative samples in the latent space. for: Among them, cell i - It is a negative sample of cell i. The contrast loss can be expressed as: To avoid grouping all cells into the same cluster, a cross-entropy regularization loss is introduced: in The contrastive loss function with regularization is: L stage2 =L contrastive +L entropy #(15) Potential features of fused omics data: First, the potential features of each omics obtained in formula (1) are concatenated: Z=[Z 1 ,WITH 2 ,…,WITH u ]#(16) in Projecting Z onto different feature spaces: [G1;G2]=ZW1,[K1;K2]=ZW2,Z′=ZW3#(17) in All are parameter matrices, by default ; According to the differential attention calculation formula, we can obtain: Where λ is a learnable scalar: in All are learnable vectors, λ0∈(0,10) is a constant used to initialize λ, λ0=5 is taken, and finally the final fused feature H is obtained after normalization, residual connection and linear transformation: O′=Norm(Z+O)#(20) H=Norm(O′W4+O′)#(21) in It is a parameter matrix, and eigenvalue transformation is performed on O′.
7. The method for combined single-cell multi-omics data integration and clustering according to claim 6, characterized in that, Based on the result H of the omics feature fusion, cell types are classified, and the probability of a cell belonging to each category is calculated: A = softmax(hw5) #(22) in It is a parameter matrix, where y is the predefined number of categories, and C ij To represent soft clustering, and to enhance the discriminative power of the soft clustering results, the target distribution is constructed as follows: Where P is the clustering assignment probability matrix after enhancing class discrimination, and the index of the maximum value of each cell in the probability matrix is taken as the final class label: In obtaining the target distribution P ij and the distribution of specific omics data Then, we use the following loss function. Guide P ij : Where KL(·||·) is the KL divergence.
8. The method for combined single-cell multi-omics data integration and clustering according to claim 7, characterized in that, The overall loss function L is defined as a linear combination of the losses of each component and is optimized using the Adam algorithm. Network parameters are iteratively updated through backpropagation. The model gradually reduces the overall loss function L through multiple rounds of iterative optimization. The training process terminates when the clustering accuracy reaches convergence. This multi-objective joint optimization achieves effective dimensionality reduction and denoising of omics features, semantic alignment of cross-omics data features, and the probability distribution of omics data features guiding the probability distribution of fused features. L=L stage1 +L stage2 +L stage3 #(26)。