A method for inferring cell type relationships through cross-species single-cell transcriptome integration
By employing a dual-branch network model and a two-stage training strategy, the problems of inconsistent gene sets and inconsistent cell type annotations in cross-species single-cell integration are solved. This achieves broad coverage of cross-species gene relationships and explicit characterization of multi-resolution cell type relationships, supporting cell type tree construction and gene expression lineage analysis.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-19
AI Technical Summary
Cross-species single-cell integration faces challenges such as inconsistent gene sets and inconsistent cell type annotation systems, making it difficult to consistently characterize cell type relationships at multiple resolution levels. Furthermore, traditional methods have failed to effectively utilize non-homologous gene signals and cell type hierarchical structures, thus limiting the biological significance of cross-species integration.
A two-branch network model was constructed, using a pair of homologous genes as anchors and combining average expression across cell types. A two-stage training strategy was adopted, first aligning gene embeddings and then enhancing cell embeddings, to achieve broad coverage of cross-species gene relationships and explicit characterization of cell type relationships.
It improves the coverage and consistency of cross-species gene relationship learning, supports the comparison of multi-resolution cell type relationships, constructs stable cell type trees, and provides gene expression lineage variation analysis at cross-species hierarchical levels.
Smart Images

Figure CN122242759A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of single-cell data analysis technology, and in particular to a method for cross-species single-cell transcriptome integration and cell type relationship inference. Background Technology
[0002] With the rapid development of single-cell RNA sequencing technology, large-scale single-cell atlases of different species are constantly accumulating, making it an important research need to compare the lineage relationships of cell types in different species at the cellular level and to reveal the conserved and differentiated transcriptional regulatory programs. However, cross-species single-cell integration faces two core challenges: First, the gene sets of different species are inconsistent, gene homology relationships are complex, and annotation systems and granularities for cell types in different species are often inconsistent, making it difficult to consistently characterize cell type relationships at multiple resolution levels. Existing cross-species integration methods often model using a pair of homologous genes as a common feature set. While this strategy simplifies cross-species alignment, it excludes a large number of non-homologous or species-specific genes, making it difficult to utilize signals related to lineage identity and subtype structure. Furthermore, directly treating the expression of a pair of homologous genes as cross-species comparable may introduce systematic bias, as regulatory evolution can significantly alter the expression patterns of homologous genes in specific cell types. To expand gene coverage beyond homologous sets, some methods introduce protein language models based on amino acid sequences. However, these sequence embeddings typically do not explicitly encode cell type-discriminatory transcriptional context from single-cell expression profiles, making it difficult to establish a learnable correspondence between genes and cell type expression programs, thus limiting the biological significance of cross-species integration.
[0003] Secondly, in terms of cell type relationship modeling, many methods mainly learn cell-level representations and indirectly infer cell type relationships through cell neighborhoods or local mixing. However, biological cell identity has a natural hierarchical organizational structure, and it is difficult to stably express the direct similarity and hierarchical relationship between cell types by relying solely on cell-level neighborhoods. This limits downstream tasks such as cell type tree construction, cross-species hierarchical comparison, and multi-resolution consistency analysis.
[0004] Therefore, there is an urgent need for a cross-species single-cell integration framework that can fully utilize homology information while modeling genes outside the homology set, and introduce cell type-level learnable representations to explicitly characterize multi-resolution cell type relationships, thereby improving the reliability of cross-species phylogenetic relationship inference and downstream analysis. Summary of the Invention
[0005] The purpose of this invention is to provide a method for cross-species single-cell transcriptome integration and cell type relationship inference, so as to achieve aligned gene, cell and cell type embedding learning, and support cell type tree construction and related analysis.
[0006] To achieve the above objectives, this invention provides a method for cross-species single-cell transcriptome integration and cell type relationship inference, comprising the following steps: S1. Obtain single-cell transcriptome datasets of annotated cell types from multiple species and sets of one-to-one homologous genes between species; S2. Preprocess the single-cell transcriptome datasets of each species to form the expression matrix of each species; S3. Construct a two-branch network model for each species, consisting of cell branching and cell type branching; S4. Construct cross-species gene-level alignment signals based on the expression characteristics of cell types in each species; S5. Based on the combined loss function of gene-level loss function and cell-level loss function, the dual-branch network model is trained in two stages. S6. Using the trained bi-branch network model, obtain the aligned gene embeddings, cell embeddings, and cell type embeddings for each species, which are used for cross-species gene relationship analysis, embedding phenotypic association analysis, and cell type tree construction.
[0007] Preferably, in S2, the preprocessing includes performing library size normalization and log1p transformation on the single-cell transcriptome datasets of each species, and screening for highly variable gene sets.
[0008] Preferably, in S3, the method for constructing cell branches is as follows: a random matrix is constructed for each species as a species-specific learnable gene embedding matrix. The behavior of the gene embedding matrix is different for different genes and the columns are different embedding dimensions. During model training, the gene embedding matrix is updated as a learnable parameter with the optimization of the objective function.
[0009] During each training iteration, a small batch of data is sampled from the expression matrix of the corresponding species S2, a random gene mask is generated and applied to the expression matrix, the masked expression matrix is multiplied with the gene embedding matrix, and then input into a cell-level multilayer sensor shared across species to obtain cell-level embeddings.
[0010] The expression for cell-level embedding is: ; in, For cell mapping functions, For learnable parameters, For species The representation matrix after masking, For species Gene embedding matrix.
[0011] Preferably, in S3, the method for constructing cell type branches is as follows: A learnable cell type prototype matrix is defined for each species, and the cell type prototype matrix is updated as a learnable parameter during model training.
[0012] The cell type prototype matrix is input into a cell type multilayer perceptron shared across species to obtain cell type-level embeddings.
[0013] The cell type-level embedding expression is: ; in, This is a cell type mapping function. For learnable parameters, For species A learnable cell type prototype matrix.
[0014] Preferably, in S4, the specific operations for constructing cross-species gene-level alignment signals include: S41, Regarding species i The average expression matrix of cell types is calculated using the following formula: ; ; in, For species The cell type is marked in the middle. A collection of cells, For genes In cells Gene expression values, For species A collection of highly variable expressed genes, Cell type.
[0015] S42. For any pair of species ( i,j Using a pair of homologous genes between species as anchors, cross-species cell type similarity is calculated based on transcriptome data at cell type resolution; the expression for calculating cross-species cell type similarity is: ; in, For cosine similarity, for Medium cell types The corresponding row, for Medium cell types The corresponding row.
[0016] S43, through mutual k The nearest neighbor method is used to screen stable cross-species cell type matching relationships. The expression is: ; in, Indicates species In and species Cell types Most similar A collection of cell type indexes.
[0017] S44. Apply the matching relationship to the average expression matrix of cell types of the two corresponding species to obtain the cross-species gene association matrix; The expression for the cross-species gene association matrix is: ; in, , Species i , j Cell type average expression matrix, For species i , j A cross-species cell type matching matrix.
[0018] Preferably, in S5, the gene-level loss function is: ; in, Represented by the gene embedding matrix and The calculated gene embedding similarity matrix, It is the Frobenius norm.
[0019] Preferably, in S5, the cell-level loss function is: ; in, Indicates species The Individual cell embedding, Indicates species The Cell type embedding, Indicates similarity. For temperature coefficient, For batch size, Indicates species The Cell type labels corresponding to each cell.
[0020] Preferably, in S5, the specific process of the two-stage training is as follows: In the first stage, the dual-branch network model is trained using only the gene-level loss function to achieve alignment of gene embedding space across species.
[0021] In the second stage, the gene-level loss function and the cell-level loss function are weighted and combined to form a joint optimization objective. While maintaining the spatial alignment of gene embeddings across species, the dual-branch network model is trained to enhance the cell type discrimination ability of cells embedded within species.
[0022] Preferably, it also includes a cross-species single-cell transcriptome integration and cell type relationship inference analysis system for performing the method.
[0023] Preferred cross-species single-cell transcriptome integration and cell type relationship inference analysis systems include: The data acquisition module is used to perform the operations of S1 to acquire single-cell transcriptome datasets of annotated cell types from multiple species and a pairwise homologous gene set between species. The data processing module is used to perform the operations of S2 and preprocess the single-cell transcriptome datasets of each species. The model building module is used to perform the operations of S3 and build a two-branch network model for each species. The signal construction module is used to perform the operations of S4 to construct cross-species gene-level aligned signals; The model training module is used to perform the operations of S5 and conduct two-stage training on the dual-branch network model. The results analysis module is used to perform the operations of S6, obtain aligned gene embeddings, cell embeddings, and cell type embeddings, and conduct related analyses.
[0024] Therefore, the method for cross-species single-cell transcriptome integration and cell type relationship inference of the present invention has the following beneficial effects: (1) This invention uses a pair of homologous genes as anchors, which can establish cross-species transcriptional associations for a wider range of highly variable genes (HVGs) without being limited to homologous gene pairs, thereby improving the coverage and consistency of cross-species gene relationship learning; (2) By introducing the average expression of cell types, the transcriptional context of cell type differentiation is explicitly injected into the cross-species alignment signal, which can distinguish different cell types while removing the batch effect caused by species between the same cell types. (3) Construct a dual-branch network of cell flow and cell type flow to obtain explicit cell type representation, so that cell type relationships can be directly compared under multiple resolutions, support cell type tree construction, and solve the problem of difficult multi-level characterization of cell type relationships in traditional methods; (4) A two-stage training strategy is adopted, which first aligns the gene embedding and then adds cell supervision. First, the alignment basis at the cross-species gene level is stabilized, and then the intra-species discrimination ability of cell embedding is enhanced, making the model optimization process more stable and the convergence result more reliable.
[0025] (5) The model obtains aligned gene, cell and cell type embeddings, which can support multi-dimensional downstream tasks such as cross-species gene relationship analysis, embedding phenotypic association analysis and cell type tree construction. It can characterize the lineage changes and differentiation patterns of gene expression in the cross-species cell type hierarchical structure, and provide an important basis for analyzing the conservation and differentiation program of gene regulation.
[0026] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description
[0027] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0028] Figure 1 This is a flowchart of the method for cross-species single-cell transcriptome integration and cell type relationship inference of the present invention; Figure 2 Visualization results of cross-species joint cellular embedding after UMAP dimensionality reduction; Figure 3 A cross-species cell type tree constructed based on cell type embedding; Figure 4 Examples of gene expression differences on a cell type tree. Detailed Implementation
[0029] The technical solution of the present invention will be further described below with reference to the accompanying drawings and embodiments.
[0030] To make the objectives, technical solutions, and advantages of this application clearer, more thorough, and more complete, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings and embodiments. The following detailed descriptions are all illustrations of embodiments, intended to provide further detailed explanation of the present invention. Unless otherwise specified, all technical terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains.
[0031] The instruments, equipment, reagents, and materials used in the examples were all obtained commercially.
[0032] Example 1 A method for cross-species single-cell transcriptome integration and cell type relationship inference, the process of which is as follows: Figure 1 As shown, the specific steps include the following: S1. Data Acquisition: Obtain single-cell transcriptome datasets of annotated cell types from multiple species and sets of one-to-one homologous genes between species from public databases.
[0033] S2. Preprocess the single-cell transcriptome datasets for each species: Perform library size normalization and log1p transformation on the data for each species, and then screen for hypervariable gene sets to form the expression matrix for each species.
[0034] S3. Construct a two-branch network model for each species: The two-branch network model includes a cell branch and a cell type branch, specifically including the following: S31, Cellular Branching, is used to extract cell-level embeddings from single-cell expression mini-batch data of a single species to characterize cell location and intraspecific variation in a shared latent space. The construction method is as follows: a random matrix is constructed for each species as a species-specific learnable gene embedding matrix. The columns of the random matrix represent different genes, and the columns represent different embedding dimensions. During model training, the gene embedding matrix is updated as a learnable parameter with the objective function optimization, thereby progressively encoding cross-species transcriptional associations and intraspecific structural information. In each training iteration, a mini-batch of data is sampled from the expression matrix of S2, a random gene mask is generated and applied to the expression matrix, the masked expression matrix is multiplied by the gene embedding matrix, and then input into a cross-species shared multilayer perceptron (MLP) to obtain the cell-level embeddings. ; in, For cell mapping functions, For learnable parameters, For species The representation matrix after masking, For species Gene embedding matrix.
[0035] In some embodiments, the cellular MLP consists of a multi-layer fully connected network and may include activation functions, normalization layers, and dropout layers to prevent overfitting.
[0036] S32. The cell type branch is used to learn cell type-level embeddings to represent the representative prototypes of each annotated cell type in the shared latent space, thereby supporting cell type relationship inference. The construction method is as follows: Define a learnable cell type prototype matrix, where rows correspond to different cell types and columns are the embedding dimensions. This matrix is updated as learnable parameters during training. The cell type prototype matrix is then input into a cross-species shared cell type multilayer perceptron to obtain the cell type-level embeddings. ; in, This is a cell type mapping function. For learnable parameters, For species A learnable cell type prototype matrix.
[0037] S4. Construction of cross-species gene-level aligned signals, as detailed below: S41, Regarding species Calculate the average expression matrix for cell types: ; ; in, For species The cell type is marked in the middle. A collection of cells, For genes In cells Gene expression values, For species A collection of highly variable expressed genes, Cell type.
[0038] S42. For any pair of species ( i,j Using a pair of homologous genes as anchors, cross-species cell type similarity is calculated based on transcriptome data at cell type resolution: ; in, For cosine similarity, for Medium cell types The corresponding row, for Medium cell types The corresponding row.
[0039] S43, through mutual Nearest neighbor screening provides stable cross-species cell type matching relationships: ; in, Indicates species In and species Cell types Most similar A collection of cell type indexes.
[0040] S44. Apply the cross-species cell type matching relationship to the average expression of cell types in the two species to obtain a cross-species gene association matrix: ; in, , Species i , j Cell type average expression matrix, For species i , j A cross-species cell type matching matrix.
[0041] S5. Based on the combined loss function of gene level and cell level, the dual-branch network model of each species is trained in two stages.
[0042] The gene-level loss function is: ; in, Represented by the gene embedding matrix and The calculated gene embedding similarity matrix, It is the Frobenius norm.
[0043] The cell-level loss function is: ; in, Indicates species The Individual cell embedding, Indicates species The Cell type embedding, Indicates similarity. For temperature coefficient, For batch size, Indicates species The Cell type labels corresponding to each cell.
[0044] During training, gene-level and cell-level losses are weighted and combined to form a joint optimization objective. Weighting coefficients are set to balance the contributions of cross-species alignment and intra-species discrimination. A two-stage training strategy is employed: the first stage utilizes only gene-level loss to align cross-species gene embeddings; the second stage introduces cell-level loss while maintaining gene-level alignment, enabling stronger cell type discrimination capabilities for intra-species cell embeddings.
[0045] S6. Using the pre-trained model for each species, we obtain aligned gene embeddings, cell embeddings, and cell type embeddings for cross-species gene relationship analysis, embedding phenotypic association analysis, and cell type tree construction.
[0046] Example 2 Using single-cell transcriptome data from frog and zebrafish embryos (https: / / figshare.com / s / 6187811b6c3fae02a4d3?file=50608386) as experimental subjects, the method in Example 1 was used to learn the cross-species cell embedding space and construct a cell type tree. The results are as follows: Figures 2-4 As shown.
[0047] Figure 2 The image shows the visualization results of the cross-species joint cell embeddings learned by the model in Example 1 after dimensionality reduction using UMAP, where each point represents a cell sample. The upper part of the image is colored with manually annotated cell types, and the lower part is colored with species origin. It can be seen that cells from different species have achieved good mixed clustering within the embedding region corresponding to the same cell type, indicating that the method in Example 1 can achieve consistent representation learning and data integration across species.
[0048] Figure 3 The cross-species cell type tree is constructed based on cell type embedding and inferred from cell type embedding by hierarchical clustering. It can be seen that cell types across species are generally conserved at the transcriptional level, and the cell type tree clearly depicts the hierarchical lineage relationship of cell types, realizing multi-resolution comparison of cell type relationships.
[0049] Figure 4 In order to be in Figure 3 Examples of differential expression of Klf17 homologs shown on the same topology of the cell type tree.
[0050] In summary, the method in Example 1 can characterize the lineage changes and differentiation patterns of gene expression at the hierarchical structure of cell types across species, providing a basis for elucidating the regulatory procedures of conservation and differentiation.
[0051] Example 3 A cross-species single-cell transcriptome integration and cell type relationship inference analysis system, used to perform the method of Example 1, includes: The data acquisition module is used to acquire single-cell transcriptome datasets of annotated cell types from multiple species and sets of one-to-one homologous genes between species.
[0052] The data processing module is used to preprocess single-cell transcriptome datasets from various species, performing library size normalization, log1p transformation, and screening for hypervariable genes.
[0053] The model building module is used to construct a dual-branch network model for each species, including a cell branch and a cell type branch, generating gene embedding matrices and cell type prototype matrices respectively, and building a perceptron structure shared across species.
[0054] The signal construction module is used to construct cross-species gene-level alignment signals, and sequentially calculates the average expression matrix of cell types, cross-species cell type similarity, stable cell type matching relationships, and cross-species gene association matrix.
[0055] The model training module is used to perform two-stage training on the dual-branch network model based on a combination of gene-level and cell-level loss functions, thereby optimizing gene embedding, cell embedding, and cell type embedding.
[0056] The results analysis module is used to obtain aligned embedding data from the trained model and to conduct cross-species gene relationship analysis, embedding phenotypic association analysis, and cell type tree construction.
[0057] Each module is implemented through software programming and can be deployed on servers or workstations. It supports batch processing and analysis of single-cell transcriptome data from multiple species, providing efficient tool support for bioinformatics research.
[0058] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the technical solutions of the present invention, and these modifications or equivalent substitutions cannot cause the modified technical solutions to deviate from the spirit and scope of the technical solutions of the present invention.
Claims
1. A method for cross-species single-cell transcriptome integration and cell type relationship inference, characterized in that, Includes the following steps: S1. Obtain single-cell transcriptome datasets of annotated cell types from multiple species and sets of one-to-one homologous genes between species; S2. Preprocess the single-cell transcriptome datasets of each species to form the expression matrix of each species; S3. Construct a two-branch network model for each species, consisting of cell branching and cell type branching; S4. Construct cross-species gene-level alignment signals based on the expression characteristics of cell types in each species; S5. Based on the combined loss function of gene-level loss function and cell-level loss function, the dual-branch network model is trained in two stages. S6. Using the trained bi-branch network model, obtain the aligned gene embeddings, cell embeddings, and cell type embeddings for each species, which are used for cross-species gene relationship analysis, embedding phenotypic association analysis, and cell type tree construction.
2. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 1, characterized in that: In S2, preprocessing includes performing library size normalization and log1p transformation on the single-cell transcriptome datasets of each species, and screening for hypervariable gene sets.
3. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 2, characterized in that, In S3, the method for constructing cell branches is as follows: a random matrix is constructed for each species as a species-specific learnable gene embedding matrix. The behavior of the gene embedding matrix is different for different genes and the columns are different embedding dimensions. During model training, the gene embedding matrix is updated as a learnable parameter with the objective function optimization. During each training iteration, a small batch of data is sampled from the expression matrix of the corresponding species S2, a random gene mask is generated and applied to the expression matrix, the masked expression matrix is multiplied with the gene embedding matrix, and then input into a cell multilayer sensor shared across species to obtain cell-level embeddings. The expression for cell-level embedding is: ; in, For cell mapping functions, For learnable parameters, For species The representation matrix after masking For species Gene embedding matrix.
4. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 3, characterized in that, In S3, the method for constructing cell type branches is as follows: A learnable cell type prototype matrix is defined for each species, and the cell type prototype matrix is updated as a learnable parameter during model training. The cell type prototype matrix is input into a cell type multilayer perceptron shared across species to obtain cell type-level embeddings; The cell type-level embedding expression is: ; in, This is a cell type mapping function. For learnable parameters, For species A learnable cell type prototype matrix.
5. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 4, characterized in that, In S4, the specific operations for constructing cross-species gene-level alignment signals include: S41, Regarding species i The average expression matrix of cell types is calculated using the following formula: ; ; in, For species The cell type is marked in the middle. A collection of cells, For genes In cells Gene expression values, For species A collection of highly variable expressed genes, Cell type; S42. For any pair of species ( i,j Using a pair of homologous genes between species as anchors, cross-species cell type similarity is calculated based on transcriptome data at cell type resolution; the expression for calculating cross-species cell type similarity is: ; in, For cosine similarity, for Medium cell types The corresponding row, for Medium cell types The corresponding row; S43, through mutual k The nearest neighbor method is used to screen stable cross-species cell type matching relationships. The expression is: ; in, Indicates species In and species Cell types Most similar A set of cell type indexes; S44. Apply the matching relationship to the average expression matrix of cell types of the two corresponding species to obtain the cross-species gene association matrix; The expression for the cross-species gene association matrix is: ; in, , Species i , j Cell type average expression matrix, For species i , j A cross-species cell type matching matrix.
6. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 5, characterized in that, In S5, the gene-level loss function is: ; in, Represented by the gene embedding matrix and The calculated gene embedding similarity matrix, It is the Frobenius norm.
7. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 6, characterized in that, In S5, the cell-level loss function is: ; in, Indicates species The Individual cell embedding, Indicates species The Cell type embedding, Indicates similarity. For temperature coefficient, For batch size, Indicates species The Cell type labels corresponding to each cell.
8. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 7, characterized in that, In S5, the specific process of two-stage training is as follows: In the first stage, only the gene-level loss function is used to train the dual-branch network model to achieve alignment of gene embedding space across species. In the second stage, the gene-level loss function and the cell-level loss function are weighted and combined to form a joint optimization objective. While maintaining the spatial alignment of gene embeddings across species, the dual-branch network model is trained to enhance the cell type discrimination ability of cells embedded within species.
9. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 8, characterized in that: It also includes a cross-species single-cell transcriptome integration and cell type relationship inference analysis system for implementing methods.
10. The method for cross-species single-cell transcriptome integration and cell type relationship inference according to claim 9, characterized in that, The cross-species single-cell transcriptome integration and cell type relationship inference analysis system includes: The data acquisition module is used to perform the operations of S1 to acquire single-cell transcriptome datasets of annotated cell types from multiple species and a pairwise homologous gene set between species. The data processing module is used to perform the operations of S2 and preprocess the single-cell transcriptome datasets of each species. The model building module is used to perform the operations of S3 and build a two-branch network model for each species. The signal construction module is used to perform the operations of S4 to construct cross-species gene-level aligned signals; The model training module is used to perform the operations of S5 and conduct two-stage training on the dual-branch network model. The results analysis module is used to perform the operations of S6, obtain aligned gene embeddings, cell embeddings, and cell type embeddings, and conduct related analyses.