Single-cell transcriptome analysis method and device, electronic equipment and storage medium

By constructing a gene regulatory network and using multi-dimensional index analysis, the problems of nonlinear gene interactions and dynamic physiological state transitions in single-cell transcriptome analysis were solved, achieving highly accurate analysis results.

CN122201421APending Publication Date: 2026-06-12PEKING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
PEKING UNIV
Filing Date
2026-02-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies cannot accurately determine the nonlinear interactions between genes and the dynamic transformation of functional configuration under physiological states in single-cell transcriptome analysis, resulting in low accuracy of the analysis results.

Method used

By determining the gene embedding vector, a gene regulatory network is constructed, and gene variability indicators and regulatory indicators are used to quantify the changes in gene properties under different physiological states. Multidimensional indicators are used to analyze the distribution pattern and functional connectivity of genes.

🎯Benefits of technology

It improves the accuracy of single-cell transcriptome analysis, enabling the quantification of gene attribute changes from multiple dimensions, identification of key genes and biological pathways, and providing precise physiological state analysis.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201421A_ABST
    Figure CN122201421A_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a single-cell transcriptome analysis method and device, electronic equipment and storage medium, relating to the field of bioinformatics technology, the method comprising: determining the embedding vector of each gene in each cell to be analyzed based on the transcriptome of each cell to be analyzed under a preset physiological state; obtaining the degree of change between the distribution of the embedding vector of each gene in each cell to be analyzed under different physiological states to obtain a gene variability index; constructing a gene regulation network of each cell to be analyzed under each physiological state based on the similarity of the embedding vector of each gene; obtaining a gene regulation index corresponding to each gene regulation subnetwork according to the similarity of the embedding vector of the corresponding genes of each two nodes in each gene regulation subnetwork obtained by dividing each gene regulation network; and determining an analysis result representing the attribute change of each cell to be analyzed under different physiological states based on the obtained gene variability index and gene regulation index. An analysis result with high accuracy can be obtained.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of bioinformatics, and in particular to a method, apparatus, electronic device, and storage medium for single-cell transcriptome analysis. Background Technology

[0002] In neuroscience research, it is often necessary to analyze the transcriptome of single cells in the human brain to understand the gene expression regulation mechanisms and functions in cells, and then determine how genes affect the physiological functions of organisms, providing a basis for precision medicine and targeted intervention.

[0003] In related technologies, single-cell transcriptome analysis is often achieved by statistically analyzing gene expression levels in cells, such as through linear comparison or differential expression analysis (DEA). However, these methods can only detect increases or decreases in the expression levels of individual genes in cells, and cannot determine the complex nonlinear interactions between genes, nor can they characterize the dynamic transformation of gene functional configuration under different physiological states, resulting in low accuracy of the analytical results. Summary of the Invention

[0004] The purpose of this application is to provide a method, apparatus, electronic device, and storage medium for single-cell transcriptome analysis, so as to obtain highly accurate single-cell transcriptome analysis results. The specific technical solution is as follows:

[0005] A first aspect of this application provides a method for single-cell transcriptome analysis, the method comprising:

[0006] Based on the transcriptome of each cell to be analyzed under a preset physiological state, the embedding vector of each gene in each cell to be analyzed is determined; the gene embedding vector characterizes the gene's properties.

[0007] The degree of variation in the distribution of the embedding vectors of each gene in each cell to be analyzed under different preset physiological states is obtained to obtain the gene variability index.

[0008] For each cell to be analyzed under each preset physiological state, a gene regulation network of the cell to be analyzed under the preset physiological state is constructed based on the similarity of the gene embedding vectors in the cell to be analyzed. The gene regulation network includes nodes that correspond one-to-one with genes. There is a connection between the nodes corresponding to two genes that meet the target conditions. The target conditions include: the similarity of the embedding vectors of the two genes is not less than the similarity threshold, and at least one of the two genes belongs to the top preset number of genes with the highest similarity of the embedding vector of the other gene.

[0009] Each gene regulatory network is divided into gene regulatory sub-networks by clustering the nodes in each gene regulatory network. Gene regulatory indicators corresponding to each gene regulatory sub-network are obtained based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes in each gene regulatory sub-network.

[0010] Based on the obtained gene variability indicators and gene regulation indicators, the analysis results that characterize the changes in attributes of each cell under different physiological states are determined.

[0011] Optionally, the gene regulatory networks of each cell to be analyzed under each preset physiological state can be constructed in the following manner:

[0012] Determine the similarity of the embedding vectors of every two genes in the cell to be analyzed, and establish the connection relationship between nodes corresponding to genes with similarity not less than the similarity threshold;

[0013] For each gene in the cell to be analyzed, the first number of genes with the highest similarity to the embedding vector of that gene are identified as the first nearest neighbor genes of that gene.

[0014] If each of the two genes is not the first nearest neighbor of the other gene, delete the connection between the corresponding nodes of the two genes.

[0015] For each gene in the cell to be analyzed, the second-highest number of genes with the highest similarity to the embedding vector of that gene are identified as the second nearest neighbor genes of that gene; the second number is less than the first number.

[0016] If one of the two genes is the second nearest neighbor of the other gene and there is no connection between the nodes corresponding to the two genes, establish the connection between the nodes corresponding to the two genes to obtain the gene regulatory network of the cell to be analyzed under the preset physiological state.

[0017] Optionally, before determining the analytical results characterizing the changes in attributes of each cell under different physiological states based on the obtained gene variability indicators and gene regulation indicators, the method further includes:

[0018] The preset physiological states of each cell to be analyzed were randomly reassigned multiple times.

[0019] The degree of change in the distribution of the embedding vector of each gene in each cell to be analyzed under a preset physiological state after each redistribution is obtained.

[0020] Determine the probability of the gene variability index appearing in the obtained degree of change;

[0021] The analytical results, based on the obtained gene variability indicators and gene regulation indicators, determine the changes in the properties of each cell under different physiological states, including:

[0022] If the determined probability is less than the probability threshold, the analysis results characterizing the changes in attributes of each cell under different physiological states are determined based on the obtained gene variability indicators and gene regulation indicators.

[0023] Optionally, determining the embedding vector of each gene in each cell to be analyzed based on the transcriptome of each cell under a preset physiological state includes:

[0024] For each cell to be analyzed under a preset physiological state, the third-highest expressed genes are selected from the transcriptome. Sequence elements corresponding to the selected genes are generated based on their expression levels and gene identities. The sequence elements corresponding to the selected genes are arranged in descending order of their expression levels to obtain the input sequence. The obtained input sequence is then input into a pre-trained feature extraction model to obtain the gene embedding vectors of the cells to be analyzed, which are output by the feature extraction model. The feature extraction model is trained using the transcriptome of each sample cell, the gene expression levels of each sample cell, and the sample embedding vectors.

[0025] Optionally, the feature extraction model is trained through the following steps:

[0026] Obtain the transcriptome of each sample cell and the sample embedding vector of the genes in each sample cell;

[0027] From the transcriptome of each sample cell, select the third-highest expressed genes. Based on the expression level and gene identity of the selected genes, generate the corresponding sequence elements of the selected genes. Arrange the sequence elements of the selected genes in descending order of their expression levels to obtain the sample input sequence.

[0028] Input the sample input sequence into the feature extraction model of the initial structure to obtain the predicted embedding vector of the gene in each sample cell output by the feature extraction model;

[0029] By utilizing the difference between the predicted embedding vector and the sample embedding vector, the parameters of the initial structure feature extraction model are adjusted until the preset convergence condition is met, thus obtaining the trained feature extraction model.

[0030] Optionally, obtaining the transcriptome of each sample cell and the sample embedding vector of genes in each sample cell includes:

[0031] From the pre-constructed sample dataset, obtain the transcriptome of each sample cell and the sample embedding vector of the gene in each sample cell;

[0032] The sample dataset is constructed through the following steps:

[0033] Obtain the transcriptome of each sample cell and the original attributes of genes in each sample cell;

[0034] The original attributes of genes in each sample cell are input into the large language model so that the large language model can determine the position of the original attributes of genes in each sample cell in the pre-established standard anatomical framework. Based on the determined position, an embedding vector is generated and output. The output embedding vector is used as the sample embedding vector of genes in each sample cell.

[0035] Optionally, the gene embedding vector includes: identity embedding representing the gene's identity, expression embedding representing the gene's expression level, and order embedding representing the gene's expression level and order position.

[0036] A second aspect of this application also provides a single-cell transcriptome analysis device, the device comprising:

[0037] The embedding vector determination module is used to determine the embedding vector of each gene in each cell to be analyzed based on the transcriptome of each cell under a preset physiological state; the embedding vector of a gene represents the attribute of the gene.

[0038] The gene variability index acquisition module is used to obtain the degree of change in the distribution of the embedding vectors of each gene in each cell to be analyzed under different preset physiological states, and to obtain the gene variability index.

[0039] The gene regulation network construction module is used to construct a gene regulation network for each cell to be analyzed under each preset physiological state, based on the similarity of the embedding vectors of genes in the cell to be analyzed. The gene regulation network includes nodes that correspond one-to-one with genes. There is a connection between the nodes corresponding to two genes that meet the target conditions. The target conditions include: the similarity of the embedding vectors of the two genes is not less than the similarity threshold, and at least one of the two genes belongs to the top preset number of genes with the highest similarity of the embedding vector of the other gene.

[0040] The gene regulation index acquisition module is used to divide each gene regulation network by clustering the nodes in each gene regulation network to obtain gene regulation sub-networks. Based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes in each gene regulation sub-network, the gene regulation index corresponding to each gene regulation sub-network is obtained.

[0041] The analysis results determination module is used to determine the analysis results characterizing the changes in attributes of each cell under different physiological states based on the obtained gene variability indicators and gene regulation indicators.

[0042] A third aspect of the embodiments of this application also provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus;

[0043] Memory, used to store computer programs;

[0044] The processor, when executing a program stored in memory, implements any of the single-cell transcriptome analysis methods described above.

[0045] In a fourth aspect of this application, a computer-readable storage medium is provided, wherein a computer program is stored therein, and when executed by a processor, the computer program implements any of the single-cell transcriptome analysis methods described above.

[0046] This application also provides a computer program product containing instructions that, when run on a computer, cause the computer to execute any of the single-cell transcriptome analysis methods described above.

[0047] Beneficial effects of the embodiments in this application:

[0048] The single-cell transcriptome analysis method provided in this application can determine the embedding vectors characterizing the attributes of genes in the cells under preset physiological conditions using the transcriptomes of each cell to be analyzed. Furthermore, it can calculate the degree of variation in the distribution of embedding vectors of each gene in each cell under different preset physiological conditions, thus obtaining a gene variability index. Correspondingly, the gene variability index reflects the degree of variation in the distribution pattern of genes under different physiological conditions; the larger the gene variability index, i.e., the greater the degree of variation, the greater the difference in the distribution pattern of genes under different physiological conditions.

[0049] Furthermore, the similarity of gene embedding vectors in cells under various preset physiological states can be used to construct a gene regulation network for the cells under those preset physiological states. Each node in the gene regulation network corresponds one-to-one with a gene. The similarity of the embedding vectors of genes corresponding to two connected nodes is not less than a similarity threshold; that is, a high similarity in the embedding vectors of genes corresponding to two connected nodes ensures a high correlation between the genes, thus avoiding noise interference. Moreover, at least one gene among the genes corresponding to two connected nodes belongs to the top preset number of genes with the highest similarity in embedding vectors of the other gene. This further constrains the connection relationships between nodes, further ensuring a high correlation between the two connected genes, and thus enabling the constructed gene regulation network to accurately reflect the tightness of functional connections between genes.

[0050] After constructing the gene regulation network, the network can be divided into sub-networks by clustering the nodes within each sub-network. For each sub-network, the gene regulation index can be obtained based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes. Correspondingly, the gene regulation index of each sub-network under different preset physiological states can reflect the changes in that sub-network under these states. A larger gene regulation index, i.e., higher similarity, indicates a stronger functional connectivity within the sub-network.

[0051] Furthermore, by combining the obtained gene variability indicators and gene regulation indicators, the analytical results characterizing the changes in attributes of each cell under different physiological states can be determined. In this way, changes in gene distribution and functional connectivity can be quantified using multi-dimensional indicators, i.e., the state shifts of genes under different physiological states can be quantified. Correspondingly, based on quantified multi-dimensional indicators, changes in gene attributes can be analyzed from multiple dimensions, yielding highly accurate analytical results.

[0052] Of course, implementing any product or method of this application does not necessarily require achieving all of the advantages described above at the same time. Attached Figure Description

[0053] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other embodiments can be obtained based on these drawings.

[0054] Figure 1 A schematic flowchart of a single-cell transcriptome analysis method provided in this application embodiment;

[0055] Figure 2 A schematic diagram of the permutation test in a single-cell transcriptome analysis method provided in this application embodiment;

[0056] Figure 3 This is a schematic diagram of the structure of a single-cell transcriptome analysis device provided in an embodiment of this application;

[0057] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0058] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art based on this application are within the scope of protection of this application.

[0059] To obtain highly accurate single-cell transcriptome analysis results, this application provides a single-cell transcriptome analysis method, see [link to relevant documentation]. Figure 1 The method may include: Step S101: Based on the transcriptome of each cell to be analyzed under a preset physiological state, determine the embedding vector of each gene in each cell to be analyzed; the gene embedding vector characterizes the gene's attributes; Step S102: Obtain the degree of variation in the distribution of the embedding vectors of each gene in each cell to be analyzed under different preset physiological states, and obtain a gene variability index; Step S103: For each cell to be analyzed under each preset physiological state, construct a gene regulation network for the cell to be analyzed under that preset physiological state based on the similarity of the gene embedding vectors in that cell; the gene regulation network includes nodes that correspond one-to-one with genes, and nodes corresponding to two genes that meet the target conditions are linked. The connection relationship exists, and the target conditions include: the similarity of the embedding vectors of two genes is not less than the similarity threshold, and at least one of the two genes belongs to the top preset number of genes with the highest similarity of the embedding vector of the other gene; Step S104: Divide each gene regulation network by clustering the nodes in each gene regulation network to obtain gene regulation sub-networks, and obtain the gene regulation index corresponding to each gene regulation sub-network based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes in each gene regulation sub-network; Step S105: Based on the obtained gene variability index and gene regulation index, determine the analysis results that characterize the attribute changes of each cell under different physiological states.

[0060] Based on the above processing, the embedding vectors characterizing the attributes of genes in the cells under preset physiological conditions can be determined using the transcriptome of each cell under preset physiological conditions. Furthermore, the degree of variation in the distribution of embedding vectors of each gene in each cell under different preset physiological conditions can be calculated to obtain a gene variability index. Correspondingly, the gene variability index reflects the degree of variation in the distribution of genes under different physiological conditions; the larger the gene variability index, the greater the degree of variation, indicating a greater difference in the distribution of genes under different physiological conditions. The similarity of the embedding vectors of genes in each cell under preset physiological conditions can also be used to construct a gene regulatory network for that cell under that preset physiological condition. Nodes in the gene regulatory network correspond one-to-one with genes. The similarity of the embedding vectors of genes corresponding to two connected nodes is not less than a similarity threshold; that is, a high similarity in the embedding vectors of genes corresponding to two connected nodes results in a high correlation between the genes corresponding to the connected nodes, thus avoiding noise interference. Furthermore, if at least one gene in the genes corresponding to two nodes with a connection relationship belongs to the top preset number of genes with the highest similarity in the embedding vector of the other gene, the connection relationship of the nodes can be further constrained, making the correlation between two genes with a connection relationship higher. This allows the constructed gene regulatory network to accurately reflect the degree of functional connection between each gene.

[0061] After constructing the gene regulatory network, the network can be divided into sub-networks by clustering the nodes within each sub-network. For each sub-network, the gene regulatory index can be obtained based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes. Correspondingly, the gene regulatory index of each sub-network under different preset physiological states reflects the changes in that sub-network under these states; a larger index, indicating higher similarity, signifies stronger functional connectivity within the sub-network. Furthermore, by combining the obtained gene variability index and gene regulatory index, the analytical results characterizing the attribute changes of each cell under different physiological states can be determined. Thus, through multi-dimensional indicators, the changes in gene distribution and functional connectivity can be quantified, i.e., the state shifts of genes under different physiological states can be quantified. Consequently, based on these quantified multi-dimensional indicators, gene attribute changes can be analyzed from multiple dimensions, yielding highly accurate analytical results.

[0062] Regarding step S101, preset physiological states can be set as needed without specific limitations. There can be multiple preset physiological states; for example, preset physiological states may include different age states, different gender states, healthy states and disease states, as well as states where a specified drug is being used and states where a specified drug is not being used. The cells currently to be analyzed can be called the cells to be analyzed. There can be multiple cells to be analyzed, and the specific cell types can be set as needed without specific limitations. For example, in a neuroscience research scenario, the cells to be analyzed could be human brain cells.

[0063] For each cell to be analyzed, the transcriptome of that cell under each preset physiological state can be pre-acquired. The acquired transcriptome can then be used to determine the embedding vectors of each gene in each cell. The gene embedding vectors characterize gene attributes. The dimensions of the gene embedding vectors can be set as needed and are not specifically limited. The gene attributes characterized by the gene embedding vectors can also be set as needed; for example, gene attributes may include: gene identification, gene expression level, gene expression ranking position, gene function, etc. Elements of different dimensions in the gene embedding vectors can characterize different types of gene attributes. For example, a pre-trained deep learning model can be used to process the transcriptome of each cell under preset physiological states to obtain the embedding vectors of each gene in each cell. For details, please refer to the description of the feature extraction model in subsequent embodiments.

[0064] For step S102, for each preset physiological state, the distribution of the embedding vector of each gene in each cell to be analyzed under that preset physiological state can be determined. Furthermore, the degree of variation in the distribution of the embedding vector of each gene in each cell to be analyzed under different preset physiological states can be determined, yielding a gene variability index. For example, the Sliced ​​Wasserstein Distance (SWD) algorithm can be used to calculate the Sliced ​​Wasserstein Distance between the distributions of the embedding vector of each gene in each cell to be analyzed under different preset physiological states, i.e., the degree of variation between distributions, as a gene variability index (also known as a gene variability score).

[0065] Taking a preset physiological state including different age states as an example, different age intervals can be pre-defined as different age states. The specific division method can be set as needed and is not specifically limited. For example, 0-18 years old is the first age state, 18-49 years old is the second age state, 50-75 years old is the third age state, and 75 years old and above is the fourth age state. For each pair of adjacent age states (e.g., the first age state and the second age state), the degree of change in the distribution of the embedding vector of each gene in each cell to be analyzed under the two adjacent age states is calculated to obtain the gene variability index. Correspondingly, the obtained gene variability index can represent the degree of change in the distribution pattern of each gene in each cell to be analyzed with age, and thus it is possible to analyze the continuous nonlinear drift of gene state in cells throughout the entire life cycle based on the gene variability index.

[0066] Correspondingly, by utilizing gene variability indicators, it is possible to quantify the geometric displacement of gene distribution in vector space as physiological states change, reflecting how gene distribution changes with physiological states. Embedded vectors often have high dimensions; introducing slice Wasserstein distance can solve the curse of dimensionality problem caused by the failure of Euclidean distance in high-dimensional spaces. Furthermore, slice Wasserstein distance, utilizing optimal transport theory, can sensitively capture geometric changes in gene distribution in vector space (such as manifold distortion). Based on this, it is possible to accurately capture changes in genes during continuous biological processes such as aging and development, for example, identifying changes in aging-sensitive regions (such as the dentate gyrus).

[0067] Regarding step S103, for each cell to be analyzed under each preset physiological state, the similarity between the embedding vectors of each gene in the cell to be analyzed can be determined, thus determining the correlation between the genes in the cell to be analyzed. The similarity between two embedding vectors can be determined in the following ways: In one way, the Euclidean distance between the two embedding vectors can be calculated, and the similarity between the two embedding vectors can be obtained based on the calculated Euclidean distance, where the similarity between the two embedding vectors is negatively correlated with the Euclidean distance; In another way, the cosine similarity between the two embedding vectors can be calculated as the similarity between the two embedding vectors.

[0068] Accordingly, using the determined similarity, each gene in the cell to be analyzed can be treated as a node, and the connection relationships between the nodes corresponding to each gene in the cell to be analyzed can be determined, that is, the edges between the nodes corresponding to each gene can be determined, so as to construct the gene regulatory network of the cell to be analyzed under the preset physiological state. The gene regulatory network consists of the nodes corresponding to the genes and the connection relationships (i.e., edges) between the nodes, with a one-to-one correspondence between nodes and genes. Accordingly, the gene regulatory network is the topological graph of the nodes corresponding to each gene.

[0069] In gene regulatory networks, a connection exists between nodes corresponding to two genes that meet certain target conditions. These target conditions include: the similarity of the embedding vectors of the two genes is not less than a similarity threshold, and at least one of the two genes belongs to the top pre-defined number of genes whose embedding vectors have the highest similarity to the other gene. In other words, a higher similarity in the embedding vectors of genes corresponding to two connected nodes ensures a higher correlation between the genes and avoids noise interference. The similarity threshold can be set as needed and is not specifically limited. For example, when the similarity of gene embedding vectors is the cosine similarity between the embedding vectors, the similarity threshold can be 0 to perform non-negative truncation on nodes corresponding to genes with negative embedding vector similarity, i.e., connections are not established between nodes corresponding to genes with negative embedding vector similarity.

[0070] Furthermore, if at least one gene in the genes corresponding to two nodes with a connection relationship belongs to the top preset number of genes with the highest similarity in the embedding vector of the other gene, the connection relationship of the nodes can be further constrained, making the correlation between two genes with a connection relationship higher. This allows the constructed gene regulatory network to accurately reflect the degree of functional connection between each gene, improving the robustness of subsequent analysis based on the gene regulatory network.

[0071] The preset number may include a first number and a second number in subsequent embodiments, where the second number is less than the first number. Correspondingly, for the genes corresponding to two nodes with a connection, each gene belongs to the first number of genes with the highest similarity to the embedding vector of the other gene, or one gene belongs to the second number of genes with the highest similarity to the embedding vector of the other gene. Taking gene 1 and gene 2 as an example, if there is a connection between the nodes corresponding to gene 1 and gene 2, then gene 1 belongs to the first number of genes with the highest similarity to the embedding vector of gene 2, and gene 2 also belongs to the first number of genes with the highest similarity to the embedding vector of gene 1; or, gene 1 belongs to the second number of genes with the highest similarity to the embedding vector of gene 2, regardless of whether gene 2 belongs to the second number of genes with the highest similarity to the embedding vector of gene 1; or, gene 2 belongs to the second number of genes with the highest similarity to the embedding vector of gene 1, regardless of whether gene 1 belongs to the second number of genes with the highest similarity to the embedding vector of gene 2.

[0072] Thus, each of the two genes belongs to the first number of genes with the highest similarity to the other gene's embedding vector. Bidirectional constraints can be used to ensure a high correlation between the two genes with a connection. Either gene can also belong to the second number of genes with the highest similarity to the other gene's embedding vector. This second number is less than the first number. This allows for the use of both bidirectional constraints to ensure a high correlation between the two genes with a connection, while further utilizing unidirectional constraints to preserve strongly correlated connections, improving the overall connectivity of the gene regulatory network. This enables the constructed gene regulatory network to more accurately reflect the tightness of functional connections between genes. The specific first and second numbers can be set as needed and are not specifically limited.

[0073] Regarding step S104, for each gene regulatory network, the nodes in the gene regulatory network can be clustered to divide the gene regulatory network into gene regulatory sub-networks. For example, for each gene regulatory network, the Leiden algorithm (a community detection algorithm) can be used to continuously iterate and optimize to maximize the modularity, dividing the gene regulatory network into multiple dense gene communities (i.e., gene regulatory sub-networks), thus obtaining multi-scale hierarchical gene regulatory sub-networks.

[0074] After obtaining the gene regulatory subnetwork, for each subnetwork, the similarity of the embedding vectors of genes corresponding to every two nodes can be calculated to obtain the gene regulatory index corresponding to that subnetwork. For example, the similarity of the embedding vectors of genes corresponding to every two nodes in the subnetwork can be calculated, and the weighted sum of the obtained similarities can be used as the gene regulatory index corresponding to that subnetwork. The gene regulatory index corresponding to each subnetwork under different preset physiological states can then reflect the changes of the gene regulatory subnetwork under different preset physiological states. The larger the gene regulatory index, that is, the higher the similarity, the stronger the functional connectivity of the gene regulatory subnetwork.

[0075] Correspondingly, gene regulatory indicators can be used to quantify the tightness of functional connections between genes under different physiological states. Comparing gene regulatory indicators under different physiological states allows us to determine the impact of changes in physiological states on the tightness of gene functional connections. Thus, starting from the geometric clustering degree in vector space, we can identify gene regulatory subnetworks, enabling hierarchical assessment of the functional coordination of biological pathways. This allows us to identify latent key biological pathways where expression levels do not change significantly, but network connectivity patterns are significantly restructured—for example, gene regulatory subnetworks with significantly enhanced or dissociated functional connections. This provides a novel perspective beyond traditional differential expression analysis for the mechanistic study of complex diseases (such as neurodegenerative diseases).

[0076] Regarding step S105, the obtained gene variability indicators and gene regulation indicators can be used to determine the analytical results characterizing the changes in attributes of each cell under different physiological states. That is, through multi-dimensional indicators, the changes in gene distribution patterns and gene functional connections can be quantified, i.e., the state shifts of genes under different physiological states can be quantified. Correspondingly, based on quantified multi-dimensional indicators, it is possible to analyze gene attribute changes from multiple dimensions, enabling functional state manifold and gene state drift analysis, and identifying key genes or biological pathways that cannot be determined based on statistical analysis of expression levels, thus obtaining highly accurate analytical results.

[0077] By analyzing gene state drift, it is possible to identify genes whose expression levels remain stable but undergo significant state shifts with changes in physiological state, such as AKT2 (a gene). Unlike traditional expression-based analysis methods that only focus on the "quantity" of genes, gene state drift allows us to perceive the "function" of genes. It quantifies complex biological state evolution, achieving a dimensional shift from expression level to functional state, and can identify key genes missed by differential expression analysis.

[0078] In some embodiments, the gene regulatory network of each cell to be analyzed under each preset physiological state is constructed in the following manner: the similarity of the embedding vectors of every two genes in the cell to be analyzed is determined, and the connection relationship between the nodes corresponding to genes with a similarity not less than a similarity threshold is established; for each gene in the cell to be analyzed, the first number of genes with the highest similarity to the embedding vector of the gene are determined as the first nearest neighbor genes of the gene; if each of the two genes is not the first nearest neighbor gene of the other gene, the connection relationship between the nodes corresponding to the two genes is deleted; for each gene in the cell to be analyzed, the second number of genes with the highest similarity to the embedding vector of the gene are determined as the second nearest neighbor genes of the gene; the second number is less than the first number; if one of the two genes is the second nearest neighbor gene of the other gene and there is no connection relationship between the nodes corresponding to the two genes, the connection relationship between the nodes corresponding to the two genes is established, thereby obtaining the gene regulatory network of the cell to be analyzed under the preset physiological state.

[0079] In this embodiment, for each cell to be analyzed under each preset physiological state, the similarity of the embedding vectors of every two genes in the cell to be analyzed can be determined, and it can be determined whether the obtained similarity is less than a similarity threshold. If the similarity of the embedding vectors of two genes is not less than the similarity threshold, then a connection relationship between the nodes corresponding to the two genes is established. If the similarity of the embedding vectors of two genes is less than the similarity threshold, then the nodes corresponding to the two genes are truncated, that is, no connection relationship is established, and no subsequent connection relationship between the nodes corresponding to the two genes will be established.

[0080] For each gene in the cell to be analyzed, the first number of genes with the highest similarity to the embedding vector of that gene can be identified as its first nearest neighbors. This first number can also be represented as K, and the first nearest neighbors can be called K-nearest neighbors. If each of two genes is the first nearest neighbor of the other gene, it indicates a strong correlation between the two genes, and the connection between their corresponding nodes can be preserved. If not each of two genes is the first nearest neighbor of the other gene (i.e., at least one gene is not the first nearest neighbor of the other gene), it indicates a weak correlation between the two genes, and the connection between their corresponding nodes can be deleted. In this way, bidirectional constraints (also known as mutual nearest neighbor constraints) can be used to preserve the connection between nodes of two strongly correlated genes and delete transitive false positive connections caused by indirect regulation, ensuring the network specificity of the constructed gene regulatory network.

[0081] For each gene in the cell to be analyzed, the second-highest number of genes with the highest similarity to its embedding vector can be identified as its second nearest neighbors. If the second number is less than the first number, then genes with strong correlations can be further identified. If one gene is the second nearest neighbor of another gene, it indicates a strong correlation between the two genes. Even if there is no connection between the nodes corresponding to these two genes, a connection can be established. Thus, unidirectional constraints (also known as union nearest neighbor constraints) can be used to supplement some unidirectional strongly correlated connections, ensuring the network specificity of the constructed gene regulatory network while maintaining its overall connectivity and preventing network fragmentation.

[0082] After completing the above processing, the gene regulatory network for each cell to be analyzed under each preset physiological state can be constructed. In this way, by combining mutual nearest neighbor constraints and union nearest neighbor constraints, a mixed K-Nearest Neighbor (mKNN) topology gene network can be constructed, ensuring the network specificity of the constructed gene regulatory network while maintaining its overall connectivity. This, in turn, improves the robustness of subsequent analysis using the gene regulatory network, further ensuring the high accuracy of the final analytical results.

[0083] In some embodiments, see Figure 2 Before step S105, the single-cell transcriptome analysis method further includes: step S201: randomly reassigning the preset physiological states of each cell to be analyzed multiple times; step S202: obtaining the degree of change in the distribution of the embedding vector of each gene in each cell to be analyzed under the preset physiological states after each reassignment; step S203: determining the probability of the gene variability index appearing in the obtained degree of change; step S105 includes: if the determined probability is less than the probability threshold, determining the analysis results characterizing the attribute changes of each cell to be analyzed under different physiological states based on the obtained gene variability index and gene regulation index.

[0084] In this embodiment, a permutation test can be performed before determining the analysis results. The preset physiological states of each cell to be analyzed can be randomly redistributed multiple times to shuffle the preset physiological states corresponding to each cell. It should be noted that the preset physiological states are redistributed only during the permutation test, and the redistributed preset physiological states are only used during the permutation test. During the calculation of each indicator, the preset physiological states corresponding to each cell to be analyzed are the initially set preset physiological states.

[0085] After each redistribution, the degree of change in the distribution of the embedding vector of each gene in each cell to be analyzed under the preset physiological state after the redistribution can be obtained. The specific method for obtaining the degree of change in distribution can be referred to the relevant description of step S102 in the above embodiments, and will not be repeated here. The distribution of each degree of change can be called the null distribution. Furthermore, the probability of the gene variability index appearing in the obtained degree of change (also called the empirical P-value) can be determined, and it can be determined whether the obtained probability is less than a probability threshold.

[0086] If the obtained probability is less than the probability threshold, it means that the probability that the change in the distribution of the embedding vector of each gene in each cell under different preset physiological states is caused by random mutation is low. In this case, the gene variability index can effectively represent the change in the distribution of genes with physiological states. Based on the obtained gene variability index and gene regulation index, the analysis results characterizing the changes in the attributes of each cell under different physiological states can be determined, further ensuring that highly accurate analysis results can be obtained.

[0087] If the obtained probability is not less than the probability threshold, it means that the degree of change in the distribution of the embedding vectors of each gene in the cell under different preset physiological states is likely caused by random mutation. In this case, the gene variability index cannot effectively represent the change of gene distribution with physiological state. Based solely on the obtained gene regulation index, the analysis results characterizing the changes in attributes of each cell under different physiological states can be determined, further ensuring that highly accurate analysis results can be obtained.

[0088] Based on the above processing, a permutation test can be used to avoid situations where the variation in the distribution of embedding vectors of genes in the cells under different preset physiological states is caused by random mutations. Using gene variability indicators for analysis can reduce the influence of random noise on the analysis of attribute changes in the cells under different physiological states, thus ensuring that the analytical results truly reflect the changes in the attributes of the cells under physiological states—that is, biological variation, rather than random noise. In other words, it can further guarantee highly accurate analytical results.

[0089] In some embodiments, step S101 includes: for the transcriptome of each cell to be analyzed under a preset physiological state, selecting the third-highest number of genes with the highest expression levels from the transcriptome, generating sequence elements corresponding to the selected genes based on the expression levels and gene identities of the selected genes, arranging the sequence elements corresponding to the selected genes in descending order of their expression levels to obtain an input sequence, and inputting the obtained input sequence into a pre-trained feature extraction model to obtain the gene embedding vector of the cell to be analyzed output by the feature extraction model; wherein, the feature extraction model is trained using the transcriptome of each sample cell, the gene expression levels of each sample cell, and the sample embedding vector.

[0090] In this embodiment, the feature extraction model can be pre-trained using the transcriptome of each sample cell, the gene expression levels in each sample cell, and the sample embedding vectors to obtain a trained feature extraction model. The sample embedding vectors of genes in each sample cell are also the gene embedding vectors in each sample cell, which can characterize the attributes of genes in the sample cell. The architecture of the feature extraction model can be selected as needed and is not specifically limited. For example, the architecture of the feature extraction model can be a Transformer-based BERT (Bidirectional Encoder Representations from Transformers) architecture.

[0091] For the transcriptome of each cell under a preset physiological state, the third-highest expressed genes can be selected. The expression levels and gene identities of these selected genes can be used as sequence elements, arranged in descending order of expression levels to obtain the input sequence. The third number can be set as needed and is not specifically limited. For example, it could be 2048. The gene identity, or unique identifier, can be represented as a gene ID. The obtained input sequence can then be fed into a pre-trained feature extraction model, which can output the embedding vectors of each gene in the cell being analyzed.

[0092] Based on the above processing, a pre-trained feature extraction model can be used to extract the embedding vectors of each gene in the cell to be analyzed. That is, deep representation learning technology can be used to achieve deep functional encoding of the single-cell transcriptome. Genes with high expression levels can be selected from the transcriptome of the cell under preset physiological conditions for subsequent processing, which can reduce the cost of subsequent analysis and improve analysis efficiency. Furthermore, genes with high expression levels are also more active genes, which helps improve the effectiveness and accuracy of subsequent analysis. Simultaneously, the input sequence can also contain gene expression level ranking information; correspondingly, the obtained embedding vectors can also reflect this ranking information, thus characterizing the gene activity level and helping to identify key genes. This further ensures that highly accurate analysis results can be obtained.

[0093] In some embodiments, the feature extraction model is trained through the following steps: Step 1: Obtain the transcriptome of each sample cell and the sample embedding vector of the genes in each sample cell; Step 2: Select the third-highest expressed genes from the transcriptome of each sample cell, generate the sequence elements corresponding to the selected genes based on the expression level and gene identity, and arrange the sequence elements corresponding to the selected genes in descending order of their expression levels to obtain the sample input sequence; Step 3: Input the sample input sequence into the feature extraction model of the initial structure to obtain the predicted embedding vector of the genes in each sample cell output by the feature extraction model; Step 4: Adjust the parameters of the feature extraction model of the initial structure using the difference between the obtained predicted embedding vector and the sample embedding vector until the preset convergence condition is met to obtain the trained feature extraction model.

[0094] In this embodiment, the feature extraction model can be pre-trained. The transcriptome of each sample cell and the sample embedding vectors of genes in each sample cell can be pre-acquired. For the transcriptome of each sample cell, the third-highest expressed genes can be selected. Based on the expression level and gene identity of the selected genes, sequence elements corresponding to the selected genes are generated. These sequence elements are then arranged in descending order of gene expression level to obtain the sample input sequence. For details, please refer to the description of generating the input sequence in the above embodiments; it will not be repeated here.

[0095] Then, the sample input sequence can be input into the feature extraction model of the initial structure to obtain the predicted embedding vector of the gene in each sample cell output by the feature extraction model. Using the difference between the obtained predicted embedding vector and the sample embedding vector, the parameters of the feature extraction model of the initial structure are adjusted until a preset convergence condition is met, resulting in a trained feature extraction model. For example, the preset convergence condition could be that the difference between the obtained predicted embedding vector and the sample embedding vector is less than a difference threshold, or that the number of parameter adjustments reaches a preset number.

[0096] Based on the above processing, during training, the feature extraction model can learn the cooperative expression patterns of genes in a multi-dimensional feature space, and can also learn to generate embedding vectors representing gene attributes based on the ranking information of gene expression levels and the expression levels of genes with higher expression levels. This further ensures that the trained feature extraction model can obtain the embedding vectors of each gene in each cell to be analyzed, thus guaranteeing that subsequent multi-dimensional indicators can be obtained based on the embedding vectors to achieve transcriptome analysis and obtain highly accurate analytical results.

[0097] For example, during the training of the feature extraction model, a portion of the sample input sequence can be randomly masked, and the feature extraction model can be trained using the sample input sequence after the random masking process.

[0098] For example, the feature extraction model training process can be divided into two stages: a pre-training stage and a fine-tuning stage. In the pre-training stage, the feature extraction model can be pre-trained using the transcriptomes of all types of sample cells and the gene embedding vectors of each sample cell, which can improve the generalization performance of the feature extraction model. After pre-training, the feature extraction model can be specifically supervised and fine-tuned using the transcriptomes of a specified type of sample cell, enabling the feature extraction model to accurately extract the gene embedding vectors of the specified type of sample cell; that is, enabling the feature extraction model to accurately extract high-dimensional features specific to cell type. The specified type can be set as needed and is not specifically limited. For example, the specified type can be the cell type of the cells to be analyzed.

[0099] In some embodiments, step 1 includes: obtaining the transcriptome of each sample cell and the sample embedding vector of the gene in each sample cell from a pre-constructed sample dataset; wherein the sample dataset is constructed by the following steps: obtaining the transcriptome of each sample cell and the original attributes of the gene in each sample cell; inputting the original attributes of the gene in each sample cell into a large language model so that the large language model determines the position of the original attributes of the gene in each sample cell in a pre-established standard anatomical framework, generating an embedding vector based on the determined position, and outputting the generated embedding vector, and using the output embedding vector as the sample embedding vector of the gene in each sample cell.

[0100] In this application embodiment, a sample dataset can be pre-constructed. The transcriptome of each sample cell and the raw attributes of genes within each sample cell can be obtained. For example, raw single-cell transcriptome data can be collected from multiple independent studies worldwide to construct a massive sample resource library containing a large number of single cells and involving multiple donors, achieving the acquisition and heterogeneous aggregation of multi-source, large-scale data. The solution provided in this application embodiment integrates over 16 million cells and more than 8,000 donors, establishing a world-leading human brain single-cell transcriptome resource library, overcoming the shortcomings of previous studies with small sample sizes and insufficient representativeness. Through systematic aggregation, the transcriptome of sample cells can cover the entire life cycle spectrum of humans from embryonic development to old age (0.5 years to 95 years), and can encompass the transcriptome of sample cells under various neurological diseases or syndromes.

[0101] Furthermore, a standard anatomical framework can be pre-established. The standard anatomical framework can be established by referring to the anatomical skeleton of the Allen Brain Atlas standard (an anatomical standard). In the solution provided in this application, an 11-level hierarchical classification tree (i.e., a standard anatomical framework) is constructed with reference to the Allen Brain Atlas standard. This tree structure clearly defines 165 modified Brodmann partitions and 145 gyral regions, providing a unified anatomical "geographic coordinate" for all dispersed cell data from different sources.

[0102] Furthermore, the raw attributes of genes in each sample cell can be labeled and aligned using a standard anatomical framework. The raw attributes of genes in each sample cell can be input into a large language model (e.g., ChatGPT). The large language model can perform reasoning to determine the position of the raw attributes of genes in each sample cell within the standard anatomical framework, and can generate embedding vectors based on the determined positions, outputting the generated embedding vectors. The rules for generating the embedding vectors corresponding to each position in the standard anatomical framework can be pre-defined. The output embedding vectors can then be used as the sample embedding vectors of genes in each sample cell. In other words, a large language model can be used for labeling and alignment.

[0103] For example, after obtaining the sample embedding vectors of genes in each sample cell using a large language model, biological experts can manually verify and correct these vectors to achieve semi-automated annotation alignment and high-precision review. In this way, fragmented and conflicting original attributes can be mapped to a unified "virtual cell" reference system (i.e., a standard anatomical framework), integrating the attributes of genes in each sample cell into a unified vector representation space, fundamentally eliminating technical noise and batch effects.

[0104] Based on the above processing, a unified standard sample dataset can be obtained, avoiding the fragmentation, strong heterogeneity, and severe batch effect of single-cell transcriptome data in the sample dataset, which leads to a decrease in the prediction accuracy of the trained feature extraction model and can eliminate "data silos". That is, it can improve the generalization performance and prediction accuracy of the feature extraction model trained using the sample dataset, and further ensure that highly accurate analysis results can be obtained.

[0105] For example, after obtaining the sample dataset, UMAP (Uniform Manifold Approximation and Projection) clustering can be performed on the sample dataset to verify whether a specified type of cell, such as excitatory neurons, microglia, and astrocytes, can be clearly separated from the sample dataset. This allows for further analysis and validation of the sample dataset to evaluate whether it can effectively train the feature extraction model.

[0106] In some embodiments, the gene embedding vector includes: identity embedding representing the gene's identity, expression level embedding representing the gene's expression level, and order embedding representing the order position of the gene's expression level.

[0107] In the embodiments of this application, for any gene, the gene identity embedding can characterize the gene's genetic identity, and the gene expression level embedding can characterize the gene's expression level, which in turn characterizes the gene's activity level. The ordering position of gene expression levels in a cell, that is, the position of the gene among the third-highest expressed genes in the cell's transcriptome, arranged in descending order of gene expression levels. The ordering embedding of a gene can characterize the ordering position of its expression level, i.e., characterize the relative level of the gene's activity.

[0108] The gene embedding vector can be represented by the following formula:

[0109]

[0110] in, Represents the embedding vector of a gene. and Represents the weight matrix. Indicates one-hot encoding. Indicates identity embedding, Indicates expression embedding, This indicates the embedding of the sorting position.

[0111] Based on the above processing, gene embedding vectors can represent gene attributes from different dimensions, further ensuring that gene embedding vectors can accurately represent gene attributes. Consequently, this further guarantees the effectiveness and accuracy of subsequent analyses, i.e., it can further ensure that highly accurate analytical results are obtained. Simultaneously, using a dual-feature embedding strategy of "gene identity + expression level" enables the feature extraction model to effectively shield against noise from different experimental platforms, demonstrating excellent zero-shot prediction capabilities.

[0112] Based on the same inventive concept, this application also provides a single-cell transcriptome analysis device, see [link to relevant documentation]. Figure 3 The single-cell transcriptome analysis device includes:

[0113] The embedding vector determination module 301 is used to determine the embedding vector of each gene in each cell to be analyzed based on the transcriptome of each cell to be analyzed under a preset physiological state; the embedding vector of a gene represents the attribute of the gene.

[0114] The gene variability index acquisition module 302 is used to obtain the degree of change in the distribution of the embedding vectors of each gene in each cell to be analyzed under different preset physiological states, and to obtain the gene variability index.

[0115] The gene regulation network construction module 303 is used to construct a gene regulation network for each cell to be analyzed under each preset physiological state, based on the similarity of the embedding vectors of genes in the cell to be analyzed. The gene regulation network includes nodes that correspond one-to-one with genes. There is a connection between the nodes corresponding to two genes that meet the target conditions. The target conditions include: the similarity of the embedding vectors of the two genes is not less than the similarity threshold, and at least one of the two genes belongs to the top preset number of genes with the highest similarity of the embedding vector of the other gene.

[0116] The gene regulation index acquisition module 304 is used to divide each gene regulation network by clustering the nodes in each gene regulation network to obtain gene regulation sub-networks, and to obtain the gene regulation index corresponding to each gene regulation sub-network based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes in each gene regulation sub-network.

[0117] The analysis results determination module 305 is used to determine the analysis results characterizing the changes in attributes of each cell under different physiological states based on the obtained gene variability indicators and gene regulation indicators.

[0118] Based on the single-cell transcriptome analysis device provided in this application, the embedding vectors characterizing the attributes of genes in the cells under preset physiological conditions can be determined using the transcriptomes of each cell to be analyzed. Furthermore, the degree of variation in the distribution of embedding vectors of each gene in each cell under different preset physiological conditions can be calculated to obtain a gene variability index. Correspondingly, the gene variability index reflects the degree of variation in the distribution pattern of genes under different physiological conditions; the larger the gene variability index, i.e., the greater the degree of variation, the greater the difference in the distribution pattern of genes under different physiological conditions.

[0119] Furthermore, the similarity of gene embedding vectors in cells under various preset physiological states can be used to construct a gene regulation network for the cells under those preset physiological states. Each node in the gene regulation network corresponds one-to-one with a gene. The similarity of the embedding vectors of genes corresponding to two connected nodes is not less than a similarity threshold; that is, a high similarity in the embedding vectors of genes corresponding to two connected nodes ensures a high correlation between the genes, thus avoiding noise interference. Moreover, at least one gene among the genes corresponding to two connected nodes belongs to the top preset number of genes with the highest similarity in embedding vectors of the other gene. This further constrains the connection relationships between nodes, further ensuring a high correlation between the two connected genes, and thus enabling the constructed gene regulation network to accurately reflect the tightness of functional connections between genes.

[0120] After constructing the gene regulation network, the network can be divided into sub-networks by clustering the nodes within each sub-network. For each sub-network, the gene regulation index can be obtained based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes. Correspondingly, the gene regulation index of each sub-network under different preset physiological states can reflect the changes in that sub-network under these states. A larger gene regulation index, i.e., higher similarity, indicates a stronger functional connectivity within the sub-network.

[0121] Furthermore, by combining the obtained gene variability indicators and gene regulation indicators, the analytical results characterizing the changes in attributes of each cell under different physiological states can be determined. In this way, changes in gene distribution and functional connectivity can be quantified using multi-dimensional indicators, i.e., the state shifts of genes under different physiological states can be quantified. Correspondingly, based on quantified multi-dimensional indicators, changes in gene attributes can be analyzed from multiple dimensions, yielding highly accurate analytical results.

[0122] In some embodiments, the gene regulation network construction module 303 is specifically used to construct the gene regulation network of each cell to be analyzed under each preset physiological state in the following manner: determining the similarity of the embedding vectors of every two genes in the cell to be analyzed, and establishing the connection relationship between nodes corresponding to genes with a similarity not less than a similarity threshold; for each gene in the cell to be analyzed, determining the first number of genes with the highest similarity to the embedding vector of the gene, as the first nearest neighbor genes of the gene; if each of the two genes is not the first nearest neighbor gene of the other gene, deleting the connection relationship between the nodes corresponding to the two genes; for each gene in the cell to be analyzed, determining the second number of genes with the highest similarity to the embedding vector of the gene, as the second nearest neighbor genes of the gene; the second number is less than the first number; if one of the two genes is the second nearest neighbor gene of the other gene and there is no connection relationship between the nodes corresponding to the two genes, establishing the connection relationship between the nodes corresponding to the two genes, thereby obtaining the gene regulation network of the cell to be analyzed under the preset physiological state.

[0123] In some embodiments, the apparatus further includes:

[0124] The state allocation module is used to randomly reassign the preset physiological states of each cell to be analyzed multiple times before determining the analysis results that characterize the attribute changes of each cell under different physiological states based on the obtained gene variability index and gene regulation index.

[0125] The degree of change determination module is used to obtain the degree of change in the distribution of the embedding vector of each gene in each cell to be analyzed under a preset physiological state after each redistribution.

[0126] A probability determination module is used to determine the probability of the gene variability index occurring within the obtained degree of change.

[0127] The analysis result determination module 305 is specifically used to determine the analysis results characterizing the changes in attributes of each cell under different physiological states, based on the obtained gene variability index and gene regulation index, if the determined probability is less than the probability threshold.

[0128] In some embodiments, the embedding vector determination module 301 is specifically used to select the third-highest expressed genes from the transcriptome of each cell to be analyzed under a preset physiological state, generate sequence elements corresponding to the selected genes based on the expression level and gene identity of the selected genes, arrange the sequence elements corresponding to the selected genes in descending order of the expression level of the selected genes to obtain an input sequence, and input the obtained input sequence into a pre-trained feature extraction model to obtain the embedding vector of the gene in the cell to be analyzed output by the feature extraction model; wherein, the feature extraction model is trained using the transcriptome of each sample cell, the expression level of the gene in each sample cell, and the sample embedding vector.

[0129] In some embodiments, the feature extraction model is trained through the following steps: obtaining the transcriptome of each sample cell and the sample embedding vector of the genes in each sample cell; selecting the third-highest expressed genes from the transcriptome of each sample cell, generating sequence elements corresponding to the selected genes based on the expression level and gene identity of the selected genes, arranging the sequence elements corresponding to the selected genes in descending order of expression level to obtain the sample input sequence; inputting the sample input sequence into the feature extraction model of the initial structure to obtain the predicted embedding vector of the genes in each sample cell output by the feature extraction model; adjusting the parameters of the feature extraction model of the initial structure using the difference between the obtained predicted embedding vector and the sample embedding vector until the preset convergence condition is reached to obtain the trained feature extraction model.

[0130] In some embodiments, obtaining the transcriptome of each sample cell and the sample embedding vector of the gene in each sample cell includes: obtaining the transcriptome of each sample cell and the sample embedding vector of the gene in each sample cell from a pre-constructed sample dataset; wherein the sample dataset is constructed by the following steps: obtaining the transcriptome of each sample cell and the original attributes of the gene in each sample cell; inputting the original attributes of the gene in each sample cell into a large language model so that the large language model determines the position of the original attributes of the gene in each sample cell in a pre-established standard anatomical framework, generating an embedding vector based on the determined position, and outputting the generated embedding vector, and using the output embedding vector as the sample embedding vector of the gene in each sample cell.

[0131] In some embodiments, the gene embedding vector includes: identity embedding representing the gene's identity, expression level embedding representing the gene's expression level, and order embedding representing the order position of the gene's expression level.

[0132] This application also provides an electronic device, such as... Figure 4As shown, it includes a processor 401, a communication interface 402, a memory 403, and a communication bus 404, wherein the processor 401, the communication interface 402, and the memory 403 communicate with each other through the communication bus 404.

[0133] Memory 403 is used to store computer programs;

[0134] When processor 401 executes the program stored in memory 403, it performs the following steps:

[0135] Based on the transcriptome of each cell to be analyzed under a preset physiological state, the embedding vector of each gene in each cell to be analyzed is determined; the gene embedding vector characterizes the gene's properties.

[0136] The degree of variation in the distribution of the embedding vectors of each gene in each cell to be analyzed under different preset physiological states is obtained to obtain the gene variability index.

[0137] For each cell to be analyzed under each preset physiological state, a gene regulation network of the cell to be analyzed under the preset physiological state is constructed based on the similarity of the gene embedding vectors in the cell to be analyzed. The gene regulation network includes nodes that correspond one-to-one with genes. There is a connection between the nodes corresponding to two genes that meet the target conditions. The target conditions include: the similarity of the embedding vectors of the two genes is not less than the similarity threshold, and at least one of the two genes belongs to the top preset number of genes with the highest similarity of the embedding vector of the other gene.

[0138] Each gene regulatory network is divided into gene regulatory sub-networks by clustering the nodes in each gene regulatory network. Gene regulatory indicators corresponding to each gene regulatory sub-network are obtained based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes in each gene regulatory sub-network.

[0139] Based on the obtained gene variability indicators and gene regulation indicators, the analysis results that characterize the changes in attributes of each cell under different physiological states are determined.

[0140] The communication bus mentioned in the above electronic devices can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0141] The communication interface is used for communication between the aforementioned electronic devices and other devices.

[0142] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0143] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0144] In another embodiment provided in this application, a computer-readable storage medium is also provided, which stores a computer program that, when executed by a processor, implements any of the single-cell transcriptome analysis methods described above.

[0145] In another embodiment provided in this application, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute any of the single-cell transcriptome analysis methods described in the above embodiments.

[0146] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (SSD)).

[0147] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0148] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the embodiments of apparatus, electronic devices, storage media, and program products are basically similar to the method embodiments, so the descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0149] The above description is merely a preferred embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application are included within the scope of protection of this application.

Claims

1. A method for single-cell transcriptome analysis, characterized in that, The method includes: Based on the transcriptome of each cell to be analyzed under a preset physiological state, the embedding vector of each gene in each cell to be analyzed is determined; the gene embedding vector characterizes the gene's properties. The degree of variation in the distribution of the embedding vectors of each gene in each cell to be analyzed under different preset physiological states is obtained to obtain the gene variability index. For each cell to be analyzed under each preset physiological state, a gene regulation network of the cell to be analyzed under the preset physiological state is constructed based on the similarity of the gene embedding vectors in the cell to be analyzed. The gene regulation network includes nodes that correspond one-to-one with genes. There is a connection between the nodes corresponding to two genes that meet the target conditions. The target conditions include: the similarity of the embedding vectors of the two genes is not less than the similarity threshold, and at least one of the two genes belongs to the top preset number of genes with the highest similarity of the embedding vector of the other gene. Each gene regulatory network is divided into gene regulatory sub-networks by clustering the nodes in each gene regulatory network. Gene regulatory indicators corresponding to each gene regulatory sub-network are obtained based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes in each gene regulatory sub-network. Based on the obtained gene variability indicators and gene regulation indicators, the analysis results that characterize the changes in attributes of each cell under different physiological states are determined.

2. The method according to claim 1, characterized in that, The gene regulatory networks of each cell to be analyzed under each preset physiological state were constructed in the following manner: Determine the similarity of the embedding vectors of every two genes in the cell to be analyzed, and establish the connection relationship between nodes corresponding to genes with similarity not less than the similarity threshold; For each gene in the cell to be analyzed, the first number of genes with the highest similarity to the embedding vector of that gene are identified as the first nearest neighbor genes of that gene. If each of the two genes is not the first nearest neighbor of the other gene, delete the connection between the corresponding nodes of the two genes. For each gene in the cell to be analyzed, the second-highest number of genes with the highest similarity to the embedding vector of that gene are identified as the second nearest neighbor genes of that gene. The second number is less than the first number; If one of the two genes is the second nearest neighbor of the other gene and there is no connection between the nodes corresponding to the two genes, establish the connection between the nodes corresponding to the two genes to obtain the gene regulatory network of the cell to be analyzed under the preset physiological state.

3. The method according to claim 1, characterized in that, Before determining the analytical results characterizing the changes in properties of each cell under different physiological states based on the obtained gene variability indicators and gene regulation indicators, the method further includes: The preset physiological states of each cell to be analyzed were randomly reassigned multiple times. The degree of change in the distribution of the embedding vector of each gene in each cell to be analyzed under a preset physiological state after each redistribution is obtained. Determine the probability of the gene variability index appearing in the obtained degree of change; The analytical results, based on the obtained gene variability indicators and gene regulation indicators, determine the changes in the properties of each cell under different physiological states, including: If the determined probability is less than the probability threshold, the analysis results characterizing the changes in attributes of each cell under different physiological states are determined based on the obtained gene variability indicators and gene regulation indicators.

4. The method according to claim 1, characterized in that, The determination of the embedding vectors of each gene in each cell under pre-defined physiological conditions, based on the transcriptome of each cell to be analyzed, includes: For each cell to be analyzed under a preset physiological state, the third-highest expressed genes are selected from the transcriptome. Sequence elements corresponding to the selected genes are generated based on their expression levels and gene identities. The sequence elements corresponding to the selected genes are arranged in descending order of their expression levels to obtain the input sequence. The obtained input sequence is then input into a pre-trained feature extraction model to obtain the gene embedding vectors of the cells to be analyzed, which are output by the feature extraction model. The feature extraction model is trained using the transcriptome of each sample cell, the gene expression levels of each sample cell, and the sample embedding vectors.

5. The method according to claim 4, characterized in that, The feature extraction model is trained through the following steps: Obtain the transcriptome of each sample cell and the sample embedding vector of the genes in each sample cell; From the transcriptome of each sample cell, select the third-highest expressed genes. Based on the expression level and gene identity of the selected genes, generate the corresponding sequence elements of the selected genes. Arrange the sequence elements of the selected genes in descending order of their expression levels to obtain the sample input sequence. Input the sample input sequence into the feature extraction model of the initial structure to obtain the predicted embedding vector of the gene in each sample cell output by the feature extraction model; By utilizing the difference between the predicted embedding vector and the sample embedding vector, the parameters of the initial structure feature extraction model are adjusted until the preset convergence condition is met, thus obtaining the trained feature extraction model.

6. The method according to claim 5, characterized in that, The acquisition of the transcriptome of each sample cell and the sample embedding vector of genes in each sample cell includes: From the pre-constructed sample dataset, obtain the transcriptome of each sample cell and the sample embedding vector of the gene in each sample cell; The sample dataset is constructed through the following steps: Obtain the transcriptome of each sample cell and the original attributes of genes in each sample cell; The original attributes of genes in each sample cell are input into the large language model so that the large language model can determine the position of the original attributes of genes in each sample cell in the pre-established standard anatomical framework. Based on the determined position, an embedding vector is generated and output. The output embedding vector is used as the sample embedding vector of genes in each sample cell.

7. The method according to any one of claims 1-6, characterized in that, The gene embedding vector includes: identity embedding representing the gene's identity, expression embedding representing the gene's expression level, and order embedding representing the order position of the gene's expression level.

8. A single-cell transcriptome analysis device, characterized in that, The device includes: The embedding vector determination module is used to determine the embedding vector of each gene in each cell to be analyzed based on the transcriptome of each cell under a preset physiological state; the embedding vector of a gene represents the attribute of the gene. The gene variability index acquisition module is used to obtain the degree of change in the distribution of the embedding vectors of each gene in each cell to be analyzed under different preset physiological states, and to obtain the gene variability index. The gene regulation network construction module is used to construct a gene regulation network for each cell to be analyzed under each preset physiological state, based on the similarity of the embedding vectors of genes in the cell to be analyzed. The gene regulation network includes nodes that correspond one-to-one with genes. There is a connection between the nodes corresponding to two genes that meet the target conditions. The target conditions include: the similarity of the embedding vectors of the two genes is not less than the similarity threshold, and at least one of the two genes belongs to the top preset number of genes with the highest similarity of the embedding vector of the other gene. The gene regulation index acquisition module is used to divide each gene regulation network by clustering the nodes in each gene regulation network to obtain gene regulation sub-networks. Based on the similarity of the embedding vectors of the genes corresponding to each pair of nodes in each gene regulation sub-network, the gene regulation index corresponding to each gene regulation sub-network is obtained. The analysis results determination module is used to determine the analysis results characterizing the changes in attributes of each cell under different physiological states based on the obtained gene variability indicators and gene regulation indicators.

9. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the method described in any one of claims 1-7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method described in any one of claims 1-7.