Rare cell identification method based on density adaptive graph and related device
By constructing a density-adaptive graph for dimensionality reduction and clustering analysis, the problem of low accuracy in rare cell identification in traditional methods is solved, and accurate identification of rare cells in high-dimensional space is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG INST OF INTELLIGENT SCI & TECH
- Filing Date
- 2026-01-22
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245416A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a method and related equipment for rare cell identification based on density adaptive maps. Background Technology
[0002] Currently, rare cells play an important role in biological processes such as development, immune response, tumor drug resistance, and disease progression.
[0003] In related technologies, single-cell transcriptome sequencing is typically used to characterize the cell type composition in tissues or samples, thereby identifying rare cells. However, in practical applications, it has been found that traditional single-cell transcriptome sequencing methods generally rely on k-nearest neighbor graphs with a fixed neighborhood size to describe inter-cell relationships, which leads to low accuracy in identifying rare cells due to the highly uneven cell density.
[0004] In summary, the technical problems existing in the relevant technologies need to be improved. Summary of the Invention
[0005] This application provides a method and related equipment for rare cell identification based on density adaptive maps, which can effectively amplify the expression deviation of rare cells and effectively improve the identification accuracy and stability of rare cells.
[0006] On one hand, embodiments of this application provide a rare cell identification method based on density adaptive maps, the method comprising the following steps: Obtain the principal component analysis results of the cell group data to be identified; Construct a density-adaptive graph; wherein, the affinity matrix is obtained by calculating the local radius and assigning edge weights to the density-adaptive graph; The principal component analysis results are dimensionality reduced using the density adaptive plot to obtain a dimensionality-reduced expression matrix. Cluster analysis is performed on the reduced-dimensional expression matrix using the density adaptive map to obtain cell clustering results; The cell clustering results are analyzed to obtain the rare cell identification results of the cell group data to be identified.
[0007] Optionally, the construction of the density adaptive graph includes: The local radius is determined by calculating the Euclidean distance, and then the local connectivity structure is regularized. By calculating the cosine distance and combining it with a locally scaled Gaussian kernel, the edge weights of the local connectivity structure are assigned. Combining the local connectivity structure and edge weights, we obtain the affinity matrix.
[0008] Optionally, the step of reducing the dimensionality of the principal component analysis results using the density adaptive plot to obtain a dimensionality-reduced expression matrix includes: Using the density adaptive plot, a first adaptive affinity plot of the principal component analysis results is constructed; The neighborhood of each cell in the cell group is determined by the first adaptive affinity map; Calculate the difference between gene expression in the neighborhood of each cell and the overall distribution of the cell group, and test and correct the difference to obtain the corrected difference value; Based on the corrected difference values, a surprise matrix is constructed by converting surprise into information content. Singular value decomposition is performed on the surprise matrix to extract the right singular vector; Based on a preset vector quantity threshold, the payload vector is extracted from the right singular vector, and the cell group expression matrix is projected onto the payload vector to obtain the dimensionality-reduced expression matrix.
[0009] Optionally, the step of performing cluster analysis on the dimensionality-reduced expression matrix using the density adaptive map to obtain cell clustering results includes: A second adaptive affinity graph of the dimensionality-reduced representation matrix is constructed using the density adaptive graph. The second adaptive affinity graph is clustered using a community detection algorithm to obtain cell clustering results.
[0010] Optionally, the step of identifying and analyzing the cell clustering results to obtain the rare cell identification results of the cell group data to be identified includes: The cell clustering results are analyzed to identify and determine the clusters; Differential expression analysis was performed on each cluster, and the identification result of whether each cluster is a rare cell was determined according to preset conditions; Clusters that match the rare cell identification results are selected to obtain the rare cell identification results of the cell group data to be identified.
[0011] Optionally, the principal component analysis results for obtaining the cell group data to be identified include: Acquire the cell group data to be identified; The cell group data to be identified is subjected to quality control and gene filtering to obtain preprocessed cell group data; The library size of each cell in the preprocessed cell group data is normalized, and the target hypervariable gene is obtained based on the gene expression variance. Principal component analysis was performed on the target hypervariable gene to obtain the principal component analysis results of the cell group data to be identified.
[0012] On the other hand, embodiments of this application provide a rare cell identification device based on a density adaptive map, the device comprising: The data acquisition module is used to acquire the principal component analysis results of the cell group data to be identified; A density-adaptive graph construction module is used to construct a density-adaptive graph; wherein, the density-adaptive graph obtains an affinity matrix by calculating local radii and assigning edge weights; The data dimensionality reduction module is used to reduce the dimensionality of the principal component analysis results using the density adaptive graph to obtain a dimensionality-reduced expression matrix; The data clustering module is used to perform clustering analysis on the dimensionality-reduced expression matrix using the density adaptive map to obtain cell clustering results; The data identification module is used to identify and analyze the cell clustering results to obtain the rare cell identification results of the cell group data to be identified.
[0013] On the other hand, embodiments of this application provide an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the above-described method.
[0014] On the other hand, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method.
[0015] On the other hand, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0016] This application embodiment constructs a density adaptive graph, providing a consistent density control mechanism during the feature dimensionality reduction and clustering stages. The neighborhood structure is adaptively adjusted according to the local density, effectively amplifying the expression bias of rare cells and effectively improving the recognition accuracy and stability of rare cells. Attached Figure Description
[0017] Figure 1 This is a schematic diagram of the implementation environment of a rare cell identification method based on density adaptive maps provided in this application embodiment; Figure 2 This is a schematic flowchart of a rare cell identification method based on a density adaptive map provided in an embodiment of this application; Figure 3 This is a schematic diagram of a cell genome data processing flow provided in an embodiment of this application; Figure 4 This is a schematic diagram of the structure of a rare cell recognition device based on a density adaptive map provided in an embodiment of this application; Figure 5 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit it. In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this application; they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this application as detailed in the appended claims.
[0019] It is understood that the terms “first,” “second,” etc., used in this application may be used herein to describe various concepts, but unless otherwise stated, these concepts are not limited by these terms. These terms are only used to distinguish one concept from another. For example, without departing from the scope of the embodiments of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the words “if,” “when,” or “in response to a determination” as used herein may be interpreted as “when…” or “when…” or “in response to a determination.”
[0020] As used in this application, the terms "at least one", "multiple", "each", "any", etc., "at least one" includes one, two or more, "multiple" includes two or more, "each" refers to each of the corresponding multiples, and "any" refers to any one of the multiples.
[0021] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.
[0022] Currently, rare cells play an important role in biological processes such as development, immune response, tumor drug resistance, and disease progression.
[0023] In related technologies, single-cell transcriptome sequencing is typically used to characterize the cell type composition in tissues or samples, thereby identifying rare cells. However, in practical applications, it has been found that traditional single-cell transcriptome sequencing methods generally rely on k-nearest neighbor graphs with a fixed neighborhood size to describe inter-cell relationships, which leads to low accuracy in identifying rare cells due to the highly uneven cell density.
[0024] In view of this, this application provides a rare cell identification method and related device based on density adaptive graphs. By constructing a density adaptive graph, a consistent density control mechanism is provided in the feature dimensionality reduction and clustering stages. The neighborhood structure is adaptively adjusted according to the local density, which effectively amplifies the expression deviation of rare cells and effectively improves the identification accuracy and stability of rare cells.
[0025] It should be noted that in all specific embodiments of this application, when processing data related to user identity or characteristics, such as user information, user behavior data, user historical data, and user location information, user permission or consent is obtained first. Furthermore, the collection, use, and processing of this data comply with relevant laws, regulations, and standards. In addition, when embodiments of this application require access to sensitive personal information of users, separate permission or consent from the user is obtained through pop-ups or redirection to confirmation pages. Only after obtaining the user's separate permission or consent is the necessary user-related data required for the proper functioning of these embodiments acquired.
[0026] The specific implementation methods of the embodiments of this application will be described in detail below with reference to the accompanying drawings. First, a rare cell identification method based on density adaptive maps provided in the embodiments of this application will be described with reference to the accompanying drawings.
[0027] Please refer to Figure 1 , Figure 1 This is a schematic diagram illustrating the implementation environment of a rare cell identification method based on a density adaptive map provided in this application embodiment. In this implementation environment, the main hardware and software components involved include a terminal processor 110 and a server 120.
[0028] Specifically, the terminal processor 110 may be equipped with a control program for a rare cell identification method based on a density adaptive map, and the server 120 serves as the backend server for this control program. The terminal processor 110 and the backend server 120 are connected in communication. The rare cell identification method based on a density adaptive map provided in this embodiment can be executed on the terminal processor 110 side.
[0029] Server 120 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.
[0030] In addition, server 120 can also be a node server in a blockchain network.
[0031] The terminal processor 110 and the server 120 can establish a communication connection via a wireless network. This wireless network uses standard communication technologies and / or protocols. The network can be the Internet or any other network, including but not limited to a Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, or any combination of wireless networks, private networks, or virtual private networks. Furthermore, these hardware and software components can use the same or different communication connection methods; this application does not impose specific limitations in this regard.
[0032] Of course, this is understandable. Figure 1 The implementation environments described in this application are just some optional application scenarios for the rare cell identification method based on density adaptive maps provided in this embodiment. The actual application is not fixed. Figure 1 The software and hardware environment shown is not specifically limited in this application.
[0033] like Figure 2 As shown, Figure 2 This is a flowchart illustrating a rare cell identification method based on a density adaptive map provided in an embodiment of this application, specifically including but not limited to steps 100 to 500.
[0034] Step 100: Obtain the principal component analysis results of the cell group data to be identified.
[0035] In this embodiment of the application, the principal component analysis results of the cell group data to be identified are used as input data for subsequent density adaptive plot and information theory dimensionality reduction analysis.
[0036] The cell group data to be identified includes the raw expression matrix (specifically, cell × gene) of single-cell transcriptome sequencing (scRNA-seq). By performing principal component analysis on the raw expression matrix of single-cell transcriptome sequencing, the original high-dimensional noise data can be transformed into a stable low-noise representation, and the principal component analysis results of the cell group data to be identified can be obtained.
[0037] Specifically, as an optional implementation, the acquisition of principal component analysis results of the cell group data to be identified includes: Acquire the cell group data to be identified; The cell group data to be identified is subjected to quality control and gene filtering to obtain preprocessed cell group data; The library size of each cell in the preprocessed cell group data is normalized, and the target hypervariable gene is obtained based on the gene expression variance. Principal component analysis was performed on the target hypervariable gene to obtain the principal component analysis results of the cell group data to be identified.
[0038] In the embodiments of this application, the cell group data to be identified is first obtained, and the cell group data to be identified is preprocessed to ensure data quality and comparability between different cells.
[0039] Specifically, please refer to Figure 3 , Figure 3 This is a schematic diagram of a cell genome data processing flowchart provided in this application embodiment. Cells are divided into different clusters using blue and orange. First, cells with zero total expression or fewer than a preset threshold (e.g., 200) are removed from the acquired single-cell transcriptome sequencing data to filter out low-quality cells. Further, genes expressed in fewer than three cells are removed to reduce extremely sparse features. Mitochondrial genes are annotated, and cells with abnormal quality indicators are eliminated, including cells with the highest mitochondrial proportions and cells with the highest or lowest detected gene counts. This completes the quality control and gene filtering of the cell genome data to be identified, resulting in preprocessed cell genome data.
[0040] Furthermore, the library size of each cell in the preprocessed cell group data was normalized and logarithmically transformed to stabilize the variance, thereby removing genes with expression variance less than 0.1, and selecting the 5000 hypervariable genes with the highest variance as target hypervariable genes.
[0041] Finally, principal component analysis was performed on the target hypervariable gene to obtain a denoised representation matrix, which served as the principal component analysis result of the cell group data to be identified and as the input data for subsequent density adaptive plot and information theory analysis.
[0042] Step 200: Construct a density adaptive graph; wherein, the density adaptive graph obtains the affinity matrix by calculating the local radius and assigning edge weights.
[0043] In this embodiment, a density-adaptive graph is constructed to regularize the connectivity and weights of the cell graph. This is primarily achieved by decoupling the weights of edges between cells, specifically by calculating local radii to regularize the local connectivity structure and assigning edge weights, ultimately obtaining the affinity matrix after the density-adaptive graph operation. This allows for the effective characterization of neighborhood relationships between cells in a high-dimensional space, even considering the inherent high dimensionality, high noise, and uneven density distribution of single-cell data.
[0044] Specifically, as an optional implementation, the construction of the density adaptive graph includes: The local radius is determined by calculating the Euclidean distance, and then the local connectivity structure is regularized. By calculating the cosine distance and combining it with a locally scaled Gaussian kernel, the edge weights of the local connectivity structure are assigned. Combining the local connectivity structure and edge weights, we obtain the affinity matrix.
[0045] In this embodiment, the density adaptive graph mainly includes two parts: a regularized local connectivity structure and a regularized nearest neighbor magnitude, i.e., edge weights.
[0046] Specifically, for each cell with a feature representation, the candidate cell set closest to it is selected, and the local radius is determined by calculating the Euclidean distance. The calculation process can be expressed as the following formula (1): ; (1) in, The radius is M; M is the number of candidate cells in the nearest candidate cell set. This is the deviation correction constant, which can be set to 1 for example; For cells and cells The Euclidean distance between them.
[0047] Furthermore, after calculating the Euclidean distance to determine the local radius, the cell... to cells The distance satisfies Then the cells are preserved. to cells If an edge is connected between cells, it is deleted; otherwise, the weight of the connecting edge is 0. This allows the existence of connections between different cells to be determined by radius gating, thereby shrinking the neighborhood in dense regions and avoiding access to extremely distant cells in sparse regions.
[0048] Furthermore, by calculating the cosine distance and combining it with a locally scaled Gaussian kernel, the edge weights of the local connectivity structure are assigned, where the cosine distance can be defined by the following formula (2): (2) in, For cells and The cosine distance between them; Cosine similarity; Represents cells ; Represents cells .
[0049] Furthermore, in cells to cells The distance satisfies At that time, according to the cell and The cosine distance between them determines the cell and The affinity between them can be specifically calculated using the following formula (3): (3) in, For cells and The affinity between them; For cells The cosine distance to the farthest candidate cell; For cells The cosine distance to the candidate cell furthest away.
[0050] Therefore, the affinity calculated by formula (3) can be regularized in heterogeneous density distribution by locally scaled kernel function, thereby generating an adjusted affinity matrix, making the similarity from dense regions and sparse regions numerically comparable, and avoiding high-density clusters from dominating the community detection process.
[0051] Furthermore, by integrating the local connectivity structure and edge weights of all cells, an affinity matrix is obtained. The graph can be symmetric by retaining only the minimum affinity value between each pair of cells, thereby eliminating unidirectional long-distance shortcuts and obtaining a symmetric sparse affinity matrix.
[0052] Therefore, in a representation space with strongly uneven sampling density, a weighted, symmetric affinity matrix that can reflect the real local geometry and is numerically comparable can be generated by density adaptive graph, avoiding cross-density pseudo-connections or redundant connections in dense regions caused by fixed kNN.
[0053] Step 300: Dimensionality reduction is performed on the principal component analysis results using the density adaptive graph to obtain the dimensionality-reduced expression matrix.
[0054] In this embodiment of the application, density adaptive graphs are applied to the principal component analysis results to calculate information theory-based dimensionality reduction representations that reflect the local and global expression biases. Thus, within the adaptive neighborhood defined by the density adaptive graph, statistical tests are used to quantify the relative global expression bias of each cell within its neighborhood and convert it into a low-dimensional representation to amplify the local features of rare cells.
[0055] Specifically, as an optional implementation, the step of reducing the dimensionality of the principal component analysis results using the density adaptive graph to obtain a dimensionality-reduced expression matrix includes: Using the density adaptive plot, a first adaptive affinity plot of the principal component analysis results is constructed; The neighborhood of each cell in the cell group is determined by the first adaptive affinity map; Calculate the difference between gene expression in the neighborhood of each cell and the overall distribution of the cell group, and test and correct the difference to obtain the corrected difference value; Based on the corrected difference values, a surprise matrix is constructed by converting surprise into information content. Singular value decomposition is performed on the surprise matrix to extract the right singular vector; Based on a preset vector quantity threshold, the payload vector is extracted from the right singular vector, and the cell group expression matrix is projected onto the payload vector to obtain the dimensionality-reduced expression matrix.
[0056] In the embodiments of this application, please refer to Figure 3 The density adaptive map constructed through the above steps is applied to the principal component analysis results to construct the first adaptive affinity map of the principal component analysis results. The neighborhood of each cell in the cell group is then determined based on the first adaptive affinity map. Specifically, for each cell, the neighborhood of the cell can be determined by identifying all cells with an affinity greater than 0 to that cell.
[0057] Furthermore, the neighborhood of each cell is used to assess whether the gene expression within its neighborhood differs from the global distribution. Specifically, the two-tailed p-value can be obtained through the Wilcoxon rank-sum test as the difference value, which can be expressed as the following formula (4): (4) in, Indicates gene In cells neighborhood The amount of expression in; This indicates the expression level of the gene in all cells; This represents the Wilcoxon rank-sum test; Let represent the two-tailed p-value obtained from the Wilcoxon rank-sum test.
[0058] Furthermore, the two-tailed p-value can be corrected by using a multiple test correction method with controlled family error rate (FWER) to obtain the corrected difference value.
[0059] Furthermore, the corrected difference value can be converted into information content through surprise, which can be expressed as the following formula (5): (5) in, Indicates gene In cells The amount of information at the location; The corrected difference value; This is a symbolic field used to indicate whether a gene is upregulated or downregulated; Furthermore, by summarizing the information of all genes, a surprise matrix can be constructed, and singular value decomposition can be performed on the surprise matrix to extract its main right singular vector. The singular value decomposition can be expressed as the following formula (6): (6) in, I This is the surprise matrix; U and V Σ is an orthogonal matrix; Σ is a diagonal matrix.
[0060] Furthermore, by extracting the matrix V The column vectors are used to obtain right singular vectors. Based on a preset vector number threshold, such as 15 depending on the needs of the scenario, the first 15 payload vectors are extracted as metagenes. The original cell group expression matrix is then projected onto the payload vectors to obtain a dimensionality-reduced expression matrix, which is used to highlight the gene features that deviate most significantly from the global background in local behavior.
[0061] Step 400: Using the density adaptive graph, perform cluster analysis on the dimensionality-reduced expression matrix to obtain the cell clustering results.
[0062] In this embodiment, an adaptive graph is constructed again on the dimensionality-reduced representation matrix of the information-theoretic low-dimensional representation using a density-adaptive graph. Community detection is then performed based on the adaptive graph to achieve cluster analysis, resulting in cell clustering results where stable and rare populations are better preserved.
[0063] Specifically, as an optional implementation, the step of performing cluster analysis on the dimensionality-reduced expression matrix using the density adaptive map to obtain cell clustering results includes: A second adaptive affinity graph of the dimensionality-reduced representation matrix is constructed using the density adaptive graph. The second adaptive affinity graph is clustered using a community detection algorithm to obtain cell clustering results.
[0064] In the embodiments of this application, such as Figure 3 As shown, by further processing the information-theoretic dimensionality-reduced representation matrix using a density-adaptive graph, a second adaptive affinity graph is constructed. This graph encodes the local similarity in the information-theoretic representation space. The second adaptive affinity graph is then used as input data for a community detection algorithm to perform cluster analysis, thereby obtaining the cell clustering results. In practical applications, the second adaptive affinity graph can be clustered using the Leiden community detection algorithm, and the community structure can be obtained by maximizing the modularity. The modularity can be expressed as the following formula (7): (7) in, Q Modularity; This represents the sum of all edge weights in the second adaptive affinity graph; Represents cells i With cells j The edge weights between them in the second adaptive affinity graph; This is a resolution parameter and can be set to the default value of 1. For nodes i The sum of the edge weights; For nodes j The sum of the edge weights; For indicator functions, Represents a node i , Represents a node j .
[0065] Specifically, in maximizing modularity, the Leiden community detection algorithm improves the partitioning structure by moving different nodes to different clusters, merging the current clusters, treating a cluster as a supernode, and constructing a coarsened metagraph for the next round of optimization, until a stable partitioning structure is finally obtained.
[0066] Therefore, by reconstructing an adaptive neighborhood in the information-theoretic dimensionality-reduced representation matrix space and performing community detection, the amplified local anomalies during dimensionality reduction are not submerged or incorrectly merged during the clustering stage, thus improving the accuracy of rare cell identification.
[0067] Therefore, this application constructs a density adaptive graph to maintain the reliability of the neighborhood structure in a non-uniform density space, so that the local expression deviation of rare cells can be more fully preserved. At the same time, it provides a consistent density regulation mechanism in the feature dimensionality reduction and clustering stages, and adaptively adjusts the neighborhood structure with the local density to make the clustering structure more robust, avoiding the problems of over-clustering or under-clustering. This significantly improves the detection performance of rare cells while maintaining efficient and scalable computing, effectively amplifies the expression deviation of rare cells, and effectively improves the recognition accuracy and stability of rare cells.
[0068] Step 500: Perform identification and analysis on the cell clustering results to obtain the rare cell identification results of the cell group data to be identified.
[0069] In the embodiments of this application, rare cell subpopulations with low abundance but biological significance are identified and verified from cell clustering results as rare cell identification results of cell group data to be identified.
[0070] Specifically, as an optional implementation, the step of identifying and analyzing the cell clustering results to obtain the rare cell identification results of the cell group data to be identified includes: The cell clustering results are analyzed to identify and determine the clusters; Differential expression analysis was performed on each cluster, and the identification result of whether each cluster is a rare cell was determined according to preset conditions; Clusters that match the rare cell identification results are selected to obtain the rare cell identification results of the cell group data to be identified.
[0071] In this embodiment of the application, the cell clusters obtained from the cell clustering results can first be visualized by dimensionality reduction (e.g., using a nonlinear mapping method that preserves structural relationships) to intuitively show the distribution and boundary relationships of different cell populations in the expression space, thereby helping to identify potential rare cell clusters.
[0072] Furthermore, statistical tests were performed on each cluster and all other cells to screen for representative upregulated marker genes under a uniform significance threshold, thereby completing the differential expression analysis of each cluster.
[0073] Furthermore, each cluster is determined to be a rare cell based on preset conditions. For example, the preset conditions can be set as follows: the number of cells in the cluster accounts for a certain proportion of the total number of cells, the cluster has a clear boundary and forms an independent structure in the expression space, the cluster is enriched with specific marker genes and can be distinguished from neighboring cell types, and the gene expression pattern of the cluster is consistent with existing biological knowledge or category characteristics.
[0074] Furthermore, by comprehensively considering the identification results of all clusters as rare cells, clusters that meet the identification results of rare cells are selected and used as the rare cell identification results for the cell group data to be identified.
[0075] In practical applications, cell clustering results can also be used to detect cell subpopulations not included in the original annotations. If any cluster exhibits highly consistent and unique molecular characteristics, it can be regarded as a potential candidate for a new cell category, which can facilitate subsequent life science analysis.
[0076] Please see Figure 4 , Figure 4 This is a schematic diagram of a rare cell identification device based on a density adaptive map provided in an embodiment of this application. This application also provides a rare cell identification device based on a density adaptive map, which can implement the above-mentioned rare cell identification method based on a density adaptive map. The device includes: The data acquisition module 410 is used to acquire the principal component analysis results of the cell group data to be identified; Density adaptive graph construction module 420 is used to construct a density adaptive graph; wherein, the density adaptive graph obtains the affinity matrix by calculating the local radius and assigning edge weights; The data dimensionality reduction module 430 is used to reduce the dimensionality of the principal component analysis results through the density adaptive graph to obtain a dimensionality-reduced expression matrix; The data clustering module 440 is used to perform clustering analysis on the dimensionality-reduced expression matrix through the density adaptive map to obtain cell clustering results; The data recognition module 450 is used to identify and analyze the cell clustering results to obtain the rare cell recognition results of the cell group data to be identified.
[0077] It is understood that the content of the above method embodiments is applicable to the present device embodiments. The specific functions implemented by the present device embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0078] Please see Figure 5 , Figure 5 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. The electronic device includes: The processor 501 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application. The memory 502 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 502 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 502 and is called and executed by the processor 501 using the methods described in the embodiments of this application. The input / output interface 503 is used to implement information input and output; The communication interface 504 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.). Bus 505 transmits information between various components of the device (e.g., processor 501, memory 502, input / output interface 503, and communication interface 504); The processor 501, memory 502, input / output interface 503, and communication interface 504 are connected to each other within the device via bus 505.
[0079] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method.
[0080] It is understood that the content of the above method embodiments is applicable to this storage medium embodiment. The specific functions implemented in this storage medium embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.
[0081] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.
[0082] It is understood that the content of the above method embodiments is applicable to the embodiments of this program product. The specific functions implemented by the embodiments of this program product are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0083] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0084] This application provides a method and related device for rare cell identification based on density adaptive maps. By constructing a density adaptive map, it provides a consistent density regulation mechanism in the feature dimensionality reduction and clustering stages. The neighborhood structure is adaptively adjusted according to the local density, which effectively amplifies the expression deviation of rare cells and effectively improves the identification accuracy and stability of rare cells.
[0085] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0086] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0087] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0088] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0089] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0090] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0091] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0092] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0093] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0094] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0095] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.
Claims
1. A method for rare cell identification based on density adaptive maps, characterized in that, The method includes the following steps: Obtain the principal component analysis results of the cell group data to be identified; Construct a density-adaptive graph; wherein, the affinity matrix is obtained by calculating the local radius and assigning edge weights to the density-adaptive graph; The principal component analysis results are dimensionality reduced using the density adaptive plot to obtain a dimensionality-reduced expression matrix. Cluster analysis is performed on the reduced-dimensional expression matrix using the density adaptive map to obtain cell clustering results; The cell clustering results are analyzed to obtain the rare cell identification results of the cell group data to be identified.
2. The method according to claim 1, characterized in that, The construction of the density adaptive graph includes: The local radius is determined by calculating the Euclidean distance, and then the local connectivity structure is regularized. By calculating the cosine distance and combining it with a locally scaled Gaussian kernel, the edge weights of the local connectivity structure are assigned. Combining the local connectivity structure and edge weights, we obtain the affinity matrix.
3. The method according to claim 1, characterized in that, The step of reducing the dimensionality of the principal component analysis results using the density adaptive graph to obtain a dimensionality-reduced expression matrix includes: Using the density adaptive plot, a first adaptive affinity plot of the principal component analysis results is constructed; The neighborhood of each cell in the cell group is determined by the first adaptive affinity map; Calculate the difference between gene expression in the neighborhood of each cell and the overall distribution of the cell group, and test and correct the difference to obtain the corrected difference value; Based on the corrected difference values, a surprise matrix is constructed by converting surprise into information content. Singular value decomposition is performed on the surprise matrix to extract the right singular vector; Based on a preset vector quantity threshold, the payload vector is extracted from the right singular vector, and the cell group expression matrix is projected onto the payload vector to obtain the dimensionality-reduced expression matrix.
4. The method according to claim 1, characterized in that, The step of performing cluster analysis on the dimensionality-reduced expression matrix using the density adaptive map to obtain cell clustering results includes: A second adaptive affinity graph of the dimensionality-reduced representation matrix is constructed using the density adaptive graph. The second adaptive affinity graph is clustered using a community detection algorithm to obtain cell clustering results.
5. The method according to claim 1, characterized in that, The process of identifying and analyzing the cell clustering results to obtain the rare cell identification results of the cell group data to be identified includes: The cell clustering results are analyzed to identify and determine the clusters; Differential expression analysis was performed on each cluster, and the identification result of whether each cluster is a rare cell was determined according to preset conditions; Clusters that match the rare cell identification results are selected to obtain the rare cell identification results of the cell group data to be identified.
6. The method according to claim 1, characterized in that, The principal component analysis results of the acquired cell group data include: Acquire the cell group data to be identified; The cell group data to be identified is subjected to quality control and gene filtering to obtain preprocessed cell group data; The library size of each cell in the preprocessed cell group data is normalized, and the target hypervariable gene is obtained based on the gene expression variance. Principal component analysis was performed on the target hypervariable gene to obtain the principal component analysis results of the cell group data to be identified.
7. A rare cell identification device based on density adaptive maps, characterized in that, The device includes: The data acquisition module is used to acquire the principal component analysis results of the cell group data to be identified; A density-adaptive graph construction module is used to construct a density-adaptive graph; wherein, the density-adaptive graph obtains an affinity matrix by calculating local radii and assigning edge weights; The data dimensionality reduction module is used to reduce the dimensionality of the principal component analysis results using the density adaptive graph to obtain a dimensionality-reduced expression matrix; The data clustering module is used to perform clustering analysis on the dimensionality-reduced expression matrix using the density adaptive map to obtain cell clustering results; The data identification module is used to identify and analyze the cell clustering results to obtain the rare cell identification results of the cell group data to be identified.
8. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the method according to any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 6.