Differential gene identification method based on non-parametric test
By employing a nonparametric statistical test method based on kernel functions, and utilizing unbiased empirical estimation based on maximum mean difference (MMD) and random shuffling, differentially expressed genes in time-series RNA-Seq data are identified. This solves the problems of flexibility and accuracy in identifying differentially expressed genes in existing technologies, and enables efficient analysis of complex data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WUHAN BOTANICAL GARDEN CHINESE ACAD OF SCI
- Filing Date
- 2024-07-18
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies struggle to effectively identify differentially expressed genes when processing time-series RNA-Seq data, especially when time points and biological reproducibility are insufficient, and lack flexible analytical methods.
We employed a kernel-based nonparametric statistical test method to identify differentially expressed genes in time-series RNA-Seq data through unbiased empirical estimation of maximum mean difference (MMD) and random shuffling.
It improves the flexibility and applicability of analyzing datasets with few time points or large fluctuations in gene expression levels, and can better capture the nonlinear dynamic changes in gene expression, maintaining a low false positive rate while improving the true positive rate.
Smart Images

Figure CN118918949B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of bioinformatics technology. Specifically, it relates to a differential gene identification method based on nonparametric tests, applicable to gene identification. Background Technology
[0002] Because gene expression regulation is a dynamic process, characterizing genome-wide dynamics of gene expression is crucial for understanding transcriptomic changes in biological processes. Time-series RNA-Seq data, by collecting gene expression levels from biological samples of the same subject at different time points, allows us to study the dynamic behavior of gene expression. The availability of time-series RNA-Seq data has increased significantly thanks to the advantages of high-throughput transcriptomics, making time-series gene expression experiments increasingly common. The primary goal of these experiments is to identify differentially expressed genes between two conditions. This process is crucial because detecting differentially expressed genes through time-series analysis not only promotes a deeper understanding of gene regulation mechanisms under time-varying conditions and stimuli but also opens the possibility of discovering new biomarkers. Due to high experimental costs or the limited number of unique individual biological samples, time-series RNA-Seq datasets typically contain few time points and biological replicates. However, the fundamental scientific need to identify differentially expressed genes associated with biological processes has not diminished. Therefore, there is an urgent need to reliably detect differentially expressed genes from datasets with few time points and replicates. Furthermore, the diversity of experimental designs generates different types of data, necessitating flexible differential expression analysis methods to handle data from any experiment.
[0003] Several differential expression analysis methods have been developed for time-series RNA-seq datasets. For example, NextmaSigPro-GLM aims to identify significant differential expression profiles across multiple experimental time points or conditions in RNA-seq data. This model utilizes a negative binomial generalized linear model to adapt to the complexity and dynamics of gene expression changes over time. splineTC is an R package specifically designed for time-series analysis. It uses spline functions to smoothly model and analyze the temporal trends of the data. This approach is particularly valuable for understanding the dynamic patterns of gene expression changes over time. ImpulseDE2 employs a unique algorithm that models gene expression patterns as impulse-like responses, making it adept at detecting transient or dynamic changes in gene expression. DESeq2 and edgeR, both based on negative binomial models, are considered the gold standard in differential expression analysis. While traditional comparisons using these two tools do not take into account the analysis of time-series data, their sophisticated design allows for relatively simple investigations of time-series data. Summary of the Invention
[0004] The purpose of this invention is to address the aforementioned problems in the existing technology by providing a differential gene identification method based on nonparametric tests. This method uses a nonparametric statistical test based on kernel functions, namely the maximum mean difference (MMD), to identify and analyze differentially expressed genes in time-series RNA-Seq data.
[0005] The above-mentioned objectives of the present invention are achieved through the following technical means:
[0006] The differential gene identification method based on nonparametric tests includes the following steps:
[0007] Step 1: Obtain the gene expression matrix of the experimental group and the control group. Each column of the gene expression matrix represents a time point, and each element in each row of the gene expression matrix constitutes the gene expression level sequence of the corresponding gene in the row.
[0008] Step 2: Calculate the unbiased empirical estimate of MMD between the gene expression level sequence of gene i in the experimental group and the gene expression level sequence in the control group, as the original unbiased empirical estimate of MMD for gene i;
[0009] Step 3: Combine the gene expression level sequences of gene i in the experimental group and the control group, then randomly shuffle and rearrange them to generate new gene expression level sequences of gene i in the experimental group and the control group. Calculate the MMD unbiased empirical estimate as the new MMD unbiased empirical estimate. Based on the original MMD unbiased empirical estimate and the new MMD unbiased empirical estimate, calculate the significance parameter P of gene i. adj If the significance parameter P of gene i adj If the expression level is less than the significance threshold, gene i is considered a differentially expressed gene.
[0010] As described above, obtaining the gene expression matrices of the experimental group and the control group in step 1 includes the following steps: collecting raw RNA-seq data of the experimental group and the control group under the same temperature environment and at multiple time points, and obtaining the gene expression matrices of the experimental group and the control group after data preprocessing.
[0011] As described above, the unbiased empirical estimates of MMD in steps 2 and 3 are calculated based on the following formula:
[0012]
[0013] Among them, MMD 2 (X i ,Y i X is the unbiased empirical estimate of the MMD of gene i. i This represents the gene expression level sequence of gene i at m time points in the experimental group. Y represents the gene expression level of gene i from time point 1 to time point m in the gene expression matrix of the experimental group. i This represents the gene expression level sequence of gene i at n time points in the control group. Let represent the gene expression level of gene i from the first time point to the nth time point in the gene expression matrix of the control group, m represent the total number of time points in the gene expression matrix of the experimental group, n represent the total number of time points in the gene expression matrix of the control group, iu1 and iu2 represent the sequence numbers of the time points in the experimental group, iv1 and iv2 represent the sequence numbers of the time points in the control group, and k represent the inner product function.
[0014] The significance parameter P of gene i as described above adj Calculated based on the following formula:
[0015]
[0016] Where r is the index of the random shuffling, R is the total number of random shufflings, I(·) is the indicator function, and let X be the gene expression level sequence of gene i in the experimental group and the gene expression level sequence in the control group after merging and random shuffling. ir The gene expression level sequence of the new control group is Y. ir The new unbiased empirical estimate of MMD is calculated as MMD. 2 [X ir ,Y ir ].
[0017] As mentioned above, the content within the parentheses of the indicator function represents the condition. When the condition is true, the value of the indicator function is 1, and when the condition is false, the value of the indicator function is 0.
[0018] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of the differential gene identification method described above.
[0019] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the differential gene identification method described above.
[0020] A computer program product includes a computer program that, when executed by a processor, implements the steps of the differential gene identification method described above.
[0021] Compared with the prior art, the present invention has the following advantages:
[0022] This method does not rely on specific data distribution assumptions and can more flexibly handle datasets with few time points or large fluctuations in gene expression levels. Through unbiased empirical estimation using MMD, this method can better capture the nonlinear dynamic changes in gene expression levels between experimental and control groups. Based on the non-uniformity and complexity of gene expression data, different types of kernel functions (such as linear, polynomial, and radial basis functions) can be selected, making this method applicable to various types of gene expression data and improving the flexibility and applicability of the analysis. Attached Figure Description
[0023] Figure 1 This is a flowchart of the present invention;
[0024] Figure 2 This is a comparative data graph of embodiments of the present invention. The performance of different methods in gene expression data analysis is illustrated by plotting receiver operating characteristic (ROC) curves. KMMDE is the method described in this paper. Comparative analysis was conducted using gene expression data from different time points without biological replicates. In the comparison, other methods shown in the graph (edgeR, DESeq2, maSigPro, splineTC, and Funpat) also had their adjusted P values calculated. adj The values are used to control the false detection rate. The numbers in the top bar of the image represent data with 4, 6, 8, and 10 time points, respectively. The numbers in the right bar represent the number of differentially expressed genes. The horizontal axis (False Positive Rate) represents the false positive rate, which is the proportion of true null hypotheses (genes with no difference) incorrectly identified as differentially expressed. The vertical axis (True Positive Rate) represents the true positive rate, which is the proportion of true non-null hypotheses (genes with difference) correctly identified as differentially expressed. As can be seen from the curves, the KMMDE method (red curve) outperforms other methods in most cases, especially when the number of time points is small, still showing an advantage in the high true positive rate region (curve near the upper left corner). This indicates that KMMDE can more effectively identify truly differentially expressed genes while maintaining a low false positive rate. At the same time, KMMDE shows good stability under different time points and different numbers of differentially expressed genes, which is very important when facing the complexity of real biological data. Detailed Implementation
[0025] To facilitate understanding and implementation of the present invention by those skilled in the art, the present invention will be further described in detail below with reference to examples. The implementation examples described herein are only for illustration and explanation and are not intended to limit the present invention.
[0026] Example:
[0027] A differentially expressed gene identification method based on nonparametric tests is used, taking temporal expression data of wild grapes (experimental group) and common grapes (control group) under the same cold environment as an example. The flowchart is as follows. Figure 1 As shown, it includes the following steps:
[0028] Step 1: Obtain the gene expression matrix. Collect raw RNA-seq data from the experimental and control groups at multiple time points under the same temperature environment (e.g., 5-15 degrees Celsius). After data preprocessing, obtain the gene expression matrices for the experimental and control groups respectively. Each column of the gene expression matrix represents a time point, and each row represents a gene. The elements in the gene expression matrix represent the gene expression level of the corresponding gene in the row at the specific time point in the column. The elements of each row of the gene expression matrix constitute the gene expression level sequence of the corresponding gene in the row.
[0029] Step 2: Calculate the maximum mean difference in gene i. Assume the gene expression matrix of the experimental group has m time points, and the gene expression matrix of the control group has n time points. Gene i in the experimental group (X... i The gene expression level sequences at m time points are as follows: The gene expression level of gene i in the experimental group is from the first time point to the m-th time point in the gene expression matrix. The gene expression level of gene i in the control group (Y) is... i The gene expression level sequences at n time points in Y are... i ={y i1 ,y i2 ,...y in}, X represents the gene expression level of gene i from time point 1 to time point n in the gene expression matrix of the control group. The gene expression level sequence X is then analyzed using a kernel function φ. i and gene expression level sequence Y i Mapping to a high-dimensional feature space. Calculating gene expression level sequences X. i and gene expression level sequence Y i The inner product k(X) in the feature space i ,Y i ):
[0030] k(X i ,Y i )=<φ(X i ),φ(Y i )> F (1)
[0031] Where k is the inner product function, <,> F is the inner product operator, and F represents the Frobenius norm.
[0032] Gene expression level sequence X with sample size mi With a sample size of n, the gene expression level sequence Y i The unbiased empirical estimate of MMD between genes (i.e., the original unbiased empirical estimate of MMD for gene i) is:
[0033]
[0034] Where k is the inner product function, iu1 and iu2 are the time point index variables of the experimental group, with values ranging from 1 to m; iv1 and iv2 are the time point index variables of the control group, with values ranging from 1 to n. iu1≠iu2 indicates that the case where iu1=iu2 is excluded during the summation process, and the same applies to iv1≠iv2.
[0035] Step 3: Significance analysis, calculate the adjusted significance parameter P for gene i. adj To calculate the significance of the difference in gene i expression levels between the experimental and control groups, we used a permutation test to assess the significance of the original MMD values. First, the gene expression levels of gene i in the experimental and control groups were combined into a single sequence as the merged sequence. Then, the merged sequence was randomly shuffled R times, i.e., for {x... i1 ,x i2 ,...x im ,y i1 ,y i2 ,...y in Perform rearrangements R times to generate a sequence from S1 to S2. R R new rearranged sequences are generated. After each rearrangement, the top m gene expression levels from the rearranged sequences are selected as the new gene expression level sequence X for the experimental group. ir The expression levels of the last n genes were selected as the gene expression level sequence Y of the new control group. ir The gene expression level sequences of the new experimental group corresponding to gene i obtained by the r-th shuffling and the new control group are subjected to MMD analysis based on Formula 2 to obtain the new unbiased empirical estimate of MMD corresponding to gene i. Let X be the gene expression level sequence of the new experimental group after the combined sequence of gene i is randomly rearranged for the r-th time. ir The gene expression level sequence of the new control group is Y. ir The new unbiased empirical estimate of MMD is calculated as MMD. 2 [X ir ,Y ir By performing R rearrangements, we can generate a distribution for the unbiased empirical estimate of MMD, which can be used to evaluate the significance of the original unbiased empirical estimate of MMD. The adjusted significance parameter P... adjThe significance of gene expression level differences is determined by comparing the original unbiased empirical estimate of MMD with the new unbiased empirical estimate of MMD generated by rearrangement. The significance parameter P for gene i is used. adj It can be defined as:
[0036]
[0037] Where r is the index of the random shuffle, R is the total number of random shuffles, and formula I(·) is the indicator function. The content in parentheses represents the condition. When the condition is true (condition met), the value of the indicator function is 1; when the condition is false (condition not met), the value of the indicator function is 0. Accumulating all values results in MMD. 2 [X ir ,Y ir MMD 2 [X i ,Y i In order to increase the robustness of the calculation and avoid the situation of P. adj If the sum is 0, increment the accumulated result by 1 and then divide by R+1. This method ensures effective control of the error rate in multiple comparisons, making the results more reliable.
[0038] The significance parameter P of gene i was obtained. adj Next, a significance threshold of 0.05 was set. If the significance parameter P of gene i is... adj If <0.05, gene i is considered a differentially expressed gene, meaning that there is a statistically significant difference in the expression levels of these genes between the experimental group and the control group.
[0039] The following are comparison results between the method of this invention and current mainstream tools. KMMDE is the method of this invention. Comparative analysis was performed using gene expression level sequences with different numbers of time points and without biological replicates. The numbers in the upper column of the image represent gene expression level sequences with 4, 6, 8, and 10 time points, respectively. The numbers in the right column represent gene expression level sequences with different numbers of differentially expressed genes. The horizontal axis (False Positive Rate) represents the false positive rate, i.e., the proportion of incorrectly identifying true null hypotheses (genes with no difference) as differentially expressed. The vertical axis (True Positive Rate) represents the true positive rate, i.e., the proportion of correctly identifying true non-null hypotheses (genes with difference) as differentially expressed. As can be seen from the curves, the KMMDE method (red curve) outperforms other methods in most cases, especially when the number of time points is small, still showing an advantage in the high true positive rate region (curve near the upper left corner). This indicates that KMMDE can more effectively identify truly differentially expressed genes while maintaining a low false positive rate. Meanwhile, KMMDE demonstrates good stability across different time points and varying numbers of differentially expressed genes, which is crucial when dealing with the complexity of real-world biological data.
[0040] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.
[0041] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.
[0042] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0043] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above methods.
[0044] It should be noted that the specific embodiments described in this invention are merely illustrative of the spirit of the invention. Those skilled in the art to which this invention pertains can make various modifications or additions to the described specific embodiments or use similar methods to substitute them, without departing from the spirit of the invention or exceeding the scope defined by the appended claims.
Claims
1. A differential gene identification method based on nonparametric tests, characterized in that, Includes the following steps: Step 1: Obtain the gene expression matrix of the experimental group and the control group. Each column of the gene expression matrix represents a time point, and each element in each row of the gene expression matrix constitutes the gene expression level sequence of the corresponding gene in the row. Step 2: Calculate the unbiased empirical estimate of MMD between the gene expression level sequence of gene i in the experimental group and the gene expression level sequence in the control group, as the original unbiased empirical estimate of MMD for gene i; Step 3: Combine the gene expression level sequences of gene i in the experimental group and the control group, then randomly shuffle and rearrange them to generate new gene expression level sequences of gene i in the experimental group and the control group. Calculate the MMD unbiased empirical estimate as the new MMD unbiased empirical estimate. Based on the original MMD unbiased empirical estimate and the new MMD unbiased empirical estimate, calculate the significance parameter P of gene i. adj If the significance parameter P of gene i adj If the expression level is less than the significance threshold, then gene i is considered a differentially expressed gene. Step 1, obtaining the gene expression matrices of the experimental and control groups, includes the following steps: collecting raw RNA-seq data from the experimental and control groups at multiple time points under the same temperature environment; and obtaining the gene expression matrices of the experimental and control groups after data preprocessing. The unbiased empirical estimates of MMD in steps 2 and 3 are calculated based on the following formula: Among them, MMD 2 (X i ,Y i X is the unbiased empirical estimate of the MMD of gene i. i This represents the gene expression level sequence of gene i at m time points in the experimental group. Y represents the gene expression level of gene i from time point 1 to time point m in the gene expression matrix of the experimental group. i This represents the gene expression level sequence of gene i at n time points in the control group. Let represent the gene expression level of gene i from time point 1 to time point n in the gene expression matrix of the control group; m is the total number of time points in the gene expression matrix of the experimental group; n is the total number of time points in the gene expression matrix of the control group; iu1 and iu2 are the ordinal variables of the time points in the experimental group; iv1 and iv2 are the ordinal variables of the time points in the control group; and k is the inner product function. The significance parameter P of gene i adj Calculated based on the following formula: Where r is the index of the random shuffling, R is the total number of random shufflings, I(·) is the indicator function, and let X be the gene expression level sequence of gene i in the experimental group and the gene expression level sequence in the control group after merging and random shuffling. ir The gene expression level sequence of the new control group is Y. ir The new unbiased empirical estimate of MMD is calculated as MMD. 2 [X ir ,Y ir ], The content within the parentheses of the indicator function represents a condition. When the condition is true, the value of the indicator function is 1; when the condition is false, the value of the indicator function is 0.
2. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the differential gene identification method according to claim 1.
3. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the differential gene identification method according to claim 1.
4. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the differential gene identification method of claim 1.