Method for constructing gene expression regulation network based on time series big data and transfer entropy

By using time-series big data and transfer entropy-based methods, combined with high sampling rates and distributed computing, significant gene regulatory relationships are screened out, solving the problems of long processing times and low efficiency in existing technologies. This enables the rapid construction of gene expression regulatory networks, which are suitable for multi-node analysis.

CN114360634BActive Publication Date: 2026-06-12高静 +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
高静
Filing Date
2021-12-29
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies suffer from high false positive rates, long processing times, and low efficiency when constructing gene expression regulatory networks, especially in multi-node analysis, making it difficult to effectively reveal dynamic biological processes.

Method used

We employ a time-series big data and transfer entropy-based approach, collecting gene expression data at a high sampling rate. We combine fuzzy C-means and cosine similarity clustering, and use the Spark in-memory computing framework for distributed parallel computation to screen out significant gene regulatory relationships. Finally, we construct a gene expression regulatory network using transfer entropy thresholds.

Benefits of technology

It enables the rapid construction of efficient gene expression regulatory networks, can handle multi-node data, and improves computational efficiency, accuracy, and construction time. It is suitable for polymorphic time series gene expression profile matrix data.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114360634B_ABST
    Figure CN114360634B_ABST
Patent Text Reader

Abstract

The present application relates to a kind of gene expression regulation network construction method based on timing big data and transfer entropy, including 6 steps of timing data sampling, time series big data preprocessing, target gene screening, transfer entropy parallel computing, gene regulation relationship screening, construct gene expression regulation network, the time series data of this method is more than equal to 3 time point's polymorphism time series gene expression profile matrix data, as the original data for constructing gene regulation network, by using transfer entropy as the gene causal regulation relationship reasoning index between genes, for determining the causal regulation relationship between genes, and based on the transfer entropy distributed parallel computing method of memory computing framework Spark, transfer entropy parallel computing is responsible for the transfer entropy between large-scale genes, to construct gene regulation network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This patent application belongs to the field of bioinformatics technology. More specifically, it relates to a method for constructing a gene expression regulatory network based on time-series big data and transfer entropy. In particular, it is applicable to polymorphic time-series gene expression data with more than or equal to three time points. By using transfer entropy to infer the regulatory causal relationship between genes, a gene expression regulatory network is constructed. Background Technology

[0002] The construction of gene expression regulatory networks can effectively reveal the dynamic biological processes and molecular dynamics of organisms. However, experiments studying protein-DNA interactions and the role of genes in regulation are expensive and difficult to replicate. Therefore, inference methods using gene regulatory networks (GRNs) are used as an alternative to biological experiments. GRN inference methods can represent the dynamics of transcriptional changes and biological physiological states, playing a crucial role in understanding the genetic basis of dynamic biological processes and phenotypic traits.

[0003] Due to the cost of sequencing and traditional data analysis thinking, when using gene expression data to study dynamic biological processes, experimental group control experiments are usually used, and the research focus is on the comparative analysis of treatment experiments and control experiments (i.e., static data comparison analysis based on two states). Due to the limitations of sample data, the dynamic biological processes of the treated samples cannot be fully revealed.

[0004] Current methods for constructing gene expression regulatory networks based on correlations lead to high false positives and discard nonlinear regulatory relationships between genes.

[0005] Current methods for constructing gene expression regulatory networks are all based on single-machine computation. When there are many gene nodes to be analyzed, constructing gene regulatory networks is time-consuming and inefficient. Summary of the Invention

[0006] To study dynamic biological processes based on gene expression regulatory networks, this invention provides a novel method for constructing gene expression regulatory networks based on time-series big data and transfer entropy. This method can effectively construct gene expression regulatory networks and has the advantages of short construction time and high efficiency.

[0007] To solve the above problems, the technical solution adopted by the present invention is as follows:

[0008] A method for constructing a gene expression regulatory network based on temporal big data and transfer entropy includes the following steps:

[0009] S1. Time-series-based gene big data sampling: Samples are collected from the target tissue at equal time intervals using a high sampling rate (the higher the better) to obtain time-series-based gene samples. Then, the target tissue's sequencing data is obtained through gene sequencing technology. Upstream analysis is used to calculate gene expression profile matrix data, which is then standardized using FPKM to obtain standardized gene expression profile matrix data.

[0010] S2. Time series gene big data preprocessing: The moving average smoothing method is used to preprocess the outliers and random values ​​in the standardized gene expression profile matrix to reduce the impact of outliers and random values ​​on the construction of gene regulatory networks and obtain process genes.

[0011] S3. Target gene screening: Based on the research objectives, the process genes in step S2 are clustered using time series data based on pattern clustering methods such as fuzzy C-means and cosine similarity to select target genes.

[0012] S4. Parallel computation of transfer entropy: Using the Spark in-memory computing framework, a big data technology, the transfer entropy between pairs of target genes is calculated in a distributed parallel computing manner to obtain the transfer entropy and p-value between the target genes.

[0013] S5. Gene regulatory relationship screening: Based on the magnitude of the bidirectional transfer entropy value between target genes, unidirectional gene regulatory relationships are obtained. Then, based on the p-value (chi-square), the regulatory relationships of paired genes are further screened. Finally, by setting the transfer entropy threshold, the final major-effect gene regulatory relationships are further screened, thereby obtaining genes that strongly regulate causal relationships.

[0014] S6. Construct a gene expression regulatory network, and visualize the gene expression regulatory network based on the final obtained gene regulatory relationships using visualization tools.

[0015] A further improvement of the technical solution of the present invention is that: in S1, the higher the sampling rate, the better. The collection of time series gene datasets with high sampling rates means that sample tissues are collected at equal time intervals, and the sampling time points are greater than or equal to 3.

[0016] A further improvement of the technical solution of the present invention is that: in S1, a high sampling rate means a sampling frequency of 20 times / hour, a sampling period of 0.05 hours, a sampling time period of 24 hours, and a total number of samplings of 480 times.

[0017] A further improvement to the technical solution of this invention lies in the following: In S2, the preprocessing of outliers and random values ​​in the standardized gene expression profile matrix using a moving average smoothing method refers to the fact that, due to the influence of various factors such as the objective environment, equipment, and human intervention during the acquisition of time series big data, outliers and random values ​​are usually present. The presence of outliers or random values ​​not only affects the accuracy of the calculation results but also causes the calculation results to deviate from the essential trend of the time series. By using a moving average smoothing method to smooth the random values ​​of periodic time series big data, with a smoothing window of 5, the impact of outliers on the data analysis results is effectively reduced.

[0018] A further improvement of the technical solution of the present invention is that: in S3, the target gene is selected by pattern clustering methods such as fuzzy C-means and / or cosine similarity.

[0019] A further improvement of the technical solution of the present invention is that: in S4, the causal regulatory relationship between genes is inferred based on the method of transfer entropy.

[0020] A further improvement of the technical solution of this invention is that: in S4, a distributed parallel computing method for transfer entropy is designed based on the big data technology Spark, which improves the computational efficiency of transfer entropy.

[0021] A further improvement of the technical solution of the present invention is as follows: In S5, firstly, the unidirectional gene regulatory relationship is determined based on the magnitude of the bidirectional transfer entropy value; then, the significant causal regulatory relationship between paired related genes is further determined based on the p-value (chi-square); finally, the final major gene regulatory relationship is further screened by setting the transfer entropy threshold.

[0022] A further improvement of the technical solution of the present invention is that: in S6, constructing a gene expression regulation network means: firstly, determining a unidirectional regulatory relationship, then selecting a pair of related gene regulatory relationships with a p-value (chi-square) > 0.05, and finally screening gene pairs with a transfer entropy greater than or equal to 0.5 to obtain genes that strongly regulate causal relationships.

[0023] A further improvement of the technical solution of the present invention is that: in S6, Cytoscape is used to visualize the gene expression regulatory network.

[0024] Due to the adoption of the above technical solution, the beneficial effects achieved by this invention are:

[0025] 1. This method is based on polymorphic time-series gene expression big data. It uses transfer entropy to calculate the interaction information between paired genes and determines the regulatory relationships and directions between genes based on p-values ​​and transfer entropy values, thereby constructing a gene expression regulatory network. The data used in this method is polymorphic (at least 3 time points) time-series gene expression profile matrix data, which serves as the raw data for constructing the gene regulatory network. It is suitable for multi-node gene analysis and features short construction time and high efficiency.

[0026] 2. This method constructs a gene expression regulation network based on large-scale gene expression time series data and parallel computation of transfer entropy. Because a distributed parallel computing method is used in the process of calculating transfer entropy, the number of computing nodes can be flexibly configured according to the amount of computing data, thereby improving computing efficiency and enabling the rapid construction of large-scale gene expression regulation networks.

[0027] 3. This method uses transfer entropy as an indicator for inferring causal regulatory relationships between genes, and is used to determine these relationships. This method is based on the distributed parallel computation method of transfer entropy in the in-memory computing framework Spark, where parallel computation is responsible for calculating the transfer entropy between large-scale genes. Attached Figure Description

[0028] Figure 1 This is a diagram illustrating the gene regulatory network construction process of the present invention;

[0029] Figure 2 This is the standardized gene expression profile matrix diagram of the present invention;

[0030] Figure 3 This is a comparison diagram of gene expression time series data before and after smoothing processing according to the present invention;

[0031] Figure 4 This is a time-axis expression trend diagram of the AANAT gene in an embodiment of the present invention;

[0032] Figure 5 This is a diagram showing the clustering results of whole-genome time-series data patterns in this invention.

[0033] Figure 6 The flowchart of the Spark-based distributed parallel computing process for transfer entropy in this invention;

[0034] Figure 7 This is a graph showing the calculation results of the transfer entropy between paired genes in this invention;

[0035] Figure 8 This is a diagram illustrating the unidirectional gene regulation relationship in the screening process of this invention.

[0036] Figure 9 This is a diagram illustrating the regulatory relationships of major genes used in the screening process according to the present invention.

[0037] Figure 10 This is a visualization of the gene expression regulatory network of the present invention. Detailed Implementation

[0038] The present invention will be further described in detail below with reference to the embodiments.

[0039] This invention discloses a method for constructing a gene expression regulatory network based on temporal big data and transfer entropy, such as... Figure 1 As shown, the design of a gene expression regulatory network construction method based on time-series big data and transfer entropy includes six steps.

[0040] (1) Time-series-based gene big data sampling is conducted using a high sampling rate (the higher the better), i.e., sampling tissues are collected at equal time intervals to obtain target tissue gene samples based on time series. The high sampling frequency of this invention is 20 times / hour, i.e., the time interval between samples is 0.05 hours, the sampling period is 24 hours, and the total number of samples is 480. Then, gene expression profile matrix data information (where rows represent genes and columns represent different time points) is obtained through upstream analysis and subjected to FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) standardization processing.

[0041] "Upstream analysis and computation" refers to the process of first performing quality checks on the raw sequencing data (the original sequencing data), then mapping the raw sequencing data to a reference genome, and performing quantitative analysis on the mapping results to obtain an initial gene expression matrix; finally, the initial gene expression matrix is ​​normalized using FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) to obtain normalized gene expression profile matrix data, such as... Figure 2 As shown.

[0042] (2) Preprocessing of time series genetic big data: Due to the influence of various factors such as objective environment, equipment and instruments, and human factors during the collection of time series big data, there are usually outliers and random values ​​in time series big data. For example Figure 3 As shown, the darker colored areas represent the original values, which fluctuate significantly; the lighter colored areas in the middle represent smoothed values, which fluctuate considerably less. It can be observed that this large time-series dataset contains some random or outlier values. The presence of outliers or random values ​​not only affects the accuracy of the calculation results but also causes them to deviate from the inherent trend of the time series.

[0043] This invention employs a moving average smoothing method to smooth random values ​​in large-scale periodic time series data, with a smoothing window of 5, thereby effectively reducing the impact of outliers on data analysis results. For example... Figure 3 The figure shows a comparative analysis of smoothed time series data before and after smoothing, from... Figure 3 The results clearly show that smoothing time series big data is more effective and can effectively reduce the impact of random values ​​on the trend of change.

[0044] (3) Target gene screening: Data screening is conducted according to the research objectives, and target genes are selected by pattern clustering methods such as fuzzy C-means and / or cosine similarity.

[0045] For example, the research target of this invention is a gene related to circadian rhythms, such as the AANAT gene, the rate-limiting enzyme for melatonin secretion. The expression trend of this AANAT gene over time (7:00 AM on the first day to 7:00 AM on the second day) is as follows: Figure 4 As shown. To select genes consistent with the expression change trend of the AANAT gene, cluster analysis was first performed on the whole-genome time series data based on the fuzzy C-means clustering method, as shown. Figure 5 The image shown is a diagram illustrating the pattern clustering results. From... Figure 5 The data shows that the expression trend of the genes in cluster 4 is most similar to that of the AANAT gene on the time axis, i.e., it first decreases, then increases, and then decreases again. Furthermore, the AANAT gene is included in cluster 4. Therefore, cluster 4 was selected as the target gene for further analysis and research.

[0046] (4) Parallel computation of transfer entropy is performed using the Spark in-memory computing framework, a big data technology, such as... Figure 6 As shown, the process of parallel computation of the transfer entropy between paired genes is performed based on a Spark cluster, and the transfer entropy and p-value (chi-square) between genes are obtained, as follows. Figure 7 As shown.

[0047] Transfer Entropy Figure 5 Transition entropy (TE) is a conditional distribution used to measure asymmetry between time series. This information, from y to x and from x to y, is asymmetric, and this asymmetry leads to the establishment of causal relationships between drivers and responses. Transition entropy can handle nonlinear time series well and is highly sensitive to Granger causality.

[0048] The expression for transfer entropy is:

[0049]

[0050]

[0051] Where n is the length of time series x and y, and k and 1 are the delay lengths of variables x and y, respectively.

[0052] Because transfer entropy has high computational complexity, we designed and implemented a distributed parallel computing method for transfer entropy based on Spark big data technology. This significantly improves the computational efficiency of transfer entropy.

[0053] (5) Screening of gene regulatory relationships, such as Figure 8 As shown, unidirectional gene regulatory relationships are obtained based on the magnitude of the bidirectional transfer entropy values ​​between target genes (for example, if the transfer entropy value from gene A to gene B is greater than that from gene B to gene A, the relationship from gene B to gene A is deleted, thus obtaining a unidirectional gene regulatory relationship). Then, based on the p-value (chi-square), the independence between pairs is tested, and relevant paired gene regulatory relationships with p-values ​​greater than 0.05 are selected, thus obtaining significant causal regulatory relationships between relevant paired genes. Next, by setting a transfer entropy threshold, paired gene regulatory relationships with transfer entropies greater than or equal to 0.5 are screened, thus obtaining strong regulatory causal relationships (gene regulatory relationships with strong informational effects). Finally, major gene (a genetic term referring to a gene that determines a trait by a single gene) regulatory relationships are screened based on the transfer entropy values, such as... Figure 9 As shown, the major gene (that is, the gene that strongly regulates the causal relationship) is obtained. The whole process is as follows: unidirectional gene regulation relationship, significant causal regulation relationship, and major gene regulation relationship.

[0054] (6) Construct a gene regulatory network, based on the regulatory relationships of the major genes in the final screening, such as... Figure 10 As shown, gene expression regulatory networks are visualized using the Cytoscape tool.

[0055] This method constructs a gene expression regulatory network based on large-scale gene expression time-series data and parallel computation of transfer entropy. The data used in this method consists of polymorphic time-series gene expression profile matrices with at least three time points, serving as the raw data for constructing the gene regulatory network. Transfer entropy is used as an indicator for inferring causal regulatory relationships between genes to determine these relationships. A distributed parallel computation method for transfer entropy, based on the in-memory computing framework Spark, is employed to calculate the transfer entropy between large-scale genes, thereby constructing the gene regulatory network. This method is suitable for multi-node gene analysis and features short construction time and high efficiency.

Claims

1. A method for constructing a gene expression regulatory network based on time-series big data and transfer entropy, characterized in that... Includes the following steps: S1. Based on time series gene expression big data sampling, sample tissues are collected according to a high sampling rate to obtain target tissue gene samples based on time series; then, the sequencing raw data of the target tissue is obtained through gene sequencing technology, gene expression profile matrix data information is obtained through upstream analysis and calculation, and FPKM standardization is performed to obtain a standardized gene expression profile matrix. According to the high sampling rate, the higher the sampling rate, the better. That is, the sample tissue is collected at equal time intervals, and the number of sampling time points is greater than or equal to 3. High sampling rate refers to a sampling frequency of 20 times / hour, a sampling period of 0.05 hours, a sampling time period of 24 hours, and a total number of samples of 480. Upstream analysis and computation refer to the following steps: First, the raw sequencing data is subjected to quality testing. The raw sequencing data is the original sequencing data. Then, the raw sequencing data is mapped to a reference genome, and the mapping results are quantitatively analyzed to obtain an initial gene expression matrix. Finally, the initial gene expression matrix is ​​normalized using FPKM (Fragments Per Kilobase of exon model per Million mapped fragments) to obtain normalized gene expression profile matrix data. S2. Time series gene big data preprocessing: outliers and random values ​​in the standardized gene expression profile matrix are preprocessed to reduce their impact on the construction of gene regulatory networks and obtain process genes. The moving average smoothing method is used to smooth out outliers and random values ​​in time series big data in the standardized gene expression profile matrix. S3. Target gene screening: Based on the research objectives, the time series data of the process genes in step S2 are clustered using clustering methods to select target genes. S4. Parallel computation of transfer entropy: Based on the transfer entropy method, the causal regulatory relationship between target genes is inferred. Big data technology is used to calculate the transfer entropy between pairs of target genes in a distributed parallel computing manner to obtain the bidirectional transfer entropy and p-value between target genes. Parallel computation of transfer entropy, based on the Spark in-memory computing framework of big data technology, is a distributed computation method for transfer entropy, which improves the efficiency of transfer entropy computation. The expression for transfer entropy is: , , Where n is the length of time series x and y, and k and l are the delay lengths of variables x and y, respectively; S5. Screening of gene regulatory relationships: Based on the magnitude of the bidirectional transfer entropy value between target genes, unidirectional gene regulatory relationships are obtained. Then, based on the p-value, the regulatory relationships of related paired genes are further screened. Finally, by setting the transfer entropy threshold, the regulatory relationships of major genes are further screened. First, based on the magnitude of the bidirectional transfer entropy value, the unidirectional gene regulatory relationship is determined; then, based on the p-value, the significant causal regulatory relationship between the relevant paired genes is further determined; finally, by setting a transfer entropy threshold, the final major gene regulatory relationship is further screened. S6. Construct a gene expression regulatory network, and visualize the gene expression regulatory network based on the final obtained gene regulatory relationships using visualization tools. Constructing a gene expression regulatory network involves: first, determining unidirectional gene regulatory relationships; then, selecting gene regulatory relationships with p-values ​​greater than 0.05; and finally, screening gene pairs with transfer entropy values ​​greater than or equal to 0.5 to obtain the major genes.

2. The method for constructing a gene expression regulatory network based on time-series big data and transfer entropy according to claim 1, characterized in that: In S3, cluster analysis is performed on the time-series expression spectrum matrix of process genes based on fuzzy C-means and / or cosine similarity clustering methods to screen out target genes that are consistent with the expression pattern of the research target.

3. The method for constructing a gene expression regulatory network based on time-series big data and transfer entropy according to claim 1, characterized in that: In S6, the Cytoscape tool was used to visualize the gene expression regulatory network.