Gene regulatory network-based maximum probability shortest path query method between genes

By setting appropriate sampling thresholds in gene regulatory networks for random sampling and probability approximation calculations, the problems of low accuracy and long time in querying the shortest path with the highest probability between genes are solved, achieving more efficient path querying, which is suitable for the analysis of biological laws in gene regulatory networks.

CN115881231BActive Publication Date: 2026-06-26NORTHEASTERN UNIV CHINA

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHEASTERN UNIV CHINA
Filing Date
2022-12-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing methods have low accuracy and long query times in gene regulatory networks, and cannot accurately describe the strength of gene interactions. Furthermore, existing methods for finding the shortest path with the highest probability are not accurate enough in gene regulatory networks and have long query times.

Method used

A maximum probability shortest path query method based on gene regulatory networks is adopted. By setting an appropriate sampling threshold, possible worlds are generated through random sampling. The maximum probability shortest path is calculated using breadth-first search and probabilistic approximation methods. Error analysis is performed by combining sampling estimation and probabilistic approximation methods to finally obtain the accurate maximum probability shortest path.

Benefits of technology

It improves the accuracy of gene regulatory network queries, shortens query time, and can more accurately find the diagnostic effects of existing drugs on new pathogenic genes, making it suitable for analyzing biological patterns in "drug repurposing".

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115881231B_ABST
    Figure CN115881231B_ABST
Patent Text Reader

Abstract

The application provides a kind of intergenic maximum probability shortest path query method based on gene regulatory network, it is related to biological genetic technology field.Input contains the gene regulatory network of probability, source vertex s and target vertex t;Set appropriate sampling threshold, use the idea of random sampling, according to edge probability, randomly generate N possible world;According to the search method, the maximum probability shortest path candidate set of source vertex s to target vertex t is calculated;According to the estimation of sampling and the probability approximation method, all the paths in the candidate set are estimated, the error between the two methods is analyzed, and the maximum probability shortest path is finally obtained.The application proposes a query algorithm suitable for gene regulatory network, uses the Carp-Luby sampling idea to propose a calculation method of the estimated value of the maximum probability shortest path based on gene regulatory network, and simplifies the probability calculation formula by mathematical method, the proposed probability approximation query method not only reduces the sampling frequency, but also effectively improves the query efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of biogenetics, and in particular to a method for querying the shortest path with the highest probability between genes based on gene regulatory networks. Background Technology

[0002] Genes play an indispensable role in the evolutionary process of biological heredity and are essential factors for understanding and exploring the mysteries of life. Gene regulatory networks provide an important and effective approach for studying the genetic expression relationships between genes. Gene regulatory networks are composed of interdependent relationships between genes, forming a complex network where each gene influences and restricts the others. Utilizing gene regulatory networks to find the shortest path with the highest probability between genes will be beneficial for analyzing pathways between existing drug-targeting genes and new pathogenic genes, calculating the regulatory relationships between drug-targeting genes and potentially pathogenic genes, discovering biological laws, and ultimately achieving "drug repurposing."

[0003] Due to the complexity and massive scale of gene regulatory networks, existing methods mostly utilize search algorithms to obtain the set of shortest paths within the network, or obtain a set of probabilities for the top K paths based on the strength of influencing gene interactions through predetermined thresholds. This leads to the problem of retrieving short but low-probability paths, or high-probability paths that are also long; neither approach directly and effectively describes the strength of gene-gene interactions. Furthermore, existing methods for finding the shortest path with the highest probability often employ a combination of secondary sampling and estimation, which cannot accurately determine a suitable set of candidate paths within gene regulatory networks, resulting in insufficient accuracy and long query times. Therefore, designing a method that can more accurately and efficiently retrieve the shortest path with the highest probability within a gene regulatory network presents a significant challenge. Summary of the Invention

[0004] The technical problem to be solved by the present invention is to provide a method for querying the shortest path with the highest probability between genes based on gene regulatory networks, which addresses the shortcomings of the prior art and solves the problems of low query accuracy and long query time.

[0005] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:

[0006] A method for finding the shortest path with maximum probability between genes based on gene regulatory networks includes the following steps:

[0007] Step 1: Input a gene regulatory network containing probabilities, source vertex s, and target vertex t;

[0008] In a gene regulatory network, each vertex represents a gene, the directed edges between vertices represent the regulatory relationships between genes, and the probability values ​​on the edges represent the strength of the regulatory relationships between genes.

[0009] Step 2: Set an appropriate sampling threshold and use the idea of ​​random sampling to randomly generate N possible worlds based on the edge probability;

[0010] Step 3: Calculate the candidate set of the shortest path with the highest probability from the source vertex s to the target vertex t according to the search method;

[0011] Step 4: Estimate all paths in the candidate set using sampling estimation and probability approximation methods, analyze the error between the two methods, and finally obtain the shortest path with the highest probability.

[0012] Furthermore, the specific steps of step 2 are as follows:

[0013] Step 2.1: Determine an appropriate sampling threshold based on the size of the gene regulatory network graph and practical experience. By gradually increasing the number of possible worlds, obtain the relationship curve between the number of paths in the candidate set of the shortest path with the maximum probability and the number of sampled possible worlds. As the number of sampled possible worlds increases, the number of paths in the candidate set of the shortest path with the maximum probability gradually stabilizes. Based on the horizontal axis value corresponding to when the curve flattens out, obtain an appropriate sampling threshold.

[0014] Step 2.2: Randomly sample the edges of the gene regulation network. Use Python's random module to generate random numbers. After generating a random number between [0,1] for each edge of the gene regulation network using a deterministic algorithm, determine whether the probability of the current edge's existence is greater than the random number generated for the current edge. If it is greater, keep the edge; otherwise, delete the edge. After generating random numbers for all edges and determining whether to keep them, one sampling is completed, and one possible world is obtained. Repeat the above process to obtain N possible worlds.

[0015] Furthermore, the method for obtaining the shortest path candidate set with the highest probability in step 3 is as follows: Query the shortest path for each possible world using breadth-first search, add the shortest path obtained from the query to the shortest path candidate set, and after all possible worlds have been queried, sort all paths in the shortest path candidate set according to the rule of ascending path length; if several paths have the same length, sort these paths according to the rule of descending probability, and finally form the shortest path candidate set with the highest probability.

[0016] Furthermore, the specific steps of step 4 are as follows:

[0017] Step 4.1: Directly calculate the probability estimate of each path in the candidate set of the shortest path with the highest probability as the shortest path with the highest probability using the sampling estimation method;

[0018] Step 4.2: Use the probabilistic approximation method to obtain the probability estimate of each path in the candidate set of the shortest path with the maximum probability as the shortest path with the maximum probability;

[0019] Step 4.3: Perform error analysis on the probability estimates obtained by the above two methods to determine the accuracy of the probability approximation method;

[0020] Step 4.4: Sort all paths in the candidate path set using the probability estimates obtained by the probability approximation method to obtain the shortest path with the highest probability.

[0021] Furthermore, the sampling estimation method in step 4.1 is as follows:

[0022] For each candidate path, the sampling number is set to M. In each sampling, a shorter reachable path than the current candidate path is first selected based on the probability of the shorter reachable path existing; the higher the probability, the more likely it is to be selected. Next, a suitable possible world is constructed, requiring that this possible world contains the shorter reachable path. Then, it is further determined whether there is a path shorter than the selected shorter reachable path in this possible world. If not, it means that this path is the shortest path in this possible world, and the number of times the path is the shortest path in the possible worlds sampled M times is incremented by 1. In this way, after completing all sampling, the probability estimate of all paths shorter than the current candidate path as the shortest path with the highest probability is obtained. Finally, the probability estimate of the path as the shortest path with the highest probability is obtained by combining the probability of the current candidate path existing. The above operation is repeated to obtain the probability estimate of all candidate paths that can become the shortest path with the highest probability, and this is used as a reference value to evaluate the accuracy of the result obtained by the probability approximation method.

[0023] Furthermore, the probability approximation method in step 4.2 is as follows:

[0024] Select a path from the candidate set of shortest paths with the highest probability. If it is the first path, calculate the estimated value of the shortest path with the highest probability using the following formula:

[0025]

[0026] Among them, P MPSP(n) P is the probability estimate of the nth path; n The value is the probability that the nth path exists, which is Pr(X(P n ));

[0027] If it is not the first path, then use all shorter paths as the probability approximation of the shortest path with the highest probability for iterative calculation.

[0028] Furthermore, in step 4.3, error analysis is performed using the mean absolute error calculation formula.

[0029] Furthermore, in step 4.4, the sorting principle is as follows: sort in descending order according to the obtained probability estimates; if the estimates are the same, sort in ascending order according to the path length.

[0030] The beneficial effects of adopting the above technical solution are as follows: The gene-based shortest path query method based on gene regulatory networks provided by this invention firstly considers the unique characteristics of gene regulatory networks and proposes a query algorithm suitable for gene regulatory networks, adapting to the query requirements of gene regulatory networks and improving the accuracy of the query; secondly, it proposes a probability approximation calculation method, and through further derivation of the formula for calculating the shortest path with maximum probability, the secondary sampling process can be ignored in the implementation of the algorithm, improving the efficiency of network query and greatly shortening the query time. In the actual "drug repurposing," the query method using probability approximation calculation can more accurately determine whether the original drug also has a diagnostic effect on new pathogenic genes. Attached Figure Description

[0031] Figure 1 This is a flowchart of the method for querying the shortest path between genes based on a gene regulatory network, provided in an embodiment of the present invention.

[0032] Figure 2 A flowchart for obtaining possible worlds using sampling, provided as an embodiment of the present invention;

[0033] Figure 3 A flowchart for obtaining the set of shortest paths with the highest probability using a sampling estimation method, provided in an embodiment of the present invention;

[0034] Figure 4 This is a flowchart illustrating the process of obtaining the set of shortest paths with the highest probability using a probabilistic approximation method, as provided in an embodiment of the present invention. Detailed Implementation

[0035] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.

[0036] This embodiment uses a gene regulatory network-based method for finding the shortest path with the highest probability between genes. Figure 1 As shown, it includes the following steps:

[0037] Step 1: Input a gene regulation network containing probabilities, source vertex s and target vertex t; each vertex in the gene regulation network represents a gene, the directed edges between vertices represent the regulatory relationship between genes, and the probability value on the edge represents the strength of the regulatory relationship between genes.

[0038] Step 2: Set an appropriate sampling threshold and, using the idea of ​​random sampling, randomly generate N possible worlds based on the side probabilities. The process is as follows: Figure 2 As shown, the specific steps are as follows:

[0039] Step 2.1: Determine an appropriate sampling threshold based on the size of the gene regulatory network graph and practical experience; by gradually increasing the number of possible worlds, obtain the relationship curve between the number of paths in the candidate set of the shortest path with the highest probability and the number of possible worlds sampled. As the number of possible worlds sampled increases, the number of paths in the candidate set of the shortest path with the highest probability gradually stabilizes. Based on the horizontal axis value corresponding to when the curve flattens out, obtain an appropriate sampling threshold.

[0040] In this embodiment, a breast cancer-related gene regulatory network was selected to obtain approximately 10,000 possible worlds.

[0041] Step 2.2: Randomly sample the edges of the gene regulatory network to obtain possible worlds based on the network.

[0042] The possible world is an instance of a probabilistic graph, and a probabilistic graph with E edges has 2E possible worlds.

[0043] This embodiment uses random sampling: Random numbers are generated using Python's `random` module. A deterministic algorithm generates a random number between [0,1] for each edge on the gene regulation network. The probability of the current edge's existence is then checked against the generated random number. If the probability is greater, the edge is retained; otherwise, it is deleted. Since the decision number used to determine whether to delete an edge is randomly generated, its probability distribution follows a uniform distribution of [0,1] as the number of possible worlds increases. Therefore, it can be used as the basis for determining whether to delete or retain an edge in a possible world. Once random numbers have been generated for all edges and the retention decision has been made, one sampling cycle is completed, yielding one possible world. This process is repeated to obtain N possible worlds.

[0044] In this embodiment, a breast cancer-related gene regulatory network is selected, which means that the probability graph size of the gene regulatory network is V = 547 vertices and E = 3503 edges. The number of samplings is set to N = 10000. After 10000 samplings, 10000 possible worlds are obtained.

[0045] Step 3: Calculate the candidate set of the shortest path with the highest probability from source vertex s to target vertex t according to the search method; the specific method is as follows:

[0046] The shortest path is queried for each possible world using breadth-first search. The shortest path found is added to the shortest path candidate set. After all possible worlds have been queried, all paths in the shortest path candidate set are sorted in ascending order of path length. If several paths have the same length, they are sorted in descending order of probability. Finally, the shortest path candidate set with the highest probability is formed.

[0047] In this embodiment, each shortest path or set of shortest paths in 10,000 possible worlds is stored in the candidate set of shortest paths with the highest probability (CP), and then stored in the candidate path set (LP) in ascending order. After sorting, it is found that as the path length increases, the probability of the path becoming the shortest path with the highest probability decreases, and it has no reference value. It is only necessary to focus on the top 100 paths.

[0048] Step 4: Estimate all paths in the candidate set using sampling estimation and probabilistic approximation methods, analyze the error between the two methods, and finally obtain the shortest path with the highest probability. The specific method is as follows:

[0049] Step 4.1: Directly calculate the probability estimate of each path in the candidate path set as the shortest path with the highest probability using the sampling estimation method. The process is as follows: Figure 3 As shown.

[0050] For each candidate path, the sampling number is set to M = 1000. In each sampling, a shorter reachable path than the current candidate path is first selected based on the probability of its existence; the higher the probability, the more likely it is to be selected. Next, a suitable possible world is constructed, requiring that this possible world contains the shorter reachable path. Then, it is further determined whether there is a path shorter than the selected shorter reachable path in this possible world. If not, it means that this path is the shortest path in this possible world, and the number of times the path is the shortest path in the M sampled possible worlds is incremented by 1. In this way, after completing all sampling, the probability estimates of all paths shorter than the current candidate path as the shortest path with the highest probability are obtained. Finally, the probability estimate of the path as the shortest path with the highest probability is obtained by combining the probability of the current candidate path with the probability of its existence. The above operation is repeated to obtain the probability estimates of all candidate paths that can be the shortest path with the highest probability, which serve as a reference value for evaluating the accuracy of the results obtained by the probabilistic approximation method.

[0051] In Example 1, the probability of each path existing is expressed as:

[0052] Pr(X(P))=ΠP(e)Π(1-P(e)) (1)

[0053] In the formula, P(e) is the probability of edge e existing in the gene regulatory network G.

[0054] The sum of probabilities of all paths shorter than the current candidate path P in the set LP is represented as:

[0055]

[0056] In the formula, P i Let P be any path shorter than P, n be the total number of paths in set LP, and i be the total number of paths in set LP shorter than n.

[0057] Among 1000 suitable possible worlds sampled, the i-th path P is shorter than the current candidate path P. i The number of times a sample is taken is expressed as:

[0058]

[0059] In the formula, M is the number of possible worlds to be sampled, and in this embodiment, M is 1000.

[0060] The probability estimate of all paths shorter than the current candidate path P as the shortest path with the highest probability is expressed as:

[0061]

[0062] In the formula, C i For path P i The number of times it is the shortest path in 1000 possible worlds.

[0063] Current candidate path P n The probability estimate of the shortest path with the highest probability is expressed as:

[0064]

[0065] Step 4.2: Use a probabilistic approximation method to obtain the probability estimate of each path in the candidate path set as the shortest path with the highest probability. The process is as follows: Figure 4 As shown. A path is selected from the candidate set of the shortest path with the highest probability. If it is the first path, the formula is directly used to calculate the estimated value of the shortest path with the highest probability. If it is not the first path, all shorter paths are used as approximate probabilities of the shortest path with the highest probability for iterative calculation.

[0066] The specific method for approximating the probability of the current candidate path Pn as the shortest path with the highest probability is expressed as follows:

[0067]

[0068] In the formula, P i The value is the probability that the i-th path exists, which is Pr(X(P)). i)).

[0069] The specific derivation process from the sampling estimation method to the probability approximation method is expressed as follows:

[0070]

[0071] in, This indicates that in a path P n-1 In the possible world, P n-1 The probability of finding the shortest path, i.e. For path P n-1 As the probability of the shortest path with the highest probability. Therefore, It can be represented as That is, 1 minus the sum of the probabilities of all paths shorter than n-1 being the shortest path with the highest probability. At this point, the probability estimate of the nth path obtained by the probability approximation method can be derived using the sampling estimation method, specifically expressed as:

[0072]

[0073] Step 4.3: Perform error analysis on the probability estimates obtained by the two methods above.

[0074] In this embodiment, the error between the probability values ​​of the path obtained by the two methods being the shortest path with the maximum probability, calculated using the mean absolute error method, is approximately 0.03. The calculation method is specifically expressed as follows:

[0075]

[0076] Step 4.4: Using the probability approximation method, sort all paths in the candidate path set in descending order of the probability estimates. If the estimates are the same, sort them in ascending order of path length. Finally, obtain the shortest path with the highest probability.

[0077] This embodiment takes into account the unique characteristics of gene regulatory networks and proposes a query algorithm suitable for gene regulatory networks. It utilizes the Carp-Luby sampling concept to propose a method for calculating the maximum probability shortest path estimate based on gene regulatory networks, and simplifies the probability calculation formula through mathematical methods. By comparing the results of two estimation methods, the rationality of the proposed simplification method is further verified. The probabilistic approximate query method proposed in this embodiment reduces the number of samplings and effectively improves query efficiency.

[0078] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the claims of the present invention.

Claims

1. A method for querying the shortest path with maximum probability between genes based on gene regulatory networks, characterized in that: Includes the following steps: Step 1: Input a gene regulatory network containing probabilities, source vertex s, and target vertex t; In a gene regulatory network, each vertex represents a gene, the directed edges between vertices represent the regulatory relationships between genes, and the probability values ​​on the edges represent the strength of the regulatory relationships between genes. Step 2: Set an appropriate sampling threshold and use the idea of ​​random sampling to randomly generate N possible worlds based on the edge probabilities; the specific steps are as follows: Step 2.1: Determine an appropriate sampling threshold based on the size of the gene regulatory network graph and practical experience. By gradually increasing the number of possible worlds, obtain the relationship curve between the number of paths in the candidate set of the shortest path with the maximum probability and the number of sampled possible worlds. As the number of sampled possible worlds increases, the number of paths in the candidate set of the shortest path with the maximum probability gradually stabilizes. Based on the horizontal axis value corresponding to when the curve flattens out, obtain an appropriate sampling threshold. Step 2.2: Randomly sample the edges of the gene regulation network. Use Python's random module to generate random numbers. After generating a random number between [0,1] for each edge of the gene regulation network using a deterministic algorithm, determine whether the probability of the current edge's existence is greater than the random number generated for the current edge. If it is greater, keep the edge; otherwise, delete the edge. After generating random numbers for all edges and determining whether to keep them, one sampling is completed, and one possible world is obtained. Repeat the above process to obtain N possible worlds. Step 3: Calculate the candidate set of the shortest path with the maximum probability from the source vertex s to the target vertex t according to the search method. The method for obtaining the candidate set of the shortest path with the maximum probability is as follows: query the shortest path of each possible world through breadth-first search, and add the shortest path obtained by the query to the candidate set of the shortest path. After all possible worlds have been queried, sort all paths in the candidate set of the shortest path according to the rule of ascending order of path length. If some paths have the same length, sort these paths according to the rule of descending probability. Finally, the candidate set of the shortest path with the maximum probability is formed. Step 4: Estimate all paths in the candidate set using sampling estimation and probability approximation methods, analyze the error between the two methods, and finally obtain the shortest path with the highest probability; the specific steps are as follows: Step 4.1: Directly calculate the probability estimate of each path in the candidate set of the shortest path with the highest probability as the shortest path with the highest probability using the sampling estimation method. The process is as follows: For each candidate path, the sampling number is set to M. In each sampling, a shorter reachable path than the current candidate path is first selected based on the probability of the existence of a shorter reachable path; the higher the probability, the more likely it is to be selected. Next, a suitable possible world is constructed, requiring that this possible world contains the shorter reachable path. Then, it is further determined whether there is a path shorter than the selected shorter reachable path in this possible world. If not, it means that this path is the shortest path in this possible world, and the number of times the path is the shortest path in the possible worlds sampled M times is incremented by 1. In this way, after completing all sampling, the probability estimate of all paths shorter than the current candidate path as the shortest path with the highest probability is obtained. Finally, the probability estimate of the path as the shortest path with the highest probability is obtained by combining the probability of the current candidate path with the probability of its existence. The above operation is repeated to obtain the probability estimate of all candidate paths that can become the shortest path with the highest probability, and this is used as a reference value to evaluate the accuracy of the result obtained by the probability approximation method. The probability of each path existing is expressed as: (1); In the formula, P(e) is the probability of edge e existing in the gene regulatory network G; The sum of probabilities of all paths shorter than the current candidate path P in the candidate path set LP is expressed as: (2); In the formula, P i \P represents all paths shorter than P, n represents the total number of paths in the candidate path set LP, and i represents the number of paths shorter than n in the candidate path set LP. Among M suitable possible worlds sampled, the i-th path P is shorter than the current candidate path P. i The number of times a sample is taken is expressed as: (3); The probability estimate of all paths shorter than the current candidate path P as the shortest path with the highest probability is expressed as: (4); In the formula, C i For path P i The number of times it is the shortest path in the possible worlds sampled M times; Current candidate path P n The probability estimate of the shortest path with the highest probability is expressed as: (5); Step 4.2: Obtain the probability estimate of each path in the candidate set of the shortest path with the maximum probability as the shortest path with the maximum probability using a probabilistic approximation method; the probabilistic approximation method in step 4.2 is as follows: Select a path from the candidate set of shortest paths with the highest probability. If it is the first path, directly calculate the estimated value of the shortest path with the highest probability using the following formula: ; in, P is the probability estimate of the nth path; n The value is the probability that the nth path exists, which is Pr(X(P n )); If it is not the first path, then use all shorter paths as the probability approximation of the shortest path with the highest probability for iterative calculation; Step 4.3: Perform error analysis on the probability estimates obtained by the above two methods to determine the accuracy of the probability approximation method; Step 4.4: Sort all paths in the candidate path set using the probability estimates obtained by the probability approximation method to obtain the shortest path with the highest probability.

2. The method for querying the shortest path with maximum probability between genes based on gene regulatory networks according to claim 1, characterized in that: In step 4.3, error analysis is performed using the mean absolute error calculation formula.

3. The method for querying the shortest path with maximum probability between genes based on gene regulatory networks according to claim 1, characterized in that: In step 4.4, the sorting principle is as follows: sort in descending order according to the obtained probability estimates; if the estimates are the same, sort in ascending order according to the path length.