A maximum bipartite set search method and system based on local and global structure feature difference measurement

By calculating the chi-square value of nodes and mapping them to adjacent edges, and combining the difference between local and global structural features, the maximum biclique search is performed using pruning and optimization techniques. This solves the problem of low efficiency in existing methods and achieves a highly efficient and accurate maximum biclique search.

CN122262384APending Publication Date: 2026-06-23DONGHUA UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DONGHUA UNIV
Filing Date
2026-04-15
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing maximum biclique search methods are inefficient with large-scale data and cannot meet real-time requirements. Furthermore, personalized maximum biclique search requires a large number of iterative calculations to maintain the accuracy of the results.

Method used

By calculating the chi-square value of a node and mapping it to its adjacent edges to construct a search space, and combining the difference between local and global structural features, pruning and optimization techniques are used to perform maximum biclique search, including maximal pruning, maximum pruning, concurrent optimization, and early termination optimization.

Benefits of technology

It significantly improves search efficiency by 1 to 3 orders of magnitude, provides more accurate results, and is highly adaptable, meeting the maximum binary search needs in different scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122262384A_ABST
    Figure CN122262384A_ABST
Patent Text Reader

Abstract

The application discloses a maximum bipartite cluster search method and system based on local and global structure feature difference measurement, calculates the chi-square value of nodes according to input bipartite graph data; maps the chi-square value of two end nodes to adjacent edges; constructs a search space according to the large edge in front of the chi-square value; executes a maximum bipartite cluster search algorithm on the search space, and obtains a result. The application can effectively model based on local structure and global structure feature difference, and take the local structure with large difference as a heuristic search space, and then obtain a result through an optimized maximum bipartite cluster search algorithm. The method not only has high search efficiency, but also can improve 1-3 orders of magnitude compared with existing heuristic methods, and the result is better, and the method has strong adaptability and can meet the maximum bipartite cluster search demand in various scenes.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] Method Domain

[0002] This invention relates to the field of graph data mining methods, and proposes a heuristic search method and system for finding the maximum biclique on a bipartite graph by utilizing the difference between local and global structural features.

[0003] Background Method

[0004] Bipartite graphs, as an important graph model, are widely used to represent relationships between two different types of entities. A clique in a bipartite graph is called a biclique; it is a dense substructure where vertices of different types are connected by an edge. By searching for the maximum biclique, we can identify frequently occurring patterns or association rules within the bipartite structure, thereby uncovering relationships and patterns in the data. Existing research approaches to the maximum biclique search problem can be broadly divided into two categories: exact search methods and heuristic search methods.

[0005] The exact search method is primarily based on the branch and bound framework. Specifically, it involves taking nodes from one side of the bipartite graph as a candidate set and recursively enumerating vertices from this candidate set to constrain the vertex set on the other side, thus finding the maximum bipartite graph. Based on this search framework, numerous improvements have been proposed, such as optimizing the search order of candidate nodes to improve efficiency or focusing on data organization to accelerate the enumeration process. However, as the data scale increases, the exact algorithm's high complexity often prevents it from meeting the demands of real-time scenarios.

[0006] For heuristic search methods, one approach is based on subspace clustering, which uses Monte Carlo simulation to obtain an approximate solution. Specifically, it first treats the bipartite graph as an adjacency matrix. If an edge exists between the two vertices, the corresponding position in the matrix is ​​marked as 1; otherwise, it is marked as 0. Then, through extensive random sampling, simulations are performed to obtain an approximate solution to the problem. The simulation steps are as follows: Initially, random selection... Rows, then iterate through all columns, if the column matches If there are edges between all rows, then add them to the column result set. Then iterate through all the rows, and if the row is consistent with... If there is an edge between all columns, then add them to the result set. In the end, a simulation result was obtained. After multiple rounds of simulation, the largest biclique is returned as a heuristic result. However, when the largest biclique is relatively small compared to the entire bipartite graph, a large number of iterative calculations are required to maintain the accuracy of the result.

[0007] Besides the global maximum biclique search problem, there are scenarios where finding the maximum biclique containing a specific vertex or edge is required; these are also known as personalized maximum bicliques. Compared to the global maximum biclique, this scenario emphasizes local features. Based on this idea, we can observe that every vertex and edge in a bipartite graph exists within a specific local structure, with the local structure features of the maximum biclique being the most prominent. Therefore, we can leverage the differences between local and global structure features to find the nodes and edges most likely to be within the maximum biclique, and then use these features to search for the global maximum biclique, effectively reducing the search space. This provides another approach to heuristic search methods. Summary of the Invention

[0008] The purpose of this invention is to provide a maximum biclique heuristic search method that balances efficiency and effectiveness. To achieve the above objective, this invention proposes a heuristic search method and search system based on the measurement of differences between local and global structural features.

[0009] A maximum bicluster search method and system based on the difference between local and global structural features includes the following steps:

[0010] Step 1: Calculate the chi-square value of each node based on the input bipartite graph data;

[0011] Step 2: Map the chi-square values ​​of the two endpoints to the adjacent edges;

[0012] Step 3: Based on the chi-square value before Large edges construct the search space;

[0013] Step 4: Execute the maximum biclique search algorithm on the search space and obtain the results.

[0014] Preferably, the specific process of step 1 of the present invention is as follows:

[0015] Input bipartite graph data Next, calculate three statistics about the degree of the vertices in the bipartite graph, including the maximum degree of the vertex. Average degree of the vertex and the standard deviation of vertex degree ;

[0016] Then, the deviation marker is calculated based on the distribution of vertex degrees in the graph. and its probability value ,in It is a positive integer, with a minimum of 1 and a maximum of 1. Therefore, a set of deviation markers representing the degree of deviation can be generated. The expected probability value of each deviation marker is obtained using Chebyshev's inequality, and the transformed formula is: The probability of each deviation marker is a fixed value;

[0017] Next, the deviation marker for each vertex is calculated using the inequality. Received, among which Then, the bias label distribution of each node's one-hop neighbors is used as the observation vector of the local structure. The expected vector of the node It is through Received, among which Let be the number of neighbors of the vertex. Finally, the chi-square value is calculated using the formula. Obtain the chi-square value of the vertex.

[0018] Preferably, the specific process of step 2 of the present invention is as follows:

[0019] After obtaining the chi-square value of the vertex, it is mapped to the adjacent edges. For any edge... Its chi-square value is calculated using the following formula: ,in The aggregation function is the geometric mean formula, i.e. Finally, based on the edge with the largest chi-square value... Find the largest biclique.

[0020] Preferably, the specific process of step 3 of the present invention is as follows:

[0021] Establish edges The search space consists of the upper vertices. One-hop neighbors and lower-level vertices The search space consists of one-hop neighbors and the edges between them; the search space contains all edges. The second division.

[0022] Preferably, the specific process of step 4 of the present invention is as follows:

[0023] Assumption Then the smaller side As a candidate set, simultaneously construct As the result set on the same side, Set the collection as a global element and pass it Controlling the size of the set; similarly, the other side is controlled by... To control the result set The size of the clique at the current position is used to obtain the size of the clique. .

[0024] A search system for implementing the maximum bicluster search method of the present invention includes the following modules:

[0025] Chi-square value calculation module: Statistically analyze the distribution of node degree on the entire bipartite graph, then calculate the chi-square value corresponding to the local structure of each node, further map the chi-square values ​​of the nodes at both ends of the bipartite graph to the adjacent edges, and finally obtain the chi-square value of each edge.

[0026] Search space construction module: based on chi-square value The search space is constructed using large edges, where the search space for each edge is independent;

[0027] Maximum biclique search module: In the search space of these candidate edges, the optimized branch and bound algorithm is used to perform maximum biclique search and return the query results.

[0028] The method of the present invention has the following advantages compared with existing methods:

[0029] 1. Modeling based on the differences in local and global structural features in a bipartite graph: Existing heuristic methods require iterative search across the entire bipartite graph, while this method only requires modeling once and then iteratively searching in the local space, resulting in better heuristic solutions.

[0030] 2. Highly efficient maximum biclique search algorithm: For the search process in the local space, this invention provides a variety of pruning methods and optimization techniques, which significantly improves the search efficiency. Compared with existing heuristic algorithms, the processing time is improved by 1 to 3 orders of magnitude. Attached Figure Description

[0031] Figure 1 This is a schematic diagram of the method flow of the present invention;

[0032] Figure 2 This is graph data from one embodiment of the present invention;

[0033] Figure 3 This is the search space corresponding to one embodiment of the present invention;

[0034] Figure 4 This is an early termination process corresponding to one embodiment of the present invention. Detailed Implementation

[0035] To make the objectives, methods, and advantages of this invention clearer, the following detailed description of specific embodiments further illustrates the invention. However, this invention is not limited to the following specific embodiments.

[0036] This invention proposes a maximum bicluster search method and system based on a measure of the difference between local and global structural features, including:

[0037] 1. A modeling method based on the difference measurement between local and global structural features;

[0038] 2. A maximum bicluster search method based on multiple pruning methods and optimization techniques.

[0039] This invention designs a chi-square statistic-based method to measure the difference between local and global structures, and employs various pruning and optimization techniques to improve the efficiency of the search method. It can meet the need for searching for the largest binary clique in different scenarios, such as the high consistency structure between user groups and product sets in recommendation systems, the interaction cliques between microorganisms and drugs in drug development, and the identification of document clusters in text mining. This invention outperforms existing heuristic methods in both query efficiency and result reliability.

[0040] A bipartite graph is typically represented as: , Representation diagram middle Layer vertex set, Representation diagram middle Layer vertex set, Representation diagram The set of edges in the middle. Among them, and For two disjoint vertex sets, i.e. Furthermore, vertices within each vertex set are not connected to each other. A bipartite clique is a substructure on a bipartite graph, represented as... ,in The nodes in the two parts are fully connected. The maximum biclique search problem refers to finding the biclique with the most edges in a bipartite graph, i.e., ... .

[0041] The modeling method described above refers to using the chi-square statistic to characterize the difference between the distribution of local and global structural features. Specifically, the chi-square value is calculated using the following formula: The chi-square value of a vertex is calculated from the observed vector and the expected vector, where the observed vector... The expected vector represents the degree distribution of all one-hop neighbors of the current vertex. This represents the expected distribution of the neighbor degrees of a vertex. After obtaining the chi-square value of the vertex, in order to further improve the robustness of the model and considering the structural characteristics of the bipartite graph, the chi-square values ​​of the two vertices are mapped to the edges for smoothing, and an induced subgraph is constructed based on the edges for searching. Since the search space of the edges is smaller than that of the vertices, the search efficiency is also improved.

[0042] The pruning methods and optimization techniques mentioned include maximal pruning, concurrency optimization, candidate node search order optimization, and early termination optimization. These can be used in combination to achieve the best results. Pruning methods refer to reducing the search space by utilizing the structural properties of the maximum binary clique, thereby reducing unnecessary search processes. Optimization techniques, on the other hand, adjust the branch-and-bound framework to improve the overall efficiency of the algorithm. Optimizing the candidate node search order specifically involves sorting candidate nodes according to their dominance. If two nodes have a dominance relationship, visiting the dominant node first allows for pruning of the dominated node in the current branch. If two nodes do not have a dominance relationship, they are sorted in ascending order of degree, making the search tree as balanced as possible and effectively reducing its depth. Early termination optimization, under certain conditions, converts the recursive process in the branch-and-bound framework into a cyclic process, effectively reducing the overhead of recursion and thus improving algorithm efficiency.

[0043] The heuristic search method based on the measurement of differences between local and global structural features is specifically defined as follows: First, the chi-square statistic is used to measure the feature differences between the local and global structures on the bipartite graph. Then, the chi-square value is used to... Large edges are used as anchor seeds to search for the largest biclique. From a structural perspective, a larger chi-square value indicates that the edge is more likely to be in the largest biclique structure, while a smaller chi-square value indicates that it is similar to most structures in the graph and is therefore less likely to appear in the largest biclique.

[0044] A maximum bicluster search method and system based on the difference between local and global structural features includes the following steps:

[0045] Step 1: Calculate the chi-square value of the nodes based on the input bipartite graph data.

[0046] The specific process is as follows: The input bipartite graph data is as follows: Figure 2 As shown, the next step is to calculate three statistics about the vertex degrees on the bipartite graph, including the maximum degree of the vertex. Average degree of the vertex and the standard deviation of vertex degree Then, the deviation marker is calculated using the distribution of vertex degrees in the graph. and its probability value ,in It is a positive integer, with a minimum of 1 and a maximum of 1. Therefore, a set of deviation markers representing the degree of deviation can be generated. The expected probability value of each deviation marker is obtained using Chebyshev's inequality, as shown in the formula: Then, the bias label distribution of each node's one-hop neighbors is used as the observation vector of the local structure. The expected vector of the node It is through Received, among which , Let be the number of neighbors of the vertex. Finally, according to the calculation formula Obtain the chi-square value of the vertex.

[0047] Step 2: Map the chi-square values ​​of the two endpoints to the adjacent edges.

[0048] The specific process is as follows: After obtaining the chi-square value of the vertex, it is mapped to the adjacent edges. For any edge... Its calculation formula is ,in The aggregation function is denoted as . Since the ChiMBC algorithm is designed to solve the maximum biclique search problem, this invention uses the geometric mean formula as the aggregation function to minimize the influence of one-sided extrema. The first two columns of Table 1 record Figure 2 The first column shows the chi-square values ​​of all nodes. The third column shows the results after mapping the node chi-square values ​​to adjacent edges and sorting them in descending order. It can be observed that the first four edges all belong to the maximum biclique. Therefore, we only need to consider the edges with the largest chi-square values. This allows us to find the largest biclique.

[0049] Table 1

[0050] vertex Chi-square value vertex Chi-square value side Chi-square value 0.67 0.44 0.89 0.44 0.89 0.77 0.67 0.67 0.77 0.89 0.44 0.77 0.44 0.67 0.77 0.22 0.44 0.67 0.67 0.22 0.67 0.22 0.67

[0051] Step 3: Based on the chi-square value before Large edges are used to construct the search space.

[0052] The specific process is as follows: (Side) Search space such as Figure 3 As shown, it is composed of the upper-level vertices. One-hop neighbors and lower-level vertices The search space consists of a hop neighbor and the edges between them. Clearly, this search space contains all edges. A biclique if the maximum biclique contains edges If so, then it must also be in this search space.

[0053] Step 4: Execute the maximum biclique search algorithm on the search space and obtain the results.

[0054] The specific process is as follows: The main idea of ​​the search method is to use the nodes on one side as a candidate set, and then construct a separate result set for each side. The result set on the same side as the candidate set is initialized to empty, while the result set on the other side contains all the vertices of that side. Then, by continuously enumerating the nodes in the candidate set into the result set on the same side, while simultaneously filtering the result set of the nodes on the other side, all bicliques containing that edge can be obtained. The specific process is as follows, assuming... Then the smaller side As a candidate set, simultaneously construct As the result set on the same side, according to the branch and bound algorithm, regardless of whether it is during downward recursion or enumeration at the same level... Collections only operate on the last element, so you can set it as a global element and use... Controlling the size of the collection reduces unnecessary space overhead. Similarly, on the other side... To control the result set The size of the clique is such that the current position of the clique is... .

[0055] A maximum bicluster search method and system based on the difference between local and global structural features includes the following modules:

[0056] Chi-square value calculation module: Statistically analyze the distribution of node degree on the entire bipartite graph, then calculate the chi-square value corresponding to the local structure of each node, further map the chi-square values ​​of the nodes at both ends of the bipartite graph to the adjacent edges, and finally obtain the chi-square value of each edge.

[0057] Search space construction module: based on chi-square value The search space is constructed using large edges, where the search space for each edge is independent;

[0058] Maximum biclique search module: In the search space of these candidate edges, the optimized branch and bound algorithm is used to perform maximum biclique search and return the query results.

[0059] To improve the efficiency of the algorithm, this invention utilizes multiple strategies to optimize it, with each strategy working in conjunction to achieve the best results. The first strategy is reduction. The coordination with the maximum pruning strategy, where maximum pruning involves finding the upper bound of the current branch and then... When making comparisons, it's necessary to first reduce the number of elements as much as possible. The size, because for the current branch, The degree of a node in a set should be at least 1. .

[0060] Before enumerating candidate nodes for each branch, this invention sorts the candidate nodes. Specifically, they are first sorted according to dominance, because if there is a dominance relationship between two nodes, visiting the dominant node first allows pruning of the dominated node in the current branch. For example, since... Dominate the other three Nodes, so only searching is needed during the first level of enumeration. The nodes can then be traversed to see all possible cases. Furthermore, if no dominance relationship exists between two nodes, they are sorted in ascending order of degree. This effectively reduces the depth of the search tree, thus improving the overall efficiency of the algorithm. Before entering recursion, the maximality of the current biclique needs to be checked. If the current biclique does not satisfy maximality, the branch can be pruned.

[0061] Because the branch-and-bound framework is used to enumerate all combinations in the candidate set during the maximum binary clique search phase, the number of branches increases exponentially with the recursion depth. If each subproblem needs to be processed recursively, it will lead to a decrease in algorithm performance. During the recursive search process, a common phenomenon can be observed: initially... The set must be larger than the candidate set. However, after several rounds of recursion, Sets are often much smaller If the set continues following the branch-and-bound process, it will require many more levels of recursion. Given this property, this invention proposes using a consensus algorithm to terminate the recursion early, converting the recursive process into a loop, thereby improving algorithm efficiency. For example, in... There are three sub-branches under this branch. The steps of using the consensus algorithm are as follows: Figure 4 As shown, the upper part is the initialization phase, and the lower part is the consensus phase. It can be seen that after two rounds of iterations, all consensuses can be reached, and the algorithm finally returns... and Form the largest binary subgroup.

[0062] Through the above implementation methods, this invention can effectively model based on the differences between local and global structural features, using the highly different local structures as a heuristic search space, and then obtaining the results through an optimized maximum biclique search algorithm. This method not only boasts high search efficiency, improving upon existing heuristic methods by 1-3 orders of magnitude, but also delivers better results and exhibits strong adaptability, meeting the maximum biclique search requirements in various scenarios.

[0063] Obviously, the above embodiments are merely examples for clearly illustrating the present invention, and are not intended to limit the implementation of the present invention. Those skilled in the art can make other variations or modifications based on the above description. It is neither necessary nor possible to exhaustively describe all embodiments here. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the claims of the present invention.

Claims

1. A maximum bicluster search method based on the difference between local and global structural features, characterized in that... Includes the following steps: Step 1: Calculate the chi-square value of each node based on the input bipartite graph data; Step 2: Map the chi-square values ​​of the two endpoints to the adjacent edges; Step 3: Based on the chi-square value before Large edges construct the search space; Step 4: Execute the maximum biclique search algorithm on the search space and obtain the results.

2. The maximum bicluster search method according to claim 1, characterized in that: The specific process of step 1 above is as follows: Input bipartite graph data Next, calculate three statistics about the vertex degrees on the bipartite graph, including the maximum degree. Average degree and the standard deviation of vertex degree ; Then, the deviation marker is calculated based on the distribution of vertex degrees in the graph. and their probability values ,in It is a positive integer, with a minimum of 1 and a maximum of 1. Generate a set of deviation markers representing the degree of deviation. The expected probability value of each deviation marker is obtained using Chebyshev's inequality, and the transformed formula is: The probability of each deviation marker is a fixed value; Next, the deviation marker for each vertex is calculated using the inequality. Received, among which Then, the bias label distribution of each node's one-hop neighbors is used as the observation vector of the local structure. The expected vector of the node It is through Received, among which for, Finally, the chi-square value is calculated using the formula. Obtain the chi-square value of the vertex.

3. The maximum bicluster search method according to claim 2, characterized in that: The specific process of step 2 above is as follows: After obtaining the chi-square value of the vertex, it is mapped to the adjacent edges. For any edge... Its chi-square value is calculated using the following formula: ,in The aggregation function is the geometric mean formula, i.e. Finally, based on the edge with the largest chi-square value... Find the largest biclique.

4. The maximum bicluster search method according to claim 3, characterized in that: The specific process of step 3 above is as follows: Establish edges The search space consists of the upper vertices. One-hop neighbors and lower-level vertices The search space consists of one-hop neighbors and the edges between them; the search space contains all edges. The second division.

5. The maximum bicluster search method according to claim 4, characterized in that: The specific process of step 4 above is as follows: Assumption Then the smaller side As a candidate set, simultaneously construct As the result set on the same side, Set the collection as a global element and pass it Controlling the size of the set; similarly, the other side is controlled by... To control the result set The size of the clique at the current position is used to obtain the size of the clique. .

6. A search system for implementing the maximum bicluster search method according to any one of claims 1-5, characterized in that... Includes the following modules: Chi-square value calculation module: Statistically analyze the distribution of node degree on the entire bipartite graph, then calculate the chi-square value corresponding to the local structure of each node, further map the chi-square values ​​of the nodes at both ends of the bipartite graph to the adjacent edges, and finally obtain the chi-square value of each edge. Search space construction module: based on chi-square value The search space is constructed using large edges, where the search space for each edge is independent; Maximum biclique search module: In the search space of these candidate edges, the optimized branch and bound algorithm is used to perform maximum biclique search and return the query results.