Parallel reverse shuffle sampling

By employing a parallel reverse shuffling sampling method and GPU parallelization design, the problem of low efficiency in large-scale graph data processing is solved, achieving efficient and parallel graph neighbor sampling, improving sampling efficiency and reducing time complexity.

CN122242664APending Publication Date: 2026-06-19BEIJING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING UNIV OF POSTS & TELECOMM
Filing Date
2026-03-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing graph neighbor sampling methods are inefficient in large-scale graph data processing, their time complexity is affected by the degree of nodes, and they are difficult to utilize parallel computing resources.

Method used

A parallel reverse shuffle sampling method is adopted. By using the reverse shuffle algorithm and GPU parallelization design, sampling tasks are allocated to GPU warp processing. The label propagation strategy is used to decouple the dependencies between operations and achieve efficient sampling.

Benefits of technology

It significantly improves sampling efficiency, reduces time complexity to O(k·log2k), achieves a 5.89x performance improvement on GPUs, supports multiple neighbor sampling strategies, and fully utilizes parallel computing capabilities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242664A_ABST
    Figure CN122242664A_ABST
Patent Text Reader

Abstract

This invention discloses a parallel reverse shuffling sampling method, belonging to the fields of graph data processing and machine learning. The method involves sampling k elements from n elements, performing k reverse shuffling operations to process the shuffling process in reverse order, randomly selecting a position within the corresponding deck size range for each operation, and determining the sampling strategy based on whether an element has already been sampled. The time complexity of this method is O(k·log₂k), which is only related to the number of samples, overcoming the limitation of traditional methods by the degree of the original graph nodes. Furthermore, a GPU parallel implementation method is provided, which decouples operation dependencies through a label propagation strategy, achieving a time complexity of O(k·log₂k). 2 (k / warp_size). This invention achieves a 5.89x performance improvement in fanout-based neighbor sampling and a 5.30x performance improvement in probability-based neighbor sampling, and can be widely applied to large-scale graph data processing in fields such as social networks, recommender systems, and bioinformatics.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to graph data processing and machine learning, and more particularly to a parallel reverse shuffle sampling method. Background Technology

[0002] With the widespread application of graph machine learning and graph analysis techniques in fields such as social networks, recommender systems, and bioinformatics, efficiently processing large-scale graph data has become a key challenge. Graph sampling, by extracting representative subgraphs from the original graph, provides an effective method to alleviate computational bottlenecks.

[0003] Existing neighbor sampling methods mainly include probability-based and fanout-based methods. However, these methods still face a key limitation: their time cost is strongly influenced by the degree of nodes in the original graph. When dealing with large-scale graphs, this leads to inefficiency and scalability issues.

[0004] Traditional graph neighbor sampling methods have the following problems: 1. Probabilistic methods require scanning data, and their time complexity is related to the degree of the original graph nodes; 2. The fanout-based method requires building an initial deck, and its time complexity is also affected by the number of nodes; 3. Existing methods typically have a time complexity of O(n), where n is the node degree, resulting in low efficiency on large-scale graphs. 4. Sequential sampling methods are difficult to utilize parallel computing resources such as GPUs.

[0005] Therefore, a new graph neighbor sampling method is needed that can overcome the limitations of the original graph node degree, improve sampling efficiency, and make full use of parallel computing capabilities. Summary of the Invention

[0006] Purpose of the invention: The purpose of this invention is to provide a parallel reverse shuffle sampling method, which significantly improves the efficiency of graph neighbor sampling through a reverse shuffle sampling algorithm and parallel design.

[0007] Technical solution: A parallel reverse shuffling sampling method, comprising the following steps: Firstly, a reverse shuffling sampling method is provided, including: (1) Sample k elements from a set containing n elements, where 0 ≤ k ≤ n; (2) Perform k reverse shuffle operations, in the reverse order of the shuffle operations; (3) For the i-th reverse shuffle operation (0≤i < k), randomly select a position within the deck size range of the (ki-1)-th shuffle operation; (4) Determine whether the element at the selected position is already included in the sampling set: If the element has not been sampled, add the element to the sampling set; If the element has been sampled, add the element at the top of the current deck to the sampling set; (5) Repeat steps (3)-(4) until k reverse riffle operations are completed to obtain the final sampling set.

[0008] The time complexity of this method is O(k·log2k), which is only related to the sampling quantity k and is not affected by the size n of the original set.

[0009] In the second aspect, a GPU parallel reverse riffle sampling is provided, including: Assign the sampling task to the warp (thread bundle) of the GPU for processing, and the sampling task of each node is executed by one warp; Assign k reverse riffle operations to be executed in parallel by warp_size threads, and assign an ID (from 0 to k-1) to each operation; Generate k random positions in parallel in each thread; Execute the conflict marking: For the operation with ID i, if there exists an operation ID j < i that selects the same position, mark operation i as a conflict; Establish the dependency relationship: For the operation i that selects the position xi, if xi ≥ n - k and this operation is not marked as a conflict, establish a directed edge from operation (xi - n + k) to operation i to represent the dependency relationship; Execute the marking propagation: Starting from the marked operation, propagate the marking along the dependency edge until all affected operations are correctly marked; Determine the sampling element according to the marking status: For the marked operation, sample the element corresponding to the position (n - k + i); For the unmarked operation, sample the element corresponding to the random position; Add the sampled elements to the sampling set to obtain the final result.

[0010] The time complexity of this parallel method is O(k·log 2 k / warp_size), which makes full use of the parallel computing power of the GPU.

[0011] In the third aspect, a general graph neighbor sampling framework is provided, including: Support the probability-based neighbor sampling strategy; Support the fanout-based neighbor sampling strategy; Both strategies can be efficiently implemented using the reverse shuffle sampling method described above; It supports GPU parallel execution to accelerate the neighbor sampling process of large-scale graphs.

[0012] Beneficial effects: (1) It breaks through the problem that the time complexity of traditional methods is limited by the degree of the original graph nodes, and the sampling time depends only on the sampling fanout size; (2) Significantly improved sampling efficiency, achieving a 5.89x performance improvement in fanout-based neighbor sampling and a 5.30x performance improvement in probability-based neighbor sampling; (3) The dependency relationship between operations is decoupled by the mark propagation strategy, and parallel processing is realized; (4) Fully utilize the parallel computing capabilities of the GPU to further accelerate the sampling process; (5) It provides a general sampling framework that can support multiple neighbor sampling strategies at the same time. Attached Figure Description

[0013] Figure 1 This is a flowchart illustrating the reverse shuffle sampling method. Figure 2 An example diagram of the reverse shuffle sampling method; Figure 3 A flowchart for GPU parallel reverse shuffling sampling; Figure 4 This is a schematic diagram of the label propagation strategy. Detailed Implementation

[0014] To make the technical solution of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0015] Example 1: Reverse Shuffle Sampling Method The core idea of ​​reverse shuffle sampling is to process the shuffle process in reverse order from "future" to "past". Specifically, k reverse shuffle operations are performed, and these k reverse shuffle operations correspond one-to-one with k shuffle operations in reverse order.

[0016] The specific implementation steps are as follows: Input: n (total size), k (number of samples), e[0..n-1] (all elements);

[0017] Output: set (sample set).

[0018] Step 1: Initialize the sampling set to an empty set.

[0019] Step 2: Perform k reverse shuffle operations. For the i-th operation (i ranges from 0 to k-1): Step 2.1: Determine the current deck size as (nki), corresponding to the deck size of the (ki-1)th shuffle operation; Step 2.2: Randomly select a position within the range [0, n-k+i-1]; Step 2.3: Determine whether the element at the selected position is already in the sampling set: If the element has not been sampled, then add the element to the sampling set; If the element has already been sampled, add the top element of the current deck to the sample set.

[0020] Step 3: Return the sample set.

[0021] For example: To sample 3 elements from a set {e0, e1, e2, e3, e4, e5, e6} containing 7 elements, the process is as follows: First reverse shuffle operation (i=0): Hand size: [0, 4] (n-k+i = 7-3+0 = 4); Randomly select position 2 to obtain element e2; e2 was not in the sampling set; add it to the sampling set. Current sample set: {e2}.

[0022] Second reverse shuffle operation (i=1): Hand size: [0, 5] (n-k+i = 7-3+1 = 5); Randomly select position 2 to obtain element e2; e2 is already in the sample set; add the top element e5 to the sample set. Current sample set: {e2, e5}.

[0023] Third reverse shuffle operation (i=2): Hand size: [0, 6] (n-k+i = 7-3+2 = 6); Randomly select position 5 to obtain element e5; e5 is already in the sample set; add the top element e6 to the sample set. Current sample set: {e2, e5, e6}.

[0024] Final sample set: {e2, e5, e6}.

[0025] During the reverse riffle sampling process, a total of k reverse riffle operations are performed. For each reverse riffle operation, since there are at most k elements in the sampling set, the time complexity of querying whether an element is in the sampling set is O(log2k). Therefore, the time complexity of reverse riffle sampling is O(k·log2k).

[0026] It needs to be proved that when sampling k elements from a set En = {e0, ..., en-1} of size n, the probability of sampling any subset A of size k is 1 / C(n,k).

[0027] The proof uses induction: (1) Boundary condition: When k = 0, 1 / C(n,0) = 1, and when using reverse riffle sampling to sample 0 elements from n elements, the probability of sampling an empty set is also 1. Therefore, when k = 0, the theorem holds.

[0028] (2) Inductive hypothesis: When k0≥1, if 0≤k<k0 and for any n≥k the theorem holds, then when k = k0, for any n≥k, the theorem also holds.

[0029] Calculate the probability of sampling k0 elements to obtain a specific set A from a set En={e0,...,en-1} of size n (n≥k0): Case 1: en-1∈A Since in the k0 reverse riffle operations of reverse riffle sampling, only the k0th operation can sample en-1, the first k0 - 1 operations need to sample k0 - 1 elements from A - {en-1}. In reverse riffle sampling, the first k0 - 1 operations of sampling k0 elements from n elements are the same as the operations of sampling k0 - 1 elements from n - 1 elements. Therefore, the probability that the first k0 - 1 operations sample k0 - 1 elements from A - {en-1} is 1 / C(n-1,k0-1). Based on this, whether the k0th operation selects an element in A - {en-1} or selects en-1, en-1 will be added to the sampling set. Therefore, the probability that the k0th operation adds en-1 to the sampling set is k0 / n. Therefore, when en-1∈A, the probability of sampling k0 elements from En using reverse riffle sampling is 1 / C(n-1,k0-1)×k0 / n = 1 / C(n,k0).

[0030] Case 2: en-1∉A The first k0 - 1 operations also need to sample a subset of size k0 - 1 from A. For any α ∈ A, the probability that the first k0 - 1 operations sample A - {α} is 1 / C(n - 1, k0 - 1), and the probability that the k0 - th operation samples α is 1 / n. Since there are k0 choices for α, the probability of obtaining set A by sampling k0 samples from a set of n elements using reverse shuffle sampling is 1 / C(n, k0).

[0031] In summary, when k0 ≥ 1, if the theorem holds for 0 ≤ k < k0, then it is proved that the theorem also holds for k = k0. Therefore, based on the inductive hypothesis, the theorem always holds when k ≥ 0.

[0032] Example 2: GPU Parallel Reverse Shuffle Sampling The sequential reverse shuffle sampling algorithm has a serial dependency problem because each sampling operation depends on the previous sampling result. To make full use of the parallel computing power of the GPU, this example proposes a parallel method to decouple the dependency relationship through a tag propagation strategy.

[0033] The specific implementation steps are as follows: Input: n (total size), k (number of samples), e[0..n - 1] (all elements); Output: set (sampling set).

[0034] Step 1: Generate k random positions x[0..k - 1] in parallel, where each position is within the range [0, n - 1].

[0035] Step 2: Mark conflicts and execute: for i = 0 to k - 1 do (Parallel i, j): if ∃j < i, x[j] = x[i] then:tag[i] = True end if end for For operation ID i, if there exists an operation ID j < i that selects the same position, then mark operation i as a conflict.

[0036] Step 3: Suggest dependency relationships and execute: for i = 0 to k - 1 do (Parallel i): if x[i] ≥ n - k ∧ tag[i] = Falsethen: link[x[i]-(n - k)] ← i end if end for For operation i, if the selected position xi ≥ n - k and the operation is not marked as a conflict, then establish a dependency link from operation (xi - n + k) to operation i.

[0037] Step 4: Mark the propagation and execute: for i = 0 to k-1 do (Parallel i): if tag[i] = True then: w ← iwhile link[w] is not NULL do: link_tag[link[w]] = True w = link[w] endwhile end if end for Starting with the marked operation, the marking is propagated along the dependency edges until all affected operations are correctly marked.

[0038] Step 5: Determine the sampling elements and execute: for i = 0 to k-1 do (Parallel i): if tag[i] ∨ link_tag[i] then: x[i] ← nk + i end if set.push(e[x[i]]) end for The sampling element is determined based on the marking state. For a marked operation, the element corresponding to position (n-k+i) is sampled; for an unmarked operation, the element corresponding to a random position is sampled.

[0039] Step 6: Return the sample set.

[0040] The core of the label propagation strategy is: Conflict marking: If multiple operations select the same position, the earlier operation will sample the element at that position first, and subsequent operations need to mark it as a conflict.

[0041] Dependency: When an operation selects a position xi ≥ nk, that position becomes the top element in the deck of operation (xi-n+k). If the element selected by operation (xi-n+k) is sampled by an earlier operation, then operation (xi-n+k) will sample the top element, i.e., the element at position xi. This will cause the element selected by operation i to also be sampled, therefore a dependency needs to be established between operation (xi-n+k) and operation i.

[0042] Mark propagation: Starting with the operations marked as conflicting, the mark is propagated along the established dependency edges to ensure that all affected operations are correctly marked.

[0043] In GPU-based parallel reverse shuffle sampling, sampling for each node is handled by a warp, and k sampling operations are distributed across warpsize threads. Each thread independently performs random index generation and comparison; therefore, the basic sampling steps require O(k·log) time. 2 k / warpsize) time.

[0044] The main computational overhead comes from conflict detection and resolution between concurrent sampling operations. To efficiently identify overlapping accesses and ensure the correctness of sampling results, a parallel sorting process is employed. This step has a time complexity of O(k·log₂) on the GPU. 2 (k / warp_size).

[0045] After conflict resolution, a small number of dependency updates are required to maintain the consistency of the sampling results. The expected number of such updates required for each operation is proportional to ∑_{i=1}^k 1 / (n-k+i), limited by log2(k+1). Therefore, the expected time complexity of this adjustment phase is O(log2k).

[0046] Based on the above analysis, the overall time complexity of the GPU-based parallel reverse shuffle sampling algorithm is O(k·log₂). 2 (k / warp_size).

[0047] In actual testing, the parallel reverse shuffle sampling method proposed in this invention achieved a 5.89-fold performance improvement in fanout-based neighbor sampling and a 5.30-fold performance improvement in probability-based neighbor sampling, significantly improving the sampling efficiency of large-scale graph data.

[0048] Example 3: General Graph Neighbor Sampling Framework Framework architecture: Sampling strategies supported: Probability-based Neighbor Sampling Fanout-based Neighbor Sampling.

[0049] Sampling core: Neighbor node sampling is performed using a reverse shuffle sampling algorithm; It supports both CPU sequential execution and GPU parallel execution modes.

[0050] Parallelization support: Automatically detect available GPU resources; Distribute large-scale sampling tasks across multiple GPU cores; Employing warp-level parallel granularity maximizes GPU utilization.

[0051] Application scenarios: Social network analysis: Samples user neighbor nodes from large-scale social network graphs for tasks such as social recommendation and community discovery.

[0052] Recommendation system: Sample the user's historical interaction neighbors in the user-item bipartite graph for use in graph neural network training.

[0053] Bioinformatics: Sampling node neighbors from complex biological networks such as protein-protein interaction networks and gene regulatory networks for research purposes such as drug discovery and disease prediction.

[0054] Knowledge graph: Samples entity neighbors from large-scale knowledge graphs for tasks such as knowledge reasoning and entity linking.

[0055] Performance advantages: Compared to traditional methods, this framework has the following advantages: The sampling time complexity is reduced from O(n) to O(k·log₂k), where k is the sampling fanout size. Supports massively parallel processing and fully utilizes GPU computing resources. It is highly versatile and can support multiple neighbor sampling strategies simultaneously. The sampling results are guaranteed to be uniform, ensuring the fairness of graph neural network training.

[0056] Example 4: Application in Graph Neural Network Training The application process is as follows: Graph data preprocessing: Load large-scale graph data (nodes, edges, features); Construct adjacency lists or other graph data structures.

[0057] Training process: For each training batch: Select a batch of source nodes; For each source node, its neighboring nodes are sampled using the reverse shuffle sampling method of this invention; Recursively sample the neighboring nodes obtained from the sampling to form a sampled subgraph; The sampled subgraph is input into the graph neural network for forward propagation; Calculate the loss and backpropagate to update the parameters.

[0058] Sampling optimization: The sampling fanout size is dynamically adjusted based on the available GPU memory. GPU parallel sampling is used to accelerate the neighbor sampling process; Batch processing technology can be used to further improve sampling efficiency.

[0059] Experimental results: Experiments on real datasets (such as Reddit, Yelp, Amazon, etc.) show that: Compared to traditional random sampling methods, training speed is increased by 3-5 times; The fact that the accuracy of the graph neural network model remains unchanged indicates that the sampling process does not affect the model performance. It can still maintain efficient sampling on ultra-large-scale graph datasets (tens of millions of nodes).

[0060] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of this patent should be determined by the appended claims.

Claims

1. A parallel reverse shuffling sampling method, characterized in that, It includes the following steps: Sample k elements from a set containing n elements, where 0 ≤ k ≤ n; Perform k reverse shuffling operations in the reverse order of the shuffling operation; For the i-th reverse shuffling operation, randomly select a position within the range of the deck size of the (k - i - 1)-th shuffling operation; Determine whether the element at the selected position is already included in the sampling set: If the element has not been sampled, add the element to the sampling set; If the element has been sampled, add the element at the top of the current deck to the sampling set; Repeat the above steps until k reverse shuffling operations are completed to obtain the final sampling set.

2. The parallel reverse shuffling sampling according to claim 1, characterized in that, The time complexity of the method is O(k·log2k), where k is the sampling quantity.

3. A GPU parallel reverse shuffling sampling method, characterized in that, It includes the following steps: Assign the sampling task to the warp processing of the GPU, and the sampling task of each node is executed by one warp; Assign k reverse shuffling operations to be executed in parallel by warp_size threads, and assign an ID to each operation; Generate k random positions in parallel in each thread; Perform conflict marking: For the operation with ID i, if there exists an operation ID j < i that selects the same position, mark operation i as a conflict; Establish a dependency relationship: For the operation i that selects the position xi, if xi ≥ n - k and the operation is not marked as a conflict, establish a directed edge from operation (xi - n + k) to operation i to represent the dependency relationship; Perform marking propagation: Starting from the marked operation, propagate the mark along the dependency edge until all affected operations are correctly marked; Determine the sampling element according to the marking status: For the marked operation, sample the element corresponding to the position (n - k + i); for the unmarked operation, sample the element corresponding to the random position; Add the sampled elements to the sampling set to obtain the final result.

4. The GPU parallel reverse shuffling sampling according to claim 3, characterized in that, The time complexity of the method is O(k·log). 2 k / warpsize), where k is the number of samples and warpsize is the thread bundle size.

5. The GPU parallel reverse shuffling sampling according to claim 3, characterized in that, The marking propagation strategy includes: Infer whether an element is sampled in the current reverse shuffling operation by propagating the conflict mark between operations, rather than directly relying on the complete sampling set; When multiple operations select the same position, the earlier operation will sample the element at that position first, and the subsequent operations are marked as conflicts; When the position selected by an operation is the top element of another operation's deck and the top element is sampled by an earlier operation, establish a dependency relationship between operations and propagate the mark.

6. A general graph neighbor sampling framework, characterized in that, It includes: Support the probability-based neighbor sampling strategy; Support the fanout-based neighbor sampling strategy; Both strategies are efficiently implemented using the reverse shuffling sampling method described in claim 1 or 2; Support GPU parallel execution, and use the GPU parallel reverse order shuffling sampling described in any one of claims 3 - 5.

7. The general graph neighbor sampling framework according to claim 6, characterized in that, The framework is applicable to the following scenarios: Social network analysis: Sample user neighbor nodes from the social network graph; Recommendation system: Sample the historical interaction neighbors of users in the user-item bipartite graph; Bioinformatics: Sample node neighbors from the protein interaction network and gene regulatory network; Knowledge graph: Sample entity neighbors from the large-scale knowledge graph.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method described in any one of claims 1 - 5.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-5.