Graph data sampling, graph neural network training methods and systems, equipment and media
By segmenting and distributing graph data, and utilizing the resources of multiple node devices for distributed sampling, the problem of low sampling efficiency in graph neural network training is solved, achieving efficient processing of graph data sampling and improving the training efficiency of graph neural networks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIBABA (CHINA) CO LTD
- Filing Date
- 2023-02-27
- Publication Date
- 2026-06-30
Smart Images

Figure CN116306867B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, specifically to a graph data sampling and graph neural network training method, system, device, and medium. Background Technology
[0002] Graph data is a data structure that describes entities and the relationships between them, and can be used to model data relationships in many application scenarios. With the development of neural network technology, Graph Neural Networks (GNNs) for processing graph data have emerged; GNNs can be regarded as neural network models for processing graph data, and they are widely used in applications such as data recommendation, security risk control, and drug molecule prediction.
[0003] When training a graph neural network, it is necessary to sample the graph data used as training data. Therefore, how to improve the sampling efficiency of graph data and thus improve the training efficiency of graph neural networks has become a technical problem that needs to be solved by those skilled in the art. Summary of the Invention
[0004] In view of this, embodiments of this application provide a graph data sampling, graph neural network training method and system, device and medium to improve the sampling efficiency of graph data, thereby improving the training efficiency of graph neural networks.
[0005] To achieve the above objectives, the embodiments of this application provide the following technical solutions.
[0006] In a first aspect, embodiments of this application provide a graph data sampling method applied to a first node device, the method comprising:
[0007] Obtain a sampling task and determine multiple objects to be sampled corresponding to the sampling task;
[0008] For any object to be sampled, a target data slice for storing the object is determined according to a preset allocation relationship; the allocation relationship records at least the data slices allocated to the segmented graph data, wherein the segmented graph data is allocated to multiple data slices for storage, and the multiple data slices are stored on multiple node devices;
[0009] If the target data slice is stored in the first node device, the resources of the first node device are used to perform a sampling task on the object to be sampled in order to obtain the sampling result of the object to be sampled;
[0010] If the target data slice is stored in the second node device, the resources of the second node device are invoked to perform a sampling task on the object to be sampled, so as to obtain the sampling result of the object to be sampled;
[0011] Based on the sampling results of each object to be sampled, the sampling results of the sampling task are obtained.
[0012] Secondly, embodiments of this application provide a graph neural network training method, including:
[0013] Obtain the sampling results of the graph data; the sampling results of the graph data are determined based on the graph data sampling method described in the first aspect above.
[0014] Train a graph neural network based on the sampling results of the graph data.
[0015] Thirdly, embodiments of this application provide a graph neural network training system, including:
[0016] A storage layer is used to implement the segmentation of graph data and the distributed storage of data slices on corresponding node devices;
[0017] Graph operator layer, which provides operators for CPU and operators for GPU;
[0018] An interface layer and a distributed sampling layer, the interface layer and the distributed sampling layer at least provide an interface to a sampler, the sampler being configured to perform the graph data sampling method as described in the first aspect above;
[0019] The model layer is used to support the training of the graph neural network.
[0020] Fourthly, embodiments of this application provide a node device, including at least one memory and at least one processor. The memory stores one or more computer-executable instructions, and the processor invokes the one or more computer-executable instructions to execute the graph data sampling method as described in the first aspect above, or the graph neural network training method as described in the second aspect above.
[0021] Fifthly, embodiments of this application provide a storage medium that stores one or more computer-executable instructions. When the one or more computer-executable instructions are executed, they implement the graph data sampling method as described in the first aspect above, or the graph neural network training method as described in the second aspect above.
[0022] In a sixth aspect, embodiments of this application provide a computer program that, when executed, implements the graph data sampling method as described in the first aspect above, or the graph neural network training method as described in the second aspect above.
[0023] The graph data sampling method provided in this application embodiment can be implemented based on graph data segmentation and distributed storage. The segmented graph data can be allocated to multiple data slices for storage, and these data slices are stored on multiple node devices. The data slices allocated to the segmented graph data can be recorded through allocation relationships. Therefore, when sampling graph data, a first node device can acquire a sampling task and determine multiple objects to be sampled corresponding to the sampling task. For any object to be sampled, the first node device can determine the target data slice for storing the object to be sampled according to a preset allocation relationship. If the target data slice is stored on the first node device, this application embodiment can use the resources of the first node device to perform the sampling task on the object to be sampled to obtain the sampling result of the object to be sampled. If the target data slice is stored on a second node device, this application embodiment can call the resources of the second node device to perform the sampling task on the object to be sampled to obtain the sampling result of the object to be sampled. Furthermore, the first node device can obtain the sampling result of the sampling task based on the sampling results of each object to be sampled.
[0024] As can be seen, after allocating the segmented graph data to multiple data slices and storing these slices on multiple node devices, each node device, when processing a sampling task, only performs the sampling task on the objects to be sampled stored in its local data slice. For objects to be sampled not stored in a local data slice, the node device executes the sampling task by calling other node devices corresponding to those non-local data slices. Therefore, multiple objects to be sampled corresponding to a sampling task can execute the sampling task asynchronously and in parallel on multiple node devices, thereby efficiently utilizing the resources of multiple node devices to execute the sampling tasks for multiple objects to be sampled. This achieves reasonable resource allocation and load balancing among multiple node devices, improving the sampling efficiency of graph data. Therefore, the graph data sampling method provided in this application embodiment can significantly improve the sampling efficiency of graph data, thus providing a foundation for improving the training efficiency of graph neural networks. Attached Figure Description
[0025] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0026] Figure 1A This is an example diagram illustrating the structure of graph data.
[0027] Figure 1B This is an example graph for graphical data.
[0028] Figure 2AThis is an example diagram illustrating the training process of a batch-based graph neural network.
[0029] Figure 2B This is an example diagram of a sampled subplot.
[0030] Figure 3A This is an example diagram illustrating the process of sampling graph data provided in an embodiment of this application.
[0031] Figure 3B This is an example diagram illustrating the implementation of graph data sampling provided in an embodiment of this application.
[0032] Figure 4A This is a flowchart illustrating a method for segmenting graph data as provided in an embodiment of this application.
[0033] Figure 4B A flowchart illustrating a distributed storage method provided in an embodiment of this application.
[0034] Figure 4C A flowchart illustrating the method for determining thermal data features and cold data features provided in this application embodiment.
[0035] Figure 4D This is a diagram illustrating an example of a distributed storage implementation.
[0036] Figure 5 This is a flowchart illustrating a method for sampling graph data provided in an embodiment of this application.
[0037] Figure 6A This is a flowchart illustrating the subgraph sampling method provided in an embodiment of this application.
[0038] Figure 6B Example diagram of subgraph sampling implementation.
[0039] Figure 6C A flowchart illustrating the subgraph feature sampling method provided in this application embodiment.
[0040] Figure 7A A flowchart of a graph neural network training method provided in an embodiment of this application.
[0041] Figure 7B This is a block diagram of the architecture of a graph neural network training system provided in an embodiment of this application.
[0042] Figure 8A An example diagram of process deployment provided in an embodiment of this application.
[0043] Figure 8B Another example diagram of process deployment provided in the embodiments of this application. Detailed Implementation
[0044] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0045] Graph data is a data structure that represents entities and the relationships between them. It can be used to represent data in many application scenarios, such as social networks, data recommendation systems, and transportation networks. In graph data, nodes represent entities, and edges (directed or undirected) between nodes represent relationships between entities. Nodes and edges can also have features, which can be vectors of integers and floating-point numbers. Therefore, in the example structure of graph data, the graph data can include nodes, edges, node features, and edge features.
[0046] To make it easier to understand, let's take three nodes as an example. Figure 1A An example diagram illustrating the structure of graph data is shown below, such as... Figure 1A As shown, nodes 1, 2, and 3 in the graph data can each represent three entities, and the edges connecting the nodes can represent the relationships between the entities; for example, edge 11 represents the relationship between the entity represented by node 1 and the entity represented by node 2, edge 12 represents the relationship between the entity represented by node 2 and the entity represented by node 3, and edge 13 represents the relationship between the entity represented by node 1 and the entity represented by node 3. Each node and each edge can have features.
[0047] In one example Figure 1B An example graph of graph data is shown, taking three entities: people, houses, and land. In the graph data, people, houses, and land can be represented by nodes, and all relationships between people and houses are represented by edges connecting the corresponding nodes. The inclusion relationship between land and houses is represented by edges connecting the corresponding nodes, and the use relationship between people and land is represented by edges connecting the nodes.
[0048] It's important to note that in real-world applications, graph data can be quite large. For example, graph data often involves a large number of nodes and edges, with diverse node and edge types and rich features. In applications like data recommendation and security risk control, the number of edges can reach billions or even tens of billions, and the number of nodes can be in the hundreds of millions to billions, with a wealth of features. Therefore, the specific size of the graph data needs to be determined based on the actual application scenario. Figure 1A and Figure 1BThis is merely a brief illustration of graph data structures and should not impose limitations on specific graph data structures or sizes.
[0049] It should also be noted that, for two nodes connected by a directed edge, the node that sends the directed edge can be called the source node, and the node that the directed edge points to can be called the destination node. For example, in Figure 1A In the example, node 1 points to node 2 through edge 11 (edge 11 is a directed edge), so node 1 is the source node and node 2 is the destination node.
[0050] Graph learning is the application of deep learning to graph data. Graph learning techniques, represented by Graph Neural Networks (GNNs), have wide applications in scenarios such as data recommendation, security risk control, and molecular prediction. Training methods for graph neural networks can be divided into full-graph-based training and batch-based training. For applications with large-scale graph data (such as data recommendation and security risk control), batch-based training can be used to train the graph neural network. Figure 2A An example diagram illustrating the training process of a batch-based graph neural network is shown, such as... Figure 2A As shown, in the batch-based training method, the training process of the graph neural network can be divided into: sampling phase 210 and training phase 220.
[0051] Sampling phase 210 mainly involves sampling the graph data used as training data to obtain multiple subgraphs with features. Sampling phase 210 can be further subdivided into subgraph sampling phase 211 and subgraph feature sampling phase 212.
[0052] The subgraph sampling phase 211 is mainly used to sample and obtain multiple subgraphs, which can be composed of nodes and their neighboring nodes. In the subgraph sampling phase 211, a subgraph composed of nodes and their neighboring nodes can be sampled by specifying a node or edge to be sampled. Optionally, in the subgraph sampling phase 211, for a specified node or edge to be sampled, a subgraph can be formed by sampling the neighboring nodes of the node to be sampled, or by sampling the nodes connected to the edge to be sampled and their neighboring nodes. For example, for a node to be sampled, its neighboring nodes can be sampled, thus forming a subgraph composed of the node to be sampled and its neighboring nodes. As another example, for an edge to be sampled, the two nodes connected to the edge to be sampled and their neighboring nodes can be sampled, thus forming a subgraph composed of the two nodes connected to the edge to be sampled and their neighboring nodes.
[0053] For ease of understanding, Figure 2B An example diagram of the sampling subgraph is shown, such as... Figure 2BAs shown, assuming the nodes to be sampled are nodes 1, 2, 3, and 4, then the neighboring nodes 5 and 6 of node 1, the neighboring nodes 7 and 9 of node 2, the neighboring nodes 6 and 8 of node 3, and the neighboring node 7 of node 4 can be sampled. This results in a subgraph formed by each node to be sampled and its neighboring nodes. It should be noted that in batch-based graph data sampling, multiple subgraphs can be obtained by specifying multiple nodes to be sampled (multiple nodes can be specified at once).
[0054] The subgraph feature sampling stage 212 mainly searches for the features of each subgraph (for example, searching for the features of each node or each edge in each subgraph) to obtain a subgraph with features.
[0055] The training phase 220 primarily involves learning and training a subgraph with features using a graph neural network. For example, the embedding representations of nodes or edges in the subgraph are iteratively updated through message passing, ultimately using embedding vectors to represent the information of nodes or edges.
[0056] In the training process of a graph neural network, the computational cost of the training phase 220 is smaller than that of the sampling phase 210, and the bottleneck in the training of the graph neural network is mainly concentrated in the sampling phase 210. Therefore, improving the efficiency of the sampling phase 210 is the key to improving the training efficiency of the graph neural network.
[0057] Based on this, embodiments of this application provide an improved graph data sampling scheme. By employing a distributed sampling method, the sampling objects (nodes or edges to be sampled) are sampled in a distributed manner, thereby improving the sampling efficiency of graph data. Optionally, the distributed sampling method can be implemented using distributed samplers, which refer to multiple samplers distributed and capable of graph data sampling. Embodiments of this application can set up multiple distributed samplers with graph data sampling capabilities (multiple distributed samplers can run on multiple node devices, and one node device can run at least one sampler) to perform distributed sampling of the sampling objects (nodes or edges to be sampled), thereby improving the sampling efficiency of graph data. In embodiments of this application, the distributed samplers can be used to perform subgraph sampling and / or subgraph feature sampling of the sampling objects (nodes or edges to be sampled).
[0058] As an optional implementation, given the large volume of graph data, a single node device may not be able to store and accommodate it. Therefore, embodiments of this application can divide the graph data into multiple data slices and store these data slices across multiple node devices. The node devices referred to here can be electronic devices with data processing capabilities, such as terminal devices or server devices. Optionally, embodiments of this application can perform distributed sampling of the graph data by utilizing samplers running on multiple node devices, based on the graph data segmentation and distributed storage. For example, before sampling the graph data, embodiments of this application can segment the graph data, distributing the segmented graph data across multiple data slices. These multiple data slices can be stored on multiple node devices, and each node device can run at least one sampler (e.g., one node device runs one or more samplers). Thus, the samplers running on multiple node devices can achieve graph data sampling in a distributed manner.
[0059] Optional, Figure 3A An exemplary diagram illustrating the image data sampling process provided in an embodiment of this application is shown, such as... Figure 3A As shown, the process may include a graph data segmentation stage 31, a distributed storage stage 32, and a sampling stage 210. The graph data segmentation stage 31 mainly segments the graph data, distributing the segmented graph data across multiple data slices. The distributed storage stage 32 mainly performs distributed storage of the data slices on the node devices. The sampling stage 210 mainly uses distributed samplers to perform subgraph sampling and subgraph feature sampling based on the data slices and the nodes or edges to be sampled.
[0060] To make it easier to understand, let's take the example of graph data being divided into two data slices. Figure 3B An exemplary diagram illustrating an optional implementation of graph data sampling provided in an embodiment of this application is shown, such as... Figure 3B As shown, since a single node device cannot store and accommodate graph data, the graph data can be divided into data slices 301 and 302. Data slice 301 can be stored in node device 311, and data slice 302 can be stored in node device 312. At the same time, node device 311 can run multiple samplers 321 to 32n (n is the number of samplers run by node device 311), and node device 312 can also run multiple samplers 331 to 33m (m is the number of samplers run by node device 312). n and m can be the same or different.
[0061] When performing graph data sampling, if a sampler receives a sampling task, it can determine the target data slice where the node or edge to be sampled is located for any node or edge to be sampled indicated by the sampling task (the sampling task can indicate multiple nodes or edges to be sampled). If the node device corresponding to the target data slice and the node device corresponding to the sampler are the same node device, the sampler can perform subgraph sampling and subgraph feature sampling on the node or edge to be sampled using local sampling methods, such as using local resources to perform subgraph sampling and subgraph feature sampling on the node or edge to be sampled. If the node device corresponding to the target data slice and the node device corresponding to the sampler are not the same node device, the sampler can sample the node or edge to be sampled using remote sampling methods, such as passing the node or edge to be sampled to the node device corresponding to the target data slice, and the sampler running on the node device corresponding to the target data slice can perform subgraph sampling and subgraph feature sampling on the node or edge to be sampled.
[0062] In one example, combined Figure 3B As shown, assuming the sampler 321 running on node device 311 receives a sampling task, for any node to be sampled indicated by the sampling task, if the data slice corresponding to the node to be sampled is data slice 301, since data slice 301 is stored on node device 311 and sampler 321 also runs on node device 311, sampler 321 can call the local resources of node device 311 to perform subgraph sampling and subgraph feature sampling on the node to be sampled; if the data slice corresponding to the node to be sampled is data slice 302, since data slice 302 is stored on node device 312 and sampler 321 runs on node device 311, sampler 321 needs to transmit the node to be sampled to node device 312 through remote sampling, so that the sampler running on node device 312 can call the resources of node device 312 to perform subgraph sampling and subgraph feature sampling on the node to be sampled.
[0063] Based on the above ideas, the optional implementations of the graph data segmentation stage 31, the distributed storage stage 32, and the sampling stage 210 provided in the embodiments of this application will be described below.
[0064] As an optional implementation, graph data can include nodes, edges, node features, and edge features. Therefore, partitioning graph data and distributing it into multiple data slices can be viewed as distributing the nodes, edges, node features, and edge features of the graph data into multiple data slices; the nodes, edges, node features, and edge features can be stored in the allocated data slices. Optionally, Figure 4A An exemplary flowchart of a graph data segmentation method provided in an embodiment of this application is shown. The graph data segmentation stage 31 described above can be achieved through... Figure 4AThe method flow shown is implemented. (Refer to...) Figure 4A The method process may include the following steps.
[0065] In step S410, for a node in the graph data, a data slice to be assigned to the node is determined from multiple data slices, wherein the multiple data slices are stored in multiple node devices, and a node device stores at least one data slice.
[0066] This application embodiment can set up multiple data slices for storing data, and the multiple data slices are stored on multiple node devices, with each node device storing at least one data slice. Optionally, the number of data slices stored on each node device can be the same or different, and the amount of data stored in each data slice can be the same or different, depending on the actual situation. This application embodiment does not set any limitations. After setting up multiple data slices, this application embodiment can allocate data slices to each node in the graph data, so that the node is stored in the allocated data slice; here, allocating data slices to nodes means allocating data slices to nodes to store node data.
[0067] As an optional implementation, for nodes in graph data, this embodiment can determine the data slice assigned to a node from multiple data slices based on the node's identifier and the number of data slices. In graph data, nodes and edges can have identifiers. Node identifiers can be used to distinguish different nodes, and edge identifiers can be used to distinguish different edges. The identifier can be in ID form, and the identifier value can be an integer. In the implementation example, for a node, this embodiment can perform a hash operation based on the node's identifier and the number of data slices, and use the result of the hash operation as the sequence number of the data slice assigned to the node (i.e., the sequence number of the data slice assigned to the node corresponds to the result of the hash operation) to determine the data slice assigned to the node. Taking the hash operation as a modulo operation as an example, this embodiment can use the node's identifier to perform a modulo operation on the number of data slices, and then use the result of the modulo operation as the sequence number of the data slice assigned to the node.
[0068] In a specific example, for a node, assuming the node's identifier is 15 (the node's identifier can be an integer), and the number of data slices is 4 (the corresponding data slice numbers are 0, 1, 2, and 3), then the node's identifier (15) can be used to perform a modulo operation on the number of data slices (4), and the modulo operation result is 3 (15%4=3); thus, the node is assigned a data slice with the number 3, that is, the node can be assigned a data slice with the number 3 for storage.
[0069] In an optional implementation, step S410 may be to allocate a data slice to the source node of the graph data; for example, for the source node of the graph data, the embodiments of this application may determine the data slice to be allocated to the source node from multiple data slices according to the identifier of the source node and the number of data slices; the optional implementation process can be referred to the description of the corresponding part above.
[0070] In step S411, for an edge in the graph data, a data slice is assigned to the edge according to the data slice in which the source node of the edge is located.
[0071] When allocating data slices for edges, embodiments of this application can assign edges to the data slices allocated to the source node of the edge, that is, the edges are allocated according to the data slices where their source nodes are located. Here, allocating data slices for edges means allocating data slices to store edge data for the edges.
[0072] In step S412, for features in the graph data, data slices for feature assignment are determined from multiple data slices.
[0073] Features in graph data can be the features of nodes and edges in the graph data. In optional implementations, for features in graph data, embodiments of this application can allocate data slices to features according to a random allocation method or an allocation method based on the sampling probability. Optionally, random allocation means randomly assigning features to different data slices; for example, for features in graph data, embodiments of this application can randomly determine data slices from multiple data slices and allocate the features to the randomly determined data slices for storage.
[0074] Optionally, the allocation method based on sampling probability refers to: sorting nodes according to their sampling probability (either in ascending or descending order); then, according to the sorting order, sequentially allocating the features associated with each node (the associated features may be node features or features corresponding to the edges of the destination node) to each data slice in a cyclical manner. Since the number of nodes is generally greater than the number of data slices, the allocation of node-associated features to each data slice according to the sorting order is done in a cyclical manner; that is, when allocating node-associated features to data slices according to the sorting order, if a node-associated feature is allocated to the last data slice, the process returns to the first data slice and continues to allocate data slices to the remaining node-associated features according to the sorting order, until all features have been allocated to data slices. In one implementation example, assuming there are 5 nodes and 4 data slices (corresponding data slice numbers 0, 1, 2, and 3), and after sorting based on the sampling probability of the nodes, the sorting order is (node 2, node 3, node 1, node 4, and node 5); then according to the sorting order, the features associated with node 2 are assigned to data slice 0, the features associated with node 3 are assigned to data slice 1, the features associated with node 1 are assigned to data slice 2, the features associated with node 4 are assigned to data slice 3, and then the loop returns to data slice 0, assigning the features associated with node 5 to data slice 0.
[0075] In step S413, the nodes, edges, and features of the graph data are recorded, along with the allocation relationship of the data slices.
[0076] After assigning nodes, edges, and features (node features and edge features) of graph data to data slices, embodiments of this application can record the data slices assigned to nodes, edges, and features, thereby recording the allocation relationship between nodes, edges, features, and data slices. That is, the allocation relationship records the data slice assigned to any node, the data slice assigned to any edge, and the data slice assigned to any feature (features of any node and features of any edge).
[0077] In an optional implementation, embodiments of this application can record the above allocation relationships through a table. For example, a segmentation routing table can be set up to record the allocation relationships. For instance, the segmentation routing table can record the allocation relationships between nodes, edges, and features, and data slices. The allocation relationships can be stored on each node device, or they can be stored in a storage device that can be read by each node device.
[0078] After implementing graph data segmentation and storing the segmented graph data into multiple data slices, this embodiment of the application allows for distributed storage of the data slices across various node devices. That is, based on the allocated nodes, edges, and features stored in a data slice, the node devices storing that data slice can perform distributed storage of the nodes, edges, and features stored in that data slice. Optionally, Figure 4B An exemplary flowchart of a distributed storage method provided in an embodiment of this application is shown. The distributed storage stage 32 described above can be achieved through... Figure 4B The method flow shown is implemented. (Refer to...) Figure 4B The method process may include the following steps.
[0079] In step S420, the nodes and edges stored in the data slice are stored in the CPU paged memory or GPU memory of the node device.
[0080] For the nodes and edges stored in a data slice, the node device storing the data slice can store the nodes and edges in GPU memory or CPU paged memory. Optionally, the node device can choose to store the nodes and edges in CPU paged memory or GPU memory according to the actual storage resources of the node device. For example, if the node device's GPU memory is sufficient to store the nodes and edges of the data slice, then the nodes and edges of the data slice can be stored in GPU memory; if the node device's GPU memory is insufficient to store the nodes and edges of the data slice, then the nodes and edges of the data slice can be stored in CPU paged memory.
[0081] CPU paged memory refers to the process by which the operating system can page CPU memory (pinned pages) to allow the GPU to directly access CPU memory and avoid excessive copying operations. This paged CPU memory is marked by the operating system as non-swapable, allowing the GPU to directly access it.
[0082] In step S421, for the features stored in the data slice, hot data features and cold data features are determined. The hot data features are stored in the GPU memory of the node device, and the cold data features are stored in the CPU page memory of the node device.
[0083] In an optional implementation, the CPU memory and GPU memory of the node device can be managed uniformly using a Unified Tensor, thereby reducing data transfer and copying between the GPU and CPU. Features stored in a data slice can then be stored using a Unified Tensor. When storing features in a data slice, this embodiment can distinguish between hot and cold data features, storing hot data features in the node device's GPU memory and cold data features in the node device's CPU paged memory. Optionally, if the node device's GPU group is connected via NVLink, hot data features can be evenly stored in each GPU memory segment of the NVLink-connected GPU group. It should be noted that NVLink is a bus and its communication protocol; NVLink uses a point-to-point structure and serial transmission for connections between the CPU and GPU, and can also be used for interconnections between multiple GPUs.
[0084] In an optional implementation, embodiments of this application can determine whether a feature is a hot data feature or a cold data feature based on the in-degree or sampling probability of the nodes associated with the feature. The in-degree or sampling probability of the nodes associated with the hot data feature is higher than the in-degree or sampling probability of the nodes associated with the cold data feature. As an optional implementation, Figure 4C An exemplary flowchart of a method for determining thermal data features and cold data features provided in an embodiment of this application is shown, with reference to... Figure 4C The method process may include the following steps.
[0085] In step S430, for the features stored in the data slice, the nodes associated with the features are determined; the nodes associated with the features are sorted according to the in-degree or the sampling probability of the nodes.
[0086] Optionally, when the feature is a node feature, the node associated with the feature is the node corresponding to the feature; when the feature is an edge feature, the node associated with the feature is the node connected by the edge corresponding to the feature (e.g., the destination node connected by the edge corresponding to the feature). For each feature stored in the data slice, the embodiments of this application can determine the nodes associated with each feature, and then sort the nodes associated with each feature according to the in-degree of the node or the sampling probability (e.g., ascending order or descending order).
[0087] In an optional implementation, the in-degree of a node is the total number of edges pointing to that node. When a directed edge points to a destination node, the in-degree of a node can be considered as the total number of edges connected to that destination node. The sampling probability of a node refers to the probability that the node will be sampled as a neighbor node; it should be noted that during the sampling of a node's neighbors, each node is visited a different number of times, exhibiting a probability distribution, with some nodes having a higher probability of being sampled and others a lower probability.
[0088] In step S431, hot data features and cold data features are determined based on the node sorting.
[0089] After sorting the nodes associated with features according to their in-degree or sampling probability, based on the fact that the in-degree or sampling probability of nodes associated with hot data features is higher than that of nodes associated with cold data features, this embodiment of the application can determine hot data features and cold data features from the features stored in the data slice according to the node sorting. Optionally, this embodiment of the application can set a sorting threshold. When sorting in ascending order according to the in-degree or sampling probability of nodes, this embodiment of the application can determine the features associated with nodes whose sorting order is higher than the sorting threshold as hot data features; and determine the features associated with nodes whose sorting order is not higher than the sorting threshold as cold data features. The sorting threshold can be a threshold for sorting order (in integer form) or a threshold for sorting proportion.
[0090] In a further optional implementation, since features are allocated to each data slice according to a random allocation method or an allocation method based on the sampling probability, for a data slice, the data slice stores the features allocated to that data slice, but does not store the globally high-frequency data features of the graph data. Based on this, embodiments of this application can determine the globally high-frequency data features of the graph data and store the globally high-frequency data features of the graph data in each data slice, so that each data slice stores the globally high-frequency data features of the graph data. That is to say, in the optional implementation, in addition to storing the nodes, edges, and features allocated to that data slice, a data slice can also store the globally high-frequency data features of the graph data.
[0091] In an optional implementation, the globally high-frequency data feature of the graph data refers to the feature associated with nodes whose in-degree or sampling probability ranks higher than a preset ranking value for the graph data as a whole. Optionally, in this embodiment, the nodes in the graph data are sorted according to their in-degree or sampling probability, and the features associated with the nodes whose ranking ranks higher than the preset ranking value are determined as the globally high-frequency data features of the graph data as a whole; thereby, the globally high-frequency data features of the graph data are cached in each data slice, so that each data slice has a cache of the globally high-frequency data features of the graph data. For example, after sorting the nodes in the graph data according to their in-degree or sampling probability, the features associated with the top preset number of nodes are determined as the globally high-frequency data features of the graph data as a whole.
[0092] Optionally, the sorting preset value can be a pre-set sorting order value or sorting ratio value based on sorting all nodes of the graph data according to their in-degree or sampling probability; while the sorting threshold described above is a pre-set sorting order value or sorting ratio value based on sorting the nodes associated with the features stored in the data slice according to their in-degree or sampling probability. The specific selection method of the sorting threshold and sorting preset value can be determined according to the actual situation, and this application embodiment does not set any limitations.
[0093] In an optional implementation, for the global hot data features of the graph data stored in the data slice, embodiments of this application may store the global hot data features in the GPU memory of the node device. For example, for the global hot data features stored in the data slice in the node device, embodiments of this application may store the global hot data features in the GPU memory of the node device.
[0094] To facilitate a further understanding of distributed storage, Figure 4D An exemplary implementation diagram of distributed storage is shown, such as... Figure 4D As shown, taking the distributed storage of data slice 301 as an example, assuming that data slice 301 is stored on node device 311, then data slice 301 stores allocated nodes, edges, features, and globally high-frequency data features of graph data. The nodes and edges stored in data slice 301 can be stored in the CPU paged memory or GPU memory of node device 311 (for example, the nodes and edges stored in data slice 301 can be selectively stored in either the CPU paged memory or GPU memory of node device 311). For the hot data features stored in data slice 301, they can be stored in the GPU memory of node device 311; for the cold data features stored in data slice 301, they can be stored in the CPU paged memory of node device 311. Simultaneously, the globally high-frequency data features of graph data stored in data slice 301 can be stored in the GPU memory of node device 311.
[0095] The embodiments of this application, by segmenting and distributing graph data, can at least have the following effects:
[0096] Reduce data transfer and copying between the CPU and GPU, accelerate subgraph sampling and subgraph feature sampling in subsequent sampling stages, and provide support for improving the efficiency of subsequent graph data sampling;
[0097] By partitioning and distributing graph data, efficient storage of large-scale, multi-feature, and heterogeneous graph data can be supported. For example, for a graph with tens of billions of edges, billions of nodes, and each node containing hundreds of features, the data volume may reach TGB (1TB equals 1024GB). Therefore, the CPU memory and GPU memory of a single node device may be insufficient to meet the storage needs of the graph data. In this case, partitioning and distributing the graph data can meet the efficient storage needs of large-scale, multi-feature, and heterogeneous graph data. Heterogeneous graphs refer to graph data containing one or more types of nodes or one or more types of edges.
[0098] It supports storing hot and cold data features in data slices separately, and supports caching global high-hot data features, which can reduce cross-machine communication between node devices and improve the overall throughput of node devices when sampling graph data in the future.
[0099] Based on the partitioning and distributed storage of graph data, embodiments of this application can achieve asynchronous distributed graph data sampling through samplers running on multiple node devices. For example, asynchronous distributed subgraph sampling and subgraph feature sampling can be implemented to achieve sampling stage 210. As an optional implementation, Figure 5 An exemplary flowchart of a graph data sampling method provided in an embodiment of this application is shown. This method can be applied to a first node device, which can be any node device running a sampler, such as... Figure 5 As shown, the method flow may include the following steps.
[0100] In step S510, a sampling task is obtained, and multiple objects to be sampled corresponding to the sampling task are determined.
[0101] Batch-based graph data sampling can be achieved through different batch sampling tasks. Therefore, the embodiments of this application can achieve graph data sampling through multiple sampling tasks in different batches. In an optional implementation, graph data sampling involves subgraph sampling and subgraph feature sampling. The sampling task can be divided into a subgraph sampling task and a subgraph feature sampling task. The subgraph sampling task can sample a subgraph by specifying multiple nodes or edges to be sampled. The subgraph feature sampling task mainly performs feature lookup on each node or edge to be sampled in the subgraph (i.e., queries the features of each object to be sampled in the subgraph) to obtain the features of the subgraph.
[0102] After the first node device acquires the sampling task, it can determine multiple objects to be sampled corresponding to the sampling task. These objects can be nodes or edges to be sampled. For example, in a subgraph sampling task, multiple nodes or edges can be specified for subgraph sampling. Similarly, in a subgraph feature sampling task, based on the obtained subgraph, each node or edge in the subgraph can be specified (e.g., each node in the subgraph can be considered a node to be sampled in the subgraph feature sampling task, or each edge in the subgraph can be considered an edge to be sampled in the subgraph feature sampling task).
[0103] In step S511, for any object to be sampled, the target data slice for storing the object to be sampled is determined according to a preset allocation relationship.
[0104] As mentioned above, the allocation relationship records at least the data slices allocated to the segmented graph data (e.g., data slices allocated to the nodes, edges, and features of the graph data); wherein the segmented graph data is allocated to multiple data slices for storage, the multiple data slices are stored in multiple node devices, and each node device stores at least one data slice.
[0105] After identifying multiple objects to be sampled corresponding to a sampling task, for any object to be sampled corresponding to the sampling task, this embodiment can determine the target data slice for storing the object to be sampled according to a preset allocation relationship. For ease of explanation, the data slice for storing any object to be sampled corresponding to the sampling task is called the target data slice. In an optional implementation, if the object to be sampled is a node to be sampled, this embodiment can determine the target data slice for storing the node to be sampled according to a preset allocation relationship. In an optional implementation, if the object to be sampled is an edge to be sampled, this embodiment can determine the target data slice for storing the edge to be sampled according to a preset allocation relationship.
[0106] Optionally, in this embodiment of the application, the allocation relationship can be recorded according to a preset segmentation routing table, so that the target data slice for storing the object to be sampled can be determined according to the preset segmentation routing table.
[0107] In step S512, if the target data slice is stored in the first node device, the resources of the first node device are used to perform a sampling task on the object to be sampled in order to obtain the sampling result of the object to be sampled.
[0108] For any object to be sampled corresponding to a sampling task, the target data slice of that object may be stored on a first node device (i.e., the target data slice is stored locally on the first node device) or on a second node device different from the first node device (i.e., the target data slice is stored on a non-local second node device). Depending on the different storage scenarios of the target data slice, embodiments of this application can use different resources to perform sampling tasks on the object to be sampled.
[0109] In an optional implementation, embodiments of this application can determine the node device storing the target data slice based on the corresponding storage relationship between data slices and node devices (i.e., the relationship between the data slices stored by each node device). For example, each node device can save the corresponding storage relationship between data slices and node devices, so that after determining the target data slice, the first node device can determine the node device storing the target data slice based on the corresponding storage relationship.
[0110] Optionally, if the target data slice is stored on the first node device, then since the target data slice is stored locally on the first node device, the first node device can directly use its local resources to perform sampling tasks on the object to be sampled. The resources used to perform the sampling tasks can be either GPU resources or CPU resources. If this embodiment uses a distributed GPU for graph data sampling, the first node device can directly use its GPU resources to perform sampling tasks on the object to be sampled. Of course, this embodiment can also support the first node device using its CPU resources to perform sampling tasks on the object to be sampled.
[0111] Optionally, if the sampling task is a subgraph sampling task, then performing the sampling task on the object to be sampled can be based on the object to be sampled, sampling the subgraph corresponding to the object to be sampled. For example, if the object to be sampled is a node to be sampled, then in this embodiment, the node to be sampled is used as the source node, and the destination node corresponding to the node to be sampled (i.e., the neighbor node pointed to by the node to be sampled through a directed edge) is sampled, thereby combining the node to be sampled and the destination node corresponding to the node to be sampled to determine the subgraph corresponding to the node to be sampled. As another example, if the object to be sampled is an edge to be sampled, then in this embodiment, the source node and destination node connected by the edge to be sampled can be determined, thereby sampling the neighbor nodes corresponding to the source node and the neighbor nodes corresponding to the destination node, and combining the source node, the destination node, and the neighbor nodes corresponding to the source node and the destination node respectively, to determine the subgraph corresponding to the edge to be sampled.
[0112] Optionally, if the sampling task is to sample features from a subgraph, then performing the sampling task on the object to be sampled can be based on sampling the features of the object to be sampled. For example, if the object to be sampled is a node to be sampled in a subgraph, then embodiments of this application can sample the features of the node to be sampled. As another example, if the object to be sampled is an edge to be sampled in a subgraph, then embodiments of this application can sample the features of the edge to be sampled.
[0113] In step S513, if the target data slice is stored in the second node device, the resources of the second node device are invoked to perform a sampling task on the object to be sampled, so as to obtain the sampling result of the object to be sampled.
[0114] In an optional implementation, if the target data slice is stored on a second node device different from the first node device, since the target data slice is not stored locally on the first node device, the first node device can invoke the resources of the second node device via remote calls to perform sampling tasks on the object to be sampled, thereby obtaining the sampling results of the object to be sampled. For example, the first node device can request the second node device to perform sampling tasks on the object to be sampled via remote calls such as RPC (Remote Procedure Call). The second node device can then use its resources to perform sampling tasks on the object to be sampled and return the obtained sampling results to the first node device. Optionally, the resources of the second node device can be either GPU resources or CPU resources. Optionally, after requesting the second node device to perform sampling tasks on the object to be sampled via remote calls such as RPC, the first node device can respond to the next sampling task and process it without blocking.
[0115] In an optional implementation, the node device may run at least one sampler, which can acquire and process sampling tasks (e.g., executed by the sampler running the node device). Figure 5 (The process is shown). Optionally, in this embodiment, asynchronous distributed samplers can be implemented on multiple node devices to pipeline and execute sampling tasks of different input batches simultaneously; furthermore, each sampler running on a node device can maintain a Python EventLoop to achieve asynchronous concurrent processing of sampling tasks.
[0116] For example, the sampler running on the first node device can acquire sampling tasks. The EventLoop in the sampler can determine the objects to be sampled in the sampling task that are stored in local data slices and those that are stored in non-local data slices according to the split routing table. For objects to be sampled stored in local data slices, the sampler can directly use the GPU resources of the first node device to execute the sampling task for the objects to be sampled. For objects to be sampled stored in non-local data slices, the sampler can call the node device corresponding to the non-local data slice to execute the sampling task for the objects to be sampled through an asynchronous RPC request. Thus, the sampler running on the first node device can respond to the next sampling task and process it without blocking.
[0117] Optionally, when the sampling task is a subgraph sampling task, the second node device can perform the sampling task based on the object to be sampled, sampling the subgraph corresponding to the object to be sampled; when the sampling task is a subgraph feature sampling task, the second node device can perform the sampling task based on the object to be sampled, querying the features of the object to be sampled. For related details, please refer to the descriptions in the corresponding sections above.
[0118] Optionally, each node device (including all node devices in the first and second node devices) can choose to use CPU or GPU resources to perform sampling tasks based on its machine environment. For example, if a node device only has CPU resources, then CPU resources are used to perform sampling tasks; if a node device has GPU resources, then GPU resources are recommended (GPU resources offer a significant performance improvement compared to CPU resources). It should be noted that when training graph neural networks, the model parameters are relatively few. If distributed CPU sampling is primarily used for training the graph neural network, GPU utilization may be low. Therefore, when sampling graph data (sampling subgraphs and subgraph features), distributed GPU sampling can be prioritized to utilize idle GPU resources and improve GPU utilization.
[0119] In step S514, the sampling results of the sampling task are obtained based on the sampling results of each object to be sampled.
[0120] After obtaining the sampling results of each object to be sampled corresponding to the sampling task (which may include the sampling results of the objects to be sampled obtained by the first node device through local resources, and the sampling results of the objects to be sampled transmitted back by the second node device), the first node device can concatenate the sampling results of each object to obtain the sampling result of the sampling task. The sampling result of the sampling task can be regarded as the overall sampling result of multiple objects to be sampled corresponding to the sampling task. For example, when the sampling task is a subgraph sampling task, the sampling result of the subgraph sampling task can be the subgraph corresponding to multiple objects to be sampled specified by the subgraph sampling task; as another example, when the sampling task is a subgraph feature sampling task, the sampling result of the subgraph feature sampling task is the feature of each edge or the feature of each node in the subgraph.
[0121] In an optional implementation, embodiments of this application may splice the sampling results of each object to be sampled in the order of the multiple objects to be sampled corresponding to the sampling task, thereby obtaining the sampling result of the sampling task.
[0122] In a further optional implementation, the first node device may store the sampling results of the sampling task in a prefetch cache implemented by shared memory and paged memory.
[0123] Optionally, when the sampler performs graph data sampling, after all the objects to be sampled corresponding to the sampling task have completed their sampling tasks, the sampler running on the first node device can concatenate the sampling results of each object to be sampled corresponding to the sampling task to obtain the sampling result corresponding to the sampling task, and store it in the prefetch cache of the first node device.
[0124] It should be noted that, Figure 5 Taking the first node device as an example, this explains how a node device (e.g., a sampler within a node device) responds to and processes a sampling task after receiving it; each node device, after receiving a sampling task, can... Figure 5The illustrated process principle responds to and processes sampling tasks. For example, after a node device receives a sampling task, it can distinguish between objects stored in local data slices and objects stored in non-local data slices (i.e., objects stored in data slices of other node devices) for multiple objects to be sampled corresponding to the sampling task. It then uses local resources to execute the sampling task on the objects stored in its local data slice and passes the objects stored in non-local data slices to the corresponding node device for sampling. Since the node device only executes the sampling task on objects stored in its local data slice, it can immediately respond to the next sampling task and process it without being blocked after passing the objects stored in non-local data slices to the corresponding node device. Simultaneously, after the objects stored in non-local data slices complete their sampling tasks, the node device can concatenate the sampling results of multiple objects corresponding to the sampling task to obtain the sampling result corresponding to the sampling task. This embodiment of the application significantly improves sampling efficiency through this asynchronous distributed sampling method.
[0125] Where possible, in a sampling task acquired by the first node device, the multiple objects to be sampled for that task may all be stored in the data slice of the first node device. In this case, all the objects to be sampled for that task can use the resources of the first node device to execute the sampling task. Alternatively, the multiple objects to be sampled for that task may not be stored in the data slice of the first node device. In this case, each object to be sampled for that task needs to be passed to another node device where its corresponding data slice is located to execute the sampling task.
[0126] The graph data sampling method provided in this application embodiment can be implemented based on graph data segmentation and distributed storage. The segmented graph data can be allocated to multiple data slices for storage, and these data slices are stored on multiple node devices. The data slices allocated to the segmented graph data can be recorded through allocation relationships. Therefore, when sampling graph data, a first node device can acquire a sampling task and determine multiple objects to be sampled corresponding to the sampling task. For any object to be sampled, the first node device can determine the target data slice for storing the object to be sampled according to a preset allocation relationship. If the target data slice is stored on the first node device, this application embodiment can use the resources of the first node device to perform the sampling task on the object to be sampled to obtain the sampling result of the object to be sampled. If the target data slice is stored on a second node device, this application embodiment can call the resources of the second node device to perform the sampling task on the object to be sampled to obtain the sampling result of the object to be sampled. Furthermore, the first node device can obtain the sampling result of the sampling task based on the sampling results of each object to be sampled.
[0127] As can be seen, after allocating the segmented graph data to multiple data slices and storing these slices on multiple node devices, each node device, when processing a sampling task, only performs the sampling task on the objects to be sampled stored in its local data slice. For objects to be sampled not stored in a local data slice, the node device executes the sampling task by calling other node devices corresponding to those non-local data slices. Therefore, multiple objects to be sampled corresponding to a sampling task can execute the sampling task asynchronously and in parallel on multiple node devices, thereby efficiently utilizing the resources of multiple node devices to execute the sampling tasks for multiple objects to be sampled. This achieves reasonable resource allocation and load balancing among multiple node devices, improving the sampling efficiency of graph data. Therefore, the graph data sampling method provided in this application embodiment can significantly improve the sampling efficiency of graph data, thus providing a foundation for improving the training efficiency of graph neural networks.
[0128] It should be noted that, Figure 5 The asynchronous distributed sampling method shown can be applied to both subgraph sampling and subgraph feature sampling. Of course, the embodiments of this application can also support either subgraph sampling or subgraph feature sampling. Figure 5 The asynchronous distributed sampling method is shown; for example, subgraph sampling is performed by... Figure 5 The asynchronous distributed sampling method shown is implemented, while subgraph feature sampling is implemented by other methods (such as traditional methods); for example, subgraph feature sampling is implemented by... Figure 5 The asynchronous distributed sampling method shown is used, while subgraph sampling is implemented using other methods (such as traditional methods). From the perspective of improving sampling efficiency, both subgraph sampling and subgraph feature sampling use... Figure 5 The asynchronous distributed sampling method shown is a better way to improve sampling efficiency; however, it is understood that it is only used for subgraph sampling or subgraph feature sampling. Figure 5 The asynchronous distributed sampling method shown can also improve sampling efficiency.
[0129] As an optional implementation, the object to be sampled for the sampling task can be either the node to be sampled or the edge to be sampled; the sampling task can be either the subgraph sampling task or the subgraph feature sampling task, wherein the subgraph sampling task is used to sample the subgraph corresponding to multiple objects to be sampled, and the subgraph feature sampling task is used to sample the features of each object to be sampled in the subgraph.
[0130] based on Figure 5 The principle of the asynchronous distributed sampling method shown below will be introduced from the optional implementation process of subgraph sampling and subgraph feature sampling.
[0131] Optional, Figure 6A An exemplary flowchart of a subgraph sampling method provided in an embodiment of this application is shown. This method can be implemented by a first node device. (Refer to...) Figure 6A The method process may include the following steps.
[0132] In step S610, a subgraph sampling task is obtained, and multiple nodes to be sampled corresponding to the subgraph sampling task are determined.
[0133] Subgraph sampling is a form of sampling task that can indicate multiple objects to be sampled. Figure 6A Taking the object to be sampled as the node to be sampled as an example, of course, the embodiments of this application can also support multiple objects to be sampled as multiple edges to be sampled as indicated by the subgraph sampling task.
[0134] In step S611, for any node to be sampled, the target data slice for storing the node to be sampled is determined according to a preset allocation relationship.
[0135] Optionally, in this embodiment of the application, the target data segmentation for storing the nodes to be sampled can be determined according to a preset segmentation routing table.
[0136] In step S612, if the target data slice is stored in the first node device, the GPU resources of the first node device are used to sample the subgraph of the node to be sampled to obtain the subgraph of the node to be sampled.
[0137] In step S613, if the target data slice is stored on the second node device, the GPU resources of the second node device are called using asynchronous RPC to sample the subgraph of the node to be sampled, so as to obtain the subgraph of the node to be sampled.
[0138] For any node to be sampled corresponding to a subgraph sampling task, the target data slice storing the node to be sampled may be located on a first node device or a second node device. When the target data slice is located on the first node device, the first node device uses its local GPU resources to sample the subgraph of the node to be sampled; when the target data slice is located on the second node device, the first node device can use an asynchronous RPC request to call the GPU resources of the second node device, thereby sampling the subgraph of the node to be sampled. Optionally, when sampling the subgraph of the node to be sampled, this embodiment of the application uses the node to be sampled as the source node and samples the destination node corresponding to the node to be sampled (i.e., the neighbor node pointed to by the node to be sampled through a directed edge), thereby combining the node to be sampled and the destination node corresponding to the node to be sampled to determine the subgraph corresponding to the node to be sampled.
[0139] Furthermore, after executing step S613, the first node device can acquire and respond to the next subgraph sampling task.
[0140] In step S614, the subgraphs corresponding to each sampling node are spliced together according to the order of the multiple sampling nodes corresponding to the subgraph sampling task to obtain the subgraph corresponding to the subgraph sampling task.
[0141] In step S615, the subgraph corresponding to the subgraph sampling task is stored in the prefetch cache of the first node device.
[0142] The subgraph corresponding to the subgraph sampling task can be regarded as the subgraph corresponding to the total number of nodes to be sampled specified by the subgraph sampling task.
[0143] In one implementation example, a node device can run at least one sampler, and multiple samplers are distributed across multiple node devices; thus, the sampler running in the first node device can be responsible for execution. Figure 6A The method flow is shown below. For example, the sampler running in the first node device can maintain a Python EventLoop for asynchronous concurrent processing of subgraph sampling. After the first node device obtains a subgraph sampling task, the EventLoop in the sampler running in the first node device can determine the sampling nodes stored in the local data slice of the first node device and the sampling nodes stored in non-local data slices from the multiple sampling nodes specified by the subgraph sampling task according to a preset partitioning routing table. Thus, for the sampling nodes stored in the local data slice, the sampler of the first node device can use the GPU of the first node device to operate the CPU page memory or GPU video memory of the first node device (the nodes and edges of the graph data are stored in the CPU page of the node device). The sampler of the first node device can use asynchronous RPC requests to call the samplers of other node devices to use the GPU resources of other node devices to sample the subgraph of the node to be sampled (either stored in the local data slice or in the GPU memory). At this time, the sampler of the first node device can respond to the next subgraph sampling task without being blocked. After the subgraph sampling of the node to be sampled in the non-local data slice is completed, the sampler of the first node device can stitch together the subgraphs of each node to be sampled according to the order of each node to be sampled specified by the subgraph sampling task to obtain the subgraph corresponding to the subgraph sampling task. Thus, the sampler of the first node device can save the subgraph corresponding to the subgraph sampling task in the prefetch cache of the first node device.
[0144] For ease of understanding, Figure 6B An example diagram illustrating the implementation of subgraph sampling is shown below, such as... Figure 6BAs shown, assuming that nodes 1, 2, 3, and 4 are designated as nodes to be sampled in the subgraph sampling task obtained by the sampler of the first node device, the sampler of the first node device can determine the data slices storing nodes 1, 2, 3, and 4 respectively according to the split routing table; assuming that the data slice storing nodes 1 and 3 is data slice 601, which is stored in the first node device, and the data slice storing nodes 2 and 4 is data slice 602, which is stored in the second node device; based on this, the sampler of the first node device can use the GPU resources of the first node device to sample the neighboring nodes of node 1 and the neighboring nodes of node 3 respectively, using nodes 1 and 3 as source nodes, in the CPU paged memory or GPU video memory of the first node device, to realize the sampling of the subgraphs of nodes 1 and 3; and the sampler of the first node device can use asynchronous RPC requests to call the sampler of the second node device to sample the subgraphs of nodes 2 and 4. Assume that the neighboring nodes of node 1 are nodes 5 and 6 (the subgraph of node 1 consists of nodes 1, 5, and 6), the neighboring nodes of node 3 are nodes 6 and 8 (the subgraph of node 3 consists of nodes 3, 6, and 8), the neighboring nodes of node 2 are nodes 7 and 9 (the subgraph of node 2 consists of nodes 2, 7, and 9), and the neighboring node of node 4 is node 7 (the subgraph of node 4 consists of nodes 4 and 7). Thus, the sampler of the first node device can stitch together the subgraphs of each node in the order of nodes 1, 2, 3, and 4 to obtain the subgraph corresponding to the subgraph sampling task.
[0145] Optional, Figure 6C An exemplary flowchart of a subgraph feature sampling method provided in an embodiment of this application is shown. This method can be implemented by a first node device. (Refer to...) Figure 6C The method process may include the following steps.
[0146] In step S630, the subgraph feature sampling task is obtained, and each node to be sampled in the subgraph is determined.
[0147] Subgraph feature sampling is a form of sampling task that can indicate each node in a subgraph to be sampled. Figure 6C Taking the object to be sampled as the node to be sampled as an example, of course, the embodiments of this application can also support the subgraph feature sampling task to indicate each edge to be sampled in the subgraph.
[0148] In step S631, for any node to be sampled, the target data slice for storing the node to be sampled is determined according to a preset allocation relationship.
[0149] In step S632, if the target data slice is stored in the first node device, the GPU resources of the first node device are used to sample the features of the node to be sampled in order to obtain the features of the node to be sampled.
[0150] In step S633, if the target data slice is stored in the second node device, according to the network configuration, the CPU or GPU resources of the second node device are called using asynchronous RPC to sample the features of the node to be sampled, so as to obtain the features of the node to be sampled.
[0151] As can be seen, subgraph feature sampling and subgraph sampling have similar asynchronous processing logic. For a node in a subgraph whose features need to be sampled, if the node to be sampled is stored in the local data slice of the first node device, the GPU resources of the first node device can be used to operate the GPU memory and CPU page memory of the first node device (hot data features stored in the data slice are stored in the GPU memory of the corresponding node device, cold data features are stored in the CPU page memory of the corresponding node device, and global high-hot data features of the graph data are stored in the GPU memory of the corresponding node device) to find the features of the node to be sampled. If the node to be sampled is stored in the data slice of the second node device, the first node device can use asynchronous RPC to call the CPU or GPU resources of the second node device according to the network configuration to operate the GPU memory and CPU page memory of the second node device to sample the features of the node to be sampled.
[0152] In an optional implementation, if the network configuration supports RDMA (Remote Direct Memory Access) or GPU-Direct RDMA (GPU Direct Remote Memory Access), the first node device can use asynchronous RPC to call the GPU resources of the second node device to sample features of the node to be sampled; if the network configuration is TCP (Transmission Control Protocol), the first node device can use asynchronous RPC to call the CPU resources of the second node device to sample features of the node to be sampled.
[0153] Furthermore, after executing step S633, the first node device can acquire and respond to the next subgraph feature sampling task.
[0154] In step S634, the features of each node to be sampled are spliced together according to the order of the nodes to be sampled in the subgraph to obtain the features of the subgraph.
[0155] In step S635, the features of the subgraph are stored in the prefetch cache of the first node device.
[0156] In one implementation example, the sampler running on the first node device can maintain a Python EventLoop for asynchronous concurrent processing of subgraph feature sampling. After the first node device receives a subgraph feature sampling task, the EventLoop in the sampler running on the first node device can determine, according to a preset partitioning routing table, the sampling nodes stored in the local data slice of the first node device and the sampling nodes stored in non-local data slices from the sampling nodes specified by the subgraph feature sampling task. Therefore, for sampling nodes stored in the local data slice, the sampler of the first node device can use the GPU resources of the first node device to find the features of the sampling nodes; for sampling nodes stored in non-local data slices, the sampler of the first node device can configure the network to support RDMA or GPU-Direct. During RDMA, asynchronous RPC requests are used to invoke samplers on other node devices to utilize their GPU resources and find features of the nodes to be sampled. If the network configuration is TCP, the sampler on the first node device uses asynchronous RPC requests to invoke samplers on other node devices to utilize their CPU resources and find features of the nodes to be sampled. Simultaneously, the sampler on the first node device can respond to the next subgraph feature sampling task without being blocked. After waiting for the nodes in the non-local data slice to complete feature sampling, the sampler on the first node device can concatenate the features of each node to be sampled based on the order of the nodes in the subgraph to obtain the features of the subgraph. Thus, the sampler on the first node device can store the features of the subgraph in its prefetch cache.
[0157] After completing graph data sampling, embodiments of this application can utilize the sampling results of the graph data to train a graph neural network. Optionally, Figure 7A An exemplary flowchart of the graph neural network training method provided in this application embodiment is shown, such as... Figure 7A As shown, the method flow may include the following steps.
[0158] In step S710, the sampling results of the graph data are obtained.
[0159] The sampling results of graph data can be determined based on the graph data sampling method provided in the embodiments of this application. Optionally, the sampling results of graph data may include sampled subgraphs and subgraph features. In conjunction with the foregoing description, after completing subgraph sampling and subgraph feature sampling, the subgraphs and subgraph features can be stored in the prefetch cache of the node device. Thus, when training the graph neural network, the embodiments of this application can load the sampling results (subgraphs and subgraph features) of graph data from the prefetch cache of the node device to obtain the sampling results of graph data.
[0160] In step S711, a graph neural network is trained based on the sampling results of the graph data.
[0161] In an optional implementation, embodiments of this application can utilize PyTorch distributed training technology to train a graph neural network based on the sampling results of graph data. Optionally, embodiments of this application can write PyTorch or PyG code to construct a graph neural network model, then begin training the graph neural network model, and finally use the graph neural network model for prediction tasks. During the training of the graph neural network model, a loader can load subgraphs and subgraph features from the prefetch cache of the node device. The graph data sampling (involving subgraph sampling and subgraph feature sampling) and the training of the graph neural network provided in embodiments of this application can be executed asynchronously.
[0162] As an optional implementation, combining graph data sampling and graph neural network training, embodiments of this application can provide a distributed graph neural network training system. This system can fully utilize the characteristics of hardware such as GPUs, NVLink, and RDMA networks, as well as the graph neural network model, to accelerate single-machine and distributed graph neural network training. Optionally, Figure 7B An exemplary diagram of the architecture of the graph neural network training system provided in this application embodiment is shown, such as... Figure 7B As shown, the system architecture may include: a storage layer 721, a graph operator layer 722, an interface layer, a distributed sampling layer 723, and a model layer 724.
[0163] The storage layer 721 primarily implements graph data segmentation and distributed storage of data slices on corresponding node devices. When using GPUs for distributed sampling of graph data, data transfer between the CPU and GPU can become a major performance bottleneck. Therefore, to accelerate graph data sampling (involving subgraph sampling and subgraph feature sampling), this embodiment implements unified tensor storage in the storage layer 721 to unify CPU and GPU memory management and reduce data transfer and copying between the CPU and GPU. When implementing distributed storage of data slices, the storage layer 721 can store the nodes and edges of the data slices in the GPU memory or CPU page memory of the node device corresponding to the data slice. For the features stored in the data slices, hot data features and cold data features are distinguished according to the in-degree or sampling probability of the nodes associated with the features. Hot data features are stored in the GPU memory of the node device corresponding to the data slice (for GPU groups with NVLink connections, hot data features can be evenly stored in the GPU memory of each GPU in the GPU group), while cold data features are stored in the CPU page memory of the node device corresponding to the data slice. Simultaneously, the GPU memory of the node device stores globally high-hot data features stored in the data slice.
[0164] The graph operator layer 722 provides both CPU and GPU operators. CPU operators can be viewed as operations utilizing CPU resources, while GPU operators can be viewed as operations utilizing GPU resources. For example, the graph operator layer provides both CPU-based and GPU-based operator support for tasks such as neighbor node sampling, subgraph sampling, and subgraph feature sampling. CPU-based operators can use multithreading to achieve parallel acceleration when performing tasks such as subgraph sampling and subgraph feature sampling; GPU-based operators can use CUDA (Compute Unified Device Architecture) kernel functions to achieve parallel processing when performing similar tasks. Since the nodes, edges, and features of graph data are stored in GPU memory or CPU paged memory, GPU operators can directly access the data in GPU memory or CPU paged memory, reducing the time spent copying data from the CPU to the GPU.
[0165] The interface layer and distributed sampling layer 723 can provide interfaces to support distributed sampling of graph data. Regarding the interface, to reduce the learning curve for users, this embodiment can adopt a Python interface compatible with PyTorch, while also being compatible with graph learning frameworks such as PyG; this allows users to accelerate PyG models with minimal code modifications. The interfaces provided by the interface layer and distributed sampling layer 723 can include graph objects (edges and nodes), samplers, features, etc. Optionally, the interface layer and distributed sampling layer 723 can at least provide an interface for a sampler, which can execute the graph data sampling methods provided in this embodiment.
[0166] Optionally, to prevent remote data access from hindering the progress of graph data sampling and graph neural network training, this embodiment implements a high-efficiency RPC framework on top of PyTorch's RPC, supporting TCP and RDMA networks, and employing asynchronous distributed subgraph sampling and feature sampling methods to hide network latency (optional implementation methods can be found in the corresponding sections above), thereby improving the throughput between node devices. Optionally, this embodiment implements distributed graph objects, distributed samplers, and distributed features at the Python layer.
[0167] Model layer 724 is used to support the training of graph neural networks. This embodiment supports different graph neural network models to adapt to graph data of different scales in different application scenarios. Model layer 724 allows users to place graph neural network training and graph data sampling in the same process, or separate them into different processes, or even different node devices. The model interface of model layer 724 is compatible with graph neural network frameworks such as PyG, and models from these frameworks can also be directly used for training in conjunction with the Python interface of this embodiment. PyG is an open-source graph neural network framework based on PyTorch, an open-source Python machine learning library.
[0168] As an optional implementation, a single process on a node device can perform graph data sampling and graph neural network training, and a node device can execute graph data sampling and graph neural network training in parallel through multiple processes, while multiple processes on multiple node devices can execute graph data sampling and graph neural network training in a distributed manner. Optionally, Figure 8A An example diagram illustrating process deployment provided in an embodiment of this application is shown. Figure 8A An example is shown where graph data sampling and graph neural network training are deployed on the same node device through a single process.
[0169] like Figure 8A As shown, the graph data is divided into two data slices, data slice 0 and data slice 1. Data slice 0 and data slice 1 are stored on two different machines; for example, data slice 0 is stored on node device 801, and data slice 1 is stored on node device 802. Each node device can run multiple processes ( Figure 8A (Taking a single node device running two processes as an example), one process running multiple samplers ( Figure 8ATaking a process running two samplers as an example, the samplers can execute the graph data sampling method provided in this application embodiment through the interfaces provided by the interface layer and the distributed sampling layer (the corresponding content of the graph data sampling method can be referred to the description in the corresponding section above). Simultaneously, a process runs a loader and a model (graph neural network model). The samplers and loader running in the process are connected through a shared memory channel, so that the loader can obtain the sampling results of the graph data for training the graph neural network model. It should be noted that when a process performs graph data sampling and graph neural network training, and is deployed on the same machine, the prefetch cache of the node device actually exists in the CPU shared memory and CPU paged memory. Therefore, after the sampler obtains the sampling results of the graph data, the loader can directly use the GPU to access the sampling results of the graph data. It can be seen that a process on a node device can run multiple samplers and a loader. The loader connects to the samplers through a shared memory channel to use the GPU resources of the node device to access the sampling results of the graph data in the prefetch cache to train the graph neural network model.
[0170] As an optional implementation, graph data sampling and graph neural network training can be executed through different processes and deployed on different machines. In this embodiment, the process used for graph data sampling can be called a sampling process, and the process used for graph neural network training can be called a training process. Optionally, the sampling process can be deployed on a service node device, and the training process can be deployed on a client node device. Simultaneously, one service node device runs multiple sampling processes to execute graph data sampling in parallel, and the sampling processes running on multiple service node devices execute graph data sampling in a distributed manner. Similarly, one client node device runs multiple training processes to execute graph neural network training in parallel, and the training processes running on multiple client node devices execute graph neural network training in a distributed manner. Optionally, Figure 8B Another example diagram of process deployment provided in an embodiment of this application is shown as an example. Figure 8B An example of the sampling and training processes being deployed on different node devices is shown.
[0171] like Figure 8B As shown, the sampling process is deployed to two service node devices 811 and 812, while the training process is deployed to two client node devices 813 and 814. After the graph data is sliced, the data slices are stored on the service node devices; for example, data slice 0 is stored on service node device 811, and data slice 1 is stored on service node device 812. One service node device runs multiple sampling processes. Figure 8B Taking a single service node device running two sampling processes as an example, a single sampling process can run multiple samplers. Figure 8B (Example: running two samplers in one sampling process). A client node device can run multiple training processes. Figure 8B(Taking a client node device running two training processes as an example), one training process runs a loader and a model (graph neural network model). The sampler and the loader are connected via a remote channel. For example, the sampling results of the graph data sampled by the sampler can be saved to the prefetch cache of the corresponding server node device. Then, during graph neural network training, the loader reads the sampling results of the graph data from the prefetch cache of the server node device through the remote channel for training the graph neural network model.
[0172] The solution provided in this application segmentes graph data and distributes these data slices in GPU memory or CPU paged memory. It then implements graph data sampling through asynchronous concurrent sampling, thereby achieving efficient subgraph sampling and subgraph feature sampling. This application supports distributed graph data sampling and graph neural network training on multi-node devices and multi-GPUs, improving both the sampling efficiency and the training efficiency of graph neural networks. Therefore, the solution provided in this application can solve the performance problems of large-scale graph data sampling and training, and improve GPU resource utilization.
[0173] Furthermore, this application embodiment supports caching of globally high-frequency data features, which can reduce cross-node device communication and improve overall throughput. This application embodiment can support distributed training of graph data with hundreds of billions of edges. Compared with technologies such as DGL (Deep Graph Library), this application embodiment can achieve a 1 to 2 times speedup when sampling and training graph neural networks with billions of edges, and improve GPU utilization by 2 to 3 times. Furthermore, by designing graph data sampling and graph neural network training into a service node device and client node device architecture, and placing graph data sampling and graph neural network training in different processes and deploying them on different machines, or placing graph data sampling and graph neural network training in the same process and deploying them on the same machine, it can provide process-level resource allocation. This allows different resources to be used according to the characteristics of graph data sampling and graph neural network training, enabling reasonable resource allocation and load balancing.
[0174] This application also provides a node device, which may include at least one memory and at least one processor. The memory stores one or more computer-executable instructions, and the processor invokes the one or more computer-executable instructions to execute the graph data sampling method provided in this application embodiment, or the graph neural network training method provided in this application embodiment.
[0175] This application also provides a storage medium that stores one or more computer-executable instructions. When the one or more computer-executable instructions are executed, they implement the graph data sampling method or the graph neural network training method provided in this application.
[0176] This application also provides a computer program that, when executed, implements the graph data sampling method or the graph neural network training method provided in this application.
[0177] The foregoing describes multiple embodiment schemes provided by the embodiments of this application. The optional methods described in each embodiment scheme can be combined and cross-referenced with each other without conflict, thereby extending to a variety of possible embodiment schemes. These can all be considered as the embodiment schemes disclosed and published by the embodiments of this application.
[0178] While the embodiments disclosed above are described in this application, this application is not limited thereto. Any person skilled in the art can make various modifications and alterations without departing from the spirit and scope of this application; therefore, the scope of protection of this application should be determined by the scope defined in the claims.
Claims
1. A graph data sampling method, wherein, Applied to a first-node device, the method includes: Obtain a sampling task and determine multiple objects to be sampled corresponding to the sampling task. The objects to be sampled include nodes to be sampled, and the sampling task includes a subgraph sampling task. For any object to be sampled, a target data slice for storing the object is determined according to a preset allocation relationship; the allocation relationship records at least the data slices allocated to the segmented graph data, wherein the segmented graph data is allocated to multiple data slices for storage, and the multiple data slices are stored on multiple node devices; If the target data slice is stored in the first node device, the resources of the first node device are used to perform a sampling task on the object to be sampled in order to obtain the sampling result of the object to be sampled; If the target data slice is stored in the second node device, the resources of the second node device are invoked to perform a sampling task on the object to be sampled, so as to obtain the sampling result of the object to be sampled; Based on the sampling results of each object to be sampled, the sampling results of the sampling task are obtained. The sampling results of the objects to be sampled include a subgraph of the nodes to be sampled, including: Based on the order of the multiple nodes to be sampled corresponding to the subgraph sampling task, the subgraphs of each node to be sampled are spliced together to obtain the subgraph corresponding to the subgraph sampling task.
2. The method according to claim 1, wherein, The objects to be sampled also include edges to be sampled; the sampling task also includes a subgraph feature sampling task, wherein the subgraph sampling task is used to sample the subgraphs corresponding to the plurality of objects to be sampled, and the subgraph feature sampling task is used to sample the features of each object to be sampled in the subgraph.
3. The method according to claim 2, wherein, If the target data slice is stored in the first node device, the step of using the resources of the first node device to perform a sampling task on the object to be sampled to obtain the sampling result of the object to be sampled includes: If the target data slice containing the node to be sampled is stored in the first node device, the GPU resources of the first node device are used to sample the subgraph of the node to be sampled in order to obtain the subgraph of the node to be sampled. If the target data slice is stored in the second node device, the step of calling the resources of the second node device to perform a sampling task on the object to be sampled, so as to obtain the sampling result of the object to be sampled, includes: If the target data slice containing the node to be sampled is stored on the second node device, use asynchronous RPC to call the GPU resources of the second node device to sample the subgraph of the node to be sampled in order to obtain the subgraph of the node to be sampled.
4. The method according to claim 2, wherein, The sampling task is a subgraph feature sampling task; the multiple objects to be sampled corresponding to the sampling task are each node to be sampled in the subgraph; if the target data slice is stored in the first node device, the sampling task is performed on the objects to be sampled using the resources of the first node device to obtain the sampling results of the objects to be sampled, including: If the target data slice containing the node to be sampled is stored in the first node device, the GPU resources of the first node device are used to sample the features of the node to be sampled in order to obtain the features of the node to be sampled. If the target data slice is stored in the second node device, the step of calling the resources of the second node device to perform a sampling task on the object to be sampled, so as to obtain the sampling result of the object to be sampled, includes: If the target data slice of the node to be sampled is stored in the second node device, according to the network configuration, the CPU or GPU resources of the second node device are called using asynchronous RPC to sample the features of the node to be sampled in order to obtain the features of the node to be sampled. The step of obtaining the sampling results of the sampling task based on the sampling results of each object to be sampled includes: Based on the order of the nodes to be sampled in the subgraph, the features of each node to be sampled are concatenated to obtain the features of the subgraph.
5. The method according to claim 1, wherein, Also includes: After executing the step of calling the resources of the second node device to perform a sampling task on the object to be sampled, obtain and respond to the next sampling task; And / or, the sampling results of the sampling task are stored in the prefetch cache of the first node device, which is implemented by the CPU shared memory and CPU paged memory of the first node device.
6. The method according to any one of claims 1-5, wherein, A node device runs at least one sampler, and multiple samplers are distributed across multiple node devices; the samplers running in the node devices are used to execute the graph data sampling method.
7. The method according to claim 3 or 4, wherein, The allocation relationship records at least the data slices allocated to the segmented graph data, including: the allocation relationship records the data slices allocated to the nodes, edges, and features of the graph data; Specifically, the nodes and edges stored in the data slices are stored in the CPU page memory or GPU memory of the corresponding node device; the hot data features of the features stored in the data slices are stored in the GPU memory of the corresponding node device, and the cold data features are stored in the CPU page memory of the corresponding node device; each data slice also stores the global high-hot data features of the graph data, and the global high-hot data features stored in the data slices are stored in the GPU memory of the corresponding node device; the node device corresponding to the data slice is the node device that stores the data slice.
8. The method according to claim 7, wherein, The nodes in the graph data are assigned data slices based on their node identifiers and the number of data slices. Edges in the graph data are stored in the data slice where the source node of the edge is located. Features in the graph data are classified as hot data features and cold data features based on the in-degree or sampling probability of the nodes associated with those features. The in-degree or sampling probability of nodes associated with hot data features is higher than that of nodes associated with cold data features. For the graph data as a whole, the globally high-hot data features are those associated with nodes whose in-degree or sampling probability is ranked before a preset ranking value.
9. A method for training a graph neural network, wherein, include: Obtain the sampling results of the graph data; The sampling result of the graph data is determined based on the graph data sampling method according to any one of claims 1-8; Train a graph neural network based on the sampling results of the graph data.
10. A graph neural network training system, wherein, include: A storage layer is used to implement the segmentation of graph data and the distributed storage of data slices on corresponding node devices; Graph operator layer, which provides operators for CPU and operators for GPU; An interface layer and a distributed sampling layer, the interface layer and the distributed sampling layer providing at least an interface to a sampler configured to perform the graph data sampling method as described in any one of claims 1-8; The model layer is used to support the training of the graph neural network.
11. The graph neural network training system according to claim 10, wherein, One process on a node device performs graph data sampling and graph neural network training, and a node device performs graph data sampling and graph neural network training in parallel through multiple processes, while processes running on multiple node devices perform graph data sampling and graph neural network training in a distributed manner. In this process, a single node device runs multiple samplers and loaders. The loaders connect to the samplers via a shared memory channel to access the sampling results of graph data in the prefetch cache using the node device's GPU resources, in order to train the graph neural network model.
12. The graph neural network training system according to claim 10, wherein, A single service node device can run multiple sampling processes to perform graph data sampling in parallel, while the sampling processes running on multiple service node devices can perform graph data sampling in a distributed manner. A single client node device can run multiple training processes to train a graph neural network in parallel, while multiple client node devices can run training processes to train a graph neural network in a distributed manner. In this process, one sampling process runs multiple samplers, and one training process runs a loader and a graph neural network model; the samplers and the loader are connected via a remote channel. The loader reads the graph data sampling results stored in the prefetch cache of the corresponding service node device by the sampler through a remote channel, in order to train the graph neural network model.
13. A node device, wherein, It includes at least one memory and at least one processor, the memory storing one or more computer-executable instructions, and the processor invoking the one or more computer-executable instructions to perform the graph data sampling method as described in any one of claims 1-8, or the graph neural network training method as described in claim 9.
14. A storage medium, wherein, The storage medium stores one or more computer-executable instructions, which, when executed, implement the graph data sampling method as described in any one of claims 1-8, or the graph neural network training method as described in claim 9.