Graph neural network model training method, device, storage medium and program product
By setting up a cache space in the host memory and using different processors to train the GNN model, the problems of resource waste and disk I/O bottleneck in training large-scale graph data are solved, and more efficient GNN model training is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ALIBABA CLOUD COMPUTING CO LTD
- Filing Date
- 2024-12-12
- Publication Date
- 2026-06-12
AI Technical Summary
Traditional single-GPU-based GNN model training methods cannot handle large-scale graph data. Distributed training systems lead to resource waste and high training costs. Disk-based GNN model training systems cannot effectively alleviate disk I/O bottlenecks, resulting in low GPU and memory utilization and low training efficiency.
By setting up a first cache space in the host memory, the GNN model is trained together using different types of processors (such as CPU and GPU). The graph data is first partitioned and loaded into the cache space. A neighborhood sampling strategy is used to determine cached and uncached neighbor nodes. Neighbor sampling is performed in parallel by different processors, reducing disk access.
It improves the training efficiency of GNN models, reduces disk I/O bottlenecks, and enhances the utilization of host resources, especially GPU utilization.
Smart Images

Figure CN122197960A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a graph neural network model training method, device, storage medium, and program product. Background Technology
[0002] Graph Neural Networks (GNNs) models, as a class of models capable of performing machine learning tasks on graph data, have demonstrated significant application value in various fields such as social network analysis, recommender systems, and bioinformatics. Graph data can include graph structure data describing the graph's topology and node feature vectors describing the feature vectors of nodes. The graph structure data describes the nodes contained in the graph and the edges connecting the nodes.
[0003] As the scale of graphs in practical applications continues to expand, efficient GNN model training faces significant challenges. Graph data used in industry has reached billions of units in size, and the nodes and edges in these graphs typically possess high-dimensional feature vectors (e.g., each node in the graph data has a 1024-dimensional feature vector). Such large-scale graph data requires several GB or even TB of storage space, far exceeding the capacity of a typical single-machine graphics processing unit (GPU) or host memory. Therefore, traditional single-machine GPU-based GNN model training methods cannot be directly applied to large-scale graph data.
[0004] On the one hand, distributed training is a common technique for training machine learning models on large-scale data. For GNN models, distributed training divides the graph data into several parts and stores them in the memory of different host nodes or GPUs. By increasing the number of host nodes, a distributed GNN model training system can accommodate large-scale graph data in memory. However, while increasing the number of host nodes to expand the total memory capacity, the distributed training system also expands other hardware resources (such as CPUs and GPUs) on the host nodes in the same proportion. This means that when training GNN models on large-scale graph data, some of the expanded resources (such as CPUs and GPUs) exceed the system requirements, resulting in resource waste. In addition, distributed training also introduces additional network communication overhead, which leads to relatively high training costs.
[0005] On the other hand, disks, as cost-effective and high-capacity storage devices, have become a possible solution to address the memory limitations of large-scale graph data. Since disk access speeds are much slower than memory, memory is typically used as a data cache to reduce disk access frequency when training GNN models on large-scale graph data stored on a single machine's disk. However, current disk-based GNN model training systems still cannot effectively alleviate disk I / O (Input / Output) bottlenecks in their caching strategies. For example, they still require frequent disk access, resulting in low utilization of the host's GPU and memory (including host memory and GPU memory), ultimately reducing the training efficiency of the GNN model. Summary of the Invention
[0006] This invention provides a graph neural network model training method, device, storage medium, and program product to improve the resource utilization of the training device, thereby improving the training efficiency of the GNN model.
[0007] In a first aspect, embodiments of the present invention provide a method for training a graph neural network model, the method comprising:
[0008] The first processor loads a target number of graph partitions into the first cache space in the host memory. The target number of graph partitions are graph partitions required for multiple iterations of the training process. Each graph partition is a different connected subgraph of the complete graph structure data. The complete graph structure data is stored on the host disk.
[0009] The first processor determines the cached and uncached neighbor nodes corresponding to the neighborhood sampling process of the target training node. The cached neighbor nodes are located in the first cache space, and the uncached neighbor nodes are located on the host disk. The target training node is any training node in the target number of graph partitions.
[0010] The second processor samples a first number of neighbor nodes from the cached neighbor nodes, and the first processor samples a second number of neighbor nodes from the uncached neighbor nodes, and transmits the second number of neighbor nodes to the second processor. The second processor is of a different type than the first processor.
[0011] The second processor determines the neighborhood subgraph corresponding to the target training node based on the first number of neighboring nodes and the second number of neighboring nodes, and inputs the neighborhood subgraph and the node feature vectors of each node in the neighborhood subgraph into the graph neural network model for training.
[0012] Secondly, embodiments of the present invention provide a graph neural network model training apparatus, the apparatus comprising:
[0013] A loading module is used to load a target number of graph partitions into a first cache space in the host memory via a first processor. The target number of graph partitions are graph partitions required for multiple iterations of the training process. Each graph partition is a different connected subgraph of the complete graph structure data, and the complete graph structure data is stored on the host disk.
[0014] The determination module is used to determine, through the first processor, the cached neighbor nodes and uncached neighbor nodes corresponding to the neighborhood sampling process of the target training node, wherein the cached neighbor nodes are located in the first cache space, the uncached neighbor nodes are located in the host disk, and the target training node is any training node in the target number of graph partitions;
[0015] The sampling module is used to sample a first number of neighbor nodes from the cached neighbor nodes through the second processor, sample a second number of neighbor nodes from the uncached neighbor nodes through the first processor, and transmit the second number of neighbor nodes to the second processor, wherein the second processor is of a different type than the first processor;
[0016] The training module is used to determine the neighborhood subgraph corresponding to the target training node by the second processor based on the first number of neighbor nodes and the second number of neighbor nodes, and input the neighborhood subgraph and the node feature vectors of each node in the neighborhood subgraph into the graph neural network model for training.
[0017] Thirdly, embodiments of the present invention provide an electronic device, including a processor and a memory, wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions, when executed by the processor, implement the graph neural network model training method described in the first aspect above. The electronic device may also include a communication interface for communicating with other devices or communication systems.
[0018] Fourthly, embodiments of the present invention provide a non-transitory machine-readable storage medium storing executable code, wherein when the executable code is executed by a processor of an electronic device, the processor is able to at least implement the graph neural network model training method as described in the first aspect above.
[0019] Fifthly, embodiments of the present invention provide a computer program product, the computer program product including a computer program or instructions, which, when executed by a processor, cause the processor to implement the graph neural network model training method as described in the first aspect above.
[0020] The graph neural network model training method provided in this invention uses graph data including complete graph structure data and node feature vectors. The GNN model is trained on a single host computer. This host computer has a first cache space located in its memory. Furthermore, the host computer uses different types of first processors (e.g., CPU) and second processors (e.g., GPU) to train the GNN model, thereby improving the utilization of processor resources on the host computer. Initially, the complete graph structure data and node feature vectors are fully stored on the host computer's disk. The graph structure data is pre-partitioned to obtain multiple graph partitions. Each graph partition is a different connected subgraph of the complete graph structure data; that is, each graph partition can be considered a subgraph of the complete graph structure data. The multi-hop connections between nodes within each graph partition are preserved, and different graph partitions generally do not contain duplicate nodes or edges.
[0021] During training, the first processor first loads a predetermined number of graph partitions, required for multiple iterations of training the GNN model, into a first cache space. Then, it performs neighborhood sampling on the training nodes within these graph partitions. For any target training node, during neighborhood sampling, the first processor, based on a predetermined sampling strategy, determines which neighboring nodes of the target training node are already cached in the first cache space (i.e., cached neighboring nodes) and which are not cached but stored on the host disk (i.e., uncached neighboring nodes). For cached neighboring nodes, the second processor performs neighborhood sampling to obtain a first number of neighboring nodes. For uncached neighboring nodes, the first processor samples a second number of neighboring nodes and transmits this second number of neighboring nodes to the second processor. The second processor then generates a neighborhood subgraph corresponding to the target training node based on the first and second number of neighboring nodes. The neighborhood subgraph and the node feature vectors of each node in the neighborhood subgraph are then input into the graph neural network model for training.
[0022] Since the neighborhood sampling process only uses the training nodes in the graph partition loaded in the first cache space as the seed nodes for sampling, and the graph partition retains the multi-hop connections of the nodes it contains, the neighbor nodes of most training nodes can hit the first cache space. In this way, the second processor can directly complete the neighborhood sampling of most training nodes in the first cache space with faster access speed, while only a small number of the remaining training nodes' neighbor nodes are sampled from the host disk by the first processor. Thus, the parallel execution of the first and second processors can speed up the overall neighborhood sampling speed, thereby helping to improve the training efficiency of the GNN model. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0024] Figure 1 A flowchart of a graph neural network model training method provided in an embodiment of the present invention;
[0025] Figure 2 This is a schematic diagram of a neighborhood sampling process provided in an embodiment of the present invention;
[0026] Figure 3 This is a flowchart for obtaining node feature vectors provided in an embodiment of the present invention;
[0027] Figure 4 This is a schematic diagram illustrating the application of a graph neural network model training method provided in an embodiment of the present invention;
[0028] Figure 5 This is a schematic diagram illustrating a parallel execution process of a processor, provided as an embodiment of the present invention.
[0029] Figure 6 A flowchart of a graph neural network model training method provided in an embodiment of the present invention;
[0030] Figure 7 This is a schematic diagram of a graph structure data compressed and stored in CSC format, provided as an embodiment of the present invention.
[0031] Figure 8 A flowchart illustrating a method for determining disk data access volume during the node feature vector acquisition stage, provided in an embodiment of the present invention;
[0032] Figure 9 A flowchart illustrating a method for determining disk data access volume during the neighborhood acquisition phase, provided in an embodiment of the present invention;
[0033] Figure 10 This is a schematic diagram of the structure of a graph neural network model training device provided in an embodiment of the present invention;
[0034] Figure 11 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0035] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0036] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in the embodiments of the present invention are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0037] First, some concepts involved in the embodiments of this invention will be explained.
[0038] Graph data is a type of structured data composed of vertices and edges, forming a network structure. Each node typically represents an entity, and each edge represents a relationship or connection between nodes. Depending on the type, nodes and edges in a graph may also be associated with several high-dimensional feature vectors, which represent specific attributes of the entities or connections between entities. Graph data is widely used in social networks, knowledge graphs, bioinformatics, and other fields to model complex relationships and dependencies. A key characteristic of graph data is its ability to intuitively represent many-to-many relationships between elements, making it particularly effective for analyzing and mining patterns in complex network structures.
[0039] Graph Neural Networks (GNNs) are neural network models specifically designed for processing graph data. Unlike traditional neural network models, GNNs can learn and infer directly from graph data, capturing complex relationships and interactions between nodes. Each layer in a GNN model updates the node state using the node's feature information and information from its neighbors. By iterating this process, a representation of each node or the entire graph is ultimately obtained. This representation can be used for downstream tasks such as node classification, link prediction, and graph classification.
[0040] Neighborhood Sampling: Neighborhood sampling is a commonly used technique in GNN model training, designed to address the problem of excessive computational resource consumption and memory overflow caused by the large number of edges connecting nodes. During the forward and backward propagation of a GNN network model, neighborhood sampling aggregates information by randomly selecting a subset of a node's neighbors instead of all of them. This significantly reduces the number of nodes processed in each training iteration, thereby lowering computational complexity and memory requirements. In this way, GNN models can be effectively scaled to training on large-scale graph data.
[0041] Data caching: Caching is a common technique that speeds up overall data access by storing frequently accessed copies of data in faster storage devices. In GNN model training, some node feature vectors or graph data are often cached in host memory or GPU memory to reduce latency caused by data loading and improve training efficiency and speed. Data caching can be divided into static caching and dynamic caching. Static caching involves preloading infrequently changing data into the cache, which remains unchanged once loaded. Dynamic caching automatically adjusts the contents of the cache based on data access patterns and frequency. It uses algorithms to determine which data is most likely to be re-accessed and dynamically replaces entries in the cache based on the prediction results. A suitable caching strategy is crucial for improving the effectiveness of data caching and the speed of data loading.
[0042] Training epochs, rounds, and steps: In training a GNN model, a training epoch refers to the process of using the complete graph data for one round of training. In other words, a training epoch involves a complete traversal of the graph data used as the training sample set. Training a GNN model requires multiple training epochs, a training epoch contains multiple training rounds, and a training round contains multiple iterations.
[0043] With the increasingly widespread application of graph data, especially in fields such as social network analysis, recommender systems, and bioinformatics, the scale and complexity of graph data are constantly increasing, posing significant scalability challenges to traditional GNN model training methods. Graph data used in industry typically contains over a million nodes and billions of edges, and these nodes and edges often possess high-dimensional feature vectors. This makes it impossible to fully load large-scale graph data into the memory of a single machine (such as host memory or GPU memory). In contrast, disk devices have gained attention as a resource-efficient storage solution. The capacity of disk devices can store large-scale graph data, making it possible to train GNN models on large-scale graph data on a single machine. However, this also presents challenges such as disk I / O bottlenecks and low utilization of hardware resources (such as GPUs) during training.
[0044] To this end, the GNN model training scheme provided in this embodiment of the invention fully considers the heterogeneous data characteristics and data access patterns of the GNN model workload. By using different memory resources in the host as cache space, different types of processors in the host jointly participate in the training of the GNN model, thereby improving the utilization of host resources and improving the training efficiency of the GNN model.
[0045] The heterogeneous data characteristics of the GNN model workload refer to the fact that the graph data used to train the GNN model includes two different types of data: one is graph structure data, which describes the topological connections between nodes and edges; the other is node feature vectors, which are high-dimensional vectors describing the feature information of nodes, such as a node feature vector that can be 1024-dimensional. Of course, some graph data may also include edge feature vectors, but the meaning of edges varies in different graph data. For example, each edge in an undirected graph or each edge in graph data composed of the same type of nodes may only have one meaning (such as indicating whether there is a certain relationship between nodes). In this case, the edge feature vector can be ignored. Therefore, this embodiment of the invention will not elaborate further on edge feature vectors.
[0046] The data access pattern of GNN models refers to the fact that in a training epoch, some nodes in the graph data used as training data may be accessed far more often than other nodes, unlike some other neural network models where training data such as images and text are used only once in a training epoch.
[0047] In this embodiment of the invention, based on the heterogeneous data characteristics and data access patterns of the GNN model workload, the neighborhood sampling process and node feature vector acquisition process of the training nodes involved in the model training process are optimized. By formulating a reasonable and efficient data caching mechanism, the utilization rate of hardware resources (such as GPU) in the host for training the GNN model can be improved, the disk I / O bottleneck can be reduced, and the model training efficiency can be improved.
[0048] The following detailed description of some embodiments of the present invention, in conjunction with the accompanying drawings, is provided. Where there is no conflict between the embodiments, the following embodiments and the features and steps described therein can be combined with each other. Furthermore, the timing of the steps in the following method embodiments is merely an example and not a strict limitation.
[0049] Figure 1 This is a flowchart illustrating a graph neural network model training method provided in an embodiment of the present invention. This method is applied to a single host computer training a GNN model, such as... Figure 1 As shown, the method includes the following steps:
[0050] 101. The first processor loads the target number of graph partitions into the first cache space in the host memory. The target number of graph partitions are the graph partitions required for multiple iterations of the training process. Each graph partition is a different connected subgraph of the complete graph structure data. The complete graph structure data is stored in the host disk.
[0051] 102. The cached and uncached neighbor nodes corresponding to the neighborhood sampling process of the target training node are determined by the first processor. The cached neighbor nodes are located in the first cache space, and the uncached neighbor nodes are located on the host disk. The target training node is any training node in the target number of graph partitions.
[0052] 103. The second processor samples a first number of neighboring nodes from the cached neighboring nodes, the first processor samples a second number of neighboring nodes from the uncached neighboring nodes, and the second number of neighboring nodes is transmitted to the second processor. The second processor is of a different type than the first processor.
[0053] 104. The second processor determines the neighborhood subgraph corresponding to the target training node based on the first number of neighboring nodes and the second number of neighboring nodes, and inputs the neighborhood subgraph and the node feature vectors of each node in the neighborhood subgraph into the graph neural network model for training.
[0054] The GNN model training method provided in this embodiment of the invention can be applied to a single host computer, so that a single host computer can be used as a training device to train the GNN model. The first processor and the second processor mentioned above can be two different types of processors in the host computer, such as the first processor being a CPU and the second processor being a GPU. The type of the second processor is not limited to this, and can also be other acceleration or dedicated chips.
[0055] Training a GNN model involves large-scale graph data. To efficiently store and access this graph data, different storage media on the host machine can be used to store the graph data required during GNN model training. Specifically, before training begins, all the graph data required for training can be stored on the host disk.
[0056] The graph data includes complete graph structure data describing the graph's topology and node feature vectors describing the nodes. The complete graph structure data describes the nodes in the graph and the edges connecting them (which can be directed or undirected). In practical applications, the original storage format for storing graph structure data can be a data structure such as triples, describing the connection relationships between different pairs of nodes.
[0057] The complete graph structure data is complex and massive. During a single iteration of training a GNN model, the entire graph structure data is not input into the model all at once. Therefore, it is necessary to first perform graph partitioning on the complete graph structure data to divide it into multiple graph partitions (assuming a total of B partitions, B>1). Each graph partition is a different connected subgraph of the complete graph structure data; that is, each graph partition can be considered a subgraph of the complete graph structure data, and the multi-hop connections between nodes within the graph partition are preserved. Different graph partitions generally do not contain duplicate nodes and edges. Initially, the multiple graph partitions are stored on the host disk.
[0058] The following describes how to partition the complete graph structure data into B graph partitions. In summary, optionally, a breadth-first search (BFS) method can be used to partition the complete graph structure data. Furthermore, each graph partition includes the nodes located within that partition and all incoming edges ending at those nodes.
[0059] Specifically, we can first randomly select a node from the complete graph structure data as the source node, and perform a breadth-first traversal along the incoming edges of the source node. When the number of traversed nodes reaches a preset threshold, the traversal stops, and we obtain a connected graph tile corresponding to the source node. Then, we randomly select another node as the source node and traverse in the same way until all nodes in the complete graph structure data have been traversed, at which point we will obtain several graph tiles.
[0060] Next, each obtained graph tile can be treated as a super source node, and the above traversal process can be repeated: select a super source node, perform a breadth-first traversal along its incoming edges, and stop the traversal when the number of super nodes reached a preset threshold. At this point, a graph partition corresponding to the super source node is obtained. Repeating this process will eventually yield B graph partitions. Among them, for a graph tile x1, if a node in another graph tile x2 has an edge pointing to a node in graph tile x1, then graph tile x2 has an incoming edge pointing to graph tile x1.
[0061] After dividing the complete graph structure data into B graph partitions, these B partitions can be consecutively numbered: Graph Partition 0, Graph Partition 1, ..., Graph Partition B-1. The node identifiers are then renumbered according to the order of the graph partitions, with each partition containing consecutive node identifiers, and the first and last node identifiers of adjacent partitions being consecutive. For example, assuming a graph partition contains 1000 nodes, then the first node in Graph Partition 0 is represented as node 0, and the last node as node 999; the first node in Graph Partition 1 is represented as node 1000, and the last node as node 1999, and so on, with the last node identifier being node N-1, where N is the total number of nodes in the complete graph structure data.
[0062] By partitioning the complete graph structure data into graph partitions, the massive complete graph structure data can be split into B smaller graph partitions. Thus, during the GNN model training process, the graph partitions can be loaded as needed to input into the GNN model.
[0063] It is important to note that, corresponding to the training, testing, and validation processes of the GNN model, the nodes in the complete graph structure data are divided into training nodes for training, testing nodes for testing, and validation nodes for validation. The purpose type of each node can be set in advance.
[0064] When training a GNN model begins, multiple graph partitions corresponding to the complete graph structure data stored on the host disk, along with the node feature vectors corresponding to the nodes contained therein, can be loaded into the host's cache space as needed. In this embodiment of the invention, different memory resources on the host are used as cache spaces to cache multiple graph partitions and node feature vectors required during GNN model training. For example, host memory, GPU memory, etc., can be used to cache multiple graph partitions and node feature vectors, making full use of multiple different memory resources to cache more graph partitions and node feature vectors. In this way, during training, the reading of graph partitions and node feature vectors from the host disk can be reduced, and more can be read from the host's cache space, accelerating the reading speed of training data and thus improving the training efficiency of the GNN model.
[0065] The following section will first introduce the process of obtaining graph partitions during GNN model training by using the first cache space set in the host memory to cache graph partitions.
[0066] The amount of graph data used to train the GNN model is large, making it impractical to load all the complete graph structure data stored on the host disk into the first cache space in the host memory at once. Therefore, in this embodiment of the invention, based on the training mechanism of the GNN model, the complete graph structure data is divided into B graph partitions, and different graph partitions can be loaded into the first cache space in batches.
[0067] In the first training mechanism, as mentioned above, the training process of a GNN model in one training round can include multiple training rounds, each of which includes multiple iterations (steps). Therefore, the number of target graph partitions (hereinafter referred to as K, K≥1) to be loaded in a training round can be pre-defined, and K graph partitions are loaded into the first cache space at the granularity of one training round. It should be noted that the total number R of training rounds in a training round can be pre-defined, but the number of iterations in different training rounds can be different. This number is affected by the total number of training nodes in the K graph partitions corresponding to the training round and the number of seed nodes for neighborhood sampling set in one iteration of training (referred to as batchsize). Batchsize is a set constant.
[0068] Specifically, given a total of B graph partitions, for the current training round, K graph partitions can be randomly selected from the B partitions and loaded into the first cache space. In the next training round, K more graph partitions can be randomly selected from the remaining partitions and loaded into the first cache space. At this point, the K graph partitions loaded into the first cache space in the previous training round are deleted. The K graph partitions corresponding to different training rounds do not overlap.
[0069] As mentioned above, complete graph structure data contains nodes of different types for various purposes. During the training process of a GNN model, the first stage is "neighborhood sampling." In this stage, it's necessary to sample the neighboring nodes of a node, such as a training node. The number of training nodes contained in the K graph partitions corresponding to different training epochs is likely to be different. The number of iterations in a training epoch is M / batchsize, where M is the number of training nodes in the K graph partitions corresponding to that training epoch. Therefore, the number of iterations in different training epochs may vary.
[0070] Accordingly, the number of multiple iterations of training process in step 101 above is actually the number of iterations contained in the current training round.
[0071] Of course, in the second training mechanism, the granularity of the training round can also be omitted, meaning that a training round includes a set number of iterations. In this case, the iteration count can be grouped, with each group containing a set number of iterations. The number of graph partitions loaded for each group is the target number K, so that the corresponding K graph partitions can be loaded into the first cache space at the granularity of this group. However, in this case, the number of seed nodes for neighborhood sampling in each iteration of each group may be different.
[0072] Regardless of the training mechanism mentioned above, the underlying principle is similar: at the smallest granularity, the GNN model is trained once per iteration.
[0073] After loading the K graph partitions into the first cache space via the first processor, the neighborhood sampling phase can be executed. For the K graph partitions loaded into the first cache space at this time, for each iteration of the multiple training processes, a corresponding number of training nodes can be randomly selected as seed nodes for use in the neighborhood sampling phase. The seed nodes selected in different iterations do not overlap. It can be understood that, for the first training mechanism, the number of seed nodes selected in each iteration is the set batch size. For the second training mechanism, the number of seed nodes selected in each iteration of the current iteration group is: the number of training nodes contained in the current K graph partitions / the total number of iterations in an iteration group.
[0074] In practice, we can first use each training node contained in the K graph partitions as a seed node for neighborhood sampling, regardless of which iteration the training node belongs to. After sampling, we then aggregate the neighborhood sampling results corresponding to each training node in the same iteration according to the iteration training process to which the training node belongs, and input them together into the GNN model.
[0075] In this embodiment of the invention, the neighborhood sampling process uses a training node as a seed node. Neighborhood sampling of this training node means using it as the source node to determine its surrounding neighboring nodes and sampling the required number of these neighboring nodes. Therefore, a neighborhood sampling strategy is provided in advance, indicating the range of neighboring nodes to be sampled for each seed node during the neighborhood sampling phase. This range may be one hop, two hops, or even multiple hops, and may also indicate the number of neighboring nodes to be sampled, such as how many to sample within one hop or how many within the second hop.
[0076] As explained in the definition of graph partitions above, a graph partition is a connected subgraph of a complete graph structure. Typically, a graph partition contains multiple nodes and the edges connecting these nodes (or, for a directed graph, the incoming edges). Therefore, a node within a graph partition may have one or more hops within that partition. Based on this, when a graph partition is loaded into the first cache space, and a training node within that partition is used as a seed node for neighborhood sampling, its neighboring nodes are more likely to be located within that graph partition, or in other words, most of its neighboring nodes may be located in the first cache space. This allows for the direct retrieval of most of the training node's neighboring nodes from the first cache space, while only a small number of neighboring nodes may not be cached in the first cache space, i.e., they may be located on the host disk. Thus, during the neighborhood sampling phase, more neighboring nodes are directly retrieved from the first cache space, which can significantly improve the execution efficiency of the neighborhood sampling phase, thereby contributing to improved training efficiency of the GNN model.
[0077] Therefore, for any target training node in the K graph partitions, the first processor determines the cached and uncached neighbor nodes corresponding to the target training node. The cached neighbor nodes are located in the first cache space, and the uncached neighbor nodes are located on the host disk. Then, the second processor samples a first number of neighbor nodes from the cached neighbor nodes. The first processor samples a second number of neighbor nodes from the uncached neighbor nodes and transmits the second number of neighbor nodes to the second processor. Thus, the second processor determines the neighborhood subgraph corresponding to the target training node based on the first and second number of neighbor nodes, that is, the subgraph that reflects the connection relationship between the sampled neighbor nodes and the target training node.
[0078] In practical applications, a neighborhood sampling strategy can be used to pre-store the neighbor node identifiers of each training node in the complete graph structure data (which can be stored on the host disk or in host memory). This allows the first processor to determine all neighbor nodes corresponding to the target training node. Then, based on the nodes contained in the K currently loaded graph partitions in the first cache space, it can determine which neighbor nodes are cached and which are not. It can be understood that different cache addresses in the first cache space store the node identifiers of each node loaded into the first cache space, and the second processor can query each cache address to obtain the cached neighbor nodes.
[0079] The sampling of cached and uncached neighbor nodes can be performed in parallel by the second and first processors. This allows for parallel sampling of the neighborhood by different processors, resulting in higher execution efficiency. It also improves the utilization of the second processor in the host.
[0080] Based on the above process, the second processor can obtain the neighborhood subgraphs corresponding to each training node in the K graph partitions. When training the GNN model at an iterative granularity, the second processor determines the neighborhood subgraphs corresponding to multiple training nodes in a single iteration, and then concatenates them to obtain the training subgraph corresponding to that iteration. Afterward, the node feature vectors of each node in this training subgraph are obtained and used in conjunction with the training subgraph. Figure 1 The second processor inputs the GNN model for one iteration of training. It should be noted that although the seed node in the neighborhood sampling stage is a training node, not all sampled neighbor nodes are training nodes. That is, the purpose type of the sampled nodes is not limited. Therefore, the nodes contained in the training subgraph will include test nodes and / or validation nodes in addition to training nodes.
[0081] In this embodiment, the method for obtaining node feature vectors is not limited. Optionally, when the node feature vectors of all nodes are stored in the host disk, the first processor and / or the second processor can obtain the node feature vectors of each node in the training subgraph from the host disk. Alternatively, the node feature vectors can be obtained based on the caching mechanism described in other embodiments below.
[0082] The following is combined with Figure 2 Let's illustrate the neighborhood sampling process described above. Assume the neighborhood sampling strategy is as follows: each training node needs to sample 6 neighboring nodes, 2 neighboring nodes within a one-hop range, and 4 neighboring nodes within a two-hop range. Figure 2 The diagram illustrates training node 0. When sampling neighbors using training node 0 as a seed node, the first step is to determine whether the four one-hop neighbor nodes 1-4 and the seven two-hop neighbor nodes 5-11 corresponding to training node 0 have been cached in the first cache space. Here, it is assumed that all nodes except for 8 and 11 (located on the host disk) have been loaded into the first cache space.
[0083] Since all one-hop neighbor nodes of training node 0 are in the first cache space, the second processor retrieves the aforementioned four one-hop neighbor nodes from the first cache space and randomly samples two of them, such as node 1 and node 3. After sampling the one-hop neighbor nodes, the sampled one-hop neighbor nodes are used as new seed nodes to sample neighbor nodes within a two-hop range. Since some of the two-hop neighbor nodes corresponding to node 1 and node 3 of training node 0 (node 8) are located on the host disk, at this time, if... Figure 2As shown, even if there are already 5 two-hop neighbor nodes in the first cache space, satisfying the requirement of sampling 4 neighbor nodes within the two-hop range, in practical applications, optionally, the sampling can be performed not directly from the 5 two-hop neighbor nodes stored in the first cache space, but from both the first cache space and the host disk. For example, the number of neighbor nodes to be sampled within the set two-hop range can be allocated according to the ratio of the total number of neighbor nodes in the first cache space to the number of neighbor nodes in the host disk (5:1). For example, in the example above, 3 two-hop neighbor nodes can be sampled from the first cache space and 1 two-hop neighbor node can be sampled from the host disk. Thus, the first processor samples node 8 from node 8 stored in the host disk. The second processor samples 3 nodes from nodes 5-7 and nodes 9-10 stored in the first cache space, such as node 5-7. Finally, the 6 sampled neighbor nodes form a network with the training node 0. Figure 2 The diagram shows the neighborhood subgraph.
[0084] As illustrated by the examples above, when sampling the neighborhood of a certain child node within a certain range, optionally, if the seed node has neighboring nodes not located in the same storage space (the same storage space refers to the first cache space and the host disk) within that range, that is, if the neighboring nodes within that range are distributed in the first cache space and the host disk, then the neighborhood sampling of the seed node within that range needs to be completed from these two storage spaces. The purpose of this is to make fuller use of the neighboring node information of the seed node within that range, without overly restricting the randomness of the input samples of the GNN model, thus helping to ensure the training effect of the GNN model.
[0085] In summary, in this embodiment of the invention, since only the training nodes in the graph partition loaded in the first cache space are used as seed nodes for neighborhood sampling during the neighborhood sampling process, and the multi-hop connection relationship of the nodes contained in the graph partition is preserved, the neighbor nodes of most training nodes can hit the first cache space. In this way, the second processor can directly complete the neighborhood sampling of most training nodes in the first cache space with faster access speed, while only a small number of the remaining training nodes' neighbor nodes are sampled from the host disk by the first processor. Thus, through the parallel execution of the first and second processors, the overall neighborhood sampling speed can be accelerated, thereby helping to improve the training efficiency of the GNN model.
[0086] As mentioned in the above embodiments, after obtaining a neighborhood subgraph corresponding to a training node through the neighborhood sampling stage, or in other words, obtaining a training subgraph composed of multiple neighborhood subgraphs corresponding to multiple training nodes in one iteration of training, it is necessary to obtain the node feature vectors of each node contained therein in order to input them into the GNN model for training. However, in practical applications, a node feature vector is usually a high-dimensional vector (e.g., 1024-dimensional). If the processor directly loads the node feature vector from the host disk every time it needs to be loaded, the disk I / O will be large, meaning the disk access volume will be significant. Since disk access speed is low, this will affect the model training efficiency. Therefore, this embodiment of the invention also provides a caching mechanism for node feature vectors to accelerate the acquisition efficiency of node feature vectors.
[0087] Since node feature vectors occupy a large amount of storage space, in order to cache more node feature vectors in the cache space, in this embodiment of the invention, the host memory and the memory of the second processor are used as the cache space for node feature vectors.
[0088] In practice, the host memory is divided into two cache spaces: the first cache space is used to cache graph partitions, and the second cache space is used to cache node feature vectors. Simultaneously, a third cache space located in the memory of the second processor can also be used to cache node feature vectors. This not only increases the cache space for node feature vectors but also improves the memory utilization of the second processor, thereby further improving the training efficiency of the GNN model.
[0089] Figure 3 A flowchart for obtaining node feature vectors provided in an embodiment of the present invention, such as... Figure 3 As shown, the method may include the following steps:
[0090] 301. The second processor queries the cache address lookup table to determine whether the node feature vectors corresponding to each node in the neighborhood subgraph are cached in the second cache space or the third cache space.
[0091] 302. For the first node whose node feature vector is cached, the second processor retrieves the node feature vector corresponding to the first node from the corresponding cache address.
[0092] 303. For the second node whose node feature vector is not cached, the first processor retrieves the corresponding node feature vector from the host disk.
[0093] 304. The first processor updates the second cache space and / or the third cache space according to the node feature vector corresponding to the second node.
[0094] In this embodiment, since the first cache space and the second cache space in the host memory are used to cache graph partitions and node feature vectors, respectively, in order to reduce the load on the first processor and balance the load on the first and second processors, optionally, a cache address lookup table used to store the cache addresses of node identifiers and their corresponding node feature vectors can be stored in the memory of the second processor. Of course, this cache address lookup table can also be stored in the host memory. Since node feature vectors can be cached in the second cache space and the third cache space, the cache addresses in the cache address lookup table include the cache addresses corresponding to these two cache spaces.
[0095] As explained in the previous embodiments regarding the neighborhood sampling stage, regardless of the training mechanism used, several neighborhood subgraphs corresponding to the training nodes will be obtained during the neighborhood sampling stage. When training the GNN model on a per-iteration basis, the multiple neighborhood subgraphs of multiple training nodes corresponding to one iteration are simply concatenated into a single training subgraph as input to the graph structure data of the GNN model. Therefore, in this embodiment, step 301 only describes the process of obtaining the node feature vectors of each node in the neighborhood subgraph corresponding to a single training node as an example. The process of obtaining the node feature vectors of each node in the training subgraph is similar, only the scope of acquisition is expanded.
[0096] Taking any of the aforementioned target training nodes as an example, when it is necessary to obtain the node feature vectors of each node in its corresponding neighborhood subgraph, the second processor first queries the cache address lookup table using the node identifier of each node in the neighborhood subgraph to determine whether the node feature vectors corresponding to each node in the neighborhood subgraph are cached in the second cache space or the third cache space. For the first node whose node feature vector is cached, the second processor obtains the node feature vector corresponding to the first node from the corresponding cache address. For the second node whose node feature vector is not cached, the first processor obtains the node feature vector corresponding to the second node from the host disk, and then transmits it to the second processor for aggregation. Thus, the first and second processors complete the acquisition of the node feature vectors of each node in the neighborhood subgraph in parallel, improving the execution efficiency of the node feature vector collection stage.
[0097] Understandably, the second and third cache spaces are initially empty. Therefore, assuming that the acquisition of node feature vectors in the neighborhood subgraph corresponding to the target training node is the first node feature vector acquisition performed, then at this time, none of the nodes in the neighborhood subgraph will hit the second and third cache spaces (i.e., these nodes all belong to the second node). Thus, the first processor acquires the node feature vectors of these nodes and stores them in the second and third cache spaces.
[0098] In one optional embodiment, the storage of these node feature vectors in the second and third cache spaces can be random. For example, for a given node feature vector, the first processor randomly determines a free cache address from the second and third cache spaces and stores it therein. Then, the corresponding node identifier and the cache address are sent to the second processor for writing into the cache address lookup table. It is understood that the node feature vector of the same node will not be stored repeatedly in the second and third cache spaces.
[0099] In another optional embodiment, as can be seen from the method of obtaining node feature vectors jointly by the first and second processors, the node feature vectors that hit the cache are all obtained by the second processor. However, as the data access pattern of the GNN model described above indicates, some nodes may be repeatedly used during a training epoch of the GNN model. Therefore, when storing node feature vectors in the second and third cache spaces, since the third cache space is closer to the second processor and has a higher access speed, the node feature vectors of nodes with higher access frequency can be preferentially stored in the third cache space, while the node feature vectors of nodes with the next highest access frequency are stored in the second cache space. Thus, in this embodiment, it is necessary to measure the access frequency or access heat of nodes in order to use the second and third cache spaces accordingly.
[0100] Furthermore, it's understandable that when the second and third cache spaces are full, if the first processor still needs to retrieve the node feature vector of a node in a neighborhood subgraph from the host disk, then updating the cached node feature vectors will also be involved: removing some cached node feature vectors with low access frequency and replacing them with newly retrieved node feature vectors with higher access frequency. Therefore, assuming the first processor retrieves the node feature vector of the second node from the host disk and finds that both the second and third cache spaces are full, then it needs to perform the update process for the second and third cache spaces based on the access frequency of the second node.
[0101] The calculation of the access popularity value of each node in the complete graph structure data can be performed after the complete graph structure data is divided into B graph partitions, and this calculation can be completed by a second processor. In this embodiment of the invention, the access popularity value of each node is represented by a node popularity matrix H. The node popularity matrix is used to describe the access popularity value of each node in the complete graph structure data when neighborhood sampling is performed on the training nodes in each graph partition. Each graph partition is each of the B graph partitions into which the complete graph structure data is divided. That is, the node popularity matrix H is an N-row B-column matrix, where N rows represent N nodes and B columns correspond to B graph partitions.*j This represents the access popularity value of each of the N nodes when performing neighborhood sampling on each training node in the j-th graph partition.
[0102] The calculation process of the node heat matrix H will be explained in detail below. Here, we first describe the update process of the second and third cache spaces after the node heat matrix H is known. Specifically, the first processor first determines the local node heat vector h based on the node heat matrix H. This local node heat vector h includes the local node heat value of each node in the complete graph structure data. The local node heat value of each node corresponds to the statistical value of the access heat value of each node in the complete graph structure data when performing neighborhood sampling with the training nodes in the K graph partitions currently loaded into the first cache space. This statistical value is, for example, a sum, average, or maximum value. Then, the first processor determines the local node heat value corresponding to the second node based on the local node heat vector h, and updates the second and / or third cache spaces according to the local node heat value corresponding to the second node.
[0103] Following the previous embodiment where K graph partitions are loaded into the first cache space, assuming the K graph partitions are graph partition 0 and graph partition 1, these two graph partitions correspond to the first and second columns in the node popularity matrix H, respectively. The first column represents the access popularity values of N nodes when performing neighborhood sampling on each training node in graph partition 0, and the second column represents the access popularity values of N nodes when performing neighborhood sampling on each training node in graph partition 1. For the same node v, its access popularity values in these two columns are summed (the above statistical values are illustrated using summation as an example) to obtain the local node popularity value of node v, that is, the access popularity statistics of node v relative to the two local graph partitions, graph partition 0 and graph partition 1. The same calculation is performed on each of the N nodes, ultimately resulting in a local node popularity vector h of dimension N. Then, the local node popularity value corresponding to the second node is extracted from the local node popularity vector h, and the second and third cache spaces are updated accordingly.
[0104] In one optional embodiment, the node heat matrix H can be multiplied by the vector M = [M0, M1…M ... B-1 ] T This yields the local node heat vector h. Where M... i Describes the i-th graph partition, if M i If M is one of the K graph partitions currently loaded into the first cache space, then i =1, otherwise M i =0.
[0105] Specifically, the first processor determines the local node heat value corresponding to the second node based on the local node heat vector h, in order to update the second cache space and / or the third cache space, including:
[0106] If the second cache space and / or the third cache space are not full, the first processor stores the node feature vector corresponding to the second node into the not full cache space and sends the first address update information to the second processor. The first address update information is used to indicate that the cache address of the second node is added to the cache address lookup table.
[0107] If both the second and third cache spaces are full, the first processor removes the node feature vector corresponding to the third node in the second and third cache spaces and replaces it with the node feature vector corresponding to the second node. The first processor then sends second address update information to the second processor. The second address update information is used to instruct the second node's cache address to be added to the cache address lookup table and the third node's cache address to be deleted from the cache address lookup table. The local node heat value corresponding to the third node is lower than the local node heat value corresponding to the second node.
[0108] Optionally, if the second cache space is full but the third cache space is not full, the node feature vector of the second node can be stored in the third cache space; if the second cache space is not full but the third cache space is full, the node feature vector of the second node can be stored in the second cache space. If neither the second nor the third cache space is full, the feature vector can be randomly stored in one of the cache spaces.
[0109] Understandably, if there are multiple second nodes, and the number of free cache addresses in the second and third cache spaces is less than the number of second nodes, then alternatively, the multiple second nodes can be sorted according to their local node popularity value, and the second nodes with high local node popularity values can be stored in the free cache addresses first. The remaining second nodes can then be processed based on the situation where both the second and third cache spaces are full.
[0110] In an optional embodiment, the following principle can be established for updating the second and third cache spaces: the local node popularity value of a node in the third cache space is not lower than the local node popularity value of a node in the second cache space; that is, the node feature vectors of nodes with higher local node popularity values are stored in the third cache space. Under this principle, when the second and / or third cache spaces are not full, if the number of free cache addresses is greater than or equal to the total number of second nodes, the node feature vectors of these second nodes can be stored in both cache spaces. However, it is necessary to redetermine the cache addresses of the node feature vectors stored in the two cache spaces: according to the sorting of local node popularity values, the higher-ranked nodes fill the third cache space first, and the rest are stored in the second cache space. If the number of free cache addresses is less than the number of second nodes, these second nodes are sorted by their local node popularity values, and nodes with high local node popularity values equal to the number of free cache addresses are selected and stored in the cache space. This also involves updating the cache addresses of the node feature vectors stored in the two cache spaces.
[0111] When both the second and third cache spaces are full, optionally, these second nodes are sorted by their local node popularity values, and each is compared with the local access popularity values of the nodes already stored in the second and third cache spaces. The final result is that the node feature vector of the node with the higher local node popularity value is stored in the cache space. At this time, the node popularity vectors of some second nodes may not be successfully stored in the cache space.
[0112] It should be noted that the node feature caching strategies for the second and third cache spaces are not limited to the examples mentioned above. Other strategies that satisfy the requirement of storing more node feature vectors of nodes with higher local node popularity values in the two cache spaces mentioned above can also be applied.
[0113] Based on this strategy, the node feature vectors of nodes with higher access frequency can be retained in the second and third cache spaces for a longer period of time, thereby improving the speed of obtaining node feature vectors and helping to improve the efficiency of model training.
[0114] The process of generating the node heat matrix H is described below: Multiple node groups corresponding to the target graph partition are determined, where the number of training nodes in each node group is determined based on the number of training nodes required for one iteration. The number of node groups is equal to the number of iterations set in one training round. The target graph partition is any one of the multiple graph partitions. Neighborhood sampling is performed on each training node in the target node group to obtain a local subgraph composed of the neighborhood subgraphs of each training node in the target node group. The target node group is any one of the multiple node groups. The frequency of each node in the complete graph structure data in the multiple local subgraphs is determined as the access heat value of each node in the complete graph structure data when neighborhood sampling is performed on the training nodes in the target graph partition. The multiple local subgraphs correspond to multiple node groups. Based on the access heat values of each node in the complete graph structure data when neighborhood sampling is performed on the training nodes in the multiple graph partitions, a node heat matrix is generated and stored in the memory of the second processor.
[0115] In calculating the node heat matrix H, it is assumed that only one graph partition is randomly loaded in a training round. Taking graph partition f1 as an example, assuming that graph partition f1 contains 500 training nodes, and the number of seed nodes for neighborhood sampling in one iteration, i.e., batchsize = 100, then it can be determined that a training round corresponding to graph partition f1 contains 5 iterations, that is, the GNN model is trained through 5 iterations to complete the use of this graph partition f1. In this example, it is equivalent to dividing the 500 training nodes in graph partition f1 into five node groups, with each node group containing 100 training nodes. For each node group, 100 nodes can be randomly selected from these 500 training nodes, and there are no duplicate nodes between different node groups.
[0116] Next, for each node group, neighborhood sampling is performed on each training node contained within it. The neighborhood sampling process is described in the aforementioned related embodiments and will not be repeated here. For any target node group, each training node within it can obtain its corresponding neighborhood subgraph after neighborhood sampling. The neighborhood subgraphs of each training node in the target node group are concatenated together to obtain the local subgraph corresponding to this target node group. The same processing is performed on each node group, thereby obtaining multiple local subgraphs corresponding to multiple node groups (such as the five in the example above).
[0117] Next, for each of the N nodes in the complete graph structure data, the number of times the same node appears in multiple local subgraphs is determined according to the order of the node identifiers, and this number is used as a column in the node popularity matrix H corresponding to graph partition f1. Specifically, for any node v, assuming there are five local subgraphs, node v can be identified by five occurrence counts in these five local subgraphs (a value of 0 is used if it does not appear). These five occurrence counts are summed to obtain the access popularity value corresponding to node v.
[0118] By performing the above processing on each of the B graph partitions, we can obtain column B of the node popularity matrix H, which is the node popularity matrix H, where elements H... ij This represents the access popularity value of node i when performing neighborhood sampling using the seed node in graph partition j.
[0119] To more intuitively understand the caching mechanism for graph structure data and node feature vectors in the embodiments of the present invention, the following will be combined with... Figure 4 Let me explain.
[0120] Figure 4 This is a schematic diagram illustrating the application of a graph neural network model training method provided in an embodiment of the present invention, such as... Figure 4 As shown, the training of a GNN model can be completed by a single host. This host includes a CPU, a GPU, a host disk, host memory, and GPU memory. The host memory is further divided into a first cache space for storing graph partitions and a second cache space for storing node feature vectors. The GPU memory includes a third cache space for storing node feature vectors.
[0121] Before training the GNN model, the B graph partitions and node feature vectors corresponding to the complete graph structure data used in the training process can be stored on the host disk. In this way, when training the GNN model, the corresponding data can be read from the host disk and loaded into the cache space.
[0122] In practice, the training of the GNN model can be completed collaboratively by the CPU and GPU. The training process mainly includes neighborhood sampling, collection of node feature vectors, and inputting the neighborhood sampling results and the collected node feature vectors into the GNN model for training. When training the GNN model, a training round consists of multiple training epochs, and each training epoch includes multiple iterations. It is set that K graph partitions need to be loaded in one training epoch.
[0123] To improve the efficiency of neighborhood sampling, before performing neighborhood sampling, the CPU can first load the K graph partitions of the current training round into the first cache space in the host memory. Neighborhood sampling is then performed jointly by the GPU and CPU. The GPU is mainly used to sample neighbor nodes already cached in the first cache space, while the CPU is mainly used to sample neighbor nodes not cached in the first cache space. The neighborhood sampling process is described in the previous embodiments and will not be repeated here. Finally, for one iteration in the current training round, the GPU generates the training subgraph corresponding to the current iteration based on the neighbor nodes sampled by the CPU and the neighbor nodes sampled by the GPU.
[0124] For the node feature acquisition phase: The second cache space in host memory and the third cache space in GPU memory can be used to cache node feature vectors, allowing for the caching of more node feature vectors. Furthermore, after caching node feature vectors to the second or third cache space, the corresponding cache information is stored in a cache address lookup table, which is stored in GPU memory. For a single iteration in the current training round, the CPU can be used to acquire uncached node feature vectors in the training subgraph, while the GPU can be used to acquire cached node feature vectors in the training subgraph.
[0125] In addition, when starting training for the current training round, the CPU obtains the node heat matrix H and calculates the corresponding local node heat vector h based on the currently loaded K graph partitions. This local node heat vector h will serve as the basis for updating the node feature vectors stored in the second and third cache spaces in the current training round.
[0126] Finally, the GPU obtains the training subgraph corresponding to the current iteration and the node feature vectors of each node in the training subgraph, and then inputs them into the GNN model to complete one iteration of training.
[0127] Furthermore, the contents not described in detail in this embodiment and the technical effects that can be achieved can be found in the relevant descriptions in the above embodiments, and will not be repeated here.
[0128] As mentioned in the foregoing embodiments, the neighborhood sampling stage and the node feature vector acquisition stage can both be executed in parallel by the first processor and the second processor. In fact, besides these two stages, the second processor's computation stage of the GNN model (i.e., the stage of collecting the training subgraph corresponding to one iteration and its node feature vectors and inputting them into the GNN model for computation) and the first processor's update process of the cache space storing node feature vectors can also be performed in a parallel pipeline, such as... Figure 5 As shown.
[0129] exist Figure 5The diagram illustrates the execution pipelines of the first processor (CPU) and the second processor (GPU), where the horizontal axis represents time and the vertical axis represents the parallelism of the processors. S, F, U, and C represent the processes of neighborhood sampling, node feature vector acquisition, node feature vector update, and model computation, respectively. i represents the i-th iteration in a training round.
[0130] Depend on Figure 5 As illustrated in the diagram, the CPU and GPU execute in parallel during the first time period as follows: the GPU can execute the model computation task for the (i-1)th iteration, the node feature vector acquisition task for the i-th iteration, and the neighborhood sampling task for the i+1th iteration in parallel; the CPU can execute the node feature vector acquisition task for the i-th iteration, the node feature vector update task for the i-th iteration, and the neighborhood sampling task for the i+1th iteration in parallel. The parallel execution during the second time period is similar and will not be elaborated further. Based on this, in practical applications, multi-parallel thread task execution scheduling can be implemented for the CPU and GPU to improve model training efficiency.
[0131] Figure 6 This is a flowchart illustrating a graph neural network (GNN) model training method provided in an embodiment of the present invention. The method is applied to a single host computer training a GNN model. This host computer includes different types of first and second processors, such as a CPU as the first processor and a GPU as the second processor. Figure 6 As shown, the method includes the following steps:
[0132] 601. Determine the offset array, index array, and pointer array corresponding to the complete graph structure data. Store the offset array and pointer array in the first cache space in the host memory. The index array and pointer array correspond to the set compressed storage format of the adjacency matrix of the complete graph structure data. The offset array is used to store the head node identifiers corresponding to each of the multiple graph partitions. The complete graph structure data is divided into multiple consecutive graph partitions, and the node identifiers in each graph partition are consecutive.
[0133] 602. For any graph partition i in the target number of graph partitions, the first processor determines the head and tail node identifiers corresponding to graph partition i based on the offset array, queries the pointer array based on the head and tail node identifiers to determine the first position segment corresponding to graph partition i in the index array, and loads the node identifiers of the first position segment in the index array into the first cache space. The target number of graph partitions is the graph partitions required for multiple iterations of training in one training round.
[0134] 603. For the target training node in graph partition i determined based on the above head and tail node identifiers, the first processor determines the second position segment corresponding to the target training node in the index array according to the pointer array. Based on the overlap relationship between the second position segment and the first position segment, the cached neighbor nodes and uncached neighbor nodes corresponding to the target training node in the neighborhood sampling process are determined. The node identifier in the index array located in the second position segment is the neighbor node in the neighborhood sampling process of the target training node. The cached neighbor node corresponds to the node identifier of the overlapping part of the first position segment and the second position segment. The cached neighbor node is located in the first cache space, and the uncached neighbor node is located on the host disk.
[0135] 604. The second processor samples a first number of neighboring nodes from the cached neighboring nodes, the first processor samples a second number of neighboring nodes from the uncached neighboring nodes, and the second number of neighboring nodes is transmitted to the second processor.
[0136] 605. The second processor determines the neighborhood subgraph corresponding to the target training node based on the first number of neighboring nodes and the second number of neighboring nodes, and inputs the neighborhood subgraph and the node feature vectors of each node in the neighborhood subgraph into the graph neural network model for training.
[0137] As mentioned above, after partitioning the complete graph structure data into graph partitions, assuming B graph partitions are obtained, these partitions are numbered consecutively, such as graph partition 0, graph partition 1, ... graph partition B-1. Furthermore, after partitioning, the nodes in the complete graph structure data can be renumbered. Nodes within the same graph partition have consecutive node identifiers (i.e., node numbers), thus ensuring that the node identifiers in adjacent graph partitions are also consecutive. That is, the head node identifier of the next graph partition is the next to the tail node identifier of the previous graph partition. This facilitates compressed storage of the complete graph structure data, reducing storage space usage. For example, assuming a graph partition contains 1000 nodes, with node identifiers starting from node 0, the head node of graph partition 0 is node 0, and the tail node is node 999. The head node identifier of the next graph partition 1 is node 1000, and the tail node identifier is node 1999, and so on.
[0138] In the aforementioned embodiments, for complete graph structure data, the graph structure data loaded into the first cache space in the host memory can be in its original format. For example, the graph structure data represented by the K graph partitions (the target number of graph partitions) corresponding to the current training round can be loaded into the first cache space. Loading graph structure data in its original format will occupy a significant amount of storage space. In this embodiment, to further reduce the storage space occupied by graph structure data, a compressed storage format is set for the graph structure data, thereby achieving compressed storage of the graph structure data. In practical applications, this compressed storage format can be, for example, a column-major compressed storage format (Compressed Sparse Column, CSC) or a row-major compressed storage format (Compressed Sparse Row, CSR).
[0139] Regardless of whether it's CSC or CSR format, compressed graph structure data will result in two arrays: an array of indices and an array of pointers (indptr). However, CSC format creates an array of row indices and an array of column pointers, while CSR format creates an array of column indices and an array of row pointers.
[0140] Since the two compression formats mentioned above are similar in principle, and experiments have shown that the CSC format has higher computational efficiency and lower performance overhead for graph structure data, the CSC format will be used as an example in this embodiment of the invention.
[0141] To compress complete graph structure data into CSC format, firstly, an adjacency matrix corresponding to the complete graph structure data can be generated. Then, based on this adjacency matrix, the row index array and column pointer array mentioned above are determined. The adjacency matrix describes whether connections exist between nodes. The dimension of the adjacency matrix depends on the total number of nodes in the complete graph structure data. For example, if there are N nodes, the adjacency matrix is an N*N matrix, where each row and column corresponds to a node identifier. For instance, the row indices are: node 0, node 1… node N-1, and similarly, the column indices are: node 0, node 1… node N-1.
[0142] The value of the element in the adjacency matrix indicating whether a connection exists between two nodes differs depending on whether the connecting edges in the complete graph structure data are undirected (i.e., undirected graph) or directed (i.e., directed graph). In general, for an undirected graph, if there is a connecting edge between any two nodes i and j, the values of the elements in row i, column j (representing the row corresponding to node i and the column corresponding to node j) and row j, column i (representing the row corresponding to node j and the column corresponding to node i) in the adjacency matrix are both 1. Conversely, if there is no connecting edge, the values of these two positions are both 0. For a directed graph, assuming there is an edge from node i to node j, but no edge from node j to node i, the value of the element in row i, column j in the adjacency matrix is 1, and the value of the element in row j, column i is 0.
[0143] After obtaining the adjacency matrix corresponding to the complete graph structure data using the above method, the adjacency matrix can be compressed in CSC format to obtain a row index array and a column pointer array. In fact, after this compression, a value array can also be obtained; however, in this embodiment of the invention, since it is assumed that the weights of the connecting edges in the complete graph structure data are all the same (e.g., all 1), the value array can be ignored.
[0144] Since the CSC format iterates through the non-zero elements of the adjacency matrix column-by-column, the row index array stores the row index (node identifier) corresponding to the non-zero element value in each column (i.e., value 1 in the example above). Therefore, the dimension of the row index array is the number of edges in the complete graph structure. The column pointer array stores the position index of the first non-zero element value in each column (i.e., value 1 in the example above) within the row index array. The dimension of the column pointer array is N+1, where N represents the total number of nodes in the complete graph structure. Both arrays start with an index of 0.
[0145] In addition, in this embodiment of the invention, besides the arrays corresponding to the two CSC formats mentioned above, an offset array `offsets` corresponding to the complete graph structure data is also generated. This offset array stores the header node identifiers corresponding to each of the B graph partitions corresponding to the complete graph structure data, and can be represented as: `offsets = [f0 = 0, f1, ..., f...]` B =N], where N is the total number of nodes in the complete graph structure data, and the node identifier range in the i-th partition is from f i to f i+1 -1. Where, f B =N is used to represent the total number of nodes contained in the complete graph structure data. Based on this, the tail node identifier in the last graph partition of the complete graph structure data can be determined: node N-1.
[0146] For example, suppose there are three graph partitions. The node identifier range for partition 0 is: node 0 - node 1000; the node identifier range for partition 1 is: node 1001 - node 1500; and the node identifier range for partition 2 is: node 1501 - node 2500. Then, f0 = 0 means that the head node of partition 0 is node 0; f1 = 1001 means that the head node of partition 1 is node 1001; f2 = 1501 means that the head node of partition 2 is node 1501; and f3 = 2501 represents the total number of nodes in the complete graph structure data.
[0147] It should be noted that when the total number of nodes N in the complete graph structure data is stored in other ways, the last f in the above offset array... B =N can be omitted.
[0148] To facilitate understanding of the row index array and column pointer array corresponding to the complete graph structure data in CSC format, the following will combine... Figure 7 To illustrate, assume the complete graph structure contains three nodes: 0, 1, and 2, with the following connections: Node 0 points to Node 2, Node 2 points to Node 1, and Node 1 points to Node 0. Based on this, the adjacency matrix for this graph structure is: [0 0 1, 1 0 0, 0 1 0], where 0 0 1 represents the row element value corresponding to node 0 (corresponding to the three columns of nodes 0, 1, and 2 respectively). Similarly, 1 0 0 represents the row element value corresponding to node 1, and 0 10 represents the row element value corresponding to node 2. Based on this adjacency matrix, a row index array [1 2 0] and a column pointer array [0 1 2 3] can be generated. The generation process can be understood by referring to existing related technologies, and will not be elaborated here.
[0149] After obtaining the offset array, row index array, and column pointer array corresponding to the complete graph structure data through the above method, since the data types of the element values in the offset array and column pointer array are only simple numeric types and occupy relatively little storage space, the offset array and column pointer array can be stored in the first cache space allocated in the host memory. However, since the row index array stores non-numeric data such as node identifiers, it occupies more storage space, so the row index array is stored in the host disk.
[0150] In practical applications, optionally, the generation process of the above three arrays can be executed by the first processor, or by other types of processors in the host. Then, based on the above three arrays, the first processor and the second processor can jointly complete the "neighborhood sampling" process involved in the training of the GNN model.
[0151] For the training process of the current training round (including multiple iterative training processes) of the GNN model, first, the first processor determines K graph partitions used in the current training round from the B graph partitions obtained by partitioning the complete graph structure data, where K < B. In practical applications, K graph partitions can be randomly selected. For example, if the B graph partitions include graph partition 0 to graph partition 4 and K = 2, graph partition 1 and graph partition 4 can be randomly selected as the graph partitions used in the current training round. Then, the first processor loads these K graph partitions into the first cache space.
[0152] It should be noted that in the foregoing alternative embodiment, the first processor can directly load the original graph structure data corresponding to these K graph partitions into the first cache space. At this time, each node included in the K graph partitions will be loaded into the first cache space. However, in this embodiment, when the first processor loads the K graph partitions based on the above three arrays, based on the definition of the row index array, actually, the first processor loads the incoming edges corresponding to the nodes in the K graph partitions into the first cache space. And an incoming edge can be represented by the identifier of the head node of this edge. Therefore, it can be said that the incoming edge nodes of the nodes in the K graph partitions are loaded into the first cache space. Taking the example of node a pointing to node b, assuming that node b is included in the K graph partitions, then its incoming edge node is node a. Thus, it can be seen that actually, at this time, the loading of the K graph partitions is to load the one-hop neighbor nodes of the nodes in the K graph partitions into the first cache space. Then it can be understood that if the above node b has no incoming edge in its belonging graph partition, that is, for its belonging graph partition, node b belongs to a "leaf node", then the node a connected by its incoming edge will belong to another graph partition. If the graph partition to which node a belongs is not included in the above K graph partitions, it means that not only the nodes in the K graph partitions are loaded into the first cache space at this time, but also some nodes not located in the K graph partitions may be included. Generally speaking, in the loading result of the K graph partitions into the first cache space implemented based on the above three arrays, it may include most of the nodes in the K graph partitions, or it may also include some nodes in other graph partitions.
[0153] Next, it is introduced how the first processor implements the loading of the K graph partitions into the first cache space based on the above three arrays. Since the loading process for each of the graph partitions is the same, any one of the graph partitions i is taken as an example for illustration.
[0154] First, the first processor determines the head and tail node identifiers corresponding to graph partition i according to the offset array. Then, according to the head and tail node identifiers, the pointer array is queried to determine the first position segment corresponding to graph partition i in the row index array, and the node identifiers located in the first position segment in the index array are loaded into the first cache space.
[0155] To facilitate understanding, let's illustrate with an example. Suppose a complete graph structure contains 12 nodes, labeled as node 0, node 1, ..., node 11. Also, assume this complete graph structure is divided into 5 partitions: partition 0, partition 1, ..., partition 4. Furthermore, assume the following three arrays are generated based on this complete graph structure:
[0156] Offset array offsets = [0, 2, 5, 8, 10, 12]
[0157] The column pointer array indptr = [0,1,3,5,7,8,9,12,12,15,15,17,19]
[0158] Row index array indices = [2,0,3,4,5,0,2,5,3,3,5,9,4,5,7,9,11,7,9]
[0159] Now, assuming we need to load graph partition 1 into the first cache space, this can be achieved through the following process:
[0160] First, determine the head node identifier and tail node identifier corresponding to graph partition 1 based on the offsets array: offsets[1] = 2, offsets[1+1] = 5. Here, offsets[1] is the position index of graph partition 1 in the offsets array (as mentioned above, the position indices in each array start from 0 and increment sequentially). Based on this, the head node identifier of graph partition 1 can be obtained: node 2. offsets[1+1] represents the head node identifier of the next graph partition 2 after graph partition 1. Since the head node identifier of graph partition 2 differs from the tail node identifier of graph partition 1 by only one number, the tail node identifier of graph partition 1 is actually node 4. However, for ease of operation, optionally, offsets[1+1] = 5, i.e., node 5, can be considered as the tail node identifier of graph partition 1. Finally, discard the last node identifier in the determined row index array position segment (i.e., head includes, tail excludes).
[0161] Then, based on the determination results corresponding to the head node identifier and the tail node identifier, the indptr array is queried to determine the start and end indexes of the first position segment corresponding to the graph partition 1 in the indices array: indptr[2] = 3, indptr[5] = 8, so the first position segment is indices[3:8], which represents the segment from position index 3 to position index 8 in the indices array.
[0162] Finally, load the element values (i.e., node identifiers) corresponding to the above first-position segment in the indices array [4, 5, 0, 2, 5] into the first cache space, which is the loading result of graph partition 1 into the first cache space. It should be noted that as described above, when querying the head and tail node identifiers of graph partition 1 based on the offsets array at the beginning, the head node identifier of its next graph partition is used as the tail node identifier of graph partition 1, which is equivalent to including one more node. Therefore, when determining the first-position segment indices[3:8] and reading the element values from the corresponding position indices, the last element value is discarded, which is equivalent to only reading the element values corresponding to the five position indices indices[3:7] to be loaded into the first cache space.
[0163] After the K graph partitions are loaded into the first cache space in the above manner, the neighborhood sampling process can be performed with the training nodes included in the K graph partitions as seed nodes. As mentioned before, the complete graph structure data includes nodes for different purposes such as training nodes, test nodes, and validation nodes. In the neighborhood sampling stage, only the training nodes are used as seed nodes for neighborhood sampling. Among them, the purpose type of each node in the complete graph structure data is known, that is, the first processor can determine which nodes in the K graph partitions are training nodes based on the setting results of these node types.
[0164] Still taking any graph partition i as an example, since the neighborhood sampling process for each training node included in it is similar, an arbitrary target training node included in it is taken as an example. First, the first processor determines the second-position segment corresponding to the target training node in the indices array according to the indptr array, and determines the cached neighbor nodes and uncached neighbor nodes corresponding to the target training node in the neighborhood sampling process according to the overlapping relationship between the second-position segment and the above first-position segment. Among them, the node identifiers located in the second-position segment in the indices array are the neighbor nodes in the neighborhood sampling process of the target training node. The cached neighbor nodes correspond to the node identifiers of the overlapping part between the first-position segment and the second-position segment, and the cached neighbor nodes are located in the first cache space, while the uncached neighbor nodes are located on the host disk.
[0165] For the sake of understanding, continue with the above example. Suppose graph partition i is the above graph partition 1. For the node 3 (which is a training node) included in it, because the node identifier "3" ≥ offsets[1] = 2 and "3" < offsets[1 + 1] = 5, it can be known that node 3 is in graph partition 1, and graph partition 1 has been loaded into the first cache space, so it can be known that node 3 is already in the first cache space. In practical applications, this judgment process can be omitted.
[0166] When neighborhood sampling of node 3 is required, the first processor determines the neighboring nodes of node 3 as follows: indices[indptr[3]:indptr[3+1]], i.e., indices[5:7]. This means that the second position segment corresponding to the neighboring nodes of node 3 in the indices array is the three position indices from 5 to 7. It is then determined that indices[5:7] is contained within the range of indices[3:8], i.e., the overlapping part is indices[5:7]. Since the node identifiers corresponding to indices[3:8] have already been loaded into the first cache space, it is determined that the neighboring nodes of node 3 are all "cached neighboring nodes." Therefore, the second processor can directly obtain these cached neighboring nodes from the first cache space and then sample the required neighboring nodes based on the set sampling strategy. This sampling strategy, for example, is to randomly sample a set number of nodes.
[0167] Understandably, at this point, the first processor can notify the second processor of the corresponding cache location of the aforementioned cached neighbor node in the first cache space, enabling it to obtain the cached neighbor node and complete the sampling.
[0168] It should be noted that, in one optional embodiment, once it has been determined whether node 3 is already in the first cache space, the subsequent determination of the overlap relationship between the first and second position segments does not need to be performed. That is, when it is determined that node 3 is in the first cache space, it can be directly determined that its corresponding second position segment has been loaded into the first cache space. In another optional embodiment, if the determination of whether node 3 is already in the first cache space is not performed initially, the subsequent determination of the overlap relationship between the first and second position segments is required to determine whether the neighboring nodes of node 3 have been loaded into the first cache space. In this case, the first position segment also needs to be saved in the first cache space for this determination.
[0169] In the example above, since all of node 3's neighboring nodes are located in the first cache space, the neighborhood sampling action is performed only by the second processor. However, in reality, if some of node 3's neighboring nodes are located in the first cache space and the other part are not located in the first cache space, that is, located in the host disk, then the second processor samples the first number of neighboring nodes from the first cache space, and the first processor samples the second number of neighboring nodes from the host disk.
[0170] After completing the neighborhood sampling of the target training node in the above manner, a neighborhood subgraph corresponding to the target training node can be formed based on the neighbor nodes obtained from the neighborhood sampling. This neighborhood subgraph reflects the connection relationship between the target training node and the sampled neighbor nodes.
[0171] In an optional embodiment, as described above, training one epoch of the GNN model can include multiple consecutive training rounds, each of which includes multiple consecutive iterations. Based on this, for the current training round, after loading the K graph partitions into the first cache space, the first processor can determine the actual number of iterations S in the current training round based on the set batchsize (the number of seed nodes (i.e., training nodes) to be sampled in each iteration) and the total number of training nodes contained in the K graph partitions. For example, if the K graph partitions contain a total of 1000 training nodes and the batchsize is 100, then 10 iterations will be executed in the current training round.
[0172] For each iteration, multiple training nodes are randomly sampled from K graph partitions. The number of samples is the batch size, and the sampled training nodes are unique across different iterations. For each training node in any iteration, neighborhood sampling is performed as described above to obtain a neighborhood subgraph for each training node. Then, the neighborhood subgraphs of all training nodes in this iteration are concatenated to obtain the training subgraph for this iteration (i.e., the concatenation result of multiple neighborhood subgraphs). The node feature vectors of all nodes contained in this training subgraph are then obtained. The training subgraph and the node feature vectors of each node are input into the GNN model by the second processor for one iteration of training. A similar process is performed for each iteration in the current training round, ultimately completing one training round of the GNN model.
[0173] The process of obtaining the node feature vectors of each node in a training subgraph can be referred to the relevant descriptions in the aforementioned embodiments, and will not be repeated here.
[0174] In this embodiment, by sampling and compressing the complete graph structure data for storage, the space occupied by the first cache when caching K graph partitions can be reduced. Moreover, the calculation based on the numerical data in the compressed array will be faster, thereby improving the training efficiency of the GNN model.
[0175] In the above embodiments of the present invention, a first cache space for caching graph structure data and a second and third cache space for caching node feature vectors are configured in the host. The first and second cache spaces are located in the host memory, while the third cache space is located in the memory of the second processor (e.g., GPU memory). Taking GPU memory as an example, when the available memory capacity in the host memory and GPU memory for the aforementioned caching has been set based on different host types, it is understood that all available GPU memory can be configured as the third cache space, while the available host memory needs to be configured as both the first and second cache spaces. In this case, it is necessary to determine how much of the available host memory is allocated to the first cache space and how much is allocated to the second cache space.
[0176] In practical applications, the proportion of the first cache space and the second cache space in the available host memory capacity can be set empirically, or a more reasonable ratio can be calculated based on some calculation strategies.
[0177] As mentioned above, the cached data in the aforementioned cache space is used in the neighborhood sampling stage and the node feature vector collection stage during the GNN model training process. The fundamental purpose of using these cache spaces is to reduce the amount of data access to the host disk. Therefore, this embodiment of the invention provides the following calculation strategy for determining the capacity of the first cache space and the second cache space:
[0178] In a simulated scenario where different numbers of graph partitions are loaded in a training round, the amount of data accessed on the first disk during the node feature vector acquisition phase and the amount of data accessed on the second disk during the neighborhood sampling phase are determined in a training round of the GNN model. A training round includes multiple training rounds, and a training round includes multiple iterative training processes.
[0179] Determine the target number K that minimizes the total disk data access volume, where the total disk data access volume is the sum of the first disk data access volume and the second disk data access volume;
[0180] The capacity of the first and second cache spaces is determined based on the target quantity K.
[0181] Optionally, determining the capacity of the first cache space and the second cache space based on the target quantity K includes: obtaining the first available memory capacity M of the host memory. H Given the graph data volume G, the total number of graph partitions B from which the complete graph structure data is divided, and the target quantity K, the first cache space capacity is determined to be K*G / B. Based on the first available memory capacity M... H Based on the capacity of the first cache space, the capacity of the second cache space is determined to be: MH -K*G / B. The graph data includes the complete graph structure data and the node feature vectors of all nodes.
[0182] Regarding the third cache space, the second available memory capacity M of the second processor can be obtained. G Determine the second available memory capacity M G This is the capacity of the third cache space.
[0183] As described above, determining the target number K that minimizes the total disk data access is crucial for determining the capacity of the first and second cache spaces. This target number K needs to be determined through multiple simulations. Before actually starting GNN model training, a training round is simulated with different numbers of graph partitions loaded. For each simulated value k of the number of graph partitions loaded (e.g., k1, k2, etc.), a total disk data access is calculated. From these simulations, the target number K that minimizes the total disk data access is selected, for example, K = k2.
[0184] Based on this, it is necessary to calculate the first disk data access volume corresponding to the node feature vector acquisition stage and the second disk data access volume corresponding to the neighborhood sampling process during the simulated training process of the GNN model for each simulated value k, and then sum the two to obtain the total disk data access volume corresponding to the simulated value k.
[0185] The following sections describe the process for determining the first disk data access volume during the node feature vector acquisition phase and the second disk data access volume during the neighborhood sampling phase, respectively, in a simulated scenario of loading k graph partitions. These two determination processes can be performed by the first processor in the host or by other processors in the host; no specific limitation is made.
[0186] Figure 8 A flowchart illustrating a method for determining disk data access volume during the node feature vector acquisition stage, as provided in this embodiment of the invention, is shown below. Figure 8 As shown, the method includes the following steps:
[0187] 801. Based on the total number of graph partitions into which the complete graph structure data is divided, the number of graph partitions loaded k, and the number of nodes whose node feature vectors are cached in each training round, determine the first item of disk data access volume. The node feature vectors are cached in the second cache space or the third cache space. The number of nodes whose node feature vectors are cached in one training round is determined based on the first available memory capacity of the host memory, the second available memory capacity of the second processor, the total number of graph partitions, the number of graph partitions loaded k, the amount of graph data, and the amount of data corresponding to one node feature vector. The graph data includes the complete graph structure data and the node feature vectors of all nodes.
[0188] 802. Determine the second item, disk data access volume, based on the sum of the elements in the node heat matrix.
[0189] 803. Based on the local node heat vectors corresponding to the k graph partitions in each training round, determine the total local node heat value corresponding to multiple target nodes in each training round, and determine the third item, disk data access volume, based on the sum of the total local node heat values corresponding to all training rounds. Here, the multiple target nodes in each training round are the nodes whose node feature vectors in the neighborhood subgraphs formed in that training round are cached in the second or third cache space.
[0190] 804. Based on the first, second, and third disk data access volumes, determine the first disk data access volume during the node feature vector acquisition phase.
[0191] For ease of description, let's assume that, in a simulated training epoch, loading k graph partitions in one training round, the first disk data access during the node feature vector acquisition phase of a GNN model is denoted as D. F Then D F =E[N F ]×s f , where s f This represents the amount of data corresponding to a node's feature vector, with units such as bytes. E[N] F This represents the expected number of node feature vectors that need to be read from the host disk within one training epoch.
[0192] Optionally, E[N F The following calculation process can be used as a reference:
[0193]
[0194] In other words, by using E[N] FAs the calculation process continues, it can ultimately be represented by the calculation results of the three disk data access quantities shown in the last line. It can be understood that these three items correspond to the three items in the previous line, which will not be elaborated further.
[0195] In the above calculation formula, R represents the R training rounds contained in a training epoch, and r represents any one of those training rounds. V represents the set of nodes corresponding to the graph data, and v represents any node within it. C r Let |C| represent the set of nodes whose feature vectors are cached in the r-th training round. r | represents the size of the node set, i.e., the number of nodes it contains. h r h represents the local node heat vector corresponding to the k graph partitions loaded in the r-th training round. r (v) represents the local node heat value corresponding to node v in the local node heat vector.
[0196] As can be seen from the above calculation formula, to calculate E[N]... F Therefore, it is necessary to calculate the disk data access volume of the above three items separately.
[0197] Regarding the calculation of the first item, disk data access volume:
[0198] because therefore, Where B represents the total number of graph partitions into which the complete graph structure data is divided, and M... H This represents the first available memory capacity of the aforementioned host memory, M. G The second available memory capacity of the second processor is represented by k, the number of graph partitions loaded in a training round of the current simulation, and G represents the amount of graph data.
[0199] Regarding the calculation of the second item, disk data access volume:
[0200]
[0201] Where b represents a graph partition loaded in the r-th training round, and H represents the node heat matrix calculated in the aforementioned embodiment. b This represents the column in the node heat matrix corresponding to graph partition b. sum(H) represents the sum of the elements in the node heat matrix.
[0202] Based on the above calculation process for the first two items of disk data access volume, it can be seen that these two calculations are both mathematical formulas. However, the calculation of the third item, disk data access volume, can be achieved by averaging multiple simulations, where:
[0203]
[0204] For the same training round (e.g., the r-th training round), in one simulation, k graph partitions can be randomly selected from B graph partitions. In another simulation, another k graph partitions can be randomly selected from B graph partitions. This is the case of two simulations. Taking these two simulations as an example, each simulation process will ultimately calculate a third item: disk data access volume. The average of the two calculations is taken as the final determined third item: disk data access volume. It is understandable that, for the above different simulation processes, since the k graph partitions loaded in each r-th training round are different, the local node heat vector h corresponding to the r-th training round in each simulation process will be different. r They are also different and need to be recalculated based on the k graph partitions and node heat matrix H actually loaded in the current simulation. Of course, in practical applications, the number of simulations can be simplified to 1.
[0205] Since the execution of each simulation process is the same, we will only take one simulation process as an example to illustrate the calculation of the corresponding third item, disk data access volume.
[0206] As shown in the above formula, the calculation of the third term, disk data access volume, firstly, for the r-th training round, requires determining the total local node heat value corresponding to multiple target nodes in the r-th training round based on the local node heat vectors corresponding to the k graph partitions in that training round. These multiple target nodes are the nodes whose node feature vectors in each neighborhood subgraph formed in the r-th training round are cached in the second or third cache space, i.e., they belong to C. r The nodes. Since each of the above target nodes has a local node heat vector h... r Each target node corresponds to a local node popularity value. Summing up the local node popularity values for multiple target nodes yields the total local node popularity value for the r-th training round. Then, the sum of the total local node popularity values for all training rounds is used as the third item: disk data access volume.
[0207] After obtaining the above three disk data access volumes, we can use the aforementioned formula to obtain the first disk data access volume during the node feature vector acquisition stage in the case of loading k graph partitions in each training round during the simulated training process of the GNN model for one training round.
[0208] Figure 9 A flowchart illustrating a method for determining disk data access volume during the neighborhood sampling phase, as provided in an embodiment of the present invention, is shown below. Figure 9 As shown, the method includes the following steps:
[0209] 901. Obtain the pre-generated neighbor heat matrix. The neighbor heat matrix is used to describe the total number of neighbor nodes of each node in the complete graph structure data when the training nodes in each graph partition are sampled for neighborhood. Each graph partition is each of the multiple graph partitions into which the complete graph structure data is divided. One column in the neighbor heat matrix corresponds to one graph partition.
[0210] 902. Determine the local neighbor heat vectors corresponding to the k graph partitions in the target training round. The local neighbor heat vectors include the local neighbor heat values of each node in the complete graph structure data. The local neighbor heat value of each node corresponds to the statistical value of the total number of neighbor nodes of each node in the complete graph structure data when the training nodes in the k graph partitions in the target training round are sampled for neighborhood. The target training round is any training round.
[0211] 903. For multiple nodes that are not located in the k graph partitions, the sum of the local neighbor heat values of these multiple nodes is determined based on the local neighbor heat vector, and this sum is used as the disk data access volume corresponding to the neighborhood sampling phase of the target training round.
[0212] 904. Based on the disk data access volume corresponding to each training round in the neighborhood sampling phase, determine the second disk data access volume in the neighborhood sampling phase.
[0213] In this embodiment, when simulating the loading of k graph partitions in one training round to train the GNN model for one training round, the second disk data access volume during the neighborhood sampling phase is represented as D. S This access volume represents the number of accesses performed during the neighborhood sampling phase when data is read from the host disk for sampling. Its calculation process can be expressed as the following formula:
[0214]
[0215] Where T represents the neighbor heat matrix, T b This represents the column in the neighbor heat matrix corresponding to the b-th graph partition out of the k graph partitions for the r-th training round. The meanings of other characters are as described in the preceding embodiments.
[0216] The calculation process of the neighbor heat matrix T is detailed in the same way as the calculation process of the node heat matrix H mentioned above, and will be explained in detail in the following embodiments. Here, we will first introduce the calculation process of the second disk data access volume.
[0217] Similar to the calculation of the third item in the first disk data access calculation process described above, for D... S The calculation is also achieved by taking the average of multiple simulations.
[0218] Specifically, for the same training epoch (e.g., the r-th training epoch), in one simulation, k graph partitions can be randomly selected from B graph partitions. In another simulation, another k graph partitions can be randomly selected from the B graph partitions. This is the case of two simulations. Taking these two simulations as an example, each simulation process will ultimately calculate a second disk data access volume. The average of the two calculations is taken as the final determined second disk data access volume D. S It is understandable that, for the different simulation processes described above, since the k graph partitions loaded in each of the r-th training rounds are different, the local neighbor heat vector corresponding to the r-th training round in each simulation process is also different. It needs to be recalculated based on the k graph partitions actually loaded in the current simulation and the neighbor heat matrix T. Of course, in practical applications, the number of simulations can be simplified to one. The calculation process of the local neighbor heat vector is similar to that of the local node heat vector in the aforementioned embodiment, which will be explained in detail later.
[0219] Since each simulation process is executed identically, we will only use one simulation process as an example to illustrate the calculation of the corresponding second disk data access volume.
[0220] As can be seen from the above formula, in the process of calculating the second disk data access volume, the disk data access volume of the neighborhood sampling stage is calculated first for each training round, and then the disk data access volume corresponding to all training rounds in a training round is accumulated to obtain the total second disk data access volume.
[0221] For any one of the target training rounds, the calculation process for the disk data access volume during the neighborhood sampling phase is as follows:
[0222] First, determine the local neighbor heat vectors corresponding to the k graph partitions in the target training rounds. These local neighbor heat vectors include the local neighbor heat values of each node in the complete graph structure data. Each node's local neighbor heat value corresponds to the statistical value of the total number of neighboring nodes in the complete graph structure data when sampling the neighborhood of the training nodes in the k graph partitions of the target training rounds. This statistical value is, for example, a summation.
[0223] Specifically, the neighbor heat matrix T is an N*B dimension matrix, where N is the total number of nodes in the complete graph structure data, and B is the total number of graph partitions into which the complete graph structure data is divided. The B columns of this matrix correspond to the B graph partitions, and the N elements in any column represent the neighbor heat values of each of the N nodes during multiple iterations of a training round, using a training node in the corresponding graph partition as a seed node.
[0224] For the k graph partitions corresponding to the target training round, the corresponding k columns can be determined from the neighbor heat matrix T. Then, the element values of these k columns corresponding to the same node are summed together to obtain the local neighbor heat value of that node. Finally, the local neighbor heat values corresponding to N nodes constitute the local neighbor heat vector corresponding to the k graph partitions in the target training round.
[0225] Subsequently, for multiple nodes that are not located in the k graph partitions corresponding to the target training round, the sum of the local neighbor heat values of these multiple nodes is determined based on the local neighbor heat vector, and used as the disk data access volume corresponding to the neighborhood sampling stage of the target training round.
[0226] After completing the disk data access volume corresponding to the neighborhood sampling phase in all training rounds through the above process, the second disk data access volume is obtained by summing them up.
[0227] After obtaining the first disk data access volume and the second disk data access volume corresponding to the simulated scenario of loading k graph partitions in a training round through the above embodiments, the two are summed to obtain the total disk data access volume. For different values of k, the total disk data access volume obtained under different simulated scenarios of loading different numbers of graph partitions in a training round can be obtained. From these, the value of k with the minimum total disk data access volume can be determined as the target quantity K used in the aforementioned embodiments.
[0228] Therefore, the capacity of the first and second cache spaces determined based on the target number K can better reduce the amount of access to the host disk during the actual training of the GNN model, so that the neighborhood sampling and node feature vector collection processes access the cache space in the host memory and the second processor memory more often, thereby improving the data acquisition speed.
[0229] The following describes the process of generating the neighbor heat matrix, which includes the following steps:
[0230] Multiple node groups corresponding to the target graph partition are determined, wherein the number of training nodes in each node group is determined according to the number of training nodes required for one iteration, and the number of node groups is equal to the number of iterations in one training round, and the target graph partition is any one of the multiple graph partitions;
[0231] Neighborhood sampling is performed on each training node in the target node group to obtain a local subgraph of the target composed of the neighborhood subgraphs of each training node in the target node group. The target node group can be any one of multiple node groups.
[0232] Determine the total number of neighbor nodes of each node in the complete graph structure data in multiple local subgraphs, and use this as a column in the neighbor heat matrix corresponding to the target graph partition. Multiple local subgraphs correspond to multiple node groups.
[0233] Similar to the calculation process of the node heat matrix, the calculation of the neighbor heat matrix also assumes that only one graph partition is randomly loaded in a training round. Taking graph partition f1 as an example, assuming that graph partition f1 contains 500 training nodes, and the number of seed nodes for neighborhood sampling in one iteration, i.e., batchsize = 100, then it can be determined that a training round corresponding to graph partition f1 contains 5 iterations. That is, the GNN model is trained through 5 iterations to complete the use of graph partition f1. In this example, it is equivalent to dividing the 500 training nodes in graph partition f1 into five node groups, with each node group containing 100 training nodes. For each node group, 100 nodes can be randomly selected from these 500 training nodes, and there are no duplicate nodes between different node groups.
[0234] Next, for each node group, neighborhood sampling is performed on each training node contained within it. The neighborhood sampling process is described in the aforementioned related embodiments and will not be repeated here. For any target node group, each training node within it can obtain its corresponding neighborhood subgraph after neighborhood sampling. The neighborhood subgraphs of each training node in the target node group are concatenated together to obtain the local subgraph corresponding to this target node group. The same processing is performed on each node group, thereby obtaining multiple local subgraphs corresponding to multiple node groups (such as the five in the example above).
[0235] Next, for each of the N nodes in the complete graph structure data, the total number of neighboring nodes of the same node in multiple local subgraphs is determined according to the order of node identifiers, and this number is used as a column in the neighbor heat matrix corresponding to graph partition f1. Specifically, for any node v, assuming there are five local subgraphs, node v can identify five neighboring nodes in these five local subgraphs. These five neighboring node counts are summed to obtain the total number of neighboring nodes corresponding to node v. It should be noted that, firstly, deduplication of neighboring nodes is not required during the counting and summing process; secondly, if a local subgraph does not contain node v, or there are no other nodes connected to node v (in the case of a directed graph, this means nodes pointing to node v), then the number of neighboring nodes corresponding to node v in that local subgraph is zero.
[0236] By performing the above processing on each of the B graph partitions, we can obtain column B of the neighbor heat matrix, which is the neighbor heat matrix.
[0237] The following will describe in detail one or more embodiments of the graph neural network model training apparatus of the present invention. Those skilled in the art will understand that these graph neural network model training apparatuses can be configured using commercially available hardware components through the steps taught in this solution.
[0238] Figure 10 This is a schematic diagram of a graph neural network model training device provided in an embodiment of the present invention. Figure 10 As shown, the device includes: a loading module 11, a determination module 12, a sampling module 13, and a training module 14.
[0239] The loading module 11 is used to load a target number of graph partitions into a first cache space in the host memory through a first processor. The target number of graph partitions are graph partitions required for multiple iterations of training. Each graph partition is a different connected subgraph of the complete graph structure data, and the complete graph structure data is stored in the host disk.
[0240] The determination module 12 is used to determine the cached neighbor nodes and uncached neighbor nodes corresponding to the neighborhood sampling process of the target training node through the first processor. The cached neighbor nodes are located in the first cache space, and the uncached neighbor nodes are located on the host disk. The target training node is any training node in the target number of graph partitions.
[0241] The sampling module 13 is used to sample a first number of neighbor nodes from the cached neighbor nodes through the second processor, sample a second number of neighbor nodes from the uncached neighbor nodes through the first processor, and transmit the second number of neighbor nodes to the second processor, wherein the second processor is of a different type than the first processor.
[0242] The training module 14 is used to determine the neighborhood subgraph corresponding to the target training node by the second processor based on the first number of neighbor nodes and the second number of neighbor nodes, and input the neighborhood subgraph and the node feature vectors of each node in the neighborhood subgraph into the graph neural network model for training.
[0243] Optionally, the determining module 12 is further configured to: sample multiple training nodes corresponding to the current iteration in the multiple iterations from the target number of graph partitions using the first processor, wherein the multiple training nodes include the target training node. Therefore, the training module 14 is configured to: obtain a training subgraph generated from multiple neighborhood subgraphs corresponding to the multiple training nodes using the second processor, and input the training subgraph and the node feature vectors of each node in the training subgraph into the graph neural network model for training in the current iteration.
[0244] Optionally, the host further includes a second cache space located in the host memory and a third cache space located in the memory of the second processor, wherein the memory of the second processor stores a cache address lookup table. Based on this, the device further includes: a node feature acquisition module, configured to query the cache address lookup table through the second processor to determine whether the node feature vector corresponding to each node in the neighborhood subgraph is cached in the second cache space or the third cache space; for a first node whose node feature vector is cached, the second processor retrieves the node feature vector corresponding to the first node from the corresponding cache address; for a second node whose node feature vector is not cached, the first processor retrieves the node feature vector corresponding to the second node from the host disk; and the first processor updates the second cache space and / or the third cache space based on the node feature vector corresponding to the second node.
[0245] Optionally, the node feature acquisition module is specifically used to: obtain a pre-generated node heat matrix through the first processor, the node heat matrix being used to describe the access heat value of each node in the complete graph structure data when performing neighborhood sampling with training nodes in each graph partition, each graph partition being each of all the multiple graph partitions into which the complete graph structure data is divided; determine a local node heat vector through the first processor based on the node heat matrix, the local node heat vector including the local node heat value of each node in the complete graph structure data, wherein the local node heat value of each node corresponds to the statistical value of the access heat value of each node in the complete graph structure data when performing neighborhood sampling with training nodes in the target number of graph partitions; determine the local node heat value corresponding to the second node through the first processor based on the local node heat vector, so as to update the second cache space and / or the third cache space according to the local node heat value corresponding to the second node.
[0246] Optionally, the node feature acquisition module is specifically used for: if the second cache space and / or the third cache space are not full, then the first processor stores the node feature vector corresponding to the second node into the not-full cache space and sends first address update information to the second processor, the first address update information being used to instruct the cache address of the second node to be added to the cache address lookup table; if both the second cache space and the third cache space are full, then the first processor removes the node feature vector corresponding to the third node in the second cache space and the third cache space, replaces it with the node feature vector corresponding to the second node, and sends second address update information to the second processor, the second address update information being used to instruct the cache address of the second node to be added to the cache address lookup table, and the cache address of the third node to be deleted from the cache address lookup table, wherein the local node heat value corresponding to the third node is lower than the local node heat value corresponding to the second node.
[0247] Optionally, the local node heat value of a node in the third cache space is not lower than the local node heat value of a node in the second cache space.
[0248] Optionally, the apparatus further includes: a matrix calculation module, configured to determine multiple node groups corresponding to a target graph partition, wherein the number of training nodes in each node group is determined based on the number of training nodes required for one iteration, the number of node groups is equal to the number of iterations in one training round, and the target graph partition is any one of the multiple graph partitions; performing neighborhood sampling on each training node in the target node group to obtain a local subgraph composed of the neighborhood subgraphs of each training node in the target node group, wherein the target node group is any one of the multiple node groups; determining the number of times each node in the complete graph structure data appears in the multiple local subgraphs as the access heat value of each node in the complete graph structure data when performing neighborhood sampling on the training nodes in the target graph partition, wherein the multiple local subgraphs correspond to the multiple node groups; generating the node heat matrix based on the access heat values of each node in the complete graph structure data when performing neighborhood sampling on the training nodes in the multiple graph partitions respectively, and storing it in the memory of the second processor.
[0249] Optionally, the complete graph structure data is divided into multiple consecutive graph partitions, and the node identifiers in each graph partition are consecutive. Therefore, the loading module 11 is specifically used to: determine the offset array, index array, and pointer array corresponding to the complete graph structure data; store the offset array and the pointer array in the first cache space, wherein the index array and the pointer array correspond to a set compressed storage format of the adjacency matrix of the complete graph structure data; the offset array is used to store the head node identifiers corresponding to each of the multiple graph partitions; for any graph partition among the target number of graph partitions, determine the head and tail node identifiers corresponding to that graph partition based on the offset array; query the pointer array based on the head and tail node identifiers to determine the first position segment corresponding to that graph partition in the index array; and load the node identifiers located in the first position segment in the index array into the first cache space.
[0250] The loading module 11 is further configured to: for a target training node in any graph partition determined based on the head and tail node identifiers, determine the second position segment corresponding to the target training node in the index array according to the pointer array, wherein the node identifier located in the second position segment in the index array is a neighbor node in the neighborhood sampling process of the target training node; and determine the cached neighbor node and the uncached neighbor node corresponding to the neighborhood sampling process of the target training node according to the overlap relationship between the second position segment and the first position segment, wherein the cached neighbor node corresponds to the node identifier of the overlapping part of the first position segment and the second position segment.
[0251] Optionally, the apparatus further includes: a memory allocation module, configured to determine, under simulated scenarios where different numbers of graph partitions are loaded in a training round, the first disk data access volume corresponding to the node feature vector acquisition phase and the second disk data access volume corresponding to the neighborhood sampling process in a training round of the graph neural network model, wherein a training round includes multiple training rounds and a training round includes multiple iterative training processes; determine the target number that minimizes the total disk data access volume, wherein the total disk data access volume is the sum of the first disk data access volume and the second disk data access volume; and determine the capacity of the first cache space and the second cache space based on the target number.
[0252] Optionally, the memory allocation module is specifically used to: obtain the first available memory capacity of the host memory, the amount of graph data, and the total number of graph partitions into which the complete graph structure data is divided, wherein the graph data includes the complete graph structure data and the node feature vectors of all nodes; determine the capacity of the first cache space based on the target quantity, the amount of graph data, and the total number of graph partitions, and determine the capacity of the second cache space based on the first available memory capacity and the capacity of the first cache space; the method further includes: obtaining the second available memory capacity of the second processor, and determining the second available memory capacity as the capacity of the third cache space.
[0253] Optionally, in the simulation scenario of loading k graph partitions, the memory allocation module is specifically used to: determine the first item of disk data access volume based on the total number of graph partitions divided from the complete graph structure data, the number of graph partitions loaded k, and the number of nodes whose node feature vectors are cached in each training round; wherein, the node feature vectors are cached in the second cache space or the third cache space, and the number of nodes whose node feature vectors are cached in one training round is determined based on the first available memory capacity of the host memory, the second available memory capacity of the second processor, the total number of graph partitions, the number of graph partitions loaded k, the amount of graph data, and the amount of data corresponding to one node feature vector, wherein the graph data includes the complete graph structure data and the node feature vectors of all nodes. The quantity is determined as follows: The second item, disk data access quantity, is determined based on the sum of the elements in the node heat matrix; the total local node heat value corresponding to multiple target nodes in each training round is determined based on the local node heat vectors corresponding to the k graph partitions in each training round; the third item, disk data access quantity, is determined based on the sum of the total local node heat values corresponding to each of all training rounds; wherein, the multiple target nodes in each training round are the nodes whose node feature vectors in each neighborhood subgraph formed in that training round are cached in the second cache space or the third cache space; the first item, disk data access quantity, is determined based on the first item, the second item, and the third item.
[0254] Optionally, in the simulation scenario of loading k graph partitions, the memory allocation module is specifically used to: obtain a pre-generated neighbor heat matrix, which describes the total number of neighbor nodes of each node in the complete graph structure data when performing neighborhood sampling with training nodes in each graph partition, wherein each graph partition is each of the multiple graph partitions into which the complete graph structure data is divided, and one column in the neighbor heat matrix corresponds to one graph partition; determine the local neighbor heat vectors corresponding to the k graph partitions in the target training round, wherein the local neighbor heat vectors include the local neighbor heat values of each node in the complete graph structure data; wherein, each The local neighbor heat value of a node corresponds to the statistical value of the total number of neighbor nodes of each node in the complete graph structure data when neighborhood sampling is performed on training nodes in k graph partitions in the target training round. The target training round is any training round in the total number of training rounds. For multiple nodes that are not located in the k graph partitions, the sum of the local neighbor heat values of the multiple nodes is determined according to the local neighbor heat vector as the disk data access volume corresponding to the neighborhood sampling stage of the target training round. The second disk data access volume is determined according to the disk data access volume corresponding to the neighborhood sampling stage of each training round.
[0255] Optionally, the matrix calculation module is further configured to: determine multiple node groups corresponding to the target graph partition, wherein the number of training nodes in each node group is determined according to the number of training nodes required for one iteration, the number of node groups is equal to the number of iterations in one training round, and the target graph partition is any one of the multiple graph partitions; perform neighborhood sampling on each training node in the target node group to obtain a local subgraph composed of the neighborhood subgraphs of each training node in the target node group, wherein the target node group is any one of the multiple node groups; determine the total number of neighbor nodes of each node in the complete graph structure data in the multiple local subgraphs, as a column in the neighbor heat matrix corresponding to the target graph partition, and the multiple local subgraphs correspond to the multiple node groups.
[0256] Figure 10 The device shown can perform the steps in the foregoing embodiments. For detailed execution process and technical effects, please refer to the description in the foregoing embodiments, which will not be repeated here.
[0257] In one possible design, the above Figure 10 The structure of the device shown can be implemented as an electronic device. For example... Figure 11 As shown, the electronic device may include: a processor 21, a memory 22, and a communication interface 23. The memory 22 stores executable code, which, when executed by the processor 21, enables the processor 21 to at least implement the graph neural network model training method provided in the foregoing embodiments.
[0258] In addition, embodiments of the present invention provide a non-transitory machine-readable storage medium storing executable code, which, when executed by a processor of an electronic device, enables the processor to at least implement the graph neural network model training method provided in the foregoing embodiments.
[0259] In addition, embodiments of the present invention provide a computer program product. This computer program product includes a computer program or instructions. When the computer program or instructions are executed by a processor, the processor is able to at least implement the graph neural network model training method provided in the foregoing embodiments.
[0260] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for training a graph neural network model, characterized in that, The method includes: The first processor loads a target number of graph partitions into the first cache space in the host memory. The target number of graph partitions are graph partitions required for multiple iterations of the training process. Each graph partition is a different connected subgraph of the complete graph structure data. The complete graph structure data is stored on the host disk. The first processor determines the cached and uncached neighbor nodes corresponding to the neighborhood sampling process of the target training node. The cached neighbor nodes are located in the first cache space, and the uncached neighbor nodes are located on the host disk. The target training node is any training node in the target number of graph partitions. The second processor samples a first number of neighbor nodes from the cached neighbor nodes, and the first processor samples a second number of neighbor nodes from the uncached neighbor nodes, and transmits the second number of neighbor nodes to the second processor. The second processor is of a different type than the first processor. The second processor determines the neighborhood subgraph corresponding to the target training node based on the first number of neighboring nodes and the second number of neighboring nodes, and inputs the neighborhood subgraph and the node feature vectors of each node in the neighborhood subgraph into the graph neural network model for training.
2. The method according to claim 1, characterized in that, Before determining the cached and uncached neighbor nodes corresponding to the target training node through the first processor, the method further includes: The first processor samples multiple training nodes corresponding to the current iteration in the multiple iterations from the target number of graph partitions, and the multiple training nodes include the target training node; The step of inputting the neighborhood subgraph and the node feature vectors of each node in the neighborhood subgraph into the graph neural network model for training includes: The second processor obtains a training subgraph generated from multiple neighborhood subgraphs corresponding to the multiple training nodes, and inputs the training subgraph and the node feature vectors of each node in the training subgraph into the graph neural network model to perform the training of the current iteration.
3. The method according to claim 1, characterized in that, The host also includes a second cache space located in the host memory and a third cache space located in the memory of the second processor. The memory of the second processor stores a cache address lookup table. The method further includes: The second processor queries the cache address lookup table to determine whether the node feature vectors corresponding to each node in the neighborhood subgraph are cached in the second cache space or the third cache space. For the first node whose node feature vector is cached, the second processor retrieves the node feature vector corresponding to the first node from the corresponding cache address; For a second node whose node feature vector is not cached, the first processor retrieves the corresponding node feature vector from the host disk. The first processor updates the second cache space and / or the third cache space based on the node feature vector corresponding to the second node.
4. The method according to claim 3, characterized in that, The step of updating the second cache space and / or the third cache space by the first processor according to the node feature vector corresponding to the second node includes: The first processor obtains a pre-generated node heat matrix, which is used to describe the access heat value of each node in the complete graph structure data when the training nodes in each graph partition are sampled in the neighborhood. Each graph partition is each of the multiple graph partitions into which the complete graph structure data is divided. The first processor determines a local node heat vector based on the node heat matrix. The local node heat vector includes the local node heat value of each node in the complete graph structure data. The local node heat value of each node corresponds to the statistical value of the access heat value of each node in the complete graph structure data when neighborhood sampling is performed on the training nodes in the graph partitions of the target number of nodes respectively. The first processor determines the local node heat value corresponding to the second node based on the local node heat vector, and updates the second cache space and / or the third cache space based on the local node heat value corresponding to the second node.
5. The method according to claim 4, characterized in that, The step of determining the local node heat value corresponding to the second node based on the local node heat vector by the first processor, and updating the second cache space and / or the third cache space, includes: If the second cache space and / or the third cache space are not full, the first processor stores the node feature vector corresponding to the second node into the unfilled cache space and sends first address update information to the second processor. The first address update information is used to indicate that the cache address of the second node is added to the cache address lookup table. If both the second cache space and the third cache space are full, the first processor removes the node feature vector corresponding to the third node in the second cache space and the third cache space, replaces it with the node feature vector corresponding to the second node, and sends second address update information to the second processor. The second address update information is used to indicate that the cache address of the second node is added to the cache address lookup table and the cache address of the third node is deleted from the cache address lookup table, wherein the local node heat value corresponding to the third node is lower than the local node heat value corresponding to the second node.
6. The method according to claim 5, characterized in that, The local node heat value of a node in the third cache space is not lower than the local node heat value of a node in the second cache space.
7. The method according to claim 4, characterized in that, The process of generating the node heat matrix through the second processor includes: Multiple node groups corresponding to the target graph partition are determined, wherein the number of training nodes in each node group is determined according to the number of training nodes required for one iteration, and the number of node groups is equal to the number of iterations in one training round, and the target graph partition is any one of the multiple graph partitions; Neighborhood sampling is performed on each training node in the target node group to obtain a local subgraph composed of the neighborhood subgraphs of each training node in the target node group, wherein the target node group is any one of the plurality of node groups; The frequency of each node in the complete graph structure data in multiple local subgraphs is determined as the access popularity value of each node in the complete graph structure data when performing neighborhood sampling with the training nodes in the target graph partition. The multiple local subgraphs correspond to the multiple node groups. The node heat matrix is generated based on the access heat values of each node in the complete graph structure data when the training nodes in the multiple graph partitions are sampled in their neighborhoods, and then stored in the memory of the second processor.
8. The method according to any one of claims 1-7, characterized in that, The complete graph structure data is divided into multiple consecutive graph partitions, and the node identifiers in each graph partition are consecutive; the method further includes: Determine the offset array, index array, and pointer array corresponding to the complete graph structure data, and store the offset array and the pointer array in the first cache space. The index array and the pointer array correspond to the set compressed storage format of the adjacency matrix of the complete graph structure data, and the offset array is used to store the head node identifiers corresponding to each of the multiple graph partitions. The step of loading the target number of graph partitions into the first cache space in the host memory includes: For any graph partition among the target number of graph partitions, determine the head and tail node identifiers corresponding to any graph partition based on the offset array; The pointer array is queried based on the head and tail node identifiers to determine the first position segment corresponding to any graph partition in the index array; Load the node identifier of the segment located at the first position in the index array into the first cache space.
9. The method according to claim 8, characterized in that, The process of determining the cached and uncached neighbor nodes in the neighborhood sampling of the target training node includes: For any target training node in the graph partition determined based on the head and tail node identifiers, the second position segment corresponding to the target training node in the index array is determined according to the pointer array, wherein the node identifier in the index array located in the second position segment is a neighbor node in the neighborhood sampling process of the target training node; Based on the overlap between the second position segment and the first position segment, the cached neighbor nodes and uncached neighbor nodes corresponding to the neighborhood sampling process of the target training node are determined, wherein the cached neighbor nodes correspond to the node identifiers of the overlapping parts of the first position segment and the second position segment.
10. The method according to any one of claims 4-7, characterized in that, The method further includes: In a simulated scenario where different numbers of graph partitions are loaded in a training round, the first disk data access volume corresponding to the node feature vector acquisition stage and the second disk data access volume corresponding to the neighborhood sampling process in a training round of the graph neural network model are determined. A training round includes multiple training rounds, and a training round includes multiple iterative training processes. Determine the target number that minimizes the total disk data access volume, wherein the total disk data access volume is the sum of the first disk data access volume and the second disk data access volume; The capacities of the first cache space and the second cache space are determined based on the target quantity.
11. The method according to claim 10, characterized in that, Determining the capacity of the first cache space and the second cache space based on the target quantity includes: The first available memory capacity of the host memory, the amount of graph data, and the total number of graph partitions into which the complete graph structure data is divided are obtained, wherein the graph data includes the complete graph structure data and the node feature vectors of all nodes; The capacity of the first cache space is determined based on the target quantity, the amount of graph data, and the total number of graph partitions; and the capacity of the second cache space is determined based on the first available memory capacity and the capacity of the first cache space. The method further includes: obtaining the second available memory capacity of the second processor, and determining the second available memory capacity as the capacity of the third cache space.
12. The method according to claim 10, characterized in that, In a simulated scenario of loading k graph partitions, the process for determining the first disk data access volume corresponding to the node feature vector acquisition phase includes: The first item, disk data access volume, is determined based on the total number of graph partitions into which the complete graph structure data is divided, the number of graph partitions loaded (k), and the number of nodes whose node feature vectors are cached in each training round. The node feature vectors are cached in the second cache space or the third cache space. The number of nodes whose node feature vectors are cached in one training round is determined based on the first available memory capacity of the host memory, the second available memory capacity of the second processor, the total number of graph partitions, the number of graph partitions loaded (k), the amount of graph data, and the amount of data corresponding to one node feature vector. The graph data includes the complete graph structure data and the node feature vectors of all nodes. The second item, disk data access volume, is determined based on the sum of the elements in the node heat matrix. Based on the local node heat vectors corresponding to the k graph partitions in each training round, the total local node heat value corresponding to multiple target nodes in each training round is determined accordingly. The third item, disk data access volume, is determined based on the sum of the total local node heat values corresponding to each of all training rounds. Among them, the multiple target nodes in each training round are the nodes whose node feature vectors in each neighborhood subgraph formed in that training round are cached in the second cache space or the third cache space. The first disk data access volume is determined based on the first disk data access volume, the second disk data access volume, and the third disk data access volume.
13. The method according to claim 10, characterized in that, In a simulated scenario of loading k graph partitions, the process for determining the second disk data access volume corresponding to the neighborhood sampling phase includes: Obtain a pre-generated neighbor heat matrix, which describes the total number of neighbor nodes of each node in the complete graph structure data when neighborhood sampling is performed using training nodes in each graph partition. Each graph partition is each of the multiple graph partitions into which the complete graph structure data is divided. One column in the neighbor heat matrix corresponds to one graph partition. Determine the local neighbor heat vectors corresponding to k graph partitions in the target training round. The local neighbor heat vectors include the local neighbor heat values of each node in the complete graph structure data. The local neighbor heat values of each node correspond to the statistical value of the total number of neighbor nodes of each node in the complete graph structure data when performing neighborhood sampling on the training nodes in the k graph partitions in the target training round. The target training round is any training round in the total number of training rounds. For multiple nodes that are not located in the k graph partitions, the sum of the local neighbor heat values of the multiple nodes is determined according to the local neighbor heat vector and used as the disk data access volume corresponding to the neighborhood sampling stage of the target training round; The second disk data access volume is determined based on the disk data access volume corresponding to the neighborhood sampling phase in each of the training rounds.
14. The method according to claim 13, characterized in that, The process of generating the neighbor heat matrix includes: Multiple node groups corresponding to the target graph partition are determined, wherein the number of training nodes in each node group is determined according to the number of training nodes required for one iteration, and the number of node groups is equal to the number of iterations in one training round, and the target graph partition is any one of the multiple graph partitions; Neighborhood sampling is performed on each training node in the target node group to obtain a local subgraph composed of the neighborhood subgraphs of each training node in the target node group, wherein the target node group is any one of the plurality of node groups; The total number of neighboring nodes of each node in the complete graph structure data is determined and used as a column in the neighbor heat matrix corresponding to the target graph partition. The multiple local subgraphs correspond to the multiple node groups.
15. An electronic device, characterized in that, include: The system includes a memory, a processor, and a communication interface; wherein the memory stores executable code, and when the executable code is executed by the processor, the processor performs the graph neural network model training method as described in any one of claims 1 to 14.
16. A non-transitory machine-readable storage medium, characterized in that, The non-transitory machine-readable storage medium stores executable code that, when executed by a processor of an electronic device, causes the processor to perform the graph neural network model training method as described in any one of claims 1 to 14.
17. A computer program product, characterized in that, include: A computer program, when executed by a processor of an electronic device, causes the processor to perform the graph neural network model training method as described in any one of claims 1 to 14.