Graph neural network optimization method and graph neural network inference system

CN115860061BActive Publication Date: 2026-06-26ALIBABA (CHINA) CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ALIBABA (CHINA) CO LTD
Filing Date: 2022-11-01
Publication Date: 2026-06-26

AI Technical Summary

Technical Problem

Existing graph neural network computing frameworks lack flexibility and dynamism, and cannot effectively adapt to changes in different graph operators and input graph structures, resulting in low execution efficiency.

Method used

A graph neural network optimization method is provided, which represents graph operators as nested loop statements based on a preset abstract format, extracts graph operator information and graph data information, selects optimization parallel strategies from a parallel strategy library, and generates executable code to be executed on dedicated hardware for neural network computing.

Benefits of technology

It realizes a unified expression of graph operators in graph neural networks and automatic exploration of dynamic parallelization strategies, thereby improving computational efficiency and execution performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115860061B_ABST

Patent Text Reader

Abstract

A graph neural network optimization method and a graph neural network inference system are disclosed. The method comprises: representing a graph operator of a graph neural network computing task into a loop-nested statement based on a preset abstract format; selecting an optimization parallel strategy from a parallel strategy library based on graph operator information extracted from the loop-nested statement and graph data information of the operator; and executing the graph neural network computing task according to the selected optimization parallel strategy. The present application provides a unified abstraction of complete semantic representation for various graph operators, so that various graph operators in the graph neural network are uniformly expressed, thereby determining the best strategy in the computing scenario of different operators and graph structures through the separation description of operator calculation, graph data and parallelization strategy, realizing the automatic exploration and efficient execution of dynamic parallelization strategy.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of deep learning, and more particularly to a graph neural network optimization method and a graph neural network inference system. Background Technology

[0002] Graph neural network learning, as a deep learning method for graph representation, has been widely applied in many fields such as intelligent transportation, recommendation systems, knowledge graphs, and molecular science, and has been put into practical use in various graph scenarios on an industrial scale.

[0003] Compared to traditional convolutional neural networks that study regular Euclidean space data, graph neural networks learn on irregular graph structures. The irregularity of the graph structure leads to the randomness and complexity of graph operators accessing the graph space, introducing different computation and memory access patterns into the execution process, making it impossible to directly apply the parallel solutions of other neural networks. Furthermore, the computation methods of graph operators in existing graph neural networks mainly rely on handwritten static kernels, lacking flexibility and dynamism.

[0004] Therefore, in order to support the efficient execution of graph neural networks, a graph neural network computation scheme that can achieve adaptive parallelism is needed. Summary of the Invention

[0005] One technical problem this disclosure aims to solve is to provide a graph neural network optimization method and a graph neural network inference system. The implementation is a high-performance computing optimization scheme for graph operators in graph neural networks. It is based on a unified abstraction that can provide complete semantic representations for various graph operators, so that various graph operators in graph neural networks are uniformly expressed according to a preset nesting format, thereby realizing operator-by-operator parallel optimization under this unified expression, and realizing automatic exploration and efficient execution of dynamic parallelization strategies.

[0006] According to a first aspect of this disclosure, a graph neural network optimization method is provided, comprising: representing graph operators of a graph neural network computation task as nested loop statements based on a preset abstract format; selecting an optimized parallel strategy from a parallel strategy library based on graph operator information extracted from the nested loop statements and graph data information of the operators; and executing the graph neural network computation task according to the selected optimized parallel strategy.

[0007] Optionally, the nested loop statements based on the preset abstract format include: an outer loop statement for traversing all vertices in the graph; an intermediate loop statement for traversing each vertex and passing its edge; and an inner loop statement for representing the specific operation of a particular operator.

[0008] Optionally, the inner loop statement is used to traverse along the feature dimension and includes: a first operation statement defined by a first operator that operates on each edge feature; and a second operation statement defined by a second operator that reduces the transformed features after the edge operation.

[0009] Optionally, the preset abstract format includes a first input embedding tensor, a second input embedding tensor, and a third input embedding tensor. In the first operation statement, a first operator is used to operate on the first input embedding tensor and the second input embedding tensor. In the second operation statement, a second operator is used to operate on the third input embedding tensor. The first input embedding tensor, the second input embedding tensor, and the third input embedding tensor are each one of the following: source vertex embedding tensor, target vertex embedding tensor, edge embedding tensor, and null (NULL).

[0010] Optionally, the parallel strategy library includes: a thread-edge strategy where one thread executes all operations on an edge; a thread-vertex strategy where one thread executes all operations on a vertex; a thread-beam-edge strategy where a thread bundle executes all operations on an edge; and a thread-beam-vertex strategy where a thread bundle executes all operations on a vertex, selecting the optimized parallel strategy.

[0011] Optionally, based on the graph operator information extracted from the nested loop statement and the graph data information of the operator, selecting an optimized parallel strategy from the parallel strategy library further includes introducing one of the following parameters to limit the selected parallel strategy: a grouping parameter, used to enable one thread or thread bundle in the selected parallel strategy to process multiple edges or vertices; a tiling parameter, used to enable multiple threads or thread bundles in the selected parallel strategy to process one edge or vertex.

[0012] Optionally, based on the graph operator information extracted from the nested loop statement and the graph data information of the operator, selecting an optimized parallel strategy from the parallel strategy library includes one of the following: feeding the graph operator information and the graph data graph as input into a trained optimization strategy prediction model, and obtaining the output of the optimization strategy prediction model as the optimized parallel strategy; and feeding the graph operator information and the graph data graph into an optimization strategy decision tree, and determining the optimized parallel strategy based on the decision of the optimization strategy decision tree.

[0013] Optionally, performing the graph neural network computation task according to the selected optimized parallel strategy includes: generating executable code based on the optimized parallel strategy and nested loop statements based on the preset abstract format; and executing the executable code by dedicated neural network computation hardware.

[0014] Optionally, generating executable code based on the optimized parallel strategy and the nested loop statement based on the preset abstract format includes: when the first operator or the second operator is empty, merging the first operation statement and the second operation statement, wherein the inner loop statement of the nested loop statement includes a first operation statement defined by a first operator that operates on each edge feature, and a second operation statement defined by a second operator that reduces the transformed features after the edge operation.

[0015] According to a second aspect of this disclosure, a graph neural network inference system is provided, comprising: a compiler configured to: represent graph operators of a graph neural network computation task as nested loop statements based on a preset abstract format; select an optimized parallel strategy from a parallel strategy library based on graph operator information extracted from the nested loop statements and graph data information of the operators; generate executable code according to the optimized parallel strategy and the nested loop statements based on the preset abstract format; and an execution unit configured to: execute the executable code using dedicated hardware for neural network computation.

[0016] According to a third aspect of this disclosure, a computing device is provided, comprising: a processor; and a memory having executable code stored thereon, which, when executed by the processor, causes the processor to perform the method described in the first aspect above.

[0017] According to a fourth aspect of this disclosure, a non-transitory machine-readable storage medium is provided, on which executable code is stored, which, when executed by a processor of an electronic device, causes the processor to perform the method described in the first aspect above.

[0018] Therefore, this invention provides a unified abstraction for the complete semantic representation of various graph operators, enabling the unified expression of various graph operators in graph neural networks. By separating the description of operator computation, graph data, and parallelization strategies, the optimal strategy for computation scenarios with different operators and graph structures is determined, achieving automatic exploration and efficient execution of dynamic parallelization strategies. Attached Figure Description

[0019] The above and other objects, features and advantages of this disclosure will become more apparent from the more detailed description of exemplary embodiments thereof taken in conjunction with the accompanying drawings, wherein like reference numerals generally denote like parts.

[0020] Figure 1 Examples of typical neural network processing for images and language are shown.

[0021] Figure 2 This illustrates an example of a common application pattern of graph neural networks.

[0022] Figure 3 The performance metrics of different operators on different datasets are shown.

[0023] Figure 4 A schematic flowchart of a graph neural network optimization method according to an embodiment of the present invention is shown.

[0024] Figure 5 An operational overview of the unified graph operator interface according to an embodiment of the present invention is shown.

[0025] Figure 6 A schematic diagram of the composition of a graph neural network inference system according to an embodiment of the present invention is shown.

[0026] Figure 7 A schematic diagram of a computing device that can be used to implement the above-described graph neural network optimization method according to an embodiment of the present invention is shown. Detailed Implementation

[0027] Preferred embodiments of the present disclosure will now be described in more detail with reference to the accompanying drawings. While preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

[0028] With the development of machine learning and deep learning, significant breakthroughs have been made in speech, image, and natural language processing. However, speech, images, and text are all structured data presented as sequences or grids, and existing deep learning models are good at processing this type of data. Figure 1 Examples of typical neural network processing for images and language are shown. Figure 1 As shown above, the image can be viewed as a fixed grid because its structure remains unchanged. Subsequently, the desired features can be extracted from the image based on the input layer, multiple hidden layers, and multiple hidden layers (e.g., by increasing the receptive field, the extracted features go from local contrast patterns to facial features, and then to the entire face). The extracted features can be classified by the output layer, thereby achieving face recognition. Figure 1 The speech shown at the bottom can be regarded as a fixed sequence of text, which can be put into the LSTM shown on the right for corresponding semantic understanding tasks.

[0029] However, not everything in the real world can be represented as a sequence or a grid. For example, social networks, knowledge graphs, and complex file systems are all unstructured. Compared to simple text and images, this type of unstructured data corresponds to a more general graph structure. In a typical graph, each node has a different number of edges, meaning it has different neighboring nodes. Graph structures have rich problem-representing capabilities, overcoming the data redundancy and information loss problems caused by traditional grid structures, allowing people to efficiently extract more useful information from data.

[0030] However, graph processing is very complex, and the difficulties include: (1) the size of the graph is arbitrary, the topological structure of the graph is complex, and there is no spatial locality like that of an image; (2) the graph does not have a fixed node order, or in other words, there is no reference node; (3) the graph is often a dynamic graph and contains multi-dimensional or even multi-modal features.

[0031] The complexity of graph data poses a challenge to traditional machine learning. Graph Neural Networks (GNNs), as a method for learning graph representations, have received widespread attention and development in recent years and play a crucial role in the aforementioned problem scenarios. Their main idea is to combine the end-to-end learning process of deep learning with the information transfer in graph computation, forming a new computational paradigm. GNNs can capture the relationships in irregular graph structures and extract effective graph embeddings. Downstream task-specific algorithms can utilize these embedding vectors for efficient and effective computation, thereby achieving the task objectives.

[0032] A graph is a data structure that models a set of objects (nodes) and their relationships (edges). A Graph Neural Network (GNN) is a generalized neural network based on graph structures; it's a deep learning model architecture that can run on graph data. GNNs typically use the graph's topology as computational input and learn neural network primitives to generate single-node embedding vectors by passing, transforming, and aggregating node / edge feature information across the entire graph. These generated node embedding vectors can be used as input to any differentiable prediction layer for node classification, edge link prediction, or graph structure recognition. The complete model can be trained end-to-end.

[0033] Figure 2 This illustrates an example of a common application pattern of graph neural networks. As shown in the figure, the input can be a graph containing nodes (circles and triangles) and edges representing the relationships between nodes. The input graph undergoes various operations such as multi-layer graph convolution and activation functions to ultimately obtain representations of each node, facilitating tasks such as node classification, link prediction, and graph and subgraph generation.

[0034] The ever-evolving graph neural network (GNN) model offers a vast architectural space. Simultaneously, the variability and complexity of the graph operators used in GNNs are rapidly increasing. For example, the number of graph operators in GNN models has increased significantly, from early GCN (Graph Convolutional Network) models to later GAT (Graph Attention Network) and GIN (Graph Isomorphism) models. Correspondingly, the graph operators corresponding to these new ones have become more complex, making high-performance execution of GNNs more challenging. In addition to the large design space for graph operators, GNN models can also operate on datasets with different graph structures possessing unique characteristics (e.g., varying balance, density, and cluster locality). Traditional computing systems explore adaptive parallelization patterns in different scenarios to achieve high performance. Unlike traditional graph algorithms, GNNs do not involve complex control flow due to boundaries; instead, they involve traversing feature dimensions and more complex computations when traversing the graph.

[0035] Existing graph neural network (GNN) computation frameworks rely on hand-written static kernels for graph operator computation, lacking flexibility and dynamism. These frameworks employ fixed execution strategies for different graph operators and input graphs. However, achieving optimal parallel performance for different GNN models and input graph structures is extremely challenging, requiring dynamic trade-offs between locality, parallelism, and efficiency. Graphs, as inputs to GNN models, vary significantly in terms of the number of vertices, edges, sparsity, the size of input features, and the distribution characteristics of edges within the graph. Furthermore, graph operators in different GNN models possess unique computational and memory access characteristics. Existing static execution patterns mean that current frameworks perform well only for specific GNN models and input graph datasets. Their performance deteriorates when the GNN model and its input graph dataset change.

[0036] Before introducing the graph neural network optimization of this invention, we first briefly introduce the background of existing frameworks, including execution frameworks for graph neural networks (GNNs), focusing on their programming interfaces and execution strategies, and demonstrate the low execution efficiency of existing frameworks through experiments on GPUs.

[0037] Graph Neural Networks

[0038] In recent years, Generative Neural Networks (GNNs) have attracted widespread attention from academia and industry due to their powerful learning capabilities and reasoning ability regarding graph structures in non-Euclidean spaces. The output of a GNN model is a d-dimensional embedding vector for each node in the input graph. For vertices or subgraph structures with similar properties, their embeddings are also close to each other, enabling fast reasoning for graph-related problems.

[0039] To obtain these embeddings, GNNs combine DNN-based feature transformations with graph-based operations that propagate and aggregate information along the graph structure. Due to this hybrid approach of DNN and graph operations, existing GNN frameworks, such as DGL and PyTorch-Geometric (PyG), extend existing DNN frameworks (such as TensorFlow and PyTorch) with the key concept of "messages." A message can be viewed as an intermediate feature embedding representation associated with each edge. Message-centric graph operations can be formalized using the following formula: For any operation on a graph G = (V, E), depending on the attributes of the data and the direction of data movement, it can be divided into three phases: message creation, message aggregation, and feature update.

[0040]

[0041]

[0042]

[0043] Where u and v are vertex (or node) indices, e is the index of the edge between u and v; h v This refers to the feature embedding representation of vertex v, m e It is a message associated with edge e.

[0044] In equation (1), each edge creates its message m by applying an edge-wise message function to its own edge features and associated vertex features. e In Equation (2), each vertex uses an aggregation function to aggregate messages from incoming edges. In Equation (3), each vertex uses a vertex-wise combination function to update its features. In GNNs, the set of feature embedding representations of all vertices is called the vertex embedding tensor, and the set of feature embedding representations of all edges is called the edge embedding tensor.

[0045] Definition of graph operators

[0046] Graph operators are defined as those that require traversing the input graph structure. Message creation and message aggregation, as explained above, are two types of graph operators. When the message creation operator is a simple copy operation, it can be merged into the message aggregation operator to avoid redundant access (this is the application of both DGL and PyG). Therefore, there exists a third type of graph operator, called the fused-aggregation operator, which combines the original message creation operator and the message aggregation operator (in this article, unless explicitly specified, aggregation operator refers to the fused-aggregation operator).

[0047] In short, graph operators include message creation and message aggregation, as well as fusion aggregation operators. These operators exhibit irregular storage behavior due to graph structures and complex arithmetic computations, posing a significant challenge to high-performance GNN computation. Optimizing the computation of graph operators is the technical problem this invention aims to solve.

[0048] Complexity of graph operators

[0049] Different GNN models use different graph operators, offering considerable design flexibility. Table 1 categorizes the 160 graph operators supported by DGL based on input and output tensor types. As shown in Table 1, even within the three main categories of message creation, message aggregation, and fusion aggregation, different graph operators exhibit variations in input type. Furthermore, even using the same input / output tensors, graph operators can perform different computational patterns. Therefore, providing practical, high-performance support for all these operations is challenging and requires systematic and automated solutions.

[0050]

[0051] Table 1

[0052] Variability of graph data

[0053] Real-world graph datasets also exhibit significant variability. Table 2 below shows 15 commonly used graph datasets selected for analysis. Specifically, the number of vertices and edges of different graphs was collected to reflect the graph's size scale, and the standard deviation of non-zero values in the adjacency matrix rows (the "std of nnz" column) was derived, reflecting the degree of graph balance. Different graph datasets also have different features and class sizes, which affect the memory usage and computational complexity of some graph operators. As can be seen from Table 2, the properties vary considerably between different graphs.

[0054]

[0055] Table 2

[0056] Execution efficiency analysis on GPU

[0057] Here, a GPU with the CUDA programming language is chosen as the execution hardware. GPU architecture is highly parallel and contains many streaming multiprocessors (SMs). SMs execute threads in SIMT (Single Instruction, Multithreaded) mode, with warps of up to 32 threads running concurrently. The enormous computational and memory resources required make GPUs increasingly important for accelerating deep learning. Due to a lack of systematic optimization methods, the inventors found that the underlying CUDA kernels used in existing GNN frameworks are inefficient and inflexible. The DGL framework is used as an example here, but the kernel in the PyG framework also has similar problems.

[0058] DGL calls a static CUDA kernel to support the message-passing programming interface, which cannot adaptively adapt to different computational scenarios. Here, we select two commonly used graph operators in GNNs for quantitative analysis. The first is the weighted-aggr-sum graph operator in GCN and GAT, and the other is the unweighted-aggr-max graph operator in SageMax. Due to the addition of edge weights, the former has a higher access and computational cost than the latter. We use the AR and SO datasets as representatives of imbalanced graphs, and the PR and DD datasets as representatives of balanced graphs, collecting their occupancy metrics using nvprof. Furthermore, we use the CO and CI datasets as representatives of small graphs, and the SW and OV datasets as examples of large graphs, collecting their SM (streaming multiprocessor) utilization and L2 cache hit rates under different operators.

[0059] The results are as follows Figure 3 As shown. Figure 3 The performance metrics for different operators on different datasets are shown. Both types of operators exhibit some similar result patterns. The occupancy rate of imbalanced graphs is significantly lower compared to balanced graphs. Furthermore, smaller graphs achieve higher L2 cache hit rates while achieving lower SM usage compared to larger graphs. In addition, there are differences in results between operators. For the lightweight unweighted-aggr-max operator, the occupancy results differ significantly between imbalanced and balanced graphs, but the differences in SM usage and cache hit rates between small and large graphs are relatively small.

[0060] These results indicate that low GPU utilization leads to insufficient hardware resource utilization when executing unbalanced graphs. When executing small graphs, GPU performance is typically limited by insufficient hardware resource utilization due to parallelism, while when executing large graphs, access bandwidth becomes a bottleneck due to low locality. Furthermore, these metrics may vary depending on the operator.

[0061] Existing GNN frameworks rely on handwritten kernels with fixed execution policies. However, the diversity of graph-related operations and real-world graph structures makes fixed execution policies inefficient for GNN models. This prompted the inventors of this invention to design a unified interface (called μGrapher) to support existing GNN frameworks. The unified interface of this invention captures the complete semantic representations of all common graph operators in GNNs and enables different dynamic and flexible execution policies for input graph data and graph operators with different characteristics.

[0062] The unified abstraction of graph operators in this invention

[0063] Previous work merely decomposed GNNs into different stages for static optimization. The inventors of this invention have innovatively implemented a unified abstraction of all graph operators in current GNN models, abstracting the modeling of underlying graph-related operators, and employing nested sparse-dense loops to separate data acquisition and computation, achieving optimization for different graph datasets. Furthermore, because the pre-defined abstract form of the graph operators decouples the scheduling strategy from computation, and the unified abstract variable portion can be used as input to the prediction model or decision tree described below, the optimized parallel strategy can be directly obtained based on the model's output or the decision tree's decision.

[0064] Figure 4 A schematic flowchart of a graph neural network optimization method according to an embodiment of the present invention is shown. This optimization method can be executed by a graph neural network compiler and generates executable code based on a unified abstraction and optimization parallelism strategy for execution by underlying dedicated hardware, such as a GPU.

[0065] In step S410, the graph operators of the graph neural network computation task are represented as nested loop statements based on a preset abstract format. Subsequently, in step S420, an optimized parallel strategy is selected from the parallel strategy library based on the graph operator information extracted from the nested loop statements and the graph data information of those operators. In step S430, the graph neural network computation task is executed according to the selected optimized parallel strategy.

[0066] Nested loop statements based on a predefined abstract format can include three levels of nested loop statements. Specifically, they can include: an outer loop statement for traversing all vertices in the graph; a middle loop statement for traversing each vertex and passing edges; and an inner loop statement for representing the specific operation of a particular operator. Thus, the inventors abstract all graph operators in a GNN into three execution stages: moving data from vertices to edges, performing edge-wise computation on all edges, and performing a reduction function from an edge to its associated vertex. Different operators perform different edge computations and reduction computations, and may skip certain stages. Accordingly, the inner statements in the nested loops can differ to represent different operators.

[0067] The inner loop statement is used to traverse along the feature dimension and includes: a first operation statement defined by a first operator (e.g., edge_op below) that operates on each edge feature; and a second operation statement defined by a second operator (e.g., gather_op below) that reduces the transformed features after edge operation.

[0068] To facilitate understanding, the aggregation-sum operator will first be used as an example to illustrate the abstraction method of this invention. Then, its generalization ability to represent all graph operators will be described.

[0069] Nested For loop representation of graph operators

[0070] The same aggregation summation operator example described above is used here to illustrate the graph operator abstraction method of the present invention. This operator is widely used in GNNs, where, for each vertex in the graph, the operator traverses its neighboring vertices and accumulates the feature embedding representations of these neighboring vertices.

[0071] The following is a code example using nested loops to represent the aggregation summation operator. The graph operator abstraction consists of three nested loops. Line 5 (outer loop) iterates through all vertices in the graph, line 6 (middle loop) iterates through the incoming edges of each vertex, and line 8 iterates along the feature dimension. The innermost statement (line 9, inner loop) performs the combined accumulation of data from the source vertex to the target vertex.

[0072]

[0073] The predefined unified abstract format can also include several GNN-specific data structures that capture the graph-level semantics of the operators. In the above abstract representation, the input of the aggregation summation operator is the graph G and the vertex embedding tensor X, and the output is a new vertex embedding tensor Y. The graph is a pair of sets, where V and E represent the sets of all vertices and all edges in the graph. Each element of set V represents a vertex, and the incoming and outgoing edges of each vertex can be obtained through the get_inedges() and get_outedges() interfaces, respectively. Each element of set E represents an edge, and each edge corresponds to a pair of vertices, with the source and destination vertices obtained through src_v and dst_v.

[0074] Preset Abstract Design

[0075] As mentioned earlier, the inventors abstracted all graph operators in GNNs into three execution phases: moving data from vertices to edges, performing edge-wise computation on all edges, and performing reduction functions from edges to their associated vertices. Different operators perform different edge and reduction computations and may skip certain phases.

[0076] For example, the aggregation summation operator in the SageSum model simply copies the source vertex features of each edge to form edge features without performing edge computation. Then, for each vertex, it reduces the edge features of all its incoming edges to a new vertex feature. In contrast, the GAT model contains several graph operators with different computational patterns. Its first message creation operator is very lightweight, summing the features of the source and destination vertices of each edge as the edge features for calculating attention weights, skipping the final reduction stage. Conversely, its second aggregation summation operator involves computation in all three stages. This operator first copies features from the source vertices, then performs edge-wise multiplication with the previously generated edge weights, and finally reduces the transformed edge features to vertex features. Therefore, the second operator is more computationally expensive than the first.

[0077] Given the similarities and differences among these graph operators, this invention uses nested loops as the basis for graph operator abstraction and allows users to define input tensors and element-wise operations to represent different operators. Details of the unified abstraction are given below. Compared to the aggregation summation representation above, the nested loops in the unified abstraction remain the same, but the innermost code block (inner loop) introduces two additional dynamic operators: edge_op (corresponding to the first operator) and gather_op (corresponding to the second operator), which can be defined by the user.

[0078]

[0079] `edge_op` performs edge-by-edge computation, while `gather_op` performs the reduction operation from edge to vertex. For example, to represent the aggregation summation in the example above, the functions `edge_op` and `gather_op` can be set to `copy_lhs` (copy from the left) and `copy_rhs` (copy from the right), respectively.

[0080] In addition to `edge_op`, `gather_op`, and the input to the graph structure G, the unified abstraction requires three additional input embedding tensors. To maintain flexibility in representing different graph operators, these three embedding tensors can be of any of the following types: source vertex embedding tensor (Src_V), target vertex embedding tensor (Dst_V), edge embedding tensor (Edge), and NULL (empty). Different data types also determine different addressing modes in loop computations (lines 10-12). For example, the output tensor Y of the aggregation summation in the example above corresponds to a tensor C with the target vertex feature type, ensuring that the addressing dimension in line 9 is always based on `dst`.

[0081] In summary, the combination of edge_op and gather_op, along with tensor types A, B, and C, captures the complete semantics of graph operators, including their computation and memory movement patterns. The following formula defines the unified abstraction of this invention in arithmetic form, where ψ is the edge_op function and ρ is the gather_op function.

[0082]

[0083] Table 3 below shows the complete implementation of all graph operation semantics and their corresponding parameter configurations. It is evident that the unified abstraction of this invention supports message creation and aggregation, as well as the fused graph semantics, providing a foundation for flexible optimization for different graph operators.

[0084]

[0085] Table 3

[0086] As shown in Table 3, for different graph operators, the `edge_op` operation can be `copy_lhs` (copy from the left), `copy_rhs` (copy from the right), `mul` (multiplication), `add` (addition), `sub` (subtraction), and `div` (division). The `gather_op` operation can be `copy_lhs` (copy from the left), `copy_rhs` (copy from the right), `sum` (summation), `max` (maximum value), `min` (minimum value), and `mean` (average value). This is also the content defined by `edge_op_list` (the list of edge operations) and `gather_op_list` (the list of aggregation operations) mentioned above. Furthermore, the tensor types involved in graph operator computation, as defined by `tensor_type_list` (the list of tensor types), include `Src_V` (source vertex embedding tensor), `Dst` (destination vertex embedding tensor), and `Edge` (edge embedding tensor). Since the edge_op or gather_op operations in some operators can be null, type_idx_dict (type index dictionary) defines Src_V:src, Dst_V:Dst, Edge:edge, and NULL.

[0087] Therefore, the predetermined abstract format of the present invention includes a first input embedding tensor (tensor A), a second input embedding tensor (tensor B), and a third input embedding tensor (tensor C). In the first operation statement, a first operator is used to operate on the first input embedding tensor and the second input embedding tensor. In the second operation statement, a second operator is used to operate on the third input embedding tensor. The first input embedding tensor, the second input embedding tensor, and the third input embedding tensor are each one of the following: source vertex embedding tensor, target vertex embedding tensor, edge embedding tensor, and NULL.

[0088] As mentioned earlier, after representing a graph operator as a nested loop statement with a predetermined abstract format, an optimized parallel strategy can be selected from the parallel strategy library based on the graph operator information extracted from the nested loop statement and the graph data information of the operator.

[0089] Optimized parallelization strategies need to be explored within the optimization space. The following describes how to determine the optimization space, which is crucial for achieving high-performance graph operator execution. Specifically, the trade-off space of different parallelization strategies for executing graph operators on GPUs is explored, and it is shown that the optimal strategy differs for different datasets and different graph operators.

[0090] Balance space

[0091] The trade-off space affecting the performance of graph operators on GPUs involves dimensional optimization space: locality, parallelism, and efficiency.

[0092] Locality describes the amount of space and time reuse in a program. Better locality improves cache hit rate and potentially improves program performance. GPUs contain an L1 cache per SM as well as a shared L2 cache. To improve the locality of graph operators, tiling or blocking can be applied to nested loops (e.g., grouping and tiling parameters as shown below), which can limit the working set of each SM.

[0093] Parallelism refers to the amount of computation that can be performed simultaneously. Modern GPUs typically contain thousands of computing units, so higher parallelism can improve hardware resource utilization, hide memory access latency, and thus improve program performance. The simplest way to increase the parallelism of graph operators is to start more threads, warps, or thread blocks.

[0094] Efficiency is expressed as the reciprocal of overhead. Different execution strategies for the same operator can introduce additional computations, such as address calculations. Furthermore, to execute graph operators in a GPU, atomic instructions are required when write conflicts occur, which introduces lock overhead and reduces efficiency. For example, each edge can be mapped to a single thread. Since different edges may share the same vertices, atomic addition instructions are needed when performing cumulative reduction from edge features to vertex features.

[0095] Locality, parallelism, and efficiency form an impossible triangle, meaning no single strategy can improve all three metrics simultaneously. Different parallelization strategies have both positive and negative impacts on various metrics in the trade-off space. Given the diversity of graph operators and graph dataset features, it can be shown that a fixed parallelization strategy leads to optimal performance only in a few cases.

[0096] Using the aggregation summation graph operator from the previous example as a representative case, we illustrate the impact of various parallelization strategies on three tradeoffs. First, we follow two classic parallelization strategies used in existing graph processing systems: vertex parallelism and edge parallelism, whose GPU implementations mean that one thread handles all computations for a single vertex or edge. Therefore, we define them as thread-vertex and thread-edge strategies, which are executed in parallel by different threads. Since the number of edges in a graph is typically much greater than the number of vertices, the thread-vertex strategy reduces parallelism compared to the thread-edge strategy, but improves the reusability of output data, thus improving locality. However, the thread-edge strategy reduces efficiency because multiple threads can update the same vertex, thus requiring atomic update operations.

[0097] Meanwhile, since the vertex / edge features in GNNs are vectors, while traditional graphics processing algorithms like PageRank use scalar values, this GNN-specific feature dimension parallelization strategy is called the warp-vertex (thread bundle-vertex) strategy and the warp-edge (thread bundle-edge) strategy. In these strategies, each warp (a set of 32 threads in the GPU) processes only one vertex or edge at a time, and different threads within a warp process different feature elements. Compared to the thread-vertex and thread-edge strategies, the thread bundle-vertex and thread bundle-edge strategies can initiate more warps, thus increasing parallelism. However, they also compromise locality because the cache capacity per warp is reduced.

[0098] For the four strategies described above, two fine-grained parameters are introduced to further explore the trade-off between locality and parallelism. The first parameter, which we call the V / E grouping parameter, means grouping multiple edges or vertices into a group. For example, for the thread-edge strategy, setting this parameter to 4 means that one thread can process four edges instead of the original one, which improves locality but reduces parallelism. This also reduces efficiency due to the additional computational overhead of grouping.

[0099] The second parameter is feature tiling, which leverages the parallelism of feature dimensions to launch more threads. For example, for a feature size of 64 and a thread bundle size of 32, setting the feature tiling parameter to 2 will map a vertex / edge to two thread bundles instead of a single thread bundle when no feature tiling parameter is applied. Compared to V / E grouping, this strategy increases parallelism but reduces locality. Simultaneously, it also reduces efficiency due to the additional address computation required for feature tiling.

[0100] Therefore, the parallel strategy library of the present invention may include: a thread-edge strategy in which one thread executes all operations on one edge; a thread-vertex strategy in which one thread executes all operations on one vertex; a thread-bundle-edge strategy in which a thread bundle executes all operations on one edge; and a thread-bundle-vertex strategy in which a thread bundle executes all operations on one vertex. Further, the following strategies may also be limited based on the parameters described above. Therefore, the search for optimized parallel strategies based on graph operator information extracted from the unified abstract format and the graph data information of the operators further includes introducing one of the following parameters: a grouping parameter (V / E grouping), used to enable one thread or thread bundle in the selected optimized parallel strategy to process multiple edges or vertices; and a feature tiling parameter, used to enable multiple threads or thread bundles in the selected optimized parallel strategy to process one edge or vertex.

[0101] As mentioned earlier, the search for optimized parallel strategies is based on graph operator information and also on the graph data information of the operators. In other words, the optimal execution strategy for graph operators varies depending on the dataset and feature size; that is, different strategies achieve optimal results under different conditions. Therefore, it is necessary to determine the optimized parallel strategy based on both graph operator and data information.

[0102] Furthermore, this invention proposes μGrapher, a unified and high-performance graph operator interface for GNNs, which employs the unified abstraction described above and incorporates a parallelization strategy. Figure 5 An operational overview of a unified graph operator interface according to an embodiment of the present invention is shown. Figure 5 This invention showcases two key features: the ability to provide complete semantic representations for various graph operators, and the capability to achieve efficient execution through the automatic exploration of flexible and dynamic parallelization strategies. μGrapher can provide unified representations for various operators, including graph scatter operators, gather operators, message creation operators, message aggregation operators, and fusion graph operators. Dynamic parallelization strategies can be selected from thread-edge, thread-vertex, thread-bundle-edge, and thread-bundle-vertex strategies, and modified using grouping and tiling parameters. Existing GNN frameworks can call μGrapher's unified API or graph operators rewritten in a unified format. Furthermore, the code generated by μGrapher can be used for efficient execution on dedicated hardware, such as GPUs.

[0103] Therefore, μGrapher can provide specialized and optimized kernels for all GNN graph operators on different GPU architectures and graph datasets. Based on a unified abstraction and various decoupled parallelization strategies, an example of the μGrapher API is shown below:

[0104] op_info=[edge_op,gather_op,Tensor_A,A_Type,Tensor_B,B_Type,

[0105] Tensor_C, C_Type]

[0106] parallel_info=[parallel_strategy,Grouping_Param,Tiling_Param]

[0107] uGrapher(Graph_Tensor,op_info,parallel_info)

[0108] The μGrapher API contains three arguments: graph_tensor (the graph tensor), which is the data; op_info (operator information), which passes information about edge_op, gather_op, and the computation of the input tensor; and parallel_info (parallelization information), which specifies the parallelization strategy.

[0109] The API described above separates operator computation, graph data, and parallelization strategies, allowing users to propose their own heuristics to determine the optimal strategy for different operators and graph structures. Furthermore, when no parallelization strategy is specified by the user, μGrapher's interface can perform automatic adjustments to find the optimal parallelization strategy (e.g., model-based or decision tree-based operations as described below).

[0110] The following describes how to generate CUDA kernels for operators defined via the μGrapher API. At a high level, the CUDA code generator of this invention also follows the μGrapher design principles, completely decoupling the operator's scheduling strategy from its computation.

[0111] To provide full scheduling support for various graph operators, template-based programming can be utilized, and CUDA kernel templates can be manually implemented for each parallelization strategy described above. We then preserve a device function interface in each template to support various graph operators.

[0112] Code generation can be an automated, end-to-end process to ensure correctness and optimize the generated CUDA kernels for different graph operators. The entire process consists of two code traversals and is flexible and scalable to support future operators. The first code traversal merges the innermost two code statements when members of `op_info` (such as `edge_op` or `gather_op`) are NULL, reducing register usage and read / write overhead. The second code traversal generates the final device function code, which can select atomic operations by analyzing whether different threads will compete for the same data.

[0113] The above design, through the free combination of CUDA kernel functions (global functions) and device functions, provides flexible and efficient implementations for different operators. The former supports different parallelization strategies, while the latter supports different arithmetic operations in graph operators.

[0114] Finding the optimal parallelization strategy can be challenging and time-consuming because there are 10 parallelization steps for a graph operator in μGrapher. 4An effective strategy is needed. A thorough grid search would take several days. Therefore, the gradient boosting framework LightGBM can be used to train a prediction model to select the optimal strategy in the parallelization space. In one embodiment, features from both graph data and operator information can be used for model training. The introduction of an optimization strategy prediction model can almost completely eliminate the overhead of searching for and optimizing the scheduling strategy. In another embodiment, the graph operator information and the graph data can be fed into an optimization strategy decision tree, and the optimization parallel strategy can be determined based on the decisions of the optimization strategy decision tree.

[0115] According to another aspect of the present invention, a graph neural network inference system is proposed. Figure 6 A schematic diagram of a graph neural network inference system according to an embodiment of the present invention is shown. System 600 includes a compiler 610 and an execution unit 620. The compiler 610 is configured to: represent graph operators in the graph computation graph corresponding to the graph neural network computation task as nested loop statements with a predetermined abstract format; search for and optimize a parallel strategy based on graph operator information extracted from the unified abstract format and graph data information of the operators; and generate executable code according to the optimized parallel strategy and the nested loop statements with the unified abstract format. The execution unit 620 is configured to execute the executable code using dedicated hardware for neural network computation. When implemented on a neural network computation platform, the compiler and execution unit described above can be arranged on each computation node.

[0116] Figure 7 A schematic diagram of a computing device that can be used to implement the above-described graph neural network optimization method according to an embodiment of the present invention is shown.

[0117] See Figure 7 The computing device 700 includes a memory 710 and a processor 720.

[0118] Processor 720 may be a multi-core processor or may include multiple processors. In some embodiments, processor 720 may include a general-purpose main processor and one or more specialized coprocessors, such as a graphics processing unit (GPU), a digital signal processor (DSP), etc. For example, processor 720 may include a GPU dedicated to performing parallel neural network computations. In some embodiments, processor 720 may be implemented using or include custom circuitry, such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs).

[0119] Memory 710 may include various types of storage units, such as system memory, read-only memory (ROM), and permanent storage devices. ROM may store static data or instructions required by the processor 720 or other modules of the computer. Permanent storage devices may be read-write storage devices. Permanent storage devices may be non-volatile storage devices that retain stored instructions and data even when the computer is powered off. In some embodiments, permanent storage devices use mass storage devices (e.g., magnetic or optical disks, flash memory) as permanent storage devices. In other embodiments, permanent storage devices may be removable storage devices (e.g., floppy disks, optical drives). System memory may be a read-write storage device or a volatile read-write storage device, such as dynamic random access memory. System memory may store some or all of the instructions and data required by the processor during operation. Furthermore, memory 710 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and disks and / or optical disks may also be used. In some embodiments, memory 710 may include a removable storage device that is readable and / or writable, such as a laser disc (CD), a read-only digital multifunction optical disc (e.g., DVD-ROM, dual-layer DVD-ROM), a read-only Blu-ray disc, an ultra-high density optical disc, a flash memory card (e.g., SD card, mini SD card, Micro-SD card, etc.), a magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves or transient electronic signals transmitted wirelessly or via wired connections.

[0120] The memory 710 stores executable code, which, when processed by the processor 720, enables the processor 720 to execute the graph neural network optimization method described above.

[0121] The graph operators in the existing graph neural network framework are highly dynamic. This invention abstracts all graph operators in the graph neural network into the following three execution stages: moving data from vertices to edges, performing edge computation on all edges, and executing a merge function from edge to related vertex.

[0122] Different graph operators perform different edge computations and merge computations, and may skip certain stages. Given the similarities and differences among these graph operators, this invention, based on nested loops as the basis for graph operator scheduling representation, allows users to customize input tensors and function operations at different stages to represent different operators.

[0123] `edge_op` implements the function representation for memory access and computation on edges, while `gather_op` implements the function representation for merging edges to vertices. The default abstract format also requires type information for three additional input embedding tensors (Tensor A, B, and C). To maintain flexibility in representing different graph operators, these three embedding tensors can be of any of the following types: source vertex embedding tensor (Src_V), destination vertex embedding tensor (Dst_V), edge embedding tensor (Edge), and NULL.

[0124] Different data types also determine different addressing modes in loop computations. The unified abstraction of this invention supports the semantics of message creation and message aggregation, as well as the semantics of different graph operators after merging and optimization, which provides a foundation for implementing a unified computation optimization interface.

[0125] The unified high-performance computing interface design for graph operators in this invention can provide specialized and optimal computation scheduling for graph operators in all graph neural networks across different GPU architectures and graph datasets. Based on interfacing with upper-layer graph neural network computing frameworks, the unified high-performance computing interface can implement computational functionality support for different graph operators, including Scatter, Gather, message creation and aggregation, as well as optimized merged operations. It also provides support for parallel strategies including four coarse-grained parallel modes and two fine-grained parallel control adjustment parameters.

[0126] The unified high-performance computing interface includes three parameters: graph_tensor, which is the graph data; op_info, which passes computational information about edge_op, gather_op, and the input tensor; and parallel_info, which specifies the parallelization strategy.

[0127] The aforementioned interface design separates operator computation, graph data, and parallelization strategies, allowing the system to use heuristics to determine the optimal strategy for computational scenarios with different operators and graph structures. This invention leverages template-based high-performance programming to provide comprehensive scheduling support for various graph operators. First, CUDA-level computation templates are manually implemented for each coarse-grained parallelization strategy. Then, a device function parameter interface is retained in each template to support various graph operators. Simultaneously, this invention implements an automated computation code generation process for different graph operators, optimizing the generated CUDA kernel while ensuring correctness. This invention implements a unified high-performance computing interface as the underlying interface for invocation without modifying user code. As seen in the code snippet, a simple replacement is all that's needed to achieve a unified high-performance computing interface for different scenarios using graph operators in existing graph neural network frameworks.

[0128] Furthermore, the method according to the present invention can also be implemented as a computer program or computer program product, which includes computer program code instructions for performing the steps defined in the above-described method of the present invention.

[0129] Alternatively, the present invention can also be implemented as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) storing executable code (or computer program, or computer instruction code) thereon, which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the various steps of the method described above according to the present invention.

[0130] Those skilled in the art will also understand that the various exemplary logic blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein can be implemented as electronic hardware, computer software, or a combination of both.

[0131] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0132] The various embodiments of the present invention have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A graph neural network optimization method, comprising: Based on a preset abstract format, graph operators for graph neural network computation tasks are represented as nested loop statements. These nested loop statements include an outer loop statement for traversing all vertices in the graph, an intermediate loop statement for traversing the incoming edges of each vertex, and an inner loop statement for representing specific operations of a particular operator. The inner loop statement is used for traversal along the feature dimension and includes a first operation statement defined by a first operator operating on each edge feature, and a second operation statement defined by a second operator reducing the transformed features after edge operations. The nested loop statements also include a first input embedding tensor, a second input embedding tensor, and a third input embedding tensor. In the first operation statement, the first operator is used to operate on the first and second input embedding tensors. In the second operation statement, the second operator is used to operate on the third input embedding tensor. Each of the first, second, and third input embedding tensors is one of the following: a source vertex embedding tensor, a target vertex embedding tensor, an edge embedding tensor, or an empty space. Based on the graph operator information extracted from the nested loop statement and the graph data information of the operator, an optimized parallel strategy is selected from the parallel strategy library. The parallel strategy library includes at least two of the following: a thread-edge strategy in which one thread executes all operations on one edge; a thread-vertex strategy in which one thread executes all operations on one vertex; a thread-beam-edge strategy in which a thread bundle executes all operations on one edge; and a thread-beam-vertex strategy in which a thread bundle executes all operations on one vertex. Executable code is generated based on the optimized parallel strategy and nested loop statements based on the preset abstract format; and The executable code is executed by dedicated hardware for neural network computing.

2. The method as described in claim 1, wherein, Based on the graph operator information extracted from the nested loop statement and the graph data information of the operator, selecting an optimized parallel strategy from the parallel strategy library also includes introducing one of the following parameters to limit the selected parallel strategy: Grouping parameters are used to enable one thread or thread bundle in the selected parallel strategy to process multiple edges or vertices; The tiling parameter is used to enable multiple threads or thread bundles in the selected parallel strategy to process an edge or vertex.

3. The method as described in claim 1, wherein, Based on the graph operator information extracted from the nested loop statement and the graph data information of the operator, the selection of optimization parallel strategies from the parallel strategy library includes: The graph operator information and the graph data information are fed into a trained optimization policy prediction model as input, and the output of the optimization policy prediction model is obtained as the parallel optimization policy; or The graph operator information and the graph data information are fed into the optimization strategy decision tree, and the optimization parallel strategy is determined based on the decision of the optimization strategy decision tree.

4. The method of claim 3, wherein, Generating executable code based on the optimized parallel strategy and nested loop statements based on the preset abstract format includes: When the first operator or the second operator is empty, the first operation statement and the second operation statement are merged. The inner loop statement of the nested loop statement includes the first operation statement defined by the first operator that operates on each edge feature, and the second operation statement defined by the second operator that reduces the transformed features after the edge operation.

5. A graph neural network inference system, comprising a compiler and an execution unit, wherein, The compiler is used for: Based on a preset abstract format, graph operators for graph neural network computation tasks are represented as nested loop statements. These nested loop statements include an outer loop statement for traversing all vertices in the graph, an intermediate loop statement for traversing the incoming edges of each vertex, and an inner loop statement for representing specific operations of a particular operator. The inner loop statement is used for traversal along the feature dimension and includes a first operation statement defined by a first operator operating on each edge feature, and a second operation statement defined by a second operator reducing the transformed features after edge operations. The nested loop statements also include a first input embedding tensor, a second input embedding tensor, and a third input embedding tensor. In the first operation statement, the first operator is used to operate on the first and second input embedding tensors. In the second operation statement, the second operator is used to operate on the third input embedding tensor. Each of the first, second, and third input embedding tensors is one of the following: a source vertex embedding tensor, a target vertex embedding tensor, an edge embedding tensor, or an empty space. Based on the graph operator information extracted from the nested loop statement and the graph data information of the operator, an optimized parallel strategy is selected from the parallel strategy library. The parallel strategy library includes at least two of the following: a thread-edge strategy in which one thread executes all operations on one edge; a thread-vertex strategy in which one thread executes all operations on one vertex; a thread-beam-edge strategy in which a thread bundle executes all operations on one edge; and a thread-beam-vertex strategy in which a thread bundle executes all operations on one vertex. The execution unit is used for: Executable code is generated based on the optimized parallel strategy and nested loop statements based on the preset abstract format; and the executable code is executed by dedicated hardware for neural network computing.

6. A computing device, comprising: processor; as well as A memory having executable code stored thereon, which, when executed by the processor, causes the processor to perform the method as described in any one of claims 1 to 4.

7. A non-transitory machine-readable storage medium having executable code stored thereon, which, when executed by a processor of an electronic device, causes the processor to perform the method as described in any one of claims 1 to 4.