Resource scheduling method of heterogeneous computing system, electronic device and program product

By acquiring hardware and model information of heterogeneous computing systems, performing device performance normalization and model partitioning, the resource scheduling problem caused by the difference in computing power among heterogeneous devices is solved, and efficient distributed operation of graph neural network models is realized.

CN122240281APending Publication Date: 2026-06-19INSPUR SUZHOU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INSPUR SUZHOU INTELLIGENT TECH CO LTD
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, distributed graph neural network systems cannot adapt to the differences in computing power between devices in heterogeneous device environments, resulting in poor resource scheduling performance.

Method used

By acquiring hardware and model information of heterogeneous computing systems, calculating the total memory, and performing device performance normalization, quantitative device performance parameters are obtained. Based on this, the graph neural network model is divided into subgraph parameter information adapted to each device for dynamic resource scheduling.

🎯Benefits of technology

It achieves precise resource scheduling of graph neural network models in heterogeneous computing systems, improving the overall efficiency of distributed model operation and hardware resource utilization.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240281A_ABST
    Figure CN122240281A_ABST
Patent Text Reader

Abstract

This application discloses a resource scheduling method, electronic device, and program product for heterogeneous computing systems, relating to the technical field of heterogeneous computing. The method includes acquiring hardware and model information of heterogeneous computing devices; calculating the total memory required by the model and normalizing device performance based on the hardware information to obtain quantified device performance parameters; then, based on these parameters, dividing the graph neural network model into subgraph parameter information adapted to each device; and finally, dynamically scheduling heterogeneous device resources according to the subgraph parameter information, thereby achieving efficient distributed operation of the model. In this way, by normalizing and quantifying the performance of heterogeneous computing devices and adapting the model into appropriate subgraphs, accurate resource scheduling of the graph neural network model in heterogeneous computing systems is achieved, improving the overall efficiency of distributed model operation and hardware resource utilization, and enhancing resource scheduling effectiveness.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the technical field of heterogeneous computing, and in particular to a resource scheduling method, electronic device and program product for a heterogeneous computing system. Background Technology

[0002] With the rapid development of e-commerce, social networks, bioinformatics, and other fields, the scale of graph data is growing exponentially, and graph neural networks have become the core technology for processing such irregularly structured data. For example, user behavior graphs in social networks may contain billions of vertices and tens of billions of edges, entity relationship graphs in knowledge graph reasoning can reach the scale of tens of billions, and biological graph datasets in molecular property prediction have more than 100 million nodes and 16 billion edges.

[0003] In related technologies, distributed graph neural network systems are mainly used to solve the computational needs of large-scale graph data. However, they have the following limitations: the system generally assumes that all working nodes have similar computing and communication capabilities, which leads to the inability of task partitioning strategies to adapt to the differences in computing power between devices in heterogeneous device environments, resulting in poor resource scheduling performance. Summary of the Invention

[0004] This application provides a resource scheduling method, electronic device, and program product for heterogeneous computing systems, in order to at least solve the problem of poor resource scheduling performance in related technologies.

[0005] This application provides a resource scheduling method for a heterogeneous computing system, including:

[0006] Obtain hardware information of multiple heterogeneous computing devices in a heterogeneous computing system, as well as model information of the graph neural network model to be scheduled;

[0007] Based on the model information, calculate the total memory required for the graph neural network model;

[0008] Based on hardware information, model information, and total memory, the performance of multiple heterogeneous computing devices is normalized to obtain the device performance parameters of multiple heterogeneous computing devices.

[0009] Based on the device performance parameters of multiple heterogeneous computing devices, the graph neural network model is divided to obtain the subgraph parameter information corresponding to each heterogeneous computing device.

[0010] Based on the parameter information of multiple subgraphs, the resources of multiple heterogeneous computing devices are dynamically scheduled, and the distributed operation of the graph neural network model is executed.

[0011] This application also provides a resource scheduling device for a heterogeneous computing system, including:

[0012] The acquisition module is used to acquire hardware information of multiple heterogeneous computing devices in a heterogeneous computing system, as well as model information of the graph neural network model to be scheduled.

[0013] The calculation module is used to calculate the total amount of memory required by the graph neural network model based on the model information.

[0014] The processing module is used to normalize the performance of multiple heterogeneous computing devices based on hardware information, model information, and total memory, and obtain the device performance parameters of multiple heterogeneous computing devices.

[0015] The partitioning module is used to partition the graph neural network model according to the device performance parameters of multiple heterogeneous computing devices, and obtain the subgraph parameter information corresponding to each of the multiple heterogeneous computing devices.

[0016] The scheduling module is used to dynamically schedule the resources of multiple heterogeneous computing devices based on the parameter information of multiple subgraphs, and to execute the distributed operation of the graph neural network model.

[0017] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for implementing the resource scheduling method of any of the above-described heterogeneous computing systems when executing the computer program.

[0018] This application also provides a computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, it implements the steps of the resource scheduling method for any of the above-described heterogeneous computing systems.

[0019] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the resource scheduling method for any of the above-described heterogeneous computing systems.

[0020] This application obtains hardware and model information from heterogeneous computing devices, calculates the total memory required by the model, and normalizes device performance based on hardware information to obtain quantified device performance parameters. Then, based on these parameters, the graph neural network model is divided into subgraph parameter information adapted to each device. Finally, dynamic scheduling of heterogeneous device resources is implemented according to the subgraph parameter information, thereby achieving efficient distributed operation of the model. In this way, by normalizing and quantifying the performance of heterogeneous computing devices and adapting the model to different devices, precise resource scheduling of the graph neural network model in heterogeneous computing systems is achieved, improving the overall efficiency of distributed model operation and hardware resource utilization, and enhancing resource scheduling effectiveness. Attached Figure Description

[0021] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 This is a schematic diagram of the structure of a heterogeneous computing system provided in an embodiment of this application;

[0023] Figure 2 This is a schematic diagram of the structure of an accelerator provided in an embodiment of this application;

[0024] Figure 3 A flowchart illustrating a resource scheduling method for a heterogeneous computing system provided in this application embodiment. Figure 1 ;

[0025] Figure 4 This application provides a schematic diagram of the structure of a CXL unified memory pool.

[0026] Figure 5 A flowchart illustrating a sparsity optimization method for a hidden layer computation process provided in this application embodiment;

[0027] Figure 6 A schematic diagram of a graph neural network model load mapping process provided in an embodiment of this application;

[0028] Figure 7 A schematic diagram of a dynamic scheduling process provided in an embodiment of this application;

[0029] Figure 8 A flowchart illustrating a resource scheduling method for a heterogeneous computing system provided in this application embodiment. Figure 2 ;

[0030] Figure 9 This application provides a schematic diagram of the structure of a resource scheduling device for a heterogeneous computing system.

[0031] Figure 10 A schematic diagram of the structure of the electronic device provided in this application. Detailed Implementation

[0032] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.

[0033] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.

[0034] With the rapid development of e-commerce, social networks, bioinformatics, and other fields, the scale of graph data is growing exponentially, and graph neural networks have become the core technology for processing such irregularly structured data. For example, user behavior graphs in social networks may contain billions of vertices and tens of billions of edges, entity relationship graphs in knowledge graph reasoning can reach the scale of tens of billions, and biological graph datasets in molecular property prediction have more than 100 million nodes and 16 billion edges.

[0035] In related technologies, distributed graph neural network systems are mainly used to solve the computational needs of large-scale graph data. However, they have the following limitations: the system generally assumes that all working nodes have similar computing and communication capabilities, which leads to the inability of task partitioning strategies to adapt to the differences in computing power between devices in heterogeneous device environments, resulting in poor resource scheduling performance.

[0036] To address the aforementioned technical problems, this application provides a resource scheduling method for heterogeneous computing systems. By acquiring hardware and model information of heterogeneous computing devices, calculating the total memory required by the model, and normalizing device performance based on the hardware information, quantified device performance parameters are obtained. Then, based on these parameters, the graph neural network model is divided into subgraph parameter information adapted to each device. Finally, dynamic scheduling of heterogeneous device resources is implemented according to the subgraph parameter information, thereby achieving efficient distributed operation of the model. In this way, by normalizing and quantifying the performance of heterogeneous computing devices and adapting the model for specific needs, accurate resource scheduling of the graph neural network model in heterogeneous computing systems is achieved, improving the overall efficiency of distributed model operation and hardware resource utilization, and enhancing resource scheduling effectiveness.

[0037] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0038] The specific application environment architecture or specific hardware architecture on which the execution of resource scheduling methods for heterogeneous computing systems depends is described here.

[0039] Figure 1 This is a schematic diagram of the structure of a heterogeneous computing system provided in an embodiment of this application. Figure 2 This is a schematic diagram of an accelerator provided in an embodiment of this application. (Reference) Figure 1 and Figure 2 , Figure 1 This can include heterogeneous computing systems. A heterogeneous computing system can include at least a host, multiple heterogeneous computing devices, and accelerators. Figure 2 It can include accelerators.

[0040] The host can communicate with multiple heterogeneous computing devices and accelerators through the high-speed serial computer expansion bus standard bus (Peripheral Component Interconnect Express, PCIe) and the Compute Express Link (CXL) protocol.

[0041] The host can acquire hardware information of various heterogeneous computing devices and model information of the graph neural network model to be scheduled. Based on the above information, the host completes the calculation of the total model memory, normalization of device performance, subgraph partitioning, and generates dynamic scheduling instructions, which are sent to the scheduling module of the accelerator through the on-chip bus to achieve unified management and control of the entire computing process.

[0042] Heterogeneous computing devices can be distributed computing units of a system, consisting of processing nodes with different architectures, and may include central processing units, graphics accelerators from different manufacturers / models, etc.

[0043] Each heterogeneous computing device has an independent computing core, local memory, and high-speed network interface.

[0044] Each heterogeneous computing device can perform local computation and interact with other devices and accelerators through the CXL protocol, supporting the distributed parallel operation of graph neural networks.

[0045] An accelerator can be a computation acceleration unit for this system, employing a dedicated hardware architecture to optimize the computational efficiency of graph neural networks.

[0046] An accelerator may include a scheduling module, on-chip memory, multi-level input buffers, multi-level output buffers, processing unit arrays, functional auxiliary modules, and on-chip buses and interconnect interfaces.

[0047] The scheduling module can receive scheduling instructions from the host through the on-chip bus, temporarily store them in the instruction register, parse them in the instruction decoding unit, and then distribute control signals to each functional module to realize dynamic scheduling of computing tasks and precise management of resources, ensuring that each module operates in a coordinated and efficient manner.

[0048] On-chip storage can serve as a high-speed data cache layer for accelerators, providing low-latency storage for input features, intermediate results, and output data during the computation process, significantly reducing the frequency of off-chip data access and effectively improving computational efficiency.

[0049] Multi-level input buffers and multi-level output buffers allow the input buffer to receive data to be computed from on-chip storage or external devices, reassemble and distribute the data according to computing needs, and provide a continuous data stream to the processing unit array; the output buffer collects the computing results of the processing unit array, temporarily stores them, writes them to on-chip storage or sends them back to external devices, thereby achieving efficient data transfer and reuse.

[0050] The processing unit array can consist of multiple sets of parallel processing units ( Figure 2 The accelerator is composed of components (represented by "PE") and is the core computing component of the accelerator. It is specifically optimized to perform intensive computing tasks such as matrix multiplication and neighbor aggregation in graph neural networks. Through pipelined parallel processing, it significantly improves computing throughput and meets the real-time computing needs of large-scale graph data.

[0051] The functional auxiliary module can include a sparse transformation operation module, a nonlinear activation module, and a data operation module. The sparse transformation operation module can optimize the computational overhead by performing sparse matrix transposition and compressed storage format conversion, taking advantage of the sparse characteristics of graph structures. The nonlinear activation module can perform activation function calculations. The data operation module can perform auxiliary operations such as normalization and broadcasting, comprehensively optimizing the computational process.

[0052] On-chip bus and interconnect interfaces can be used to build high-speed communication channels between various modules within the accelerator, ensuring low-latency data transmission; externally, it connects to the host via the PCIe bus and uses high-speed switching devices based on the CXL protocol. Figure 2 (referred to as "CXL-Switch") interconnects with other heterogeneous computing devices, relying on the memory consistency of the CXL protocol ( Figure 2 With its high bandwidth characteristics (represented by "CXL.mem"), it enables efficient data synchronization across devices, providing solid communication support for global resource scheduling.

[0053] This architecture fully leverages the memory consistency and low latency characteristics of the CXL protocol, combined with the flexibility of Field-Programmable Gate Arrays (FPGAs) and the parallelism of Graphics Processing Units (GPUs), to build an efficient and scalable acceleration platform for graph neural network models, suitable for typical application scenarios such as social network analysis and recommendation systems.

[0054] Figure 3A flowchart illustrating a resource scheduling method for a heterogeneous computing system provided in this application embodiment. Figure 1 ,like Figure 3 As shown, embodiments of this application provide a resource scheduling method for a heterogeneous computing system. The method is described in detail below:

[0055] S301. Obtain hardware information of multiple heterogeneous computing devices in the heterogeneous computing system, as well as model information of the graph neural network model to be scheduled.

[0056] Graph neural network models can be neural network models used to process images.

[0057] Hardware information can include hardware parameters and communication topology information.

[0058] Hardware parameters can include computation-related hardware parameters and memory-related hardware parameters.

[0059] Hardware parameters related to computation can include peak performance corresponding to multiple computational precisions, computational unit utilization, memory bandwidth, arithmetic strength, and instruction pipeline efficiency.

[0060] Hardware parameters related to memory can include at least one layer of memory sets, as well as the raw hardware bandwidth, hit rate, and concurrent access efficiency of each layer of memory.

[0061] Communication topology information may include the interconnection topology between devices, link bandwidth, and latency parameters for point-to-point / aggregate communication.

[0062] Model information can include network structure, graph dataset features, and sparsity features.

[0063] Network structure can include the number of layers in the model, the number of nodes / edges, feature dimensions, aggregation operation types, etc.

[0064] Hardware information of multiple heterogeneous computing devices in a heterogeneous computing system, as well as model information of graph neural network models to be scheduled, can be obtained from the storage space.

[0065] Optionally, after obtaining the hardware information of multiple heterogeneous computing devices in the heterogeneous computing system and the model information of the graph neural network model to be scheduled, the method further includes: mapping the local storage of multiple heterogeneous computing devices to the global physical address space through a cache consistency protocol; directly accessing remote memory addresses in the global physical address space through an inter-device communication protocol; and synchronizing local and remote memory data according to the access request of the remote memory address.

[0066] This application can map the local storage of multiple heterogeneous computing devices to the global physical address space through a cache coherence protocol. It can achieve unified mapping of local storage of heterogeneous devices based on rack-level CXL interconnect architecture and by using the CXL cache coherence function and CXL memory expansion capability provided by the CXL2.0 / 3.0 protocol.

[0067] Heterogeneous computing devices can include CPUs, GPUs, FPGAs, etc. Through the above mapping method, the local storage of all devices is integrated into a continuous global physical address space, making the entire data center rack abstract into a large-scale non-uniform memory access machine.

[0068] Synchronizing local and remote memory data based on access requests to remote memory addresses can be achieved using the cache consistency mechanism of the CXL protocol, eliminating the need for explicit software management of data synchronization and significantly simplifying the programming model.

[0069] Combining the computational characteristics of graph neural networks and based on the aforementioned global unified memory pool mechanism, this application keeps all intermediate results residing in the CXL unified memory pool at all times, and only uses the write-back mechanism to synchronize local and remote memory data, thus avoiding repeated copying and storage of intermediate results between devices.

[0070] It's worth noting that the connection method for heterogeneous devices further ensures the high efficiency of memory access and data synchronization. FPGA and GPU device groups are connected internally via high-speed internal communication links. The entire device group is mounted on the global physical address space of the CPU, leveraging the high bandwidth and low latency characteristics of the CXL interconnect architecture to achieve efficient collaboration between devices within and between groups. Simultaneously, the CXL protocol supports memory pooling and expansion functions, unifying the local storage of multiple heterogeneous devices into a single logical memory pool, further improving memory resource utilization and providing hardware foundation support for training ultra-large-scale graph neural networks.

[0071] Below, in conjunction with Figure 4 The CXL unified memory pool will be illustrated with an example.

[0072] Figure 4 This is a schematic diagram of the structure of a CXL unified memory pool provided in an embodiment of this application. Please refer to... Figure 4 . Figure 4 It can include traditional architectures and the CXL unified memory pool architecture. Traditional architectures and the CXL unified memory pool architecture include CPU, GPU, and FPGA, respectively.

[0073] In the traditional PCIe interconnect architecture, the CPU, FPGA, and GPU each have their own independent local memory space, and the addresses are completely isolated: the address range of CPU memory is 0x0000~0xFFFF, the address range of FPGA memory is 0x0000~0xFFFF, and the address range of GPU memory is 0x0000~0xFFFF.

[0074] In this architecture, devices cannot directly access each other's memory. For example, if a GPU wants to read graph data from the CPU's memory, it must perform an explicit data copy via the PCIe bus, moving the data from the CPU's memory to the GPU's memory before computation can begin.

[0075] In the heterogeneous acceleration architecture based on CXL of this invention, the cache coherency and memory expansion capabilities of the CXL2.0 / 3.0 protocol are used to map the local storage of all devices to a continuous global physical address space.

[0076] The global address allocation is as follows: CPU memory occupies the low address range 0x0000~0xFFFF, FPGA memory occupies the middle address range, and GPU memory occupies the high address range up to 0x2FFFF.

[0077] Any device can directly access remote memory in the global address space using standard instructions.

[0078] For example, a GPU can directly read data at address 0x8000 in CPU memory without explicit copying; an FPGA can also directly write data to address 0x20000 in GPU memory.

[0079] When a device initiates remote memory access, the CXL protocol's cache consistency mechanism automatically synchronizes local cache and remote memory data through cache-level write-back operations, without software intervention, greatly simplifying the programming model.

[0080] S302. Based on the model information, calculate the total amount of memory required for the graph neural network model.

[0081] This step is to combine the structural information of the graph neural network model to be scheduled, the features of the graph dataset, and the sparsity optimization characteristics to accurately calculate the total amount of memory required for the complete operation of the model.

[0082] Model information can include the network structure of the graph neural network model, graph dataset features, and sparsity features.

[0083] Based on the network structure, graph dataset characteristics, and sparsity characteristics of the graph neural network model, a preset algorithm can be used to calculate the total amount of memory required for the graph neural network model.

[0084] Optionally, the network structure includes multiple hidden layers. The total memory required for the graph neural network model can be calculated based on the model information as follows: Based on the network structure, graph dataset features, and sparsity features, determine the total number of nodes, the dimension of input node features, the graph structure sparsity, and the output dimensions of multiple hidden layers; based on the total number of nodes, the dimension of input node features, the graph structure sparsity, and the output dimensions of multiple hidden layers, determine the memory usage corresponding to each hidden layer; based on the memory usage corresponding to each hidden layer, determine the total memory required for the graph neural network model.

[0085] Among them, network structure can refer to the hierarchical design and topology architecture of graph neural network models.

[0086] Features of a graph dataset can include node features, edge features, graph type, and so on.

[0087] Sparsity features can be used to describe the sparsity of graph structures.

[0088] The total number of nodes can be the total number of nodes contained in the graph data.

[0089] The dimension of the input node features can be the dimension of the original feature vector of each node.

[0090] Graph sparsity can be an indicator of the degree of sparsity of a graph.

[0091] The output dimension of the hidden layer can represent the dimension of the node feature vector output after feature transformation.

[0092] Memory usage can refer to the memory resources consumed by a single hidden layer during computation and storage.

[0093] The total memory can be the total memory required for the graph neural network model to run.

[0094] Below, in conjunction with Figure 5 Taking the sparse optimization structure, the Ogbn-papers100M citation network dataset, and a four-layer hidden graph neural network model as examples, this paper details the specific process of calculating the total memory of the model based on the network structure, graph dataset features, and sparsity features.

[0095] The proposed graph neural network model is used for paper classification on the OGBN-papers100M dataset, which contains 111 million nodes and 1.615 billion edges. Node features are 128-dimensional vectors. The model employs a four-layer hidden layer structure (with channels of 256, 512, 1024, and 172 respectively), ultimately achieving 172 classifications. Due to the model's characteristics of small parameters, large dataset size, strong structural correlation, and the sparsity of the activated feature matrix, it can be further optimized. Figure 5 The sparsity optimization process reduces memory and computational overhead.

[0096] Figure 5 This is a flowchart illustrating a sparsity optimization method for a hidden layer computation process provided in an embodiment of this application. Please refer to... Figure 5 . Figure 5 This includes the process before optimization and the process after optimization.

[0097] The unoptimized hidden layer calculation process can be summarized in the following steps:

[0098] 1. Perform ordinary matrix multiplication on the input feature and weight matrices.

[0099] 2. Apply an activation function to the multiplication result to generate a feature matrix containing a large number of zero values.

[0100] 3. The feature matrix is ​​multiplied by sparse matrix multiplication with the graph structure data.

[0101] 4. Execute the activation function again and output the hidden layer results.

[0102] In this process, although the activated feature matrix is ​​sparse, it is still stored as a dense matrix, resulting in redundant memory overhead.

[0103] This application optimizes the computation process based on the sparsity characteristics after activation:

[0104] The sparsity s of the activated feature matrix is ​​calculated using the following formula:

[0105] s = zeroscount / size

[0106] Where s can be the sparsity, zeroscount can be the total number of zero elements in the matrix, and size can be the total number of elements in the matrix.

[0107] When s < 0.15, the original dense input features can be converted into sparse matrices, and ordinary matrix multiplication can be replaced with sparse matrix multiplication.

[0108] The subsequent calculation process remains unchanged, but memory usage and computational overhead are significantly reduced by using sparse matrix storage and computation.

[0109] For graph neural network models with small parameters, large data volume, and strong correlation of data structures, the product of the model matrix can be estimated using the following formula:

[0110]

[0111] Where M1 can be the computational cost corresponding to the product of the model matrix, S can be the sparsity, N can be the number of nodes, C1 can be the node feature dimension, C2 can be the dimension after matrix transformation, S*N is approximately equal to the average degree of the graph structure, and D is approximately equal to a constant (=S*N+C1).

[0112] This demonstrates that the computational cost of the model is strongly correlated with the number of graph data nodes. During graph partitioning, the appropriate number of subgraph nodes can be allocated based on the hardware's computing power. If the goal is to store all data on accelerated devices, the model's actual memory requirements must be considered when computing the entire Ogbn-papers100M graph dataset.

[0113] Next, we will calculate the actual memory requirements of the model:

[0114] The memory requirements of each hidden layer can be composed of three parts: input features, sparse structured data, and output features, as shown in the following formula:

[0115] The memory usage of the first layer is H1 = N×C1 + N×S + N×C2;

[0116] The memory usage of the second layer is H2 = N×C2 + N×S + N×C3;

[0117] The memory usage of the third layer is H3 = N×C3 + N×S + N×C4;

[0118] The memory usage of the fourth layer is H4 = N×C4 + N×S + N×C5;

[0119] Where C1=128, C2=256, C3=512, C4=1024, C5=172, the sparsity is calculated using the total number of edges / total number of nodes in the graph structure, i.e. S≈16.15 / 1.11≈15.

[0120] The total memory usage H is calculated by summing the memory usage of the four layers:

[0121] H = H1 + H2 + H3 + H4

[0122] If float32 is used for storage, the actual physical memory requirement is H×4 bytes.

[0123] By optimizing the sparsity of the hidden layer computation process and combining it with accurate total memory calculation, a quantitative basis is provided for resource scheduling in heterogeneous computing systems, which can effectively improve the distributed operation efficiency and hardware resource utilization of large-scale graph neural network models.

[0124] S303. Based on hardware information, model information, and total memory, normalize the performance of multiple heterogeneous computing devices to obtain device performance parameters for multiple heterogeneous computing devices.

[0125] Device performance parameters can be indicators that characterize the hardware performance of heterogeneous computing devices.

[0126] The performance parameters of heterogeneous computing devices may include normalized performance scores, maximum number of storage nodes, and load mapping constraints.

[0127] Based on hardware information, model information, and total memory, the performance of multiple heterogeneous computing devices can be normalized using a preset normalization algorithm or model to obtain the device performance parameters of multiple heterogeneous computing devices.

[0128] Because heterogeneous computing clusters contain different types of devices, the units and magnitudes of performance indicators such as computing power, memory capacity, and communication bandwidth vary significantly, making direct and fair comparisons impossible. This step uses normalization to map multi-dimensional hardware performance indicators to a unified parameter range, making the overall performance of each device comparable. This ensures that subsequent subgraph partitioning accurately reflects the actual capabilities of each device, avoiding imbalances such as underloaded powerful devices and overloaded weak devices.

[0129] S304. Based on the device performance parameters of multiple heterogeneous computing devices, the graph neural network model is divided to obtain the subgraph parameter information corresponding to each of the multiple heterogeneous computing devices.

[0130] Subgraph parameter information can be the subgraph information corresponding to each heterogeneous computing device output after model partitioning.

[0131] Subgraph parameter information may include the subgraph's vertex set, edge set, feature data, task load allocation, boundary vertex distribution, computing power requirements, etc.

[0132] Based on the device performance parameters of multiple heterogeneous computing devices, the graph neural network model can be divided using a model partitioning algorithm to obtain the subgraph parameter information corresponding to each heterogeneous computing device.

[0133] Among them, the model partitioning algorithm can adopt a combination of layer-by-layer coarsening and iterative partitioning. On the constructed resource graph, the data graph of the graph neural network model is mapped and partitioned layer by layer. The original complex multi-constraint load mapping problem is transformed into a top-down hierarchical matching process, which greatly reduces the search space and algorithm complexity of the partitioning process.

[0134] Optionally, the graph neural network model can be partitioned based on the device performance parameters of multiple heterogeneous computing devices to obtain subgraph parameter information corresponding to each heterogeneous computing device in the following manner: A cluster resource graph is established based on the device performance parameters of multiple heterogeneous computing devices, and the cluster resource graph is coarsened layer by layer to obtain multiple root nodes; the original data graph corresponding to the graph neural network model is determined; based on the multiple root nodes, iterative partitioning processing is performed on the cluster resource graph and the original data graph until the corresponding subgraph parameter information for each heterogeneous computing device is determined.

[0135] Below, in conjunction with Figure 6 The process of load mapping for a graph neural network model is illustrated with an example.

[0136] Figure 6 This is a schematic diagram illustrating a graph neural network model load mapping process provided in an embodiment of this application. Please refer to... Figure 6 .

[0137] exist Figure 6 In section a, taking a heterogeneous computing cluster containing two FPGA devices and two GPU devices as an example, the process of completing the load mapping of the graph neural network model and obtaining the subgraph parameter information corresponding to each device by establishing a resource graph, coarsening layer by layer, and iteratively partitioning is explained in detail.

[0138] Step 1: Perform performance testing and normalization on the four heterogeneous computing devices in the cluster: the normalized computing power of the FPGA device is ai=2, and the normalized computing power of the GPU device is ai=4. At the same time, collect the communication bandwidth between the devices as the weight of the edges in the resource graph.

[0139] exist Figure 6 In step b, the entire cluster is abstracted as a resource graph, which includes nodes and edges. Two nodes labeled 2 represent FPGAs, and two nodes labeled 4 represent GPUs. Edges can be the physical communication links between devices that connect nodes, and the weight of the edges is determined by the actual communication bandwidth.

[0140] Step 2: Use a bottom-up coarsening strategy to aggregate the resource graph.

[0141] exist Figure 6 In 'b', nodes with the same computing power are aggregated. Two FPGA nodes with ai=2 are aggregated into one coarse node, with a normalized computing power of 2+2=4; two GPU nodes with ai=4 are aggregated into one coarse node, with a normalized computing power of 4+4=8.

[0142] exist Figure 6 In c, after one coarsening, the resource graph is simplified to a graph containing only 2 root nodes: one with a computing power of 4 and the other with a computing power of 8.

[0143] Step 3: Determine the original data graph corresponding to the graph neural network model to be processed. This graph contains multiple vertices and edges. Vertices carry feature information, and edges represent the relationships between vertices.

[0144] Step 4: Starting from the two root nodes, perform iterative partitioning from top to bottom, matching the original data graph with the resource graph layer by layer.

[0145] exist Figure 6 In d, between the two root nodes with computing power of 4 and 8, the original data graph is first divided according to the ratio of computing power:

[0146] The original data graph is divided into two parts based on a computing power ratio of 4:8 = 1:2, so that the task load matches the computing power of the root node.

[0147] Adjusting the distribution of boundary vertices reduces the number of edge connections between the two subgraphs, thereby decreasing subsequent cross-device communication. The resulting partition yields two coarsened subgraphs, corresponding to root nodes 4 and 8, respectively.

[0148] exist Figure 6 In 'e', ​​the coarsened subgraph and coarsened resource graph are further iterated downwards:

[0149] The coarsened subgraph corresponding to root node 4 is divided into two smaller subgraphs, which correspond to the two FPGA devices with ai=2 in the original resource graph.

[0150] The coarsened subgraph corresponding to root node 8 is divided into two smaller subgraphs, each corresponding to one of the two GPU devices with ai=4 in the original resource graph. The two-stage process of computation matching and communication optimization described above is repeated for each layer to ensure that the load matches the device's computing power while minimizing cross-device communication.

[0151] Step 5: Output the sub-graph parameter information corresponding to each device.

[0152] exist Figure 6 In f, after complete iterative partitioning, four subgraphs are finally obtained, corresponding to two FPGA devices and two GPU devices respectively.

[0153] The parameter information for each subgraph includes: the set of vertices and edges contained in the subgraph; the task load of the subgraph; the distribution of boundary vertices of the subgraph; and the computing resource requirements corresponding to the subgraph.

[0154] This information provides a clear basis for scheduling subgraph tasks to corresponding heterogeneous computing devices for execution, realizing efficient load mapping of graph neural network (GNN) computing tasks in heterogeneous clusters.

[0155] S305. Based on the parameter information of multiple subgraphs, dynamically schedule the resources of multiple heterogeneous computing devices and execute the distributed operation of the graph neural network model.

[0156] Based on the parameter information of multiple subgraphs, a mapping relationship between subgraph tasks and heterogeneous computing devices can be established. Based on the mapping relationship, the resources of multiple heterogeneous computing devices can be dynamically scheduled, and the distributed operation of the graph neural network model can be executed.

[0157] Dynamic scheduling can include resource reservation and allocation, real-time resource monitoring and adjustment, and communication bandwidth adaptation.

[0158] Optionally, resources of multiple heterogeneous computing devices can be dynamically scheduled based on multiple subgraph parameter information as follows: Multiple heterogeneous computing devices are grouped into multiple node groups based on the subgraph parameter information and the hardware information of the multiple heterogeneous computing devices, and at least one communication link is established between each graph node group, where each graph node group includes multiple nodes; latency detection processing is performed on the at least one communication link of the multiple nodes to determine at least one target node, where the latency corresponding to each target node is greater than a threshold; for any target node, a remote graph node group connected to the target node is determined, and a replica node corresponding to the target node is generated within the remote graph node group, so that the remote graph node group can complete intra-group communication by calling the replica node, and the replica node synchronizes data with the target node.

[0159] Below, in conjunction with Figure 7 The process of dynamic scheduling will be illustrated with an example.

[0160] Figure 7 This is a schematic diagram illustrating a dynamic scheduling process provided in an embodiment of this application. Please refer to [link / reference]. Figure 7 .

[0161] Taking a heterogeneous cluster containing two FPGAs and two GPUs as an example, this paper details how device grouping, latency detection, and node replication can solve communication bottlenecks and improve the distributed operation efficiency of graph neural network models.

[0162] exist Figure 7 In section a, hardware information of each device in the cluster is collected: the two FPGA devices belong to the same manufacturer and model, and there is a built-in high-speed link between the devices. Figure 7 (Represented by double thick black lines); the two GPU devices belong to the same manufacturer and model, and have a built-in high-speed link between them ( Figure 7 (Represented by double thick black lines); the FPGA and GPU are connected only via a low-speed link ( Figure 7(represented by double thin black lines); the devices are divided into two node groups based on the high-speed link: the FPGA node group contains the FPGA devices corresponding to task A and task B; the GPU node group contains the GPU devices corresponding to task C and task D.

[0163] Devices within a node group communicate via high-speed links, while devices between groups communicate via low-speed Ethernet links, forming a communication topology of "high speed within a group and low speed between groups".

[0164] Delay detection processing is performed on at least one communication link of multiple graph nodes to determine at least one target node.

[0165] exist Figure 7 In the middle, combining the high-frequency communication nodes in the subgraph parameter information ( Figure 7 (The red lines in the middle represent high-frequency communication). High-frequency cross-group communication was detected between graph nodes 2 and 3 in the FPGA node group and graph nodes 8 and 9 in the GPU node group; high-frequency cross-group communication was also detected between graph nodes 6 and 7 in the FPGA node group and graph nodes 12 and 13 in the GPU node group.

[0166] The latency of these cross-group communications all exceeded the threshold, so nodes 2, 3, 6, 7, 8, 9, 12, and 13 were marked as target nodes.

[0167] Target nodes 2, 3, 6, and 7 belong to the FPGA node group, and their communication remote node group is the GPU node group.

[0168] Target nodes 8, 9, 12, and 13 belong to the GPU node group, and their communication remote node group is the FPGA node group.

[0169] Create replica nodes for target nodes 2, 3, 6, and 7 within the GPU node group. Figure 7 The dashed circles in the middle represent copies of graph nodes, denoted as 2', 3', 6', and 7'.

[0170] Within the FPGA node group, generate replica nodes of target nodes 8, 9, 12, and 13, denoted as 8', 9', 12', and 13'.

[0171] High-frequency communication that originally required crossing low-speed links was redirected to high-speed links within the group: Task C within the GPU node group directly calls replica nodes 2' and 3' to complete communication, without needing to access the original nodes 2 and 3 of the FPGA node group across groups.

[0172] Task B within the FPGA node group can directly call replica nodes 9' and 12' to complete communication, without needing to access the original nodes 9 and 12 of the GPU node group across groups.

[0173] The system maintains data consistency between replica nodes and original nodes through an asynchronous incremental synchronization mechanism, ensuring the correctness of calculation results.

[0174] This embodiment provides a resource scheduling method for a heterogeneous computing system. It acquires hardware information of multiple heterogeneous computing devices and model information of the graph neural network (GNN) model to be scheduled. Based on the model information, it calculates the total memory required by the GNN model. Then, based on the hardware information, model information, and total memory, it normalizes the performance of the multiple heterogeneous computing devices to obtain device performance parameters. Based on these parameters, it partitions the GNN model to obtain subgraph parameter information for each heterogeneous computing device. Finally, based on these subgraph parameter information, it dynamically schedules the resources of the multiple heterogeneous computing devices and executes the distributed operation of the GNN model. By normalizing and quantifying the performance of heterogeneous computing devices and adapting the model for partitioning, it achieves precise resource scheduling of the GNN model in the heterogeneous computing system, improving the overall efficiency of distributed model operation and hardware resource utilization, thus enhancing resource scheduling effectiveness.

[0175] Below, in conjunction with Figure 8 This paper explains the process of normalizing the performance of multiple heterogeneous computing devices based on hardware information, model information, and total memory to obtain the device performance parameters of multiple heterogeneous computing devices.

[0176] Figure 8 A flowchart illustrating a resource scheduling method for a heterogeneous computing system provided in this application embodiment. Figure 2 ,like Figure 8 As shown, based on the above embodiments, see also Figure 8 The method includes:

[0177] S801. Based on the hardware parameters, determine the computational performance characteristics by using a preset roofline model.

[0178] The computational roofline model can be a hardware performance analysis tool that combines the theoretical peak computing power of a device with its memory bandwidth to form a roofline boundary, which is used to describe the upper limit of the actual computing performance of the device under different computing intensities.

[0179] The preset computational roofline model combines the device's theoretical peak computing power with its memory bandwidth to form a roofline boundary, which describes the upper limit of the device's actual computing performance under different computing intensities.

[0180] By inputting the device's hardware parameters, such as the number of CPU / GPU cores, clock frequency, and floating-point operation capability, and using a predefined computing roofline model, the computing performance characteristics of the device under different computing loads can be calculated, including theoretical peak computing power and the curve of computing efficiency changing with computing intensity.

[0181] Optionally, the hardware parameters include peak performance, computing unit utilization, memory bandwidth, arithmetic strength, and instruction pipeline efficiency corresponding to multiple computational precisions. Based on the hardware parameters, the computational performance characteristics are determined through a preset computational roofline model: an arithmetic strength threshold is determined based on the peak performance and memory bandwidth corresponding to multiple computational precisions; it is then determined whether the arithmetic strength is less than the arithmetic strength threshold; if so, the computational performance characteristics are determined based on the peak performance, computing unit utilization, and instruction pipeline efficiency corresponding to multiple computational precisions; otherwise, the computational performance characteristics are determined based on memory bandwidth, arithmetic strength, and instruction pipeline efficiency.

[0182] In this way, by using the arithmetic intensity threshold as a key judgment condition, the computing performance characteristics of the device are divided into two intervals for precise characterization, thus better matching the variable computing intensity scenarios in the task.

[0183] When the arithmetic strength of a task is below this threshold, the device is in a memory bottleneck zone, and its performance is limited by memory bandwidth, making it unable to fully utilize the computing units.

[0184] When the arithmetic strength of a task exceeds this threshold, the device is in a computational bottleneck zone, and its performance is limited by the peak capacity of the computing units.

[0185] Alternatively, the arithmetic strength threshold can be calculated using the following formula. :

[0186]

[0187] in, For precision type At peak performance levels, FP32 is a single-precision floating-point type, FP16 is a half-precision floating-point type, INT8 is an 8-bit integer type, and Tensor is a tensor core-specific precision type. For memory bandwidth.

[0188] Alternatively, computational performance characteristics can be determined using the following formula:

[0189]

[0190] Where AI = FLOPs / Bytes represents the arithmetic strength, indicating the number of floating-point operations performed for each byte of memory access. For precision type At peak performance levels, FP32 is a single-precision floating-point type, FP16 is a half-precision floating-point type, INT8 is an 8-bit integer type, and Tensor is a tensor core-specific precision type. To calculate unit utilization, For instruction pipeline efficiency.

[0191] S802. Based on hardware parameters, determine memory performance characteristics using a preset memory roofline model.

[0192] Memory performance characteristics can be used to accurately characterize the actual bandwidth performance of multi-level memory in heterogeneous devices.

[0193] The preset memory roofline model can incorporate the device's multi-level cache, main memory, high-bandwidth memory, and even remote memory into the analysis, constructing a "layered roofline".

[0194] The memory roofline model can be used to characterize the bandwidth features of multi-level memory systems. It combines the cache hit rate and concurrent access efficiency at each level to calculate the effective bandwidth, reflecting the differences in memory access performance at different levels.

[0195] Optionally, the hardware parameters include at least one layer of memory set, and the original hardware bandwidth, hit rate, and concurrent access efficiency corresponding to each layer of memory. The hardware parameters also include main memory access latency and the access latency corresponding to each layer of memory. Based on the hardware parameters, memory performance characteristics are determined through a preset memory roofline model: For any layer of memory, the first effective bandwidth corresponding to the memory is determined based on the original hardware bandwidth, hit rate, and concurrent access efficiency corresponding to the memory, and the target effective bandwidth is determined based on the first effective bandwidth corresponding to each layer of memory in the at least one layer of memory set; For any layer of memory, the first weighted latency corresponding to the memory is determined based on the access latency corresponding to the memory, the hit rate of the previous at least one layer corresponding to the memory, and the hit rate corresponding to the memory, and the total weighted latency is determined based on the first weighted latency corresponding to each layer of memory in the at least one layer of memory set; The main memory weighted latency is determined based on the hit rate corresponding to each layer of memory and the main memory access latency; The average memory access latency is determined based on the total weighted latency and the main memory weighted latency.

[0196] Among them, memory performance characteristics include target effective bandwidth and average memory access latency.

[0197] The target effective bandwidth can be used to represent the actual data transmission capacity that the system can provide after considering cache hit rate and concurrent access efficiency.

[0198] Hit rate can represent the probability that data is successfully found in a given layer.

[0199] Concurrent access efficiency can be used to measure the bandwidth utilization of this layer when multiple requests are concurrent.

[0200] Alternatively, the target effective bandwidth can be determined using the following formula. :

[0201]

[0202] in, This is a set of memory hierarchies: L1 is the L1 cache layer, L2 is the L2 cache layer, L3 is the L3 cache layer, HBM is the high-bandwidth memory layer, DDR is the double-speed memory layer, PCIe is the high-speed serial bus layer, and NVLink is the high-speed interconnect layer. It can represent the raw hardware bandwidth. For the first Layer cache hit rate Satisfying constraints , To improve concurrent access efficiency, .

[0203] The memory hierarchy set can represent the complete storage hierarchy from the cache (L1 / L2 / L3) closest to the computing unit, to on-chip high-speed memory (HBM), main memory (DDR), and then to the device interconnect (PCIe / NVLink).

[0204] Alternatively, the average memory access latency can be determined using the following formula. :

[0205]

[0206] in, This is a set of memory hierarchies: L1 is the L1 cache layer, L2 is the L2 cache layer, L3 is the L3 cache layer, HBM is the high-bandwidth memory layer, DDR is the double-speed memory layer, PCIe is the high-speed serial bus layer, and NVLink is the high-speed interconnect layer. For the first Level 1 cache latency, For the first Layer cache hit rate Main memory latency, This can represent the probability that the cache misses all within the first j-1 layers. It can represent the probability that all cache layers are missed.

[0207] S803. Based on the communication topology information and total memory, determine the communication performance characteristics by using a preset communication rooftop model.

[0208] Communication topology information can refer to the connection methods and link attributes of devices within a heterogeneous computing cluster.

[0209] The pre-defined communication roofline model can be an analytical tool for quantifying the upper limit of a system's communication performance.

[0210] The communication topology information and total memory can be input into the preset communication roofline model to obtain the communication performance characteristics.

[0211] Optionally, the communication topology information includes the startup delay and transmission duration corresponding to the communication link. Based on the communication topology information and the total memory, the communication performance characteristics are determined through a preset communication roofline model. Based on the communication topology information, the communication type is determined, including point-to-point communication, general communication, and ring communication. Based on the communication type, a target communication roofline model is determined from multiple preset communication roofline models. Based on the total memory and the startup delay and transmission duration corresponding to the communication link, the communication performance characteristics are determined through the target communication roofline model.

[0212] Communication performance characteristics may include latency.

[0213] Alternatively, the delay for point-to-point communication can be determined using the following formula. :

[0214]

[0215] in, For message size, For equipment and Startup delay between;

[0216] The unit is the data transmission time.

[0217] Alternatively, the delay for a general communication type can be determined using the following formula. :

[0218]

[0219] Where γ is the computation coefficient (computational overhead of the reduction operation), m is the message size, P is the number of processes, α is the device startup delay, and β is the unit data transmission time.

[0220] Alternatively, the delay for a general communication type can be determined using the following formula. :

[0221]

[0222] in, Where P is the message size and P is the number of processes. For device startup delay, The unit is the data transmission time.

[0223] S804. Generate a high-dimensional performance tensor based on model information, computational performance characteristics, memory performance characteristics, and communication performance characteristics.

[0224] The model information, computational performance characteristics, memory performance characteristics, and communication performance characteristics can be normalized to determine the high-dimensional performance tensor.

[0225] Alternatively, it can be done through This represents a high-dimensional performance tensor.

[0226] in, For the i-th heterogeneous computing device, For a high-dimensional performance tensor, D is the data type dimension, M is the memory level dimension, and T is the task type dimension.

[0227] Among them, the data type dimension can represent the data types that the model processes.

[0228] The memory hierarchy dimension can represent a multi-layered storage structure.

[0229] The task type dimension can represent the different computational task types of the model.

[0230] S805: Generates a device performance profile by encoding and learning high-dimensional performance tensors and load features through a preset encoder.

[0231] High-dimensional performance tensors and load features can be input into a preset encoder to generate a device performance profile.

[0232] A Transformer encoder can be used to extract device performance profiles from high-dimensional performance tensors.

[0233] Optionally, a device performance profile is generated by encoding and learning the high-dimensional performance tensor and load features through a preset encoder: the high-dimensional performance tensor is encoded and embedded to obtain a device embedding vector; based on the load features, the attention weights between the graph neural network model and the heterogeneous computing device are determined through multi-head cross-attention; based on the attention weights, the device embedding vector and load features are encoded through a preset encoder to determine the fit of the heterogeneous computing device; and based on the high-dimensional performance tensor, the device performance score of the heterogeneous computing device is determined.

[0234] The device performance profile includes compatibility and device performance score.

[0235] For example, suppose a high-dimensional flattened vector and load characteristics The encoding process is as follows:

[0236]

[0237] in, For flattened vectors, For high-dimensional performance tensors, It can be a flattening operation. It can be a learnable weight matrix. It can be a bias vector. For embedding vectors, , For embedded space.

[0238] The encoding process can be a flattening of a high-dimensional vector. This is mapped to a low-dimensional embedding space. Output embedding vector .

[0239] The multi-head cross-attention weights are calculated as follows:

[0240]

[0241] in , For the embedding matrix of all devices, It is the embedding vector of the i-th device, obtained by linear transformation of the task load features and the device embedding vector. For querying the matrix, For the key matrix, For value matrices, For load characteristics, This is the weight matrix. is the dimension of the key matrix, and softmax is a preset function used to determine the degree of matching between the task and each device.

[0242] compatibility Defined as:

[0243]

[0244] For quantification equipment For load characteristics The degree of compatibility.

[0245] TransformerEncoder can utilize multi-head attention and feedforward networks to fuse device embedding vectors. and load characteristics Finally, a scalar score is output, which serves as the basis for task scheduling.

[0246] Multi-objective optimization calculation device performance rating :

[0247]

[0248] Among them, the weights are calculated. Memory weight Communication weight ,satisfy and , To calculate the performance score, The memory performance score is given. The score is given for communication performance.

[0249] The performance score can be determined using the following formula:

[0250]

[0251] For the computational performance score of the i-th device, AI can represent computational intensity. and These represent the lower and upper limits of the calculated intensity, respectively. It can represent the actual performance of the i-th device when the computational intensity is AI. It can represent the ideal performance when the computational intensity is AI.

[0252] Memory performance score can be determined using the following formula:

[0253]

[0254] in, For memory hierarchy collection, The memory performance score for the i-th device. The actual available bandwidth of the j-th layer of memory. Let j be the hit rate of the cache at level j. This represents the theoretical peak bandwidth of the j-th layer of memory.

[0255] The communication performance score can be determined using the following formula:

[0256]

[0257] in, A score can be given for communication performance. and For average startup latency and transmission time, This represents the average message size.

[0258] S806. Based on the model information and total memory, normalize the device performance profile to determine the device performance parameters of the heterogeneous computing device.

[0259] Device performance parameters may include normalized performance score, maximum number of storage nodes, and load mapping constraints.

[0260] The device performance profile can be normalized using a preset algorithm to determine the normalized performance score of heterogeneous computing devices. Based on the model information and total memory, the maximum number of storage nodes and load mapping constraints can be determined.

[0261] Optionally, the device performance profile includes a device performance score, and the model information includes the total number of nodes. Based on the total memory, the device performance profile is normalized to determine the device performance parameters of the heterogeneous computing devices, including: obtaining the maximum cluster performance score corresponding to multiple heterogeneous computing devices; normalizing the device performance score based on the maximum cluster performance score to determine the normalized performance score of the heterogeneous computing devices; determining the maximum number of storage nodes for the heterogeneous computing devices based on the total number of nodes, the total memory, and the memory capacity corresponding to the heterogeneous computing devices; and determining the load mapping constraints based on the normalized performance score and the maximum number of storage nodes.

[0262] Among them, the device performance parameters of heterogeneous computing devices include normalized performance score, maximum number of storage nodes, and load mapping constraints.

[0263] Alternatively, the normalized performance score of a heterogeneous computing device can be determined in the following way:

[0264]

[0265] in, To normalize the performance score, Rate the equipment performance. The highest performance score for the cluster.

[0266] Alternatively, the maximum number of storage nodes can be determined in the following way:

[0267]

[0268] in, TN is the maximum number of nodes the device can store, TN is the total number of nodes in the graph dataset, and TM is the total memory required for the graph neural network model to compute the entire graph dataset. This refers to the memory capacity of the device.

[0269] Alternatively, the load mapping constraint can be expressed by the following formula:

[0270]

[0271] in, It is a device performance rating. It is the ratio of the number of graph nodes that the normalized device can map to according to the load mapping strategy. This is the maximum number of nodes that the device can store.

[0272] This embodiment provides a resource scheduling method for heterogeneous computing systems. It determines computing performance characteristics based on hardware parameters using a pre-defined computing roofline model; memory performance characteristics based on hardware parameters using a pre-defined memory roofline model; communication performance characteristics based on communication topology information and total memory using a pre-defined communication roofline model; a high-dimensional performance tensor is generated based on the model information, computing performance characteristics, memory performance characteristics, and communication performance characteristics; a device performance profile is generated by encoding and learning the high-dimensional performance tensor and load characteristics using a pre-defined encoder; and the device performance profile is normalized based on the model information and total memory to determine the device performance parameters of the heterogeneous computing devices. In this way, by constructing accurate device performance profiles through multi-dimensional roofline models and encoding learning, and obtaining unified and quantified device performance parameters through normalization, it provides accurate and quantifiable hardware performance basis for load mapping and communication optimization of tasks in heterogeneous clusters, thereby significantly improving the overall efficiency and resource utilization of distributed computing.

[0273] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.

[0274] Figure 9 This is a schematic diagram of the structure of a resource scheduling device for a heterogeneous computing system provided in an embodiment of this application. Figure 9 As shown, embodiments of this application also provide a resource scheduling device for a heterogeneous computing system.

[0275] The resource scheduling device 900 of the heterogeneous computing system includes an acquisition module 901, a computing module 902, a processing module 903, a partitioning module 904, and a scheduling module 905.

[0276] The acquisition module 901 is used to acquire hardware information of multiple heterogeneous computing devices in the heterogeneous computing system, as well as model information of the graph neural network model to be scheduled.

[0277] Calculation module 902 is used to calculate the total amount of memory required by the graph neural network model based on the model information;

[0278] The processing module 903 is used to normalize the performance of multiple heterogeneous computing devices based on hardware information, model information and total memory, and obtain the device performance parameters of multiple heterogeneous computing devices.

[0279] The partitioning module 904 is used to partition the graph neural network model according to the device performance parameters of multiple heterogeneous computing devices, and obtain the subgraph parameter information corresponding to each of the multiple heterogeneous computing devices.

[0280] The scheduling module 905 is used to dynamically schedule the resources of multiple heterogeneous computing devices based on multiple subgraph parameter information, and to execute the distributed operation of the graph neural network model.

[0281] In one possible implementation, the hardware information includes hardware parameters and communication topology information, and the model information includes load characteristics. For any heterogeneous computing device, the processing module 903 is specifically used for:

[0282] Based on hardware parameters, communication topology information, and total memory, the computing performance characteristics, memory performance characteristics, and communication performance characteristics are determined using multiple preset performance roofline models.

[0283] Based on model information, computational performance characteristics, memory performance characteristics, and communication performance characteristics, a high-dimensional performance tensor is generated.

[0284] By encoding and learning high-dimensional performance tensors and load features through a preset encoder, a device performance profile is generated.

[0285] Based on the model information and total memory, the device performance profile is normalized to determine the device performance parameters of heterogeneous computing devices.

[0286] In one possible implementation, the processing module 903 is specifically used for:

[0287] Based on hardware parameters, computational performance characteristics are determined by using a pre-defined roofline model.

[0288] Based on hardware parameters, memory performance characteristics are determined using a preset memory roofline model.

[0289] Based on the communication topology information and total memory, the communication performance characteristics are determined by using a preset communication roofline model.

[0290] In one possible implementation, the hardware parameters include peak performance corresponding to multiple computational precisions, computing unit utilization, memory bandwidth, arithmetic strength, and instruction pipeline efficiency. The processing module 903 is specifically used for:

[0291] The arithmetic strength threshold is determined based on the peak performance and memory bandwidth corresponding to multiple computational precisions.

[0292] Determine whether the arithmetic strength is less than the arithmetic strength threshold;

[0293] If so, determine the computing performance characteristics based on the peak performance, computing unit utilization, and instruction pipeline efficiency corresponding to multiple computational precisions;

[0294] If not, determine the computational performance characteristics based on memory bandwidth, arithmetic strength, and instruction pipeline efficiency.

[0295] In one possible implementation, the hardware parameters include at least one layer of memory set, and the raw hardware bandwidth, hit rate, and concurrent access efficiency corresponding to each layer of memory. The hardware parameters also include main memory access latency and the access latency corresponding to each layer of memory. The processing module 903 is specifically used for:

[0296] For any layer of memory, the first effective bandwidth corresponding to the memory is determined based on the original hardware bandwidth, hit rate and concurrent access efficiency of the memory, and the target effective bandwidth is determined based on the first effective bandwidth of each layer of memory in at least one layer of memory set.

[0297] For any memory layer, determine the first weighted latency corresponding to the memory based on the access latency corresponding to the memory, the hit rate of the previous at least one layer corresponding to the memory, and the hit rate corresponding to the memory. Then, determine the total weighted latency based on the first weighted latency corresponding to each memory layer in the at least one memory set.

[0298] The main memory weighted latency is determined based on the hit rate of each memory layer and the main memory access latency.

[0299] The average memory access latency is determined based on the weighted total latency and the main memory weighted latency.

[0300] Among them, memory performance characteristics include target effective bandwidth and average memory access latency.

[0301] In one possible implementation, the communication topology information includes the start-up delay and transmission duration corresponding to the communication link, and the processing module 903 is specifically used for:

[0302] Based on the communication topology information, the communication type is determined. The communication types include point-to-point communication, general communication, and ring communication.

[0303] Based on the communication type, determine the target communication rooftop model from multiple preset communication rooftop model models;

[0304] Based on the total memory and the startup delay and transmission duration of the communication link, the communication performance characteristics are determined using the target communication roofline model.

[0305] In one possible implementation, the processing module 903 is specifically used for:

[0306] Encode and embed the high-dimensional performance tensor to obtain the device embedding vector;

[0307] Based on the load characteristics, the attention weights between the graph neural network model and heterogeneous computing devices are determined by multi-head cross-attention.

[0308] Based on attention weights, the device embedding vector and load features are encoded by a preset encoder to determine the adaptability of heterogeneous computing devices;

[0309] Determine the device performance score of heterogeneous computing devices based on the high-dimensional performance tensor;

[0310] The device performance profile includes compatibility and device performance score.

[0311] In one possible implementation, the device performance profile includes a device performance score, the model information includes the total number of nodes, and the processing module 903 is specifically used for:

[0312] Obtain the maximum performance score of the cluster corresponding to multiple heterogeneous computing devices;

[0313] Based on the cluster's maximum performance score, the device performance scores are normalized to determine the normalized performance scores of heterogeneous computing devices.

[0314] The maximum number of storage nodes for a heterogeneous computing device is determined based on the total number of nodes, the total amount of memory, and the memory capacity corresponding to the heterogeneous computing device.

[0315] The load mapping constraints are determined based on the normalized performance score and the maximum number of storage nodes.

[0316] Among them, the device performance parameters of heterogeneous computing devices include normalized performance score, maximum number of storage nodes, and load mapping constraints.

[0317] In one possible implementation, the model information includes the network structure of the graph neural network model, graph dataset features, and sparsity features. The network structure includes multiple hidden layers, and the computation module 902 is specifically used for:

[0318] Based on the network structure, graph dataset features, and sparsity features, determine the total number of nodes, the dimension of input node features, the sparsity of the graph structure, and the output dimensions of multiple hidden layers;

[0319] The memory usage of each hidden layer is determined based on the total number of nodes, the feature dimension of the input nodes, the sparsity of the graph structure, and the output dimensions of multiple hidden layers.

[0320] The total amount of memory required for the graph neural network model is determined based on the memory usage of each hidden layer.

[0321] In one possible implementation, the partitioning module 904 is specifically used for:

[0322] Based on the device performance parameters of multiple heterogeneous computing devices, a cluster resource graph is established, and the cluster resource graph is coarsened layer by layer to obtain multiple root nodes;

[0323] Determine the original data graph corresponding to the graph neural network model;

[0324] Based on multiple root nodes, iterative partitioning is performed on the cluster resource graph and the original data graph until the parameter information of the corresponding subgraph is determined by each of the multiple heterogeneous computing devices.

[0325] In one possible implementation, the scheduling module 905 is specifically used for:

[0326] Based on the parameter information of multiple subgraphs and the hardware information of multiple heterogeneous computing devices, the multiple heterogeneous computing devices are grouped to obtain multiple node groups, and at least one communication link is established between each graph node group. A graph node group includes multiple nodes.

[0327] Delay detection processing is performed on at least one communication link of multiple nodes to identify at least one target node, and the delay corresponding to each of the at least one target node is greater than a threshold.

[0328] For any target node, determine the remote graph node group connected to the target node, and generate a replica node corresponding to the target node within the remote graph node group so that the remote graph node group can complete intra-group communication by calling the replica node, and the replica node synchronizes data with the target node.

[0329] In one possible implementation, the device further includes a storage module 906, which is used for:

[0330] Map the local storage of multiple heterogeneous computing devices to the global physical address space through a cache consistency protocol;

[0331] In the global physical address space, remote memory addresses are accessed directly through inter-device communication protocols;

[0332] Synchronize local and remote memory data based on access requests for remote memory addresses.

[0333] For a description of the features of the resource scheduling device for a heterogeneous computing system in the corresponding embodiment, please refer to the relevant description of the resource scheduling method for a heterogeneous computing system in the corresponding embodiment, which will not be repeated here.

[0334] Figure 10 A schematic diagram of the structure of the electronic device provided in this application. Figure 10 As shown, the electronic device 100 provided in this embodiment includes at least one processor 1001 and a memory 1002. Optionally, the electronic device 100 further includes a communication component 1003. The processor 1001, memory 1002, and communication component 1003 are connected via a bus.

[0335] In a specific implementation, at least one processor 1001 executes computer execution instructions stored in memory 1002, causing at least one processor 1001 to execute the above-described resource scheduling method embodiment for heterogeneous computing systems.

[0336] The specific implementation process of processor 1001 can be found in the above method embodiments, and its implementation principle and technical effect are similar. It will not be repeated here.

[0337] In the above embodiments, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.

[0338] The memory may include random access memory (RAM) and may also include non-volatile memory (NVM), such as at least one disk storage device.

[0339] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.

[0340] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above embodiments of the resource scheduling method for heterogeneous computing systems at runtime.

[0341] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.

[0342] Embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above embodiments of the resource scheduling method for heterogeneous computing systems.

[0343] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above-described embodiments of the resource scheduling method for heterogeneous computing systems.

[0344] Any of the components, modules, units, parts, methods, and operations described herein can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or any combination thereof. Alternatively or additionally, any functionality described herein can be executed at least in part by one or more hardware logic components, such as, but not limited to, a central processing unit (CPU), a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), an application-specific standard product (ASSP), a system-on-a-chip (SoC), a complex programmable logic device (CPLD), a microprocessor (MCU), etc. The terms "system," "computing device," or "apparatus" as used herein encompass various means, devices, and machines for processing data, including, for example, one or more programmable processors, computers, SoCs, or combinations thereof. The apparatus may also include code that creates an execution environment for the computer program in question, such as code constituting processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or one or more combinations thereof. The aforementioned computer program (also known as a program, software, software application, app, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and can be deployed in any form, including as a standalone program or as a module, component, subroutine, object, or other unit suitable for a computing environment.

[0345] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0346] The resource scheduling method for a heterogeneous computing system provided in this application has been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are only for the purpose of helping to understand the method and its core ideas. It should be noted that those skilled in the art can make several improvements and modifications to this application without departing from the principles of this application, and these improvements and modifications also fall within the protection scope of this application.

Claims

1. A resource scheduling method for a heterogeneous computing system, characterized in that, include: Obtain hardware information of multiple heterogeneous computing devices in a heterogeneous computing system, as well as model information of the graph neural network model to be scheduled; Based on the model information, calculate the total memory required for the graph neural network model; Based on the hardware information, the model information, and the total memory, the performance of the multiple heterogeneous computing devices is normalized to obtain the device performance parameters of the multiple heterogeneous computing devices. Based on the device performance parameters of the multiple heterogeneous computing devices, the graph neural network model is divided to obtain subgraph parameter information corresponding to each of the multiple heterogeneous computing devices; Based on multiple subgraph parameter information, the resources of multiple heterogeneous computing devices are dynamically scheduled, and the distributed operation of the graph neural network model is executed.

2. The method according to claim 1, characterized in that, The hardware information includes hardware parameters and communication topology information, and the model information includes load characteristics. For any heterogeneous computing device, the performance of the heterogeneous computing device is normalized based on the hardware information, the model information, and the total memory, to obtain the device performance parameters of the heterogeneous computing device, including: Based on the hardware parameters, the communication topology information, and the total memory, the computing performance characteristics, memory performance characteristics, and communication performance characteristics are determined respectively using multiple preset performance roofline models. A high-dimensional performance tensor is generated based on the model information, the computational performance characteristics, the memory performance characteristics, and the communication performance characteristics. A device performance profile is generated by encoding and learning the high-dimensional performance tensor and the load features through a preset encoder. Based on the model information and the total memory, the device performance profile is normalized to determine the device performance parameters of the heterogeneous computing device.

3. The method according to claim 2, characterized in that, Based on the hardware parameters, the communication topology information, and the total memory, computational performance characteristics, memory performance characteristics, and communication performance characteristics are determined using multiple preset performance roofline models, including: Based on the hardware parameters, the computational performance characteristics are determined by using a preset roofline model. Based on the hardware parameters, the memory performance characteristics are determined using a preset memory roofline model. Based on the communication topology information and the total memory, the communication performance characteristics are determined using a preset communication roofline model.

4. The method according to claim 3, characterized in that, The hardware parameters include peak performance, computing unit utilization, memory bandwidth, arithmetic strength, and instruction pipeline efficiency corresponding to multiple computational precisions. Based on these hardware parameters, the computational performance characteristics are determined using a preset computational roofline model, including: The arithmetic strength threshold is determined based on the peak performance corresponding to the multiple computational precisions and the memory bandwidth; Determine whether the arithmetic strength is less than the arithmetic strength threshold; If so, the computing performance characteristics are determined based on the peak performance corresponding to the multiple computational precisions, the computing unit utilization, and the instruction pipeline efficiency; If not, the computational performance characteristics are determined based on the memory bandwidth, the arithmetic strength, and the instruction pipeline efficiency.

5. The method according to claim 3, characterized in that, The hardware parameters include at least one layer of memory set, and the original hardware bandwidth, hit rate, and concurrent access efficiency corresponding to each layer of memory. The hardware parameters also include main memory access latency and the access latency corresponding to each layer of memory. Based on the hardware parameters, the memory performance characteristics are determined using a preset memory roofline model, including: For any layer of memory, the first effective bandwidth corresponding to the memory is determined based on the original hardware bandwidth, hit rate and concurrent access efficiency of the memory, and the target effective bandwidth is determined based on the first effective bandwidth corresponding to each layer of memory in the at least one layer of memory set. For any layer of memory, a first weighted latency is determined based on the access latency corresponding to the memory, the hit rate of the previous at least one layer corresponding to the memory, and the hit rate corresponding to the memory. The total weighted latency is then determined based on the first weighted latency corresponding to each layer of memory in the at least one layer of memory set. The main memory weighted latency is determined based on the hit rate of each memory layer and the main memory access latency. The average memory access latency is determined based on the weighted total latency and the main memory weighted latency. The memory performance characteristics include the target effective bandwidth and the average memory access latency.

6. The method according to claim 3, characterized in that, The communication topology information includes the startup delay and transmission duration corresponding to the communication link. Based on the communication topology information and the total memory, the communication performance characteristics are determined through a preset communication roofline model, including: Based on the communication topology information, the communication type is determined, including point-to-point communication type, general communication type, and ring communication type; Based on the communication type, a target communication rooftop model is determined from multiple preset communication rooftop model models; Based on the total memory and the startup delay and transmission duration corresponding to the communication link, the communication performance characteristics are determined through the target communication rooftop model.

7. The method according to claim 2, characterized in that, By encoding and learning the high-dimensional performance tensor and the load features using a preset encoder, a device performance profile is generated, including: The high-dimensional performance tensor is encoded and embedded to obtain the device embedding vector; Based on the load characteristics, the attention weights between the graph neural network model and the heterogeneous computing device are determined through multi-head cross-attention. Based on the attention weights, the device embedding vector and the load features are encoded by the preset encoder to determine the adaptability of the heterogeneous computing device; Based on the high-dimensional performance tensor, determine the device performance score of the heterogeneous computing device; The device performance profile includes the compatibility and the device performance score.

8. The method according to claim 2, characterized in that, The device performance profile includes a device performance score, and the model information includes the total number of nodes. Based on the total memory, the device performance profile is normalized to determine the device performance parameters of the heterogeneous computing device, including: Obtain the maximum cluster performance score corresponding to the multiple heterogeneous computing devices; Based on the maximum performance score of the cluster, the performance score of the device is normalized to determine the normalized performance score of the heterogeneous computing device. The maximum number of storage nodes for the heterogeneous computing device is determined based on the total number of nodes, the total amount of memory, and the memory capacity corresponding to the heterogeneous computing device. The load mapping constraints are determined based on the normalized performance score and the maximum number of storage nodes. The device performance parameters of the heterogeneous computing device include normalized performance score, maximum number of storage nodes, and load mapping constraints.

9. The method according to any one of claims 1-8, characterized in that, The model information includes the network structure of the graph neural network model, graph dataset features, and sparsity features. The network structure includes multiple hidden layers. Based on the model information, the total memory required for the graph neural network model is calculated, including: Based on the network structure, the graph dataset features, and the sparsity features, determine the total number of nodes, the dimension of input node features, the graph structure sparsity, and the output dimension of the multiple hidden layers; Based on the total number of nodes, the feature dimension of the input nodes, the sparsity of the graph structure, and the output dimension of the multiple hidden layers, determine the memory usage corresponding to each hidden layer. The total amount of memory required for the graph neural network model is determined based on the memory usage of each hidden layer.

10. The method according to any one of claims 1-8, characterized in that, Based on the device performance parameters of the multiple heterogeneous computing devices, the graph neural network model is divided to obtain subgraph parameter information corresponding to each of the multiple heterogeneous computing devices, including: Based on the device performance parameters of the multiple heterogeneous computing devices, a cluster resource graph is established, and the cluster resource graph is coarsened layer by layer to obtain multiple root nodes; Determine the original data graph corresponding to the graph neural network model; Based on the multiple root nodes, an iterative partitioning process is performed on the cluster resource graph and the original data graph until the multiple heterogeneous computing devices determine their respective subgraph parameter information.

11. The method according to any one of claims 1-8, characterized in that, Based on multiple subgraph parameter information, resources of multiple heterogeneous computing devices are dynamically scheduled, including: Based on the parameter information of multiple subgraphs and the hardware information of the multiple heterogeneous computing devices, the multiple heterogeneous computing devices are grouped to obtain multiple node groups, and at least one communication link is established between each graph node group, wherein the graph node group includes multiple nodes. Delay detection processing is performed on at least one communication link of the plurality of nodes to determine at least one target node, wherein the delay corresponding to the at least one target node is greater than a threshold. For any target node, a remote graph node group connected to the target node is determined, and a replica node corresponding to the target node is generated within the remote graph node group, so that the remote graph node group can complete intra-group communication by calling the replica node, and the replica node is synchronized with the target node.

12. The method according to any one of claims 1-8, characterized in that, After acquiring the hardware information of multiple heterogeneous computing devices in the heterogeneous computing system and the model information of the graph neural network model to be scheduled, the method further includes: The local storage of the multiple heterogeneous computing devices is mapped to the global physical address space through a cache consistency protocol; In the global physical address space, remote memory addresses are accessed directly through inter-device communication protocols; Synchronize local and remote memory data based on the access request from the remote memory address.

13. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the resource scheduling method for a heterogeneous computing system as described in any one of claims 1 to 12 when executing the computer program.

14. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, it implements the steps of the resource scheduling method for a heterogeneous computing system as described in any one of claims 1 to 12.

15. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the resource scheduling method for a heterogeneous computing system as described in any one of claims 1 to 12.