Method and system for reducing gpu and cpu interaction overhead

By performing topological sorting and reverse ordering of operators in the DNN model to generate kernel algorithm code, and then running it asynchronously on the GPU, the problem of CPU-GPU interaction overhead is solved, thus improving the running efficiency of the DNN model.

CN116185615BActive Publication Date: 2026-06-26GLENFLY TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GLENFLY TECH CO LTD
Filing Date
2023-02-08
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, when the CPU transmits and schedules kernel algorithm parameters, it causes the GPU and CPU to idle, resulting in a large amount of performance overhead and wasted computing resources. In addition, the interaction overhead between the CPU and GPU is relatively large.

Method used

By topologically sorting the operators in the DNN model, reverse-order kernel algorithm code is generated, and a concatenation function is inserted at the end of the code of each operator. The code is then compiled and loaded onto the GPU using an online compiler to achieve asynchronous operation and reduce CPU intervention.

Benefits of technology

It reduces the interaction overhead between the GPU and CPU, saves computing resources, and improves the running speed of DNN models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116185615B_ABST
    Figure CN116185615B_ABST
Patent Text Reader

Abstract

The application discloses a method and system for reducing GPU and CPU interaction overhead, and the method comprises the following steps: according to the type and parameter information of all operators in a DNN model after topological sorting, generating kernel algorithm code from the last operator one by one, and meanwhile, according to the program information of the kernel algorithm code of the next operator to be run, inserting a serial function at the end of each generated kernel algorithm code; calling an online compiler of a runtime library to compile the kernel algorithm code of the just generated operator, and loading the binary program generated by the compilation into the GPU as input data to prepare for running; according to the input data, iteratively running each operator on the GPU through a running function from the first operator until a running end event is listened to, and then ending the running and copying the output data to the CPU. The technical scheme of the application can eliminate the interaction overhead generated by the interaction between the GPU and the CPU in the execution of each operator, save the calculation resources, and improve the speed of model running.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of neural network technology, and in particular to a method and system for reducing the overhead of GPU and CPU interaction. Background Technology

[0002] GPUs (Graphics Processing Units) have become the mainstream devices for accelerating DNNs (Deep Neural Networks), whether for inference or training. A DNN model is a DAG (Directed Acyclic Graph) composed of many operators (ops). For GPUs, each op is implemented as a kernel function that executes on the GPU. During a complete DNN run, the CPU passes kernel parameters to each op in the topologically ordered sequence and schedules them to run on the GPU. The biggest problem with this approach is that each op execution requires the CPU to pass kernel parameters, start its execution on the GPU, and wait for it to finish, resulting in significant performance overhead. Simultaneously, during this time interval, the GPU's execution units are idle, leading to considerable waste of computational resources.

[0003] like Figure 1 As shown, the current DNN network approach running on GPU devices has the following problems:

[0004] Firstly, in a DNN network, each operator's kernel algorithm must go through the following steps before running: ① build the kernel algorithm code online (i.e., build kernel), which can be optimized by caching the results; ② set the kernel's running parameters through the parameter setting function clSetKernelArg; ③ finally call the running function clEnqueueNDRangeKernel to start running the kernel algorithm.

[0005] Secondly, the repetitive preparation steps before running the kernel algorithm incur significant performance overhead for the CPU, as well as interaction overhead between the CPU and GPU. Summary of the Invention

[0006] One of the objectives of this invention is to overcome the shortcomings of the prior art. In view of the problems in the prior art, such as the CPU idling while the GPU is passing and scheduling kernel algorithm parameters, and the CPU idling while the GPU is executing kernel parameters, which leads to the performance overhead and waste of computing resources of both the CPU and the GPU, this invention provides a method and system to reduce the interaction overhead between the GPU and the CPU.

[0007] To achieve the above objectives, the present invention is implemented through the following technical solution:

[0008] In a first aspect, the present invention provides a method for reducing GPU and CPU interaction overhead, the method comprising:

[0009] S100: Based on the type and parameter information of all operators in the DNN model after topological sorting, starting from the last operator, generate the kernel algorithm code corresponding to each operator in reverse order according to the code template corresponding to the operator type and the parameter information of the operator. At the same time, at the end of each generated kernel algorithm code, insert a concatenation function to start the next operator to be run after the current operator has finished running, based on the program information of the kernel algorithm code of the next operator to be run.

[0010] S200: Calls the online compiler of the runtime library to compile the kernel algorithm code of the newly generated operator, and loads the compiled binary program into the GPU as input data for execution;

[0011] S300: Based on the input data, the DNN model is iteratively run on the GPU starting from the first operator through the running function until the end of the run is detected and the output data is copied to the CPU.

[0012] In a preferred embodiment of this application, S100 specifically includes:

[0013] S101: Perform topological sorting on each operator in the DNN model to obtain the type and parameter information of each operator;

[0014] S102: Set the generation order of the kernel algorithm code to reverse order, and set the starting position i of the kernel algorithm code to be generated for each operator to the position n of the last operator;

[0015] S103: Construct the algorithm model corresponding to the i-th operator based on the code template corresponding to the type of the i-th operator and the parameter information of the operator;

[0016] S104: Generate the kernel algorithm code corresponding to the i-th operator based on the algorithm model and parameter information corresponding to the i-th operator;

[0017] S105: At the end of the kernel algorithm code corresponding to the i-th operator, according to the program information of the kernel algorithm code of the (i+1)-th operator, insert a concatenation function to start the (i+1)-th operator after the i-th operator has finished running;

[0018] S106: Determine if i is 1; if i=1, then execute S200; if i≠1, then execute S107.

[0019] S107: Let i = i-1, then return to execute S103.

[0020] In a preferred embodiment of this application, the type and parameter information of the operator include the attribute information of each operator, the shape of the input and output tensors, the memory address allocated to the tensor for storing data, and the entry address information of the kernel algorithm code corresponding to the subsequent operator;

[0021] The shape of the output tensor is calculated based on the attribute information and the shape of the input tensor;

[0022] If the operator to be used to generate kernel algorithm code is not the last operator, then obtain the entry address information of the executable program obtained after compiling the kernel algorithm code corresponding to the successor operator after this operator.

[0023] In a preferred embodiment of this application, the attribute information of each operator, the shape of the input and output tensors, the memory address allocated to the tensor for storing data, and the entry address information of the kernel algorithm code corresponding to the subsequent operator are generated by a code generation function; the parameter information and algorithm model of each operator are used to construct the corresponding kernel algorithm code by a compilation function.

[0024] In a preferred embodiment of this application, S300 specifically includes:

[0025] S301: Copy input data from the CPU;

[0026] S302: Start running the first operator on the GPU by running the function;

[0027] S303: During operation, the CPU asynchronously listens for execution termination events emitted by the GPU;

[0028] S304: After the CPU detects the end of the process event, the GPU copies the current calculation result as output data to the CPU's memory for further processing.

[0029] Repeat steps S301 through S304.

[0030] In a second aspect, the present invention provides a system for reducing the interaction overhead between GPU and CPU, the system comprising a concatenation module, a compilation module, and a runtime module;

[0031] The concatenation module is used to generate kernel algorithm code for each operator in reverse order, starting from the last operator, based on the type and parameter information of all operators in the DNN model after topological sorting, according to the code template corresponding to the operator type and the parameter information of the operator. At the same time, at the end of each generated kernel algorithm code, a concatenation function is inserted based on the program information of the kernel algorithm code of the next operator to be run, which is used to start the next operator to be run after the current operator has finished running.

[0032] The compilation module is used to call the online compiler of the runtime library to compile the kernel algorithm code of the newly generated operator, and load the compiled binary program into the GPU as input data for execution;

[0033] The running module is used to repeatedly iterate through each operator in the DNN model starting from the first operator on the GPU according to the input data, until the running ends when the running end event is detected, and then copy the output data to the CPU.

[0034] Thirdly, the present invention provides a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the method for reducing GPU and CPU interaction overhead as described in the first aspect.

[0035] Fourthly, the present invention provides a computer program product comprising a computer program that, when run on a computer, causes the computer to perform the method for reducing GPU and CPU interaction overhead as described in the first aspect.

[0036] The present invention discloses a method and system for reducing GPU and CPU interaction overhead. This ensures that when a DNN model is run iteratively on a GPU, only the preparation process before the first run requires a significant amount of CPU and GPU resources. Subsequent iterations only require the data copying overhead of the input or output tensors that are essential when running the DNN model. This eliminates the interaction overhead caused by the interaction between the GPU and CPU required for the execution of each operator in the DNN model, thus saving computing resources and improving the speed of model operation. Attached Figure Description

[0037] The present invention is described with reference to the following figures:

[0038] Figure 1This is a flowchart of existing technology for running DNN models on GPUs;

[0039] Figure 2 This is a flowchart of the method for reducing GPU and CPU interaction overhead in Embodiment 1 of the present invention;

[0040] Figure 3 This is a schematic diagram of running a DNN model asynchronously on a GPU in the method for reducing GPU and CPU interaction overhead in Embodiment 1 of the present invention.

[0041] Figure 4 This is a flowchart of S100 in the method for reducing GPU and CPU interaction overhead in Embodiment 1 of the present invention.

[0042] Figure 5 This is an example diagram of the method for reducing GPU and CPU interaction overhead in Embodiment 1 of the present invention, which uses a reverse order to generate the Kernel algorithm corresponding to each operator;

[0043] Figure 6 This is a flowchart of S300 in the method for reducing GPU and CPU interaction overhead in Embodiment 1 of the present invention.

[0044] Figure 7 This is a schematic diagram of a system for reducing GPU and CPU interaction overhead in Embodiment 2 of the present invention. Detailed Implementation

[0045] To better understand the technical solution of this application, the embodiments of this application will be described in detail below with reference to the accompanying drawings.

[0046] It should be understood that the described embodiments are merely some, not all, of the embodiments in this application. All other embodiments obtained by those skilled in the art based on the embodiments in this application without inventive effort are within the scope of protection of this application.

[0047] This invention provides a method and system for reducing GPU and CPU interaction overhead. By inserting a concatenation function at the end of the kernel algorithms corresponding to each operator after topological sorting, all kernel algorithms are chained together. This allows the entire DNN model to run automatically on the GPU without CPU intervention. This changes the original synchronous operation to an asynchronous operation, eliminating the waste of computational resources caused by the need for CPU intervention in the execution of each operator, and the interaction overhead caused by the interaction between the CPU and GPU.

[0048] Example 1

[0049] Embodiment 1 of the present invention discloses a method for reducing the interaction overhead between GPU and CPU. Based on the OpenCL platform, during the operation of the DNN network, the execution of each operator of the DNN model is always in the GPU without CPU intervention, thus saving a lot of computing and interaction resources for both CPU and GPU.

[0050] The CPU, or Central Processing Unit, is the core of a computer system, responsible for computation and control. It is the final execution unit for information processing and program execution. A CPU typically has 25% ALU (arithmetic and logic unit), 25% control unit, and 50% cache unit. Because the CPU involves general-purpose computing and is highly complex, it needs a small number of ALUs to handle complex calculations. Its powerful logical operation capabilities require sufficient control units to implement complex data control and forwarding; it also needs sufficient cache units to store results from completed calculations or data that will be used soon.

[0051] GPU, or Graphics Processing Unit, also known as a display core, visual processor, or display chip, is a microprocessor specifically designed for performing image and graphics-related computations in personal computers, workstations, game consoles, and some mobile devices (such as tablets and smartphones). A GPU has 90% ALU (Arithmetic Logic Unit), 5% Control Unit, and 5% Cache Unit. GPUs have high throughput capabilities, hence the large number of execution units. These execution units contain a significant number of ALUs and threads. To balance memory latency, we can fully utilize the characteristics of more ALUs to achieve a very high throughput. The smaller number of control and cache units are mainly responsible for merging and forwarding data. Since the GPU's demand for these two tasks is relatively low, the control and cache units occupy a small amount of GPU space. The architectural characteristics of GPUs are a perfect match for the operating characteristics of DNN models, therefore GPUs are widely used for DNN model training and inference. GPU programming interfaces include CUDA (Compute Unified Device Architecture), OpenCL (Open Computing Language), and Vulkan (Graphics Application Programming Interface), among others.

[0052] See Figure 2The method for reducing GPU and CPU interaction overhead in this embodiment 1 includes:

[0053] S100: Based on the types and parameter information of all operators in the DNN model after topological sorting, starting from the last operator, generate the kernel algorithm code corresponding to each operator in order according to the code template corresponding to the operator type and the parameter information of the operator. At the same time, at the end of each generated kernel algorithm code, insert a concatenation function to start the next operator to be run after the current operator has finished running, based on the program information of the kernel algorithm code of the next operator to be run.

[0054] S200: Calls the online compiler of the runtime library to compile the kernel algorithm code of the newly generated operator, and loads the compiled binary program into the GPU as input data for execution;

[0055] S300: Based on the input data, the function runs repeatedly on the GPU, starting from the first operator in the DNN model, until the end of the run is detected and the output data is copied to the CPU.

[0056] Specifically, most DNN networks are static networks. At least one subgraph within a DNN network containing a large number of computationally intensive operators (ops) is considered static. DNN networks use tensors as processing units for data flow. A static network means that the dimensions and size of the input / output tensors for each operator remain unchanged during each network iteration. Furthermore, the properties of each operator, such as the kernel size of a convolution, also remain constant. Computationally intensive operators are well-suited for acceleration using GPUs. Based on this static characteristic, the method in Embodiment 1 uses construction functions (such as the combination of clCreateProgramWithSource / clBuildProgram / clCreateKernel) in the GPU's OpenCL programming interface to build the corresponding kernel algorithm code for each operator online in step S100. This allows the kernel algorithm code to be automatically re-edited after generation. Simultaneously, a concatenation function `enqueueNextKernel` is inserted at the end of the kernel algorithm code, linking the topologically sorted kernel algorithm codes of each operator together. This concatenation function `enqueueNextKernel` can call the kernel algorithm code of the next operator to be run, achieving the effect of automatically running all operators sequentially in the GPU. For the kernel algorithm code corresponding to each operator with the added concatenation function `enqueueNextKernel`, step S200 uses an online compiler (CodeGenerator) to compile all the generated kernel algorithm code, generating a binary program adapted to the GPU runtime environment. The binary programs compiled from all the kernel algorithm code together form the input data and are loaded into the GPU. Finally, in step S300, the DNN model's operators are sequentially run on the GPU using the function clEnqueueNDRangeKernel until the end-of-run event occurs. For example... Figure 3 As shown, the part enclosed in curly braces represents the GPU-independent operation without CPU intervention, while the parts outside the braces, corresponding to the two arrows at the input and output ends respectively, represent the CPU-coordinated processing part. Figure 3In this process, before the GPU runs the DNN network, the input data is prepared in advance. When the DNN model is run starting from the first operator 1 through the function clEnqueueNDRangeKernel, after the kernel algorithm code of operator 1 is executed, the kernel algorithm code of operator 2 is directly called through the concatenation function enqueueNextKernel. After the kernel algorithm code of operator 2 is executed, the kernel algorithm code of operator 3 is directly called through the concatenation function enqueueNextKernel, and so on, until the kernel algorithm code of operator n-1 is executed. After that, the kernel algorithm code of the last operator n is directly called through the concatenation function enqueueNextKernel. There are no more subsequent operators to call. At this time, the run-end event is triggered, the GPU outputs data, ends the run, and sends the run-end event to the CPU.

[0057] By adopting the method of Embodiment 1, the interaction frequency between the GPU and CPU during the running of the DNN model is greatly reduced, so that the running of the DNN model always occurs in the GPU. Moreover, the more operators in the DNN model, the more significant the reduction in computing resources and interaction overhead of the CPU and GPU, and the more significantly the speed of running the DNN network on the GPU is improved.

[0058] See Figure 4 In the method of this embodiment 1, S100 specifically includes:

[0059] S101: Perform topological sorting on each operator in the DNN model to obtain the type and parameter information of each operator;

[0060] S102: Set the generation order of the kernel algorithm code to reverse order, and set the starting position i of the kernel algorithm code to be generated for each operator to the position n of the last operator;

[0061] S103: Construct the algorithm model corresponding to the i-th operator based on the code template corresponding to the type of the i-th operator and the parameter information of the operator;

[0062] S104: Generate the kernel algorithm code corresponding to the i-th operator based on the algorithm model and parameter information corresponding to the i-th operator;

[0063] S105: At the end of the kernel algorithm code corresponding to the i-th operator, according to the program information of the kernel algorithm code of the (i+1)-th operator, insert a concatenation function to start the (i+1)-th operator after the i-th operator has finished running;

[0064] S106: Determine if i is 1; if i=1, then execute S200; if i≠1, then execute S107.

[0065] S107: Let i = i-1, then return to execute S103.

[0066] Specifically, before generating the kernel algorithm code, all operators in the DNN model need to be topologically sorted to determine the order in which they run, thus defining the function of the DNN model and obtaining the type and parameter information of each operator. This facilitates the subsequent insertion of the concatenation function `enqueueNextKernel` and the calling of subsequent operators through `enqueueNextKernel`. During the generation of the kernel algorithm code for each operator, the generation order is reversed, starting from the last operator in the topologically sorted sequence and generating the corresponding kernel algorithm code for each operator sequentially. This generation process is iterative and repeated. Figure 5 As shown. Then, a concatenation function `enqueueNextKernel` is inserted at the end of the kernel algorithm code of the operator to concatenate all operators and their corresponding kernel algorithm codes in the DNN model. Specifically, assuming there is a set {1,2,…i,…,n} of n operators in the DNN model, a corresponding algorithm model is constructed for the type of the i-th operator, which is the kernel template used to generate the kernel algorithm code. This kernel template contains kernel parameters to be supplemented. These kernel parameters are filled according to the parameter information corresponding to the i-th operator to generate the kernel algorithm code corresponding to the i-th operator. Then, referring to the program information of the kernel algorithm code already generated by the (i+1)-th operator following the i-th operator, a concatenation function `enqueueNextKernel` is inserted at the end of the kernel algorithm code of the i-th operator to automatically call the (i+1)-th operator to continue running after the i-th operator has finished running. Finally, check if i is 1. If i is not 1, set i = i - 1 and return to generate kernel algorithm code for the new operator. Continue in this way until i = 1, which is to generate kernel algorithm code for the first operator and insert the concatenation function enqueueNextKernel at the end of it.

[0067] The DNN network operates based on OpenCL 1.2. By extending the OpenCL 1.2 standard, a concatenation function `enqueueNextKernel` is implemented, which is a kernel function. After the CPU completes the preparation work for running the DNN network, it transmits the concatenated operators and their corresponding kernel algorithm code to the GPU. This eliminates the need for the CPU to construct kernel algorithm code, set kernel parameters, and start kernel execution for each operator during DNN network operation, significantly reducing CPU computational resource consumption and the overhead of CPU-GPU interaction.

[0068] In the method of this embodiment 1, the operator type and parameter information obtained in step S101 include the attribute information of each operator, the shape of the input tensor, the shape of the output tensor, and the algorithm information of the kernel algorithm code corresponding to the subsequent operator; the shape of the output tensor is deduced based on the attribute information and the shape of the input tensor; if the operator to which the kernel algorithm code is to be generated is not the last operator, the program information of the kernel algorithm code corresponding to the subsequent operator is obtained. Before running the DNN network on the GPU, memory needs to be allocated to the constant tensors of each operator in the DNN model to store constant data, such as weights, biases, etc., and then the specific constant data is copied into the constant tensors as the input tensors of each operator. Then, based on the shape of the input tensor of each operator and its own attribute information (characterizing the characteristics of the operator), the shape of the output tensor of each operator is deduced, and finally, memory is allocated to all tensors that have not been allocated memory.

[0069] The attribute information of each operator, the shape of the input tensor, the shape of the output tensor, and the program information of the kernel algorithm code corresponding to the subsequent operator are generated by the code generator function. The compilation function constructs the kernel algorithm code corresponding to each operator based on the type of the operator, its parameter information, and the algorithm model.

[0070] Taking the 2D convolutional Kernel algorithm code as an example, the original implementation of the Kernel algorithm code was as follows:

[0071] __kernel void conv2d(__global fp16* input, __global fp16* output, __global fp16* weight,

[0072] __global fp16* weights, __global fp16* bias, const int input_width,

[0073] const int input_height, const int output_width,

[0074] const int output_width);

[0075] Now it becomes:

[0076] __kernel void conv2d(void) {

[0077] __global fp16* input;

[0078] __global fp16* output;

[0079] __global fp16* weight;

[0080] __global fp16* weights;

[0081] __global fp16* bias;

[0082] const int input_width;

[0083] const int input_height;

[0084] const int output_width;

[0085] const int output_width;

[0086] ………

[0087] enqueueNextKernel(getDefaultQueue(), dims, global_work_size,

[0088] local_work_size, next_kernel_info);

[0089] }

[0090] In the modified Kernel algorithm code, the specific value of each parameter will be automatically hard-coded into the code by a code generation function using the collected operator type and parameter information.

[0091] See Figure 5 Taking LetNet (a type of convolutional neural network) as an example, LetNet takes a 28x28 monochrome single-channel image as input and outputs 10 numbers, representing the probabilities of the numbers 0-9. Clearly, this is a completely static network; there are no operators requiring dynamic memory allocation or those for control flow implementation, making it easy to run entirely on a GPU. LetNet has seven operators, but only three types: conv2d, pooling, and fully connected. Implementing the kernel algorithm code for each operator only requires implementing the algorithm templates for the three corresponding operators. Then, starting from the last operator, fully connected_3, the kernel algorithm code is automatically generated and compiled in reverse order of operator execution. The parameters of conv2d_1 and conv2d_2 are different; these parameters are input into the code generation function and combined with the algorithm template of the conv2d operator's kernel algorithm code to generate the final running kernel algorithm code for each operator. The kernel algorithm code corresponding to these automatically generated operators is eventually linked together in GPU memory and executed sequentially until the kernel algorithm code corresponding to the last operator is executed, at which point the CPU is notified that the execution has ended.

[0092] See Figure 6 In the method of this embodiment 1, S300 specifically includes:

[0093] S301: Copy input data from the CPU;

[0094] S302: Start running the first operator on the GPU by running the function;

[0095] S303: During operation, the CPU asynchronously listens for execution termination events emitted by the GPU;

[0096] S304: After the CPU detects the end of the process event, the GPU copies the current calculation result as output data to the CPU's memory for further processing.

[0097] Repeat steps S301 through S304.

[0098] Specifically, after steps S100-S200, a corresponding kernel algorithm code is constructed for each operator in the DNN model. The topologically sorted operators and their corresponding kernel algorithm codes are then linked together sequentially using the concatenation function `enqueueNextKernel`. Therefore, when running the DNN model, only the CPU needs to prepare the input data and transmit it to the GPU. On the GPU, the first operator in the DNN model is run asynchronously by calling the execution function `clEnqueueNDRangeKernel` (which is also a kernel function). The execution of each subsequent operator is completed entirely on the GPU. During the DNN model's operation, the CPU does not interfere with the GPU's work and there is no interaction between them, greatly saving CPU computing resources and improving GPU speed. During the DNN model's operation, the listening function `clWaitForEvents` continuously monitors the DNN model. Once a completion event is detected, the GPU immediately stops running, sending the data generated during the process as output data along with the completion event to the CPU. The method in this embodiment 1 repeats steps S301 to S304, and eliminates the interaction overhead caused by the need for CPU and GPU to interact when executing each operator in the DNN model through asynchronous operation.

[0099] Example 2

[0100] Embodiment 2 of the present invention discloses a system for reducing GPU and CPU interaction overhead. By implementing the method for reducing GPU and CPU interaction overhead disclosed in Embodiment 1, based on the OpenCL platform, during the operation of the DNN network, the execution of each operator of the DNN model is always in the GPU without CPU intervention, thus saving a lot of computing and interaction resources for both CPU and GPU.

[0101] See Figure 7The system in this embodiment 2 includes a concatenation module 10, a compilation module 20, and a running module 30. The concatenation module 10 generates kernel algorithm code for each operator in the DNN model, starting from the last operator, based on the type and parameter information of all operators after topological sorting, according to the code template corresponding to the operator type and the operator's parameter information in reverse order. Simultaneously, at the end of each generated kernel algorithm code, a concatenation function is inserted based on the program information of the kernel algorithm code for the next operator to be run, to start the next operator after the current operator has finished running. The compilation module 20 calls the online compiler of the runtime library to compile the kernel algorithm code of the newly generated operators and loads the compiled binary program into the GPU as input data for execution. The running module 30, based on the input data, iterates through each operator on the GPU, starting from the first operator in the DNN model, until a completion event is detected, at which point the execution ends and the output data is copied to the CPU.

[0102] Specifically, most DNN networks are static networks. At least one subgraph within a DNN network containing a large number of computationally intensive operators (ops) is considered static. DNN networks use tensors as processing units for data flow. A static network means that the dimensions and size of the input / output tensors for each operator remain unchanged during each network iteration. Furthermore, the properties of each operator, such as the kernel size of a convolution, also remain constant. Computationally intensive operators are well-suited for acceleration using GPUs. Based on this static characteristic, the method in Embodiment 1 uses construction functions (such as the combination of clCreateProgramWithSource / clBuildProgram / clCreateKernel) in the GPU's OpenCL programming interface to build the corresponding kernel algorithm code for each operator online through the concatenation module 10. This allows the kernel algorithm code to be automatically re-edited after generation. Simultaneously, a concatenation function `enqueueNextKernel` is inserted at the end of the kernel algorithm code, linking the topologically sorted kernel algorithm codes of each operator together. This `enqueueNextKernel` function can call the kernel algorithm code of the next operator to be run, automatically calling the next operator to achieve the effect of automatically running all operators sequentially in the GPU. For the kernel algorithm code corresponding to each operator with the `enqueueNextKernel` function added, the compilation module 20 uses an online compiler (CodeGenerator) to compile all the generated kernel algorithm code, generating a binary program adapted to the GPU runtime environment. The binary programs compiled from all the kernel algorithm code together form the input data and are loaded into the GPU. Finally, module 30 executes the DNN model sequentially on the GPU using the function clEnqueueNDRangeKernel until the end-of-run event occurs. For example... Figure 3 As shown, the part enclosed in curly braces represents the GPU-independent operation without CPU intervention, while the parts outside the braces, corresponding to the two arrows at the input and output ends respectively, represent the CPU-coordinated processing part. Figure 3In the process, before the GPU runs the DNN network, the input data is prepared in advance through the concatenation module 10 and the compilation module 20. When the running module 30 starts the running function clEnqueueNDRangeKernel to run the DNN model from the first operator 1, after the kernel algorithm code of operator 1 is executed, the concatenation function enqueueNextKernel directly calls the kernel algorithm code of operator 2. After the kernel algorithm code of operator 2 is executed, the concatenation function enqueueNextKernel directly calls the kernel algorithm code of operator 3, and so on, until the kernel algorithm code of operator n-1 is executed. After that, the concatenation function enqueueNextKernel directly calls the kernel algorithm code of the last operator n. There are no more subsequent operators to call. At this time, the running end event is triggered, the GPU outputs data, ends the running, and sends the running end event to the CPU.

[0103] The system using this embodiment 2 significantly reduces the interaction frequency between the GPU and CPU during the running of the DNN model, ensuring that the DNN model always runs on the GPU. Furthermore, the more operators there are in the DNN model, the more significant the reduction in computing resources and interaction overhead for the CPU and GPU, thus significantly improving the speed of running the DNN network on the GPU.

[0104] Example 3

[0105] Embodiment 3 of the present invention discloses a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the method for reducing GPU and CPU interaction overhead as disclosed in Embodiment 1.

[0106] Example 4

[0107] Embodiment 4 of the present invention discloses a computer program product, which includes a computer program that, when run on a computer, causes the computer to perform the method for reducing GPU and CPU interaction overhead as disclosed in Embodiment 1.

[0108] The present invention discloses a method and system for reducing GPU and CPU interaction overhead. This ensures that when a DNN model is run iteratively on a GPU, only the preparation process before the first run requires a significant amount of CPU and GPU resources. Subsequent iterations only require the data copying overhead of the input or output tensors that are essential when running the DNN model. This eliminates the interaction overhead caused by the interaction between the GPU and CPU required for the execution of each operator in the DNN model, thus saving computing resources and improving the speed of model operation.

[0109] It should be understood that the above description of specific embodiments of the present invention is only for illustrating the technical approach and features of the present invention, and is intended to enable those skilled in the art to understand the content of the present invention and implement it accordingly. However, the present invention is not limited to the specific embodiments described above. All changes or modifications made within the scope of the claims of the present invention should be covered within the protection scope of the present invention.

Claims

1. A method for reducing GPU and CPU interaction overhead, characterized in that, The method includes: S100: Based on the type and parameter information of all operators in the DNN model after topological sorting, starting from the last operator, generate the kernel algorithm code corresponding to each operator in reverse order according to the code template corresponding to the operator type and the parameter information of the operator. At the same time, at the end of each generated kernel algorithm code, insert a concatenation function to start the next operator to be run after the current operator has finished running, based on the program information of the kernel algorithm code of the next operator to be run. The type and parameter information of the operators include the attribute information of each operator, the shape of the input and output tensors, the memory address allocated to the tensors for storing data, and the entry address information of the kernel algorithm code corresponding to the subsequent operators; The shape of the output tensor is calculated based on the attribute information and the shape of the input tensor; If the operator to be used to generate kernel algorithm code is not the last operator, then obtain the entry address information of the executable program obtained after compiling the kernel algorithm code corresponding to the successor operator after this operator. S200: Calls the online compiler of the runtime library to compile the kernel algorithm code of the newly generated operator, and loads the compiled binary program into the GPU as input data for execution; S300: Based on the input data, the DNN model is iteratively run on the GPU starting from the first operator through the running function until the end of the run is detected and the output data is copied to the CPU.

2. The method for reducing GPU and CPU interaction overhead according to claim 1, characterized in that, S100 specifically includes: S101: Perform topological sorting on each operator in the DNN model to obtain the type and parameter information of each operator; S102: Set the generation order of the kernel algorithm code to reverse order, and set the starting position i of the kernel algorithm code to be generated for each operator to the position n of the last operator; S103: Construct the algorithm model corresponding to the i-th operator based on the code template corresponding to the type of the i-th operator and the parameter information of the operator; S104: Generate the kernel algorithm code corresponding to the i-th operator based on the algorithm model and parameter information corresponding to the i-th operator; S105: At the end of the kernel algorithm code corresponding to the i-th operator, according to the program information of the kernel algorithm code of the (i+1)-th operator, insert a concatenation function to start the (i+1)-th operator after the i-th operator has finished running; S106: Determine if i is 1; if i=1, then execute S200; if i≠1, then execute S107. S107: Let i = i-1, then return to execute S103.

3. The method for reducing GPU and CPU interaction overhead according to claim 1, characterized in that, The attribute information of each operator, the shape of the input and output tensors, the memory address allocated to the tensor for storing data, and the entry address information of the kernel algorithm code corresponding to the subsequent operator are generated by the code generation function; the parameter information and algorithm model of each operator are used to construct the corresponding kernel algorithm code by the compilation function.

4. The method for reducing GPU and CPU interaction overhead according to claim 1, characterized in that, The S300 specifically includes: S301: Input data copied from the CPU; S302: Start running the first operator on the GPU by running the function; S303: During operation, the CPU asynchronously listens for execution termination events emitted by the GPU; S304: After the CPU detects the end of the process event, the GPU copies the current calculation result as output data to the CPU's memory for further processing. Repeat steps S301 through S304.

5. A system for reducing GPU and CPU interaction overhead, characterized in that, The system includes a serialization module, a compilation module, and a runtime module; The concatenation module is used to generate kernel algorithm code for each operator in reverse order, starting from the last operator, based on the type and parameter information of all operators in the DNN model after topological sorting, according to the code template corresponding to the operator type and the parameter information of the operator. At the same time, at the end of each generated kernel algorithm code, a concatenation function is inserted based on the program information of the kernel algorithm code of the next operator to be run, which is used to start the next operator to be run after the current operator has finished running. The compilation module is used to call the online compiler of the runtime library to compile the kernel algorithm code of the newly generated operator, and load the compiled binary program into the GPU as input data for execution; The type and parameter information of the operators include the attribute information of each operator, the shape of the input and output tensors, the memory address allocated to the tensors for storing data, and the entry address information of the kernel algorithm code corresponding to the subsequent operators; The shape of the output tensor is calculated based on the attribute information and the shape of the input tensor; If the operator to be used to generate kernel algorithm code is not the last operator, then obtain the entry address information of the executable program obtained after compiling the kernel algorithm code corresponding to the successor operator after this operator. The running module is used to repeatedly iterate through each operator in the DNN model starting from the first operator on the GPU according to the input data, until the running ends when the running end event is detected, and then copy the output data to the CPU.

6. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when run on a computer, causes the computer to perform the method for reducing GPU and CPU interaction overhead as described in any one of claims 1 to 4.

7. A computer program product, characterized in that, The computer program product includes a computer program that, when run on a computer, causes the computer to perform the method for reducing GPU and CPU interaction overhead as described in any one of claims 1 to 4.