Data compilation method and related apparatus

By encapsulating complex loop bodies into objective functions and optimizing them by dividing them into computational subgraphs, the problem of low compilation efficiency of complex loop structures in existing technologies is solved, and efficient computational graph optimization and execution are achieved.

WO2026123864A1PCT designated stage Publication Date: 2026-06-18HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-09-17
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing technologies, when dealing with complex loop structures, contain syntax that is not supported by the static graph within the loop body, resulting in the computation graph failing to form a complete graph, wasting resources, and reducing compilation efficiency.

Method used

Complex loops are encapsulated as objective functions, and their computation graphs are divided into multiple computation subgraphs for optimization. This avoids the repeated generation and destruction of fragmented graphs and employs optimization techniques such as constant folding and elimination of common subexpressions.

🎯Benefits of technology

It improves compilation efficiency, reduces the impact range of the split graph, and ensures the correct execution of the computation graph.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025121715_18062026_PF_FP_ABST
    Figure CN2025121715_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A data compilation method, comprising: acquiring a code file to be complied, wherein said code file comprises a target code segment, and the target code segment comprises a loop header and a loop body corresponding to the loop header; when the loop body comprises a target statement of which syntax does not support conversion into a static computation graph, encapsulating the loop body into a target function; and optimizing the static computation graph corresponding to said code file, to obtain an optimized computation graph, wherein the target statement is used for partitioning the static computation graph corresponding to the target function into a plurality of computation subgraphs, and the optimized computation graph comprises a computation subgraph obtained by optimizing at least one of the plurality of computation subgraphs. In the present application, optimization of a computation graph can still be implemented even when a split graph is present in the computation graph corresponding to a loop body, such that repeated generation and destruction of the split graph during the compilation of a loop structure are avoided, thereby improving compilation efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

A data compilation method and related apparatus

[0001] This application claims priority to Chinese Patent Application No. 202411814293.7, filed on December 10, 2024, entitled “A Data Compilation Method and Related Apparatus”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence (AI) technology, and in particular to a data compilation method and related apparatus. Background Technology

[0003] This application relates to the field of compiler optimization technology in computer science, particularly to just-in-time (JIT) compilation acceleration technology in AI frameworks. With the development of artificial intelligence, performance optimization of AI frameworks has become an important research direction. JIT technology is a key technique that improves code execution efficiency at runtime by capturing dynamic graph code in a computational graph.

[0004] To achieve Just-In-Time (JIT) acceleration, the industry standard approach is to optimize code segments that may be executed multiple times (such as loop structures) during code execution. Specifically, this involves expanding loop structures (such as for or while statements) and then performing computation graph capture on the expanded code. The basic idea behind this method is to transform dynamic parts of the code into static parts, thereby making computation graph capture more efficient.

[0005] However, while this method works well for simple loop structures, its effectiveness diminishes significantly when dealing with complex loop structures, especially those containing syntax not supported by the static graph (such as if or while loops). This is because, in such cases, the loop body cannot form a complete computation graph; instead, it splits into several sub-computation graphs at the unsupported syntax points. Existing techniques, upon identifying the inability to form a complete computation graph, do not optimize the entire loop body. This is equivalent to first expanding the loop body and transforming the computation graph of the expanded loop body, without performing optimization, resulting in wasted resources and reduced code compilation efficiency. Summary of the Invention

[0006] This application provides a data compilation method and related apparatus, which can improve compilation efficiency.

[0007] This application provides a data compilation method, comprising: obtaining a code file to be compiled; the code file including a target code segment, the target code segment including a loop header and a loop body corresponding to the loop header; if the loop body includes a target statement that does not support conversion to a static computation graph, encapsulating the loop body into a target function; optimizing the static computation graph corresponding to the code file to obtain an optimized computation graph; wherein the target statement divides the static computation graph corresponding to the target function into multiple computation subgraphs; the optimized computation graph includes a computation subgraph obtained by optimizing at least one computation subgraph among the multiple computation subgraphs. When the code contains statements that do not support conversion to a static computation graph, the converted static computation graph may have a split graph. For loops containing split graphs, existing technologies typically use the method of fully expanding the loop and then capturing the computation graph. However, once the computation graph corresponding to the expanded loop body has a split graph, the entire computation graph of the loop will be discarded without optimization and cannot be captured into the graph, resulting in a decrease in the graph capture rate. In this embodiment, the loop body in the loop structure is encapsulated into a new function (target function), and the multiple computation subgraphs in the computation graph corresponding to the target function, which are separated by the target statement, are optimized. That is, even if there is a split graph in the computation graph corresponding to the loop body, the computation graph is still optimized, which can avoid repeatedly generating and destroying split graphs when compiling the loop structure, thereby improving compilation efficiency.

[0008] In one possible implementation, before optimizing the static computation graph corresponding to the code file, the method further includes: writing the information of the objective function into the data structure of the static computation graph corresponding to the code file.

[0009] In one possible implementation, the target statement is one of the following:

[0010] Statements that call third-party libraries;

[0011] Statements that include operators, and the target statement is used to select a branch from multiple branches based on the result of the operator's execution; or, manually triggered statements.

[0012] In one possible implementation, each computational subgraph is a computational graph of a continuous segment of code within the loop body without unrolling the loop body.

[0013] The approach of this application embodiment is as follows: Compared with the existing technology that expands the entire loop body and then abandons the entire loop graph without optimization, this application embodiment takes a step back and does not pursue full graph entry. When encountering a loop with a split graph, the loop is not expanded, but the loop body is encapsulated into an independent new function. After encapsulating each sub-computation graph, the graph entry rate is reduced, but the influence range of the split graph is also controlled within a certain range.

[0014] In one possible implementation, the encapsulated new function can be optimized, including but not limited to common code optimization techniques such as constant folding, common subexpression elimination, and dead code elimination. This mechanism further improves the efficiency of loop body execution.

[0015] In one possible implementation, the optimized computation graph can also be executed. Specifically, a new, encapsulated function can be executed to achieve efficient loop execution.

[0016] In one possible implementation, the optimized computation graph includes the computation graph corresponding to the objective function; the method further includes: after executing the computation graph corresponding to the objective function and obtaining the computation result, writing the computation result to the optimized computation graph, so that the computation node after executing the computation graph corresponding to the objective function in the optimized computation graph can call the computation result.

[0017] In other words, after executing the encapsulated new function, the result can be written back to the original computation graph so that subsequent computation graph nodes can use these results for computation. This mechanism ensures that the result of the encapsulated new function can be correctly used by subsequent computation graph nodes, thereby guaranteeing the correct execution of the entire computation graph.

[0018] Secondly, this application provides a data compilation apparatus, the apparatus comprising:

[0019] The acquisition module is used to acquire the code file to be compiled; the code file includes a target code segment, and the target code segment includes a loop header and a loop body corresponding to the loop header;

[0020] An encapsulation module is used to encapsulate the loop body into an object function when the loop body includes an object statement that does not support syntax for conversion to a static computation graph;

[0021] The compilation module is used to optimize the static computation graph corresponding to the code file to obtain an optimized computation graph; wherein, the target statement divides the static computation graph corresponding to the target function into multiple computation subgraphs; the optimized computation graph includes a computation subgraph obtained by optimizing at least one computation subgraph among the multiple computation subgraphs.

[0022] In one possible implementation, before optimizing the static computation graph corresponding to the code file, the encapsulation module is further configured to: write the information of the objective function into the data structure of the static computation graph corresponding to the code file.

[0023] In one possible implementation, the target statement is one of the following:

[0024] Statements that call third-party libraries;

[0025] Statements that include operators, and the target statement is used to select a branch from multiple branches based on the result of the operator's execution; or, manually triggered statements.

[0026] In one possible implementation, each computational subgraph is a computational graph of a continuous segment of code within the loop body without unrolling the loop body.

[0027] In one possible implementation, the optimization includes at least one of the following:

[0028] Constant folding, common subexpression elimination, and dead code elimination.

[0029] In one possible implementation, the device further includes:

[0030] The execution module is used to execute the optimized computation graph.

[0031] In one possible implementation, the optimized computation graph includes multiple computation nodes, and the computation graph corresponding to the objective function corresponds to the objective node among the multiple computation nodes; the execution module is further configured to:

[0032] After the computation graph corresponding to the objective function is executed and the computation result is obtained, the computation result is written to the optimized computation graph so that the computation nodes after the objective node in the optimized computation graph can call the computation result.

[0033] A third aspect of this application provides a data compilation apparatus, which may include a processor and a memory coupled together. The memory stores program instructions, and when the program instructions stored in the memory are executed by the processor, the method described in the first aspect or any implementation thereof is implemented. For details regarding the steps in the various possible implementations of the first aspect executed by the processor, please refer to the first aspect; further details will not be repeated here.

[0034] The fourth aspect of this application provides a computer-readable storage medium storing a computer program that, when run on a computer, causes the computer to perform the method of any implementation of the first aspect described above.

[0035] The fifth aspect of this application provides a circuit system including a processing circuit configured to perform the method of any implementation of the first aspect described above.

[0036] The sixth aspect of this application provides a computer program product that, when run on a computer, causes the computer to perform any implementation of the first aspect described above.

[0037] A seventh aspect of this application provides a chip system including a processor for supporting a server or threshold value acquisition device in implementing the functions involved in any implementation of the first aspect described above, such as transmitting or processing data and / or information involved in the methods described above. In one possible design, the chip system further includes a memory for storing program instructions and data necessary for the server or communication device. This chip system may be composed of chips or may include chips and other discrete devices.

[0038] The beneficial effects of the second to seventh aspects mentioned above can be referred to the introduction of the first aspect above, and will not be repeated here. Attached Figure Description

[0039] Figure 1 is a schematic diagram of a system architecture provided in an embodiment of this application;

[0040] Figure 2 is a schematic diagram of a system architecture provided in an embodiment of this application;

[0041] Figure 3 is a flowchart illustrating a data compilation method provided in an embodiment of this application;

[0042] Figure 4 is a schematic diagram of a data compilation device provided in an embodiment of this application;

[0043] Figure 5 is a schematic diagram of an execution device provided in an embodiment of this application;

[0044] Figure 6 is a schematic diagram of a chip structure provided in an embodiment of this application;

[0045] Figure 7 is a schematic diagram of the structure of a computer-readable storage medium provided in an embodiment of this application. Detailed Implementation

[0046] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application are described below with reference to the accompanying drawings. Obviously, the described embodiments are merely some, and not all, of the embodiments of this application. Those skilled in the art will understand that, with the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

[0047] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such descriptions can be used interchangeably where appropriate to allow embodiments to be implemented in a sequence other than that illustrated or described in this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or device that includes a series of steps or modules is not necessarily limited to those explicitly listed, but may include other steps or modules not explicitly listed or inherent to such processes, methods, products, or devices. The naming or numbering of steps appearing in this application does not imply that the steps in the method flow must be performed in the chronological / logical order indicated by the naming or numbering. The execution order of named or numbered process steps can be changed according to the desired technical purpose, as long as the same or similar technical effect is achieved. The division of units in this application is a logical division. In practical applications, there may be other division methods. For example, multiple units may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the shown or discussed mutual coupling, direct coupling, or communication connection may be through some interface, and the indirect coupling or communication connection between units may be electrical or other similar forms, none of which are limited in this application. Furthermore, the units or sub-units described as separate components may or may not be physically separated, may or may not be physical units, or may be distributed among multiple circuit units. Some or all of the units can be selected to achieve the purpose of the solution in this application according to actual needs.

[0048] This application relates to the application of neural networks. For ease of understanding, the relevant terms and concepts are explained below:

[0049] 1. Neural Networks

[0050] A neural network (NN) is a machine learning model. A neural network can be composed of neural units, which are computational units that take xs and an intercept of 1 as input. The output of this computational unit can be:

[0051] Where s = 1, 2, ..., n, where n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into the output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a nonlinear function such as ReLU. A neural network is a network formed by connecting many of the above-mentioned individual neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field, which can be a region composed of several neural units.

[0052] 2. Convolutional Neural Networks

[0053] A convolutional neural network (CNN) is a deep neural network with convolutional structures. It is a deep learning architecture, which refers to learning at multiple levels of abstraction using machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network, where each neuron responds to an input image. A CNN contains a feature extractor consisting of convolutional layers and pooling layers. This feature extractor can be viewed as a filter, and the convolution process can be seen as performing convolution with a trainable filter and an input image or a convolutional feature map.

[0054] A convolutional layer is a layer of neurons in a convolutional neural network that performs convolution processing on the input signal. A convolutional layer can contain multiple convolution operators, also called kernels. In image processing, these operators act as filters, extracting specific information from the input image matrix. Essentially, a convolution operator can be a weight matrix, which is usually predefined. During the convolution operation, the weight matrix typically processes the input image pixel by pixel (or two pixels by two pixels, depending on the stride) along the horizontal direction, thus extracting specific features from the image. The size of the weight matrix should be related to the image size. It's important to note that the depth dimension of the weight matrix is ​​the same as the depth dimension of the input image; during convolution, the weight matrix extends to the entire depth of the input image. Therefore, convolution with a single weight matrix produces a single-depth convolutional output. However, in most cases, multiple weight matrices of the same size (rows × columns) are used instead of a single weight matrix. The outputs of each weight matrix are stacked to form the depth dimension of the convolutional image. This dimension can be understood as being determined by the "multiple" factors mentioned above. Different weight matrices can be used to extract different features from the image. For example, one weight matrix can be used to extract edge information, another to extract specific colors, and yet another to blur unwanted noise. These multiple weight matrices have the same size (rows × columns), and the feature maps extracted by these weight matrices also have the same size. These extracted feature maps are then merged to form the output of the convolution operation. The weight values ​​in these weight matrices need to be obtained through extensive training in practical applications. The weight matrices formed by these trained weight values ​​can be used to extract information from the input image, enabling the convolutional neural network to make correct predictions. When a convolutional neural network has multiple convolutional layers, the initial convolutional layers often extract more general features, which can also be called low-level features. As the depth of the convolutional neural network increases, the features extracted by later convolutional layers become increasingly complex, such as high-level semantic features. Features with higher semantic levels are more suitable for the problem being solved.

[0055] Because it's often necessary to reduce the number of training parameters, pooling layers are frequently introduced periodically after convolutional layers. This can be a single convolutional layer followed by a pooling layer, or multiple convolutional layers followed by one or more pooling layers. In image processing, the sole purpose of pooling layers is to reduce the spatial size of the image. Pooling layers can include average pooling and / or max pooling operators to sample the input image to obtain a smaller image size. Average pooling calculates the average value of pixel values ​​within a specific range as the result of average pooling. Max pooling takes the pixel with the largest value within a specific range as the result of max pooling. Furthermore, just as the size of the weight matrix in a convolutional layer should be related to the image size, the operators in a pooling layer should also be related to the image size. The size of the output image after pooling can be smaller than the size of the input image of the pooling layer. Each pixel in the output image represents the average or maximum value of the corresponding sub-region of the input image of the pooling layer.

[0056] After processing by convolutional / pooling layers, a convolutional neural network (CNN) is still insufficient to output the required information. As mentioned earlier, convolutional / pooling layers only extract features and reduce the parameters introduced by the input image. However, to generate the final output information (the required class information or other relevant information), the CNN needs to utilize neural network layers to generate one or a set of desired class numbers of output. Therefore, the neural network can include multiple hidden layers, the parameters of which can be pre-trained based on training data relevant to a specific task type, such as image recognition, image classification, image super-resolution reconstruction, etc.

[0057] Optionally, after the multiple hidden layers in the neural network, there is also an output layer of the entire convolutional neural network. This output layer has a loss function similar to the classification cross-entropy, which is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network is completed, the backpropagation will begin to update the weight values ​​and biases of the aforementioned layers to reduce the loss of the convolutional neural network and the error between the result output by the convolutional neural network through the output layer and the ideal result.

[0058] 3. Calculation diagram

[0059] A graphical data structure that reflects the design principles and implementation process of computational logic by expressing the flow direction and computational relationships of data in computational logic.

[0060] 4. Data Flow Graph Parameters

[0061] In a data flow graph, parameters refer to the data carried by the connecting edges of computing nodes on the graph, which are used for processing by the computing nodes or fed back by the computing nodes.

[0062] The system architecture and application scenarios of embodiments of this application will be described below. Please refer to Figure 1, which is a schematic diagram of a system architecture according to an embodiment of this application.

[0063] As shown in Figure 1, the compilation device 320 compiles the source program 350 to obtain a program compilation result 301. The source program 350 is the code file to be compiled in this embodiment, which can be a program performing multi-dimensional data operations in different scenarios, including image processing, speech recognition, scientific computing, or physical modeling. The compilation device 320 can be any device containing a compiler.

[0064] The compiled program result 301 obtained by compiling using the compilation device 320 can be applied to different systems or devices, such as the execution device 310 shown in Figure 1. The execution device 310 can be a terminal, such as a mobile terminal, tablet computer, laptop computer, augmented reality (AR) / virtual reality (VR) terminal, vehicle terminal, etc., or it can be a server or cloud, etc. In Figure 1, the execution device 310 is configured with an input / output (I / O) interface 312 for data interaction with external devices.

[0065] The execution device 310 can receive data from the database 330 or input from the client device 340, and use the calculation module 311 to execute the relevant calculation process in the program compilation result 301 to obtain the corresponding processing result.

[0066] Finally, the I / O interface 312 returns the processing results (e.g., image processing results or speech recognition results) to the client device 340 for use by the user.

[0067] It is worth noting that the compilation device 320 can compile source programs for different goals or different tasks to obtain corresponding program compilation results 301. Then, the execution device 310 executes the relevant calculation process in the program compilation results 301 to obtain the processing results required for different goals or different tasks.

[0068] In the scenario shown in Figure 1, the user can manually provide input data, which can be done through the interface provided by I / O interface 312. Alternatively, the client device 340 can automatically send input data to I / O interface 312. If user authorization is required for the client device 340 to automatically send input data, the user can set the corresponding permissions in the client device 340. The user can view the output of the execution device 310 on the client device 340, which can be presented in various ways such as display, sound, or animation.

[0069] It is worth noting that Figure 1 is merely a schematic diagram of a system architecture provided by an embodiment of this application. The positional relationships between the devices, components, modules, etc. shown in the figure do not constitute any limitation. For example, in Figure 1, the compilation device 320 is an external device relative to the execution device 310. In other cases, the compilation device 320 can also be placed within the execution device 310. The execution device 310 is an external device relative to the client device 340. In other cases, the execution device 310 and the client device 340 can be the same device.

[0070] Referring to Figure 2, which illustrates an application architecture according to an embodiment of this application, the specific form can be program code contained in platform software such as a compiler, interpreter, or other executable code, and deployed on server hardware. Taking the application scenario shown in Figure 2 as an example, the program code of this embodiment exists in the platform software's code analysis module 4012, loop identification module 4013, computation graph generation module 4015, code optimization module 4016, and loop encapsulation module 4014. During runtime, the program code can run in the host memory 4022 and / or GPU memory 4024 of server 4001, belonging to the machine learning framework 4011 of server 4001. Furthermore, it may also include a heterogeneous computing platform 4021. On the hardware execution entity, it may also include a CPU 4025 and a GPU 4023.

[0071] This application relates to the field of compiler optimization technology in computer science, particularly to just-in-time (JIT) compilation acceleration technology in AI frameworks. With the development of artificial intelligence, performance optimization of AI frameworks has become an important research direction. JIT technology is a key technique that improves code execution efficiency at runtime by capturing dynamic graph code in a computational graph.

[0072] To achieve Just-In-Time (JIT) acceleration, the industry standard approach is to optimize code segments that may be executed multiple times (such as loop structures) during code execution. Specifically, this involves expanding loop structures (such as for or while statements) and then performing computation graph capture on the expanded code. The basic idea behind this method is to transform dynamic parts of the code into static parts, thereby making computation graph capture more efficient.

[0073] However, while this method works well for simple loop structures, its effectiveness diminishes significantly when dealing with complex loop structures, especially those containing syntax not supported by the static graph (such as if or while loops). This is because, in such cases, the loop body cannot form a complete computation graph; instead, it splits into several sub-computation graphs at the unsupported syntax points. Existing techniques, upon identifying the inability to form a complete computation graph, do not optimize the entire loop body's computation graph. This is equivalent to first expanding the loop body and transforming the computation graph of the expanded loop body, without performing any optimization, resulting in wasted resources and reduced code compilation efficiency.

[0074] Existing technologies limit the application scope of JIT technology when handling complex code. These problems mainly arise because existing technologies do not take into account syntax that static graphs within loop bodies may not support when dealing with loop structures.

[0075] The system architecture used in the method provided in this embodiment has been described above. The specific execution flow of the method provided in this application embodiment will be described in detail below with reference to the accompanying drawings.

[0076] Please refer to Figure 3, which is a flowchart illustrating a data compilation method provided in an embodiment of this application. As shown in Figure 3, the data compilation method includes the following steps 301-303.

[0077] Step 301: Obtain the code file to be compiled; the code file includes a target code segment, and the target code segment includes a loop header and a loop body corresponding to the loop header.

[0078] In this embodiment, the code file to be compiled can be obtained. The code file includes source code, which can be code written using an application editing interface of a high-level editing language. Compared to low-level languages, it is closer to our normal human thinking, and its biggest feature is that it is easy to write and has good code readability. To achieve the same function, using a high-level language takes less time, produces shorter code, and is easier to read.

[0079] For example, the high-level programming language can be, but is not limited to, C, C++, Python, Java, Matlab, LabVIEW, or a domain-specific language (DSL). The DSL language can be Halide, GraphIt, Spatial, or other custom domain-specific languages. Halide is suitable for vector and tensor operations, GraphIt is suitable for graph computation, Spatial is suitable for programmable hardware, and custom domain-specific languages ​​are suitable for their respective custom domains.

[0080] The code file may include a loop structure, which refers to repeatedly running the corresponding code, such as the target code segment in the embodiments of this application. The target code segment may include a loop header and a loop body corresponding to the loop header.

[0081] Structurally, a loop consists of a loop header and a loop body. Generally, the loop header is used to initialize the loop variable, control changes to the loop variable, and set the loop termination condition. The loop body, on the other hand, contains statements that you want to execute multiple times.

[0082] Step 302: If the loop body includes a target statement that does not support syntax for conversion to a static computation graph, the loop body is encapsulated as a target function.

[0083] In one possible implementation, the target statement that does not support the syntax for converting to a static computation graph can be one of the following: a statement that calls a third-party library; a statement that includes an operator and the target statement is used to select a branch from multiple branches based on the result of the operator's execution (such as a for statement or a while statement); or a manually triggered statement.

[0084] The target statement includes one or more statements.

[0085] When the code contains statements that do not support conversion to a static computation graph, the resulting static computation graph may contain fragmented graphs. Existing techniques typically capture the computation graph after fully expanding the loop for loops containing fragmented graphs. However, if the computation graph corresponding to the expanded loop body contains fragmented graphs, the entire computation graph of the loop will be discarded without optimization and cannot be captured, resulting in a decreased graph capture rate. In this embodiment, the loop body in the loop structure is encapsulated into a new function (target function), and the multiple computation subgraphs separated by the target statement in the computation graph corresponding to the target function are optimized. That is, even if the computation graph corresponding to the loop body contains fragmented graphs, computation graph optimization is still performed, avoiding the repeated generation and destruction of fragmented graphs during loop structure compilation, thereby improving compilation efficiency.

[0086] In one possible implementation, each computational subgraph is a computational graph of a continuous segment of code within the loop body without unrolling the loop body.

[0087] The approach of this application embodiment is as follows: Compared with the existing technology that expands the entire loop body and then abandons the entire loop graph without optimization, this application embodiment takes a step back and does not pursue full graph entry. When encountering a loop with a split graph, the loop is not expanded, but the loop body is encapsulated into an independent new function. After encapsulating each sub-computation graph, the graph entry rate is reduced, but the influence range of the split graph is also controlled within a certain range.

[0088] In one possible implementation, before optimizing the static computation graph corresponding to the code file, the information of the objective function can be written into the data structure of the static computation graph corresponding to the code file.

[0089] For example, a tuple containing information related to the loop body can be written into the data structure of each connection edge of the computation graph node. The tuple content includes, but is not limited to, the name, size, memory type, and loop body identifier of the parameter to be processed represented by the connection edge. This information comes from the result of the compiler, interpreter, or other platform software that executes the code processing the application and input data based on its inherent strategies.

[0090] Step 303: Optimize the static computation graph corresponding to the code file to obtain an optimized computation graph; wherein, the target statement divides the static computation graph corresponding to the target function into multiple computation subgraphs; the optimized computation graph includes a computation subgraph obtained by optimizing at least one computation subgraph among the multiple computation subgraphs.

[0091] In one possible implementation, the encapsulated new function can be optimized, including but not limited to common code optimization techniques such as constant folding, common subexpression elimination, and dead code elimination. This mechanism further improves the efficiency of loop body execution.

[0092] In one possible implementation, the optimized computation graph can also be executed. Specifically, a new, encapsulated function can be executed to achieve efficient loop execution.

[0093] In this step, the loop is encapsulated into a new function module, and the encapsulated function is executed to achieve efficient loop body execution. When multiple loop bodies need to be executed during computation graph calculations, this step can execute the encapsulated function separately on each loop body. This mechanism ensures efficient execution of each loop body, thereby improving overall execution efficiency.

[0094] In one possible implementation, the optimized computation graph includes the computation graph corresponding to the objective function; alternatively, after executing the computation graph corresponding to the objective function and obtaining the computation result, the computation result can be written to the optimized computation graph, so that the computation nodes that execute the computation graph corresponding to the objective function in the optimized computation graph can call the computation result.

[0095] In other words, after executing the encapsulated new function, the result can be written back to the original computation graph so that subsequent computation graph nodes can use these results for computation. This mechanism ensures that the result of the encapsulated new function can be correctly used by subsequent computation graph nodes, thereby guaranteeing the correct execution of the entire computation graph.

[0096] The methods provided in the embodiments of this application have been described in detail above. Next, the device for performing the above methods provided in the embodiments of this application will be described.

[0097] Please refer to Figure 4, which is a schematic diagram of the structure of a data compilation device provided in an embodiment of this application. As shown in Figure 4, the data compilation device includes: an acquisition module 401, used to acquire a code file to be compiled; the code file includes a target code segment, and the target code segment includes a loop header and a loop body corresponding to the loop header;

[0098] For a detailed description of the acquisition module 401, please refer to the description of 301 in the above embodiment. The similarities will not be repeated here.

[0099] Encapsulation module 402 is used to encapsulate the loop body into an object function when the loop body includes an object statement that does not support the syntax for conversion to a static computation graph;

[0100] For a detailed description of the encapsulation module 402, please refer to the description of 302 in the above embodiment. The similarities will not be repeated here.

[0101] The compilation module 403 is used to optimize the static computation graph corresponding to the code file to obtain an optimized computation graph; wherein, the target statement divides the static computation graph corresponding to the target function into multiple computation subgraphs; the optimized computation graph includes a computation subgraph obtained by optimizing at least one computation subgraph among the multiple computation subgraphs.

[0102] For a detailed description of the compilation module 403, please refer to the description of 303 in the above embodiment. The similarities will not be repeated here.

[0103] In one possible implementation, before optimizing the static computation graph corresponding to the code file, the encapsulation module is further configured to:

[0104] The information of the objective function is written into the data structure of the static computation graph corresponding to the code file.

[0105] In one possible implementation, the target statement is one of the following:

[0106] Statements that call third-party libraries;

[0107] Statements that include operators, and the target statement is used to select a branch from multiple branches based on the result of the operator's execution; or, manually triggered statements.

[0108] In one possible implementation, each computational subgraph is a computational graph of a continuous segment of code within the loop body without unrolling the loop body.

[0109] In one possible implementation, the optimization includes at least one of the following:

[0110] Constant folding, common subexpression elimination, and dead code elimination.

[0111] In one possible implementation, the device further includes:

[0112] The execution module is used to execute the optimized computation graph.

[0113] In one possible implementation, the optimized computation graph includes multiple computation nodes, and the computation graph corresponding to the objective function corresponds to the objective node among the multiple computation nodes; the execution module is further configured to:

[0114] After the computation graph corresponding to the objective function is executed and the computation result is obtained, the computation result is written to the optimized computation graph so that the computation nodes after the objective node in the optimized computation graph can call the computation result.

[0115] Please refer to Figure 5, which is a schematic diagram of an execution device provided in an embodiment of this application. The execution device 1100 can specifically be a server, personal computer, smartphone, etc., and is not limited here. Specifically, the execution device 1100 includes: a receiver 1101, a transmitter 1102, a processor 1103, and a memory 1104 (the number of processors 1103 in the execution device 1100 can be one or more; Figure 5 uses one processor as an example). The processor 1103 may include an application processor 11031 and a communication processor 11032. In some embodiments of this application, the receiver 1101, transmitter 1102, processor 1103, and memory 1104 can be connected via a bus or other means.

[0116] Memory 1104 may include read-only memory and random access memory, and provides instructions and data to processor 1103. A portion of memory 1104 may also include non-volatile random access memory (NVRAM). Memory 1104 stores processor and operation instructions, executable modules, or data structures, or subsets thereof, or extended sets thereof, wherein the operation instructions may include various operation instructions for implementing various operations.

[0117] Processor 1103 controls the operation of the execution device. In specific applications, the various components of the execution device are coupled together through a bus system, which may include not only the data bus, but also power buses, control buses, and status signal buses. However, for clarity, all buses are referred to as the bus system in the diagram.

[0118] The methods disclosed in the embodiments of this application described above can be applied to processor 1103, or implemented by processor 1103. Processor 1103 can be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the integrated logic circuit in the hardware of processor 1103 or by instructions in the form of software. The processor 1103 described above can be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and may further include application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0119] The processor 1103 can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. This storage medium is located in memory 1104. The processor 1103 reads information from memory 1104 and, in conjunction with its hardware, completes the steps of the above methods.

[0120] Receiver 1101 can be used to receive input digital or character information, and to generate signal inputs related to the settings and function control of the execution device. Transmitter 1102 can be used to output digital or character information through the first interface; transmitter 1102 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; transmitter 1102 may also include a display device such as a display screen.

[0121] The electronic device provided in this application embodiment can specifically be a chip, which includes a processing unit and a communication unit. The processing unit can be, for example, a processor, and the communication unit can be, for example, an input / output interface, pins, or circuits. The processing unit can execute computer execution instructions stored in the storage unit to cause the chip in the execution device to execute the data compilation method described in the above embodiments, or to cause the chip in the training device to execute the data compilation method described in the above embodiments. Optionally, the storage unit can be a storage unit within the chip, such as a register or cache. Alternatively, the storage unit can be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or other types of static storage devices capable of storing static information and instructions, such as random access memory (RAM).

[0122] Specifically, please refer to Figure 6, which is a schematic diagram of a chip structure provided in an embodiment of this application. The chip can be represented as a neural network processor (NPU) 1200. The NPU 1200 is mounted as a coprocessor on the host CPU, and tasks are assigned by the host CPU. The core part of the NPU is the arithmetic circuit 1203, which is controlled by the controller 1204 to extract matrix data from the memory and perform multiplication operations.

[0123] In some implementations, the arithmetic circuit 1203 internally includes multiple processing engines (PEs). In some implementations, the arithmetic circuit 1203 is a two-dimensional pulsating array. The arithmetic circuit 1203 can also be a one-dimensional pulsating array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 1203 is a general-purpose matrix processor.

[0124] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 1202 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 1201 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is ​​stored in the accumulator 1208.

[0125] Unified memory 1206 is used to store input and output data. Weight data is directly transferred to weight memory 1202 via Direct Memory Access Controller (DMAC) 1205. Input data is also transferred to unified memory 1206 via DMAC.

[0126] BIU stands for Bus Interface Unit, which is used for interaction between the AXI bus and the DMAC and the Instruction Fetch Buffer (IFB) 1209.

[0127] The Bus Interface Unit (BIU) 1210 is used by the instruction fetch memory 1209 to fetch instructions from external memory, and also by the memory access controller 1205 to fetch the original data of the input matrix A or the weight matrix B from external memory.

[0128] The DMAC is mainly used to move input data from external memory DDR to unified memory 1206, or to weight data to weight memory 1202, or to input data to input memory 1201.

[0129] The vector computation unit 1207 includes multiple processing units that further process the output of the computation circuit 1203 when needed, such as vector multiplication, vector addition, exponential operations, logarithmic operations, size comparisons, etc. It is mainly used for computation in non-convolutional / fully connected layers of neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.

[0130] In some implementations, the vector computation unit 1207 can store the processed output vector in the unified memory 1206. For example, the vector computation unit 1207 can apply a linear function, or a nonlinear function, to the output of the computation circuit 1203, such as linear interpolation of feature planes extracted by a convolutional layer, or, for example, a vector of accumulated values, to generate activation values. In some implementations, the vector computation unit 1207 generates normalized values, pixel-level summed values, or both. In some implementations, the processed output vector can be used as an activation input to the computation circuit 1203, for example, for use in subsequent layers of the neural network.

[0131] The instruction fetch buffer 1209 connected to the controller 1204 is used to store the instructions used by the controller 1204;

[0132] Unified memory 1206, input memory 1201, weight memory 1202, and instruction fetch memory 1209 are all on-chip memories. External memory is proprietary to this NPU hardware architecture.

[0133] The processor mentioned above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits used to control the execution of the above program.

[0134] Referring to Figure 7, which is a schematic diagram of the structure of a computer-readable storage medium provided in an embodiment of this application. This application also provides a computer-readable storage medium in which, in some embodiments, the method disclosed in Figure 3 can be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or on other non-transitory media or articles of art.

[0135] Figure 7 schematically illustrates a conceptual partial view of an example computer-readable storage medium arranged according to at least some of the embodiments shown herein, the example computer-readable storage medium including a computer program for executing computer processes on a computing device.

[0136] In one embodiment, the computer-readable storage medium 1300 is provided using a signal bearer medium 1301. The signal bearer medium 1301 may include one or more program instructions 1302, which, when executed by one or more processors, can provide the functions or parts thereof described above with reference to FIG3.

[0137] In some examples, the signal carrying medium 1301 may include a computer-readable medium 1303, such as, but not limited to, a hard disk drive, a compact disc (CD), a digital video optical disc (DVD), a digital magnetic tape, a memory, ROM, or RAM, etc.

[0138] In some embodiments, the signal carrying medium 1301 may include a computer-recordable medium 1304, such as, but not limited to, a memory, a read / write (R / W) CD, a R / W DVD, etc. In some embodiments, the signal carrying medium 1301 may include a communication medium 1305, such as, but not limited to, digital and / or analog communication media (e.g., fiber optic cables, waveguides, wired communication links, wireless communication links, etc.). Therefore, for example, the signal carrying medium 1301 may be transmitted by a wireless communication medium 1305 (e.g., a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocols).

[0139] One or more program instructions 1302 may be, for example, computer-executable instructions or logical implementation instructions. In some examples, the computing device may be configured to provide various operations, functions, or actions in response to one or more program instructions 1302 conveyed to the computing device via a computer-readable medium 1303, a computer-recordable medium 1304, and / or a communication medium 1305.

[0140] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the accompanying drawings of the device embodiments provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0141] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods of the various embodiments of this application.

[0142] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0143] A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of this application is generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transferred from one computer-readable storage medium to another. For example, computer instructions may be transferred from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

Claims

1. A data compilation method, characterized in that, The method includes: Obtain the code file to be compiled; the code file includes a target code segment, the target code segment includes a loop header and a loop body corresponding to the loop header; If the loop body includes a target statement that does not support syntax for conversion to a static computation graph, the loop body is encapsulated as a target function; The static computation graph corresponding to the code file is optimized to obtain an optimized computation graph; wherein, the target statement divides the static computation graph corresponding to the target function into multiple computation subgraphs; the optimized computation graph includes a computation subgraph obtained by optimizing at least one of the multiple computation subgraphs.

2. The method according to claim 1, characterized in that, Before optimizing the static computation graph corresponding to the code file, the method further includes: The information of the objective function is written into the data structure of the static computation graph corresponding to the code file.

3. The method according to claim 1 or 2, characterized in that, The target statement is one of the following: Statements that call third-party libraries; Statements that include operators, and the target statement is used to select a branch from multiple branches based on the result of the operator's execution; or, manually triggered statements.

4. The method according to any one of claims 1 to 3, characterized in that, Each computational subgraph is a computational graph of a continuous segment of code within the loop body without expanding the loop body.

5. The method according to any one of claims 1 to 4, characterized in that, The optimization includes at least one of the following: Constant folding, common subexpression elimination, and dead code elimination.

6. The method according to any one of claims 1 to 5, characterized in that, The method further includes: Execute the optimized computation graph.

7. The method according to claim 6, characterized in that, The optimized computational graph includes the computational graph corresponding to the objective function; the method further includes: After the computation graph corresponding to the objective function is executed and the computation result is obtained, the computation result is written to the optimized computation graph so that the computation node after executing the computation graph corresponding to the objective function in the optimized computation graph can call the computation result.

8. A data compilation apparatus, characterized in that, The device includes: The acquisition module is used to acquire the code file to be compiled; the code file includes a target code segment, and the target code segment includes a loop header and a loop body corresponding to the loop header; An encapsulation module is used to encapsulate the loop body into an object function when the loop body includes an object statement that does not support syntax for conversion to a static computation graph; The compilation module is used to optimize the static computation graph corresponding to the code file to obtain an optimized computation graph; wherein, the target statement divides the static computation graph corresponding to the target function into multiple computation subgraphs; the optimized computation graph includes a computation subgraph obtained by optimizing at least one computation subgraph among the multiple computation subgraphs.

9. The apparatus according to claim 8, characterized in that, Before optimizing the static computation graph corresponding to the code file, the encapsulation module is further configured to: The information of the objective function is written into the data structure of the static computation graph corresponding to the code file.

10. The apparatus according to claim 8 or 9, characterized in that, The target statement is one of the following: Statements that call third-party libraries; Statements that include operators, and the target statement is used to select a branch from multiple branches based on the result of the operator's execution; or, manually triggered statements.

11. The apparatus according to any one of claims 8 to 10, characterized in that, Each computational subgraph is a computational graph of a continuous segment of code within the loop body without expanding the loop body.

12. The apparatus according to any one of claims 8 to 11, characterized in that, The optimization includes at least one of the following: Constant folding, common subexpression elimination, and dead code elimination.

13. The apparatus according to any one of claims 8 to 12, characterized in that, The device further includes: The execution module is used to execute the optimized computation graph.

14. The apparatus according to claim 13, characterized in that, The optimized computation graph includes multiple computation nodes, and the computation graph corresponding to the objective function corresponds to the objective node among the multiple computation nodes; the execution module is further configured to: After the computation graph corresponding to the objective function is executed and the computation result is obtained, the computation result is written to the optimized computation graph so that the computation nodes after the objective node in the optimized computation graph can call the computation result.

15. A data compilation apparatus, characterized in that, The device includes a memory and a processor; the memory stores code, and the processor is configured to execute the code, wherein when the code is executed, the device performs the method as described in any one of claims 1 to 7.

16. A computer storage medium, characterized in that, The computer storage medium stores instructions that, when executed by the computer, cause the computer to perform the method according to any one of claims 1 to 7.

17. A computer program product, characterized in that, The computer program product stores instructions that, when executed by a computer, cause the computer to perform the method described in any one of claims 1 to 7.