Processing method and processing device of neural network computation graph
By splitting the neural network computation graph into computation subgraphs and generating executable files, the problems of high compilation difficulty and insufficient storage resources are solved, achieving efficient compilation and reasonable utilization of storage resources.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- LYNXI TECH CO LTD
- Filing Date
- 2022-05-17
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, the neural network computation graphs of deep learning frameworks are difficult to compile on many-core chips, have low compilation efficiency, and require storage resources that exceed the chip hardware capabilities, resulting in the inefficient use of resources.
By splitting the neural network computation graph into serially connected computation subgraphs, and using the output of the target operator node as the split point, an executable file for each computation subgraph is generated and run on the chip. The output results are stored in external memory or the host, reducing on-chip storage requirements.
It reduces the compilation difficulty of neural network computation graphs, improves compilation efficiency, makes reasonable use of chip hardware storage resources, saves on-chip storage resources, and improves the utilization rate of storage resources.
Smart Images

Figure CN114970814B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a method and apparatus for processing neural network computation graphs, an electronic device, and a computer-readable storage medium. Background Technology
[0002] Many-core architecture chips based on in-memory computing reduce data transfer time and power consumption by placing both computing and storage on-chip, making it an important development direction for many-core chips.
[0003] Deep learning frameworks (such as TensorFlow or ONNX) typically use computation graphs to represent the computations of deep learning models (neural networks). For specific acceleration hardware, the neural network computation graph needs to be compiled by a compiler to generate an instruction stream that can run on the hardware. This hardware can be based on in-memory computing many-core chips, which typically consist of multiple physical cores. Summary of the Invention
[0004] This disclosure provides a method and apparatus for processing neural network computation graphs, an electronic device, and a computer-readable storage medium.
[0005] In a first aspect, this disclosure provides a method for processing a neural network computation graph, wherein the neural network computation graph includes multiple operator nodes, and the processing method includes:
[0006] Based on the output connection relationship of multiple operator nodes, all target operator nodes in the neural network computation graph are determined. The target operator node is one with a first output terminal and a second output terminal, and the operator node connected to the first output terminal is a data output node.
[0007] Using the second output terminal of the target operator node as the splitting point, the neural network computation graph is split into multiple computation subgraphs connected in series, and the output terminals of the operator nodes connected to the second output terminal are connected to other operator nodes.
[0008] An executable file corresponding to each computation subgraph is generated sequentially based on each computation subgraph.
[0009] Secondly, this disclosure provides a processing apparatus for processing a neural network computation graph to be processed, the neural network computation graph including multiple operator nodes, the processing apparatus comprising:
[0010] The determination module is used to determine all target operator nodes in the neural network computation graph based on the output connection relationship of multiple operator nodes. The target operator node is one that has a first output terminal and a second output terminal, and the operator node connected to the first output terminal is a data output node.
[0011] The first splitting module is used to split the neural network computation graph into multiple serially connected computation subgraphs, with the second output end of the target operator node as the splitting point, and the output end of the operator node connected to the second output end is connected to other operator nodes.
[0012] The generation module is used to sequentially generate an executable file corresponding to each computational subgraph based on each computational subgraph.
[0013] Thirdly, this disclosure provides an electronic device comprising:
[0014] At least one processor;
[0015] and a memory communicatively connected to the at least one processor;
[0016] The memory stores one or more computer programs that can be executed by the at least one processor, and the one or more computer programs are executed by the at least one processor to enable the at least one processor to perform the above-described neural network computation graph processing method.
[0017] Fourthly, this disclosure provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor / processing core, implements the above-described method for processing neural network computation graphs.
[0018] According to the technical solution of the neural network computation graph processing method provided in the embodiments of this disclosure, on the one hand, the processing method is applicable to any neural network computation graph, and can realize the automatic decomposition of any neural network computation graph, so that the neural network computation graph can be compiled in segments, reducing the compilation difficulty of the neural network computation graph, improving the compilation efficiency and effect of the neural network computation graph, and effectively reducing the requirements of the compilation of the neural network computation graph on the chip hardware storage resources. This is conducive to solving the problem that the storage resources required for the compilation of the neural network computation graph are large and the actual chip hardware storage resources cannot meet them, realizing the rational use of chip hardware storage resources and improving the utilization efficiency of chip hardware storage resources. On the other hand, the output of each computation subgraph obtained by automatic decomposition is used as part of the output result of the neural network computation graph. It does not need to be stored on the chip or stored on the chip for a long time, but can be stored in external memory or the host corresponding to the chip, thereby effectively saving on-chip storage resources and improving the utilization rate of on-chip storage resources.
[0019] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0020] The accompanying drawings are provided to further illustrate the present disclosure and form part of the specification. They are used together with the embodiments of the present disclosure to explain the disclosure and do not constitute a limitation thereof. The above and other features and advantages will become more apparent to those skilled in the art from the detailed description of exemplary embodiments with reference to the accompanying drawings, in which:
[0021] Figure 1 A flowchart illustrating a method for processing a neural network computation graph provided in an embodiment of this disclosure;
[0022] Figure 2 This is a schematic diagram of the structure of a neural network computation graph;
[0023] Figure 3 for Figure 1 A flowchart illustrating a specific implementation of step S13;
[0024] Figure 4 for Figure 1 A flowchart illustrating another specific implementation of step S13;
[0025] Figure 5 A flowchart illustrating another method for processing a neural network computation graph provided in this embodiment of the present disclosure;
[0026] Figure 6 A flowchart illustrating another method for processing a neural network computation graph provided in this embodiment of the present disclosure;
[0027] Figure 7 A flowchart illustrating another method for processing a neural network computation graph provided in this embodiment of the present disclosure;
[0028] Figure 8 A flowchart illustrating another method for processing a neural network computation graph provided in this embodiment of the present disclosure;
[0029] Figure 9 This is a block diagram of a processing apparatus provided in an embodiment of the present disclosure;
[0030] Figure 10 This is a block diagram of an electronic device provided in an embodiment of the present disclosure. Detailed Implementation
[0031] To enable those skilled in the art to better understand the technical solutions of this disclosure, exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments of this disclosure to aid understanding. These should be considered merely exemplary. Therefore, those skilled in the art should recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.
[0032] Where there is no conflict, the various embodiments of this disclosure and the features thereof in the embodiments may be combined with each other.
[0033] As used herein, the term “and / or” includes any and all combinations of one or more related enumerated entries.
[0034] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. As used herein, the singular forms “a” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that when the terms “comprising” and / or “made of” are used in this specification, the presence of the stated feature, integral, step, operation, element, and / or component is specified, but the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof is not excluded. Words such as “connected” or “linked” are not limited to physical or mechanical connections but can include electrical connections, whether direct or indirect.
[0035] Unless otherwise specified, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant art and this disclosure, and will not be interpreted as having an idealized or overly formal meaning, unless expressly so defined herein.
[0036] In related technologies, the computational graph of a large neural network usually requires a large amount of computation and data. The computing and storage resources of the chip are usually insufficient to meet the resource requirements of the entire neural network computational graph, resulting in high compilation difficulty and low efficiency of the neural network computational graph.
[0037] Therefore, this disclosure provides a method and apparatus for processing neural network computation graphs, an electronic device, and a computer-readable storage medium, which are intended to effectively solve at least one of the technical problems existing in the above-mentioned related technologies.
[0038] The processing method of this disclosure can be executed by a processing device, which can be integrated into an electronic device such as a terminal device or a server through software and / or hardware. For example, the terminal device can be an in-vehicle device, user equipment (UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (PDA), handheld device, computing device, wearable device, etc. In some embodiments, the processing method of this disclosure can be implemented by a processor calling computer-readable program instructions stored in memory, or the processing method of this disclosure can be executed by a server.
[0039] Figure 1 This is a flowchart illustrating a method for processing a neural network computation graph, as provided in an embodiment of this disclosure.
[0040] This disclosure provides a method for processing a neural network computation graph. This method is used to automatically split the neural network computation graph to be processed and generate an executable file that runs on a corresponding many-core chip. The neural network computation graph to be processed may include multiple operator nodes, which are the basic computational units that constitute the neural network. The operator nodes may be, for example, operations such as convolution and pooling in the neural network. The neural network may be any type of deep learning network. The neural network may be used to perform any one of image processing tasks, speech processing tasks, text processing tasks, and video processing tasks. The input data of the neural network may be any one of image data, speech data, text data, and video data.
[0041] Reference Figure 1 The processing method may include steps S11 to S13.
[0042] Step S11: Based on the output connection relationship of multiple operator nodes, determine all target operator nodes in the neural network computation graph. The target operator node is one with a first output terminal and a second output terminal, and the operator node connected to the first output terminal is a data output node.
[0043] Step S12: Using the second output end of the target operator node as the splitting point, the neural network computation graph is split into multiple computation subgraphs connected in series, and the output ends of the operator nodes connected to the second output end are connected to other operator nodes.
[0044] Step S13: Generate the executable file corresponding to each computation subgraph in sequence.
[0045] In this embodiment of the disclosure, for a neural network computation graph to be processed, before determining all target operator nodes in the neural network computation graph based on the output connection relationship of multiple operator nodes, that is, before step S11, the processing method further includes: obtaining node information of each operator node in the neural network computation graph.
[0046] The node information of an operator node can include its input connections, output connections, required parameters, attribute information, and execution order. Specifically, the input connections describe the connections between the operator node's input and the outputs of other operators in the neural network computation graph; the output connections describe the connections between the operator node's output and the inputs of other operators; the required parameters include, but are not limited to, pre-configured weight parameters needed to perform the operator's operations; the attribute information characterizes the operator node's features and may include, but is not limited to, the operator type, the computational and storage requirements; and the execution order represents the temporal order in which the operator's operations are performed.
[0047] In step S11, based on the output connection relationship of multiple operator nodes in the neural network computation graph, and using the condition that a node has a first output terminal and a second output terminal, and the output of the first output terminal is a part of the output of the neural network computation graph, all target operator nodes in the neural network computation graph that satisfy the above screening conditions can be determined.
[0048] Specifically, for each operator node in the neural network computation graph, the number of other operator nodes connected to the output of the operator node in the neural network computation graph can be determined based on the output connection relationship of the operator node, thereby determining the number of output branches of the operator node, that is, determining whether the operator node is a single output node, a dual output node, or a multi-output node.
[0049] Furthermore, based on the output connection relationships and node information of multiple operator nodes in the neural network computation graph, the output connection relationships and operator types of other operator nodes connected to each output terminal of each operator node can be determined. This allows us to determine whether other operator nodes connected to each output terminal of each operator node are data output nodes, and consequently, whether each operator node is a target operator node. Here, a data output node refers to an operator node whose operator type is used to output data, and the output data is part of the output result of the neural network computation graph. This type of operator node is not used for data computation operations and does not have output connection relationships with other operator nodes.
[0050] When an operator node is a dual-output or multi-output node, and one of its output terminals is connected to an operator node that has no output connection to other operator nodes (this output terminal serves as a data output node), and the other output terminal is connected to an operator node that has an output connection to other operator nodes (i.e., this output terminal serves as a data operation node for data computation), then the output of this operator node serves as both part of the output of the neural network computation graph and as input to subsequent operator nodes. Therefore, this operator node meets the above-mentioned screening criteria for target operator nodes and is identified as a target operator node. Furthermore, for ease of distinction, in this paper, one input terminal of the operator node is defined as the first output terminal, and the other input terminal is defined as the second output terminal.
[0051] Figure 2 This is a schematic diagram of the structure of a neural network computation graph, exemplarily, refer to... Figure 2 Operator node 1 is a single-output node, which does not meet the above screening criteria for target operator nodes. Operator node 2 is a dual-output node, and the operator node 3 connected to the first output terminal 21 of operator node 2 does not have an output connection relationship with other operator nodes. Operator node 3 is a data output node, while the output terminal of operator node 4 connected to the second output terminal 22 has an output connection relationship with other operator nodes 6. Therefore, operator node 2 meets the above screening criteria for target operator nodes. Similarly, operator nodes 5 and 7 are both data output nodes, so operator 4 and operator node 6 both meet the above screening criteria for target operator nodes, while other operator nodes do not meet the above screening criteria for target operator nodes. Therefore, Figure 2 In the neural network computation graph shown, operator nodes 2, 4, and 6 are each considered as a target operator node.
[0052] In step S12, the neural network computation graph is split using the second output of each target operator node as a splitting point, resulting in multiple serially connected computation subgraphs. For example, using... Figure 2 Taking the neural network computation graph shown as an example, step S11 can determine... Figure 2 The target operator nodes in the neural network computation graph shown above that meet the above screening criteria include operator node 2, operator node 4, and operator node 6. Taking the second output terminal 22 of operator node 2, the second output terminal 42 of operator node 4, and the second output terminal 62 of operator node 6 as splitting points, the following steps are performed: Figure 2 The neural network computation graph shown can be decomposed into four serially connected computation subgraphs. Let these four computation subgraphs be denoted as computation subgraphs. Figure 1 , Calculator Figure 2 , Calculator Figure 3 and operator Figure 4 Then calculate the sub Figure 1 Including operator node 1, operator node 2, and operator node 3, the computational node... Figure 2 Including operator node 4 and operator node 5, the operator... Figure 3 Including operator node 6 and operator node 7, the operator... Figure 4 This includes operator node 8 and operator node 9.
[0053] It should be noted that the number of first input terminals of the target operator node can be one, and the number of second input terminals can be one or more. This disclosure does not limit this.
[0054] For the multiple computational subgraphs obtained by splitting in step S12, each computational subgraph has one or more operator nodes.
[0055] In step S13, for each computational subgraph obtained by splitting in step S12, an executable file corresponding to the computational subgraph is generated. The executable file is executable code that can be executed on the corresponding chip, thereby enabling the computational subgraph to be compiled and run on the corresponding chip so that the corresponding chip can execute the corresponding computational task of the computational subgraph.
[0056] According to the technical solution of the neural network computation graph processing method provided in the embodiments of this disclosure, on the one hand, the processing method is applicable to any neural network computation graph, and can realize the automatic decomposition of any neural network computation graph, so that the neural network computation graph can be compiled in segments, reducing the compilation difficulty of the neural network computation graph, improving the compilation efficiency and effect of the neural network computation graph, and effectively reducing the requirements of the compilation of the neural network computation graph on the chip hardware storage resources. This is conducive to solving the problem that the storage resources required for the compilation of the neural network computation graph are large and the actual chip hardware storage resources cannot meet them, realizing the rational use of chip hardware storage resources and improving the utilization efficiency of chip hardware storage resources. On the other hand, the output of each computation subgraph obtained by automatic decomposition is used as part of the output result of the neural network computation graph. It does not need to be stored on the chip or stored on the chip for a long time, but can be stored in external memory or the host corresponding to the chip, thereby effectively saving on-chip storage resources and improving the utilization rate of on-chip storage resources.
[0057] In some embodiments, after generating an executable file corresponding to each computational subgraph in sequence according to each computational subgraph, the processing method further includes: loading the executable file corresponding to each computational subgraph into the corresponding chip; and storing the output result corresponding to the computational subgraph into an external memory or the host corresponding to the chip in response to the executable file corresponding to the computational subgraph completing its execution on the chip.
[0058] By storing the output results corresponding to the computation subgraph in external memory or the host computer corresponding to the chip, and then reading the output results corresponding to the computation subgraph from external memory (such as double rate synchronous dynamic random access memory) or the host computer back to the chip when needed, the on-chip storage resources of the chip can be effectively saved.
[0059] Figure 3 for Figure 1 A flowchart illustrating a specific implementation of step S11 is shown below. Figure 3 In some embodiments, step S13, which generates a chip executable file corresponding to each computation subgraph in sequence according to each computation subgraph, may further include steps S31 to S32.
[0060] Step S31: In response to a failure error that occurs when generating the executable file corresponding to the current computation subgraph, the current computation subgraph is further split into multiple serially connected computation subgraphs.
[0061] Step S32: Generate an executable file corresponding to each computational subgraph obtained from the further subdivision in sequence.
[0062] In step S31, a failure error occurs when generating the executable file corresponding to the current computation subgraph, indicating that the current computation subgraph cannot be supported by the corresponding chip and cannot run normally on the corresponding chip. That is, the corresponding chip cannot support the compilation and execution of the current computation subgraph on the chip. For example, the computing power, computation amount or storage amount required by the current computation subgraph exceeds the limit of the corresponding chip, which causes a failure error when generating the executable file corresponding to the computation subgraph. Therefore, the current computation subgraph is further split into multiple serially connected computation subgraphs.
[0063] In some embodiments, the current computational subgraph can be further subdivided according to the subdivision methods described in steps S11 and S12 above. In some embodiments, the current computational subgraph can be further subdivided as needed by configuring other subdivision methods, or by combining other subdivision methods with the subdivision methods described in steps S11 and S12 above, or by manually subdividing the current computational subgraph. Other subdivision methods may include filtering and subdividing target operator nodes according to other filtering conditions configured for the target operator nodes. This disclosure does not specifically limit the method of further subdividing the current computational subgraph.
[0064] For the multiple computational subgraphs obtained by further splitting in step S31, each computational subgraph may have one or more operator nodes.
[0065] In step S32, for each computational subgraph obtained by further splitting in step S31, an executable file corresponding to the computational subgraph is generated. The executable file is executable code that can be executed on the corresponding chip, thereby enabling the computational subgraph to be compiled and run on the corresponding chip so that the corresponding chip can execute the corresponding computational task of the computational subgraph.
[0066] If a failure occurs when generating the executable file corresponding to the current computational subgraph through the above steps S31 and S32, it indicates that the current computational subgraph may not be supported by the corresponding chip. Therefore, the current computational subgraph is further refined and split to improve the compilation effect. This is beneficial to ensure that each segment of the computational subgraph can be supported by the chip hardware and to make full use of the chip's hardware resources, thereby improving the chip's execution efficiency in processing neural network computational graphs.
[0067] Figure 4 for Figure 1 A flowchart illustrating another specific implementation of step S13 is provided below. Figure 4 In some embodiments, step S13, which generates a chip executable file corresponding to each computation subgraph in sequence according to each computation subgraph, may further include steps S41 to S42.
[0068] Step S41: In response to the absence of a failure error when generating the executable file corresponding to the current computation subgraph, the current computation subgraph is used as the target computation subgraph;
[0069] Step S42: Load the executable file corresponding to each target computation subgraph into the corresponding chip.
[0070] In step S41, if no failure error occurs when generating the executable file corresponding to the current computation subgraph, it means that the current computation subgraph can be supported by the corresponding chip and can run normally on the corresponding chip. That is, the corresponding chip supports the compilation and execution of the current computation subgraph on the chip. Therefore, there is no need to further process the current computation subgraph. The current computation subgraph can be used as the target computation subgraph, and step S42 can be executed.
[0071] In step S42, since no failure error occurred when generating the executable file corresponding to the target computation subgraph, the executable file of the target computation subgraph can be supported by the corresponding chip and can be executed normally on the corresponding chip. Therefore, for each target computation subgraph, the executable file corresponding to the target computation subgraph can be loaded into the corresponding chip to run the executable file corresponding to the target computation subgraph on the corresponding chip to perform the corresponding computation task.
[0072] It should be noted that the computational subgraph obtained by splitting in step S12 and the computational subgraph obtained by further splitting in step S31 are both subgraphs in the neural network computational graph. Therefore, this embodiment of the present disclosure does not make specific distinctions in name and definition between the computational subgraph obtained by splitting in step S12 and the computational subgraph obtained by further splitting in step S31. Their identities and functions are essentially equivalent. Furthermore, for a specific description of generating the executable file in step S32, please refer to the relevant description of generating the executable file in step S13 in this embodiment of the present disclosure. It will not be elaborated here.
[0073] In this embodiment, no failures or errors occurred when generating the executable file corresponding to each computation subgraph, indicating that all computation subgraphs are supported by the corresponding chips and can run normally on the corresponding chips. However, some computation subgraphs may actually require fewer resources, while others require more. When all the executable files of the computation subgraphs are directly loaded into the corresponding chips for execution, the chip hardware resources may not be used reasonably, resulting in some chips having low hardware resource load and others having high hardware resource load, which is not conducive to the load balancing of chip hardware resources. Therefore, in some embodiments, in order to achieve reasonable use of chip hardware resources, improve the utilization efficiency of chip hardware resources, and achieve the load balancing of chip hardware resources, it is necessary to merge and splice some computation subgraphs into one computation subgraph before generating the corresponding executable file and loading it into the corresponding chip.
[0074] Figure 5 A flowchart illustrating another method for processing neural network computation graphs provided in this disclosure is shown below. Figure 5 In some embodiments, after generating the executable file corresponding to each computation subgraph in sequence according to each computation subgraph, that is, after step S13, the processing method may further include: steps S51 to S54.
[0075] Step S51: In response to the absence of any failures or errors when generating the executable file corresponding to each computation subgraph, obtain at least one set of computation subgraphs, each set of computation subgraphs including at least two computation subgraphs connected in series.
[0076] Step S52: Merge at least two computation subgraphs in each group of computation subgraphs to obtain alternative computation subgraphs corresponding to each group of computation subgraphs.
[0077] Step S53: If there are no failures or errors when generating the executable file corresponding to the candidate computation subgraph, use the candidate computation subgraph as the target computation subgraph.
[0078] Step S54: Load the executable file corresponding to each target computation subgraph into the corresponding chip.
[0079] In step S51, if no failure error occurs when generating the executable file corresponding to each computation subgraph, in order to make reasonable use of chip hardware resources, improve the utilization efficiency of chip hardware resources, and achieve load balancing of chip hardware resources, at least one set of computation subgraphs is first obtained from all computation subgraphs that are connected in sequence, wherein each set of computation subgraphs includes at least two computation subgraphs connected in sequence.
[0080] For example, refer to Figure 2 The neural network computation diagram shown assumes... Figure 2 The neural network computation graph shown is currently broken down into computational sub-graphs. Figure 1 , Calculator Figure 2 , Calculator Figure 3 and operator Figure 4 In step S51, a set of computation subgraphs can be obtained. This set of computation subgraphs may include, for example, serially connected computation subgraphs. Figure 1 and operator Figure 2 or include serially connected computational sub-components Figure 2 and operator Figure 3 or include serially connected computational sub-components Figure 2 , Calculator Figure 3 and operator Figure 4 .
[0081] In step S52, the fusion process refers to connecting the output of the previous computational subgraph and the input of the next computational subgraph in each pair of adjacent computational subgraphs in a set of computational subgraphs in a one-to-one correspondence, so as to merge at least two computational subgraphs in a set of computational subgraphs into one computational subgraph as a candidate computational subgraph.
[0082] For example, refer to Figure 2 The neural network computation graph shown includes a set of computation subgraphs comprising serially connected computation subgraphs. Figure 1 and operator Figure 2 , computed Figure 1 Including operator node 1, operator node 2, and operator node 3, the computational node... Figure 2 Including operator node 4 and operator node 5, in step S52, the serially connected operator nodes are... Figure 1 and operator Figure 2 The fusion process is performed to obtain a candidate computational subgraph, which includes operator nodes 1 to 5.
[0083] After fusing each set of computational subgraphs to obtain candidate computational subgraphs, a corresponding executable file can be generated based on the candidate computational subgraphs obtained through fusion processing. In step S53, if no failure error occurs when generating the executable file corresponding to the candidate computational subgraphs obtained through fusion processing, it means that the candidate computational subgraphs obtained through fusion processing can still be supported by the corresponding chip and can run normally on the corresponding chip. That is, the corresponding chip can support at least two computational subgraphs of a set of computational subgraphs to be compiled and executed on the chip. Therefore, the candidate computational subgraphs can be used as the target computational subgraphs.
[0084] In some embodiments, for any set of computational subgraphs, if a failure occurs when generating the executable file corresponding to the candidate computational subgraph obtained by the fusion process, the set of computational subgraphs includes at least two computational subgraphs connected in series, and an error occurs, it means that the corresponding chip cannot support the compilation and execution of at least two computational subgraphs of the set of computational subgraphs on the chip. Therefore, at least two computational subgraphs in the set of computational subgraphs before the fusion process can be used as a target computational subgraph respectively.
[0085] For example, suppose that the set of computation subgraphs obtained in step S51 includes computation subgraph A and computation subgraph B, and after fusion processing, a candidate computation subgraph C is obtained. If a failure occurs when generating the executable file corresponding to the candidate computation subgraph C obtained by fusion processing, then computation subgraph A before fusion processing can be used as a target computation subgraph, and computation subgraph B before fusion processing can also be used as a target computation subgraph.
[0086] In some embodiments, for any set of computational subgraphs, if the set of computational subgraphs includes multiple (e.g., 3 or 4) computational subgraphs connected in series, and a failure occurs when generating the executable file corresponding to the candidate computational subgraph obtained by the fusion process, the set of computational subgraphs can be further divided into one or more sets of computational subgraphs, and the operations of steps S52 and S53 as described above can be performed. The computational subgraphs that are not divided into groups can be used as target computational subgraphs respectively.
[0087] For example, suppose the set of computational subgraphs obtained in step S51 includes computational subgraphs D, E, and F. After fusion processing, a candidate computational subgraph G is obtained. If a failure occurs when generating the executable file corresponding to the candidate computational subgraph G obtained by fusion processing, computational subgraphs D and E can be regrouped into a set of computational subgraphs, and fusion processing can be performed on computational subgraphs D and E again. The operation in step S53 is performed on the fused subgraph. However, computational subgraph F is not grouped and is treated as a separate target computational subgraph.
[0088] Furthermore, the description of step S54 can be found in the above description of step S42, and will not be repeated here.
[0089] In some embodiments, obtaining at least one set of computational subgraphs in step S51 may further include: determining at least one set of computational subgraphs that can be fused among all computational subgraphs based on the subgraph attribute parameters corresponding to each computational subgraph; wherein the subgraph attribute parameters corresponding to the computational subgraph may include the computational amount, weight information, and the number of nodes in the corresponding vector acceleration unit graph of the computational subgraph.
[0090] The computational cost corresponding to the computational subgraph can be the sum of the computational costs required by all operator nodes contained in the computational subgraph. The weight information corresponding to the computational subgraph can include the sum of the weights required by all operator nodes contained in the computational subgraph. Before splitting the neural network computational graph, the neural network computational graph is pre-mapped to specific chip hardware to obtain the vector acceleration unit (APU) graph corresponding to the neural network computational graph. The vector acceleration unit (APU) graph corresponding to the neural network computational graph represents the mapping relationship of the operator nodes of the neural network computational graph on specific chip hardware (such as many-core chips, physical cores on the chip). Correspondingly, the vector acceleration unit (APU) graph corresponding to the computational subgraph split from the neural network computational graph represents the mapping relationship of the operator nodes of the computational subgraph on specific chip hardware.
[0091] In some embodiments, the step of determining at least one set of computational subgraphs that can be fused based on the subgraph attribute parameters corresponding to each computational subgraph may further include: sequentially checking whether at least two serially connected computational subgraphs meet the fusion condition according to the serial connection relationship and execution order of all computational subgraphs; when it is determined that at least two serially connected computational subgraphs meet the fusion condition, the at least two computational subgraphs are regarded as a set of computational subgraphs.
[0092] The fusion conditions may include: the sum of the computational amounts corresponding to the at least two computational subgraphs is greater than or equal to the minimum computational amount threshold and less than or equal to the maximum computational amount threshold; the sum of the weight information corresponding to the at least two computational subgraphs is greater than or equal to the minimum weight threshold and less than or equal to the maximum weight threshold; and the sum of the number of nodes in the APU graphs corresponding to the at least two computational subgraphs is greater than or equal to the minimum number threshold and less than or equal to the maximum number threshold.
[0093] The minimum and maximum computational cost thresholds can be configured according to actual needs. Similarly, the minimum and maximum weight thresholds, as well as the minimum and maximum number thresholds, can also be configured according to actual needs.
[0094] For example, suppose that all the current computational subgraphs include computational subgraphs H, I, J, K and L connected in sequence. When it is determined that the serially connected computational subgraphs H and I satisfy the above fusion condition, then the serially connected computational subgraphs H and I are regarded as a group of computational subgraphs. When it is determined that the serially connected computational subgraphs J, K and L satisfy the above fusion condition, then the serially connected computational subgraphs J, K and L are regarded as a group of computational subgraphs.
[0095] In some embodiments, by judging at least two computational subgraphs that can be fused through the above-mentioned fusion conditions, the efficiency of obtaining at least one set of computational subgraphs that can be fused can be improved, the compilation effect of the fused subgraph can be improved, and the probability of failure and error in generating the fused subgraph can be reduced.
[0096] In some embodiments, obtaining at least one set of computation subgraphs in step S51 may further include: taking each pair of serially connected computation subgraphs in all computation subgraphs as a set of computation subgraphs according to the serial connection relationship and execution order of all computation subgraphs.
[0097] For example, assuming that all current computational subgraphs include computational subgraphs H, I, J, and K connected in sequence, then the serially connected computational subgraphs H and I are considered as one set of computational subgraphs, and the serially connected computational subgraphs J and K are considered as another set of computational subgraphs.
[0098] In some embodiments, by directly treating every two serially connected computation subgraphs in all computation subgraphs as a group of computation subgraphs, the efficiency of obtaining at least one group of computation subgraphs can be effectively improved.
[0099] Figure 6 A flowchart illustrating another method for processing neural network computation graphs provided in this disclosure is shown below. Figure 6 In some embodiments, after generating the executable file corresponding to each computation subgraph in sequence according to each computation subgraph, that is, after step S13, the processing method may further include: steps S61 to S64.
[0100] Step S61: In response to the absence of any failures or errors when generating the executable file corresponding to each computation subgraph, merge the current computation subgraph and the next computation subgraph into a candidate computation subgraph according to the execution order of all computation subgraphs.
[0101] Step S62: If a failure occurs when generating the executable file corresponding to the alternative computation subgraph, the current computation subgraph shall be used as the target computation subgraph.
[0102] Step S63: Take the next computation subgraph as the current computation subgraph, and return to execute the step of merging the current computation subgraph and the next computation subgraph into a candidate computation subgraph.
[0103] Step S64: Load the executable file corresponding to each target computation subgraph into the corresponding chip.
[0104] In step S61, assuming no failures or errors occur during the generation of the executable file corresponding to each computational subgraph, in order to achieve reasonable utilization of chip hardware resources, improve the efficiency of chip hardware resource utilization, and realize load balancing of chip hardware resources, the execution order of all computational subgraphs connected in sequence (e.g., ...) is first followed. Figure 2 (From top to bottom) The current computation subgraph and the next computation subgraph are fused to obtain candidate computation subgraphs. The current computation subgraph and the next computation subgraph are two serially connected and adjacent computation subgraphs in the neural network computation graph, and the next computation subgraph is the next computation subgraph of the current computation subgraph along the execution order direction, that is, the computation subgraph connected to the output of the current computation subgraph along the execution order direction.
[0105] For example, refer to Figure 2 The neural network computation diagram shown assumes... Figure 2 The neural network computation graph shown is currently broken down into computational sub-graphs. Figure 1 , Calculator Figure 2 , Calculator Figure 3 and operator Figure 4 According to the execution order, the current computation subgraph is the computation subgraph. Figure 1 In step S61, the calculation sub-component is first... Figure 1 and operator Figure 2 The fusion process is performed to obtain alternative computational subgraphs.
[0106] After obtaining the candidate computation subgraph, the corresponding executable file can be generated based on the candidate computation subgraph obtained by the fusion process. In step S62, if a failure occurs when generating the executable file corresponding to the candidate computation subgraph, it means that the candidate computation subgraph obtained by the fusion process cannot be supported by the corresponding chip and cannot run normally on the corresponding chip. Therefore, the current computation subgraph is used as the target computation subgraph and step S63 is performed to continue the fusion and judgment according to the execution order.
[0107] In step S63, the next computational subgraph is taken as the current computational subgraph, and the process returns to the step of merging the current computational subgraph and the next computational subgraph into a candidate computational subgraph. That is, the process returns to step S61 to continue to merge and judge the current computational subgraph and the next computational subgraph of the current computational subgraph according to the execution order, until all computational subgraphs have been traversed.
[0108] For example, in generating the above-mentioned computational sub-processes Figure 1 and operator Figure 2 If a failure occurs when obtaining the executable file corresponding to the alternative computation subgraph, it indicates that the computation subgraph... Figure 1 and operator Figure 2 It is not suitable for fusion, therefore the computational sub-components will be used. Figure 1 As the target computation subgraph, and the computation subgraph Figure 2 As the current computational subgraph, continue with the computational subgraph. Figure 2 and operator Figure 3 Perform fusion and judgment, and so on, until all computation subgraphs have been traversed.
[0109] Furthermore, the description of step S64 can be found in the above description of step S42, and will not be repeated here.
[0110] Figure 7 A flowchart illustrating another method for processing neural network computation graphs provided in this disclosure is shown below. Figure 7 In some embodiments, after generating the executable file corresponding to each computation subgraph in sequence according to each computation subgraph, that is, after step S13, the processing method may further include: steps S71 to S74.
[0111] Step S71: In response to the absence of any failures or errors when generating the executable file corresponding to each computation subgraph, merge the current computation subgraph and the next computation subgraph into a candidate computation subgraph according to the execution order of all computation subgraphs.
[0112] Step S72: If there are no failures or errors when generating the executable file corresponding to the candidate computation subgraph, use the candidate computation subgraph as the target computation subgraph.
[0113] Step S73: Take the computation subgraph whose execution order is after the next computation subgraph as the current computation subgraph, and return to the step of merging the current computation subgraph and the next computation subgraph into a candidate computation subgraph.
[0114] Step S74: Load the executable file corresponding to each target computation subgraph into the corresponding chip.
[0115] For a description of step S71, please refer to the description of step S61 above; it will not be repeated here.
[0116] After obtaining the candidate computation subgraph, the corresponding executable file can be generated based on the candidate computation subgraph obtained by the fusion process. In step S72, if no failure error occurs when generating the executable file corresponding to the candidate computation subgraph, it means that the candidate computation subgraph obtained by the fusion process can still be supported by the corresponding chip and can run normally on the corresponding chip. Therefore, the candidate computation subgraph is used as the target computation subgraph and step S73 is performed to continue the fusion and judgment according to the execution order.
[0117] In step S73, the computational subgraph that is located after the next computational subgraph along the execution order direction and is adjacent to the next computational subgraph is taken as the current computational subgraph, and the execution returns to the step of merging the current computational subgraph and the next computational subgraph into a candidate computational subgraph, that is, returning to the execution step S71, so as to continue to merge and judge the current computational subgraph and the next computational subgraph of the current computational subgraph according to the execution order direction, until all computational subgraphs have been traversed.
[0118] For example, suppose all current computational subgraphs include computational subgraphs H, I, J, and K connected in sequence. The current computational subgraph is computational subgraph H, and the next computational subgraph is computational subgraph I. If no failure error occurs when generating the executable file corresponding to the candidate computational subgraph by fusing the above computational subgraphs H and I, it means that computational subgraphs H and I are suitable for further fusion. Therefore, the candidate computational subgraph obtained after fusing computational subgraphs H and I is taken as the target computational subgraph, and computational subgraph J is taken as the current computational subgraph. The fusion and judgment of computational subgraphs J and K are continued, and so on, until all computational subgraphs have been traversed.
[0119] Furthermore, the description of step S74 can be found in the above description of step S42, and will not be repeated here.
[0120] Figure 8 A flowchart illustrating another method for processing neural network computation graphs provided in this disclosure is shown below. Figure 8 In some embodiments, after generating the executable file corresponding to each computation subgraph in sequence according to each computation subgraph, that is, after step S13, the processing method may further include: steps S81 to S84.
[0121] Step S81: In response to the absence of any failures or errors when generating the executable file corresponding to each computation subgraph, the adjacent two computation subgraphs are merged sequentially according to the execution order of all computation subgraphs.
[0122] Step S82: Simultaneously, in the opposite direction to the execution order, merge two adjacent computation subgraphs in sequence.
[0123] Step S83: For any computational subgraph obtained by fusion processing, if there is no failure error when generating the executable file corresponding to the computational subgraph obtained by fusion processing, the computational subgraph obtained by fusion processing shall be used as the target computational subgraph.
[0124] Step S84: Load the executable file corresponding to each target computation subgraph into the corresponding chip.
[0125] In step S81, assuming no failures or errors occur during the generation of the executable file corresponding to each computational subgraph, in order to achieve reasonable utilization of chip hardware resources, improve the utilization efficiency of chip hardware resources, and realize load balancing of chip hardware resources, firstly, in step S81, the execution order of all computational subgraphs connected in sequence (e.g., ...) is followed. Figure 2 (From top to bottom) adjacent computational subgraphs are merged to obtain candidate computational subgraphs. After obtaining each candidate computational subgraph, a corresponding executable file can be generated based on the candidate computational subgraphs obtained through fusion.
[0126] Meanwhile, in step S82, the execution proceeds in the opposite direction to the above-mentioned execution order (e.g., Figure 2 (From bottom to top) adjacent computational subgraphs are merged to obtain candidate computational subgraphs. After obtaining each candidate computational subgraph, a corresponding executable file can be generated based on the candidate computational subgraphs obtained by fusion.
[0127] For example, suppose that all current computational subgraphs include computational subgraphs H, I, J, K, L, and M connected sequentially, with the execution order direction from computational subgraph H to computational subgraph M, and the opposite direction from computational subgraph M to computational subgraph H. Then, in step S81, according to the execution order direction, computational subgraphs H and I are first merged; simultaneously, in step S82, according to the opposite direction from the execution order direction, computational subgraphs M and L are first merged.
[0128] After obtaining each candidate computational subgraph, a corresponding executable file can be generated based on the candidate computational subgraphs obtained through fusion processing. In step S83, for a computational subgraph obtained through fusion processing in any direction, if no error occurs when generating the executable file corresponding to the candidate computational subgraph obtained through fusion processing, it indicates that the computational subgraph obtained through fusion processing can still be supported by the corresponding chip and can run normally on the corresponding chip. Therefore, the computational subgraph obtained through fusion processing is used as the target computational subgraph, and the fusion processing and judgment of two adjacent computational subgraphs are continued simultaneously according to different directions. For example, the fusion processing and judgment of computational subgraphs J and K are continued, and so on, until all computational subgraphs are traversed.
[0129] In some embodiments, if a failure occurs when generating the executable file corresponding to the computational subgraph obtained by fusion processing along any direction, it indicates that the two adjacent computational subgraphs before fusion processing are not suitable for fusion. Therefore, the processing method may further include:
[0130] When the computational subgraph obtained by the fusion process is a computational subgraph obtained by fusion processing in the direction of execution order, the computational subgraph with the earlier execution order among the two adjacent computational subgraphs before the fusion process is taken as the target computational subgraph, and the fusion process is continued in the direction of execution order for the computational subgraph with the later execution order and the next computational subgraph adjacent to the computational subgraph with the later execution order.
[0131] When the computational subgraph obtained by the fusion process is a computational subgraph obtained by fusion processing in the direction opposite to the execution order, the computational subgraph with the later execution order among the two adjacent computational subgraphs before the fusion process is taken as the target computational subgraph, and the fusion process is continued in the direction opposite to the execution order for the computational subgraph with the earlier execution order and the preceding computational subgraph adjacent to the computational subgraph with the earlier execution order.
[0132] Furthermore, the description of step S84 can be found in the above description of step S42, and will not be repeated here.
[0133] In some embodiments, by simultaneously attempting to merge and judge adjacent computational subgraphs along different directions in sequence, the efficiency of attempting to merge and judge can be effectively improved.
[0134] In some embodiments, after loading the executable file corresponding to the target operator node into the corresponding chip, the processing method further includes: in response to the executable file corresponding to the target computation subgraph completing its execution on the corresponding chip, storing the output result corresponding to the target computation subgraph into an external memory or the host corresponding to the chip.
[0135] By storing the output results corresponding to the target computation subgraph in external memory or the host corresponding to the chip, and then reading the output results corresponding to the target computation subgraph from external memory (such as double rate synchronous dynamic random access memory) or the host back to the chip when needed, the on-chip storage resources of the chip can be effectively saved.
[0136] In some embodiments, the chip described above can be any one of the many-core chips in a many-core system. The many-core system may include one or more many-core chips. The many-core chip is a chip based on a memory-computing many-core architecture. Each many-core chip may include multiple physical cores (also called computing cores), and each physical core has independent memory.
[0137] In some embodiments, for multiple target computation subgraphs obtained through the above processing and connected in series, the chips corresponding to the multiple target computation subgraphs can be the same many-core chip, or they can correspond to different many-core chips respectively, or some target computation subgraphs correspond to different many-core chips respectively, while some target computation subgraphs correspond to the same many-core chip. The specific details can be determined according to actual needs, and this disclosure does not impose specific limitations.
[0138] It is understood that the various method embodiments mentioned above in this disclosure can be combined with each other to form combined embodiments without violating the principle and logic. Due to space limitations, this disclosure will not elaborate further. Those skilled in the art will understand that in the above methods of specific implementation, the specific execution order of each step should be determined by its function and possible internal logic.
[0139] In addition, this disclosure also provides a processing apparatus, an electronic device, and a computer-readable storage medium. The processing apparatus is used to implement the neural network computation graph processing method provided by this disclosure. The electronic device and the computer-readable storage medium can both be used to implement the neural network computation graph processing method provided by this disclosure. The corresponding technical solutions and descriptions are described in the corresponding descriptions in the method section, and will not be repeated here.
[0140] Figure 9 This is a block diagram of a processing apparatus provided in an embodiment of the present disclosure, with reference to... Figure 9 This disclosure provides a processing device 90 for processing a neural network computation graph to be processed. The neural network computation graph includes multiple operator nodes. The processing device 90 includes a determination module 91, a first splitting module 92, and a generation module 93.
[0141] The determining module 91 is used to determine all target operator nodes in the neural network computation graph based on the output connection relationship of multiple operator nodes. The target operator node is one that has a first output terminal and a second output terminal, and the operator node connected to the first output terminal is a data output node.
[0142] The first splitting module 92 is used to split the neural network computation graph into multiple serially connected computation subgraphs, with the second output end of the target operator node as the splitting point. The output ends of the operator nodes connected to the second output end are connected to other operator nodes.
[0143] The generation module 93 is used to generate an executable file corresponding to each computational subgraph in sequence based on each computational subgraph.
[0144] In some embodiments, the processing apparatus 90 may further include a loading module (not shown) and a storage module (not shown), wherein the loading module is used to load the executable file corresponding to each computational subgraph to the corresponding chip; and the storage module is used to store the output result corresponding to the computational subgraph to an external memory or the host corresponding to the chip in response to the executable file corresponding to the computational subgraph completing its execution on the chip.
[0145] In some embodiments, the processing apparatus 90 may further include a second splitting module (not shown in the figure); the generation module 93 is configured to: in response to a failure error occurring when generating the executable file corresponding to the current computational subgraph, trigger the second splitting module to further split the current computational subgraph into a plurality of serially connected computational subgraphs; and sequentially generate an executable file corresponding to each of the further split computational subgraphs according to each of the further split computational subgraphs.
[0146] In some embodiments, the generation module 93 is configured to: in response to no failure error when generating the executable file corresponding to the current computation subgraph, use the current computation subgraph as the target computation subgraph.
[0147] In some embodiments, the processing apparatus 90 may further include a loading module (not shown in the figure), which is used to load the executable file corresponding to each target computation subgraph to the corresponding chip.
[0148] In some embodiments, the processing apparatus 90 may further include an acquisition module (not shown in the figure), a first fusion module (not shown in the figure), and a judgment module (not shown in the figure). The acquisition module is configured to acquire at least one set of computational subgraphs in response to no failure errors reported when the generation module 93 generates executable files corresponding to each computational subgraph. Each set of computational subgraphs includes at least two computational subgraphs connected in series. The first fusion module is configured to perform fusion processing on at least two computational subgraphs in each set of computational subgraphs to obtain candidate computational subgraphs corresponding to each set of computational subgraphs. The judgment module is configured to, if no failure errors are reported when the generation module 93 generates executable files corresponding to candidate computational subgraphs, select the candidate computational subgraph as the target computational subgraph.
[0149] In some embodiments, the acquisition module is used to determine at least one set of computational subgraphs that can be fused among all computational subgraphs based on the subgraph attribute parameters corresponding to each computational subgraph; wherein, the subgraph attribute parameters corresponding to the computational subgraph include the computational amount, weight information, and the number of nodes in the corresponding vector acceleration unit (APU) graph of the computational subgraph.
[0150] In some embodiments, the acquisition module is configured to: sequentially check whether at least two serially connected computational subgraphs meet the fusion conditions according to the serial connection relationship and execution order of all computational subgraphs; when it is determined that at least two serially connected computational subgraphs meet the fusion conditions, treat the at least two computational subgraphs as a group of computational subgraphs; wherein, the fusion conditions include: the sum of the computational amounts corresponding to the at least two computational subgraphs is greater than or equal to the minimum computational amount threshold and less than or equal to the maximum computational amount threshold; the sum of the weight information corresponding to the at least two computational subgraphs is greater than or equal to the minimum weight threshold and less than or equal to the maximum weight threshold; the sum of the number of nodes in the APU graphs corresponding to the at least two computational subgraphs is greater than or equal to the minimum number threshold and less than or equal to the maximum number threshold.
[0151] In some embodiments, the acquisition module is used to treat every two serially connected computation subgraphs in all computation subgraphs as a group of computation subgraphs according to the serial connection relationship and execution order of all computation subgraphs.
[0152] In some embodiments, the processing apparatus 90 may further include a second fusion module (not shown in the figure), the second fusion module being configured to: in response to no failure errors when the generation module generates executable files corresponding to each computational subgraph, merge the current computational subgraph and the next computational subgraph into a candidate computational subgraph according to the execution order of all computational subgraphs; if a failure error occurs when the generation module 93 generates executable files corresponding to the candidate computational subgraphs, use the current computational subgraph as the target computational subgraph, and use the next computational subgraph as the current computational subgraph, and return to execute the step of merging the current computational subgraph and the next computational subgraph into a candidate computational subgraph; if no failure error occurs when the generation module 93 generates executable files corresponding to the candidate computational subgraphs, use the candidate computational subgraph as the target computational subgraph, and use the computational subgraph whose execution order is after the next computational subgraph as the current computational subgraph, and return to execute the step of merging the current computational subgraph and the next computational subgraph into a candidate computational subgraph.
[0153] In some embodiments, the processing apparatus 90 may further include a third fusion module (not shown in the figure). The third fusion module is configured to: in response to no failure errors reported when the generation module 93 generates the executable file corresponding to each computational subgraph, sequentially fuse two adjacent computational subgraphs according to the execution order of all computational subgraphs; simultaneously, sequentially fuse two adjacent computational subgraphs in the opposite direction to the execution order; for any computational subgraph obtained through fusion, if no failure errors are reported when generating the executable file corresponding to the fused computational subgraph, use the fused computational subgraph as the target computational subgraph; if a failure error occurs when generating the executable file corresponding to the fused computational subgraph, if the fused computational subgraph is a computational subgraph obtained through fusion in the direction of execution order, use the computational subgraph with the earlier execution order among the two adjacent computational subgraphs before fusion as the target computational subgraph; if the fused computational subgraph is a computational subgraph obtained through fusion in the opposite direction to the execution order, use the computational subgraph with the later execution order among the two adjacent computational subgraphs before fusion as the target computational subgraph.
[0154] In some embodiments, the processing device 90 further includes a storage module (not shown in the figure); the storage module is used to store the output result corresponding to the target computation subgraph to an external memory or the host corresponding to the chip in response to the executable file corresponding to the target computation subgraph completing its execution on the chip.
[0155] The processing apparatus provided in this disclosure is used to implement the processing method provided in the above embodiments. For a detailed description, please refer to the relevant description in the processing method of the above embodiments, which will not be repeated here.
[0156] Figure 10 This is a block diagram of an electronic device provided in an embodiment of the present disclosure, with reference to... Figure 10 This disclosure provides an electronic device, which includes: at least one processor 101; at least one memory 102; and one or more I / O interfaces 103 connected between the processor 101 and the memory 102; wherein the memory 102 stores one or more computer programs that can be executed by the at least one processor 101, and the one or more computer programs are executed by the at least one processor 101 to enable the at least one processor 101 to perform the above-described neural network computation graph processing method.
[0157] This disclosure also provides a computer-readable storage medium storing a computer program thereon, wherein the computer program, when executed by a processor, implements the aforementioned neural network computation graph processing method. The computer-readable storage medium may be volatile or non-volatile.
[0158] This disclosure also provides a computer program product, including computer-readable code, or a non-volatile computer-readable storage medium carrying computer-readable code. When the computer-readable code is run in a processor of an electronic device, the processor in the electronic device executes the above-described neural network computation graph processing method.
[0159] Those skilled in the art will understand that all or some of the steps, systems, and apparatuses disclosed above, and their functional modules / units, can be implemented as software, firmware, hardware, or suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned above does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit (ASIC). Such software can be distributed on a computer-readable storage medium, which may include computer storage media (or non-transitory media) and communication media (or transient media).
[0160] As is known to those skilled in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable program instructions, data structures, program modules, or other data). Computer storage media includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), static random access memory (SRAM), flash memory or other memory technologies, portable compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and is accessible to a computer. Furthermore, it is known to those skilled in the art that communication media typically contain computer-readable program instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.
[0161] The computer-readable program instructions described herein can be downloaded from computer-readable storage media to various computing / processing devices, or downloaded via a network, such as the Internet, local area network, wide area network, and / or wireless network, to an external computer or external storage device. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer-readable program instructions from the network and forwards them to the computer-readable storage media in the respective computing / processing device.
[0162] Computer program instructions used to perform the operations of this disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, status setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as the "C" language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or may be connected to an external computer (e.g., via the Internet using an Internet service provider). In some embodiments, electronic circuitry, such as programmable logic circuitry, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), is personalized by utilizing the status information of the computer-readable program instructions to implement various aspects of this disclosure.
[0163] The computer program product described herein can be implemented specifically through hardware, software, or a combination thereof. In one alternative embodiment, the computer program product is specifically embodied in a computer storage medium; in another alternative embodiment, the computer program product is specifically embodied in a software product, such as a software development kit (SDK), etc.
[0164] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.
[0165] These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processor of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner; thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.
[0166] Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.
[0167] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than those shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0168] Example embodiments have been disclosed herein, and while specific terminology has been used, it is for illustrative purposes only and should be construed as such, and is not intended to be limiting. In some instances, it will be apparent to those skilled in the art that features, characteristics, and / or elements described in connection with particular embodiments may be used alone, or in combination with features, characteristics, and / or elements described in connection with other embodiments, unless otherwise expressly indicated. Therefore, those skilled in the art will understand that various changes in form and detail may be made without departing from the scope of this disclosure as set forth by the appended claims.
Claims
1. A method for processing neural network computation graphs, characterized in that, The neural network computation graph includes multiple operator nodes, and the processing method includes: Based on the output connection relationship of multiple operator nodes, all target operator nodes in the neural network computation graph are determined. Each target operator node has a first output terminal and a second output terminal, and the operator node connected to the first output terminal is a data output node. Using the second output terminal of the target operator node as the splitting point, the neural network computation graph is split into multiple computation subgraphs connected in series, and the output terminals of the operator nodes connected to the second output terminal are connected to other operator nodes. An executable file corresponding to each computation subgraph is generated sequentially based on each computation subgraph. In response to the absence of any failures or errors when generating the executable file corresponding to each of the computational subgraphs, at least one set of computational subgraphs is obtained, each set of computational subgraphs including at least two computational subgraphs connected in series. At least two computation subgraphs in each group of computation subgraphs are merged to obtain alternative computation subgraphs corresponding to each group of computation subgraphs; If there are no failures or errors when generating the executable file corresponding to the candidate computation subgraph, the candidate computation subgraph will be used as the target computation subgraph. The executable file corresponding to each target computation subgraph is loaded into the corresponding chip to execute the corresponding computation task; Wherein, obtaining at least one set of computational subgraphs includes: Based on the subgraph attribute parameters corresponding to each of the aforementioned computational subgraphs, at least one set of computational subgraphs that can be fused is determined among all the computational subgraphs; wherein, the subgraph attribute parameters corresponding to the computational subgraphs include the computational amount, weight information, and the number of nodes in the corresponding vector acceleration unit (APU) graph of the computational subgraph; or, Based on the serial connection relationship and execution order of all computation subgraphs, each pair of serially connected computation subgraphs is treated as a group of computation subgraphs.
2. The processing method according to claim 1, characterized in that, After generating the executable file corresponding to each computation subgraph according to each computation subgraph in sequence, the processing method further includes: Load the executable file corresponding to each computational subgraph into the corresponding chip; In response to the executable file corresponding to the computation subgraph completing its execution on the chip, the output result corresponding to the computation subgraph is stored in external memory or the host corresponding to the chip.
3. The processing method according to claim 1, characterized in that, The step of generating an executable file corresponding to each computation subgraph in sequence includes: In response to a failure error occurring when generating the executable file corresponding to the current computation subgraph, the current computation subgraph is further split into multiple serially connected computation subgraphs. Each computational subgraph obtained through further subdivision is used to generate an executable file corresponding to each of the further subdivisions.
4. The processing method according to claim 1, characterized in that, The step of determining at least one set of computational subgraphs that can be fused among all computational subgraphs based on the subgraph attribute parameters corresponding to each of the computational subgraphs includes: Based on the serial connection relationship and execution order of all computation subgraphs, check in turn whether at least two serially connected computation subgraphs meet the fusion condition; When it is determined that at least two computational subgraphs connected in a serial manner satisfy the fusion condition, the at least two computational subgraphs are treated as a group of computational subgraphs. The fusion conditions include: the sum of the computational amounts corresponding to the at least two computational subgraphs is greater than or equal to the minimum computational amount threshold and less than or equal to the maximum computational amount threshold; the sum of the weight information corresponding to the at least two computational subgraphs is greater than or equal to the minimum weight threshold and less than or equal to the maximum weight threshold; and the sum of the number of nodes in the APU graphs corresponding to the at least two computational subgraphs is greater than or equal to the minimum number threshold and less than or equal to the maximum number threshold.
5. A method for processing neural network computation graphs, characterized in that, The neural network computation graph includes multiple operator nodes, and the processing method includes: Based on the output connection relationship of multiple operator nodes, all target operator nodes in the neural network computation graph are determined. Each target operator node has a first output terminal and a second output terminal, and the operator node connected to the first output terminal is a data output node. Using the second output terminal of the target operator node as the splitting point, the neural network computation graph is split into multiple computation subgraphs connected in series, and the output terminals of the operator nodes connected to the second output terminal are connected to other operator nodes. An executable file corresponding to each computation subgraph is generated sequentially based on each computation subgraph. After generating the executable file corresponding to each computation subgraph according to each computation subgraph in sequence, the processing method further includes: In response to the absence of any failures or errors when generating the executable file corresponding to each computational subgraph, the current computational subgraph and the next computational subgraph are merged into a candidate computational subgraph according to the execution order of all computational subgraphs. If a failure occurs during the generation of the executable file corresponding to the candidate computation subgraph, the current computation subgraph will be used as the target computation subgraph; and The next computational subgraph is used as the current computational subgraph, and the process returns to the step of merging the current computational subgraph and the next computational subgraph into a candidate computational subgraph. The executable file corresponding to each target computation subgraph is loaded into the corresponding chip to execute the corresponding computation task.
6. The processing method according to claim 5, characterized in that, After merging the current computation subgraph and the next computation subgraph into a candidate computation subgraph, the processing method further includes: If no errors occur during the generation of the executable file corresponding to the candidate computation subgraph, the candidate computation subgraph will be used as the target computation subgraph; and The computational subgraph whose execution order is after the next computational subgraph is taken as the current computational subgraph, and the process returns to the step of merging the current computational subgraph and the next computational subgraph into a candidate computational subgraph.
7. A method for processing neural network computation graphs, characterized in that, The neural network computation graph includes multiple operator nodes, and the processing method includes: Based on the output connection relationship of multiple operator nodes, all target operator nodes in the neural network computation graph are determined. Each target operator node has a first output terminal and a second output terminal, and the operator node connected to the first output terminal is a data output node. Using the second output terminal of the target operator node as the splitting point, the neural network computation graph is split into multiple computation subgraphs connected in series, and the output terminals of the operator nodes connected to the second output terminal are connected to other operator nodes. An executable file corresponding to each computation subgraph is generated sequentially based on each computation subgraph. After generating the executable file corresponding to each computation subgraph according to each computation subgraph in sequence, the processing method further includes: In response to the absence of any failures or errors when generating the executable file corresponding to each computational subgraph, adjacent computational subgraphs are merged sequentially according to the execution order of all computational subgraphs. Simultaneously, in the opposite direction to the execution order, adjacent computational subgraphs are merged sequentially. For any computational subgraph obtained by fusion processing, if there is no failure or error when generating the executable file corresponding to the computational subgraph obtained by fusion processing, the computational subgraph obtained by fusion processing shall be used as the target computational subgraph. The executable file corresponding to each target computation subgraph is loaded into the corresponding chip to execute the corresponding computation task.
8. The processing method according to claim 7, characterized in that, In the event of a failure to generate the executable file corresponding to the computational subgraph obtained from the fusion process, the processing method further includes: When the computational subgraph obtained by the fusion process is a computational subgraph obtained by fusion processing according to the execution order direction, the computational subgraph with the earlier execution order among the two adjacent computational subgraphs before the fusion process is taken as the target computational subgraph. When the computational subgraph obtained by the fusion process is a computational subgraph obtained by fusion processing in the opposite direction to the execution order, the computational subgraph with the later execution order among the two adjacent computational subgraphs before the fusion process is taken as the target computational subgraph.
9. A processing apparatus, characterized in that, The processing device is used to process the neural network computation graph to be processed, the neural network computation graph including multiple operator nodes, and the processing device includes: The determination module is used to determine all target operator nodes in the neural network computation graph based on the output connection relationship of the multiple operator nodes. The target operator node is an operator node with a first output terminal and a second output terminal, and the operator node connected to the first output terminal is a data output node. The first splitting module is used to split the neural network computation graph into multiple serially connected computation subgraphs, with the second output end of the target operator node as the splitting point, and the output end of the operator node connected to the second output end is connected to other operator nodes. A generation module is used to sequentially generate an executable file corresponding to each computational subgraph based on each computational subgraph. After generating the executable file corresponding to each computation subgraph according to each computation subgraph in sequence, the processing device is further configured to: In response to the absence of any failures or errors when generating the executable file corresponding to each of the computational subgraphs, at least one set of computational subgraphs is obtained, each set of computational subgraphs including at least two computational subgraphs connected in series. At least two computation subgraphs in each group of computation subgraphs are merged to obtain alternative computation subgraphs corresponding to each group of computation subgraphs; If there are no failures or errors when generating the executable file corresponding to the candidate computation subgraph, the candidate computation subgraph will be used as the target computation subgraph. The executable file corresponding to each target computation subgraph is loaded into the corresponding chip to execute the corresponding computation task; Wherein, obtaining at least one set of computational subgraphs includes: Based on the subgraph attribute parameters corresponding to each of the aforementioned computational subgraphs, at least one set of computational subgraphs that can be fused is determined among all the computational subgraphs; wherein, the subgraph attribute parameters corresponding to the computational subgraphs include the computational amount, weight information, and the number of nodes in the corresponding vector acceleration unit (APU) graph of the computational subgraph; or, Based on the serial connection relationship and execution order of all computation subgraphs, each pair of serially connected computation subgraphs is treated as a group of computation subgraphs.
10. An electronic device, characterized in that, include: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores one or more computer programs that can be executed by the at least one processor, the one or more computer programs being executed by the at least one processor to enable the at least one processor to perform the processing method as described in any one of claims 1-8.
11. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the processing method as described in any one of claims 1-8.