Operator processing method and apparatus, electronic device, and storage medium
By rearranging and fusing operators in a neural network model to generate target operators with linear memory layout, the problems of large workload in operator development and difficulty in fusion are solved, and efficient operator execution and performance improvement are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI BIREN TECH CO LTD
- Filing Date
- 2023-07-13
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the development of operators for running neural networks is a huge undertaking, and operators with different data arrangements are difficult to merge and eliminate, resulting in poor performance.
By rearranging the memory of operators in the neural network model based on the first memory arrangement operator, a second operator with linear memory arrangement is generated and fused to generate the target operator. The target operator is then used to perform operations related to instruction mapping, thereby realizing the fusion of operators under different hardware scenarios.
It reduces the workload of operator development, improves operator execution performance, enables the reuse of CPU logic in different hardware scenarios, and improves operator execution efficiency.
Smart Images

Figure CN116796289B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to an operator processing method, apparatus, electronic device, and storage medium. Background Technology
[0002] With the rapid development of artificial intelligence (AI) technology, neural networks have been widely used in various fields. Running a neural network requires the support of a large number of operators.
[0003] In related technologies, running neural networks basically adopts the single operator mode (eager mode), which requires real memory rearrangement. Furthermore, each single operator corresponds to a different implementation in different hardware scenarios. Operators with different data layouts are difficult to merge and eliminate, resulting in a huge workload for operator development.
[0004] Therefore, how to reduce the workload of operator development and improve the performance of operator execution is an urgent problem to be solved. Summary of the Invention
[0005] To address the problems existing in the prior art, embodiments of the present invention provide an operator processing method, apparatus, electronic device, and storage medium.
[0006] This invention provides an operator processing method, comprising:
[0007] Based on the first memory arrangement operator, the memory of M first operators in the neural network model is rearranged to generate M second operators; the memory arrangement of each second operator is a linear memory arrangement, and M is a positive integer greater than or equal to 1;
[0008] The M second operators are fused to generate a first fusion operator;
[0009] Based on the first memory layout operator, the first fusion operator, and the second memory layout operator, operator fusion is performed to generate a target operator, which is used to execute a target operation; the target operation is related to instruction mapping.
[0010] Optionally, before performing memory rearrangement processing on the M first operators in the neural network model based on the first memory arrangement operator, the method further includes:
[0011] Obtain the computation graph corresponding to the neural network model, wherein the computation graph contains the M first operators.
[0012] Optionally, the step of rearranging the memory of M first operators in the neural network model based on the first memory arrangement operator to generate M second operators includes:
[0013] The first memory arrangement operator is inserted at the beginning of the computation graph, and the first memory arrangement operator is used to perform memory rearrangement processing on the M first operators to generate the M second operators.
[0014] Optionally, the operator fusion based on the first memory layout operator, the first fusion operator, and the second memory layout operator includes:
[0015] The second memory arrangement operator is inserted at the end of the computation graph. Based on the second memory arrangement operator, the first fusion operator is rearranged in memory to generate the second fusion operator. The memory arrangement of the second fusion operator is a non-linear memory arrangement.
[0016] The first memory layout operator, the second fusion operator, and the second memory layout operator are fused together to generate the target operator.
[0017] Optionally, the method further includes:
[0018] Based on the target operator, determine the physical coordinates of the target operator in the hardware device;
[0019] Based on the template schedule and the physical coordinates, instruction mapping is performed to generate read and write instructions;
[0020] Based on the read / write instructions, read / write operations are performed on the data associated with the target operator in the hardware device.
[0021] Optionally, the step of performing read / write operations on the data associated with the target operator in the hardware device based on the read / write instructions includes:
[0022] Based on the memory layout of the hardware device, the data associated with the target operator is segmented to generate multiple data blocks;
[0023] Based on the read / write instructions, at least one data block is read / written in the hardware device.
[0024] Optionally, the step of performing read / write operations on at least one data block in the hardware device based on the read / write instructions includes:
[0025] When performing read and write operations on multiple data blocks simultaneously, the vector instruction is used to perform read and write operations on each data block; the vector instruction is used to call multiple threads to perform read and write operations on multiple data blocks simultaneously.
[0026] Optionally, the step of performing read / write operations on at least one data block in the hardware device based on the read / write instructions includes any one of the following:
[0027] When performing read / write operations on the internal data of any data block, the internal data is written to a shared memory area for data rearrangement to generate target data; and the target data in the shared memory area is read / written based on the vector instruction.
[0028] or,
[0029] Based on the target read / write instruction, read and write operations are performed on the internal data; the target read / write instruction is used to call a single thread to perform read and write operations on a single piece of data.
[0030] Optionally, the vector instruction includes at least one of the following:
[0031] Multiple data loading ldm instructions;
[0032] Multi-data storage STM commands.
[0033] Optionally, the target read / write instruction includes at least one of the following:
[0034] Data loading ld instruction;
[0035] Data storage st command.
[0036] Optionally, if the data associated with the M first operators exceeds the hardware device limit, the method further includes:
[0037] A third operator is inserted at the beginning of the computation graph. The third operator includes the memory layout operator and a fourth operator. The fourth operator is used to perform operations related to data transformation and reshape.
[0038] The third operator is used to process the data associated with the M first operators.
[0039] Optionally, if the data type associated with the first operator does not conform to the target data type, the method further includes:
[0040] A fifth operator is inserted at the beginning and end of the computation graph. The fifth operator is used to convert the data type of the data associated with the first operator.
[0041] The fifth operator is used to convert the data type of the data associated with the first operator into the target data type.
[0042] Optionally, fusing the M second operators to generate a fusion operator includes:
[0043] The M second operators are fused using the computational logic of the central processing unit (CPU) to generate the fused operator.
[0044] The present invention also provides an operator processing apparatus, comprising:
[0045] The first generation module is used to perform memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operator to generate M second operators; the memory arrangement of each second operator is a linear memory arrangement, and M is a positive integer greater than or equal to 1;
[0046] The second generation module is used to fuse the M second operators to generate a first fusion operator;
[0047] The execution module is used to perform operator fusion based on the first memory layout operator, the first fusion operator, and the second memory layout operator to generate a target operator, which is used to execute a target operation; the target operation is related to the instruction mapping.
[0048] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the operator processing method as described above.
[0049] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the operator processing method as described above.
[0050] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the operator processing method as described above.
[0051] The operator processing method, apparatus, electronic device, and storage medium provided by this invention perform memory rearrangement processing on M first operators in a neural network model using a first memory arrangement operator to generate M second operators with linear memory arrangements. The M second operators are then fused to generate a first fusion operator. Finally, operator fusion is performed based on the first memory arrangement operator, the first fusion operator, and the second memory arrangement operators to generate a target operator. This method achieves the fusion of operators under different memory arrangements. The target operator is used to execute target operations related to instruction mapping. It eliminates the need to support different hardware-related scenarios for each first operator, allowing for the reuse of CPU logic, thereby significantly reducing the workload of operator development and improving operator execution performance. Attached Figure Description
[0052] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0053] Figure 1 This is one of the flowcharts illustrating the operator processing method provided by the present invention;
[0054] Figure 2 This is the second flowchart of the operator processing method provided by the present invention;
[0055] Figure 3 This is one of the schematic diagrams of the processing logic of the operator processing method provided by the present invention;
[0056] Figure 4 This is the second schematic diagram of the processing logic of the operator processing method provided by the present invention;
[0057] Figure 5 This is a schematic diagram of the operator processing device provided by the present invention;
[0058] Figure 6 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0059] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0060] To facilitate a clearer understanding of the various embodiments of this application, some relevant knowledge will be introduced as follows.
[0061] With the rapid development of artificial intelligence technology, neural networks have been widely used in various fields. Running a neural network requires the support of a large number of operators.
[0062] However, the following drawbacks exist in the related technologies for running neural networks:
[0063] 1. Running neural networks basically adopts eager mode, which requires real memory rearrangement. The workload of operator development under eager mode is huge.
[0064] 2. Operators of different layouts, such as memory operators, unary operators, and binary operators, are difficult to merge and eliminate.
[0065] Among them, layout corresponds to the memory arrangement of data on the hardware, such as whether to read by row or by column; which block to read first and which block to read later; and different alignment requirements on different dimensions.
[0066] Unary: Unary operators, such as sin, cos, exp, log, erf, etc.
[0067] Binary: binary operators, such as add, mul, mod, and and operators, including broadcast and elementwise scenarios;
[0068] Memory: mainly operators involving memory mapping, such as transpose and reshape.
[0069] 3. For some extremely large shapes that exceed hardware limitations, explicit reshaping is required. Each handwritten operating system kernel must consider large tensors.
[0070] 4. Each operator needs to consider the following permutations and combinations of different scenarios:
[0071] (1) Different layouts.
[0072] (2) Different data types, such as fp32, bf16, etc.
[0073] (3) Different shapes.
[0074] (4) Different Burst modes, such as burst1 / 2 / 4;
[0075] Among them, Burst mode corresponds to the concept of Single Instruction Multiple Data (SIMD), which is generally referred to as vectorization, where one instruction can process multiple data simultaneously.
[0076] (5) Large tensor.
[0077] (6) Different operator properties, such as the exchange of different axes of transpose.
[0078] With different layouts, different data types, different bursts, and considering large tensor scenarios, if these scenarios are combined, it is foreseeable that the amount of handwritten kernel work required to implement the same operator will be enormous.
[0079] 5. Some operators involve rearranging within a data block and can only perform point-to-point data loading (ld) or data storage (st). However, after merging, multiple data loading (load much, ldm) or multiple data storage (store much, stm) can be used.
[0080] In summary, to enable different single operators to be implemented in different hardware-related scenarios (layout, data type, burst mode, different operator implementations, different shapes, large tensors, etc.), reduce the workload of operator development, and improve the performance of operator execution, this invention provides an operator processing method, apparatus, electronic device, and storage medium.
[0081] The following is combined with Figures 1 to 4 The operator processing method provided by this invention will be described in detail. Figure 1 This is one of the flowcharts illustrating the operator processing method provided by the present invention. See [link / reference]. Figure 1 As shown, the method includes steps 101-103, wherein:
[0082] Step 101: Based on the first memory arrangement operator, perform memory rearrangement processing on the M first operators in the neural network model to generate M second operators; the memory arrangement of each second operator is a linear memory arrangement, and M is a positive integer greater than or equal to 1.
[0083] First, it should be noted that the subject of this invention can be any electronic device capable of operator processing, such as a smartphone, smartwatch, desktop computer, laptop, or any other type.
[0084] In this embodiment, the neural network model can be applied to fields such as image recognition, speech processing, and natural language processing. The neural network model can be, for example, a convolutional neural network (CNN) model, a recurrent neural network (RNN) model, etc. Running the neural network requires the support of M first operators.
[0085] The first operator can be, for example, the Memory operator, the Unary operator, the Binary operator, etc., each corresponding to a different non-linear memory arrangement. Therefore, in this embodiment, it is necessary to utilize the basic settings of the AI compiler to rearrange the memory of the M non-linear first operators based on the memory arrangement operators, generating M linear memory arrangement second operators, that is, non-linear arrangement (non-ByteObject) → linear arrangement (ByteObject).
[0086] The first memory layout operator is used to perform memory rearrangement processing on the first operator. The first memory layout operator can be represented as: layout_convert operator.
[0087] Step 102: Merge the M second operators to generate a first fusion operator.
[0088] In this embodiment, since the memory layout of each second operator is a linear memory layout, each second operator can be directly merged into a large first fusion operator.
[0089] Optionally, the step of fusing the M second operators to generate a first fusion operator can be achieved in the following way:
[0090] The M second operators are fused using the computational logic of the central processing unit (CPU) to generate the first fused operator.
[0091] For example, the computational logic of the Tensor Virtual Machine (TVM) community CPU can be reused to merge the Memory operator, Unary operator, and Binary operator into a single large operator. It should be noted that, in this case, the memory arrangement of the Memory operator, Unary operator, and Binary operator is a linear memory arrangement.
[0092] In the above implementation, since each second operator is merged into a large first fusion operator, it is not necessary to implement corresponding operators for every permutation and combination of different layouts, different data types, different burst modes, and large tensor scenarios. Furthermore, the automatic fusion of operators between different layouts, such as memory operators, unary operators, and binary operators, greatly improves performance.
[0093] Step 103: Perform operator fusion based on the first memory layout operator, the first fusion operator, and the second memory layout operator to generate a target operator. The target operator is used to execute a target operation. The target operation is related to instruction mapping.
[0094] In this embodiment, the second memory arrangement operator is used to perform memory rearrangement processing on the first fusion operator. Generating the target operator means realizing a transformation of the coordinates of each of the first operators from logical coordinates to physical coordinates in the hardware device.
[0095] The target operator is used to perform target operations related to instruction mapping.
[0096] Specifically, each hardware device has its own instruction set, and here we need to generate instructions that the hardware device can understand, such as read / write instructions, calculation instructions, etc.
[0097] The operator processing method provided by this invention performs memory rearrangement processing on M first operators in a neural network model using a first memory arrangement operator to generate M second operators with linear memory arrangements. These M second operators are then fused to generate a first fusion operator. Finally, operator fusion is performed based on the first memory arrangement operator, the first fusion operator, and the second memory arrangement operators to generate a target operator. This method achieves the fusion of operators under different memory arrangements. The target operator is used to execute target operations related to instruction mapping. It eliminates the need to support different hardware-related scenarios for each first operator, allowing for the reuse of CPU logic, thereby significantly reducing the workload of operator development and improving operator execution performance.
[0098] Optionally, before performing memory rearrangement processing on the M first operators in the neural network model based on the first memory arrangement operator, it is necessary to obtain the M first operators in the neural network model, which is specifically achieved in the following way:
[0099] Obtain the computation graph corresponding to the neural network model, wherein the computation graph contains the M first operators.
[0100] In this embodiment, the neural network model is first abstracted into a corresponding computation graph, and then each first operator is obtained from the computation graph.
[0101] The computation graph corresponding to the neural network model is a directed acyclic graph used to describe the operations. It has two main elements: nodes and edges. Each node can correspond to an operator, such as a vector, matrix, or tensor. Edges represent operations, such as addition, subtraction, multiplication, division, and convolution.
[0102] Optionally, the process of rearranging the memory of the M first operators in the neural network model based on the first memory arrangement operator to generate M second operators is specifically implemented in the following way:
[0103] The first memory arrangement operator is inserted at the beginning of the computation graph, and the first memory arrangement operator is used to perform memory rearrangement processing on the M first operators to generate the M second operators.
[0104] For example, by inserting the layout_convert operator at the beginning of the computation graph, the non-linear memory layout operators Memory, Unary, and Binary are rearranged into a linear memory layout ByteObject. Then, the Memory, Unary, and Binary operators can be merged into a large merged operator according to the CPU's logic.
[0105] By using the above method, operators from different layouts can be automatically merged, which greatly reduces the workload of operator development and thus improves the performance of operator execution.
[0106] Optionally, the operator fusion based on the first memory layout operator, the first fusion operator, and the second memory layout operator can be implemented through the following steps:
[0107] Step 1) Insert the second memory arrangement operator at the end of the computation graph, and perform memory rearrangement processing on the first fusion operator based on the second memory arrangement operator to generate the second fusion operator; the memory arrangement of the second fusion operator is a non-linear memory arrangement.
[0108] Step 2) Merge the first memory layout operator, the second fusion operator, and the second memory layout operator to generate the target operator.
[0109] In this embodiment, after fusing the second operators to generate the first fusion operator, the first fusion operator with linear memory arrangement needs to be rearranged again based on the second memory arrangement operator to generate the second fusion operator with non-linear arrangement, thereby realizing non-ByteObject→ByteObject→non-ByteObject.
[0110] Then, using the target operator, target operations related to instruction mapping are performed.
[0111] Optionally, the execution of the target operation based on the target operator can be achieved through the following steps:
[0112] Step 1) Based on the target operator, determine the physical coordinates of the target operator in the hardware device;
[0113] Step 2) Based on the template schedule and the physical coordinates, perform instruction mapping to generate read and write instructions.
[0114] Step 3) Based on the read / write instructions, perform read / write operations on the data associated with the target operator in the hardware device.
[0115] In this embodiment, for the aforementioned "non-ByteObject→ByteObject→non-ByteObject" process, tensor IR fusion can be performed, combining the three computations into a single computation to obtain the target operator, thereby achieving a single coordinate transformation. That is, the original coordinates (logical coordinates) of the target operator are transformed to the coordinates (physical coordinates) in the hardware device. Simultaneously, compiler technology can be used to simplify the fusion of the computational parts.
[0116] Tensor IR is an intermediate representation (IR) used to describe tensor computation, including for loops, if-else statements, read / write operations, and corresponding computations.
[0117] Optionally, the step of performing read and write operations on the data associated with the target operator in the hardware device based on the read and write instructions can be implemented through the following steps:
[0118] Step 1) Based on the memory layout of the hardware device, the data associated with the target operator is segmented to generate multiple data blocks;
[0119] Step 2) Based on the read / write instructions, perform read / write operations on at least one data block in the hardware device.
[0120] In this embodiment, the data associated with the target operator needs to be segmented according to the memory layout of the hardware device using a schedule to obtain multiple data blocks.
[0121] Then, based on read and write instructions, data read and write operations are performed on at least one data block in the hardware device.
[0122] Here, schedule is a collection of computational transformations, which achieve different performance levels through computational loops in the transformation program.
[0123] In the above implementation, the data associated with the target operator is segmented according to the memory layout of the hardware device so that it can be aligned with the memory layout of the hardware device.
[0124] Optionally, the read / write operation on at least one data block in the hardware device based on the read / write instruction can be implemented in at least one of the following ways:
[0125] Scenario 1: This involves reading and writing multiple data blocks simultaneously, which requires data rearrangement between blocks.
[0126] When performing read and write operations on multiple data blocks simultaneously, the vector instruction is used to perform read and write operations on each data block; the vector instruction is used to call multiple threads to perform instruction mapping operations on multiple data blocks simultaneously.
[0127] In this embodiment, the vector instruction is a concept of Single Instruction Multiple Thread (SIMT), which allows multiple threads to perform corresponding read and write operations simultaneously. There are certain alignment restrictions, such as requiring 32 threads to operate at the same time, rather than partial operations.
[0128] Optionally, the vector instruction includes at least one of the following:
[0129] a) Multiple data loading ldm instructions;
[0130] b) Multi-data storage STM instructions.
[0131] It should be noted that the ldm instruction is a type of vector instruction, corresponding to the SIMT concept, where multiple threads read data from the outside at the same time, and different layouts correspond to different instruction configurations.
[0132] STM instructions are a type of vector instruction. Multiple threads can write data to the outside world simultaneously, corresponding to different layouts, and different layouts correspond to different instruction configurations.
[0133] In the above implementation, based on the vector instruction, multiple threads can be invoked simultaneously to perform data read and write operations on each data block, thereby improving the efficiency of data read and write and enhancing performance.
[0134] Scenario 2: Reading and writing operations are performed on the internal data of any data block, which involves data rearrangement within the block.
[0135] Method 1: When performing read / write operations on the internal data of any data block, write the internal data to a shared memory area for data rearrangement to generate target data; and perform read / write operations on the target data in the shared memory area based on the vector instruction.
[0136] In this embodiment, when data rearrangement within a block is involved, a contiguous target is generated by writing the internal data of any data block into shared memory. Using this method, the target data can be written from shared memory using STM instructions; compared to precise location reading and writing, STM instructions are faster, and multiple threads can share data.
[0137] Method 2: Based on the target read / write instruction, perform read / write operations on the internal data; the target read / write instruction is used to call a single thread to perform read / write operations on a single piece of data.
[0138] Optionally, the target read / write instruction includes at least one of the following:
[0139] a) Data loading ld instruction;
[0140] b) Data storage st command.
[0141] In this embodiment, when data rearrangement within a block is involved, the ld and st instructions can be used to perform precise position reading and writing of the internal data.
[0142] It should be noted that this part is mainly reflected in layout convert and schedule. Here, for non-ByteObject linear layouts, a new byteobject_colmajor layout representation has been added. It flattens out into a linear layout according to the colmajor memory layout, so that corresponding point-to-point operations can be performed and the ld or st instructions can be applied.
[0143] Optionally, if the data associated with the M first operators exceeds the hardware device limit, the following steps also need to be performed:
[0144] Step 1) Insert a third operator at the beginning of the computation graph. The third operator includes the memory layout operator and a fourth operator. The fourth operator is used to perform operations related to data transformation and reshape.
[0145] Step 2) Use the third operator to process the data associated with the M first operators.
[0146] In this embodiment, for shapes that exceed hardware limitations, a third operator is automatically inserted at the beginning of the computation graph to ensure that the shape does not exceed the hardware device's limitations.
[0147] It should be noted that there is no need to add new implementations for different layouts for reshape. The reshape operator only needs to be split into layout_convert + reshape(ByteObject), thus reusing the existing CPU implementation of reshape. Here, the reshape operator is used in a broad sense, including any operator that can perform reshape-related operations.
[0148] Optionally, if the data type associated with the first operator does not conform to the target data type, the following steps also need to be performed:
[0149] Step 1) Insert a fifth operator at the beginning and end of the computation graph. The fifth operator is used to convert the data type of the data associated with the first operator.
[0150] The fifth operator can be, for example, the cast operator.
[0151] Step 2) Using the fifth operator, the data type of the data associated with the first operator is converted into the target data type.
[0152] Figure 2 This is the second flowchart of the operator processing method provided by the present invention, see [link / reference]. Figure 2 As shown, the method includes steps 201-210, wherein:
[0153] Step 201: Obtain the computation graph corresponding to the neural network model, wherein the computation graph contains M first operators.
[0154] Step 202: Insert a first memory arrangement operator at the beginning of the computation graph, and perform memory rearrangement processing on M first operators based on the first memory arrangement operator to generate M second operators.
[0155] Step 203: Reuse the CPU's operational logic to fuse the M second operators to generate the first fusion operator.
[0156] Step 204: Insert a second memory arrangement operator at the end of the computation graph, and perform memory rearrangement processing on the first fusion operator based on the second memory arrangement operator to generate the second fusion operator; wherein, the memory arrangement of the second fusion operator is a non-linear memory arrangement.
[0157] Step 205: Merge the first memory layout operator, the second fusion operator, and the second memory layout operator to generate the target operator.
[0158] Step 206: Based on the target operator, determine the physical coordinates of the target operator in the hardware device.
[0159] Step 207: Map instructions based on the schedule and physical coordinates to generate read and write instructions.
[0160] Step 208: Based on the memory layout of the hardware device, the data associated with the target operator is segmented to generate multiple data blocks.
[0161] Step 209: When performing read and write operations on multiple data blocks simultaneously, perform read and write operations on each data block based on the vector instruction; wherein, the vector instruction is used to call multiple threads to perform read and write operations on multiple data blocks simultaneously.
[0162] Step 210: When performing read / write operations on the internal data of any data block, write the internal data to the shared memory area for data rearrangement to generate the target data; perform read / write operations on the target data in the shared memory area based on the vector instruction; or, perform read / write operations on the internal data based on the target read / write instruction, wherein the target read / write instruction is used to call a single thread to perform read / write operations on a single piece of data.
[0163] The operator processing method provided by the present invention will be further described in detail below with reference to specific embodiments.
[0164] Example 1: Data type is Fp32, Block rearrangement
[0165] Figure 3 This is one of the schematic diagrams illustrating the processing logic of the operator processing method provided by this invention. See also... Figure 3 As shown in (a), the input layer (input), memory operator (op), unary op, binary op, and output layer (output) of the neural network model are illustrated.
[0166] Then, insert the layout_convert operator at the beginning and end of (a) to obtain the computation graph shown in (b).
[0167] The layout_convert operator below input is used to convert Col-Major to ByteObject; the memory op, unary op, and binary op in the middle are ByteObjects, and operator fusion is achieved by reusing the tvm community CPU logic.
[0168] The layout_convert operator above output is used to convert ByteObject to Col-Major.
[0169] All the above operators are fused to obtain the computation graph shown in (c), where fused op represents the final fusion operator. Then, the fusion operator code is automatically generated to perform a coordinate transformation, i.e., Col-Major → Col-Major. It should be noted that this coordinate transformation is a coordinate transformation with layout, corresponding to the generation of vector instructions.
[0170] The process of one coordinate transformation is as follows: First, compute inline is used to achieve compute fusion, and then the master schedule template is used to perform a coordinate transformation on the merged compute.
[0171] It can be simply summarized as follows:
[0172] Layer IR -> Tensor Representation IR (for loop level) -> Instruction Layer IR (e.g., MLIR / LLNM IR). MLIR / LLVM IR mainly reflects the corresponding instruction mapping, such as Loadmatrix instruction (coordinates, parameters, ...), Storematrix instruction (coordinates, parameters, ...), etc.
[0173] A general schedule template can be represented as:
[0174] j0,j1=sch.split(j,factors=[None,j_align_factor])
[0175] k0,k1=sch.split(j,factors=[None,k_align_factor])
[0176] sch.reorder(j0,k0,j1,k1)
[0177] sch.bind(j1, "threadxxx")
[0178] …
[0179] The split alignment parameters, order, and thread binding of different loop axes are all related to the hardware design of the layout.
[0180] For FP32 data types, with in-block reordering, the Schedule template is as follows:
[0181] Within the block: Cache_read is rearranged into shared memory;
[0182] Between blocks: Then the previous logic is used again, and finally the STM is written out.
[0183] Example 2:
[0184] Figure 4 This is the second schematic diagram of the processing logic of the operator processing method provided by this invention. See also... Figure 4 As shown in (a), the input layer (input), memory op, unary op, binary op, and output layer (output) of the neural network model are illustrated.
[0185] When the applicable data type in the tvm community is Fp32, and the data types of memory op, unary op and binary op are BF16, inserting the cast operator and layout_convert operator at the beginning and end of (a) yields the computation graph shown in (b).
[0186] Among them, the cast operator below input is used to convert the data types of memory op, unary op and binary op from BF16 to Fp32; the layout_convert operator is used to convert Col-Major to ByteObject; the memory op, unary op and binary op in the middle are ByteObject, and the operator fusion is achieved by reusing the tvm community CPU logic.
[0187] The `layout_convert` operator above `output` is used to convert a ByteObject to a Col-Major; the `cast` operator is used to convert the data types of the memory op, unary op, and binary op from Fp32 to BF16.
[0188] All the above operators are fused to obtain the computation graph shown in (c), where fused op represents the final fusion operator. Then, the fusion operator code is automatically generated to perform a coordinate transformation, i.e., BF16Col-Major → BF16Col-Major. It should be noted that this coordinate transformation is a coordinate transformation with layout, corresponding to the generation of vector instructions.
[0189] The operator processing apparatus provided by the present invention will be described below. The operator processing apparatus described below can be referred to in correspondence with the operator processing method described above. Figure 5 This is a schematic diagram of the operator processing device provided by the present invention, as shown below. Figure 5 As shown, the operator processing device 500 includes: a first generation module 501, a first fusion module 502, and a second fusion module 503, wherein:
[0190] The first generation module 501 is used to perform memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operator to generate M second operators; the memory arrangement of each second operator is a linear memory arrangement, and M is a positive integer greater than or equal to 1;
[0191] The first fusion module 502 is used to fuse the M second operators to generate a first fusion operator;
[0192] The second fusion module 503 is used to perform operator fusion based on the first memory layout operator, the first fusion operator and the second memory layout operator to generate a target operator, which is used to execute a target operation; the target operation is related to instruction mapping.
[0193] The operator processing device provided by this invention performs memory rearrangement processing on M first operators in a neural network model using a first memory arrangement operator to generate M second operators with linear memory arrangements. The M second operators are then fused to generate a first fusion operator. Finally, operator fusion is performed based on the first memory arrangement operator, the first fusion operator, and the second memory arrangement operators to generate a target operator. This method achieves the fusion of operators under different memory arrangements. The target operator is used to execute target operations related to instruction mapping. It eliminates the need to support different hardware-related scenarios for each first operator, allowing for the reuse of CPU logic, thereby significantly reducing the workload of operator development and improving operator execution performance.
[0194] Optionally, the device further includes:
[0195] The acquisition module is used to acquire the computation graph corresponding to the neural network model, wherein the computation graph contains the M first operators.
[0196] Optionally, the generation module 501 is further configured to:
[0197] The first memory arrangement operator is inserted at the beginning of the computation graph, and the first memory arrangement operator is used to perform memory rearrangement processing on the M first operators to generate the M second operators.
[0198] Optionally, the first generation module 501 is further configured to:
[0199] The second memory arrangement operator is inserted at the end of the computation graph. Based on the second memory arrangement operator, the first fusion operator is rearranged in memory to generate the second fusion operator. The memory arrangement of the second fusion operator is a non-linear memory arrangement.
[0200] The first memory layout operator, the second fusion operator, and the second memory layout operator are fused together to generate the target operator.
[0201] Optionally, the device further includes:
[0202] The determination module is used to determine the physical coordinates of the target operator in the hardware device based on the target operator;
[0203] The second generation module is used to perform instruction mapping based on the template schedule and the physical coordinates to generate read and write instructions.
[0204] The read / write module is used to perform read / write operations on the data associated with the target operator in the hardware device based on the read / write instructions.
[0205] Optionally, the read / write module is further configured to:
[0206] Based on the memory layout of the hardware device, the data associated with the target operator is segmented to generate multiple data blocks;
[0207] Based on the read / write instructions, at least one data block is read / written in the hardware device.
[0208] Optionally, the read / write module is further configured to:
[0209] When performing read and write operations on multiple data blocks simultaneously, the vector instruction is used to perform read and write operations on each data block; the vector instruction is used to call multiple threads to perform read and write operations on multiple data blocks simultaneously.
[0210] Optionally, the read / write module is further used for any of the following:
[0211] When performing read / write operations on the internal data of any data block, the internal data is written to a shared memory area for data rearrangement to generate target data; and the target data in the shared memory area is read / written based on the vector instruction.
[0212] or,
[0213] Based on the target read / write instruction, read and write operations are performed on the internal data; the target read / write instruction is used to call a single thread to perform read and write operations on a single piece of data.
[0214] Optionally, the vector instruction includes at least one of the following:
[0215] Multiple data loading ldm instructions;
[0216] Multi-data storage STM commands.
[0217] Optionally, the target read / write instruction includes at least one of the following:
[0218] Data loading ld instruction;
[0219] Data storage st command.
[0220] Optionally, the device further includes:
[0221] The first insertion module is used to insert a third operator at the beginning of the computation graph. The third operator includes the memory layout operator and a fourth operator. The fourth operator is used to perform operations related to data transformation reshape.
[0222] The processing module is used to process the data associated with the M first operators using the third operator.
[0223] Optionally, the device further includes:
[0224] The second insertion module is used to insert a fifth operator at the beginning and end of the computation graph, wherein the fifth operator is used to convert the data type of the data associated with the first operator;
[0225] The conversion module is used to convert the data type of the data associated with the first operator into the target data type using the fifth operator.
[0226] Optionally, the first fusion module 502 is further configured to:
[0227] The M second operators are fused using the computational logic of the central processing unit (CPU) to generate the first fused operator.
[0228] Figure 6 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 6 As shown, the electronic device may include a processor 610, a communications interface 620, a memory 630, and a communication bus 640, wherein the processor 610, the communications interface 620, and the memory 630 communicate with each other through the communication bus 640. The processor 610 can call logical instructions in the memory 630 to execute an operator processing method, which includes: rearranging the memory of M first operators in a neural network model based on a first memory arrangement operator to generate M second operators; the memory arrangement of each second operator is a linear memory arrangement, and M is a positive integer greater than or equal to 1; fusing the M second operators to generate a first fused operator; and fusing operators based on the first memory arrangement operator, the first fused operator, and the second memory arrangement operators to generate a target operator, which is used to execute a target operation; the target operation is related to an instruction mapping.
[0229] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, essentially, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0230] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the operator processing method provided by the above methods. The method includes: performing memory rearrangement processing on M first operators in a neural network model based on a first memory arrangement operator to generate M second operators; the memory arrangement of each second operator is a linear memory arrangement, and M is a positive integer greater than or equal to 1; fusing the M second operators to generate a first fusion operator; and performing operator fusion based on the first memory arrangement operator, the first fusion operator, and the second memory arrangement operator to generate a target operator, which is used to perform a target operation; the target operation is related to instruction mapping.
[0231] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the operator processing method provided by the above methods. The method includes: rearranging the memory of M first operators in a neural network model based on a first memory arrangement operator to generate M second operators; the memory arrangement of each second operator is a linear memory arrangement, where M is a positive integer greater than or equal to 1; fusing the M second operators to generate a first fusion operator; and fusing operators based on the first memory arrangement operator, the first fusion operator, and the second memory arrangement operators to generate a target operator, wherein the target operator is used to perform a target operation; and the target operation is related to an instruction mapping.
[0232] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0233] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0234] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. An operator processing method, characterized by, include: Based on the first memory arrangement operator, the memory of M first operators in the neural network model is rearranged to generate M second operators; The M first operators correspond to different nonlinear memory arrangements; the memory arrangement of each second operator is a linear memory arrangement, where M is a positive integer greater than or equal to 1; The step of performing memory rearrangement processing on M first operators in a neural network model based on the first memory arrangement operator to generate M second operators includes: using an AI compiler to perform memory rearrangement processing on the M first operators with different nonlinear memory arrangements based on the first memory arrangement operator, so as to uniformly convert the nonlinear memory arrangements of the M first operators into linear memory arrangements, thereby generating the M second operators; The M second operators are fused to generate a first fusion operator; The first memory layout operator, the first fusion operator, and the second memory layout operator are fused to generate a target operator, which is used to execute a target operation; the target operation is related to instruction mapping; the step of fusing the first memory layout operator, the first fusion operator, and the second memory layout operator to generate the target operator includes: performing memory rearrangement processing on the first fusion operator based on the second memory layout operator to generate a second fusion operator, wherein the memory layout of the second fusion operator is a non-linear memory layout; and fusing the first memory layout operator, the second fusion operator, and the second memory layout operator to generate the target operator.
2. The operator processing method according to claim 1, characterized in that, Before performing memory rearrangement processing on the M first operators in the neural network model based on the first memory arrangement operator, the method further includes: Obtain the computation graph corresponding to the neural network model, wherein the computation graph contains the M first operators.
3. The operator processing method according to claim 2, characterized in that, The process of rearranging the memory of M first operators in the neural network model based on the first memory arrangement operator to generate M second operators includes: The first memory arrangement operator is inserted at the beginning of the computation graph, and the first memory arrangement operator is used to perform memory rearrangement processing on the M first operators to generate the M second operators.
4. The operator processing method according to claim 2, characterized in that, The operator fusion based on the first memory layout operator, the first fusion operator, and the second memory layout operator includes: The second memory arrangement operator is inserted at the end of the computation graph. Based on the second memory arrangement operator, the first fusion operator is rearranged in memory to generate the second fusion operator. The memory arrangement of the second fusion operator is a non-linear memory arrangement. The first memory layout operator, the second fusion operator, and the second memory layout operator are fused together to generate the target operator.
5. The operator processing method according to claim 1, characterized in that, The method further includes: Based on the target operator, determine the physical coordinates of the target operator in the hardware device; Based on the template schedule and the physical coordinates, instruction mapping is performed to generate read and write instructions; Based on the read / write instructions, read / write operations are performed on the data associated with the target operator in the hardware device.
6. The operator processing method according to claim 5, characterized in that, The step of performing read and write operations on the data associated with the target operator in the hardware device based on the read and write instructions includes: Based on the memory layout of the hardware device, the data associated with the target operator is segmented to generate multiple data blocks; Based on the read / write instructions, at least one data block is read / written in the hardware device.
7. The operator processing method according to claim 6, characterized in that, The step of performing read / write operations on at least one data block in the hardware device based on the read / write instructions includes: When performing read and write operations on multiple data blocks simultaneously, the vector instruction is used to perform read and write operations on each data block; the vector instruction is used to call multiple threads to perform read and write operations on multiple data blocks simultaneously.
8. The operator processing method according to claim 6, characterized in that, The read / write operation on at least one data block in the hardware device based on the read / write instruction includes any one of the following: When performing read / write operations on the internal data of any data block, the internal data is written to the shared memory area for data rearrangement to generate the target data; Based on the vector instruction, read and write operations are performed on the target data in the shared memory area; or, Based on the target read / write instruction, read and write operations are performed on the internal data; the target read / write instruction is used to call a single thread to perform read and write operations on a single piece of data.
9. The operator processing method according to claim 7 or 8, characterized in that, The vector instruction includes at least one of the following: Multiple data loading ldm instructions; Multi-data storage STM commands.
10. The operator processing method according to claim 8, characterized in that, The target read / write instruction includes at least one of the following: Data loading ld instruction; Data storage st command.
11. The operator processing method according to any one of claims 2 to 4, characterized in that, If the data associated with the M first operators exceeds the hardware device limit, the method further includes: A third operator is inserted at the beginning of the computation graph. The third operator includes the memory layout operator and a fourth operator. The fourth operator is used to perform operations related to data transformation and reshape. The third operator is used to process the data associated with the M first operators.
12. The operator processing method according to any one of claims 2 to 4, characterized in that, If the data type associated with the first operator does not conform to the target data type, the method further includes: A fifth operator is inserted at the beginning and end of the computation graph. The fifth operator is used to convert the data type of the data associated with the first operator. The fifth operator is used to convert the data type of the data associated with the first operator into the target data type.
13. The operator processing method according to any one of claims 1 to 8, characterized in that, The step of fusing the M second operators to generate a first fusion operator includes: The M second operators are fused using the computational logic of the central processing unit (CPU) to generate the first fused operator.
14. An operator processing apparatus, characterized in that, include: The first generation module is used to perform memory rearrangement processing on M first operators in the neural network model based on the first memory arrangement operator to generate M second operators; The M first operators correspond to different nonlinear memory arrangements; the memory arrangement of each second operator is a linear memory arrangement, where M is a positive integer greater than or equal to 1; The step of performing memory rearrangement processing on M first operators in a neural network model based on the first memory arrangement operator to generate M second operators includes: using an AI compiler to perform memory rearrangement processing on the M first operators with different nonlinear memory arrangements based on the first memory arrangement operator, so as to uniformly convert the nonlinear memory arrangements of the M first operators into linear memory arrangements, thereby generating the M second operators; The first fusion module is used to fuse the M second operators to generate a first fusion operator; The second fusion module is used to perform operator fusion based on the first memory layout operator, the first fusion operator, and the second memory layout operator to generate a target operator, wherein the target operator is used to execute a target operation; the target operation is related to instruction mapping; the step of performing operator fusion based on the first memory layout operator, the first fusion operator, and the second memory layout operator to generate the target operator includes: performing memory rearrangement processing on the first fusion operator based on the second memory layout operator to generate a second fusion operator, wherein the memory layout of the second fusion operator is a non-linear memory layout; and performing operator fusion on the first memory layout operator, the second fusion operator, and the second memory layout operator to generate the target operator.
15. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the operator processing method as described in any one of claims 1 to 13.
16. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the operator processing method as described in any one of claims 1 to 13.
17. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the operator processing method as described in any one of claims 1 to 13.