A Winograd convolution optimization method and system suitable for ARMv8 multi-core architecture
By optimizing the data layout and parallel strategy of Winograd convolution on the ARMv8 multi-core architecture, the problems of insufficient cache locality utilization and strided memory access are solved, improving computational efficiency and parallel efficiency, and achieving efficient convolution computation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF SOFTWARE - CHINESE ACAD OF SCI
- Filing Date
- 2024-08-12
- Publication Date
- 2026-06-30
Smart Images

Figure CN119106710B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of deep learning technology, specifically relating to a Winograd convolution optimization method and system suitable for ARMv8 multi-core architecture. Background Technology
[0002] Winograd convolution utilizes the Winograd minimum filtering algorithm. By dividing the input data into multiple tiles and mapping it from the spatial domain to the Winograd domain, it reduces the inherent computational complexity of convolution operations. This algorithm reduces the number of arithmetic operations required, thus improving the computational efficiency of convolution operations. Taking two-dimensional convolution as an example, F(m×m, r×r) represents using an r×r convolution kernel to compute m×m outputs, where the input tile size is m+r-1. Assume... and Here, 'b' represents the row and column coordinates of the input tile, and 'b' is the batch index value, corresponding to the output element of the b-th batch and the k-th output channel. It can be represented as
[0003]
[0004] Here, ⊙ denotes element-wise multiplication; B, G, and A are the transformation matrices of the input, filter, and output, respectively, and g = That is, the element values of the filter tensor F in the k-th output channel and c-th input channel; d is the row and column coordinates of the input tile in batch b and the c-th input channel ( The element value of ). and These are the filter tile and input tile (corresponding to the b-th batch, k-th output channel, and c-th input channel) mapped to the Winograd domain after multiplication with the transformation matrix. One technique for Winograd convolution is to perform element-wise multiplication on the input and filter data in the Winograd domain after the domain transformation.
[0005] Element-wise multiplication belongs to level-1 BLAS operations. To improve arithmetic strength, the original expression can be rewritten as follows:
[0006]
[0007] Where (x, y) represents the coordinates of the elements participating in the element-wise multiplication, and ξ is the original coordinate. , This represents the convolution result in the Winograd domain. This expression can be viewed as a matrix multiplication operation.
[0008] Currently, Winograd convolution can be divided into three stages: the transformation between the input and the convolution kernel, matrix multiplication, and the inverse transformation of the output result. For example, there is the existing implementation scheme disclosed by Zhen Jia et al. ① (Zhen Jia, Aleksandar Zlateski, Fredo Durand, and Kai Li. 2018. Optimizing N-dimensional, winograd-based convolution for manycore CPUs. In Proceedings of the 23rd ACM SIGPLANSymposium on Principles and Practice of Parallel Programming. 109–123.), and the existing implementation scheme disclosed by Xueying Wang et al. ② (Xueying Wang, Guangli Li, Zhen Jia, Xiaobing Feng, and Yida Wang. 2024. Fast convolution meets low precision: Exploring efficient quantized Winograd convolution on modern CPUs. ACMTransactions on Architecture and Code Optimization 21, 1 (2024), 1–26.). However, in both existing implementations, the converted data layout only considers block partitioning and vectorization, failing to guarantee continuous memory access for matrix multiplication. Matrix multiplication only considers optimizing cache block partitioning, without optimizing microkernel configuration. The conversion stage uses the same method for different Winograd tile sizes, without considering register reuse in overlapping areas between tiles.
[0009] In summary, the main drawbacks of existing technical solutions include the following aspects:
[0010] (1) Using a non-fusion model to perform Winograd convolution, the input transformation, computation and output transformation are processed as three independent stages, or only some of the stages are coupled, which cannot make full use of cache locality.
[0011] (2) Using matrix multiplication for Winograd field calculation, due to the additional field conversion overhead introduced by cross-row memory access, the existing technology of using intrinsic methods to implement field conversion cannot sufficiently reduce the conversion overhead.
[0012] (3) The existing data layout for Winograd domain computation based on matrix multiplication only considers block division, which limits performance.
[0013] (4) The parallel strategy adopted was not optimized for specific layers in the network, resulting in low parallel efficiency in some layers. Summary of the Invention
[0014] To address the technical problems of existing Winograd convolution implementations on ARM, such as insufficient cache locality utilization, additional overhead introduced by strided memory access, and insufficient refinement of parallel strategies, this invention discloses a Winograd convolution optimization method and system suitable for ARMv8 multi-core architecture based on the data characteristics of common convolutional neural network models, which can significantly improve the computational efficiency of the Winograd algorithm.
[0015] To achieve the above objectives, the technical solution of the present invention includes the following:
[0016] A Winograd convolution optimization method suitable for ARMv8 multi-core architecture, the method comprising:
[0017] Determine tile segment value Input channel block value and output channel block value And based on the tile block value Divide the input data into tile blocks;
[0018] The global domain transformation of the filter is completed by a double nested loop, and the global domain transformation result is stored in the FilterOut array according to the data layout of the first matrix multiplication.
[0019] Tile block value By traversing the tile blocks and performing intra-block input transformation, coupled matrix multiplication, and output transformation on each tile block, the Winograd convolution optimization result is obtained. The process of performing intra-block input transformation, coupled matrix multiplication, and output transformation on each tile block to obtain the Winograd convolution optimization result includes:
[0020] Block value by input channel The algorithm iterates through the data and, during each iteration, calls the input transformation kernel to perform the input domain transformation. Then, it stores the data in the TransInOut temporary array according to the second matrix multiplication data layout.
[0021] Block value based on output channel The algorithm iterates through the input channels, performing L matrix multiplications in each iteration. Each matrix multiplication is performed using the input channel block values. For step-size traversal, call GEMM to check the TransInOut temporary array. × Blocks and FilterOut arrays × The block performs matrix multiplication and accumulates the results, storing them in a temporary array GEMMOut according to the data layout of the matrix multiplication results. Then, the matrix multiplication results in the temporary array GEMMOut are processed by the output transformation kernel to obtain the Winograd convolution optimization result. Here, L represents the number of elements in a Winograd convolution tile.
[0022] Furthermore, the determination of tile block values Input channel block value and output channel block value ,include:
[0023] Determine tile block values based on minimizing data movement overhead using cache capacity. Input channel block value and output channel block value Furthermore, by combining the characteristics of the input and output channels in a convolutional neural network, the tile block values are... Input channel block value and output channel block value To impose restrictions.
[0024] Furthermore, the global domain transformation of the filter achieved through a double-nested loop includes:
[0025] Outer loop segmented by output channel value Traversal;
[0026] Inner loop segmented by input channel value Traversal;
[0027] The loop calls the filter conversion kernel to perform iterative conversion; where the size of one iteration is L× × The data.
[0028] Furthermore, when the number of rows and columns of the output tile matrix is m=2, the step of calling the input transformation kernel to complete the input domain transformation includes:
[0029] Step 4.1: Load L=16 elements from a Winograd convolution tile into vector registers v0~v15; where each element represents the number of input channels C spanned by a tile loaded into the same vector register. One data value, The number of data items that can be held in a vector register;
[0030] Step 4.2: Left-multiply the Winograd convolution tiles corresponding to the L vector registers by the matrix. After releasing vector registers v0, v1, v8, and v9, the left-multiplication matrix of the tile is stored by combining vector registers v16 to v27. Provisional results; among which, Represents the transformation matrix of the input;
[0031] Step 4.3: Left-multiply the tile by the matrix. The temporary result is right-multiplied by the matrix Stored in vector registers v16 to v31;
[0032] Step 4.4: Store the results in vector registers v16 to v31 into the TransInOut temporary array;
[0033] Step 4.5: After loading the non-overlapping data from the next Winograd convolution tile into vector registers v0, v1, v4, v5, v8, v9, v12, and v13, execute steps 4.2-4.4 and store the result corresponding to the next Winograd convolution tile into the TransInOut temporary array.
[0034] Step 4.6: After loading the non-overlapping data of the next Winograd convolution tile into vector registers v2, v3, v6, v7, v10, v11, v14, and v15, execute steps 4.2-4.6, and store the result corresponding to the next Winograd convolution tile into the TransInOut temporary array.
[0035] Step 4.7: Return to step 4.5 until all Winograd convolution tiles have been processed.
[0036] Furthermore, when the number of rows and columns of the output tile matrix is m=6, the step of calling the input transformation kernel to complete the input domain transformation includes:
[0037] Perform the following steps on the i-th row of data in a Winograd convolutional tile: ×B conversion yields the result. The result transpose Calculation factor Calculation factor Calculation factor Calculation factor Calculation factor Calculation factor Calculation factor Calculation factor , This represents the input tile matrix used for input domain transformation calculations. This represents the element in the i-th row and n-th column of the input tile, where n can take values ranging from 1 to 2. B represents the input transformation matrix;
[0038] The results As 8×8× A row of data in the temporary array tmp. This indicates the number of data items that can be stored in a vector register. The number of input channels spanned for one row of data processed in a single iteration;
[0039] Left-multiply the temporary array tmp by the matrix column by column. The corresponding results are then stored in the TransInOut temporary array.
[0040] Furthermore, the data layout for the first matrix multiplication is as follows: The data layout for the second matrix multiplication is as follows: The data layout of the matrix multiplication result is as follows: ;in, This indicates the number of output channels of the convolutional layer. This represents the number of input channels of the convolutional layer. and The parameters represent the matrix multiplication microkernel. This indicates the number of floating-point numbers stored in the vector register.
[0041] 7. The method according to claim 6, characterized in that the method further comprises:
[0042] When the number of rows and columns of the output tile matrix is m=2, the input transformation kernel first traverses the image width W of the input tensor of the convolutional network, then proceeds to the image height H, until the tile block values are completed. All tiles within the input channel C direction Transform each element; finally, iterate through the input channel directions;
[0043] When the number of rows and columns of the output tile matrix is m=6, the traversal order of the input transformation kernel is as follows: first, complete the overall transformation of a tile in the input channel direction, and then traverse the image width W and height H directions of the input tensor of the convolutional network to perform the transformation of all tiles.
[0044] The iterative method within the filter conversion kernel involves passing the values loaded into the same vector register across the output channel K, and following the... The order of traversal.
[0045] Furthermore, the implementation process of the matrix multiplication microkernel includes:
[0046] Based on parameters and parameters Given the constraints, determine the parameters of the optimal microkernel that maximize the computational memory access ratio. and parameters The constraints include: and ;
[0047] Based on the characteristics of convolutional neural networks, select the parameters corresponding to the suboptimal microkernel. and parameters ;
[0048] For the parameters of the selected optimal microkernel and parameters and the parameters corresponding to the suboptimal microkernel and parameters A ping-pong strategy is used to implement the matrix multiplication microkernel operation.
[0049] Furthermore, the method also includes:
[0050] In the shallow convolutional layers of the convolutional neural network model, OpenMP is used to tile the outermost layer coupled with the Winograd algorithm. Parallelize the loop that iterates through the tiles with a step size, and set the maximum number of threads to [value missing]. ;in, This represents the total number of tiles generated by the input tensor.
[0051] In the deep convolutional layers of a convolutional neural network model, tile segmentation values are set. And using OpenMP technology to achieve parallelism in both C and K dimensions, the maximum number of threads is set to ;in, This represents the number of input channels of the convolutional layer. This indicates the number of output channels of the convolutional layer;
[0052] In the intermediate convolutional layers of the convolutional neural network model, Pthreads thread pool technology and lock-free task queues are used for multidimensional parallelism, and atomic snapshots are used to complete the synchronization between stages. Specifically, T-dimensional parallelization is the initial subtask of the task queue. Each T-dimensional subtask pushes C- and K-dimensional subtasks into the queue to achieve parallelization, with a total number of subtasks. The maximum number of threads is empirically set to .
[0053] A Winograd convolution optimization system for ARMv8 multi-core architecture, the system comprising:
[0054] The tile value determination module is used to determine the tile tile value. Input channel block value and output channel block value And based on the tile block value Divide the input data into tile blocks;
[0055] The filter conversion module is used to complete the global domain conversion of the filter through a double nested loop, and store the global domain conversion result in the FilterOut array according to the data layout of the first matrix multiplication.
[0056] Winograd convolution optimization module, used for tile-based block values. By traversing the tile blocks and performing intra-block input transformation, coupled matrix multiplication, and output transformation on each tile block, the Winograd convolution optimization result is obtained. The process of performing intra-block input transformation, coupled matrix multiplication, and output transformation on each tile block to obtain the Winograd convolution optimization result includes:
[0057] Block value by input channel The algorithm iterates through the data and, during each iteration, calls the input transformation kernel to perform the input domain transformation. Then, it stores the data in the TransInOut temporary array according to the second matrix multiplication data layout.
[0058] Block value based on output channel The algorithm iterates through the input channels, performing L matrix multiplications in each iteration. Each matrix multiplication is performed using the input channel block values. For step-size traversal, call GEMM to check the TransInOut temporary array. × Blocks and FilterOut arrays × The block performs matrix multiplication and accumulates the results, storing them in a temporary array GEMMOut according to the data layout of the matrix multiplication results. Then, the matrix multiplication results in the temporary array GEMMOut are processed by the output transformation kernel to obtain the Winograd convolution optimization result. Here, L represents the number of elements in a Winograd convolution tile.
[0059] Compared with the prior art, the present invention has at least the following beneficial effects.
[0060] 1. This invention improves cache locality in the convolution calculation process by coupling the three stages of the Winograd algorithm and making full use of the three-level cache of the ARM architecture.
[0061] 2. This invention combines the transformation of input data, output data, and filter data in the spatial domain and the Winograd domain with the data layout designed for the matrix multiplication calculation process, thereby alleviating the memory access overhead caused by converting data to GEMM format.
[0062] 3. This invention features a meticulously designed matrix multiplication microkernel that utilizes ping-pong technology and a customized data layout to maintain continuous memory access and overlap data loading and computation during the calculation process, thereby greatly improving computational efficiency.
[0063] 4. This invention achieves high parallel efficiency for each layer of convolution by adapting the parallel mode to the data features of different layers of the convolutional network.
[0064] 5. This invention improves the overall computational efficiency of Winograd convolution, achieving better test results than the STOA library in both single-core and multi-core tests. Attached Figure Description
[0065] Figure 1 This is a diagram of the algorithm framework of the present invention.
[0066] Figure 2 This is a diagram outlining the entire process of processing a Winograd tile convolution.
[0067] Figure 3 This is a schematic diagram of the assembly-level register multiplexing optimization method for input conversion when m=2.
[0068] Figure 4 This is a schematic diagram of the input data layout for continuous memory access.
[0069] Figure 5 This is a schematic diagram of the filter data layout for sequential memory access.
[0070] Figure 6 This is a diagram showing the arrangement of the (4,16) vector registers in the microkernel.
[0071] Figure 7 It is a microkernel (7,8) vector register arrangement.
[0072] Figure 8 This is a comparison of the single-core convolution runtime of the present invention with that of existing technologies.
[0073] Figure 9 This is a comparison of the runtime of the 16-core and 32-core convolution of the present invention with that of the prior art.
[0074] Figure 10 This is a flowchart of the further integration of matrix multiplication and output conversion steps in this invention. Detailed Implementation
[0075] The present invention will now be described in further detail with reference to the accompanying drawings. The examples given are only for explaining the present invention and are not intended to limit the scope of the present invention.
[0076] This invention provides a convolutional neural network optimization method based on the Winograd algorithm. Figure 1 The diagram shows the overall algorithm flow of this invention. This is a fusion block method for Winograd convolution that couples the input transformation, matrix multiplication, and output transformation stages. First, the three stages are coupled by dividing the tiles into blocks, and then matrix multiplication and output transformation are further coupled by dividing the output channel number K into blocks. The input and output of this algorithm, as well as the corresponding data storage format, are consistent with those of general convolution (…). Figure 1 The input and output specifications indicate the storage format of the input tensor, filter tensor, and output tensor.
[0077] The specific steps of the fusion and segmentation strategy are described below:
[0078] 1) A heuristic method is used to determine the block size based on the cache capacity, including the tile block value. Input channel block value and output channel block value .
[0079] The heuristic method for determining block size in this invention is based on minimizing data movement overhead by considering cache capacity. This takes into account the characteristics of the input and output channels in convolutional neural networks. Figure 1 In and It is configured to be an integer multiple of 16, thereby limiting the block value, avoiding potential edge cases, and simplifying the calculation process.
[0080] 2) To utilize cache locality, a temporary array TransInOut[L× [×C] and allocate memory to store the results of converting the input data within the block to the Winograd domain; create a temporary array GEMMOut[L× × It allocates memory to store the matrix multiplication results within the block.
[0081] 3) By using a double nested loop, the outer loop is divided into blocks according to the output channel. Traversal, inner loop divided into blocks according to input channel Iterate through the loop, calling the filter conversion kernel within each iteration. The size of the conversion per iteration is L× × The data is processed to complete the global domain transformation of the filter and stored in the FilterOut[L×C×K] array according to the matrix multiplication data layout. In inference-only mode, since the weights of the neural network are pre-trained and will not change, this step can be completed and stored in advance as preprocessing.
[0082] 4) First, divide the tiles by their values. Traverse the tile blocks, performing input transformations within each block, as well as coupled matrix multiplications and output transformations in sequence:
[0083] 4.1) Divide into blocks according to input channels The process iterates through the data, calling the input transformation kernel to complete the input domain transformation and storing it in the temporary array TransInOut in step 2) according to the data layout designed in this invention; the data size of one iteration is... × .
[0084] 4.2) Divide into blocks according to output channels The loop iterates through the matrix to perform further coupled matrix multiplication and output transformation.
[0085] The process in step 4.2 is as follows: Figure 10 As shown, the completed size is The matrix multiplication and output transformation of the tile blocks. bk is the iteration index value of the output channel K, and the processing size for each iteration is L× × The data is processed as follows:
[0086] a. Perform matrix multiplication L times in a loop, with each matrix multiplication divided into blocks according to the input channels. For step-size traversal, the GEMM kernel is called to perform intra-block matrix multiplication, with the corresponding input matrix being the TransInOut temporary array. × Blocks and FilterOut arrays × The block will accumulate the results ( The block is stored in the GEMMOut temporary array.
[0087] If Winograd domain computation does not use matrix multiplication but element-wise multiplication (TEWMM), the computation data can still be packaged into a TEWMM contiguous memory access data layout, combining packaging with domain transformation.
[0088] b. Call the output conversion kernel to perform output conversion on the matrix multiplication result data in the GEMMOut temporary array.
[0089] 5) Handling tile edge cases.
[0090] In order to precisely control the vector register and make full use of software prefetching, this invention uses assembly language to implement the filter transform core in step 3, the input transform core in step 4.1, the GEMM core in step 4.2a, and the output transform core in step 4.2b.
[0091] Figure 2 The diagram illustrates a tile-like process for processing input data in this invention, comprising three stages of Winograd convolution (filter / input transformation, matrix multiplication, and output transformation). Gray rectangles represent the input, filter data blocks, and output data to be processed; darker blocks indicate data loaded into the same vector register and processed simultaneously. Figure 2 The meanings of important symbols in Chinese are as follows:
[0092] The number of floating-point numbers that can be stored in the vector register. For ARMv8 architecture processors, the vector register is 128-bit, therefore single-precision floating-point numbers... Double-precision floating-point numbers .
[0093] This represents the number of elements in a Winograd convolution tile, i.e., L = (m + r - 1) × (m + r - 1).
[0094] Total number of tiles.
[0095] The parameters of the matrix multiplication microkernel, used to calculate matrix V each time. Rows and matrix U Multiplying the columns yields the resulting matrix. of Large and small sub-blocks.
[0096] The following section details the optimization strategies for the three stages of Winograd convolution in this invention, as well as the parallel strategies applicable to the three-stage fusion algorithm framework of this invention.
[0097] 1. Domain Transformation
[0098] Because different L values affect the use of assembly-level vector registers, this invention employs different methods to handle the conversions for m=2 and m=6. The main difference between the two methods lies in whether 32 vector registers can be used to complete the spanning of a single tile across multiple channels in one go. The conversion.
[0099] Method 1: When m=2, L=16, the number of vector registers is allowed to span multiple channels. It stores 16 inputs and 16 transformation results, allowing for the processing of transformations for one tile at a time. Winograd convolutional input transformations employ the Overlapping Addition (OLA) method, storing data in row-major order. This invention leverages the shared elements of adjacent tiles to reuse registers, significantly reducing the number of elements that need to be loaded. Figure 3 As shown. Figure 3 This demonstrates the register arrangement for calculating the input transformation of a single tile using assembly language in Method 1, where numbers represent vector register index values, and each vector register contains... There are [number] elements. The specific processing steps are as follows:
[0100] a. Initial iteration: Load L=16 elements into the vector register (v0~v15), and use the remaining registers to store a tile for left multiplication during the transformation calculation. intermediate results (corresponding to) Figure 3 The register index within the top left rectangle and the result after conversion (corresponding to) Figure 3 (Register index within the upper right rectangle).
[0101] b. Left multiplication of a tile corresponding to vector registers v0~v15 Registers v0, v1, v8, and v9 are freed to store the left multiplication of the tile. Temporary results (corresponding) Figure 3 (The last row of register indices within the top left rectangle).
[0102] c. Figure 3 Left multiplication within the top left rectangle The temporary result is multiplied by B on the right and stored in vector registers v16~v31;
[0103] d. Store the results in vector registers v16~v31 into the corresponding positions in the temporary array TransInOut in step 4.1);
[0104] e. Load the non-overlapping data from the next tile ( Figure 3The values of the registers outlined in the dashed box can be reused; therefore, only the input vector registers v0, v1, v4, v5, v8, v9, v12, and v13 of the non-overlapping elements in the next tile need to be loaded. Figure 3 (The 8 register indices to the right of the dashed box).
[0105] f. Perform steps b, c, and d on the newly loaded tile, then continue loading the non-overlapping data of the next tile. The difference from step e is that the register index value reused in this iteration is... Figure 3 Conversely, the values in registers v0, v1, v4, v5, v8, v9, v12, and v13 can be reused, while registers v2, v3, v6, v7, v10, v11, v14, and v15 need to load non-overlapping elements from the next tile. Steps b, c, and d are executed for the newly loaded tile, and steps e and f are executed alternately to load the non-overlapping elements of the next tile.
[0106] Method 2: When m=6, L=64, exceeding the number of usable vector registers. This invention adopts a design that processes one row (8 elements) in one iteration. At this time, the number of channels crossed... It can make full use of vector registers. To ensure that data stored in row-major order is accessed sequentially, the d×B step in the input transformation is first executed, storing the temporary result in a temporary array tmp of size 8×8×8. Then, tmp is left-multiplied column-wise. The result is stored in the temporary array TransInOut of step 4.1).
[0107] By utilizing the special structure of the transformation matrix B when m=6, and extracting common computational factors, the overall computational complexity can be reduced. The following formula shows the simplification of performing a d×B transformation on the i-th row of data in the input tile, resulting in a row of data in the temporary array tmp. .in These are shared calculation factors. This represents the element in the i-th row and n-th column of the input tile (n can take values ranging from 1 to 1). Left multiplication of temporary array tmp Similarly, the formula can be used. Shared factors simplify calculations, at this time It should be a column of the matrix resulting from the conversion of the input tile to Winograd.
[0108]
[0109] 2. Matrix Multiplication Data Layout
[0110] This invention designs a matrix multiplication-friendly data layout that ensures continuous memory access for matrix multiplication steps in Winograd convolution. The layout after input transformation is as follows: The layout after filter conversion is as follows , respectively Figure 4 , Figure 5 As shown. The implementation of this layout is integrated with the conversion of the input / filter to the Winograd domain.
[0111] To mitigate the impact of step-by-step memory access during the conversion process, when m=2, the input conversion first traverses in the W direction; when m=6, the traversal order is... In the data layout after filter conversion η As the direction of fastest change, the values loaded into the same vector register span the output channel K and follow... The order of traversal.
[0112] When the size is L× × After the intra-block matrix multiplication is completed, output conversion begins to better utilize cache locality. The data layout of the temporary array GEMMout is also set to accommodate the continuous memory access of matrix multiplication. This layout also facilitates efficient data loading for output conversion.
[0113] 3. Matrix Multiplication Microkernel
[0114] This invention designs a matrix multiplication microkernel based on the features of each dimension of the input data of commonly used convolutional neural network models. The microkernel size is set to achieve a high computation-to-memory ratio while minimizing edge case handling. To accommodate single-precision floating-point SIMD loading, Set to satisfy Due to the loading and storage of the converted input / filter data and output data, and the number of vector registers required by the "ping-pong" technique employed in this invention to reduce pipeline downtime, and The following conditions must be met:
[0115]
[0116] By applying the above constraints, we can obtain the optimal microkernel size that maximizes the computation-to-memory access ratio. , Considering the general trend that as the number of layers in a convolutional neural network increases, the dimension T decreases while C and K increase, and edge cases with a dimension of T at deeper layers are more time-consuming, this invention selects... , As a suboptimal microkernel, it alleviates the overhead of edge processing in deep networks (where C and K exceed T on the order of magnitude). Both microkernel configurations ensure full utilization of the 32 vector registers.
[0117] The "ping-pong" technique and register usage methods are used in two microkernels ( , , which is abbreviated as (4,16); , The implementation of (7,8) is as follows: Figure 6 , Figure 7 As shown. "#" represents the pipeline stage number in the "ping-pong" strategy; the numbers represent vector register index values; the rectangles on the left, top, and bottom right represent the input, filter, and result data, respectively.
[0118] In the (4,16) configuration, this invention uses registers v0~v3 and v4~v7 as two sets of registers for loading input data, registers v8~v11 and v12~v15 for loading filter data, and the remaining registers for storing the corresponding calculation results. One set of registers storing input data is released after four pipeline stages, while the filter registers are released at each stage. At the beginning of microkernel execution, v0~v3 loads input data, and v8~v15 loads filter data. In each subsequent pipeline stage, this invention prefetches four elements from one of the other set of input registers (v4~v7) and prefetches 16 elements from the other set of filter registers, ensuring sufficient distance between the loading and calculation instructions for the same data.
[0119] The prefetched elements input in configuration (7,8) are grouped into 4 pipeline stages, with 8 elements in the first 3 stages of each group and 4 elements in the last stage.
[0120] It should be noted that the optimal and suboptimal microkernels ((7,8), (4,16)) are designed for single-precision floating-point configurations, but the optimization and configuration methods for microkernels can be extended to other precisions, simply by replacing the amount of data that can be stored in the vector register (e.g., double-precision floating-point numbers). (2).
[0121] 4. Parallel Strategy
[0122] This invention is Figure 1 The framework shown presents a multidimensional parallelization method that combines OpenMP and Pthreads technologies to adapt parallel modes for different problem sizes.
[0123] Specifically, shallow networks have larger T values, while C and K are relatively smaller. Therefore, this invention uses OpenMP to tile the outermost layer of the coupled Winograd algorithm. To parallelize the loop that iterates through tiles with a step size, the maximum number of threads is set to minimize cache contention. In deep networks, T is relatively small, and the parallel gains are not high. Therefore, setting... Using OpenMP technology, parallelism is achieved in both C and K dimensions, with the maximum number of threads set to [value missing]. The intermediate layer network's C, K, and T dimensions are all suitable for parallelization, but using OpenMP nested parallelism introduces additional overhead. This invention uses Pthreads thread pool technology and a lock-free task queue for multi-dimensional parallelism, utilizing atomic snapshots to complete synchronization between stages. T-dimensional parallelization is the initial subtask of the task queue; each T-dimensional subtask pushes C and K-dimensional subtasks into the queue to achieve parallelization. The total number of subtasks is... The maximum number of threads is empirically set to .
[0124] In summary, this invention couples the three stages of Winograd convolution (input / filter transformation, matrix multiplication, and output transformation) to obtain a fusion block traversal strategy that is beneficial to cache locality.
[0125] This invention designs a microkernel for Winograd domain matrix multiplication computation. Based on the characteristics of each dimension of the input data of the convolutional neural network model, two microkernel sizes are designed to achieve a high computation-to-memory ratio and handle a small number of edge cases. These include: an optimal microkernel with the highest computation-to-memory ratio obtained based on the limitation of the number of vector registers required for computation and data loading; and a suboptimal microkernel obtained based on the need for handling edge cases in the tile partitioning direction in deep networks. The microkernel uses a "ping-pong" strategy, overlapping the loading and computation of input data and filter data by dividing the pipeline stages and vector register groups, thus reducing pipeline pauses.
[0126] This invention packages the input data and filter data involved in matrix multiplication into a "z" shape based on different microkernel sizes and block sizes. This layout enables continuous memory access during matrix multiplication, thereby improving computational efficiency. The input data and filter data in this layout are directly used as input to the microkernel, and the packaging process is integrated with Winograd field transformation.
[0127] The Winograd algorithm of this invention divides the input into multiple tiles. Based on whether the number of available vector registers can simultaneously contain twice the number of elements in a tile, two methods are selected to implement domain transformation at the assembly level. First, a single iteration processes four elements across the input channels of a tile, utilizing the shared elements characteristic of adjacent tiles to achieve register reuse. Second, a single iteration transforms a subset of elements in a tile (here, a single row), increasing the number of elements across the input channels to fully utilize the vector registers. The special structure of the transformation matrix is used to extract common computational factors, reducing computational complexity.
[0128] This invention proposes a three-mode parallel strategy for the fusion block algorithm framework, which uses a multi-dimensional parallel approach, combining OpenMP and Pthreads technologies to adapt different modes for different problem scales.
[0129] Figure 8 and Figure 9 The results show a comparison of the runtime of Winograd convolution implemented by this invention, NCNN, and NNPACK in each convolutional layer of three models: FusionNet, VGG-16, and ResNet-50. Red represents this invention, orange represents NCNN, and blue represents NNPACK. The results show that, in the single-core case, this invention achieves a speedup of 1.21 to 2.35 times compared to NCNN, while NNPACK achieves 1.30 to 2.39 times. In the 16-thread case, this invention achieves a geometric mean speedup of 1.47 and 1.66 times compared to NCNN and NNPACK, respectively, while in the 32-thread case, the speedup is 2.06 times and 1.59 times, respectively.
[0130] Although specific embodiments of the invention have been disclosed for illustrative purposes to aid in understanding and implementing the invention, those skilled in the art will understand that various substitutions, variations, and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the content disclosed in the preferred embodiments, and the scope of protection claimed by the invention is defined by the claims.
Claims
1. A Winograd convolution optimization method suitable for ARMv8 multicore architecture, characterized in that, The method includes: determining a tile partition value , an input channel partition value , and an output channel partition value , and dividing input data according to the tile partition value to obtain a tile partition; The global domain transformation of the filter is completed by a double nested loop, and the global domain transformation result is stored in the FilterOut array according to the data layout of the first matrix multiplication. Tile block value By traversing the tile blocks and performing intra-block input transformation, coupled matrix multiplication, and output transformation on each tile block, the Winograd convolution optimization result is obtained. The process of performing intra-block input transformation, coupled matrix multiplication, and output transformation on each tile block to obtain the Winograd convolution optimization result includes: Block value by input channel The algorithm iterates through the data and, during each iteration, calls the input transformation kernel to perform the input domain transformation. Then, it stores the data in the TransInOut temporary array according to the second matrix multiplication data layout. Block value based on output channel The algorithm iterates through the input channels, performing L matrix multiplications in each iteration. Each matrix multiplication is performed using the input channel block values. For step-size traversal, call GEMM to check the TransInOut temporary array. × Blocks and FilterOut arrays × The block performs matrix multiplication and accumulates the results, storing them in a temporary array GEMMOut according to the data layout of the matrix multiplication results. Then, the matrix multiplication results in the temporary array GEMMOut are processed by the output transformation kernel to obtain the Winograd convolution optimization result. Here, L represents the number of elements in a Winograd convolution tile. Among them, the determination of tile block value Input channel block value and output channel block value ,include: Determine tile block values based on minimizing data movement overhead using cache capacity. Input channel block value and output channel block value Furthermore, by combining the characteristics of the input and output channels in a convolutional neural network, the tile block values are... Input channel block value and output channel block value To impose restrictions; The first matrix multiplication data layout is as follows: The data layout for the second matrix multiplication is as follows: The data layout of the matrix multiplication result is as follows: ;in, This indicates the number of output channels of the convolutional layer. This represents the number of input channels of the convolutional layer. and The parameters represent the matrix multiplication microkernel. This indicates the number of floating-point numbers stored in the vector register; The method further includes: When the number of rows and columns of the output tile matrix is m=2, the input transformation kernel first traverses the image width W of the input tensor of the convolutional network, then proceeds to the image height H, until the tile block values are completed. All tiles within the input channel C direction Transform each element; finally, iterate through the input channel directions; When the number of rows and columns of the output tile matrix is m=6, the traversal order of the input transformation kernel is as follows: first, complete the overall transformation of a tile in the input channel direction, and then traverse the image width W and height H directions of the input tensor of the convolutional network to perform the transformation of all tiles. The iterative method within the filter conversion kernel involves passing the values loaded into the same vector register across the output channel K, and following the... The order of traversal.
2. The method according to claim 1, characterized in that, The process of completing the global domain transformation of the filter through a double-nested loop includes: Outer loop segmented by output channel value Traversal; Inner loop segmented by input channel value Traversal; The loop calls the filter conversion kernel to perform iterative conversion; where the size of one iteration is L× × The data.
3. The method according to claim 1, characterized in that, When the number of rows and columns of the output tile matrix is m=2, the step of calling the input transformation kernel to complete the input domain transformation includes: Step 4.1: Load L=16 elements from a Winograd convolution tile into vector registers v0~v15; where each element represents the number of input channels C spanned by a tile loaded into the same vector register. One data value, The number of data items that can be held in a vector register; Step 4.2: Left-multiply the Winograd convolution tiles corresponding to the L vector registers by the matrix. After releasing vector registers v0, v1, v8, and v9, the left-multiplication matrix of the tile is stored by combining vector registers v16 to v27. Provisional results; among which, Represents the transformation matrix of the input; Step 4.3: Left-multiply the tile by the matrix. The temporary result is right-multiplied by the matrix Stored in vector registers v16 to v31; Step 4.4: Store the results in vector registers v16 to v31 into the TransInOut temporary array; Step 4.5: After loading the non-overlapping data from the next Winograd convolution tile into vector registers v0, v1, v4, v5, v8, v9, v12, and v13, execute steps 4.2-4.4 and store the result corresponding to the next Winograd convolution tile into the TransInOut temporary array. Step 4.6: After loading the non-overlapping data of the next Winograd convolution tile into vector registers v2, v3, v6, v7, v10, v11, v14, and v15, execute steps 4.2-4.6, and store the result corresponding to the next Winograd convolution tile into the TransInOut temporary array. Step 4.7: Return to step 4.5 until all Winograd convolution tiles have been processed.
4. The method according to claim 1, characterized in that, When the number of rows and columns of the output tile matrix is m=6, the step of calling the input transformation kernel to complete the input domain transformation includes: Perform the following steps on the i-th row of data in a Winograd convolutional tile: ×B conversion yields the result. The result transpose Calculation factor Calculation factor Calculation factor Calculation factor Calculation factor Calculation factor Calculation factor Calculation factor , This represents the input tile matrix used for input domain transformation calculations. This represents the element in the i-th row and n-th column of the input tile, where n can take values ranging from 1 to 2. B represents the input transformation matrix; The results As 8×8× A row of data in the temporary array tmp. This indicates the number of data items that can be stored in a vector register. The number of input channels spanned for one row of data processed in a single iteration; Left-multiply the temporary array tmp by the matrix column by column. The corresponding results are then stored in the TransInOut temporary array.
5. The method according to claim 1, characterized in that, The implementation process of the matrix multiplication microkernel includes: Based on parameters and parameters Given the constraints, determine the parameters of the optimal microkernel that maximize the computational memory access ratio. and parameters The constraints include: and ; Based on the characteristics of convolutional neural networks, select the parameters corresponding to the suboptimal microkernel. and parameters ; For the parameters of the selected optimal microkernel and parameters and the parameters corresponding to the suboptimal microkernel and parameters A ping-pong strategy is used to implement the matrix multiplication microkernel operation.
6. The method according to claim 1, characterized in that, The method further includes: In the shallow convolutional layers of the convolutional neural network model, OpenMP is used to tile the outermost layer coupled with the Winograd algorithm. Parallelize the loop that iterates through the tiles with a step size, and set the maximum number of threads to [value missing]. ;in, This represents the total number of tiles generated by the input tensor. In the deep convolutional layers of a convolutional neural network model, tile segmentation values are set. And using OpenMP technology to achieve parallelism in both C and K dimensions, the maximum number of threads is set to ;in, This represents the number of input channels of the convolutional layer. This indicates the number of output channels of the convolutional layer; In the intermediate convolutional layers of the convolutional neural network model, Pthreads thread pool technology and lock-free task queues are used for multidimensional parallelism, and atomic snapshots are used to complete the synchronization between stages. Specifically, T-dimensional parallelization is the initial subtask of the task queue. Each T-dimensional subtask pushes C- and K-dimensional subtasks into the queue to achieve parallelization, with a total number of subtasks. The maximum number of threads is empirically set to .
7. A Winograd convolution optimization system suitable for ARMv8 multi-core architecture, characterized in that, The system includes: The tile value determination module is used to determine the tile tile value. Input channel block value and output channel block value And based on the tile block value Divide the input data into tile blocks; The filter conversion module is used to complete the global domain conversion of the filter through a double nested loop, and store the global domain conversion result in the FilterOut array according to the data layout of the first matrix multiplication. Winograd convolution optimization module, used for tile-based block values. By traversing the tile blocks and performing intra-block input transformation, coupled matrix multiplication, and output transformation on each tile block, the Winograd convolution optimization result is obtained. The process of performing intra-block input transformation, coupled matrix multiplication, and output transformation on each tile block to obtain the Winograd convolution optimization result includes: Block value by input channel The algorithm iterates through the data and, during each iteration, calls the input transformation kernel to perform the input domain transformation. Then, it stores the data in the TransInOut temporary array according to the second matrix multiplication data layout. Block value based on output channel The algorithm iterates through the input channels, performing L matrix multiplications in each iteration. Each matrix multiplication is performed using the input channel block values. For step-size traversal, call GEMM to check the TransInOut temporary array. × Blocks and FilterOut arrays × The block performs matrix multiplication and accumulates the results, storing them in a temporary array GEMMOut according to the data layout of the matrix multiplication results. Then, the matrix multiplication results in the temporary array GEMMOut are processed by the output transformation kernel to obtain the Winograd convolution optimization result. Here, L represents the number of elements in a Winograd convolution tile. Among them, the determination of tile block value Input channel block value and output channel block value ,include: Determine tile block values based on minimizing data movement overhead using cache capacity. Input channel block value and output channel block value Furthermore, by combining the characteristics of the input and output channels in a convolutional neural network, the tile block values are... Input channel block value and output channel block value To impose restrictions; The first matrix multiplication data layout is as follows: The data layout for the second matrix multiplication is as follows: The data layout of the matrix multiplication result is as follows: ;in, This indicates the number of output channels of the convolutional layer. This represents the number of input channels of the convolutional layer. and The parameters represent the matrix multiplication microkernel. This indicates the number of floating-point numbers stored in the vector register; The method further includes: When the number of rows and columns of the output tile matrix is m=2, the input transformation kernel first traverses the image width W of the input tensor of the convolutional network, then proceeds to the image height H, until the tile block values are completed. All tiles within the input channel C direction Transform each element; finally, iterate through the input channel directions; When the number of rows and columns of the output tile matrix is m=6, the traversal order of the input transformation kernel is as follows: first, complete the overall transformation of a tile in the input channel direction, and then traverse the image width W and height H directions of the input tensor of the convolutional network to perform the transformation of all tiles. The iterative method within the filter conversion kernel involves passing the values loaded into the same vector register across the output channel K, and following the... The order of traversal.