Feature map processing method and device based on systolic array and storage medium

CN116090518BActive Publication Date: 2026-06-26ZHEJIANG DAHUA TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG DAHUA TECH CO LTD
Filing Date
2023-01-05
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

前者通过增加并行度减少迭代次数降低系统处理延时,但面临输入数据直连共享导致的扇入/扇出高的问题,使得最终系统速度低,且存在对不同尺寸卷积核计算的模式通用性差的问题;后者将CNN中的卷积运算和全连接运算转换为矩阵乘法,但如何提供通用卷积核计算的方法对系统性能影响显著

Benefits of technology

[0033] The beneficial effects of this application are as follows: The feature map processing device acquires the input feature map to be processed and the convolution kernel used for three-dimensional convolution operations on the input feature map; it decomposes the convolution kernel into 1×1 dimensional sub-convolution kernels along the width and height directions of the convolution kernel; along the channel directions of the input feature map and the convolution kernel, it uses a systolic operation array to perform parallel operations on the feature values ​​of corresponding points in the width and height plane of the input feature map and the weight values ​​of the sub-convolution kernels to obtain partial convolution results; and it accumulates the partial convolution results to obtain the corresponding feature values ​​of the output feature map. Through the above method, the feature map processing device converts convolution kernels of different sizes into a data segmentation, scheduling control, and hardware implementation scheme for 1×1 convolution kernel computation, solving the compatibility problem of computation with different convolution kernels.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116090518B_ABST
    Figure CN116090518B_ABST
Patent Text Reader

Abstract

The application provides a feature map processing method and device based on a pulsation operation array and a computer readable storage medium. The feature map processing method comprises: obtaining an input feature map to be processed and a convolution kernel used for three-dimensional convolution operation on the input feature map; decomposing the convolution kernel into a 1*1-dimensional sub-convolution kernel along a width direction and a height direction of the convolution kernel; and performing parallel operation on feature values of corresponding position points in a width-height plane of the input feature map and weight values of the sub-convolution kernel along a channel direction of the input feature map and the convolution kernel by using the pulsation operation array to obtain a partial convolution result; and accumulating the partial convolution result to obtain a corresponding feature value of an output feature map. In this way, the feature map processing device converts convolution kernels of different sizes into a data segmentation, scheduling control and hardware implementation scheme of 1*1 convolution kernel calculation, and solves the compatibility problem of different convolution kernel calculation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of deep learning technology, and in particular to a feature map processing method, apparatus, and computer-readable storage medium based on a systolic operation array. Background Technology

[0002] With the development of information technologies such as the Internet of Things, cloud computing, and big data, and driven by computing platforms such as sensor data and graphics processing units (GPUs), deep learning technology has received widespread attention in the industry. Among them, convolutional neural networks (CNNs) have become research hotspots in many fields such as image classification, object detection, and semantic segmentation, achieving excellent results. The convolutional and fully connected layers in CNNs have the highest computational cost, and the size of the convolutional kernel varies depending on the algorithm, making effective acceleration of their computation particularly important. As the computational scale and complexity of CNN models increase, traditional CPU (Central Processing Unit) platforms can no longer meet the requirements of practicality. Therefore, the use of computing platforms such as GPUs and FPGAs (Field-Programmable Gate Arrays) to accelerate CNN models has attracted widespread attention in the industry. However, compared to GPUs, the high energy efficiency, easy reconfiguration, rapid iteration, and ease of deployment on mobile edge devices of FPGAs are better suited to the rapidly evolving needs of deep learning algorithms.

[0003] Current FPGA-based CNN accelerator implementations mainly fall into two categories: cyclic unrolling parallel computation and systolic array computation. The former reduces system latency by increasing parallelism and decreasing the number of iterations, but faces the problem of high fan-in / fan-out due to direct sharing of input data, resulting in low system speed and poor general applicability to different convolutional kernel sizes. The latter converts convolution and fully connected operations in CNNs into matrix multiplication, but providing a universal convolutional kernel computation method significantly impacts system performance. Furthermore, due to the large amount of data involved in CNN model inference computation, including model parameters and input data, FPGA on-chip resources are limited and cannot store the hundreds of megabytes of parameters or data often found in CNNs. CNN accelerators are designed for "high parallelism computation," which requires extremely high data throughput. The characteristics of convolution computation involve several iterations, leading to frequent data interaction issues. Therefore, providing an efficient and universal convolutional kernel computation scheme and CNN accelerator device is particularly important for the CNN accelerator field. Summary of the Invention

[0004] This application provides a feature map processing method, apparatus, and computer-readable storage medium based on a systolic arithmetic array.

[0005] This application provides a feature map processing method based on a systolic arithmetic array, the method comprising:

[0006] Obtain the input feature map to be processed and the convolution kernel for performing three-dimensional convolution operations on the input feature map;

[0007] The convolution kernel is decomposed into 1×1 dimension sub-convolution kernels along the width and height directions of the convolution kernel;

[0008] Along the channel direction of the input feature map and the convolution kernel, the systolic operation array is used to perform parallel operations on the feature values ​​of corresponding points in the width and height plane of the input feature map and the weight values ​​of the sub-convolution kernel to obtain partial convolution results;

[0009] The partial convolution results are accumulated to obtain the corresponding feature values ​​of the output feature map.

[0010] The method further includes:

[0011] According to the preset input channel parallelism, the feature values ​​of each position point of the input feature map are divided into multiple feature value groups along the channel direction, wherein the number of feature values ​​in each feature value group is equal to the input channel parallelism.

[0012] Based on the input channel parallelism, the weight values ​​of the sub-convolution kernel are divided into multiple weight value groups corresponding to the feature value groups along the channel direction, wherein the corresponding feature value groups and weight value groups form operation group pairs;

[0013] The step of performing parallel operations on the feature values ​​of corresponding points in the width and height plane of the input feature map and the weight values ​​of the sub-convolutional kernel using the systolic operation array along the channel direction of the input feature map and the convolutional kernel includes:

[0014] The pulsating computing array is used to perform parallel operations on the operation pairs, wherein operations within the same operation pair are performed in parallel or operations on different operation pairs are performed in parallel.

[0015] The step of performing parallel operations on the computational group pairs using the pulsating computing array includes:

[0016] For the same sub-convolution kernel, the input feature map is traversed sequentially along one of the width and height directions and the channel direction, taking the operation group pairs as units.

[0017] The step of dividing the feature values ​​of each location point of the input feature map into multiple feature value groups along the channel direction according to the preset input channel parallelism includes:

[0018] The input feature map is read in groups of the feature values, and is cached from external memory sequentially along one of the width and height directions of the input feature map, the channel direction, and the other of the width and height directions of the input feature map. The cached input feature map is then used to input the pulsation computing array for computation.

[0019] The step of caching the input feature map from external memory in groups of the feature values ​​as the reading unit, and sequentially along one of the width and height directions of the input feature map, the channel direction, and the other of the width and height directions of the input feature map, includes:

[0020] The input feature map is cached in stages from the external memory in either the width or height direction of the input feature map.

[0021] The number of convolutional kernels is multiple, corresponding to the number of channels in the output feature map;

[0022] The method further includes:

[0023] According to the preset output channel parallelism, the multiple convolutional kernels are divided into multiple convolutional kernel groups, wherein the number of convolutional kernels in each convolutional kernel group is equal to the output channel parallelism;

[0024] The step of performing parallel operations on the feature values ​​of corresponding points in the width and height plane of the input feature map and the weight values ​​of the sub-convolutional kernel using the systolic operation array along the channel direction of the input feature map and the convolutional kernel includes:

[0025] For each convolution kernel within the same convolution kernel group, the step of performing parallel operations on the operation group pairs using the systolic computing array is executed in parallel.

[0026] Wherein, the parallelism of the output channel is equal to the parallelism of the input channel, and is not less than 1.

[0027] The systolic computation array comprises multiple computation units arranged in an array. Prior to the step of performing parallel computations on the feature values ​​of the input feature map and the weight values ​​of the sub-convolutional kernels using the systolic computation array along the channel direction of the input feature map and the convolutional kernel, the method further includes:

[0028] The feature values ​​within the same feature value group are fed in parallel along the row direction of the systolic computing array to different rows of the systolic computing array, so that each feature value is transmitted along its respective row direction.

[0029] The weight values ​​corresponding to each convolution kernel within the same convolution kernel group are fed in parallel into different columns of the systolic computation array, so that the weight values ​​within each weight value group are transmitted along their respective column directions.

[0030] The calculation unit is configured to perform a product operation on the corresponding feature value and the weight value, and then add the product result to the output result of the previous level calculation unit input along the column direction to obtain its own output result.

[0031] This application also provides a feature map processing apparatus, which includes a processor and a memory. The memory stores program data, and the processor executes the program data to implement the feature map processing method described above.

[0032] This application also provides a computer-readable storage medium for storing program data, which, when executed by a processor, is used to implement the feature map processing method described above.

[0033] The beneficial effects of this application are as follows: The feature map processing device acquires the input feature map to be processed and the convolution kernel used for three-dimensional convolution operations on the input feature map; it decomposes the convolution kernel into 1×1 dimensional sub-convolution kernels along the width and height directions of the convolution kernel; along the channel directions of the input feature map and the convolution kernel, it uses a systolic operation array to perform parallel operations on the feature values ​​of corresponding points in the width and height plane of the input feature map and the weight values ​​of the sub-convolution kernels to obtain partial convolution results; and it accumulates the partial convolution results to obtain the corresponding feature values ​​of the output feature map. Through the above method, the feature map processing device converts convolution kernels of different sizes into a data segmentation, scheduling control, and hardware implementation scheme for 1×1 convolution kernel computation, solving the compatibility problem of computation with different convolution kernels. Attached Figure Description

[0034] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein:

[0035] Figure 1 This is a schematic flowchart of an embodiment of the feature map processing method provided in this application;

[0036] Figure 2 This is a schematic diagram of the CNN accelerator device system architecture provided in this application;

[0037] Figure 3 This is a schematic diagram of the convolutional layer calculation provided in this application;

[0038] Figure 4 yes Figure 1 The diagram shows the detailed process flow of step S13 in the feature map processing method.

[0039] Figure 5 This is a schematic diagram of the input channel direction priority storage strategy provided in this application;

[0040] Figure 6 This is a schematic diagram of the scheduling of internal convolution computation provided in this application;

[0041] Figure 7 This is a schematic diagram illustrating the computational principle of the general convolution kernel provided in this application;

[0042] Figure 8 This is a schematic diagram of the structure of the pulsating array unit provided in this application;

[0043] Figure 9 This is a schematic diagram of convolution kernel computation based on systolic array provided in this application;

[0044] Figure 10 This is a schematic diagram of an embodiment of the feature map processing apparatus provided in this application;

[0045] Figure 11 This is a schematic diagram of an embodiment of the computer-readable storage medium provided in this application. Detailed Implementation

[0046] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0047] The feature map processing method based on systolic arithmetic array provided in this application aims to solve the following problems:

[0048] 1. The problem of poor compatibility and incompatibility of convolution kernels of different sizes in CNN accelerators.

[0049] 2. Deploying CNN accelerators on resource-constrained FPGAs presents challenges such as frequent memory accesses and high bandwidth pressure due to the reuse of input data and weights.

[0050] 3. The high fan-in / fan-out problem caused by the weight-sharing processing unit structure of input data leads to the problem of low final system speed.

[0051] Please see Figure 1 and Figure 2 , Figure 1 This is a flowchart illustrating an embodiment of the feature map processing method provided in this application. Figure 2 This is a schematic diagram of the CNN accelerator device system architecture provided in this application.

[0052] The feature map processing method of this application is applied to a feature map processing apparatus, which can be a server or a system consisting of a server and a terminal device working together. Accordingly, the various parts of the feature map processing apparatus, such as units, subunits, modules, and submodules, can all be located in the server, or they can be located separately in the server and the terminal device.

[0053] Furthermore, the aforementioned server can be either hardware or software. When the server is hardware, it can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When the server is software, it can be implemented as multiple software programs or software modules, such as software or software modules used to provide distributed servers, or as a single software program or software module; no specific limitation is made here. In some possible implementations, the feature map processing method of this application embodiment can be implemented by a processor calling computer-readable instructions stored in memory.

[0054] This application proposes a general convolutional kernel CNN accelerator architecture design and device based on a systolic computation array, the system architecture diagram of which is shown below. Figure 2 As shown. Among them, CNN accelerators typically accelerate multiple computationally intensive convolutional or fully connected layers in a neural network model. The following describes the functions of each component and the data processing method of the CNN accelerator in this application using an input feature map of size 256×256×32 and the actual processing of one convolutional layer. The processing of input feature maps of other sizes is similar and will not be described in detail here.

[0055] like Figure 2 As shown, the input feature map data serves as the input to either the convolutional layer or the fully connected layer, and is used for convolutional or fully connected computation with its corresponding weight parameters. The input feature map and the weight parameters of each convolutional and fully connected layer in the model are stored in external memory (e.g., DDR, Double Data Rate, or synchronous dynamic random access memory).

[0056] It should be noted that the forward computation of a CNN accelerator is usually referred to as inference. The CNN accelerator mentioned in this application is a single-engine architecture, that is, a computing engine that supports general convolution kernel operations. The CNN accelerator schedules and configures the computing engine to realize the inference operation of all layers of the neural network model by using the feature information (such as the convolution kernel size) of different layers of the neural network model.

[0057] Therefore, before each layer performs inference, the software system needs to configure the hardware module parameter registers of the programmable logic (e.g., FPGA).

[0058] For example, during convolution operations, the software system configures the working registers of the Direct Memory Access (DMA) controller via a bus, such as the Avalon bus or APB bus, to control the DMA to move data (i.e., the input feature map and weight parameters in this application) from external DDR memory to the FPGA. Since FPGA devices have limited storage resources, the accelerator device needs input feature map buffer units and weight buffer units to buffer portions of the input feature map and weight parameter data, respectively. After these two portions of data undergo several convolution operations, the DMA reads the next batch of data from DDR.

[0059] The key innovations of this application lie in the input feature map caching unit, the input feature map systolic loading control, the weight caching unit, and the weight systolic loading control. Specifically, by sharing and reusing the input feature maps in the caching section and cyclically reading the weights, combined with a systolic array computation engine, a general convolutional kernel CNN accelerator architecture system is achieved, which will be described in detail below:

[0060] The systolic array unit is the main computational unit. The accelerator device of this application converts the convolution operation of the input feature map and weights into vector-matrix multiplication and performs the computation using a systolic array. The result is a partial sum of the convolution operation. Therefore, this partial sum is temporarily stored in the accumulator cache unit and accumulated for the next batch of results until the input feature map corresponding to the current weight parameter is calculated and the final result is output.

[0061] The bias module mainly performs operations to add bias parameters to the output channel direction of the input feature map. The result is then processed non-linearly by activation functions such as ReLU and Leaky ReLU and output to the next level module. At this point, the calculation of a convolutional layer is completed.

[0062] Typically, the layer following a convolutional layer is a pooling layer or an element-wise operation layer, such as a concatenation layer. Figure 2The pooling / elemental operation processing unit in the CNN is mainly responsible for pooling and elemental operations. Its output is written back to DDR via DMA and used as the input feature map of the next convolutional layer. This process of convolutional layer-bias-activation function-pooling / elemental operation is then executed until all layers of the CNN model have been computed.

[0063] It should be further explained that the CNN accelerator device in this application is a single-engine architecture. That is, the CNN model is grouped offline, and each group includes operations such as convolutional layers, biasing, activation functions, and pooling / elemental operations. If a group does not have a corresponding processing unit, it can be bypassed and enabled. After the engine is computed several times (i.e., the number of groups) by soft core scheduling, the final output of the CNN model can be obtained.

[0064] In the CNN accelerator device of this application, the configuration parameters of all groups are pre-distributed to the on-chip (such as FPGA) through the software system. Then, the FPGA directly reads the relevant configuration parameters to realize the calculation of all groups. By reducing communication and interaction with the software system, the redundant system processing latency is reduced, thereby improving the inference performance of the system.

[0065] The following is combined Figure 1 The feature map processing method shown will be further introduced. Figure 2 The functional implementation of each part of the CNN accelerator device shown:

[0066] Specifically, such as Figure 1 As shown, the feature map processing method of this application embodiment specifically includes the following steps:

[0067] Step S11: Obtain the input feature map to be processed and the convolution kernel used to perform 3D convolution operation on the input feature map.

[0068] In this embodiment of the application, the feature map processing device inputs the input feature map to be processed and the convolution kernel for performing three-dimensional convolution operations on the input feature map from the external memory DDR to the direct memory access controller, wherein the input feature map is the input for each layer of convolution calculation.

[0069] Specifically, the input to each convolutional layer includes the input feature map ifmp(N,Ci,Hi,Wi) and the weight parameters weight(Co,Ci,Ky,Kx). The final calculation result is the output feature map ofmp(N,Co,Ho,Wo), where N represents the batch size, i.e., the number of image frames. In this description, N is set to 1, which represents a single frame image; in addition, multi-frame image processing can be regarded as multiple scheduling of single-frame images.

[0070] Co and Ci represent the number of output channels and the number of input channels, respectively. Kx and Ky represent the length and width of the convolution kernel, respectively. Hi and Wi represent the height and width of the input feature map, respectively. Ho and Wo represent the height and width of the output feature map, respectively. The final result needs to be biased by bias(Co,1), that is, the data of each output channel Co shares one bias.

[0071] The pseudocode for calculating the convolutional layer is as follows, where S represents the stride of the sliding window:

[0072]

[0073]

[0074] Where Ho = ((Hi-Ky+2*P) / S)+1, Wo = ((Wi-Kx+2*P) / S)+1, and P represents the number of zero-padding rows. Taking a certain convolutional layer as an example, if the input feature map ifmp is (1,32,256,256), the weight is (64,32,3,3), S=1, and P=1, then the output feature map ofmp of the first layer is (1,64,256,256).

[0075] Meanwhile, a fully connected layer can be viewed as a vector-matrix multiplication, i.e., ifmp is (Ci,1), weight is (Co,Ci), and bias is (Co,1), then ofmp(Co,1) = weight × ifmp + bias.

[0076] Step S12: Decompose the convolution kernel into 1×1 dimension sub-convolution kernels along the width and height directions of the convolution kernel.

[0077] In this embodiment, the feature map processing device can decompose the convolution kernel into 1×1 dimensional sub-convolution kernels along the width and height directions. For example, for a convolution kernel of size Kx = Ky = 3, it can be decomposed into 9 1×1 dimensional sub-convolution kernels at different positions along the width and height directions.

[0078] Step S13: Along the channel direction of the input feature map and the convolution kernel, use the systolic operation array to perform parallel operations on the feature values ​​of the corresponding positions in the width and height plane of the input feature map and the weight values ​​of the sub-convolution kernel to obtain partial convolution results.

[0079] In the embodiments of this application, please continue to refer to Figure 3 , Figure 3 This is a schematic diagram of the convolutional layer calculation provided in this application. The specific calculation process of convolution in this application is as follows: Figure 3As shown, each filter wgt slides sequentially from left to right and top to bottom on the input feature map ifmp, while the corresponding overlapping positions of the two are multiplied and accumulated to calculate one feature point of the output feature map ofmp. After the sliding window traverses the entire ifmp, one channel of output feature map is obtained. Co wgts then yield the final Co channels of ofmp.

[0080] Because convolution involves storing and reusing a large amount of input feature maps and parameter data, but on-chip resources are limited (e.g., FPGA), only a portion of the data can be cached on-chip for computation at a time, as shown in the gray area in the figure. The number of gray cubes in the Ci direction is PCi, which is called the input channel parallelism, and the number of gray cubes in the Co direction is PCo, which is called the output channel parallelism. Therefore, the accelerator only computes the convolution result corresponding to a subset of the entire input feature map and weight parameters (i.e., the gray cubes) at a time.

[0081] Please continue reading. Figure 4 , Figure 4 yes Figure 1 The diagram shows the specific process flow of step S13 in the feature map processing method.

[0082] Specifically, such as Figure 4 As shown, the feature map processing method of this application embodiment specifically includes the following steps:

[0083] Step S131: According to the preset input channel parallelism, the feature values ​​of each position point of the input feature map are divided into multiple feature value groups along the channel direction, wherein the number of feature values ​​in each feature value group is equal to the input channel parallelism.

[0084] In this embodiment of the application, in order to effectively read the corresponding computational data in the external memory, the prior art uses a "cross-shaped" block division, that is, dividing the input feature map into several blocks horizontally and vertically, and reading and calculating only one block of the input feature map at a time. However, due to the convolutional computation mode of the sliding window, it is easy to know that there is data overlap at the boundaries of two adjacent blocks, which requires additional processing. In addition, after calculating the convolution result of each block, additional control and operation are required to restore the physical shape of the original output feature map, which introduces additional latency.

[0085] pass Figure 3 The convolution calculation process shown illustrates that, given limited on-chip hardware resources, the input feature map (ifmp) and weights (wgt) are reused the same number of times. This means that calculating the feature points along the Co direction of each output channel requires traversing all weight parameters once. If so much data is read from DDR, it will limit bandwidth utilization.

[0086] This application proposes a storage strategy prioritizing the input channel Ci direction, the storage principle of which is as follows: Figure 5 As shown. Considering the parallelism PCi, the storage unit is set to PCi data points in the Ci direction. During computation, the parallel processing unit is also PCi data points. This application stores all data points in the rows Wi and Ci directions of the input feature map ifmp, and only stores a portion of the rows in the column Hi direction. That is, the feature values ​​are divided into multiple feature value groups, for example, 8 rows per feature value group. This value can be configured according to the maximum compatible convolutional kernel. Here, taking a maximum supported convolutional kernel of 7×7 as an example, a total of Wi×Ci×8 data points need to be stored. The on-chip storage resource addresses are contiguous, and the order of magnitude of Wi×Ci between the front-end and back-end layers of the CNN is the same, therefore the storage resources are compatible.

[0087] Step S132: Based on the parallelism of the input channels, the weight values ​​of the sub-convolution kernel are divided into multiple weight value groups corresponding to the feature value groups along the channel direction, wherein the corresponding feature value groups and the weight value groups form an operation group pair.

[0088] In this embodiment of the application, the parallelism PCi of the input channel Ci is 8. PCi = 8 means that each ifmp data is grouped into 8 groups in the Ci direction. PCi = 8 is used as the unit in external DDR storage, on-chip storage and on-chip computing. For example, when the data bit width is 8 bits, each PCi unit is 64 bits, which includes 8 ifmp data in the Ci direction.

[0089] In other embodiments, the parallelism PCi of the input channel Ci can be set to a larger value, which needs to be combined with the number of DSP (Digital Signal Processing) computing units of the selected device. The higher the parallelism, the stronger the performance. For example, parallelism PCi = 16, parallelism PCi = 32, etc.

[0090] The convolution computation scheduling in this application is as follows: Figure 6 As shown, the input feature map ifmp of Wi×Hi×Ci is convolved with the filter wgt of Kx×Ky×Ci×Co to obtain the output feature map of Wo×Ho×Co ofmp.

[0091] This application describes one of the usage embodiments, namely, using an 8-line cached ifmp with a PCi of 8, and also using a parallelism PCo of 8 on the weight wgt scale, that is, dividing the weight values ​​of the sub-convolution kernel into multiple weight value groups. The reason why PCi = PCo will be explained later. PCo represents grouping the filters, i.e., the three-dimensional cubes of each wgt N, wgt0, wgt1, ..., wgt7 (a total of 8), into one group, and so on. If there are n groups of PCo, it is easy to see that the total number of filters wgt is Co = n × PCo.

[0092] In convolution calculation, if Kx = Ky = 3, the calculation process is as follows: Step S1, first follow the Wi direction, for example... Figure 6 In step S2, the PCi weight of wgt0 is multiplied and summed sequentially with the corresponding PCi data points of ifmp to obtain the convolutional sum of one row in the Wo direction of ofmp. In step S3, step S2 is repeated along the Ci direction, noting that the PCi weight of wgt0 becomes the next group in the Ci direction, and similarly, the PCi data points of ifmp also become the next group in the Ci direction, maintaining their corresponding positions. In step S4, step S2 is repeated until all PCi data points in the Ci direction have been calculated, resulting in the convolutional result of one row in the Wo direction of ofmp corresponding to the 1×1 convolutional kernel. Since the kernel size Kx = Ky = 3, meaning there are 9 different 1×1 convolutional kernels, the above 1×1 calculation process needs to be repeated 9 times, ultimately obtaining the convolutional result of one row in the Wo direction of ofmp corresponding to the 3×3 convolutional kernel.

[0093] The above process describes the convolution calculation process of wgt0. Since the output channel parallelism PCo = 8 in the described embodiment, it means that 8 sets (i.e., wgt0, wgt1, ..., wgt7) of the same convolution calculation as wgt0 are performed in parallel, and the final output is one row of PCo Wo directions. After the first set of PCo wgt is calculated, the second set of PCo wgt is performed to obtain the second set of PCo Wo directions one row of results in the Co direction, and so on. When all n sets of PCo wgt are calculated, the PCo Wo directions one row of results for all groups in the Co direction are obtained.

[0094] Therefore, it can be seen that the data arrangement of ofmp is consistent with that ofifmp, and the subsequent convolutional layers have the same computation process. The PCo of the upper convolutional layer serves as the PCi of the lower convolutional layer. Since the data arrangement is the same, the engine can be directly scheduled for computation, continuing throughout the entire model's computation. The above description also explains why PCi = PCo is needed, namely, to maintain the parallelism correspondence between the preceding and following layers. After the result of the first row in the Wo direction of ofmp is obtained, the convolutional kernel window moves down along the Hi direction to calculate the convolution result of the second row of Wo. When the eight rows of ifmp data are consumed, the next set of eight rows of ifmp is read.

[0095] The above process is only one example in the embodiments of this application. The specific calculation principle of the general convolution kernel involved in this application is illustrated as follows. Figure 7 As shown. Based on Figure 2 The device and Figure 6 Based on the principle, this application provides a general method for calculating convolution kernels, which converts convolution kernels of various sizes into 1×1 convolution kernels for calculation.

[0096] like Figure 7 As shown, when the convolution kernel is 1×1, its window sliding path on the input feature map is from a1 to aa, and the data is multiplied and accumulated at corresponding positions. If the convolution kernel is 3×3, similar to the descriptions of steps S1, S2, and S3 above, but elaborated here, the 3×3 convolution kernel is divided into 9 groups of 1×1 convolution kernels: a, b, c, d, e, f, g, h, and i. For the window sliding path at position a of the convolution kernel, the path is a1, a2, ..., a8. The calculation process is the same as that of the 1×1 convolution kernel. The area not covered by the window is related to the stride S and zero padding P. If S is not 1, it can be controlled by reading the cache address.

[0097] Similarly, the calculation process for positions b, c, d, ..., i is similar. When the 9 groups of 1×1 calculations are completed and accumulated, the output is the final convolution result of the 3×3 convolution kernel. Figure 7 This only illustrates the case where Ci = 1 in the plane. In reality, Ci is generally not 1. Therefore, we treat PCi as one set of data, first sliding along the Wi row direction, then sliding along the Ci channel direction, consistent with the process described in steps 1 and 2 above. The calculation process is similar if the convolution kernel size is other dimensions.

[0098] Furthermore, the method provided in this application also supports the case where the convolution kernel Kx is not equal to Ky, such as 1×3, 5×1, etc. It can be seen that the convolution kernel calculation in this application achieves universal compatibility.

[0099] It should be further explained that the weight caching unit and weight systolic loading control are also key parts of this application. According to the above description, the filter wgt is divided into groups of parallel computation per PCo, while the convolution kernel itself is calculated as a 1×1 convolution kernel. Furthermore, there is parallelism PCi in the Ci direction. Therefore, as... Figure 6 As shown, the weight parameters are arranged in external memory, on-chip memory, and on-chip computation memory as follows: Step SS1: the 0th PCi of wgt0, the 0th PCi of wgt1, ..., the 0th PCi of wgt7, and then along the Ci direction, the next group of the 0th PCi of wgt0, the 0th PCi of wgt1, ..., the 0th PCi of wgt7, until the maximum value in the Ci direction; Step SS2: repeat the above process to complete the storage of the 1st, 2nd, ..., 7th weights of wgt0, wgt1, ..., wgt7, that is, to complete the storage of the wgt weights of the 1st group of PCo; Step SS3: repeat steps SS1 and SS2 until the storage of the wgt weights of all groups of PCo is completed. It should be noted that this application adopts a method of sharing a portion of the input feature map and reading the weights cyclically, that is, the input feature map only needs to be read from DDR once, while the weight parameters need to be read from DDR several times.

[0100] In summary, the shared partial temporary input feature map and scheduling strategy of this application have three advantages: (1) It supports the computation of convolutional layers with different kernel sizes; (2) For devices with limited resources, the input feature maps and weights cached on-chip are reduced and can be adapted to the resource size of the device; (3) With the shared partial temporary input feature map, it only needs to be traversed once to directly generate and write back the output feature map result, without needing to cache the intermediate convolutional computation part and result on-chip. At the same time, it reduces the number of times external memory is accessed, thereby reducing bandwidth requirements.

[0101] Step S133: Perform parallel operations on operation pairs using a pulsating computing array, wherein operations within the same operation pair are parallel to each other or operations on different operation pairs are parallel to each other.

[0102] For details regarding the structure of the pulsating array unit provided in this application, please refer to the embodiments described in this application. Figure 8 It consists of a matrix of two-dimensional processing elements (PEs).

[0103] This application employs intermittent systolic loading of weights and directional propagation of input and output feature map data for computation. Weight parameters are loaded vertically as needed. Subsequently, input feature map data is horizontally fed into the systolic array unit for computation, flowing to the adjacent right-hand PE. The computation result of each PE is propagated vertically. Each flow requires one clock cycle. Each PE is a multiply-accumulate unit; when Wi and Xi meet in each PE (i.e., an operation pair), multiplication is performed, and the product result is accumulated with the result flowing from its adjacent upper-level PE before flowing to the adjacent lower-level PE. After the computation of a portion of the input feature map in the current batch is completed, a partial sum is obtained. Since the final output feature map result is the sum of the products of all data in the input channel direction, the partial sum needs to be temporarily stored and accumulated for the next batch until the final computation result is output, which is the output feature map of the current convolutional layer. As can be seen, in the systolic array method, each PE is only connected to its adjacent PE, with a fan-in and fan-out of 1. At the same time, the regular rectangular physical structure of the systolic array makes it more effective when mapped onto the FPGA, because the physical space of the DSP resources on the corresponding FPGA is a rectangular area, which is more advantageous for layout and routing. Therefore, the accelerator can achieve a higher clock frequency.

[0104] On the other hand, embodiments of this application provide a general convolution kernel calculation method based on systolic arrays. For example... Figure 9 As shown, when PCi = PCo = 8, a PE array with a size of PCi × PCo = 8 × 8 is used. The circuit structure diagram of each PE array is shown below. Figure 9As shown on the right, the weight wi is input to the register, multiplied by the feature map data xi, and accumulated with the result psumi-1 of the vertically higher-level PE. The register output is psumi, which serves as the input to the vertically lower-level PE. At the same time, the xi register output is used by the horizontally subsequent-level PE.

[0105] Combination Figure 6 As shown in the process, the 0th PCi data in ifmp is input from the left side of the systolic array. The PCi ifmp data correspond to the input of the 8 PEs in the first column of the systolic array, that is, indices 0, 1, ..., 7 correspond to PE0, PE1, ..., PE7 in the first column of PE. At the same time, the wgt0, wgt1, ..., wgt7 data of the first group of PCo PCi are input from the top side of the systolic array. For each wgt{0, 2, ..., 7}, taking wgt0 as an example, the PCi parameters correspond to PE0, PE1, ..., PE7 in the first column, and so on. During convolution calculation, the weight wgt of each column of PE is temporarily fixed. The 1st, 2nd, ..., Width data of ifmp are input sequentially. The bottom output of the systolic array is a row of Wo with parallel PCo convolutional partial sums. When the Width data of the first row is sent, the next row of ifmp data (1st, 2nd, ..., Width) is sent along the Ci direction. Simultaneously, the wgt of each PE needs to be updated accordingly. To reduce idle cycles, the loading of weight wgt in this application adopts systolic loading, implemented through a weight systolic loading control unit. That is, after the Width data of the last row is input into the PE array, the PE in the upper left corner completes the current frame of the Width data calculation and performs new weight loading. Other PE types are used to achieve the concatenation of the data and weights of the previous row and the data and weights of the next row, thereby maximizing DSP utilization and improving system throughput. Each row result is a convolutional partial sum, utilizing... Figure 8 The accumulation buffer unit in the middle is accumulated with the result of the next systolic array until the convolution operation is completed, and then output to the subsequent modules, such as bias, activation and pooling operations.

[0106] Furthermore, it should be noted that when calculating the fully connected layer, the data arrangement is still as follows: Figure 4 As shown in the ifmp, unlike the convolutional layer, the weight parameters of the fully connected layer are consistent with the shape and size of the ifmp. For example, when Wi = Hi = 7 and Ci = 256 in the ifmp, then Kx = Ky = 7 and Ci = 256 for each wgt in the fully connected layer. If Co = 512, it is obvious that after the calculation of the fully connected layer, the final output is a 1×512 vector. It is easy to see that the device of this application is also compatible with fully connected computation.

[0107] Step S14: Accumulate some of the convolution results to obtain the corresponding feature values ​​of the output feature map.

[0108] In this embodiment, after the partial input feature map of the current batch is calculated, a partial sum is obtained. Since the final output feature map is the sum of the products of all data in the input channel direction, the partial sum needs to be temporarily stored and accumulated with the partial sum corresponding to the next batch until the final calculation result is output, which is the output feature map of the current convolutional layer. The feature map processing device continuously calls the output feature map of the current convolutional layer as the input feature map of the next convolutional layer. Figure 2 The CNN accelerator device shown obtains the final feature values ​​of the output feature map.

[0109] In this embodiment, a general-purpose convolutional kernel CNN accelerator device is provided. It employs a strategy of sharing partially temporary input feature maps and data scheduling, making it compatible with convolutional kernel computations of different sizes. This reduces bandwidth requirements by minimizing access to external memory. Furthermore, the provided systolic array-based general-purpose convolutional kernel computation method implements a pipelined design, which is significant for the bandwidth, resources, and performance of the device system.

[0110] This application converts the computation of convolutional kernels of different sizes into the computation of 1×1 convolutional kernels, segments the input feature maps and weight parameter data according to the parallelism of input and output channels, and provides an effective scheduling control method and hardware implementation scheme, effectively solving the compatibility problem of CNN accelerators for different convolutional kernel computations in other schemes.

[0111] The weight-sharing input feature map data partitioning strategy provided in this application effectively reduces memory access bandwidth pressure and alleviates the problem of limited on-chip storage resources on FPGAs by reusing input data. It also provides a method for prioritizing row-channel direction storage of the input feature map, effectively cooperating with computation modes compatible with convolution kernels of different sizes to achieve efficient scheduling and high parallelism. Compared with other partitioning strategies, this proposal caches data in units of planes composed of row and channel directions, including all data in the sliding window region; therefore, the resulting output feature map does not require additional processing.

[0112] The systolic array-based computation structure provided in this application, compared with the cyclic unrolling parallel computation scheme, has a fan-in / fan-out ratio of 1. Its regular matrix-shaped physical structure facilitates FPGA placement and routing, thus enabling the achievement of higher operating frequencies and improving the performance of CNN accelerator systems.

[0113] Those skilled in the art will understand that, in the above-described method of the specific implementation, the order in which each step is written does not imply a strict execution order and does not constitute any limitation on the implementation process. The specific execution order of each step should be determined by its function and possible internal logic.

[0114] To implement the feature map processing method of the above embodiments, this application also proposes a feature map processing apparatus, which can be found in detail below. Figure 10 , Figure 10 This is a schematic diagram of an embodiment of the feature map processing apparatus provided in this application.

[0115] The feature map processing apparatus 300 of this application embodiment includes a memory 31 and a processor 32, wherein the memory 31 and the processor 32 are coupled together.

[0116] The memory 31 is used to store program data, and the processor 32 is used to execute the program data to implement the feature map processing method described in the above embodiments.

[0117] In this embodiment, processor 32 can also be referred to as a CPU (Central Processing Unit). Processor 32 may be an integrated circuit chip with signal processing capabilities. Processor 32 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The general-purpose processor can be a microprocessor, or processor 32 can be any conventional processor.

[0118] To implement the feature map processing method of the above embodiments, this application also provides a computer-readable storage medium, such as... Figure 11 As shown, the computer-readable storage medium 400 is used to store program data 41, which, when executed by a processor, is used to implement the feature map processing method as described in the above embodiments.

[0119] This application also provides a computer program product, wherein the computer program product includes a computer program operable to cause a computer to perform the feature map processing method as described in the embodiments of this application. The computer program product can be a software installation package.

[0120] The feature map processing method described in the above embodiments of this application, when implemented as a software functional unit and sold or used as an independent product, can be stored in a device, such as a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.

[0121] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.

Claims

1. A feature map processing method based on a systolic arithmetic array, characterized in that, The method includes: Obtain the input feature map to be processed and the convolution kernel for performing three-dimensional convolution operations on the input feature map; The convolution kernel is decomposed into sub-convolution kernels with a spatial dimension of 1×1 along the width and height directions of the convolution kernel; Along the channel direction of the input feature map and the convolution kernel, the systolic operation array is used to perform parallel operations on the feature values ​​of corresponding positions in the width and height plane of the input feature map and the weight values ​​of the sub-convolution kernel to obtain partial convolution results; the corresponding position is the corresponding position on the input feature map that is aligned with the current sub-convolution kernel under the current sliding window; The convolution results are accumulated to obtain the corresponding feature values ​​of the output feature map; The parallel operation is performed in parallel on multiple channels of the input feature map and their corresponding weights for the same sub-convolution kernel in the systolic array. The accumulation is completed in the channel dimension through the vertical accumulation structure of the systolic array.

2. The method according to claim 1, characterized in that, The method further includes: According to the preset input channel parallelism, the feature values ​​of each position point of the input feature map are divided into multiple feature value groups along the channel direction, wherein the number of feature values ​​in each feature value group is equal to the input channel parallelism. Based on the input channel parallelism, the weight values ​​of the sub-convolution kernel are divided into multiple weight value groups corresponding to the feature value groups along the channel direction, wherein the corresponding feature value groups and weight value groups form operation group pairs; The step of performing parallel operations on the feature values ​​of corresponding points in the width and height plane of the input feature map and the weight values ​​of the sub-convolutional kernel using the systolic operation array along the channel direction of the input feature map and the convolutional kernel includes: The pulsating computing array is used to perform parallel operations on the operation pairs, wherein operations within the same operation pair are performed in parallel or operations on different operation pairs are performed in parallel.

3. The method according to claim 2, characterized in that, The step of performing parallel computation on the operation group pairs using the pulsating computing array includes: For the same sub-convolution kernel, the input feature map is traversed sequentially along one of the width and height directions and the channel direction, taking the operation group pairs as units.

4. The method according to claim 3, characterized in that, The step of dividing the feature values ​​of each location point of the input feature map into multiple feature value groups along the channel direction according to the preset input channel parallelism includes: The input feature map is read in groups of the feature values, and is cached from external memory sequentially along one of the width and height directions of the input feature map, the channel direction, and the other of the width and height directions of the input feature map. The cached input feature map is then used to input the pulsation computing array for computation.

5. The method according to claim 4, characterized in that, The step of reading the input feature map from external memory in groups of the feature values ​​as the reading unit, and sequentially along one of the width and height directions of the input feature map, the channel direction, and the other of the width and height directions of the input feature map, includes: The input feature map is cached in stages from the external memory in either the width or height direction of the input feature map.

6. The method according to claim 2, characterized in that, The number of convolutional kernels is multiple, corresponding to the number of channels in the output feature map; The method further includes: According to the preset output channel parallelism, the multiple convolutional kernels are divided into multiple convolutional kernel groups, wherein the number of convolutional kernels in each convolutional kernel group is equal to the output channel parallelism; The step of performing parallel operations on the feature values ​​of corresponding points in the width and height plane of the input feature map and the weight values ​​of the sub-convolutional kernel using the systolic operation array along the channel direction of the input feature map and the convolutional kernel includes: For each convolution kernel within the same convolution kernel group, the step of performing parallel operations on the operation group pairs using the systolic computing array is executed in parallel.

7. The method according to claim 6, characterized in that, The parallelism of the output channel is equal to the parallelism of the input channel, and is not less than 1.

8. The method according to claim 6, characterized in that, The systolic computation array includes multiple computation units arranged in an array. Prior to the step of performing parallel computations on the feature values ​​of the input feature map and the weight values ​​of the sub-convolutional kernels using the systolic computation array along the channel direction of the input feature map and the convolutional kernel, the method further includes: The feature values ​​within the same feature value group are fed in parallel along the row direction of the systolic computing array to different rows of the systolic computing array, so that each feature value is transmitted along its respective row direction. The weight values ​​corresponding to each convolution kernel within the same convolution kernel group are fed in parallel into different columns of the systolic computation array, so that the weight values ​​within each weight value group are transmitted along their respective column directions.

9. The method according to claim 8, characterized in that, The calculation unit is configured to perform a product operation on the corresponding feature value and the weight value, and then add the product result to the output result of the previous level calculation unit input along the column direction to obtain its own output result.

10. A feature map processing apparatus, characterized in that, The feature map processing apparatus includes a processor and a memory, wherein the memory stores program data, and the processor executes the program data to implement the feature map processing method as described in any one of claims 1-9.

11. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store program data, which, when executed by a processor, is used to implement the feature map processing method according to any one of claims 1-9.