Neural network accelerators, neural network acceleration methods, and devices
By optimizing the data arrangement order and matching of computing units, the problem of imbalance between computing resources and data bandwidth in neural network accelerators has been solved, thereby improving the performance and efficiency of neural network accelerators.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF SEMICONDUCTORS - CHINESE ACAD OF SCI
- Filing Date
- 2023-08-18
- Publication Date
- 2026-06-30
AI Technical Summary
In existing neural network accelerators, the computing engine utilization is low and the data bandwidth is insufficient, resulting in the underutilization of computing resources. When input data cannot be provided in a timely manner, the computing resources enter an idle state.
Design a neural network accelerator, including an accelerated computing module, a direct memory access controller, a data rearrangement module, and a data warping and control module, to achieve a balance between data bandwidth and computing resources by optimizing the data arrangement order and the matching of computing units.
It improves data bandwidth capabilities, makes full use of computing resources, enhances the overall performance of neural network accelerators, and reduces power consumption.
Smart Images

Figure CN117273094B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence (AI) chip technology, and in particular to a neural network accelerator and a neural network acceleration method and apparatus. Background Technology
[0002] In the field of artificial intelligence, neural network accelerators have been widely used to accelerate neural network computations. However, current neural network accelerator designs face two main problems: low utilization of the processing engine (PE) and low data bandwidth. When data bandwidth is too low, the PE often enters an idle state when input data cannot be provided in a timely manner, resulting in low utilization of computing resources. Therefore, how to achieve a balance between data bandwidth and computing resources is a crucial issue that the industry urgently needs to address. Summary of the Invention
[0003] This invention provides a neural network accelerator, neural network acceleration method, and apparatus to solve the problem of imbalance between data bandwidth and computing resources in the prior art, and to achieve a balance between data bandwidth and computing resources.
[0004] This invention provides a neural network accelerator, comprising: an accelerated computing module, a direct memory access controller, a data rearrangement module, and a data warping and control module; the accelerated computing module includes multiple computing units;
[0005] The direct memory access controller is used to continuously read the target data required for accelerated computation of the current layer of the neural network, which is stored in the external memory according to the data arrangement order, and send it to the data preparation and control module. The data arrangement order of the target data is obtained as follows: for the target data of the first three-dimensional structure with N channels * R rows * C columns, the data is stored in the first arrangement order of each second three-dimensional structure in the first three-dimensional structure, taking the second three-dimensional structure with F channels * E rows * D columns required for accelerated computation of the current layer of the neural network as the unit; for each second three-dimensional structure, the data is stored in the second arrangement order of all data in the second three-dimensional structure.
[0006] The data warping and control module is used to generate at least one target data segment based on the received data according to the bit width of the data warping and control module, the bit width of the computing unit and the convolution parameters, and distribute each target data segment to the accelerated computing module.
[0007] The accelerated computing module is used to perform accelerated computing on the target data segment from the data regularization and control module based on at least some of the computing units, and output the computing results of the computing units to the data rearrangement module.
[0008] The data rearrangement module is used to merge the calculation results of each calculation unit according to the data arrangement order required for the accelerated calculation of the next layer of the neural network, so as to serve as the input for the accelerated calculation of the next layer.
[0009] According to a neural network accelerator provided by the present invention, the data length of the target data segment is matched with the bit width of the computing unit and the convolution parameters, and the data length of the target data segment is a first data length; the data warping and control module is specifically used for:
[0010] The received data is buffered, and the length of the received data is less than or equal to the bit width of the data normalization and control module;
[0011] Each time, a data segment is selected from the received data in units of the second data length as the main data segment, until all the received data has been selected. The second data length is a preset data length.
[0012] When the length of the main data segment is less than the first data length, first splicing data and second splicing data are obtained. The first splicing data is preset padding data or data in the received data that is adjacent to the start position of the main data segment. The second splicing data is preset padding data or data in the received data that is adjacent to the end position of the main data segment. The first splicing data, the main data segment, and the second splicing data are spliced together to obtain the target data segment.
[0013] When the data length of the main data segment is equal to the first data length, the main data segment is used as the target data segment;
[0014] The target data segment is distributed to the accelerated computing module.
[0015] According to a neural network accelerator provided by the present invention, the data rearrangement module is specifically used for:
[0016] The calculation results of each of the computing units are collected and stored in different storage spaces in the first group of storage spaces;
[0017] According to the data arrangement order required for the accelerated computing of the next layer, each storage space of the second group of storage spaces selects the computing result of the corresponding computing unit from the first group of storage spaces for storage, so as to obtain the data arrangement order required for the accelerated computing of the next layer.
[0018] According to a neural network accelerator provided by the present invention, the computing unit includes a first multiply-accumulate-add module and a second multiply-accumulate-add module;
[0019] The first multiplication and accumulation module is used to perform parallel multiplication and accumulation calculations based on L+(x-1) feature data and x weight data from the data warping and control module in the row direction. The parallel multiplication and accumulation calculation includes sequentially selecting each weight data from the x weight data as the weight data W[i] to be calculated, where i ranges from 0 to x-1, and W[i] represents the (i+1)th weight data among the x weight data. The module then performs the following calculation: calculates the product of W[i] with each of the (i+1)th to (i+L)th feature data in the L+(x-1) feature data, and accumulates each product into the corresponding first intermediate memory. Here, x represents the size parameter of the convolution kernel.
[0020] The second multiplication accumulation module is used to perform serial multiplication accumulation calculation based on K feature data and K weight data from one channel direction of the data warping and control module. The serial multiplication accumulation calculation includes sequentially selecting each weight data in the K weight data as the weight data W[j] to be calculated, where the value of j includes 0 to K-1. W[j] indicates that the weight data to be calculated is the (j+1)th weight data in the K weight data. The following calculation operation is performed: calculate the product of the (j+1)th feature data in the K feature data and W[j], and accumulate the calculated multiplication into the second intermediate memory.
[0021] The computing unit is used to perform multiply-accumulate calculations using either the first multiply-accumulate module or the second multiply-accumulate module that matches the accelerated calculation of the current layer.
[0022] According to a neural network accelerator provided by the present invention, the method for obtaining the data arrangement order of the target data specifically includes:
[0023] Using the second three-dimensional structure as a unit, the target data of the first three-dimensional structure is divided into blocks to obtain Z1*Z2*Z3 second three-dimensional structures; when some data of the first three-dimensional structure is insufficient to form the second three-dimensional structure, data is filled in based on some data of the first three-dimensional structure to obtain the second three-dimensional structure; wherein, F is not greater than N, E is not greater than R, D is not greater than C, the value of Z1 is the ratio of N to F rounded up, the value of Z2 is the ratio of R to E rounded up, and the value of Z3 is the ratio of C to D rounded up;
[0024] According to the scanning method corresponding to the first three-dimensional structure, the second three-dimensional structure to be scanned is selected sequentially from each of the second three-dimensional structures to obtain the first arrangement order, and the following steps are performed on the currently selected second three-dimensional structure:
[0025] The data within the second three-dimensional structure is scanned and stored according to the scanning method corresponding to the second three-dimensional structure to obtain the second arrangement order.
[0026] According to a neural network accelerator provided by the present invention, the first arrangement order is an arrangement order formed by the second three-dimensional structures in the target data according to the priority order of the channel direction, row direction, and column direction, wherein the priority order is one of the following orders: row direction, column direction, and channel direction; column direction, row direction, and channel direction; channel direction, row direction, and column direction; channel direction, column direction, and row direction; row direction, channel direction, and column direction; column direction, channel direction, and row direction.
[0027] The second arrangement order is the arrangement order of each data in the second three-dimensional structure according to the priority order of the channel direction, row direction and column direction, wherein the priority order is one of the following orders: row direction, column direction and channel direction; column direction, row direction and channel direction; channel direction, row direction and column direction; channel direction, column direction and row direction; row direction, channel direction and column direction; column direction, channel direction and row direction.
[0028] According to a neural network accelerator provided by the present invention, a scheduler is further included; the scheduler is connected to the data warping and control module, the accelerated computing module, the data rearrangement module and the direct memory access controller respectively, and is used to parse instruction sequences and coordinate the operation of the data warping and control module, the accelerated computing module, the data rearrangement module and the direct memory access controller according to the parsing results of the instruction sequences.
[0029] According to a neural network accelerator provided by the present invention, the second three-dimensional structure corresponds to the structure of the neural network; and / or, the data in the second three-dimensional structure is read once and then all the calculations that need to be performed are completed.
[0030] The present invention also provides a neural network acceleration method based on any of the above-described neural network accelerators, comprising:
[0031] The direct memory access controller continuously reads the target data required for accelerated computation of the current layer of the neural network, which is stored in external memory in the order of data arrangement, and sends it to the data preparation and control module.
[0032] The data warping and control module generates at least one target data segment based on the received data according to the bit width of the data warping and control module, the bit width of the computing unit, and the convolution parameters, and distributes each target data segment to the accelerated computing module.
[0033] The accelerated computing module performs accelerated computing on the target data segment from the data regularization and control module based on at least some computing units, and outputs the computing results of the computing units to the data rearrangement module.
[0034] The data rearrangement module merges the calculation results of each calculation unit according to the data arrangement order required for accelerated calculation of the next layer of the neural network, and uses it as the input for accelerated calculation of the next layer.
[0035] The present invention also provides a neural network acceleration device, including an external memory and a neural network accelerator as described in any of the above.
[0036] The neural network accelerator provided by this invention allows for the storage of target data in the data format of the first three-dimensional structure in an external memory, arranged in a data order. This data order is obtained as follows: for the target data of the first three-dimensional structure with N channels * R rows * C columns, data is stored in units of the second three-dimensional structure with F channels * E rows * D columns required for accelerated neural network computation, according to the first arrangement order of each of the second three-dimensional structures in the first three-dimensional structure; for each second three-dimensional structure, data is stored in the second arrangement order of all data in the second three-dimensional structure. The direct memory access controller in the neural network accelerator can continuously read the target data required for accelerated computation of the current layer of the neural network stored in external memory and send it to the data warping and control module. The data warping and control module can generate at least one target data segment based on the received data, according to its bit width, the bit width of the computation unit, and the convolution parameters. Each target data segment is then distributed to the accelerated computation module. The accelerated computation module can perform accelerated computation on each target data segment from the data warping and control module based on at least some of the computation units, and output the computation results of each computation unit to the data rearrangement module. The data rearrangement module can merge the computation results of each computation unit according to the data arrangement order required for accelerated computation of the next layer of the neural network to serve as the input for accelerated computation of the next layer. In this way, reading the data of the current layer of the neural network and generating target data segments maximizes data bandwidth. The accelerated computation of each layer of the neural network uses an appropriate data arrangement method to achieve efficient data interaction, which can greatly improve data bandwidth capability, fully utilize the computational resources of the accelerated computation module, solve the problem of imbalance between computational resources and data bandwidth, improve the overall performance of the neural network accelerator, and reduce power consumption. Attached Figure Description
[0037] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0038] Figure 1 This is one of the structural schematic diagrams of the neural network accelerator provided by the present invention;
[0039] Figure 2 This is one of the schematic diagrams of the first three-dimensional structure provided by the present invention;
[0040] Figure 3 This is a schematic diagram of the first and second three-dimensional structures provided by the present invention;
[0041] Figure 4 This is one of the schematic diagrams of the second three-dimensional structure provided by the present invention;
[0042] Figure 5 This is a second schematic diagram of the second three-dimensional structure provided by the present invention;
[0043] Figure 6 This is the third schematic diagram of the second three-dimensional structure provided by the present invention;
[0044] Figure 7 This is the fourth schematic diagram of the second three-dimensional structure provided by the present invention;
[0045] Figure 8 This is the fifth schematic diagram of the second three-dimensional structure provided by the present invention;
[0046] Figure 9 This is the sixth schematic diagram of the second three-dimensional structure provided by the present invention;
[0047] Figure 10 This is the second schematic diagram of the neural network accelerator provided by the present invention;
[0048] Figure 11 This is the third schematic diagram of the neural network accelerator provided by the present invention;
[0049] Figure 12 This is the fourth schematic diagram of the neural network accelerator provided by the present invention;
[0050] Figure 13 This is the fifth schematic diagram of the neural network accelerator provided by the present invention;
[0051] Figure 14 This is the sixth schematic diagram of the neural network accelerator provided by the present invention;
[0052] Figure 15 This is a flowchart illustrating the neural network acceleration method provided by the present invention. Detailed Implementation
[0053] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0054] The following is combined Figures 1 to 14 The neural network accelerator of the present invention is described.
[0055] like Figure 1 As shown, this embodiment provides a neural network accelerator 100, including: an accelerated computing module 101, a direct memory access controller 103, a data rearrangement module 104, and a data regularization and control module 105. The accelerated computing module 101 includes multiple computing units.
[0056] The memory access controller 103 continuously reads the target data required for the accelerated computation of the current layer of the neural network, which is stored in the external memory according to the data arrangement order, and sends it to the data regularization and control module 105. The data arrangement order of the target data is obtained as follows: for the target data of the first three-dimensional structure with N channels * R rows * C columns, the data is stored in the first arrangement order of each second three-dimensional structure in the first three-dimensional structure, taking the second three-dimensional structure required for the accelerated computation of the current layer of the neural network as the unit; for each second three-dimensional structure, the data is stored in the second arrangement order of all data in the second three-dimensional structure.
[0057] The data warping and control module 105 is used to generate at least one target data segment based on the received data according to the bit width of the data warping and control module, the bit width of the computing unit and the convolution parameters, and distribute each target data segment to the accelerated computing module 101.
[0058] The accelerated computing module 101 is used to perform accelerated computing on the target data segment from the data regularization and control module 105 based on at least some computing units, and output the computing results of the computing units to the data rearrangement module 104.
[0059] The data rearrangement module 104 is used to merge the calculation results of each calculation unit according to the data arrangement order required for the accelerated calculation of the next layer of the neural network, so as to serve as the input for the accelerated calculation of the next layer.
[0060] The solution in this embodiment can be applied to the field of image processing, specifically to the field of artificial intelligence image processing chip technology, to accelerate the inference of image processing neural networks. The target data can be image data; for example, the image data can include feature map data, where the feature data of each pixel is stored in a certain order.
[0061] The bit width of the data warping and control module refers to the bit width of its interface. The bit width of the computation unit refers to the bit width of its interface. Convolution parameters are the parameters used in the convolution calculation of the current layer of the neural network, and may include parameters such as the size of the convolution kernel and the stride.
[0062] Figure 1 In this neural network accelerator 100, a cache module 102 is also included. A direct memory access controller 103, a data rearrangement module 104, and a data warping and control module 105 are connected to the cache module 102. Specifically, the direct memory access controller 103 continuously reads target data stored in the external memory 200 in the data arrangement order into the cache module 102. The data warping and control module 105 receives data from the cache module 102. The data rearrangement module 104 specifically merges the calculation results of each computing unit according to the data arrangement order required for the accelerated calculation of the next layer of the neural network, and then outputs it to the cache module 102 as input for the accelerated calculation of the next layer.
[0063] In practical applications, the external memory 200 can be, but is not limited to, Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), which can be abbreviated as DDR. The input data required by the neural network input layer, including feature map data, can be stored in the external memory 200. The external memory 200 can store data in the order of data arrangement.
[0064] In practical applications, the data format of a 3D structure X can be scanned and sorted. The data size of X is N*R*C, such as... Figure 2 As shown, N represents channels, R represents rows, and C represents columns, where N, R, and C are all positive integers. A pixel in X is denoted as Xn.rc, representing the pixel in the r-th row and c-th column of the n-th channel. The last row, last column, or last channel can be marked with an end marker.
[0065] When scanning data, you can first scan the feature map data of one channel and then scan the feature map data of the next channel. This is called the first scanning method, which is either the column direction first, then the row direction, and then the channel direction, or the row direction first, then the column direction, and then the channel direction. For example, the scanning order is as follows:
[0066] The first row of the first channel feature map data: X1.1.1, X1.1.2, X1.1.3, ..., X1.1.end; the second row of the first channel feature map data: X1.2.1, X1.2.2, ..., X1.2.end, ...; the last row of the first channel feature map data: X1.end.1, X1.end.2, ..., X1.end.end; the first row of the second channel feature map data: X2.1.1, X2.1.2, X2.1.3, ..., X2.1.end; the second row of the second channel feature map data: X2.2.1, X2.2.2, ..., X2.2.end, ...; the last row of the second channel feature map data: X2.end.1, X2.end.2, ..., X2.end.end, ...; the last pixel of the last channel feature map data: Xend.end.end. This scanning order can be understood as feature map priority scanning. It's important to note that for feature map data of a single channel, since observing by row and by column only affects the order of the corresponding weight data, the feature map priority scanning method can either scan along the row direction first and then along the column direction, or vice versa. During storage, data can be stored according to the scanning order.
[0067] Alternatively, when scanning data, a scanning method can be used: first the channel direction, then the column direction, and then the row direction; or, a scanning method can be used: first the channel direction, then the row direction, and then the column direction, which is called the second scanning method. For example, scanning the first pixel of all channels in a certain order, then the second pixel of all channels, and so on, until all pixels are scanned, is an example of the following scanning sequence:
[0068] The data in the first row and first column of the feature map data for all channels is: X1.1.1, X2.1.1, X3.1.1, ..., Xend.1.1; the data in the first row and second column of the feature map data for all channels is: X1.1.2, X2.1.2, X3.1.2, ..., Xend.1.2, ...; the data in the first row and last column of the feature map data for all channels is: X1.1.end, X2.1.end, ..., Xend.1.end; the data in the second row and first column of the feature map data for all channels is: X1.2.1, X2.2.1, ..., Xend.2.1; the data in the second row and second column of the feature map data for all channels is: X1.2.2, X2.2.2, ..., Xend.2.2, ...; the data in the last row and last column of the feature map data for all channels is: X1.end.end, ..., Xend.end.end. This scanning order can be understood as channel-first scanning.
[0069] Alternatively, when scanning data, a scanning method can be used: first column direction, then channel direction, then row direction; or, a scanning method can be used: first row direction, then channel direction, then column direction, which is called the third scanning method. For example, first scan the first row of pixels in the first channel, then scan the first row of pixels in the other channels, then scan the second row of pixels in all channels, and so on, until all pixels have been scanned. An example scanning sequence is as follows:
[0070] The first row of the feature map data for the first channel: X1.1.1, X1.1.2, X1.1.3, ..., X1.1.end; the first row of the feature map data for the second channel: X2.1.1, X2.1.2, ..., X2.1.end, ...; the first row of the feature map data for the last channel: Xend.1.1, Xend.1.2, ..., Xend.1.end; the second row of the feature map data for the first channel: X1.2.1, X1.2.2, X1.2.3, ..., X1.2.end; the second row of the feature map data for the second channel: X2.2.1, X2.2.2, ..., X2.2.end, ...; the last row of the feature map data for the last channel: Xend.end.1, Xend.end.2, ..., Xend.end.end. This is a hybrid scanning sequence, where channel-direction scanning is performed between the row and column directions.
[0071] In this embodiment, the target data of the first three-dimensional structure in the external memory 200 can be stored in the data arrangement order. This target data is all the data required to obtain the calculation results of the neural network output side. The direct memory access controller 103 in the neural network accelerator 100 is mainly used to control the data access between the external memory 200 and the storage space of the cache module 102 in the neural network accelerator 100, and to follow the characteristics of the external memory 200 to complete the high-bandwidth data transfer. The direct memory access controller 103 in the neural network accelerator 100 can be connected to the external memory 200 and read the target data of the neural network that needs to be input from the external memory 200 into the cache module 102 in batches. The cache module 102 can be a cache array formed by multiple caches. The cache array includes a first cache for storing feature map data, a second cache for storing weight data, and a third cache for storing bias data. The data warping and control module 105 is connected to the first cache, the second cache, and the third cache, respectively. The direct memory access controller 103 is connected to the first cache, the second cache, and the third cache, respectively. In this process, feature map data and weight data are two multiplier terms in a multiplication operation. The data stored in the first and second caches can be interchanged. For example, feature map data can be stored in the second cache and weight data can be stored in the first cache. This can improve data bandwidth when solving fully connected computations.
[0072] The data arrangement order of the target data is obtained as follows: For the target data of the first three-dimensional structure with N channels * R rows * C columns, the data is stored in the first arrangement order of each second three-dimensional structure in the first three-dimensional structure, taking the second three-dimensional structure with F channels * E rows * D columns required for accelerated computation of the neural network as the unit; for each second three-dimensional structure, the data is stored in the second arrangement order of all data in the second three-dimensional structure.
[0073] For example, such as Figure 3 As shown, the specific methods for obtaining the data arrangement order of the target data include:
[0074] Using the second three-dimensional structure as a unit, the target data of the first three-dimensional structure is divided into blocks to obtain Z1*Z2*Z3 second three-dimensional structures; when some data of the first three-dimensional structure is insufficient to form the second three-dimensional structure, data is filled in based on some data of the first three-dimensional structure to obtain the second three-dimensional structure; wherein, F is not greater than N, E is not greater than R, D is not greater than C, the value of Z1 is the ratio of N to F rounded up, the value of Z2 is the ratio of R to E rounded up, and the value of Z3 is the ratio of C to D rounded up;
[0075] According to the scanning method corresponding to the first three-dimensional structure, the second three-dimensional structure to be scanned is selected sequentially from each of the second three-dimensional structures to obtain the first arrangement order, and the following steps are performed on the currently selected second three-dimensional structure:
[0076] The data within the second three-dimensional structure is scanned and stored according to the scanning method corresponding to the second three-dimensional structure to obtain the second arrangement order.
[0077] Where D, E, and F are all positive integers, and their specific values can be set according to actual needs. In this way, the data of the first three-dimensional structure is divided into blocks according to the second three-dimensional structure to obtain the required target data.
[0078] The first arrangement order is the arrangement of the second three-dimensional structures in the target data according to the priority order of the channel direction, row direction, and column direction. The priority order is one of the following: row direction, column direction, and channel direction; column direction, row direction, and channel direction; channel direction, row direction, and column direction; channel direction, column direction, and row direction; row direction, channel direction, and column direction; column direction, channel direction, and row direction. The second arrangement order is the arrangement of the data in the second three-dimensional structure according to the priority order of the channel direction, row direction, and column direction. The priority order is one of the following: row direction, column direction, and channel direction; column direction, row direction, and channel direction; channel direction, row direction, and column direction; channel direction, column direction, and row direction; row direction, channel direction, and column direction; column direction, channel direction, and row direction.
[0079] The second three-dimensional structure was obtained in the following way:
[0080] When dividing the target data of the first three-dimensional structure into blocks, using the second three-dimensional structure as the unit, any of the following data block formats can be used: the data block format is a feature map priority format, a channel priority format, or a hybrid format; the feature map priority format can refer to the format of the second three-dimensional structure determined by F being 1, D being 1 to C, and E being 1 to R; the channel priority format can refer to the format of the second three-dimensional structure determined by D and E being 1 and F being 1 to N; the hybrid format can refer to the format of the second three-dimensional structure determined by D being 1 to C, F being 1 to N, and E being 1.
[0081] The first three-dimensional structure is divided into blocks according to the data block format to obtain the second three-dimensional structure.
[0082] When dividing the first 3D structure into blocks according to the feature map priority format, such as Figure 4As shown, the first 3D structure can be sequentially segmented according to the feature map priority format to obtain the second 3D structure. For example, F=1, D=C, E=R, see [link to documentation]. Figure 5 Each of the second three-dimensional structures in the first three-dimensional structure can be scanned using the first scanning method described above, and the resulting data is arranged in the first arrangement order.
[0083] When dividing the first 3D structure into blocks according to the channel priority format, such as Figure 6 As shown, the first 3D structure is segmented according to a channel-first format to obtain the second 3D structure. For example, F=N, D=1, E=1, see [link to documentation]. Figure 7 Each of the second three-dimensional structures in the first three-dimensional structure can be scanned using the second scanning method described above, and the resulting data is arranged in the first arrangement order.
[0084] When dividing the first three-dimensional structure into blocks according to the hybrid format, such as Figure 8 As shown, the first three-dimensional structure can be segmented according to a hybrid format to obtain the second three-dimensional structure. For example, when F=N, D=C, and E=1, see [reference needed]. Figure 9 For each of the second three-dimensional structures in the first three-dimensional structure, the third scanning method described above can be used for scanning, and the resulting data is arranged in the first arrangement order.
[0085] In this system, the second three-dimensional structure corresponds to the structure of the neural network; and / or, the data in the second three-dimensional structure is read once to complete all the necessary calculations. Different neural network structures can adopt different second three-dimensional structures to match actual computational acceleration requirements. In implementation, data read from external memory can complete all the necessary calculations at once, thus eliminating the need for a second read after the data is used up, thereby further improving computational efficiency.
[0086] Within each second three-dimensional structure, data is stored according to the second arrangement order of all data within the second three-dimensional structure. Each pixel within the second three-dimensional structure can also be scanned using the first, second, or third scanning method described above, resulting in data arranged in the second arrangement order.
[0087] As can be seen, in the first three-dimensional structure, the second three-dimensional structure is used as the granularity, and the first, second, or third scanning methods are used to form the first arrangement order. In the second three-dimensional structure, pixels are used as the granularity, and the first, second, or third scanning methods can also be used to form the second arrangement order.
[0088] The data warping and control module can read target data obtained from external memory through data block format, second arrangement order, and first arrangement order into the accelerated computing module. The computing units in the accelerated computing module can perform convolution calculations or fully connected calculations using parallel multiplication-accumulation or serial multiplication-accumulation. Within one clock cycle, all multipliers are actively working; active work refers to the multiplication of valid input feature data and weight data. The number of active multipliers within one clock cycle is the parallelism. For example, if the product of 1024 feature map data and weight data can be calculated in parallel, with 1024 multipliers actively working, the parallelism is 1024. The bit width of the computing unit is fixed, typically set to 2^a (2 to the power of a) bits. Therefore, the data warping and control module must generate target data segments from the received data to adapt the data required for accelerated computing to the computing units and use a matching method to accelerate the calculation.
[0089] The data rearrangement module 104 in the neural network accelerator 100 can collect and merge the computation results of the computing units. Since the output of the current layer's computation result is the input of the next layer of the neural network, the data can be merged according to the data arrangement order required for the input of the next layer, and then output to the external memory 200. If the current layer is not the last layer of the neural network, accelerated computation of the next layer is required. Therefore, the computation result of the current layer can be used as the input of the accelerated computation of the next layer. If the current layer is the last layer of the neural network, the computation result of the current layer is output to the external memory 200 as the final computation result. Here, the data arrangement order required for the input of the next layer of the neural network can be the data arrangement order obtained by using the first scan method, the second scan method, or the third scan method.
[0090] Different neural network layers can be stored using different data arrangement orders, allowing for convenient one-time data retrieval and computation, thus improving data utilization. Direct memory access controllers achieve optimal efficiency when accessing external memory via contiguous address access.
[0091] In this way, the data that the current layer needs to input into the neural network is read and the target data segment is generated. The data bandwidth is the highest. The accelerated calculation of each layer of the neural network adopts an appropriate data arrangement bandwidth. All calculations are completed in one reading, and efficient data interaction is achieved. This can greatly improve the data bandwidth capability, make full use of the computing resources of the accelerated computing module, solve the problem of imbalance between computing resources and data bandwidth, improve the overall performance of the neural network accelerator, and reduce power consumption.
[0092] In an exemplary embodiment, such as Figure 1As shown, the neural network accelerator 100 may also include a scheduler 106; the scheduler 106 is connected to the data warping and control module 105, the accelerated computing module 101, the data rearrangement module 104 and the direct memory access controller 103 respectively, and is used to parse the instruction sequence and coordinate the operation of the data warping and control module 105, the accelerated computing module 101, the data rearrangement module 104 and the direct memory access controller 103 according to the parsing result of the instruction sequence.
[0093] The instruction sequence can be generated by the compiler based on the structure of the neural network and the partitioning of the data address space. The weight data can also be rearranged offline according to the feature map calculation method.
[0094] The cache module 102 also includes an input cache and an output cache connected to the direct memory access controller 103. The data rearrangement module 104 can be connected to the output cache via the scheduler 106, outputting the calculation results of the current layer to the output cache via the scheduler 106 and writing them back to the external memory 200 by the direct memory access controller 103. The scheduler 106 can obtain the instruction sequence from the external memory 200 via the input cache.
[0095] In this embodiment, the neural network accelerator 100 is equipped with a scheduler 106, which can reasonably schedule the data regularization and control module 105, the accelerated computing module 101, the data rearrangement module 104 and the direct memory access controller 103 according to the instruction sequence, thereby improving processing efficiency.
[0096] In an exemplary embodiment, the data length of the target data segment is matched with the bit width of the computing unit and the convolution parameters, and the data length of the target data segment is a first data length, which is a preset data length of the target data segment; the data regularization and control module 105 is specifically used for:
[0097] The received data is buffered, and the length of the received data is less than or equal to the bit width of the data normalization and control module;
[0098] Each time, a data segment is selected from the received data in units of the second data length as the main data segment, until all the received data has been selected. The second data length is a preset data length.
[0099] When the length of the main data segment is less than the first data length, first splicing data and second splicing data are obtained. The first splicing data is preset padding data or data in the received data that is adjacent to the start position of the main data segment. The second splicing data is preset padding data or data in the received data that is adjacent to the end position of the main data segment. The first splicing data, the main data segment, and the second splicing data are spliced together to obtain the target data segment.
[0100] When the data length of the main data segment is equal to the first data length, the main data segment is used as the target data segment;
[0101] The target data segment is distributed to the accelerated computing module.
[0102] For example, see Figure 10 The data straightening and control module 105 includes a data caching unit 1051, a first data selection unit 1052, an intermediate caching unit 1053, an address generation unit 1054, a first splicing selection unit 1055, a second splicing selection unit 1056, a third splicing selection unit 1057, and a data output unit 1058.
[0103] The data buffer unit 1051 is used to buffer the received data;
[0104] The data output unit 1058 is used to generate and output the target data segment;
[0105] The address generation unit is used to generate corresponding data addresses for the first data selection unit 1052 and the first splicing selection unit 1055; it is also used to generate corresponding data addresses for the second splicing selection unit 1056 and the third splicing selection unit 1057 when the data length of the main data segment is less than the first data length.
[0106] The first data selection unit 1052 is used to select a data segment from the data cache unit 1051 and store it in the intermediate cache unit 1053 in one clock cycle according to the corresponding data address, with the second data length as the unit.
[0107] Intermediate cache unit 1053 is used to store padding data and data segments from first data selection unit 1052;
[0108] The first splicing selection unit 1055 is used to select a data segment from the intermediate cache unit 1053 as the main data segment according to the corresponding data address and provide it to the data output unit 1058.
[0109] The second splicing selection unit 1056 is used to select the first splicing data from the intermediate cache unit 1053 according to the corresponding data address and provide it to the data output unit 1058; the first splicing data is either padding data or data in the intermediate cache unit 1053 that is adjacent to the start position of the main data segment;
[0110] The third splicing selection unit 1057 is used to select the second splicing data from the intermediate cache unit 1053 according to the corresponding data address and provide it to the data output unit 1058; the second splicing data is either fill data or data in the intermediate cache unit 1053 that is adjacent to the end position of the main data segment.
[0111] The first data selection unit 1052, the first splicing selection unit 1055, the second splicing selection unit 1056, and the third splicing selection unit 1057 may include a gate.
[0112] Data cache unit 1051 may include a cache. Intermediate cache unit 1053 may include a cache or a register.
[0113] Data length can be represented by the number of bytes the data occupies.
[0114] Taking a first data length of 10 bytes and a second data length of 8 bytes as an example, the data buffer unit 1051 receives 64 bytes of data. The first data selection unit 1052 can acquire the data in a time-division manner over 8 clock cycles. Each clock cycle, it acquires an 8-byte data segment and stores it in the intermediate buffer unit. For example, the intermediate buffer unit stores three 8-byte data segments. The first splicing selection unit 1055 can acquire the middle 8-byte data segment as the main data segment and provide it to the data output unit 1058. The first splicing selection unit 1055 can acquire 1 byte of data from the 8-byte data segment adjacent to the start position of the main data segment and provide it to the data output unit 1058. The second splicing selection unit 1056 can acquire 1 byte of data from the 8-byte data segment adjacent to the end position of the main data segment and provide it to the data output unit 1058. The data output unit 1058 concatenates 1 byte of data from the second concatenation selection unit 1056, 8 bytes of main data segment from the first concatenation selection unit 1055, and 1 byte of data from the third concatenation selection unit 1057 to obtain a 10-byte target data segment, and sends the target data segment to the accelerated computing module.
[0115] In this embodiment, the data warping and control module can combine the bit width of the data warping and control module, the bit width of the computing unit, and the convolution parameters to generate a suitable target data segment, thereby improving resource utilization.
[0116] In an exemplary embodiment, such as Figure 11 As shown, the calculation unit includes a first multiplication accumulation module and a second multiplication accumulation module;
[0117] The first multiplication and accumulation module is used to perform parallel multiplication and accumulation calculations based on L+(x-1) feature data and x weight data in the row direction from the data warping and control module. The parallel multiplication and accumulation calculation includes sequentially selecting each weight data from the x weight data as the weight data W[i] to be calculated, where i ranges from 0 to x-1. W[i] represents the (i+1)th weight data in the x weight data. The module performs the following calculations: calculates the product of W[i] with each of the (i+1)th to (i+L)th feature data in the L+(x-1) feature data, and accumulates each product into the corresponding first intermediate memory. Here, x represents the size parameter of the convolution kernel. For example, if the size of the convolution kernel is 3*3, then x = 3. L is a positive integer, and the specific value of L can be set according to actual needs. For example, the value of L is the second data length.
[0118] The second multiplication and accumulation module is used to perform serial multiplication and accumulation calculations based on K feature data and K weight data from one channel direction of the data warping and control module. The serial multiplication and accumulation calculation includes sequentially selecting each weight data from the K weight data as the weight data W[j] to be calculated, where the value of j is from 0 to K-1, and W[j] represents the (j+1)th weight data in the K weight data. The module performs the following calculation operation: calculates the product of the (j+1)th feature data in the K feature data and W[j], and accumulates the calculated product into the second intermediate memory. K is a positive integer, and the specific value of K can be set according to actual needs. For example, the value of K is the second data length.
[0119] The computation unit is used to perform multiply-accumulate computation using either the first multiply-accumulate module or the second multiply-accumulate module that matches the accelerated computation of the current layer.
[0120] In practical applications, such as Figure 12 As shown, the accelerated computing module 101 includes m*h computing units (PEs), and m PEs (denoted as PE_1 to PE_m in the figure) form a PE cluster. Figure 12 The diagram illustrates a total of h PE clusters, with PE_cluster_1 to PE_cluster_h representing different PE clusters. The computation unit can perform multiply-accumulate calculations, activation calculations, pooling, distance calculations, matrix accumulation, matrix dot multiplication, etc. During multiply-accumulate calculations, the computation unit can use either a first multiply-accumulate module or a second multiply-accumulate module, depending on the configuration.
[0121] For example, the accelerated computing module includes 8 PE clusters, and each PE cluster includes 8 PEs. The 10-byte target data segment can be distributed to the 8 PE clusters to obtain the computing results of the 8 PE clusters.
[0122] Taking 3*3 convolution as an example, when using the first-multiplication accumulation module, the data warping and control module 105 sends out L+2 feature data Feature[L+1:0], which is a feature sheet. Figure 1 There are L+2 feature data points in the row. Obtaining 3 weight data points W[2:0], we have:
[0123] {
[0124] For i = 0; i < 3; i++;
[0125] mid_ram[L-1:0]=mid_ram[L-1:0]+Feature[i+L-1:i]*W[i]
[0126] }
[0127] Where mid_ram[L-1:0] represents the data stored in the first intermediate memory, and Feature[i+L-1:i] represents the (i+1)th feature data to the (i+L)th feature data.
[0128] For example, the data regularization and control module 105 sends out 10 bytes of data, which is a feature sheet. Figure 1 Given the 10 feature data in the row, Feature[9:0], we have:
[0129] {
[0130] For i = 0; i < 3; i++;
[0131] mid_ram[7:0]=mid_ram[7:0]+Feature[i+7:i]*W[i]
[0132] }
[0133] For example, when using the second-multiplication accumulation module, the data warping and control module 105 sends out K feature data points [K-1:0], representing a pixel in the feature map, with K feature data points in the channel direction. After obtaining K weight data points W[K-1:0], we have:
[0134] {
[0135] For j = 0; j <K;j++;
[0136] mid_ram=mid_ram+Feature[j]*W[j]
[0137] }
[0138] Here, mid_ram represents the data stored in the second intermediate memory. Feature[j] represents the (j+1)th feature data.
[0139] For example, the data regularization and control module 105 issues 8 feature data points, Feature[7:0], which are one pixel point of the feature map. With 8 feature data points in the channel direction, we have W[7:0], and:
[0140] {
[0141] For j = 0; j < 8; j++;
[0142] mid_ram=mid_ram+Feature[j]*W[j]
[0143] }
[0144] In practical applications, the target data segment can be distributed to the computing unit. In this embodiment, the accelerated computing module 101 is equipped with two types of multiply-accumulate modules. One of the matching multiply-accumulate modules can be selected according to the requirements to accelerate the calculation of the target data segment. When scanning by channel priority, the serial multiply-accumulate module can be used for calculation, and when scanning by feature map priority, the parallel multiply-accumulate module can be used for calculation, which is more efficient.
[0145] In an exemplary embodiment, the computing unit is specifically used for:
[0146] Cache the input weight data;
[0147] The input feature data can be selected and cached using the input selector in the selector group, or the shift controller in the selector group can be used to control the shift selector to shift the input feature data and then the shifted feature data can be selected and cached using the input selector.
[0148] Calculate the product of the corresponding cached feature data and weight data using each multiplier in the multiplier group;
[0149] The product of each multiplier is accumulated in parallel using the parallel accumulator in the adder group, or the product of all multipliers is accumulated serially using the serial accumulator in the adder group, and the accumulated result is output.
[0150] For example, such as Figure 13 As shown, the computing unit includes:
[0151] The system comprises a result selector group 1010, a feature data selector group 1011, a feature register group 1012, a weight register group 1013, a multiplier group 1014, a product selector group 1015, a first adder group 1016, a second adder group 1017, and a result register group 1018. The selector group includes the feature data selector group 1011.
[0152] Weight register group 1013 includes P weight registers, where P is a positive integer. The weight registers are used to store weight data. The diagram illustrates P = 8, therefore, it includes weight register W0, weight register W1, weight register W2, ..., weight register W7.
[0153] Feature register group 1012 includes P+x-1 feature registers, which are used to store feature data. In the figure, x=3 is used as an illustration, therefore, it includes feature register F0, feature register F1, feature register F2, ..., feature register F9.
[0154] The feature data selector group 1011 includes P+x-2 feature data selectors, each corresponding one-to-one with the first to the P+x-2th feature registers. The p-th feature data selector is used to send the received feature data to the p-th feature register, in which case the feature data selector acts as an input selector. It is also used to shift the feature data in the p+1-th feature register to the p-th feature register, in which case the feature data selector acts as a shift selector. Figure 13 In the state shown, the feature data selector on the left is the first feature data selector.
[0155] Multiplier group 1014 includes P multipliers. Each of the P multipliers corresponds one-to-one with a P weight register. Each of the P multipliers also corresponds one-to-one with the 1st to Pth feature registers. The qth multiplier is used to calculate the product of the feature data in the qth feature register and the weight data in the qth weight register. The value of q ranges from 1 to P.
[0156] The product selector group 1015 includes P product selectors, each corresponding to one of the P multipliers. The q-th product selector is used to select whether to output the product of the q-th multiplier or to output 0. If it is not necessary to clear the existing accumulated result, the q-th product selector selects to output the product of the q-th multiplier; if it is necessary to clear the existing accumulated result, the q-th product selector selects to output 0.
[0157] The first adder group 1016 includes P first adders and P clear selectors. The P first adders correspond one-to-one with the P product selectors, and the P first adders correspond one-to-one with the P clear selectors.
[0158] The result register group 1018 includes P result registers, each corresponding to one of the P first adders, and each of the P result registers also corresponding to one of the P clear selectors.
[0159] The second adder group 1017 includes ((1 / 2)+(1 / 4)+(1 / 8)+···+(1 / P))*P second adders. The second adders include multiple levels. The number of second adders in the first level is P / 2. The number of second adders in the previous level is twice the number of second adders in the next level. The number of second adders in the last level is 1.
[0160] In the first level, each second adder corresponds to two of the P product selectors to add the outputs of the two product selectors. In the next level, each second adder corresponds to the two second adders in the previous level to add the outputs of the two second adders.
[0161] The result selectors 1010 correspond to the second adder, the first product selector, and the first first adder in the last level, respectively.
[0162] The second adder in the last level provides the calculation result to the result selector 1010. The result selector 1010 is used to select the calculation result of the second adder in the last level or to select the output of the first product selector and provide it to the first adder.
[0163] The feature data selector group 1011, feature register group 1012, weight register group 1013, multiplier group 1014, product selector group 1015, first adder group 1016, result selector 1010, and result register group 1018 form a parallel multiply-accumulate module, namely the first multiply-accumulate module. The parallel accumulator includes the product selector group 1015, the first adder group 1016, the result selector 1010, and the result register group 1018. When using the parallel multiply-accumulate module, the result selector 1010 selects to receive the output of the first product selector and provides it to the first first adder. In the first adder group 1016, each clear selector is used to select the accumulated result buffered in the corresponding result register and output it to the corresponding first adder when it is not necessary to clear the existing accumulated result; and to select 0 and output it to the corresponding first adder when it is necessary to clear the existing accumulated result. Each first adder is used to accumulate the output of the corresponding product selector with the output of the corresponding clear selector and cache the accumulated result in the corresponding result register.
[0164] The above-mentioned feature data selector group 1011, feature register group 1012, weight register group 1013, multiplier group 1014, product selector group 1015, second adder group 1017, first result register, first clear selector, first first adder, and result selector 1010 form a serial multiply-accumulate module, namely the second multiply-accumulate module. Here, the second multiply-accumulate module shares some structures with the first multiply-accumulate module, but they can also be separate. The above-mentioned serial accumulator includes product selector group 1015, second adder group 1017, first result register, first clear selector, first first adder, and result selector 1010. When using the serial multiply-accumulate module, the result selector 1010 selects to receive the calculation result of the second adder in the last level and provides it to the first first adder. The first adder is used to accumulate the calculation result of the second adder in the last level with the output of the first clear selector and cache the accumulated result in the first result register. This completes the serial multiplication and accumulation.
[0165] In this embodiment, multiplication and addition calculations are performed by shifting the feature data in the feature register, which can realize convolution operations with various convolution kernels and achieve higher resource utilization.
[0166] In an exemplary embodiment, the data rearrangement module is specifically used for:
[0167] The calculation results of each of the computing units are collected and stored in different storage spaces in the first group of storage spaces;
[0168] According to the data arrangement order required for the accelerated computation of the next layer, for each storage space in the second group of storage spaces, the computation result of a corresponding computing unit is selected from the first group of storage spaces and stored to obtain the data arrangement order required for the accelerated computation of the next layer. Specifically, the computation result of a computing unit can be selected from the first group of storage spaces and stored in a corresponding storage space in the second group of storage spaces at one time, or the computation results of multiple computing units can be selected from the first group of storage spaces and stored in multiple corresponding storage spaces in the second group of storage spaces at one time, until the data rearrangement is completed and the data arrangement order required for the accelerated computation of the next layer is obtained.
[0169] For example, see Figure 14 The data rearrangement module 104 includes a data collection cache unit 1041, a second data selection unit 1042, and a data merging and conversion unit 1043; the data collection cache unit 1041 includes multiple collection storage units to form a first set of storage space; the data merging and conversion unit 1043 includes multiple conversion storage units to form a second set of storage space.
[0170] The data collection cache unit 1041 is used to collect the calculation results of the computing unit and store them in the corresponding collection storage unit. The calculation results of different computing units correspond to different collection storage units.
[0171] The second data selection unit 1042 is used to select the calculation result of the calculation unit stored in the corresponding collection storage unit for each conversion storage unit in the data merging and conversion unit 1043 according to the data arrangement order required for the accelerated calculation of the next layer, and output it to the data merging and conversion unit 1043; different collection storage units correspond to different conversion storage units;
[0172] The data merging and conversion unit 1043 is used to store the calculation results of the calculation units stored in the collection storage unit into the corresponding conversion storage unit, and to merge and output the calculation results of the calculation units stored in each conversion storage unit.
[0173] The collection storage unit and the conversion storage unit can be memory.
[0174] In this embodiment, by converting the order of the calculation results stored in the collection storage unit through the conversion storage unit, the data rearrangement is quickly realized to obtain the data arrangement order required for the accelerated calculation of the next layer.
[0175] The following is based on Figure 13 The structure shown is illustrated using a specific neural network as an example.
[0176] For example, the neural network is a VGG16 network. The data arrangement of the target data required for the accelerated computation of the current layer of the neural network is obtained using the first scanning method, while the data arrangement required for the next layer of the neural network is obtained using the second scanning method. Based on this, the data warping and control module can acquire 64 bytes of feature data (the interface width of DDR is 64 bytes in the specific design) in a time-division multiplexing manner. It acquires 8 bytes of feature data per clock cycle and concatenates them to obtain 10 bytes of feature data, which is then broadcast to the 8*64 computing units of the accelerated computation module. The computing units perform multiplication and addition calculations by shifting bits to obtain a 512-byte calculation result. The data rearrangement module collects the 512-byte calculation result and merges it into a data arrangement order using the second scanning method, resulting in eight 64-byte calculation results.
[0177] When the target data arrangement required for the current layer of the neural network to accelerate computation is obtained using the second scanning method, and the data arrangement required for the next layer of the neural network is also obtained using the second scanning method, the data regularization and control module can acquire 64 bytes of feature data in a time-division manner. Each clock cycle, 8 bytes of feature data are acquired and broadcast to the computing unit of the accelerated computing module. The computing unit does not perform shifting, and the result inside each computing unit is compressed into one, resulting in a total of 8 8-byte calculation results in sequence. When the data rearrangement module waits to collect a total of 64 bytes of calculation results, it converts them into 64 bytes of calculation results using the data arrangement order of the second scanning method.
[0178] For example, the neural network is a MobileNetV2 network. The data arrangement of the target data required for the accelerated computation of the current layer of the neural network is obtained using the second scanning method, while the data arrangement required for the next layer of the neural network is obtained using the third scanning method. Based on this, the data warping and control module can acquire 64 bytes of feature data in a time-division manner, acquiring 8 bytes of feature data each clock cycle and broadcasting them to the computing units of the accelerated computing module. The computing units do not perform shifting, and the results within each computing unit are compressed into one, resulting in a total of eight 8-byte (64-byte) computation results in sequence. When the data rearrangement module waits to collect the eight 64-byte computation results, it merges and converts them into computation results with the data arrangement order of the third scanning method.
[0179] When the target data arrangement required for the current layer of the neural network to accelerate computation is obtained using the third scanning method, while the data arrangement required for the next layer of the neural network is still obtained using the second scanning method, the data regularization and control module can acquire 64 bytes of feature data in a time-division manner, acquiring 8 bytes of feature data per clock cycle, and concatenating them to obtain 10 bytes of feature data for broadcasting to the computing unit of the accelerated computing module. The computing unit performs multiplication and addition calculations by shifting to obtain a 64-byte calculation result. After the data rearrangement module collects 8 64-byte calculation results totaling 512 bytes, it merges and converts them into a data arrangement order using the second scanning method.
[0180] The neural network acceleration method provided by the present invention is described below. The neural network acceleration method described below can be referred to in correspondence with the neural network acceleration device described above.
[0181] like Figure 15 As shown, this embodiment provides a neural network acceleration method based on the neural network accelerator 100 provided in any of the above embodiments, including:
[0182] Step 1501: The direct memory access controller continuously reads the target data required for accelerated computation of the current layer of the neural network, which is stored in the external memory in the order of data arrangement, and sends it to the data preparation and control module.
[0183] Step 1502: The data warping and control module generates at least one target data segment based on the received data according to the bit width of the data warping and control module, the bit width of the computing unit, and the convolution parameters, and distributes each target data segment to the accelerated computing module.
[0184] Step 1503: The accelerated computing module performs accelerated computing on the target data segment from the data regularization and control module based on at least some computing units, and outputs the computing results of the computing units to the data rearrangement module.
[0185] Step 1504: The data rearrangement module merges the calculation results of each calculation unit according to the data arrangement order required for the accelerated calculation of the next layer of the neural network, and uses it as the input for the accelerated calculation of the next layer.
[0186] In an exemplary embodiment, the data length of the target data segment is matched with the bit width of the computing unit and the convolution parameters, and the data length of the target data segment is a first data length; the data warping and control module buffers the received data, and the data length of the received data is less than or equal to the bit width of the data warping and control module;
[0187] Each time, a data segment is selected from the received data in units of the second data length as the main data segment, until all the received data has been selected. The second data length is a preset data length.
[0188] When the length of the main data segment is less than the first data length, first splicing data and second splicing data are obtained. The first splicing data is preset padding data or data in the received data that is adjacent to the start position of the main data segment. The second splicing data is preset padding data or data in the received data that is adjacent to the end position of the main data segment. The first splicing data, the main data segment, and the second splicing data are spliced together to obtain the target data segment.
[0189] When the data length of the main data segment is equal to the first data length, the main data segment is used as the target data segment;
[0190] The target data segment is distributed to the accelerated computing module.
[0191] In an exemplary embodiment, the data rearrangement module collects the calculation results of each of the computing units and stores them in different storage spaces within the first group of storage spaces;
[0192] According to the data arrangement order required for the accelerated computing of the next layer, each storage space of the second group of storage spaces selects the computing result of the corresponding computing unit from the first group of storage spaces for storage, so as to obtain the data arrangement order required for the accelerated computing of the next layer.
[0193] In an exemplary embodiment, the data straightening and control module includes a data caching unit, a first data selection unit, an intermediate caching unit, an address generation unit, a first splicing selection unit, a second splicing selection unit, a third splicing selection unit, and a data output unit.
[0194] The data caching unit caches the received data.
[0195] The data output unit generates and outputs the target data segment;
[0196] The address generation unit generates corresponding data addresses for the first data selection unit and the first splicing selection unit; when the data length of the main data segment is less than the first data length, it generates corresponding data addresses for the first data selection unit and the first splicing selection unit, and at the same time generates corresponding data addresses for the second splicing selection unit and the third splicing selection unit.
[0197] The first data selection unit selects a data segment from the data cache unit and stores it in the intermediate cache unit in one clock cycle, based on the corresponding data address and in units of the second data length.
[0198] The intermediate cache unit stores fill data and data segments from the data selection unit;
[0199] The first splicing selection unit selects a data segment from the intermediate cache unit according to the corresponding data address as the main data segment and provides it to the data output unit.
[0200] The second splicing selection unit selects the first spliced data from the intermediate cache unit according to the corresponding data address and provides it to the data output unit; the first spliced data is the padding data or data in the intermediate cache unit that is adjacent to the start position of the main data segment;
[0201] The third splicing selection unit selects the second splicing data from the intermediate cache unit according to the corresponding data address and provides it to the data output unit; the second splicing data is the padding data or data in the intermediate cache unit that is adjacent to the end position of the main data segment.
[0202] In an exemplary embodiment, the data rearrangement module includes a data collection cache unit, a second data selection unit, and a data merging and transformation unit; the data collection storage unit includes multiple collection storage units to form a first set of storage space; the data merging and transformation unit includes multiple transformation storage units to form a second set of storage space.
[0203] The data collection cache unit collects the calculation results of the computing unit and stores them in the corresponding collection storage unit. The calculation results of different computing units correspond to different collection storage units.
[0204] The second data selection unit selects the calculation result of the calculation unit stored in the corresponding collection storage unit for each conversion storage unit in the data merging and conversion unit according to the data arrangement order required for the accelerated calculation of the next layer, and outputs it to the data merging and conversion unit; different collection storage units correspond to different conversion storage units;
[0205] The data merging and conversion unit stores the calculation results of the calculation units stored in the collection and storage unit into the corresponding conversion and storage unit, and merges and outputs the calculation results of the calculation units stored in each conversion and storage unit.
[0206] In an exemplary embodiment, the computing unit includes a first multiply-accumulate-add module and a second multiply-accumulate-add module;
[0207] The first multiplication accumulation module performs parallel multiplication accumulation calculation based on L+(x-1) feature data and x weight data in the row direction from the data warping and control module. The parallel multiplication accumulation calculation includes sequentially selecting each weight data in the x weight data as the weight data W[i] to be calculated, where i ranges from 0 to x-1, and W[i] represents the (i+1)th weight data in the x weight data. The module then performs the following calculation: calculates the product of W[i] with each of the (i+1)th to (i+L)th feature data in the L+(x-1) feature data, and accumulates each product into the corresponding first intermediate memory. Here, x represents the size parameter of the convolution kernel.
[0208] The second multiplication accumulation module performs serial multiplication accumulation calculation based on K feature data and K weight data from one channel direction of the data warping and control module. The serial multiplication accumulation calculation includes sequentially selecting each weight data in the K weight data as the weight data W[j] to be calculated, where the value of j includes 0 to K-1. W[j] indicates that the weight data to be calculated is the (j+1)th weight data in the K weight data. The following calculation operation is performed: calculate the product of the (j+1)th feature data in the K feature data and W[j], and accumulate the calculated multiplication into the second intermediate memory.
[0209] The computing unit performs multiplication and accumulation calculations using either the first multiplication and accumulation module or the second multiplication and accumulation module that matches the accelerated calculation of the current layer.
[0210] In an exemplary embodiment, the method for obtaining the data arrangement order of the target data specifically includes:
[0211] Using the second three-dimensional structure as a unit, the target data of the first three-dimensional structure is divided into blocks to obtain Z1*Z2*Z3 second three-dimensional structures; when some data of the first three-dimensional structure is insufficient to form the second three-dimensional structure, data is filled in based on some data of the first three-dimensional structure to obtain the second three-dimensional structure; wherein, F is not greater than N, E is not greater than R, D is not greater than C, the value of Z1 is the ratio of N to F rounded up, the value of Z2 is the ratio of R to E rounded up, and the value of Z3 is the ratio of C to D rounded up;
[0212] According to the scanning method corresponding to the first three-dimensional structure, the second three-dimensional structure to be scanned is selected sequentially from each of the second three-dimensional structures to obtain the first arrangement order, and the following steps are performed on the currently selected second three-dimensional structure:
[0213] The data within the second three-dimensional structure is scanned and stored according to the scanning method corresponding to the second three-dimensional structure to obtain the second arrangement order.
[0214] In an exemplary embodiment, the first arrangement order is the arrangement order of each of the second three-dimensional structures in the target data according to the priority order of the channel direction, row direction, and column direction, wherein the priority order is one of the following: row direction, column direction, and channel direction; column direction, row direction, and channel direction; channel direction, row direction, and column direction; channel direction, column direction, and row direction; row direction, channel direction, and column direction; column direction, channel direction, and row direction.
[0215] The second arrangement order is the arrangement order of each data in the second three-dimensional structure according to the priority order of the channel direction, row direction and column direction, wherein the priority order is one of the following orders: row direction, column direction and channel direction; column direction, row direction and channel direction; channel direction, row direction and column direction; channel direction, column direction and row direction; row direction, channel direction and column direction; column direction, channel direction and row direction.
[0216] In an exemplary embodiment, the scheduler coordinates the operation of the data warping and control module, the accelerated computing module, the data rearrangement module, and the direct memory access controller based on the parsing results of the instruction sequence.
[0217] In an exemplary embodiment, the second three-dimensional structure corresponds to the structure of a neural network; and / or, the data in the second three-dimensional structure is read once and then all the necessary calculations are completed.
[0218] The neural network acceleration device provided by the present invention is described below. The neural network acceleration device described below can be referred to in correspondence with the neural network accelerator described above.
[0219] The present invention also provides a neural network acceleration device, including an external memory 200 and a neural network accelerator 100 provided in any of the above embodiments.
[0220] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0221] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0222] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A neural network accelerator, characterized in that, include: Accelerated computing module, direct memory access controller, data rearrangement module, and data warping and control module; The accelerated computing module includes multiple computing units: The direct memory access controller is used to continuously read the target data required for accelerated computation of the current layer of the neural network, which is stored in the external memory according to the data arrangement order, and send it to the data preparation and control module. The data arrangement order of the target data is obtained as follows: for the target data of the first 3D structure with N channels * R rows * C columns, the data is stored in the first arrangement order of each second 3D structure in the first 3D structure, taking the second 3D structure with F channels * E rows * D columns required for accelerated computation of the current layer of the neural network as the unit; for each second 3D structure, the data is stored in the second arrangement order of all data in the second 3D structure; the target data is obtained by feature map priority scanning, channel priority scanning, or hybrid scanning, wherein the hybrid scanning represents that the channel direction scanning is performed between the row direction and the column direction; The data warping and control module is used to generate at least one target data segment based on the received data according to the bit width of the data warping and control module, the bit width of the computing unit and the convolution parameters, and distribute each target data segment to the accelerated computing module. The accelerated computing module is used to perform accelerated computing on the target data segment from the data regularization and control module based on at least some of the computing units, and output the computing results of the computing units to the data rearrangement module. The data rearrangement module is used to merge the calculation results of each calculation unit according to the data arrangement order required for the accelerated calculation of the next layer of the neural network, so as to serve as the input for the accelerated calculation of the next layer.
2. The neural network accelerator according to claim 1, characterized in that, The data length of the target data segment is matched with the bit width of the computing unit and the convolution parameters, and the data length of the target data segment is a first data length; the data warping and control module is specifically used for: The received data is buffered, and the length of the received data is less than or equal to the bit width of the data normalization and control module; Each time, a data segment is selected from the received data in units of the second data length as the main data segment, until all the received data has been selected. The second data length is a preset data length. When the length of the main data segment is less than the first data length, first splicing data and second splicing data are obtained. The first splicing data is preset padding data or data in the received data that is adjacent to the start position of the main data segment. The second splicing data is preset padding data or data in the received data that is adjacent to the end position of the main data segment. The first splicing data, the main data segment, and the second splicing data are spliced together to obtain the target data segment. When the data length of the main data segment is equal to the first data length, the main data segment is used as the target data segment; The target data segment is distributed to the accelerated computing module.
3. The neural network accelerator according to claim 1, characterized in that, The data rearrangement module is specifically used for: The calculation results of each of the computing units are collected and stored in different storage spaces in the first group of storage spaces; According to the data arrangement order required for the accelerated computing of the next layer, each storage space of the second group of storage spaces selects the computing result of the corresponding computing unit from the first group of storage spaces for storage, so as to obtain the data arrangement order required for the accelerated computing of the next layer.
4. The neural network accelerator according to any one of claims 1 to 3, characterized in that, The calculation unit includes a first multiplication accumulation module and a second multiplication accumulation module; The first multiplication accumulation module is used to perform parallel multiplication accumulation calculation based on L+(x-1) feature data and x weight data in the row direction from the data warping and control module. The parallel multiplication accumulation calculation includes sequentially selecting each weight data in the x weight data as the weight data W[i] to be calculated, where i ranges from 0 to x-1, and W[i] represents the (i+1)th weight data in the x weight data. The module performs the following calculation operation: calculates the product of W[i] with the (i+1)th to (i+L)th feature data in the L+(x-1) feature data, and accumulates each product into the corresponding first intermediate memory. Here, x represents the size parameter of the convolution kernel. The second multiplication accumulation module is used to perform serial multiplication accumulation calculation based on K feature data and K weight data from one channel direction of the data warping and control module. The serial multiplication accumulation calculation includes sequentially selecting each weight data in the K weight data as the weight data W[j] to be calculated, where the value of j includes 0 to K-1. W[j] indicates that the weight data to be calculated is the (j+1)th weight data in the K weight data. The following calculation operation is performed: calculate the product of the (j+1)th feature data in the K feature data and W[j], and accumulate the calculated multiplication into the second intermediate memory. The computing unit is used to perform multiply-accumulate calculations using either the first multiply-accumulate module or the second multiply-accumulate module that matches the accelerated calculation of the current layer.
5. The neural network accelerator according to claim 1, characterized in that, The specific methods for obtaining the data arrangement order of the target data include: Using the second three-dimensional structure as a unit, the target data of the first three-dimensional structure is divided into blocks to obtain Z1*Z2*Z3 second three-dimensional structures; when the partial data of the first three-dimensional structure is insufficient to form the second three-dimensional structure, data is filled in based on the partial data of the first three-dimensional structure to obtain the second three-dimensional structure; wherein, F is not greater than N, E is not greater than R, D is not greater than C, the value of Z1 is the ratio of N to F rounded up, the value of Z2 is the ratio of R to E rounded up, and the value of Z3 is the ratio of C to D rounded up; According to the scanning method corresponding to the first three-dimensional structure, the second three-dimensional structure to be scanned is selected sequentially from each of the second three-dimensional structures to obtain the first arrangement order, and the following steps are performed on the currently selected second three-dimensional structure: The data within the second three-dimensional structure is scanned and stored according to the scanning method corresponding to the second three-dimensional structure to obtain the second arrangement order.
6. The neural network accelerator according to claim 1 or 5, characterized in that, The first arrangement order is the arrangement order of each second three-dimensional structure in the target data according to the priority order of the channel direction, row direction, and column direction, wherein the priority order is one of the following: row direction, column direction, and channel direction; column direction, row direction, and channel direction; channel direction, row direction, and column direction; channel direction, column direction, and row direction. Row direction, channel direction, and column direction; Column direction, channel direction, and row direction; The second arrangement order is the arrangement order of each data in the second three-dimensional structure according to the priority order of the channel direction, row direction and column direction, wherein the priority order is one of the following orders: row direction, column direction and channel direction; column direction, row direction and channel direction; channel direction, row direction and column direction; channel direction, column direction and row direction; row direction, channel direction and column direction; column direction, channel direction and row direction.
7. The neural network accelerator according to any one of claims 1 to 3, characterized in that, It also includes a scheduler; the scheduler is connected to the data warping and control module, the accelerated computing module, the data rearrangement module and the direct memory access controller respectively, and is used to parse instruction sequences and coordinate the work of the data warping and control module, the accelerated computing module, the data rearrangement module and the direct memory access controller according to the parsing results of the instruction sequences.
8. The neural network accelerator according to any one of claims 1 to 3, characterized in that, The second three-dimensional structure corresponds to the structure of the neural network; and / or, the data in the second three-dimensional structure is read once and then all the necessary calculations are completed.
9. A neural network acceleration method based on a neural network accelerator as described in any one of claims 1 to 8, characterized in that, include: The direct memory access controller continuously reads the target data required for accelerated computation of the current layer of the neural network, which is stored in the external memory in the order of data arrangement, and sends it to the data preparation and control module. The target data is obtained by feature map priority scanning, channel priority scanning, or hybrid scanning. The hybrid scanning represents that the channel direction scanning is performed between the row direction and the column direction. The data warping and control module generates at least one target data segment based on the received data according to the bit width of the data warping and control module, the bit width of the computing unit, and the convolution parameters, and distributes each target data segment to the accelerated computing module. The accelerated computing module performs accelerated computing on the target data segment from the data regularization and control module based on at least some computing units, and outputs the computing results of the computing units to the data rearrangement module. The data rearrangement module merges the calculation results of each calculation unit according to the data arrangement order required for accelerated calculation of the next layer of the neural network, and uses it as the input for accelerated calculation of the next layer.
10. A neural network acceleration device, characterized in that, Includes external memory and a neural network accelerator as described in any one of claims 1 to 8.