A convolutional neural network processing method and apparatus
By optimizing the data transmission and accumulation order of convolutional neural networks, the problems of low computational parallelism and difficulty in processing sparse data are solved, achieving higher computational speed and lower power consumption.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING SIFENGKE TECH CO LTD
- Filing Date
- 2021-12-31
- Publication Date
- 2026-06-12
AI Technical Summary
Existing convolutional neural network computation suffers from low computational parallelism, slow computation speed, and difficulty in processing sparse data, especially in dot product accumulation operations where blocking conflicts are prone to occur.
By inputting the weight data into the multiplier array in a preset manner, the set of sub-feature data is determined, and dot multiplication and accumulation are performed. The three-level storage structure of storage units and registers is used to optimize the data transmission and accumulation order and avoid calculation conflicts.
It improves data transmission efficiency, optimizes the calculation process, increases calculation speed, reduces power consumption, and solves the computational blocking problem in sparse data processing.
Smart Images

Figure CN116415629B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of convolutional neural networks, and more particularly to a convolutional neural network processing method and apparatus. Background Technology
[0002] A Convolutional Neural Network (CNN) is a type of feedforward neural network whose artificial neurons can respond to surrounding units within a certain coverage area, making it suitable for processing large images. CNNs are widely used in image recognition, speech recognition, and other fields, but they require a very large amount of computation.
[0003] In existing convolution calculations, systolic arrays are often used. To achieve correct matrix operations, data entering the array needs to be properly formatted and passed continuously in a certain order. This presents the following problems: an undesigned computational process leads to low computational parallelism, resulting in slow computation speed and low computing power; some computational results need to be summed together, and simultaneous summing can lead to data contention, causing computational blockage and reducing computing power; furthermore, if sparse data is compressed, this systolic operation cannot be achieved, meaning that existing systolic arrays cannot easily handle sparse data. Summary of the Invention
[0004] This invention provides a convolutional neural network processing method to solve the blocking conflict problem in dot multiplication and accumulation operations of convolutional neural networks, and is suitable for processing sparse data.
[0005] This invention provides a convolutional neural network processing method, the method comprising:
[0006] The weight data is input to the multiplier array according to the first preset method, wherein the convolution kernel corresponding to each weight data in each row of the multiplier is different, the coordinates corresponding to each weight data in each row of the multiplier are the same, and the channels corresponding to each weight data in each row of the multiplier are the same.
[0007] Based on the channels corresponding to the weight data in each row of multipliers, a sub-feature data set corresponding to each row of multipliers is determined; wherein, the feature data set to be processed includes at least one sub-feature data set;
[0008] Based on the sub-feature data set corresponding to each row of multipliers, the sub-feature data set is input into the multiplier array. The weight data in each multiplier of any row of multipliers is multiplied by the element feature data in the sub-feature data set corresponding to this row of multipliers to obtain the first set of dot product results.
[0009] The first set of dot multiplication results is input into the accumulator for accumulation.
[0010] In response to the completion of the accumulator accumulation, obtain the processing result corresponding to the set of feature data to be processed.
[0011] The channels corresponding to the weight data in any row of the multiplier are the same as the channels of the sub-feature data set corresponding to that row of the multiplier.
[0012] A sub-feature data set consists of all feature data from any channel of the feature data set to be processed; the number of sub-feature data sets is the same as the number of rows of the multiplier array; the arrangement order of the sub-feature data sets is determined based on the channels corresponding to the weight data in each row of the multiplier; the arrangement order of the element feature data in the sub-feature data set is determined based on the coordinates of the element feature data.
[0013] The weight data in any multiplier in any row of multipliers is multiplied by each element feature data in the sub-feature data set corresponding to this row of multipliers to obtain a sub-dot product result set, wherein the first dot product result set includes at least one sub-dot product result set.
[0014] The input weight data includes compressed weight data.
[0015] The steps to compress weighted data include:
[0016] Based on the coordinates of the weight data, the weight data of each channel of different convolutional kernels is expanded into a row vector;
[0017] Arrange the row vectors of the same channel belonging to different convolution kernels into a rearranged kernel matrix;
[0018] Compress the rearranged kernel matrix.
[0019] The set of feature data to be processed includes a compressed set of feature data to be processed. Based on the sub-feature data set corresponding to each row of multipliers, the sub-feature data set is input into the multiplier array, including: based on the sub-feature data set corresponding to any row of multipliers, broadcasting the element feature data in the sub-feature data set corresponding to this row of multipliers in coordinate order to each multiplier of this row of multipliers.
[0020] The channels of the weight data in each row of multipliers are the same as the channels of the sub-feature data set corresponding to each row of multipliers. The sub-feature data sets are arranged in channel order, and then the element feature data in the sub-feature data sets are input into the multiplier array in coordinate order.
[0021] When the number of columns in the multiplier array is less than the number of convolution kernels, and the dot product operation between the feature data set to be processed and the weight data in the current multiplier is completed, the convolution kernels in the multiplier are updated, and the feature data set to be processed is then input into the multiplier array, until the weight data of all convolution kernels has been multiplied with the feature data set to be processed.
[0022] The set of sub-dot multiplication results is stored in the storage unit corresponding to the multiplier, and the convolution kernel number corresponding to the set of sub-dot multiplication results is also saved. Each storage unit stores at least one set of sub-dot multiplication results.
[0023] The first set of dot product results is input into the accumulator, including:
[0024] The set of sub-dot multiplication results in the storage unit corresponding to a row multiplier is passed into the first-level storage; the set of sub-dot multiplication results belonging to a storage unit is stored in a memory of the first-level storage.
[0025] Based on the indication of the convolution kernel index corresponding to the sub-dot product result set, the sub-dot product result set that can be accumulated is passed into the same register in the secondary storage; the sub-dot product result set in the register is input into the accumulator;
[0026] When the time threshold is not reached, the set of sub-dot multiplication results that cannot be accumulated is stored in the cache space of the secondary storage.
[0027] When the time threshold is reached, the set of sub-dot multiplication results that cannot be accumulated is passed to the third-level storage.
[0028] If no sub-dot product result set can be accumulated in the secondary storage and passed from the primary storage, the accumulative sub-dot product result set is retrieved from the tertiary storage and entered into the secondary storage register, and then passed to the accumulator.
[0029] The size of the cache space is determined based on the storage unit corresponding to the multiplier.
[0030] The cache space is three times the size of the storage unit corresponding to the multiplier.
[0031] This invention provides a convolutional neural network processing device, comprising:
[0032] The first input unit is used to input weight data into the multiplier array according to a first preset method;
[0033] The determining unit is used to determine the set of sub-feature data corresponding to each row of multipliers based on the channels corresponding to each weight data in each row of multipliers;
[0034] The second input unit is used to input the sub-feature data set into the multiplier array based on the sub-feature data set corresponding to each row of multipliers;
[0035] The first calculation unit is used to perform a dot product operation between the weight data in each multiplier of any row of multipliers and the element feature data in the sub-feature data set corresponding to this row of multipliers.
[0036] The second calculation unit is used to input the obtained first dot multiplication result set into the accumulator for accumulation;
[0037] The acquisition unit is used to acquire the processing results corresponding to the set of feature data to be processed.
[0038] Beneficial effects:
[0039] The method provided by this invention can improve data transmission efficiency, optimize the computation process of deep learning, increase computation speed, and reduce power consumption.
[0040] The transmission method of weight data and feature data can be adapted to compressed data format. When the weight data or input feature data is sparse, data compression can further improve the calculation speed and reduce power consumption.
[0041] By passing weight data from different layers to each row of multipliers, data input and retrieval are facilitated. This also ensures that additions between layers are performed systematically, and that partial sums within the same layer do not conflict, thus speeding up computation.
[0042] The cumulative processing of product results enables multiple multipliers to perform parallel calculations while rationally arranging the order of product result accumulation to reduce conflicts and blocking. This solves the problem of frequent conflicts and contradictions in partial addition and accumulation, thus improving overall efficiency. Attached Figure Description
[0043] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0044] Figure 1 is a flowchart of a convolutional neural network processing method in one embodiment of the present invention;
[0045] Figure 2 is a schematic diagram of rearranged convolution kernels in one embodiment of the present invention;
[0046] Figure 3 is a schematic diagram of weight data being passed into the multiplier array in one embodiment of the present invention;
[0047] Figure 4 is a schematic diagram of the result of storing weight data into the multiplier array in one embodiment of the present invention;
[0048] Figure 5 is a schematic diagram of a sub-feature data set in one embodiment of the present invention;
[0049] Figure 6 is a schematic diagram of the input of the sub-feature data set into the multiplier array in one embodiment of the present invention;
[0050] Figure 7 is a schematic diagram of a dot product operation between the sub-feature data set and the weight data in the multiplier in one embodiment of the present invention;
[0051] Figure 8 is a schematic diagram of a dot product operation between the sub-feature data set and the weight data in the multiplier in one embodiment of the present invention;
[0052] Figure 9 is a schematic diagram of the storage unit corresponding to the multiplier storing the set of sub-dot multiplication results in one embodiment of the present invention;
[0053] Figure 10 is a schematic diagram of the result of storing weight data into the multiplier array in one embodiment of the present invention;
[0054] Figure 11 is a schematic diagram of refreshing the weight data in the multiplier according to an embodiment of the present invention;
[0055] Figure 12 is a schematic diagram of the multiplier array and storage at each level in one embodiment of the present invention;
[0056] Figure 13 is a schematic diagram of the result of storing the compressed weight data into the multiplier array in one embodiment of the present invention;
[0057] Figure 14 is a schematic diagram of the compressed sub-feature data set in one embodiment of the present invention;
[0058] Figure 15 is a schematic diagram of the compressed sub-feature data set being fed into the multiplier array in one embodiment of the present invention;
[0059] Figure 16 is a schematic diagram of refreshing the compressed weight data in the multiplier according to an embodiment of the present invention; Detailed Implementation
[0060] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0061] The convolutional neural network processing method provided in this application supports processing of multiple types of data, including image data, audio data, and natural language data. It is particularly suitable for situations where both weight data and feature data are sparse, such as filtering large batches of short videos. Example 1
[0062] In one embodiment of the present invention, a method for accelerating convolutional neural networks is provided, comprising the following steps: inputting weight data into a multiplier array according to a first preset method, wherein the convolution kernels corresponding to each weight data in each row of multipliers are different, the coordinates corresponding to each weight data in each row of multipliers are the same, and the channels corresponding to each weight data in each row of multipliers are the same; in some optional embodiments, the first preset method can be understood as storing weight data with the same coordinates in the same channel of different convolution kernels into the same row of multipliers; inputting weight data into the multiplier array in this form facilitates the determination of feature data that needs to be multiplied with the weight data in each row of multipliers, and also allows the aforementioned feature data to be multiplied simultaneously with weight data belonging to different convolution kernels, thereby accelerating computational efficiency.
[0063] Based on the channels corresponding to the weight data in each row of multipliers, a sub-feature data set corresponding to each row of multipliers is determined; wherein, the feature data set to be processed includes at least one sub-feature data set; in some optional embodiments, the sub-feature data set can be understood as the set of all feature data on the same channel of the input feature map; if the number of channels in the feature data set to be processed is 64, the number of sub-feature data sets is also 64; since weight data and feature data belonging to the same channel can be multiplied, after inputting weight data into the multiplier array in the first preset manner, the weight data stored in each row of multipliers belongs to the same channel, and the sub-feature data set corresponding to each row of multipliers can be determined in this way.
[0064] Based on the sub-feature data set corresponding to each row of multipliers, the sub-feature data set is input into the multiplier array. The weight data in each multiplier of any row of multipliers is multiplied by each element feature data in the corresponding sub-feature data set to obtain the first set of dot product results. By inputting the sub-feature data set corresponding to a certain row of multipliers into the multiplier array and multiplying it by the weight data in that row of multipliers, and multiplying it by each element feature data in the corresponding sub-feature data set, it can be guaranteed that the obtained first set of dot product results belongs to the same channel of different convolution kernels. Sub-dot product results between the first set of dot product results cannot be accumulated if they belong to different convolution kernels. At most two sub-dot product results in any two first set of dot product results belong to the same convolution kernel and can be accumulated. Therefore, during accumulation, multiple sub-dot product results can be avoided from waiting to be accumulated, thus avoiding conflicts and blocking.
[0065] The first set of dot multiplication results is input into the accumulator for accumulation.
[0066] In response to the completion of the accumulator accumulation, obtain the processing result corresponding to the set of feature data to be processed. Example 2
[0067] In one embodiment of the present invention, a convolutional neural network processing method is provided, comprising:
[0068] The weight data is input to the multiplier array according to the first preset method, wherein the convolution kernel corresponding to each weight data in each row of the multiplier is different, the coordinates corresponding to each weight data in each row of the multiplier are the same, and the channels corresponding to each weight data in each row of the multiplier are the same.
[0069] Based on the channels corresponding to the weight data in each row of multipliers, a sub-feature data set corresponding to each row of multipliers is determined; wherein, the feature data set to be processed includes at least one sub-feature data set;
[0070] Based on the sub-feature data set corresponding to each row of multipliers, the sub-feature data set is input into the multiplier array. The weight data in each multiplier of any row of multipliers is multiplied by the element feature data in the sub-feature data set corresponding to this row of multipliers to obtain the first set of dot product results.
[0071] The first set of dot multiplication results is input into the accumulator for accumulation.
[0072] In response to the completion of the accumulator accumulation, obtain the processing result corresponding to the set of feature data to be processed.
[0073] The channels corresponding to the weight data in any row of multipliers are the same as the channels of the sub-feature data set corresponding to that row of multipliers. The weight data in the multiplier must belong to the same channel as the input sub-feature data set to perform dot product operation. Therefore, the channels of the weight data stored in each row of multipliers can be used to determine the sub-feature data set corresponding to that row of multipliers. Furthermore, the weight data stored in that row of multipliers can perform dot product operation with the element feature data in the sub-feature data set corresponding to that row of multipliers, which can improve the efficiency of dot product operation.
[0074] A sub-feature data set consists of all feature data from any channel of the feature data set to be processed; the number of sub-feature data sets is the same as the number of rows of the multiplier array; the arrangement order of the sub-feature data sets is determined based on the channels corresponding to the weight data in each row of the multiplier; the arrangement order of the element feature data in the sub-feature data set is determined based on the coordinates of the element feature data. The sub-feature data set consists of all feature data from each channel of the feature data set to be processed. The number of sub-feature data sets is the same as the number of channels in the feature data set to be processed, which in turn is the same as the number of channels in the convolution kernel. Since the number of rows in the multiplier array is determined by the number of channels in the convolution kernel, the number of sub-feature data sets is the same as the number of rows in the multiplier array. The weight data in different rows of the multipliers correspond to different channels and are arranged in a specific order, as shown in Figures 4 and 5. Typically, the channel of the weight data in the first row of the multipliers is channel1, the channel of the weight data in the second row of the multipliers is channel2, and so on. Therefore, the arrangement order of the sub-feature data sets can be determined by the channel corresponding to each weight data in each row of the multipliers. The sub-feature data set in the first row is channel1, the sub-feature data set in the second row is channel2, and so on. This is beneficial for inputting the sub-feature data sets into the multiplier array and performing dot product operations with the weight data in the multipliers, thus speeding up the efficiency of the dot product operation.
[0075] The weight data in any multiplier in any row of multipliers is multiplied by each element feature data in the sub-feature data set corresponding to this row of multipliers to obtain a sub-dot product result set, wherein the first dot product result set includes at least one sub-dot product result set.
[0076] The input weight data includes compressed weight data. This processing method supports inputting compressed weight data. If the weight data is compressed, the compressed weight data is input into the multiplier array, eliminating the need to calculate 0 values and improving computational efficiency.
[0077] The steps to compress weighted data include:
[0078] Based on the coordinates of the weight data, the weight data of each channel of different convolutional kernels is expanded into a row vector;
[0079] The row vectors belonging to the same channel of different convolution kernels are arranged into a rearranged kernel matrix, as shown in Figure 2;
[0080] The rearranged kernel matrix is compressed. For example, in Figure 2, the position (1,1) of W3 and W4 is 0. The result of the first input into the multiplier array is shown in Figure 13. Compression of multiple convolution kernels on the same channel facilitates the sequential input of weight data into the multiplier array, which can reduce transmission energy and facilitate data reading. Moreover, the way the weight data is input into the multiplier array is adapted to the data compression format, which improves transmission efficiency.
[0081] The set of feature data to be processed includes a compressed set of feature data to be processed. Similarly, inputting the compressed set of feature data to be processed into a multiplier array for dot product operation can improve computational efficiency. The processing method provided in this embodiment of the invention is suitable for scenarios where both the weight data and the input feature map are sparse.
[0082] Based on the sub-feature data set corresponding to each row of multipliers, the sub-feature data set is input into the multiplier array, including: Based on the sub-feature data set corresponding to any row of multipliers, the element feature data in the sub-feature data set corresponding to this row of multipliers is broadcast in coordinate order to each multiplier of this row of multipliers. This is one input method for a sub-feature data set.
[0083] The channels of the weight data within each row of multipliers are the same as the channels of the corresponding sub-feature data set. The sub-feature data sets are arranged in channel order, and then the element feature data within each sub-feature data set is input into the multiplier array in coordinate order. The order of all sub-feature data sets is determined by the weight data within the multipliers. Either one sub-feature data set can be input into the multiplier array, or all sub-feature data sets can be sorted and input simultaneously. Inputting the first element feature data of each sub-feature data set at a time, as shown in Figure 6, can improve computation speed.
[0084] When the number of columns in the multiplier array is less than the number of convolution kernels, and the dot product operation between the feature data set to be processed and the weight data in the current multiplier is completed, the convolution kernels in the multiplier are updated. The coordinates of the weight data remain unchanged as the convolution kernels change. The feature data set to be processed is then input into the multiplier array again until the weight data of all convolution kernels has been multiplied with the feature data set to be processed.
[0085] The set of sub-dot multiplication results is stored in the storage unit corresponding to the multiplier, and the convolution kernel number corresponding to the set of sub-dot multiplication results is also saved. Each storage unit stores at least one set of sub-dot multiplication results.
[0086] The first set of dot product results is input into the accumulator, including:
[0087] The set of sub-dot product results within the storage unit corresponding to a row of multipliers, i.e., any set of first dot product results, is passed to the first-level storage. The set of sub-dot product results belonging to a storage unit is stored in a memory of the first-level storage. The set of first dot product results is obtained by performing dot product operations between the weight data in each multiplier of any row of multipliers and the feature data of each element in the sub-feature data set corresponding to this row of multipliers. Since the weight data in a row of multipliers belong to different convolution kernels, the sub-dot product results in the set of first dot product results cannot be accumulated. Since at most two sub-dot product results in any two sets of first dot product results belong to the same convolution kernel and can be accumulated, the transmission to the accumulator is performed in units of one set of first dot product results to avoid the problem of conflict blocking.
[0088] Based on the indication of the convolution kernel index corresponding to the sub-dot product result set, the sub-dot product result set that can be accumulated is passed into the same register in the secondary storage; the sub-dot product result set in the register is input into the accumulator; this is also the reason for setting up primary storage and secondary storage, so as to facilitate the input of the sub-dot product result set that can be accumulated into the accumulator;
[0089] When the time threshold is not reached, the set of sub-dot product results that cannot be accumulated is stored in the cache space of the second-level storage; when the time threshold is reached, the set of sub-dot product results that cannot be accumulated is passed to the third-level storage; when no sub-dot product result set that can be accumulated with the sub-dot product result set passed from the first-level storage is found in the second-level storage, the set of sub-dot product results that can be accumulated is retrieved from the third-level storage, entered into the register of the second-level storage, and then passed to the accumulator.
[0090] The size of the cache space is determined based on the storage unit corresponding to the multiplier, and the cache space is three times the size of the storage unit corresponding to the multiplier. A cache space three times the size of the storage unit is the optimal choice. The purpose of setting the cache space is to allow more sub-dot multiplication results or partial sums to be accumulated before being passed to the third-level storage, thereby reducing the power consumption caused by back-and-forth access. Example 3
[0091] To facilitate understanding of the overall solution of this application, the following is a flowchart of an embodiment of the present invention.
[0092] The convolution kernel in this embodiment provided by the present invention is a 2*2, 64-channel convolution kernel, and the number is 64.
[0093] Step 1, as shown in Figure 2, rearrange the convolution kernels. Expand each channel of each convolution kernel into a row vector. The row vectors of the same channel belonging to different convolution kernels form a rearranged kernel matrix. After the weight data is rearranged, the first column of the rearranged kernel matrix is the data with coordinates (1, 1) of different convolution kernels in the same layer.
[0094] Step 2: Input the weight data of the rearranged kernel matrix in Step 1 into the multiplier array. In this embodiment, the multiplier array is 64*8. The 64 rows correspond to the number of channels of the convolution kernel, and the 8 columns mean that a maximum of 8 different convolution kernel weight data can be input at a time.
[0095] As shown in Figure 3, the first set of weight data within the dashed line is input into the first row of the multiplier array. Each multiplier stores one set of weight data. The convolution kernels corresponding to the weight data in each multiplier are different, namely convolution kernels W1 to W8. The coordinates corresponding to the weight data in each multiplier are the same, which is (1,1). The channels corresponding to the weight data in each multiplier are the same, which is channel1. The other sets of weight data are input into the multiplier array simultaneously with the first set of weight data in the same way. The input result is shown in Figure 4. In this embodiment, the weight data is stored from right to left in the multiplier, but it can also be stored from left to right.
[0096] Step 3: Input the set of feature data to be processed into the multiplier array. In this embodiment, the set of feature data to be processed is a 4*4 feature matrix with 64 channels.
[0097] Step 301: Divide the feature data set to be processed into 64 sub-feature data sets; each feature data in a sub-feature data set belongs to the same channel; as shown in Figure 5, a1 to a64 are 64 sub-feature data sets, and the feature data in each channel of the feature data set to be processed constitutes a sub-feature data set. The area within the dashed box is sub-feature data set a1.
[0098] Step 302: Simultaneously input the feature data of all sub-feature data sets with coordinates (1,1), i.e., the feature data within the dashed box shown in Figure 6, into the multiplier array; the channel of sub-feature data set a1 is channel1, and the weight data channel of the multipliers in the first row of the multiplier array is channel1. The feature data a1 (1,1) is broadcast to each multiplier in the first row of the multiplier array in a broadcast manner. The channel of sub-feature data set a2 is channel2, and the weight data channel of the multipliers in the second row of the multiplier array is channel2. The feature data a2 (1,1) is broadcast to each multiplier in the second row of the multiplier array in a broadcast manner. The other feature data of the sub-feature data sets are input into the multiplier array in the same way.
[0099] Step 4, as shown in Figure 6, involves performing a dot product operation between the feature data input to the multiplier and the weight data already stored in the multiplier. After the feature data at all coordinates (1, 1) has been multiplied by the weight data in all multipliers, the feature data at coordinates (1, 2) of all sub-feature data sets are input into the multiplier array, as shown in Figure 7. The feature data input to the multiplier is then multiplied by the weight data already stored in the multiplier until all feature data in each sub-feature data set has been multiplied by the weight data currently stored in the multiplier.
[0100] Step 5, as shown in Figure 9, stores the dot product result of each sub-feature data set and the weight data into the storage unit corresponding to each multiplier; according to the coordinates of the feature data, it is stored in the corresponding position of the storage unit, and at the same time, the convolution kernel number corresponding to the sub-dot product result and the coordinates of the current weight data are stored; the convolution kernel number and the coordinates of the current weight data are the indexes of the relationship between the dot product results when the dot product results are accumulated; wherein, the dot product result obtained by multiplying a1 with the weight data in the first multiplier of the first row is a sub-dot product result, and the dot product result obtained by multiplying any sub-feature data set in a1-a64 with the weight data in each multiplier of its corresponding row is the first dot product result set;
[0101] Step 6: Pass the weight data of the remaining convolution kernels at coordinates (1, 1) to the multiplier array; pass 8 at a time.
[0102] The result of feeding the weight data into the multiplier array for the second time is shown in Figure 10. Repeat steps 3-6 until the dot product operation between the weight data of all convolution kernels with coordinates (1,1) and the set of feature data to be processed is completed.
[0103] Step 7: Refresh the weight data in the multiplier. The refresh result is shown in Figure 11. Repeat steps 2-6 until all weight data of each convolution kernel is reused.
[0104] Step 8: Pass the set of sub-dot multiplication results stored in the storage unit into the accumulator;
[0105] Step 801, as shown in Figure 12, involves passing the sub-dot product result set of the first row multiplier into the first-level storage; the sub-dot product result set of each multiplier is stored in a memory of the first-level storage.
[0106] Step 802: Pass the sub-dot multiplication result set of the first row multiplier stored in the first-level storage to the second-level storage, and pass the sub-dot multiplication result set of the second row multiplier to the first-level storage;
[0107] Step 803: The set of sub-dot product results of the existing second row multiplier in the first-level storage is passed into the register of the second-level storage. The sub-dot product results of the second row multiplier and the first row multiplier that belong to the same convolution kernel are passed into the same register of the second-level storage.
[0108] Step 9: Perform accumulation operation in the accumulator of the register. Since the sub-dot product result set of a row multiplier belongs to different convolution kernels, no accumulation is required within the row. Since only the sub-dot product result set of a row multiplier is passed to the first-level storage at a time, and the sub-dot product result set in the first-level storage is sent to the second-level storage before the sub-dot product result set of the next row multiplier is passed, the sub-dot product result sets of different layers will be added together. And at most two of the sub-dot product result sets of two rows multipliers belong to the same convolution kernel. After the two sub-dot product result sets of the first cycle are accumulated, they are used as partial sums in the second-level storage for subsequent accumulation. This accumulation means that each sub-dot product result set passed to the second-level storage will only be accumulated once at most. This can achieve the technical effect of accumulating the partial sums of the same convolution kernels of multiple layers one by one, solving the problem of accumulation conflict and blocking caused by too many partial sums that need to be added at the same time in each layer.
[0109] Step 10: Cache the partial sum or sub-dot product result set that cannot be accumulated before the cache time reaches the time threshold to the cache space of the secondary storage. Since the number of convolution kernels may exceed the number of columns in the multiplier array, each multiplier's storage unit will store multiple sub-dot product result sets. When the primary storage transfers the sub-dot product result set belonging to the same convolution kernel to the same register in the secondary storage, it will simultaneously store other sub-dot product result sets from the storage unit into this register. After the sub-dot product result sets of the same convolution kernel are accumulated, there are still remaining sub-dot product result sets in the register that have not been accumulated. Therefore, there may be sub-dot product result sets in the register that cannot be accumulated after a certain time. Set a time threshold and cache the partial sum or sub-dot product result set that cannot be accumulated before the cache time reaches the time threshold to the cache space of the secondary storage. The cache space S is multiples of the multiplier's storage unit, and S is determined according to the sparsity and dispersion of the compressed data; preferably, S=3.
[0110] Step 11: When the cache time reaches the time threshold, transfer the set of partial sum or sub-dot multiplication results cached in the cache space to the third-level storage.
[0111] Step 12: When the primary storage transmits the sub-dot multiplication result set to the secondary storage, if the secondary storage does not find the sum of the sub-dot multiplication result set that can be added, the secondary storage retrieves the sum of the sub-dot multiplication result set that can be accumulated from the tertiary storage, performs the accumulation operation, and stores the result of the accumulation operation into the secondary storage, thus refreshing the cache time.
[0112] Step 13 continues until the result set of sub-dot multiplication operations in rows 64 is accumulated, and the accumulated result is obtained;
[0113] Step 14: When all weight data of all convolution kernels are reused and the sub-dot multiplication result set is accumulated, the processing result corresponding to the feature data set to be processed is obtained. Example 4
[0114] S1. Rearrange the weight data: Expand the weight data of each layer of the 64 convolutional kernels into row vectors. Arrange the row vectors of the same layer belonging to different convolutional kernels in order to form a rearranged kernel matrix.
[0115] Taking the Cth layer as an example, the weight data (K×K) of channel C in convolution kernel W1 is expanded into a row vector (1×K2), which serves as the first row of the rearranged kernel matrix A; the weight data (N×N) of channel C in convolution kernel W2 is expanded into a row vector (1×N2), which serves as the second row of the rearranged kernel matrix A, and so on, until the weight data of all M convolution kernels in this layer are rearranged, where WMC(a, b) is the weight data at position (a, b) in channel C of the Mth convolution kernel;
[0116] S2. Compress the rearranged kernel matrix A using the CSF compression algorithm; store the compressed rearranged kernel matrix A in external memory;
[0117] S3. Compress the image data using the CSF algorithm; the compression is performed on each layer of the image data; S4. Perform a dot product operation on the compressed weight data and the image data; S41. Store the compressed weight data into a multiplier (M*N) array; as shown in Figure 13, since the convolution kernel in this embodiment has 64 layers, the multiplier array in this embodiment is M=64, N=8; there are a total of 64*8 multipliers. The weight data of different convolution kernels with the same coordinates on the same layer are stored in the same row of the multiplier, and saved from right to left. For example, the first row of multipliers stores the weight data corresponding to the coordinates (1,1) of the 8 convolution kernels on the 1st layer, and the second row of multipliers stores the weight data corresponding to the coordinates (1,1) of the 8 convolution kernels on the 2nd layer. In this embodiment, the multiplier has 8 columns. The weight data of 8 convolution kernels are input into one row of the multiplier at a time. Since the weight data has been compressed, there may be less than 64 non-zero weight data at coordinates (1,1) of 64 convolution kernels. Therefore, 8 data are input at a time until all data is input. The storage unit corresponding to the multiplier stores the set of sub-dot product results of more than one convolution kernel, and stores the corresponding convolution kernel number when storing the set of sub-dot product results. Since all weight data will be refreshed later, the coordinates of the weight data must also be stored for subsequent accumulation operations.
[0118] S42. Each layer of the input image data is treated as a sub-feature data set, resulting in 64 sub-feature data sets for 64 layers. Due to the compression of the image data, the coordinates of the feature data in the sub-feature data sets are discontinuous. Figure 14 As shown, the dashed box represents a sub-feature data set;
[0119] S43. The first non-zero feature data of all sub-feature data sets is passed into the multiplier array as shown in the dashed box in Figure 15. The non-zero feature data of subsequent sub-feature data sets are passed in the same way. Each feature data is broadcast to each multiplier in the corresponding row and multiplied with the weight data in each multiplier. Since the number of convolution kernel layers is the same as the number of layers of image data, in this embodiment, the number of layers of image data and convolution kernel is 64. The feature data first transmitted to the multiplier has 64 rows, one data per row, 64 data. If the feature data set to be processed is not compressed, there are 0 values. If the value of a specific coordinate of the feature data set to be processed is 0, skip this coordinate and do not perform calculation until the next non-zero feature data is passed into the row of the multiplier and the calculation continues. If all the feature data of this layer is 0, this layer has no effect on the final result, and the calculation of this layer is skipped directly, and there is no need to store the product result.
[0120] When compressing the feature data set to be processed, if the data at a specific coordinate in the feature data set to be processed does not exist, skip this coordinate and do not perform calculations. Then, pass the next non-zero feature data of this layer for calculation, which can reduce the amount of subsequent cumulative calculations. If the feature data of this layer does not exist, skip the calculation of this layer. S44: Perform a dot product operation between the image data passed to the multiplier and the weight data stored in the multiplier. The dot product operation result is stored in the storage unit corresponding to each multiplier. The size of the storage unit is determined according to the output image data. If the input image data is a 4*4 feature matrix and has no 0 values, the weight data in the first multiplier of the first row is (1,1) of W1. After all the image data is transmitted and processed, the storage unit corresponding to this multiplier stores 16 product results stored according to the coordinates of the feature data to form a sub-dot product operation result set. At the same time, save the convolution kernel number W1 and the coordinates (1,1) of the current weight data corresponding to the product result of this sub-dot product operation result set. The convolution kernel number and the coordinates of the current weight data are indices of the relationship between the partial sums.
[0121] S5. Perform an accumulation operation on the set of results of the sub-dot multiplication operation. The specific steps include:
[0122] In this embodiment, there are 64 convolutional kernels, but only 8 columns of multipliers. After compression, the weight data of each layer's coordinate (1,1) cannot be passed into the multiplier array all at once. Therefore, there are two cases. Taking the first row of multipliers as an example, the first case is to perform a dot product operation between the 8 weight data of the first row of the multiplier array and the corresponding sub-feature data set a1. After each multiplier in the first row stores a sub-dot product result set, the sub-dot product result set is passed into the first-level storage. At this time, the weight data of the remaining convolutional kernels at coordinate (1,1) are passed in 8 columns each time. The first row of multipliers performs a dot product operation on each row of multipliers until the non-zero weight data at coordinate (1,1) is multiplied by its corresponding sub-feature data set. Each time the weight data is multiplied by the sub-feature data set, the result of the sub-dot product operation is passed to the first-level storage. After the sub-dot product operation result set of the first row of multipliers is passed from the first-level storage to the second-level storage, the sub-dot product operation result set of the second row of multipliers is passed to the first-level storage. The dot product operation is performed simultaneously in each row of multipliers. After obtaining the dot product operation result, it is passed to the accumulator for accumulation. The dot product operation and the accumulation operation are also parallel. The second case is that, taking the first row of multipliers as an example, the weight data of all convolution kernels at coordinate (1,1) are passed to the multiplier array in groups of 8 each time until the non-zero weight data at coordinate (1,1) is multiplied by its corresponding sub-feature data set a1. Then, at least one sub-dot product operation result set stored in the storage unit of each multiplier in the first row is passed to the first-level storage.
[0123] S51. As shown in Figure 13, in this embodiment, there are 8 multipliers in one row. The convolution kernels of the weight data in the first row multiplier include W1, W2, W3, and W14. There are also 8 memory units in the first-level storage. The storage unit of one multiplier corresponds to one memory unit in the first-level storage. The sub-dot product result set of the storage unit corresponding to the first row multiplier is passed into the memory unit in the first-level storage.
[0124] S52. When the set of sub-dot multiplication results in the first row of the first-level storage is passed to the second-level storage, the set of sub-dot multiplication results in the second row of the multiplier is then passed to the first-level storage.
[0125] The convolution kernels for each weight data in the second row of multipliers include W2, W3, W6, and W20. The sets of sub-dot product results that can be accumulated are passed to the same register in the secondary storage. For example, if the set of sub-dot product results of W2 in the second row can be accumulated with the set of sub-dot product results of W2 in the first row, then they are passed to the same register and accumulated in the accumulator of the secondary storage. The same applies to W3. Since only the same convolution kernels can be accumulated between different layers, the set of sub-dot product results of the two rows of multipliers can contain at most two identical convolution kernels, and the result of each accumulation will not exceed two.
[0126] In the second transmission scenario, since there are 64 convolution kernels, the secondary storage register will not only store a set of sub-dot product results. Furthermore, sub-dot product result sets such as W1 and W6 may not have been accumulated in the register. The sub-dot product result sets or partial sums that have not been accumulated are cached in the cache space within the register. The size of the cache space is N times the storage unit of the multiplier. The cache space can cache N sub-dot product result sets or partial sums waiting to be accumulated.
[0127] Set a cache time. When the cache time reaches the threshold, store the set or partial sum of the sub-dot multiplication results cached in the cache space, such as W6, into the third-level storage.
[0128] If no sum or sub-dot product result set can be found in the secondary storage, for example, if the sub-dot product result set of W6 is subsequently passed to the secondary storage, W6 will be retrieved from the tertiary storage and added to it, and then the accumulated result will be stored in the current register, and the cache time will be refreshed.
[0129] S53. Continue until the product results of rows M are accumulated sequentially; S6. When all the weight data of coordinate (1,1) are multiplied, refresh the weight data in the multiplier, as shown in Figure 16. The first row of the multiplier stores the weight data corresponding to the coordinate (1,2) of the first layer of different convolution kernels, and the second row of the multiplier stores the weight data corresponding to the coordinate (1,2) of the second layer of different convolution kernels. And so on, repeating steps S4-S5 until all the weight data of each convolution kernel is reused to obtain the processing result of the image data to be processed.
[0130] The acceleration method provided by this invention can accumulate all partial sums as much as possible before storing them in the top-level storage space, thereby reducing the performance and power consumption loss caused by the back-and-forth access of partial sums during the accumulation step.
[0131] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0132] In one embodiment, a convolutional neural network processing apparatus is provided, which corresponds one-to-one with the convolutional neural network processing methods described in the above embodiments, characterized in that it includes:
[0133] The first input unit is used to input weight data into the multiplier array according to a first preset method;
[0134] The determining unit is used to determine the set of sub-feature data corresponding to each row of multipliers based on the channels corresponding to each weight data in each row of multipliers;
[0135] The second input unit is used to input the sub-feature data set into the multiplier array based on the sub-feature data set corresponding to each row of multipliers;
[0136] The first calculation unit is used to perform a dot product operation between the weight data in each multiplier of any row of multipliers and the element feature data in the sub-feature data set corresponding to this row of multipliers; the second calculation unit is used to input the obtained first dot product operation result set into the accumulator for accumulation; and the acquisition unit is used to acquire the processing result corresponding to the feature data set to be processed.
[0137] Specific limitations regarding the convolutional neural network processing device can be found in the limitations of the convolutional neural network processing method described above, and will not be repeated here. Each unit in the aforementioned processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These units can be embedded in hardware within or independently of the processor in a computer device, or stored in software within the memory of a computer device, so that the processor can invoke and execute the operations corresponding to each unit.
[0138] In one embodiment, a computer device is provided, which may be a server. The computer device includes a processor, memory, a network interface, and a database connected via a system bus. The processor of the computer device provides computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with external terminals via a network connection. When the computer program is executed by the processor, it implements a convolutional neural network processing method. Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and RAMbus dynamic RAM (RDRAM), etc.
[0139] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above.
[0140] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. A convolutional neural network processing method, characterized in that, include: The weight data is input to the multiplier array according to the first preset method, wherein the convolution kernel corresponding to each weight data in each row of the multiplier is different, the coordinates corresponding to each weight data in each row of the multiplier are the same, and the channels corresponding to each weight data in each row of the multiplier are the same. Based on the channels corresponding to the weight data in each row of multipliers, a sub-feature data set corresponding to each row of multipliers is determined; wherein, the feature data set to be processed includes at least one sub-feature data set; Based on the sub-feature data set corresponding to each row of multipliers, the sub-feature data set is input into the multiplier array. The weight data in each multiplier of any row of multipliers is multiplied by the element feature data in the sub-feature data set corresponding to this row of multipliers to obtain the first set of dot product results. The first set of dot multiplication results is input into the accumulator for accumulation. In response to the completion of the accumulator accumulation, obtain the processing result corresponding to the set of feature data to be processed.
2. The method according to claim 1, characterized in that, The channels corresponding to the weight data in any row of the multiplier are the same as the channels of the sub-feature data set corresponding to that row of the multiplier.
3. The method according to claim 1, characterized in that, A sub-feature data set consists of all feature data from any channel of the feature data set to be processed; the number of sub-feature data sets is the same as the number of rows of the multiplier array, and the arrangement order of the sub-feature data sets is determined based on the channels corresponding to the weight data in each row of the multiplier. Based on the coordinates of the element feature data, determine the order of the element feature data in the sub-feature data set.
4. The method according to claim 1, characterized in that, The weight data in any multiplier in any row of multipliers is multiplied by each element feature data in the sub-feature data set corresponding to this row of multipliers to obtain a sub-dot product result set, wherein the first dot product result set includes at least one sub-dot product result set.
5. The method according to claim 1, characterized in that, The input weight data includes compressed weight data.
6. The method according to claim 5, characterized in that, The steps to compress weighted data include: Based on the coordinates of the weight data, the weight data of each channel of different convolutional kernels is expanded into a row vector; Arrange the row vectors of the same channel belonging to different convolution kernels into a rearranged kernel matrix; Compress the rearranged kernel matrix.
7. The method according to claim 1, characterized in that, The set of feature data to be processed includes a compressed set of feature data to be processed.
8. The method according to claim 1, characterized in that, Based on the sub-feature data set corresponding to each row of multipliers, the sub-feature data set is input into the multiplier array, including: Based on the sub-feature data set corresponding to any row of multipliers, the element feature data in the sub-feature data set corresponding to this row of multipliers is broadcast to each multiplier of this row of multipliers in coordinate order.
9. The method according to claim 1, characterized in that, The channels of the weight data in each row of multipliers are the same as the channels of the sub-feature data set corresponding to each row of multipliers. The sub-feature data sets are arranged in channel order, and then the element feature data in the sub-feature data sets are input into the multiplier array in coordinate order.
10. The method according to claim 1, characterized in that, When the number of columns in the multiplier array is less than the number of convolution kernels, and the dot product operation between the feature data set to be processed and the weight data in the current multiplier is completed, the convolution kernels in the multiplier are updated, and the feature data set to be processed is then input into the multiplier array, until the weight data of all convolution kernels has been multiplied with the feature data set to be processed.
11. The processing method according to claim 4, characterized in that, The set of sub-dot multiplication results is stored in the storage unit corresponding to the multiplier, and the convolution kernel number corresponding to the set of sub-dot multiplication results is also saved. Each storage unit stores at least one set of sub-dot multiplication results.
12. The processing method according to claim 1 or 11, characterized in that, The first set of dot product results is input into the accumulator, including: The set of sub-dot multiplication results in the storage unit corresponding to a row multiplier is passed into the first-level storage; the set of sub-dot multiplication results belonging to a storage unit is stored in a memory of the first-level storage. Based on the indication of the convolution kernel index corresponding to the sub-dot product result set, the sub-dot product result set that can be accumulated is passed into the same register in the secondary storage; the sub-dot product result set in the register is input into the accumulator; When the time threshold is not reached, the set of sub-dot multiplication results that cannot be accumulated is stored in the cache space of the secondary storage. When the time threshold is reached, the set of sub-dot multiplication results that cannot be accumulated is passed to the third-level storage. If no sub-dot product result set can be accumulated in the secondary storage and passed from the primary storage, the accumulative sub-dot product result set is retrieved from the tertiary storage and entered into the secondary storage register, and then passed to the accumulator.
13. The method according to claim 12, characterized in that, The size of the cache space is determined based on the storage unit corresponding to the multiplier.
14. The method according to claim 13, characterized in that, The cache space is three times the size of the storage unit corresponding to the multiplier.
15. A convolutional neural network processing device, characterized in that, include: The first input unit is used to input weight data into the multiplier array according to a first preset method; The determining unit is used to determine the set of sub-feature data corresponding to each row of multipliers based on the channels corresponding to each weight data in each row of multipliers; The second input unit is used to input the sub-feature data set into the multiplier array based on the sub-feature data set corresponding to each row of multipliers; The first calculation unit is used to perform a dot product operation between the weight data in each multiplier of any row of multipliers and the element feature data in the sub-feature data set corresponding to this row of multipliers. The second calculation unit is used to input the obtained first dot multiplication result set into the accumulator for accumulation; The acquisition unit is used to acquire the processing results corresponding to the set of feature data to be processed.