Sparse convolutional neural network system and ranking computation method

By using a sparse convolutional neural network system and a sorting calculation method, the problem of low acceleration efficiency of sparse neural networks is solved. By rearranging connections and balancing the number of weights to optimize the calculation order, the processing efficiency and acceleration effect are improved.

CN117556878BActive Publication Date: 2026-06-19NANJING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV
Filing Date
2023-01-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Sparse neural networks have low acceleration efficiency, especially unstructured sparse neural networks, which suffer from load imbalance and memory access conflicts when accelerated by hardware.

Method used

A sparse convolutional neural network system is adopted, including a storage module, a processing module, and a switching network module. The processing unit is connected to the input storage unit in turn by rearranging the connection, and the number of non-zero weights of the input and output channels is balanced by sorting calculation method to generate a Latin square matrix to optimize the calculation order.

Benefits of technology

It alleviates the load imbalance problem in parallel processing, improves the processing efficiency of processing units, and enhances the acceleration effect of sparse convolutional neural networks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117556878B_ABST
    Figure CN117556878B_ABST
Patent Text Reader

Abstract

This application provides a sparse convolutional neural network system and a sorting calculation method in some embodiments. The method accelerates the sparse convolutional neural network by using sparse weights and by parallel processing of convolution calculations for different input and output channels, reusing input activation data and weight values ​​from the input channels. During parallel convolution calculations, the order of weight calculations is reordered by solving a Latin square matrix to maintain a balance in the number of non-zero weights in the input and output channels. This method can alleviate the problem of unbalanced load between processing units during parallel processing, improve the processing efficiency of the processing units, and thus improve the acceleration effect of the sparse convolutional neural network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of neural network technology, and in particular to a sparse convolutional neural network system and a sorting calculation method. Background Technology

[0002] Deep learning frameworks, also known as deep neural networks, are complex pattern recognition systems capable of functions ranging from automatic language translation to image recognition. The sparsity of deep neural networks is often used to compress models and reduce computational burden.

[0003] When processing a large number of zero elements in a sparse neural network directly using hardware such as a CPU (Central Processing Unit), dedicated hardware is still needed for acceleration. However, the compression of structured sparse neural networks is limited, leading to a decrease in the compression efficiency of sparse neural networks.

[0004] To ensure the compression efficiency of sparse neural networks, unstructured sparse neural networks are also used for compression. Unstructured sparse neural networks can guarantee a good compression ratio without sacrificing accuracy. However, the non-zero parameter distribution in sparse neural networks makes it difficult for hardware to accelerate them. That is, due to the random distribution of parameters in unstructured sparse networks, load imbalance and memory access conflicts can occur among multiple processing units in parallel processing, thus affecting the acceleration effect of unstructured sparse accelerators and resulting in low acceleration efficiency of sparse neural networks. Summary of the Invention

[0005] This application provides a sparse convolutional neural network system and a sorting calculation method to solve the problem of low acceleration efficiency of sparse neural networks.

[0006] In a first aspect, some embodiments of this application provide a sparse convolutional neural network system, including: a storage module, a processing module, and a switching network module, wherein:

[0007] The storage module includes an input storage unit and an output storage unit. The number of input storage units is equal to the number of output storage units. Each input storage unit includes multiple input channels, and each output storage unit includes multiple output channels.

[0008] The processing module includes multiple processing units, the number of which is equal to the number of input storage units. Each input storage unit stores input activation data for the input channels, and the input activation data is a slice of the input feature map. The input storage units are connected to the processing units via a switching network module, which is a reorderable switching network. The switching network module is configured to connect the processing units to the input storage units in rounds according to the processing cycle, so that each processing unit is connected to a different input storage unit once in sequence. Each processing unit is configured to perform convolution calculations on the weights of the input channels and the output channels, along with the corresponding input activation data.

[0009] The processing module is connected to the output storage unit, which is used to store a portion of the output activation data of the output channel calculated by the processing unit.

[0010] In conjunction with the first aspect, in one possible implementation of the first aspect, the processing unit further includes an input slice register for storing slices of input activation data; the input slice register includes a conversion unit and a multiply-accumulate unit, the conversion unit being configured according to a multiplexer array; the conversion unit is connected to the multiply-accumulate unit to multiply each weight by a matrix formed by the slices of input activation data.

[0011] In conjunction with the first aspect, in one possible implementation of the first aspect, the multiply-accumulate unit includes a multiplier, an adder, a first multiplexer, and an internal register; the multiplier is connected to the adder, one end of the first multiplexer is connected to the adder, and the other end of the first multiplexer is connected to the input port of a control signal; the internal register is connected to the adder.

[0012] In conjunction with the first aspect, in one possible implementation of the first aspect, the multiplexer array of the conversion unit includes a register group and a second multiplexer; the register group includes register rows and register columns, and the second multiplexer includes a first-level selector and a second-level selector; two adjacent registers in each register column are connected to one first-level selector to form a first-level array; two adjacent first-level selectors are connected to one second-level selector to form a second array.

[0013] In conjunction with the first aspect, in one possible implementation of the first aspect, the rows of the primary array are connected to the input ports of the first control signal, and the rows of the secondary array are connected to the input ports of the second control signal; the columns of the primary array are connected to the input ports of the third control signal, and the columns of the secondary array are connected to the input ports of the fourth control signal, so as to control the primary selector and the secondary selector in the rows and columns through four different control signals.

[0014] Secondly, some embodiments of this application also provide a sorting calculation method applied to the sparse convolutional neural network system described in the first aspect, the method comprising:

[0015] Read the first quantity information of non-zero weights in the input channels, and arrange the input channels according to the first quantity information to generate an input channel sequence;

[0016] The input channels are placed in the processing module according to the input channel sequence;

[0017] Read the second quantity information of non-zero weights in the output channels, and arrange the output channels according to the second quantity information;

[0018] The processing module performs convolution calculations between the input channel and the output channel, and alternates between the convolution calculations between the input channel and the output channel, to generate a Latin square matrix.

[0019] Initialize the diagonal elements of the Latin square matrix and solve the Latin square matrix to obtain a matrix with a balanced number of non-zero weights in each row.

[0020] In conjunction with the second aspect, in one possible implementation of the second aspect, placing the input channels in the processing module according to the input channel sequence includes: reading the number of processing units to obtain the number of inputs; grouping the input channels according to the number of inputs, and sequentially placing one of the input channels in each processing module; after placing the number of input channels, reversing the arrangement of the processing modules to place all the input channels in the processing module.

[0021] In conjunction with the second aspect, in one possible implementation of the second aspect, placing all the input channels after the processing module further includes: regrouping the input channels according to preset input channel parameters, so that each input storage unit stores a target number of target input channels; the target number of inputs is equal to the value of the input channel parameters, and the target input channels are feature map slices of the same group of input channels corresponding to different processing modules.

[0022] In conjunction with the second aspect, in one possible implementation of the second aspect, arranging the output channels according to the second quantity information includes: grouping the weighted output channels according to preset output channel parameters to generate output channel groups; calculating the second quantity information of non-zero weights in the output channel groups; arranging a target number of output channel groups according to the second quantity information, wherein the target number of outputs is equal to the value of the output channel parameters.

[0023] In conjunction with the second aspect, in one possible implementation of the second aspect, solving the Latin square matrix includes: traversing the positions of each element in the Latin square matrix, and solving the predicted values ​​of blank elements in the Latin square matrix based on the assumption method; filling the positions of the blank elements with the predicted values; and when there are errors that do not meet the conditions of the Latin square matrix, backtracking the solved element positions based on the backtracking method to fill all blank elements in the Latin square matrix.

[0024] As can be seen from the above technical solutions, the sparse convolutional neural network system and sorting calculation method provided in some embodiments of this application can accelerate the sparse convolutional neural network through weight sparsity, and reuse the input activation data and weight values ​​of the input channels by parallel processing of convolution calculations of different input and output channels in the sparse convolutional neural network. During parallel convolution calculation, the weight calculation order is also reordered by solving the Latin square matrix to maintain a balance in the number of non-zero weights in the input and output channels. This method can alleviate the problem of unbalanced load between processing units during parallel processing, improve the processing efficiency of the processing units, and thus improve the acceleration effect of the sparse convolutional neural network. Attached Figure Description

[0025] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0026] Figure 1 A schematic diagram of the connection architecture of a sparse convolutional neural network system provided in some embodiments of this application;

[0027] Figure 2 This application provides a schematic diagram of the connection architecture of a multiply-accumulate unit according to some embodiments.

[0028] Figure 3 A schematic diagram of additional characters for weight identifiers provided in some embodiments of this application;

[0029] Figure 4 A schematic diagram of a slice matrix provided for some embodiments of this application;

[0030] Figure 5 A schematic diagram of a multiplexer array provided for some embodiments of this application;

[0031] Figure 6 This is a flowchart illustrating the sorting calculation method provided in some embodiments of this application. Detailed Implementation

[0032] The embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described below do not represent all embodiments consistent with this application. They are merely examples of systems and methods consistent with some aspects of this application as detailed in the claims.

[0033] The sparsity of deep neural networks can be used to compress models and reduce computational burden. However, the large number of zero elements contained in sparse neural networks requires dedicated hardware acceleration when processed directly using hardware such as CPUs (Central Processing Units).

[0034] Furthermore, since structured sparse neural networks offer limited compression, some implementations employ unstructured sparse neural networks for compression. Unstructured sparse neural networks can achieve good compression ratios without sacrificing accuracy. However, the non-zero parameter distribution of unstructured sparse neural networks makes hardware acceleration difficult. Therefore, unstructured sparse neural networks require additional modules for storing sparse parameters and skipping meaningless calculations, increasing costs. Moreover, load imbalances and memory access conflicts between multiple processing units in parallel processing can reduce the acceleration effect of sparse neural networks.

[0035] Based on the above application scenarios, in order to alleviate the problems of load imbalance and memory access conflicts, and improve the acceleration effect of sparse neural networks, such as... Figure 1 As shown, some embodiments of this application provide a sparse convolutional neural network system, the system including a storage module, a processing module, and a switching network module, wherein:

[0036] The storage module includes input storage units and output storage units. The number of input storage units is equal to the number of output storage units. Each input storage unit includes multiple input channels, and each output storage unit includes multiple output channels.

[0037] The processing module includes multiple processing elements (PEs), the number of which is equal to the number of input storage units. The input storage units store input activation data for the input channels, which are slices of input feature maps. The input storage units are connected to the processing units via a switching network module, which is a reorderable switching network, such as a benes network. The switching network module is configured to connect the processing units to the input storage units in rounds according to the processing cycle, so that each processing unit is connected to a different input storage unit once in sequence. The processing units are configured to perform convolution calculations on the weights of the input and output channels and the corresponding input activation data.

[0038] The processing module is connected to the output storage unit to store a portion of the output activation data of the output channel calculated by the processing unit.

[0039] For example, while maintaining the parallelism of convolution computation, in order to reuse input activation data and avoid latency issues caused by access conflicts when multiple processing units access the same storage unit during parallel computation, pe_num PEs are used to process multiple convolution operations with different input and output channels in parallel in each processing cycle. Here, the number of parallel PEs is pe_num, the number of input SRAM (Static Random Access Memory) blocks and output SRAM blocks are both pe_num, the number of input feature map slice channels stored in the input SRAM block is inc, and the number of output slice channels stored in the output SRAM block is ouct.

[0040] For example: Figure 1 As shown in the figure, pe_num = 4. The four input SRAM blocks are M0, M1, M2, and M3, each storing slices of input feature maps for inc channels. Each M block is connected to a different PE through a switch network. The output of each PE is only connected to the corresponding output SRAM block ACC (accumulation, storing partial sums and results), and each PE only performs convolution operations on the corresponding output channel. The four ACCs correspond to different output channel groups, and each ACC stores the partial sum of output activation data for outc output channels calculated by the corresponding PE.

[0041] In other words, within a processing cycle, each PE can connect to different input SRAM blocks via a switch network to compute convolutions of inc input channels and outc output channels, then accumulate the partial sums into the corresponding ACC. In the next processing cycle, each PE connects again via a switch network to M blocks different from those in the previous rounds. After pe_num cycles, each PE can connect to all input M blocks once to consume the currently loaded input activation data, obtaining the partial sums of output activation data corresponding to pe_num*outc output channels. Then, a new input feature map slice is loaded for the next computation. After completing the convolution computation for all input channels, the results stored in the ACC are output and refreshed, allowing for the next round of convolution computation. Therefore, in this embodiment, the input activation data can be reused across multiple output channels, reducing the problem of repeatedly reading input activation data.

[0042] To facilitate convolution calculations by the processing unit, in some embodiments, the processing unit further includes an input slice register for storing slices of input activation data. The input slice register includes a transformation unit and a multiply-accumulate unit. The transformation unit is configured according to a multiplexer array and is connected to the multiply-accumulate unit to multiply each weight by the matrix formed by the slices of input activation data.

[0043] It is understandable that for a 1×1 convolutional kernel, the corresponding convolution calculation result can be obtained directly, while for convolutional kernels of other sizes, certain processing is required. The sparse convolutional neural network system provided in this application embodiment can perform calculations based on convolutional kernels of a preset size; for convolutional kernels larger than the preset size, multiple convolutional kernels of the preset size can be used instead. For example, for an n×n convolutional kernel, a 3×3 convolutional kernel is used as the basis; for convolutional kernels larger than 3×3, multiple 3×3 convolutional kernels are used instead.

[0044] For example, each PE internally performs element-matrix multiplication, meaning a single weight is multiplied by a slice matrix composed of slices of input activation data. Twi represents the width of the slice matrix, and Thi represents its height. In a 3×3 weighted convolution operation, each weight is concatenated with a slice portion; the nine weights need to be multiplied by their corresponding slice positions. The results are then aligned and summed to obtain the final result. Each weight is appended with a 4-bit field to identify its position within the convolution kernel. The order is from left to right and top to bottom, as follows: Figure 3As shown, the weights are sequentially labeled as 0000, 0001, 0011, 0100, 0101, 0111, 1100, 1101, and 1111. The PE internally stores an input slice register of size Twi×Thi. This input slice register can be connected to (Twi-2)×(Thi-2) MAC (Multiplier and Accumulation) units via a shift unit composed of a MUX (multiplexer) array.

[0045] Therefore, by inputting a weight value at each calculation, the shift unit can be moved according to the additional field to select the corresponding (Twi-2)×(Thi-2) input activation data to be input into the MAC unit for calculation. Multiplying a single weight by (Twi-2)×(Thi-2) inputs yields a partial sum of (Twi-2)×(Thi-2) output activation data. Simultaneously, the partial sum of the output activation data is accumulated with the result in the MAC internal register. After all nine weights in the convolution kernel have been calculated, a (Twi-2)×(Thi-2) size output slice is obtained. Therefore, the convolution operation method provided in this embodiment only needs to store non-zero weights and their position identifiers; zero weights can be skipped directly, reducing unnecessary computation.

[0046] Furthermore, in some embodiments, the multiply-accumulate unit includes a multiplier, an adder, a first multiplexer, and an internal register. The multiplier is connected to the adder; one end of the first multiplexer is connected to the adder, and the other end of the first multiplexer is connected to the input port of a control signal; and the internal register is connected to the adder.

[0047] For example: Figure 2 As shown, the MAC unit includes a multiplier, an adder, a mux3_1 (3-to-1 multiplexer), and an internal register. Each weight, after being input, is multiplied by the input activation data. Then, a 3-bit control signal selects one of the following options—zero input, external input, or the internal register—to add to the calculated result. The result is stored in the internal register. The control signal can then control whether the result stored in the internal register is transferred to the external SRAM.

[0048] To facilitate the selection of input activation data by the conversion unit, in some embodiments, the multiplexer array of the conversion unit includes a register group and a second multiplexer. The register group is the input slice register within the processing unit, and includes register rows and register columns. The second multiplexer includes a first-level selector and a second-level selector. Two adjacent registers in each register column are connected to a first-level selector to form a first-level array; two adjacent first-level selectors are connected to a second-level selector to form a second array.

[0049] Furthermore, in some embodiments, the rows of the primary array are connected to the input ports of the first control signal, and the rows of the secondary array are connected to the input ports of the second control signal; the columns of the primary array are connected to the input ports of the third control signal, and the columns of the secondary array are connected to the input ports of the fourth control signal, so as to control the primary selectors and the secondary selectors in the rows and columns through four different control signals. That is, the rows and columns in the array are controlled by different control signals, and the primary and secondary selectors in the rows and columns require four different control signals, which are the additional fields of the weights.

[0050] For example, both the primary selector and the secondary selector are mux2_1 (two-to-one multiplexer), and the additional characters for the weight identifier are as follows: Figure 3 As shown. Figure 4 As shown, Figure 4 In the slice matrix, Twi = Thi = 18. For example... Figure 5 As shown, for an 18×18 register group in the PE, adjacent registers in each column of 18 registers are connected to the same mux2_1 (two-to-one multiplexer), forming a first-level MUX (multiplexer) column, requiring a total of 306 mux2_1s. Adjacent mux2_1s are then connected to the same next-level mux2_1, forming a second-level MUX column, requiring a total of 288 mux2_1s. By selecting from these two columns of mux2_1s, the input activation data of a matrix row can be located. Similarly, for a selected 18×16 input, with each row containing 18 second-level mux2_1s, connecting to a first-level MUX row of 272 mux2_1s and a second-level MUX row of 256 mux2_1s allows the location of the input activation data of a matrix column.

[0051] In other words, in this embodiment, the rows or columns of the MUX at the same level are controlled by the same control signal. The 4-bit additional field of each weight controls the mux2_1 at the horizontal and vertical levels respectively to select the corresponding input activation data. The MUX array can determine the input activation matrix corresponding to each weight by shifting. Compared with the method of using mux9_1 to determine the corresponding input activation data for each MAC unit, this method can effectively reduce the additional cost of the MUX.

[0052] Based on the aforementioned sparse convolutional neural network system, some embodiments of this application also provide a sorting calculation method, applied to the sparse convolutional neural network system provided in the above embodiments, to alleviate the problem of load imbalance during parallel processing, such as... Figure 6 As shown, the method includes the following steps:

[0053] S100: Read the first quantity information of non-zero weights in the input channels, and arrange the input channels according to the first quantity information to generate an input channel sequence.

[0054] For single-layer convolution operations, the number of non-zero weights in each input channel can be counted to arrange all input channels according to the number of non-zero weights.

[0055] It should be noted that the sorting can be based on a descending order, a ascending order, or a combination of both, and the final result of the sorting methods is the same. This application does not impose any restrictions on this.

[0056] S200: Place the input channels into the processing module according to the input channel sequence.

[0057] After sorting the input channels, the processing modules need to be grouped, and then all the input channels are placed in the respective processing modules.

[0058] Therefore, in some embodiments, the number of processing units is read to obtain the number of inputs, and the input channels are then grouped according to the number of inputs, with one input channel being placed sequentially in each processing module. After placing the specified number of input channels, the processing modules are arranged in reverse order to place all the input channels within a single processing module.

[0059] For example, let's take a PE group with pe_num groups. Using a sorted sequence of input channels, group the input channels according to the number of PEs. In the PE groups from 1 to pe_num, place one input channel in each PE group. After placing pe_num input channels, reverse the process and place the sorted input channels from pe_num to 1 from pe_num to pe_num+1 to 2*pe_num in the PE groups. Then continue in reverse, repeating the above process until all input channels are placed in PE groups.

[0060] Furthermore, in some embodiments, after all input channels are placed after the processing module, the input channels are regrouped according to preset input channel parameters so that each input storage unit stores the target number of input channels. The target number of inputs is equal to the value of the input channel parameters, and the target input channels are feature map slices of the same group of input channels corresponding to different processing modules.

[0061] For example, the preset input channel parameter is inc. The input channels in each PE group are further grouped according to the parameter inc. The inc consecutive input channels within a PE group are divided into a smaller group. Each M-block storing input feature map slices stores feature map slices of the inc input channels from the same smaller group in different PE groups. In each computation cycle, for pe_num PEs, each PE processes the convolution calculations of the same group of inc input channels in different PE groups in parallel. At this time, the number of non-zero weights in the multiple input channels corresponding to the pe_num M-blocks is similar.

[0062] S300: Read the second quantity information of non-zero weights in the output channels, and arrange the output channels according to the second quantity information.

[0063] After all the input channels are put into the processing module, it is also necessary to read the non-zero parameters of the output channels and sort the output channels according to the number of non-zero parameters.

[0064] In some embodiments, the output channels are grouped according to preset output channel parameters to generate output channel groups. Then, a second quantity of non-zero weights in each output channel group is calculated, and the target number of output channel groups is arranged according to the second quantity information. The target number of outputs is equal to the value of the output channel parameters.

[0065] In other words, at this point, the output channel groups of each of the pe_num M blocks are sorted. Based on the input channels in each M block, the non-zero elements of each output channel group are calculated, and each M block is sorted separately. The sorting result is used to determine the order in which elements are attempted to be solved when solving the Latin matrix using the assumption method.

[0066] For example, the preset output channel parameter is outc. The weighted output channels are grouped according to the number of outc output channels in each group. In each calculation cycle, pe_num output channel groups are processed in parallel. Therefore, each output channel group corresponds to pe_num PEs and ACCs. For each grouped output channel, the non-zero weights are calculated based on the weights of each inc input channel. Then, the pe_num output channel groups are sorted in ascending order of their non-zero weights, forming the sorted output channel groups corresponding to the pe_num input channel groups.

[0067] It is understandable that the sorting method for output channels is based on the same principle as that for input channels, and will not be elaborated upon here.

[0068] S400: The processing module performs convolution calculations on the input and output channels, as well as alternating convolution calculations on the input and output channels, to generate a Latin square matrix.

[0069] After sorting the output channels, the processing module handles the convolution calculations for different input and output channels. Then, the input and output channels are rotated, and a Latin square is generated based on this rotation. A Latin square is an n×n matrix containing n distinct elements, where each distinct element appears only once in the same row or column.

[0070] For example, each operation of the PE (Process Execution) requires processing convolution calculations from different input channels and different output channels. Simultaneously, the convolution calculations between the current input channel and the output channel are performed pe_num times in rotation. For pe_num rounds, the PE, consisting of pe_num M blocks connected via a switch network, can be viewed as a pe_num × pe_num matrix. Each column contains elements belonging to a fixed group of input channels. Each row represents the group of output channels being processed. During this process, it is ensured that each row and column of the matrix contains unique elements and all numbers from 1 to pe_num; thus, the matrix is ​​a pe_num × pe_num Latin square.

[0071] S500: Initialize the diagonal elements of the Latin square matrix and solve the Latin square matrix to obtain a matrix with a balanced number of non-zero weights in each row.

[0072] After generating the Latin square matrix, the diagonal elements of the Latin square matrix are initialized to solve the Latin square matrix and obtain a matrix in which the number of non-zero weights in each row is balanced, thus ensuring that the processing modules in each round are load-balanced.

[0073] For example, let's take the element [i][i] in the Latin square matrix as an example to initialize the i-th sorted output channel group corresponding to the i-th input channel group. When there are pe_num-1 identical elements among the pe_num initialization elements, one of the duplicate elements will be changed to the nearest output channel group.

[0074] As can be seen, the above-mentioned sparse convolutional neural network system and sorting calculation method can use a conversion module to reduce the problem of different weight elements in the convolution kernel needing to be connected to different parts of the feature map with a smaller MUX cost; it can alleviate the latency problem caused by memory access conflicts, so as to realize the reuse of input feature maps by multiple output channels; and it also reduces the load imbalance problem of sparse convolutional neural networks through the sorting calculation method, thereby improving the acceleration efficiency of sparse convolutional neural networks.

[0075] To facilitate solving the Latin square matrix, in some embodiments, the positions of each element in the Latin square matrix are traversed, and the predicted values ​​of the blank elements in the Latin square matrix are calculated based on the assumption method. The predicted values ​​are then used to fill the positions of the blank elements. When errors occur that do not meet the conditions of the Latin square matrix, the positions of the elements solved are backtracked using a backtracking method to fill all the blank elements in the Latin square matrix.

[0076] For example, the Latin square matrix is ​​solved using the assumption method and backtracking method. First, the assumption method is used to solve for the blank elements in the Latin square matrix from left to right and top to bottom. At this point, the blank elements are uninitialized. Numerical attempts are made according to the output channel group order of the current element's column. When an error occurs that does not meet the conditions of the Latin square matrix, the process backtracks to the previous element position and retryes the untried values. If all values ​​fail to meet the conditions of the Latin square matrix, the process backtracks to the previous successful element position and continues until all blank elements in the Latin square matrix are filled. After solving the Latin matrix, a Latin matrix with a relatively similar number of non-zero weights is obtained. The balanced number of non-zero weights is the number of non-zero weights contained in the input and output channels corresponding to each row of elements. Therefore, in the pe_num rounds of exchange calculation, the PE load in each round can achieve relative balance.

[0077] As can be seen from the above technical solutions, the sparse convolutional neural network system and sorting calculation method provided in some embodiments of this application can accelerate the sparse convolutional neural network through weight sparsity, and reuse the input activation data and weight values ​​of the input channels by parallel processing of convolution calculations of different input and output channels in the sparse convolutional neural network. During parallel convolution calculation, the weight calculation order is also reordered by solving the Latin square matrix to maintain a balance in the number of non-zero weights in the input and output channels. This method can alleviate the problem of unbalanced load between processing units during parallel processing, improve the processing efficiency of the processing units, and thus improve the acceleration effect of the sparse convolutional neural network.

[0078] Similar parts between the embodiments provided in this application can be referred to mutually. The specific implementation methods provided above are only a few examples under the overall concept of this application and do not constitute a limitation on the scope of protection of this application. For those skilled in the art, any other implementation methods extended from the solution of this application without creative effort shall fall within the scope of protection of this application.

Claims

1. A sparse convolutional neural network system, comprising: include: The module includes a storage module, a processing module, and a switching network module, among which: The storage module includes an input storage unit and an output storage unit. The number of input storage units is equal to the number of output storage units. Each input storage unit includes multiple input channels, and each output storage unit includes multiple output channels. The processing module includes multiple processing units, the number of which is equal to the number of input storage units. Each input storage unit stores input activation data for the input channels, and the input activation data is a slice of the input feature map. The input storage units are connected to the processing units via a switching network module, which is a reorderable switching network. The switching network module is configured to connect the processing units to the input storage units in rounds according to the processing cycle, so that each processing unit is connected to a different input storage unit once in sequence. Each processing unit is configured to perform convolution calculations on the weights of the input channels and the output channels, along with the corresponding input activation data. The processing module is connected to the output storage unit, and the output storage unit is used to store a portion of the output activation data of the output channel calculated by the processing unit. The system is also configured to perform the following sorting calculation operations: Read the first quantity information of non-zero weights in the input channels, arrange the input channels according to the first quantity information to generate an input channel sequence, and place the input channels in the processing module according to the input channel sequence; Read the second quantity information of the non-zero weights in the output channels, and arrange the output channels according to the second quantity information; The processing module performs convolution calculations between the input channel and the output channel, and the switching network module performs round-robin connections according to the processing cycle to rotate the convolution calculation order in order to generate a Latin square matrix. Initialize the diagonal elements of the Latin square matrix and solve the Latin square matrix to obtain a matrix with a balanced number of non-zero weights in each row.

2. The sparse convolutional neural network system of claim 1, wherein, The processing unit further includes an input slice register, which is used to store slices of input activation data; The input slice register includes a conversion unit and a multiply-accumulate unit. The conversion unit is configured according to a multiplexer array. The conversion unit is connected to the multiply-accumulate unit to multiply each weight by the matrix formed by the input active data slice.

3. The sparse convolutional neural network system of claim 2, wherein, The multiply-accumulate unit includes a multiplier, an adder, a first multiplexer, and an internal register; the multiplier is connected to the adder, one end of the first multiplexer is connected to the adder, and the other end of the first multiplexer is connected to the input port of the control signal; the internal register is connected to the adder.

4. The sparse convolutional neural network system of claim 3, wherein, The multiplexer array of the conversion unit includes a register group and a second multiplexer; the register group includes register rows and register columns, and the second multiplexer includes a first-level selector and a second-level selector; two adjacent registers in each register column are connected to one first-level selector to form a first-level array; two adjacent first-level selectors are connected to one second-level selector to form a second array.

5. The sparse convolutional neural network system of claim 4, wherein, The rows of the primary array are connected to the input ports of the first control signal, and the rows of the secondary array are connected to the input ports of the second control signal; the columns of the primary array are connected to the input ports of the third control signal, and the columns of the secondary array are connected to the input ports of the fourth control signal, so as to control the primary selector and the secondary selector in the rows and columns through four different control signals.

6. The sparse convolutional neural network system according to any one of claims 1-5, characterized in that, The sorting calculation operation involves placing the input channels into the processing module according to the input channel sequence, specifically including: The number of processing units is read to obtain the number of inputs; The input channels are grouped according to the number of inputs, and one of the input channels is sequentially placed in each of the processing modules; After the input channels are inserted, the processing modules are arranged in reverse order so that all the input channels are placed in the processing modules.

7. The sparse convolutional neural network system according to claim 6, characterized in that, The sorting calculation operation also includes: The input channels are regrouped according to preset input channel parameters so that each input storage unit stores the target number of target input channels; the target number of inputs is equal to the value of the input channel parameters, and the target input channels are feature map slices of the same group of input channels corresponding to different processing modules.

8. The sparse convolutional neural network system of any one of claims 1-5, wherein, The sorting calculation operation, which arranges the output channels according to the second quantity information, specifically includes: The weighted output channels are grouped according to preset output channel parameters to generate output channel groups; Calculate the second quantity information of the non-zero weights in the output channel group; The target number of output channels is arranged according to the second quantity information, and the target number of outputs is equal to the value of the output channel parameter.

9. The sparse convolutional neural network system of any one of claims 1-5, wherein, Solving the Latin square matrix in the sorting calculation operation specifically includes: The positions of each element in the Latin square are traversed, and the predicted values ​​of the blank elements in the Latin square are solved based on the assumption method. Fill the blank element with the predicted value; When there are errors that do not meet the conditions of the Latin square matrix, the positions of the elements solved by backtracking are traced back to fill all the blank elements in the Latin square matrix.