set of compression coefficients for subsequent use in a neural network
By using the coefficient set of a sparse compressed neural network, the problem of high storage and computation requirements caused by coefficient set parameterization is solved, and efficient implementation on resource-constrained devices is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- IMAGINATION TECH LTD
- Filing Date
- 2021-12-20
- Publication Date
- 2026-06-19
Smart Images

Figure CN114662648B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to computer-implemented neural networks. Specifically, this disclosure relates to applying sparsity in computer-implemented neural networks. Background Technology
[0002] Neural networks can be used in machine learning applications. Specifically, they can be used in signal processing applications, including image processing and computer vision applications. For example, convolutional neural networks (CNNs) are a type of neural network commonly used to analyze image data, such as for image classification applications, semantic image segmentation applications, super-resolution applications, object detection applications, and so on.
[0003] In image classification applications, image data representing one or more images can be input into a neural network, and the output of the neural network can be data indicating the probability (or set of probabilities) that each of these images belongs to a specific category (or set of categories). Neural networks typically consist of multiple layers between input and output layers. Within a layer, a set of coefficients can be combined with the data input of that layer. Convolutional layers and fully connected layers are examples of neural network layers where the set of coefficients is combined with the data input of those layers. Neural networks may also include other types of layers that are not configured to combine the set of coefficients with the data input of those layers (such as activation layers and corresponding element layers). In image classification applications, the computations performed within layers enable the identification of characteristic features of the input data and the prediction of the category (or set of categories) to which the input data belongs.
[0004] Neural networks are typically trained to improve the accuracy of their outputs by using training data. In the image classification example, training data may include data representing one or more images and a corresponding pre-determined label for each of these images. Training a neural network may involve operating the network on the training input data using an untrained or partially trained set of coefficients to form training output data. For example, a loss function can be used to evaluate the accuracy of the training output data. The coefficient set can be updated based on the accuracy of the training output data through a process called gradient descent and backpropagation. For example, the coefficient set can be updated based on the loss determined using the loss function on the training output data.
[0005] The coefficient set used in a typical neural network can be highly parameterized. That is, the coefficient set used in a typical neural network usually includes a large number of non-zero coefficients. A highly parameterized coefficient set can have a large memory footprint. The memory bandwidth required to read a highly parameterized coefficient set from memory can be substantial. A highly parameterized coefficient set can also subject the neural network to significant computational demands, for example, by requiring the neural network to perform numerous calculations (e.g., multiplications) between the coefficients and the input values. Therefore, it can be difficult to implement neural networks on devices with limited processing or memory resources. Summary of the Invention
[0006] This summary is provided to introduce, in a simplified form, a series of concepts further described below in the detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter.
[0007] According to a first aspect of the invention, a method is provided for compressing a set of coefficients for subsequent use in a neural network, the method comprising: applying sparsity to a plurality of coefficient groups, each group comprising a plurality of predefined coefficients; and compressing the coefficient groups according to a compression scheme registered with the coefficient groups so that each coefficient group is represented by an integer number of one or more compressed values.
[0008] Each group may include one or more subsets of coefficients from a set of coefficients. Each group may include n coefficients and each subset may include m coefficients, where m is greater than 1 and n is an integer multiple of m. The method may also include compressing the coefficient group according to a compression scheme by compressing the one or more subsets of coefficients included in each group, so that each subset of coefficients is represented by an integer number of one or more compressed values.
[0009] n can be greater than m, and each coefficient group can be compressed by compressing multiple adjacent or interleaved subsets of coefficients.
[0010] n can be equal to or equal to 2m.
[0011] Each group can include 16 coefficients, and each subset can include 8 coefficients. Each group can be compressed by compressing two adjacent or interleaved subsets of coefficients.
[0012] n can be equal to m.
[0013] Applying sparsity to a group of coefficients may include setting each coefficient in the group to zero.
[0014] Sparsity can be applied to the plurality of coefficient sets according to a sparsity mask, which specifies which coefficients in the coefficient set to apply sparsity to.
[0015] The coefficient set can be a coefficient tensor, and the sparsity mask can be a binary tensor with the same dimension as the coefficient tensor. Sparsity can be applied by performing element-wise multiplication between the coefficient tensor and the sparsity mask tensor. The binary tensor can be a tensor consisting of binary 1s and / or 0s.
[0016] A sparse mask tensor can be formed by the following operations: generating a reduced tensor with one or more dimensions, wherein the dimensions of the coefficient tensor are integer multiples of the one or more dimensions, where the integer is greater than 1; determining the elements of the reduced tensor to which sparsity is to be applied in order to generate a reduced sparse mask tensor; and expanding the reduced sparse mask tensor to generate a sparse mask tensor with the same dimensions as the coefficient tensor.
[0017] Generating a reduced tensor may include: dividing the coefficient tensor into multiple coefficient groups such that each coefficient in the set is assigned to only one group, and all coefficients are assigned to groups; and representing each coefficient group of the coefficient tensor with the maximum coefficient value within the group.
[0018] The method may also include expanding the reduced sparse mask tensor by performing nearest-neighbor upsampling, such that each value in the reduced sparse mask tensor is represented by a group comprising multiple identical values in the sparse mask tensor.
[0019] Compressing each subset of coefficients may include: generating header data including h bits and multiple body portions, each including b bits, wherein each body portion corresponds to a coefficient in the subset, wherein b is fixed within the subset and wherein the header data of the subset includes an indication of b in the body portion of the subset;
[0020] The method may further include: identifying the body size b by locating the bit position of the most significant leading one among all coefficients in the subset; generating header data including a bit sequence encoding the body size; and generating a body consisting of b bits for each coefficient in the subset by removing none, one or more leading zeros from each coefficient.
[0021] The number of groups to which sparsity needs to be applied can be determined based on the sparsity parameter.
[0022] The method may further include: dividing the coefficient set into multiple coefficient groups such that each coefficient in the set is assigned to only one group, and all coefficients are assigned to one group; determining the significance of each coefficient group; and applying sparsity to coefficient groups with significance below a threshold, wherein the threshold is determined based on a sparsity parameter.
[0023] The threshold can be either the maximum absolute coefficient value or the average absolute coefficient value.
[0024] The method may also include storing the compressed set of coefficients in memory for subsequent use in the neural network.
[0025] This method may also include using compressed sets of coefficients from a neural network.
[0026] According to a second aspect of the invention, a data processing system is provided for compressing a set of coefficients for subsequent use in a neural network, the data processing system comprising: pruning logic configured to apply sparsity to a set of coefficients, each set including a predefined plurality of coefficients; and a compression engine configured to compress the set of coefficients according to a compression scheme registered with the set of coefficients, so that each set of coefficients is represented by an integer number of one or more compressed values.
[0027] According to a third aspect of the present invention, a computer-implemented method for training a neural network is provided, the neural network comprising a plurality of layers, each layer being configured to combine a corresponding set of filters with data inputs to the layer to form output data of the layer, wherein each set of filters comprises a plurality of coefficient channels, each coefficient channel of the filter set corresponding to a corresponding data channel in the data inputs to the layer, and the output data comprises a plurality of data channels, each data channel corresponding to a corresponding filter of the filter set, the method comprising: identifying a target coefficient channel of the filter set of the layer; identifying a target data channel among the plurality of data channels in the data inputs to the layer, the target data channel corresponding to a target coefficient channel of the filter set; and configuring a runtime implementation of the neural network, wherein the filter set of the preceding layer does not include the filter corresponding to the target data channel.
[0028] The data input of a layer can depend on the output data of the previous layer.
[0029] The method may also include configuring the runtime implementation of the neural network, wherein the filter set of the preceding layer does not include filters corresponding to the target data channel, such that when the runtime implementation of the neural network is executed on the data processing system, combining the filter set of the preceding layer with the data input to the preceding layer will not form a data channel in the output data of the preceding layer corresponding to the target data channel.
[0030] The method may also include configuring the runtime implementation of the neural network, wherein each filter in the filter set of the layers does not include the target coefficient channel.
[0031] The method may also include a runtime implementation of the neural network on a data processing system.
[0032] The method may also include storing a set of filters from the previous layer that do not include filters corresponding to the target data channel in memory for subsequent use by the runtime implementation of the neural network.
[0033] The filter set of a layer may include a set of coefficients arranged such that each filter in the filter set includes multiple coefficients from the set of coefficients.
[0034] Each filter in the filter set of a layer can include multiple different coefficients.
[0035] Two or more filters in a layer's filter set may include the same number of coefficients.
[0036] The method may also include identifying a target coefficient channel based on a sparsity parameter that indicates the sparsity level of the filter set to be applied to the layer.
[0037] The sparsity parameter can indicate the percentage of the set of coefficients to be set to zero.
[0038] Identifying a target coefficient channel may include applying a sparsity algorithm to set all coefficients included in the coefficient channel of the filter set of the layer to zero, and identifying the coefficient channel as the target coefficient channel of the filter set.
[0039] The method may further include, before identifying the target coefficient channel: using a set of filters from the layers to manipulate a test implementation of the neural network on the training input data to form training output data; evaluating the accuracy of the test implementation of the neural network based on the training output data; and forming sparsity parameters based on the accuracy of the neural network.
[0040] The method may further include: identifying target coefficient channels; iteratively performing the following operations: applying a sparsity algorithm to the coefficient channels of the filter set of the layer according to the sparsity parameters; using the filter set for the layer to operate a test implementation of the neural network on the training input data to form training output data; evaluating the accuracy of the test implementation of the neural network based on the training output data; and forming updated sparsity parameters based on the accuracy of the neural network.
[0041] The method may also include forming sparsity parameters based on a parameter optimization technique configured to balance the sparsity level of the filter set to be applied, as shown in the relationship between sparsity parameters and network accuracy.
[0042] According to a fourth aspect of the present invention, a data processing system for training a neural network is provided, the neural network comprising multiple layers, each layer being configured to combine a corresponding set of filters with data inputs to the layer to form output data of the layer, wherein each set of filters comprises multiple coefficient channels, each coefficient channel of the filter set corresponding to a corresponding data channel in the data inputs to the layer, and the output data comprises multiple data channels, each data channel corresponding to a corresponding filter of the filter set, the data processing system comprising coefficient identification logic configured to: identify a target coefficient channel of the filter set; and identify a target data channel of the multiple data channels in the data inputs to the layer, the target data channel corresponding to a target coefficient channel of the filter set; and wherein the data processing system is arranged to configure a runtime implementation of the neural network, wherein the filter set of the previous layer does not include a filter corresponding to a target data channel.
[0043] According to a fifth aspect of the present invention, a method for training a computer-implemented neural network is provided, the neural network being configured to combine a set of coefficients with corresponding input data values, the method comprising: for training a test implementation of the neural network: applying sparsity to one or more coefficients in the set of coefficients according to a sparsity parameter indicating the level of sparsity to be applied to the set of coefficients; operating the test implementation of the neural network on training input data using the coefficients to form training output data; evaluating the accuracy of the neural network based on the training output data; updating the sparsity parameter based on the accuracy of the neural network; and configuring a runtime implementation of the neural network based on the updated sparsity parameter.
[0044] The method may also include iteratively performing application, computation, formation, and update steps to train a test implementation of the neural network.
[0045] The method may also include iteratively updating the coefficient set based on the accuracy of the neural network.
[0046] The method may also include implementing a neural network based on updated sparsity parameters.
[0047] Applying sparsity to coefficients may include setting the coefficient to zero.
[0048] The accuracy of a neural network can be evaluated by comparing the training output data with the validation output data of the training input data.
[0049] The method may also include a test implementation of operating the neural network on the training input data using the coefficients before applying sparsity to one or more coefficients, in order to form validation output data.
[0050] The method may also include using a cross-entropy loss equation that depends on the training output data and the validation output data to evaluate the accuracy of the neural network.
[0051] The method may also include updating the sparsity parameters according to a parameter optimization technique configured to balance the sparsity level to be applied to the coefficient set, as shown in the relationship between the sparsity parameters and the accuracy of the network.
[0052] Parameter optimization techniques can utilize the cross-entropy loss equation, which depends on the sparsity parameters and the accuracy of the neural network.
[0053] Further updates to the sparsity parameters can be performed based on weights configured to make the test implementation of the neural network tend to maintain the accuracy of the network or improve the sparsity level applied to the coefficient set, as indicated by the sparsity parameters.
[0054] The sparsity parameter can be updated further based on the maximum sparsity level defined by the sparsity parameter.
[0055] A neural network may include multiple layers, each configured to combine a corresponding set of coefficients with the corresponding input data values of the layer to form the output of the layer.
[0056] The method may also include iteratively updating the corresponding sparsity parameters for each layer.
[0057] The number of coefficients in the coefficient set of each layer of a neural network can vary between layers, and the sparsity parameter can be further updated based on the number of coefficients in each coefficient set. This makes the test implementation of the neural network tend to update the corresponding sparsity parameter in order to indicate a higher level of sparsity to be applied to the coefficient set, which includes more coefficients than a coefficient set that includes fewer coefficients.
[0058] The sparsity parameter can indicate the percentage of the coefficient set to which sparsity is to be applied.
[0059] Applying sparsity can include applying sparsity to groups of coefficients, each group consisting of a predefined set of coefficients.
[0060] Applying sparsity to a group of coefficients may include setting each coefficient in the group to zero.
[0061] Configuring a runtime implementation of a neural network may include: applying sparsity to a set of coefficients based on updated sparsity parameters; compressing the set of coefficients according to a compression scheme registered with the set of coefficients so that each set of coefficients is represented by an integer number of one or more compressed values; and storing the compressed set of coefficients in memory for subsequent use by the implemented neural network.
[0062] Each group may include one or more subsets of coefficients from a set of coefficients. Each group may include n coefficients and each subset may include m coefficients, where m is greater than 1 and n is an integer multiple of m. The method may also include compressing the coefficient group according to a compression scheme by compressing the one or more subsets of coefficients included in each group so that each subset of coefficients is represented by an integer number of one or more compressed values.
[0063] Applying sparsity may include modeling the set of coefficients using a differentiable function to identify a threshold based on sparsity parameters, and applying sparsity based on said threshold such that the sparsity parameters can be updated by backpropagating one or more gradient vectors using the differentiable function to modify the threshold value.
[0064] According to a sixth aspect of the invention, a data processing system for training a neural network is provided, the neural network being configured to combine a set of coefficients with corresponding input data values, the data processing system comprising: pruning logic configured to apply sparsity to one or more coefficients in the set of coefficients according to a sparsity parameter indicating the level of sparsity to be applied to the set of coefficients; a test implementation of the neural network configured to perform operations on training input data using the coefficients to form training output data; network accuracy logic configured to evaluate the accuracy of the neural network based on the training output data; and sparsity learning logic configured to update the sparsity parameter based on the accuracy of the neural network; and wherein the data processing system is arranged to configure the runtime implementation of the neural network according to the updated sparsity parameter.
[0065] A data processing system can be embodied in hardware on an integrated circuit. A method for manufacturing a data processing system at an integrated circuit manufacturing system can be provided. An integrated circuit definition dataset can be provided, which, when processed in the integrated circuit manufacturing system, configures the system to manufacture a data processing system. A non-transitory computer-readable storage medium can be provided, storing a computer-readable description of the data processing system, which, when processed in the integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the data processing system.
[0066] An integrated circuit manufacturing system may be provided, comprising: a non-transitory computer-readable storage medium storing a computer-readable description of a data processing system thereon; a layout processing system configured to process the computer-readable description to generate a circuit layout description of an integrated circuit embodying the data processing system; and an integrated circuit generation system configured to manufacture the data processing system according to the circuit layout description.
[0067] Computer program code for performing any of the methods described herein may be provided. A non-transitory computer-readable storage medium may be provided storing computer-readable instructions that, when executed at a computer system, cause the computer system to perform any of the methods described herein.
[0068] As will be apparent to those skilled in the art, the above features can be appropriately combined, and can be combined with any aspect of the examples described herein. Attached Figure Description
[0069] The example will now be described in detail with reference to the accompanying drawings, in which:
[0070] Figure 1 An exemplary implementation of a neural network is shown.
[0071] Figure 2a An example of the data structure used in the convolutional layer of a neural network is shown.
[0072] Figure 2b A convolutional layer is schematically shown, which is arranged to combine a set of coefficients with the input data to form output data.
[0073] Figure 3a Compression based on exemplary coefficients of a compression scheme is shown.
[0074] Figure 3b Compression based on a sparse subset of coefficients according to the compression scheme is shown.
[0075] Figure 4 A graphics processing system configured according to the principles described herein is shown.
[0076] Figure 5 A data processing system for compressing coefficient sets for subsequent use in neural networks is shown, based on the principles described herein.
[0077] Figure 6 A method for compressing a set of coefficients according to the principles described herein for subsequent use in a neural network is illustrated.
[0078] Figure 7a An exemplary pruning logic for applying unstructured sparsity is shown.
[0079] Figure 7b An exemplary pruning logic for applying structured sparsity is shown.
[0080] Figure 7c An exemplary pruning logic for applying unstructured sparsity is shown.
[0081] Figure 7d An exemplary pruning logic for applying structured sparsity is shown.
[0082] Figure 8 This is a schematic diagram illustrating an exemplary application of structured sparsity.
[0083] Figure 9 A data processing system is shown that implements a test implementation of a neural network for learning sparse parameters through training, based on the principles described herein.
[0084] Figure 10 A method for learning sparsity parameters by training a neural network according to the principles described herein is shown.
[0085] Figure 11a An exemplary application of channel trimming in a convolutional layer is shown, based on the principles described herein.
[0086] Figure 11b An exemplary application of channel pruning in a fully connected layer based on the principles described herein is shown.
[0087] Figure 12 A method for training a neural network using channel pruning based on the principles described herein is illustrated.
[0088] Figure 13 This illustrates a manufacturing system for generating integrated circuits that embody a graphics processing system.
[0089] Figure 14a An example of unstructured sparsity in the coefficient set is shown.
[0090] Figures 14b to 14d An example of a structured sparsity coefficient set is shown.
[0091] The accompanying drawings illustrate various examples. Those skilled in the art will understand that the element boundaries (e.g., boxes, groups of boxes, or other shapes) shown in the drawings represent one example of a boundary. In some examples, it may be that one element can be designed as multiple elements, or multiple elements can be designed as one element. Where appropriate, common reference numerals are used throughout the drawings to indicate similar features. Detailed Implementation
[0092] The following description is given by way of example to enable those skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein, and various modifications to the disclosed embodiments will be readily apparent to those skilled in the art.
[0093] The embodiments will now be described by way of example only.
[0094] Neural Networks
[0095] Figure 1 A data processing system 100 for implementing a neural network is illustrated. The data processing system 100 may include hardware components (e.g., a hardware processing unit) and software components (e.g., firmware, and programs and tasks executed at the hardware processing unit). The data processing system 100 includes an accelerator 102 for performing operations on the neural network. The accelerator 102 may be implemented in hardware, software, or any combination thereof. The accelerator may be referred to as a neural network accelerator (NNA). The accelerator includes multiple configurable resources capable of implementing different types of neural networks at the accelerator, such as convolutional neural networks, fully convolutional neural networks, recurrent neural networks, and multilayer perceptrons.
[0096] relative to Figure 1 The specific example shown illustrates a data processing system used to describe the implementation of a neural network, where accelerator 102 includes multiple processing elements 114, each including a convolution engine. However, it should be understood that, unless otherwise stated, the principles described herein are generally applicable to any data processing system that includes an accelerator capable of performing operations on a neural network.
[0097] The data processing system includes an input 101 for receiving data inputs to the data processing system. In image classification applications, the input to the neural network may include image data representing one or more images. For example, for an RGB image, the image data may be in the format x×y×3, where x and y are the pixel dimensions of the image across three color channels (i.e., R, G, and B). The input data may be referred to as tensor data. It should be understood that the principles described herein are not limited to use in image classification applications. For example, the principles described herein can be used in semantic image segmentation applications, object detection applications, super-resolution applications, speech recognition / speech-to-text applications, or any other suitable type of application. The input to the neural network also includes one or more sets of coefficients that will be combined with the input data. As used herein, the set of coefficients may also be referred to as weights.
[0098] exist Figure 1 In this accelerator, there are input buffers 106, multiple convolution engines 108, multiple accumulators 110, accumulation buffers 112, and output buffers 116. Each convolution engine 108, together with its corresponding accumulator 110 and its share of resources in the accumulation buffer 112, represents a processed element 114. Figure 1Three processing elements are shown, but typically any number of processing elements can be present. Each processing element receives a set of coefficients from coefficient buffer 130 and input values from input buffer 106. The coefficient buffers can be located at the accelerator, for example, on the same semiconductor die and / or in the same integrated circuit package. By combining the coefficient set and the input data, the processing elements can operate to perform the operations of a neural network.
[0099] Generally speaking, accelerator 102 can implement any suitable processing logic. For example, in some examples, the accelerator may include reduction logic (e.g., for implementing max pooling or average pooling operations), element processing logic for performing mathematical operations on each element (e.g., adding two tensors together), or activation logic (e.g., for applying activation functions such as the sigmoid function or the step function). For simplicity, Figure 1 Such units are not shown in the diagram.
[0100] The processing elements of an accelerator are independent processing subsystems that can operate in parallel. Each processing element 114 includes a convolution engine 108 configured to perform a convolution operation between a set of coefficients and input values. Each convolution engine 108 may include multiple multipliers, each configured to multiply a coefficient by its corresponding input data value to produce a multiplicative output value. Following the multipliers may be, for example, an adder tree arranged to compute the sum of the multiplicative outputs. In some examples, these multiplicative addition computations may be pipelined.
[0101] Neural networks are typically described as comprising many “layers.” At each “layer” of a neural network, a set of coefficients can be combined with a corresponding set of input data values. A large number of operations must usually be performed at the accelerator to execute the operations for each “layer” of the neural network. This is because the input data and the set of coefficients are often very large. Since multiple passes of the convolution engine may be used to generate the complete output of a convolution operation (e.g., because the convolution engine may only receive and process a portion of the set of coefficients and input data values), the accelerator may include multiple accumulators 110. Each accumulator 110 receives the output of the convolution engine 108 and adds that output to the previous convolution engine output associated with the same operation. Depending on the accelerator implementation, the convolution engine may not process the same operation in consecutive cycles, so an accumulation buffer 112 may be provided to store the partially accumulated output of a given operation. Appropriate partial results can be provided to the accumulators by the accumulation buffer 112 in each cycle.
[0102] Figure 1 Accelerator 102 can be used to implement "convolutional layers". The data input to the convolutional layer can have a dimension of B×C. in ×H in ×W inFor example, such as Figure 2a As shown, the data input to the convolutional layer can be arranged as C of the data. in There are 10 channels, each with a spatial dimension H. in ×W in H in and W in These are the height dimension and the weight dimension, respectively. Figure 2a The diagram shows that the input data includes four data channels (i.e., C). in =4). The data input to the convolutional layer can also be limited by the batch size B. The batch size B is not specified in the original text. Figure 2a As shown, batch size defines the number of data sets input to the convolutional layer. For example, in image classification applications, batch size can refer to the number of individual images in the input data.
[0103] The neural network may include J layers, each configured to combine a set of coefficients with its data input. Each of these J layers can be combined with the set of coefficients w. j Related. As described in this article, j is the index of each layer of J. In other words, w represents the coefficient set of layer J. j Generally, the number and value of coefficients in a coefficient set can vary between layers, such that for the first layer, the number of coefficients can be defined as... For the second layer, the number of coefficients can be defined as Furthermore, for the J-th layer, the number of coefficients can be defined as The number of coefficients in the first layer is n1, the number of coefficients in the second layer is n2, and the number of coefficients in the Jth layer is nJ.
[0104] Generally, the set of coefficients used for the layer can be in any suitable format. For example, the set of coefficients can be represented by a p-dimensional tensor, where p ≥ 1, or in any other suitable format. In this paper, we will refer to a set of dimensions, the number of inputs to the channel, C. in The number of channel outputs C out The format of each coefficient set is defined by the height dimension H and the width dimension W, but it should be understood that the format of the coefficient set can be defined in any other suitable way.
[0105] Used for having Figure 2a The set of coefficients for performing convolution operations on input data in the format shown can have dimension C. out ×C in ×H×W. Figure 2a The coefficient set in C is not shown. inDimensions, but typically, the number of coefficient channels in a coefficient set corresponds to (e.g., equals) the number of data channels in the input data that will be combined with the coefficient set (e.g., in...). Figure 2a In the example shown, C in =4). In Figure 2a C is not shown in the text. out The dimension refers to the number of channels in the output when the coefficient set is combined with the input data. The dimension of the coefficient set used by a neural network can vary greatly. As a non-limiting example only, the coefficient set used in a convolutional layer can have dimensions such as 64×3×3×3, 512×512×3×3, or 64×3×11×11.
[0106] In convolutional layers, the coefficient set can be combined with the input data through multiple steps of convolution operations in the s and t directions, such as... Figure 2a and Figure 2b As shown. That is, in a convolutional layer, the input data is processed by convolving the input data using the set of coefficients associated with that layer. For example, Figure 2b A convolutional layer 200 is schematically illustrated, which is arranged to combine a set of coefficients 204 with input data 202 to form output data 206. The data output from the convolutional layer can have a dimension of B×C. out ×H out ×W out In other words, the data output from the convolutional layer can be arranged as C of the data. out There are 10 channels, each with a spatial dimension H. out ×W out H out and W out These are the height dimension and the weight dimension, respectively. The data output by the convolutional layer can also be limited by the batch size B. In this example, the coefficient set 204 includes four filters, each including multiple coefficients from the coefficient set. Each filter may include a unique set and / or arrangement of coefficients from the coefficient set, or two or more filters may be identical to each other. The input data 202 has three data channels. Each filter includes three coefficient channels, corresponding to the three data channels in the input data 202 (e.g., C). in =3). That is, the number of coefficient channels in each filter of the layer's coefficient set can correspond to the number of data channels in the layer's data input. The output data 206 has four channels (e.g., C). out =4). That is, the number of filters formed by the coefficient sets of the layers can correspond to the number of data channels in the output data. Figure 2b In the middle, H out =H in And W out =Win However, it should be understood that this is not mandatory; for example, H out It may not be equal to H in And / or W out It may not be equal to W in .
[0107] Input data 202 can be combined with coefficient set 204 by convolving each filter in the coefficient set with the input data, wherein the first coefficient channel of each filter is convolved with the first data channel of the input data, the second coefficient channel of each filter is convolved with the second data channel of the input data, and the third coefficient channel of each filter is convolved with the third data channel of the input data. The results of the convolution operations on each filter for each input channel can be summed (e.g., accumulated) to form the output data value of each output channel. It should be understood that the coefficient set need not be arranged as follows: Figure 2b The set of filters shown can, in fact, be arranged in any other suitable manner.
[0108] There are many other types of neural network "layers" configured to combine a set of coefficients with the data input to that layer. Another example of such a neural network layer is a fully connected layer. The set of coefficients used to perform the fully connected operation can have dimension C. out ×C in Fully connected layers can perform matrix multiplication between the coefficient set and the input tensor. Fully connected layers are commonly used in recurrent neural networks and multilayer perceptrons. Convolutional engines (e.g., Figure 1 One or more convolutional engines (of the convolutional engines 108 shown) can be used to implement fully connected layers. Other examples of neural network layers configured to combine sets of coefficients with the data inputs to these layers include variations of convolutional layers, such as deep convolutional layers, dilated convolutional layers, grouped convolutional layers, and transposed convolutional (deconvolutional) layers. Neural networks can include combinations of different layers. For example, a neural network can include one or more convolutional layers (e.g., for extracting features from an image), followed by one or more fully connected layers (e.g., for providing predictions based on the extracted features).
[0109] For the first layer of a neural network, "input data" can be considered the initial input to the network. The first layer processes the input data and generates a first set of intermediate data that is passed to the second layer. This first set of intermediate data can be considered as input data for the second layer, which processes it to produce output data in the form of second intermediate data. In the case of a neural network containing a third layer, the third layer receives the second intermediate data as input data and processes it to produce third intermediate data as output data. Therefore, the reference to input data herein can be interpreted to include references to the input data of any layer. For example, the term input data can refer to intermediate data that is the output of a particular layer and the input to subsequent layers. This process is repeated until the final layer produces output data that can be considered the output of the neural network.
[0110] return Figure 1 Accelerator 102 may include an input buffer 106 and a coefficient buffer 130 arranged to store input data required by the accelerator (e.g., a convolution engine). The coefficient buffer is arranged to store a set of coefficients required by the accelerator (e.g., a convolution engine) for combining input data with operations of the neural network. The input buffer may include some or all of the input data associated with one or more operations performed at the accelerator in a given period. The coefficient buffer may include some or all of the coefficient set associated with one or more operations processed at the accelerator in a given period. Figure 1 The various buffers of the accelerator shown can be implemented in any suitable manner, for example, as any number of data repositories local to the accelerator (e.g., on the same semiconductor die and / or located within the same integrated circuit package) or accessible to the accelerator via a data bus or other interconnects.
[0111] Memory 104 may be accelerator-accessible; for example, it may be system memory accessible to the accelerator via a data bus. On-chip memory 128 may be provided for storing coefficient sets and / or other data (such as input data, output data, etc.). The on-chip memory may be local to the accelerator, allowing data stored in the on-chip memory to be accessed by the accelerator without consuming the memory bandwidth of memory 104 (e.g., system memory accessible via a system bus). Data (e.g., coefficient sets, input data) may be periodically written from memory 104 to the on-chip memory. A coefficient buffer 130 at the accelerator may be configured to receive coefficient data from on-chip memory 128 to reduce bandwidth between the memory and the coefficient buffer. An input buffer 106 may be configured to receive input data from on-chip memory 128 to reduce bandwidth between the memory and the input buffer. Memory may be coupled to the input buffer and / or the on-chip memory to provide input data to the accelerator.
[0112] The set of coefficients received at input 101 can be in a compressed format, such as a data format with reduced memory footprint. That is, the set of coefficients can be compressed so that it is represented by an integer number of one or more compressed values before being input to input 101 of data processing system 100, as will be described in further detail herein. For this purpose, data processing system 100 may include a decompression engine 132. Decompression engine 132 can be configured to decompress any compressed set of coefficients read from coefficient buffer 130 into convolution engine 108. Alternatively or concurrently, the input data received at input 101 can be in a compressed format. In this example, data processing system 100 may include a decompression engine (… Figure 1 (Not shown in the image), the decompression engine is located between the input buffer 106 and the convolution engine 108, and is configured to decompress any compressed input data read from the input buffer 106 into the convolution engine 108.
[0113] Accumulation buffer 112 can be coupled to output buffer 116 to allow the output buffer to receive intermediate output data of the neural network operation performed at the accelerator, as well as the output data of the final operation (i.e., the last operation of the network performed at the accelerator). Output buffer 116 can be coupled to on-chip memory 128 to provide intermediate output data and the output data of the final operation to on-chip memory 128.
[0114] Typically, large amounts of data need to be transferred from memory to processing elements. If this transfer cannot be performed efficiently, it can lead to high memory bandwidth requirements and high power consumption for providing input data and coefficient sets to the processing elements. This is especially true when the memory is "off-chip" memory, i.e., implemented in a different integrated circuit or semiconductor die than the processing elements. One such example is the system memory of an accelerator accessible via a data bus. To reduce the memory bandwidth requirements when an accelerator executes neural networks, it is advantageous to provide on-chip memory at the accelerator, where at least some of the coefficient sets and / or input data required to implement the neural network can be stored. Such memory can be "on-chip" (e.g., on-chip memory 128) when it is located on the same semiconductor die and / or in the same integrated circuit package.
[0115] exist Figure 1 The examples illustrate various exemplary connections; however, in some implementations, some or all of them may be provided by one or more shared data bus connections. It should also be understood that other connections may be provided as... Figure 1 The connections shown are alternatives or supplements. For example, output buffer 114 may be coupled to memory 104 to provide output data directly to memory 104. Similarly, in some examples, not... Figure 1 All connections shown are necessary. For example, memory 104 does not need to be coupled to input buffer 106, which can obtain input data directly from an input data source, such as a camera subsystem configured to capture images at a device including a data processing system.
[0116] As described herein, in image classification applications, image data representing one or more images can be input into a neural network, and the output of the neural network can be data indicating the probability (or set of probabilities) that each of these images belongs to a specific category (or set of categories). In image classification applications, in each of the multiple layers of the neural network, a set of coefficients is combined with the data input of that layer to identify characteristic features of the input data. Neural networks are typically trained to improve the accuracy of their outputs by using training data. In the image classification example, training data may include data indicating one or more images and a corresponding pre-determined label for each of these images. Training the neural network may include operating the neural network on the training input data using an untrained or partially trained set of coefficients to form training output data. For example, a loss function can be used to evaluate the accuracy of the training output data. The set of coefficients can be updated based on the accuracy of the training output data through a process called gradient descent and backpropagation. For example, the set of coefficients can be updated based on the loss determined using the loss function. Backpropagation can be considered as the process of calculating the gradient of each coefficient with respect to the loss function. This can be achieved by using the chain rule to start from the final output of the loss function and work backwards to compute the coefficients of each layer. Once all gradients are known, gradient descent (or its derivative) can be used to update each coefficient based on its gradient, which is computed via backpropagation. Gradient descent can be performed according to a learning rate parameter, which indicates the extent to which the coefficients can change based on the gradient in each iteration of the training process. These steps can be repeated to iteratively update the set of coefficients.
[0117] The coefficient set used within a typical neural network can be highly parameterized. That is, the coefficient set used within a typical neural network typically includes a large number of non-zero coefficients. A highly parameterized coefficient set of a neural network can have a large memory footprint. When the coefficient set is stored in memory (e.g., memory 104 or on-chip memory 128), a significant amount of memory bandwidth is required at runtime to read the highly parameterized coefficient set, in addition to the local cache (e.g., 50% of the memory bandwidth in some examples). The time spent reading the highly parameterized coefficient set from memory can also increase the time it takes for the neural network to provide an output for a given input, thus increasing the latency of the neural network. A highly parameterized coefficient set can also impose significant computational demands on the processing element 114 of accelerator 102, for example, by causing the processing element to perform a large number of multiplication operations between coefficients and their corresponding data values.
[0118] Data processing system
[0119] Figure 4 A data processing system based on the principles described herein for solving one or more of the problems identified above is illustrated.
[0120] Figure 4 The data processing system 410 shown includes a memory 104 and a processor 400. In one example, the processor 400 includes a software implementation of a neural network 102-1. The software implementation of the neural network 102-1 may have the same characteristics as the referenced... Figure 1 The accelerator 102 described has the same properties. In another example, the data processing system 410 includes a hardware implementation of a neural network 102-2. The hardware implementation of the neural network 102-2 may have the same properties as the referenced... Figure 1 The accelerator 102 described has the same properties. In some examples, the data processing system may include a neural network accelerator implemented in a combination of hardware and software.
[0121] Figure 4 The processor 400 shown also includes pruning logic 402, compression logic 404, sparsity learning logic 406, network accuracy logic 408, and coefficient identification logic 412. Each of logics 402, 404, 406, 408, and 412 can be implemented in fixed-function hardware, software running on general-purpose hardware within the processor 400, or any combination thereof. The functionality of each of logics 402, 404, 406, 408, and 412 will be described in more detail herein. In some examples ( Figure 4(Not shown in the diagram), one or more of the following logics (pruning logic 402, compression logic 404, sparsity learning logic 406, network accuracy logic 408, and coefficient identification logic 412) may be alternatively or additionally implemented as logic units within the hardware implementation of the neural network 102-2.
[0122] Memory 104 can be system memory accessible by processor 400 and / or a hardware implementation of neural network 102-2 via a data bus. Alternatively, memory 104 can be on-chip memory local to processor 400 and / or a hardware implementation of neural network 102-2. Memory 104 can store coefficient sets and / or hardware implementations of neural network 102-2 to be operated by processor 400, and / or coefficient sets and / or hardware implementations of neural network 102-2 that have already been processed and output by processor 400.
[0123] Coefficient compression
[0124] One way to reduce the memory footprint of coefficient sets, and thus reduce the bandwidth required to read coefficient data from memory at runtime, is to compress the coefficient sets. That is, each coefficient group can be compressed so that it is represented by an integer number of one or more compressed data values. This compression can be achieved by… Figure 4 The compression logic 404 shown is executed. The set of uncompressed coefficients stored in memory 104 can be input to the compression logic 404 for compression. The compression logic 404 can output the compressed set of coefficients to memory 104.
[0125] The set of compression coefficients can be compressed at compression logic 404 according to the compression scheme. An example of such a compression scheme is the Single Prefix Block Code 8 (SPGC8) compression scheme. It should be understood that many other suitable compression schemes exist, and the principles described herein are not limited to the application of the SPGC8 compression scheme. The SPGC8 compression scheme is fully described in UK patent application GB2579399 (but is not identified by the name SPGC8).
[0126] Figure 3a Compression based on an exemplary set of coefficients according to a compression scheme is shown. The compression scheme can be the SPGC8 compression scheme, but the principles described herein can be applied to other compression schemes. Figure 3a The coefficient set 300, represented by 16×16 tensor coefficients, is shown. The coefficient set 300 can be all or part of the two-dimensional coefficient tensor as shown, or a plane of a p-dimensional coefficient tensor, where p ≥ 3. As described herein, the coefficient set can include any number of coefficients and can be in any suitable format.
[0127] Multiple subsets of a coefficient set can be compressed to compress the coefficient set. Each subset of coefficients includes multiple coefficients. For example, a subset of coefficients may include eight coefficients. The coefficients in a subset may be consecutive within the coefficient set. For example, a subset of coefficients is shown in the shaded area covering the coefficient set 300. This subset of coefficients includes eight consecutive coefficients arranged in a single row (e.g., a subset of coefficients with dimension 1×8). More generally, a subset of coefficients can have any dimension, such as 2×2, 4×4, etc. In the example where the coefficient set is a p-dimensional (p≥1) tensor, a subset of coefficients may also be a p-dimensional tensor where p≥1.
[0128] Each coefficient can be an integer. For example, an exemplary 1×8 subset 302 of coefficients includes coefficients 31, 3, 1, 5, 3, 4, 5, 6. Each coefficient can be encoded in binary numbers. Figure 3a Each coefficient in the subset shown is a positive (e.g., unsigned) binary number. In one example, each coefficient could be encoded as a 16-bit binary number, as shown at 304, but more or fewer bits can be chosen. Sixteen bits can be provided to encode each coefficient, so coefficients with values up to 65,536 can be encoded. Therefore, in this example, 128 bits are needed to encode a subset of eight coefficients, as shown at 304. However, typically 16 bits are not needed to encode each coefficient. That is, most coefficients in the coefficient set have values lower than or even significantly lower than the maximum coded value.
[0129] If any coefficient in the coefficient set is negative, the coefficient set can first be transformed so that all coefficient values are positive (e.g., unsigned). For example, negative coefficients can be transformed into odd values, while positive coefficients can be transformed into even values in an unsigned representation. This transformed coefficient set can then be used as input to the SPGC8 compression scheme.
[0130] According to the SPGC8 compression scheme, the number of bits sufficient to encode the largest coefficient value in the coefficient subset is identified. These bits are then used to encode each coefficient in the subset. The header data associated with the coefficient subset indicates that these bits have been used to encode each coefficient in the subset.
[0131] For example, as shown in 306, a compressed subset of coefficients can be represented by header data and multiple body parts (V0-V7). In the subset of coefficients 302, the largest coefficient value is 31, which can be encoded using 5 bits of data. In this example, the header data indicates that 5 bits will be used to encode each coefficient in the subset of coefficients. The header data itself has a bit cost, for example, 3 bits, while each body part uses 5 bits to encode the coefficient value. For example, the number of bits used in the header part can be the minimum number of bits required to encode the number of bits in each body part (e.g., in...). Figure 3aIn the example shown, 3 bits can be used to encode the binary number 5. In this example, the subset of coefficients 302 can therefore be encoded in compressed form using 43 bits of data, as shown in 308, instead of in uncompressed form using 128 bits, as shown in 304.
[0132] In other words, to compress a subset of coefficients, header data comprising h bits is generated, and multiple body portions are generated, each comprising b bits. Each body portion corresponds to a coefficient in the subset. The value of b is fixed within the subset, and the header data of the subset includes an indication of b for the body portion of the subset. The body portion size b is identified by locating the bit position of the most significant leading bit one among all coefficients in the uncompressed subset. Header data is generated to include a bit sequence encoding the body portion size, and a body portion comprising b bits is generated for each coefficient in the subset by removing zeros and one or more leading zeros from each coefficient in the uncompressed subset.
[0133] In some examples, two adjacent subsets of coefficients can be interleaved during compression according to the SPGC8 compression scheme. For example, the first subset of eight coefficients may include coefficients V0, V1, V2, V3, V4, V5, V6, and V7. Adjacent subsets of the eight coefficients may include V8, V9, V1, V2, V3, V4, V5, V6, and V7. 10 V 11 V 12 V 13 V 14 and V 15 When compressing the first and second coefficient subsets according to an interleaved compression scheme, the first compressed coefficient subset may include coefficients V0, V2, V4, V6, V8, V... 10 V 12 and V 14 An integer number of compressed values. The second compressed coefficient subset may include coefficients V1, V3, V5, V7, V9, V... 11 V 13 and V 15 An integer number of compressed values.
[0134] Unstructured sparsity
[0135] The set of coefficients used by a neural network may include one or more coefficient values of zero. A set of coefficients that includes a large number of zero coefficients can be described as sparse. As described herein, a neural network comprises multiple layers, each configured to combine the set of coefficients with the input data values of that layer, for example, by multiplying each coefficient in the set with the corresponding input data value. Therefore, for a sparse set of coefficients, a large number of operations in the layers of the neural network can produce zero outputs.
[0136] Sparsity can be artificially inserted into a set of coefficients. That is, sparsity can be applied to one or more coefficients in the set. Applying sparsity to coefficients includes setting the coefficients to zero. This can be achieved by applying a sparsity algorithm to the coefficients in the set. Figure 4 The pruning logic 402 shown can be configured to apply sparsity to one or more coefficients in a coefficient set. In one example, pruning logic 402 can apply sparsity to the coefficient set by performing a process called magnitude-based pruning. The trained coefficient set typically includes many coefficient values that are close to (or even very close to) zero but not zero. Magnitude-based pruning involves applying sparsity to a percentage, fraction, or portion of the coefficients in the coefficient set that are closest to zero. The proportion of coefficients to be set to zero is determined by a sparsity parameter that indicates the level of sparsity to be applied to the coefficient set. The result of magnitude-based pruning can increase the level of sparsity in the coefficient set, but this typically does not significantly affect the accuracy of the network because the coefficients already set to zero are the lower-valued (and therefore often the least significant) coefficients. Figure 14a An example is shown of a set of coefficients for which sparsity has been applied, for instance, through processes such as magnitude-based pruning. Figure 14a In the diagram, sparse coefficients are indicated using shading. Figure 14a This is an example of unstructured sparsity. Coefficient values with low magnitudes (i.e., magnitudes close to zero) can be randomly (e.g., in an unstructured manner) distributed within a set of coefficients. Therefore, for this reason, sparsity produced by methods such as magnitude-based pruning can be described as unstructured.
[0137] Value-based pruning is just one example of the process of applying sparsity to a set of coefficients. Many other methods can be used to apply sparsity to a set of coefficients. For example, pruning logic 402 can be configured to randomly select a percentage, fraction, or portion of the coefficients in the set to which sparsity is to be applied.
[0138] As described in this paper, for sparse coefficient sets, a large number of operations in the layers of a neural network can produce zero outputs. Therefore, neural networks can be configured to skip (i.e., not perform) "multiply by zero" operations (e.g., operations involving multiplying input data values by zero coefficient values). Thus, in this way, and by artificially inserting sparsity into the coefficient set, the computational requirements on the neural network can be reduced by requiring fewer multiplications (e.g., Figure 1 The processing element 114 of the accelerator 102 shown.
[0139] Figure 7a Exemplary pruning logic for applying unstructured sparsity is shown. In some examples, Figure 4 The pruning logic 402 shown in the figure has a reference Figure 7aThe description refers to the attributes of pruning logic 402a. It should be understood that... Figure 7a The pruning logic 402a shown is merely one example of logic configured to apply sparsity to the coefficient set. Other forms of logic can be used to apply sparsity to the coefficient set.
[0140] The inputs to the pruning logic 402a include w j 502 represents the set of coefficients for the j-th layer of the neural network. As described herein, this set of coefficients for the layer can be in any suitable format. For example, the set of coefficients can be represented by a p-dimensional coefficient tensor, where p ≥ 1, or in any other suitable format.
[0141] The inputs to the pruning logic 402a also include s j 504 represents the sparsity parameter of the j-th layer of the neural network. In other words, The sparsity parameter s of layer J represents the sparsity of layer J. j The sparsity parameter can indicate the parameter to be applied to the coefficient set w by the pruning logic 402a. j The sparsity level. For example, a sparsity parameter can indicate the percentage, fraction, or portion of the set of coefficients to which sparsity is to be applied by the pruning logic 402a. Sparsity parameter s j The sparsity parameter s can be set (e.g., somewhat arbitrarily by the user) based on the assumption of how much sparsity can be introduced into the coefficient set without significantly affecting the accuracy of the neural network. In other examples, as described further in this paper, the sparsity parameter s j It can be considered part of the training process of a neural network.
[0142] sparsity parameter s j It can be provided in any suitable form. For example, the sparsity parameter can be a decimal number in the range 0 to 1 (inclusive), representing the percentage of the coefficient set to which sparsity is to be applied. For example, a sparsity parameter s of 0.4. j It can be indicated that sparsity is applied to the coefficient set w j The coefficient is 40%.
[0143] In other examples, the sparsity parameter can be provided as a number in any suitable range (e.g., between -5 and 5). In these examples, the pruning logic 402a may include normalization logic 704, which is configured to normalize the sparsity parameter to a range between 0 and 1. An exemplary way to implement this normalization is to use a sigmoid function, such as... For example, a sigmoid function can transform the minimum y value, which is close to 0 when x is -5, to the maximum y value, which is close to 1 when x is 5. In this way, a sigmoid function can be used to transform input sparsity parameters in the range -5 to 5 into normalized sparsity parameters in the range 0 to 1. In one example, normalized logic 704 can use a sigmoid function. In order to normalize the sparsity parameter s j The output of the normalized logic 704 can be a normalized sparsity parameter. It should be understood that normalization logic can use other functions, such as hard-sigmoid(), which achieves the same normalization by applying different sets of mathematical operations to the input sparse parameters. For the purposes of the exemplary equations provided herein, the sparse parameters in the range of 0 to 1 (as provided, or after being normalized by the normalization function) will be... express.
[0144] As described herein, each coefficient in the coefficient set can be an integer. In some examples, the coefficient set may include one or more positive integer coefficients and one or more negative integer values. In these examples, the pruning logic 402a may include logic 700, which is configured to determine the coefficient set w. j The absolute value of each coefficient in the set. In this way, each value in the coefficient set at the output of unit 700 is a positive integer value.
[0145] Figure 7a The pruning logic 402a shown includes quantile logic 706, which is configured to adjust according to a sparsity parameter. A threshold is determined using a set of coefficients, including absolute coefficient values. For example, a sparsity parameter could indicate the percentage of sparsity to be applied to the coefficient set, such as 40%. In this example, quantile logic 706 would determine a threshold below which 40% of absolute coefficient values exist. In this example, the quantile logic could be described as using a non-differentiable quantile method. That is, Figure 7a The quantile logic 706 shown does not attempt to model the set of coefficients using functions, but rather classifies the absolute coefficient values empirically (e.g., in ascending or descending order) and sets a threshold value appropriately. For example, the quantile logic 706 can determine the threshold based on equation (1) τ.
[0146]
[0147] The pruning logic 402a includes subtraction logic 708, which is configured to subtract a threshold determined by quantile logic 706 from each determined absolute coefficient value. Figures 7a to 7d In the middle, subtraction logic (e.g.) Figure 7aIn the subtraction logic 708, the "minus" sign on one of the inputs is used to indicate that the input is being subtracted from another input marked with an "addition" sign. Therefore, any absolute coefficient value with a value less than a threshold will be represented by a negative number, and any absolute coefficient value with a value greater than the threshold will be represented by a positive number. In this way, the pruning logic 402a has identified the lowest significance coefficients (e.g., the coefficients least important to the set of coefficients). In this example, the lowest significance coefficients are those coefficients with absolute values below a threshold. In other words, the pruning logic has identified the set of coefficients w with inputs having values closest to zero. j The required percentage of the coefficient.
[0148] The pruning logic 402a includes step logic 710, which is configured to convert each negative coefficient value in the output of the subtraction logic 708 to zero and each positive coefficient value in the output of the subtraction logic 708 to one. An exemplary way to achieve this is by using a step function. For example, a step function can output a value of 0 for negative input values and a value of 1 for positive input values. The output of step logic 710 is a set of coefficients w with respect to the input. j A binary tensor of the same dimension. A binary tensor is a tensor composed of binary values 0 and 1. The binary tensor output by step logic 710 can be used as a "gap mask".
[0149] The pruning logic 402a includes multiplication logic 714, which is configured to perform a sparsity mask and input coefficient set w. j Element-wise multiplication. That is, in each coefficient position of the binary sparse mask including "0", the coefficient set w j The coefficients in the matrix are multiplied by 0, making the output zero. In this way, sparsity has been applied to the coefficients, i.e., they have been set to zero. At each coefficient position in the binary sparsity mask that includes a "1", the coefficient set w... j The coefficients in the equation will be multiplied by 1, so their values will remain unchanged. The output of pruning logic 402a is an updated set of coefficients w′ to which sparsity has been applied. j 506. For example, multiplication logic 714 can perform multiplication according to equation (2), where Step(abs(W j )-τ) represents the binary tensor output by step logic 710.
[0150] w j ′=Step(abs(w j )-τ)*w j (2)
[0151] Figure 7c Exemplary pruning logic for applying unstructured sparsity is shown. In some examples, Figure 4The pruning logic 402 shown in the figure has a reference Figure 7c The description refers to the attributes of pruning logic 402c. It should be understood that... Figure 7c The pruning logic 402c shown is merely one example of logic configured to apply sparsity to the coefficient set. Other forms of logic can be used to apply sparsity to the coefficient set.
[0152] For reference Figure 7a The above, Figure 7c The inputs of the pruning logic 402c shown include w j 502 and S j 504. Figure 7c The pruning logic 402c shown also includes normalization logic 704, which performs a normalization operation with reference to... Figure 7a The normalization logic described has the same function as 704.
[0153] When the coefficients in the coefficient set are normally distributed Figure 7c The pruning logic 402c shown may be particularly suitable. A normal (or Gaussian) distribution can be adequately described by its mean μ and standard deviation Ψ. Figure 7c The pruning logic 402c shown includes being configured to determine the standard deviation of the coefficients in the coefficient set 502. The logic 714, and the average value of the coefficients configured to determine the coefficient set 502. The logic is 716.
[0154] Figure 7c The pruning logic 402c shown includes quantile logic 706-2. Quantile logic 702-2 can use differentiable functions, such as inverse error functions (e.g., erf). -1 To use the average value of the coefficient set and standard deviation To model the coefficient set (as determined in logic 714 and logic 716). The quantile logic 706-2 is configured to model according to the sparsity parameter. To determine the threshold τ. For example, when the differentiable function is the inverse error function, this can be achieved according to equation (3), where It is the standard deviation determined by logic 714, and It is the average value determined by logic 716.
[0155]
[0156] Figure 7c The pruning logic 402c shown includes subtraction logic 708a, which is configured to subtract the average value determined by logic 716 from a threshold τ. Therefore, referring to equation (3), the output of subtraction logic 708a is:
[0157] Figure 7c The pruning logic 402c shown includes subtraction logic 708b, which is configured to subtract from the coefficient set w. j Subtract the average value determined by logic 716 from each coefficient in 502. This has the effect of concentrating the coefficient distribution of the coefficient set at approximately 0.
[0158] Figure 7c The pruning logic 402c shown includes logic 700, which is configured to determine the absolute value of each value in the output of the subtraction logic 708b. In this way, each value in the output of unit 700 is a positive integer value.
[0159] Figure 7c The pruning logic 402c shown includes subtraction logic 708c, which is configured to subtract the output of subtraction logic 708a from each absolute value determined by logic 700. Therefore, any absolute value having a value less than the output of subtraction logic 708a (e.g., ...) ) will be represented by a negative number, and any absolute value that is greater than the output of the subtraction logic 708a (e.g., The coefficients will be represented by positive numbers. In this way, the pruning logic 402c has identified the least significant coefficients (e.g., the coefficients least important to the set of coefficients). In this example, the least significant coefficients are those coefficients where the output of the subtraction logic 708c is negative.
[0160] The pruning logic 402c includes step logic 710, which is executed in conjunction with reference to... Figure 7a The described step logic 710 has the same function. The output of step logic 710 is a set of coefficients w that corresponds to the input. j A binary tensor of the same dimension. A binary tensor is a tensor composed of binary values 0 and 1. The binary tensor output by step logic 710 can be used as a "gap mask".
[0161] The pruning logic 402c includes multiplication logic 714, which is configured to perform a sparsity mask and input coefficient set w. j Element-wise multiplication, as shown in the reference Figure 7a The multiplication logic described in 714 is as follows. The output of the pruning logic 402c is a set of updated coefficients w′ to which sparsity has been applied. j 506. For example, multiplication logic 714 can perform multiplication according to equation (4), where This represents the binary tensor output by step logic 710.
[0162]
[0163] As described in this article, when the coefficients in this coefficient set are normally distributed, the reference... Figure 7c The described pruning logic 402c may be particularly suitable. Therefore, the coefficient set w can be tested or inferred. j The distribution of the coefficients is used to determine which implementation of the pruning logic to apply sparsity to (e.g., reference). Figure 7a or Figure 7c (Description of the pruning logic). That is, if the coefficient set is not normally distributed, a reference can preferably be used. Figure 7a The described pruning logic applies sparsity. If the coefficient set is (or approximately) normally distributed, a reference distribution can preferably be used. Figure 7c The described pruning logic applies sparsity.
[0164] Structured sparsity
[0165] Based on the principles described in this paper, synergistic benefits can be achieved by applying sparsity to multiple coefficients of a coefficient set in a structured manner registered with a compression scheme used to compress the coefficient set. This can be achieved through logically arranging... Figure 4 This is achieved using trimming logic 402 and compression logic 404, such as... Figure 5 As shown.
[0166] Figure 5 A data processing system for compressing coefficient sets for subsequent use in neural networks, based on the principles described herein, is illustrated. (Refer to...) Figure 6 Describe the set of compression coefficients for subsequent use in neural networks.
[0167] The inputs to pruning logic 402 include w j 502, which represents the set of coefficients for the j-th layer of the neural network as described in this paper. The input to the pruning logic 402 also includes s. j 504, which represents the sparsity parameter of the j-th layer of the neural network as described in this paper. j 502 and S j 504 Both can be retrieved from memory (such as...) Figure 4 The data is read from memory 104 and stored in pruning logic 402. The sparsity parameter indicates which parameter to be applied by pruning logic 402 to the coefficient set w. j The sparsity level.
[0168] The pruning logic 402 is configured to apply sparsity to groups of coefficients, each group comprising a predefined set of coefficients. This is Figure 6Method step 602. A coefficient set can be multiple coefficients occupying adjacent positions within a coefficient set, but this is not always the case. The coefficient set can have any suitable format. For example, the coefficient set can include a p-dimensional coefficient tensor (where p ≥ 1) or any other suitable format. In one example, each coefficient set includes sixteen coefficients arranged in a single row (e.g., a set of coefficients with dimension 1 × 16). More generally, coefficient sets can have any dimension, such as 2 × 2, 4 × 4, etc. As described herein, the coefficient set used to perform convolution operations on the input data can have dimension C. out ×C in ×H×W. A set of coefficients can have a dimension of 1×16×1×1 (i.e., the 16 coefficients in each set are in corresponding positions in each of the 16 input channels). As described herein, the set of coefficients used to perform the fully connected operation can have a dimension of C. 0ut ×C in A set of coefficients may have a dimension of 1×16 (i.e., the 16 coefficients in each set are in corresponding positions in each of the 16 input channels). In another example, one or more filters from a set of filters in the layer (e.g., as referenced) can be used. Figure 2b The coefficient channel described above is considered as a set of coefficients to which sparsity can be applied.
[0169] Applying sparsity to a set of coefficients may include setting each coefficient in the set to zero. This can be achieved by applying a sparsity algorithm to the coefficients in the set. The sparsity parameter s can be used as a reference. j To determine the number of coefficient sets to which sparsity is to be applied, the sparsity parameter can indicate the percentage, fraction, or portion of the coefficient set to which sparsity is applied by the pruning logic 402. Sparsity parameter s j The sparsity parameter s can be set (e.g., somewhat arbitrarily by the user) based on the assumption of how much sparsity can be introduced into the coefficient set without significantly affecting the accuracy of the neural network. In other examples, as described further in this paper, the sparsity parameter s j This can be considered part of the training process of a neural network. The output of the pruning logic 402 is a set of updated coefficients w′. j 506, which includes multiple sets of sparse coefficients (e.g., coefficient groups, each group including coefficients with the value "0").
[0170] Figure 7b Exemplary pruning logic for applying structured sparsity is shown. In some examples, Figure 4 and Figure 5 The pruning logic 402 shown in the figure has a reference Figure 7b The description refers to the attributes of pruning logic 402b. It should be understood that... Figure 7bThe pruning logic 402b shown is merely one example of logic configured to apply structured sparsity to the coefficient set. Other forms of logic can be used to apply sparsity to the coefficient set.
[0171] For reference Figure 7a The above, Figure 7b The inputs to the pruning logic 402b shown include w j 502 and S j 504. Figure 7b The pruning logic 402b shown also includes normalization logic 704 and logic 700, each of which performs an operation in conjunction with a reference. Figure 7a The corresponding logic described has the same function.
[0172] Figure 7b The pruning logic 402b shown includes reduction logic 702, which is configured to divide the set of coefficients received from logic 700 into coefficient groups such that each coefficient in the set is assigned to only one group and all coefficients are assigned to groups. Each coefficient group may include multiple coefficients. Each coefficient group identified by the reduction logic may include the same number of coefficients and may have the same dimension. The reduction logic is configured to represent each coefficient group with a single value. For example, a single value representing a group may be the average (e.g., mean, median, or mode) of multiple coefficients within that group. In another example, a single value for a group may be the maximum coefficient value within the group. This may be referred to as max pooling. In one example, a group may include channels of the coefficient set, as described herein. Reducing coefficient channels to a single value may be referred to as global pooling. Reducing coefficient channels to the maximum coefficient value within that channel may be referred to as global max pooling. The output of reduction logic 702 may be a reduction tensor, where the dimension of the tensor representing the coefficient set is an integer multiple of one or more dimensions of the reduction tensor, where the integer is greater than 1. Each value in the reduction tensor can represent a set of coefficients of the absolute set of coefficients input to the reduction logic 702. When the reduction logic 702 performs pooling operations, such as max pooling, global pooling, or global max pooling, the reduction tensor can be called a pooling tensor.
[0173] The function performed by reduction logic 702 is in Figure 8 It is illustrated schematically. Figure 8 In the above, 2×2 pooling 702 is performed on coefficient set 502. Coefficient set 502 can be... Figure 7b The outputs shown are those from logic 700. In this example, coefficient set 502 is represented by an 8×8 coefficient tensor. Coefficient set 502 is logically divided into 16 groups of four coefficients each (e.g., each group is represented by a 2×2 coefficient tensor). Figure 8The coefficient sets 502 are indicated by coarse boundaries surrounding each group of four coefficients. Each group of four coefficients in coefficient set 502 is represented by a single value in the reduction tensor 800 as described herein, through a 2×2 pooling operation 702. For example, the top-left coefficient group in coefficient set 502 can be represented by the top-left value in the reduction tensor 800. Figure 8 The reduced tensor 800 shown is represented by a tensor with a dimension of 4×4. That is, the dimension of the 8×8 tensor representing the coefficient set 502 is more than twice that of the reduced tensor 800.
[0174] return Figure 7b The pruning logic 402b includes quantile logic 706, which is configured to adjust according to a sparsity parameter. The threshold is determined using the reduced tensor. For example, the sparsity parameter can indicate the percentage of sparsity to be applied to the set of coefficients, such as 25%. In this example, quantile logic 706 will determine a threshold below which 25% of the values in the reduced tensor exist. In this example, the quantile logic can be described as using a non-differentiable quantile approach. That is, instead of attempting to model the values in the reduced tensor using a function, quantile logic 702 empirically categorizes the values in the reduced tensor (e.g., in ascending or descending order) and sets the threshold to an appropriate value. For example, quantile logic 706 can determine the threshold based on τ equation (5).
[0175]
[0176] The pruning logic 402b includes subtraction logic 708, which is configured to subtract a threshold determined by quantile logic 706 from each value in the reduced tensor. Therefore, any value in the reduced tensor with a value less than the threshold will be represented by a negative number, while any value in the reduced tensor with a value greater than the threshold will be represented by a positive number. In this way, the pruning logic 402b has identified the lowest significance value in the reduced tensor. In this example, the lowest significance value in the reduced tensor is those values below the threshold. The lowest significance value in the reduced tensor corresponds to the lowest significance group of coefficients in the coefficient set (e.g., the group of coefficients with the lowest significance in the coefficient set).
[0177] The pruning logic 402b includes step logic 710, which is configured to convert each negative coefficient value in the output of the subtraction logic 708 to zero and each positive coefficient value in the output of the subtraction logic 708 to one. An exemplary way to achieve this is by using a step function. For example, a step function can output a value of 0 for negative input values and a value of 1 for positive input values. The output of step logic 710 is a binary tensor with the same dimensions as the reduced tensor output by the reduced logic 702. The binary tensor is a tensor composed of binary values 0 and 1. This binary tensor can be referred to as the reduced sparsity mask tensor. When the reduced logic 702 performs a pooling operation, such as max pooling or global pooling, the reduced sparsity mask tensor can be referred to as the pooled sparsity mask tensor.
[0178] The functions performed by the quantile logic 706, the subtraction logic 708, and the step logic 710 can be collectively referred to as mask generation 802. Mask generation 802 in... Figure 8 It is illustrated schematically. Figure 8 In this context, mask generation 802 is performed on the reduction tensor 800 (e.g., using quantile logic 706 and subtraction logic 708, as referenced). Figure 7b (as described above), so as to generate a reduced sparse mask tensor 804. The reduced sparse mask tensor 804 includes four binary "0"s and 12 binary "1"s represented by shading.
[0179] return Figure 7b The pruning logic 402b includes extension logic 712, which is configured to extend the reduced sparse mask tensor to generate a sparse mask tensor with the same dimension as the coefficient tensor input to the reduction logic 702. Extension logic 712 can perform upsampling, such as nearest-neighbor upsampling. For example, when the reduced sparse mask tensor includes binary "0", the sparse mask tensor will include corresponding groups containing multiple binary "0", the dimension of which is the same as the dimension of the groups into which the coefficient set is partitioned by the reduction logic 702. For example, in the case where the reduction logic 702 performs a global pooling operation, extension logic 712 can perform an operation called global upsampling. The binary tensor output by extension logic 712 can be used as a "sparse mask" and therefore can be referred to herein as a sparse mask tensor. In one example, nearest-neighbor upsampling can be achieved by having a convolution engine configured appropriately (e.g., ...). Figure 1 The extended logic 712 of the deconvolution (also known as convolution transpose) layer is implemented by the convolution engine 108 shown.
[0180] The function performed by the extended logic 712 is in Figure 8 It is illustrated schematically. Figure 8In this process, the reduced sparse mask tensor 804 is upsampled 2×2, for example, by nearest neighbor upsampling, to generate the sparse mask tensor 505. For each binary "0" in the reduced sparse mask tensor 804, the sparse mask tensor includes the corresponding 2×2 groups of binary "0". As described in this paper, binary "0" in... Figure 8 The sparse mask tensor 505 is shown in shaded mode. The sparsity mask tensor 505 has the same dimensions as the coefficient tensor 502 (i.e., 8×8).
[0181] The pruning logic 402b includes multiplication logic 714, which is configured to perform multiplication on the sparse mask tensor and the input coefficient set w. j Element-wise multiplication, as shown in the reference Figure 7a The multiplication logic described in section 714 is as follows. Since the sparsity mask tensor includes multiple sets of binary "0", sparsity will be applied to the coefficient set w. j The output of the pruning logic 402b is a set of updated coefficients w′ that have had sparsity applied. j 506. For example, multiplication logic 714 can perform multiplication according to equation (6), where Expansion(Step(Reduction(abs(w j ))-τ)) represents the binary tensor output by extended logic 712.
[0182] w′ j =Expansion(Step(Reduction(abs(w j ))-τ))*w j (6)
[0183] Figure 7d Exemplary pruning logic for applying structured sparsity is shown. In some examples, Figure 4 and Figure 5 The pruning logic 402 shown in the figure has a reference Figure 7d The description refers to the attributes of pruning logic 402d. It should be understood that... Figure 7d The pruning logic 402d shown is merely one example of logic configured to apply structured sparsity to the coefficient set. Other forms of logic can be used to apply structured sparsity to the coefficient set.
[0184] For reference Figure 7a The above, Figure 7d The inputs to the pruning logic 402d shown include w j 502 and S j 504. Figure 7d The pruning logic 402d shown also includes normalization logic 704, which performs a normalization operation with reference to... Figure 7a The normalization logic described has the same function as 704.
[0185] The pruning logic 402d includes logic 716, which is configured to determine the average value of the coefficients in the coefficient set 502. It also includes subtraction logic 708d to subtract the average value determined by logic 716 from each coefficient value in the input coefficient set value 502.
[0186] The trimming logic 702 also includes logic 700, which is configured to determine the absolute value of each value in the output of the subtraction logic 708d. In this way, each value in the output of unit 700 is a positive integer value.
[0187] Pruning logic 702 includes reduction logic 702, which performs and references... Figure 7b The reduction logic 702 describes the same function. That is, the reduction logic 702 is configured to divide the set of coefficients received from logic 700 into groups of coefficients, and represent each group of coefficients with a single value. For example, a single value in a group could be the maximum coefficient value within that group. This process is called "max pooling". The output of the reduction logic 702 is a reduction tensor, where the dimension of the tensor representing the set of coefficients is an integer multiple of one or more dimensions of the reduction tensor, where the integer is greater than 1. Each value in the reduction tensor represents a group of coefficients from the set of coefficients input to the reduction logic 702.
[0188] Reference Figure 7c The pruning logic described is the same as 402c, when the coefficients in the coefficient set are normally distributed. Figure 7d The pruning logic 402d shown may be particularly suitable. However, when performing reduction on a normally distributed set of values, such as max pooling or global max pooling, the distribution of these values approximates a Gumbel distribution. The Gumbel distribution can be represented by the scaling parameter β and the position parameter... To describe. Therefore, the pruning logic 402d includes logic 718, which is configured to determine the output of the reduction logic 702 according to equation (7). proportional parameters It also includes logic 720, which is configured to determine the positioning parameters of the output of reduction logic 702 according to equation (8). Where γ is the Euler-Mascheroni constant (i.e., 0.577216 - rounded to six decimal places).
[0189]
[0190]
[0191] Figure 7dThe trimming logic 702 shown includes quantile logic 706-3. The quantile logic 702 can use differentiable functions with scaling parameters determined by logic 718 and logic 720, respectively. and position parameters To model the set of values in the reduction tensor. The quantile logic 706-3 is configured to base its model on the sparsity parameter. To determine the threshold τ.
[0192] For example, this can be achieved using a differentiable function according to equation (9).
[0193]
[0194] Figure 7d The pruning logic 702 shown includes reduction logic 708e, which is configured to subtract a threshold τ from each value in the reduction tensor output by the reduction logic 702. Therefore, any value in the reduction tensor with a value less than the threshold τ will be represented by a negative number, while any value in the reduction tensor with a value greater than the threshold τ will be represented by a positive number. In this way, the pruning logic 402d has identified the lowest significance value in the reduction tensor. In this example, the lowest significance value in the reduction tensor is those values below the threshold τ. The lowest significance value in the reduction tensor corresponds to the lowest significance group of coefficients in the coefficient set (e.g., the group of coefficients with the lowest significance in the coefficient set).
[0195] The pruning logic 402d includes step logic 710, which is configured to convert each negative coefficient value in the output of subtraction logic 708e to zero and each positive coefficient value in the output of subtraction logic 708e to one. An exemplary way to achieve this is by using a step function. For example, a step function can output a value of 0 for negative input values and a value of 1 for positive input values. The output of step logic 710 is a binary tensor with the same dimensions as the reduction tensor. A binary tensor is a tensor composed of binary values 0 and 1. This binary tensor can be referred to as a reduction sparsity mask tensor. The functions performed by quantile logic 706-3, logic 718, logic 720, subtraction logic 708e, and step logic 710 can be collectively referred to as mask generation 802.
[0196] Figure 7d The pruning logic 402d shown includes expansion logic 712, which is configured to expand the reduced sparse mask tensor to generate a sparse mask tensor with the same dimension as the coefficient tensor input to the reduction logic 702, as shown in the reference. Figure 7b As shown in the extension logic 712. The binary tensor output by the extension logic 712 can be used as a "sparse mask", and therefore can be referred to as a sparse mask tensor herein.
[0197] The pruning logic 402d includes multiplication logic 714, which is configured to perform multiplication on the sparse mask tensor and the input coefficient set w. j Element-wise multiplication, as shown in the reference Figure 7a The multiplication logic described in section 714 is as follows. Since the sparsity mask tensor includes multiple sets of binary "0", sparsity will be applied to the coefficient set w. j The coefficient set. The output of the pruning logic 402d is a set of updated coefficients w′ that have had sparsity applied. j 506. For example, multiplication logic 714 can perform multiplication according to equation (10), where This represents the binary tensor output by the extended logic 712.
[0198]
[0199] As described in this article, when the coefficients in the coefficient set are normally distributed, the reference... Figure 7d The described pruning logic 402d may be particularly suitable. Therefore, the coefficient set w can be tested or inferred. j The distribution of the coefficients is used to determine which implementation of the pruning logic should be used to apply structured sparsity to those coefficients (e.g., refer to...). Figure 7b or Figure 7d (Description of the pruning logic). That is, if the coefficient set is not normally distributed, a reference can preferably be used. Figure 7b The described pruning logic applies sparsity. If the coefficient set is (or approximately) normally distributed, a reference distribution can preferably be used. Figure 7d The described pruning logic applies sparsity.
[0200] As described in this article, Figure 7d An example is provided where reduction logic 702 performs reduction, such as max pooling or global max pooling, on a set of normally distributed values, making the distribution of these values approximate a Gumbel distribution. The Gumbel distribution can be referred to as an extreme value distribution. It should be understood that other types of extreme value distributions, such as the Weibull or Frechet distributions, can be used instead of the Gumbel distribution. Modifications can be made to these examples. Figure 7d The logic described herein allows the quantile logic to model an appropriate distribution in order to determine the threshold. It should be understood that other types of reductions, such as mean, mode, or median pooling, can be performed by the reduction logic 702 to make the set of values of the normal distribution approximate different types of distributions. In these examples, modifications can be made... Figure 7d The logic described in the paper allows quantile logic to model the appropriate distribution in order to determine the threshold.
[0201] return Figure 5 The updated coefficient set w′ j506 can be directly written from trimming logic 402 to compression logic 404 for compression. In other examples, the updated coefficient set w′ is read into compression logic 404 for compression before being compressed. j 506 can first be written back to memory, such as Figure 4 The memory 104 in the middle.
[0202] Figures 14b to 14d Some examples of structured sparsity applied to coefficient sets according to the principles described in this paper are shown. Figures 14b to 14d The set of coefficients shown can be used by fully connected layers. Figures 14b to 14d In this model, coefficient channels are depicted as horizontal rows of coefficient sets, and coefficient filters are depicted as vertical columns of coefficient sets. Figures 14b to 14d In the diagram, shading is used to indicate sparse coefficients. Figures 14b to 14d In each of the graphs, sparsity has been applied to the coefficient set as described in this paper. Figure 14b In this context, each group consists of a 2×2 coefficient tensor. Figure 14c In this context, each group includes coefficient channels. Figure 14d In this context, each group includes filters with coefficients.
[0203] Compression logic 404 is configured to compress the updated coefficient set w′ according to a compression scheme registered with the coefficient set. j This is so that each coefficient group can be represented by an integer number of one or more compressed values. This is Figure 6 Step 604 of the method.
[0204] The compression scheme can be SPGC8 compression. (See references in this article.) Figure 3a The SPGC8 compression scheme compresses the coefficient set by compressing multiple subsets of these coefficients. Each coefficient group, to which sparsity is applied by pruning logic 402, may include one or more subsets of the coefficient set according to the compression scheme. For example, each group may include n coefficients, and each subset according to the compression scheme may include m coefficients, where m is greater than 1 and n is an integer multiple of m. In some examples, n equals m. That is, in some examples, each coefficient group is a subset of the coefficients according to the compression scheme. In other examples, n may be greater than m. In these examples, each coefficient group can be compressed by compressing multiple adjacent or interleaved subsets of coefficients. For example, n may equal 2m. Each group may include 16 coefficients, and each subset may include 8 coefficients. In this way, each group can be compressed by compressing two adjacent subsets of coefficients. Alternatively, each group can be compressed by compressing two interleaved subsets of coefficients as described herein.
[0205] It should be understood that n does not necessarily have to be an integer multiple of the number of coefficients in the coefficient set. When n is not a multiple of the number of coefficients in the coefficient set, once the coefficient set is divided into groups of n coefficients, the remaining coefficients can be filled with zero coefficients (e.g., "zero-filling") to form the final (e.g., the remaining) group of n coefficients to be compressed according to the compression scheme.
[0206] The output of compression logic 404 can be stored in memory (such as...) Figure 4 The information is stored in memory 104 (as shown) for subsequent use in the neural network. For example, it can be stored in an "offline phase" (e.g., at "design time") before being used later in the "runtime" implementation of the neural network, as referenced... Figure 5 and Figure 6 The set of compression coefficients as described. For example, Figure 5 The compressed coefficient set output by compression logic 404 can form the input of a neural network (e.g., for example, for...). Figure 1 The neural network implementation shown has input 101.
[0207] refer to Figure 3b The advantages of compressing coefficient sets based on a compression scheme registered with a coefficient set that has already applied sparsity can be understood.
[0208] Figure 3bCompression of a sparse subset of coefficients according to a compression scheme is illustrated. This compression scheme can be the SPGC8 compression scheme described herein. Here, a sparse subset of coefficients 310 is considered, where all eight coefficients in the subset have a value of 0 (e.g., due to the application of sparsity to the group of coefficients including said subset). As described herein, typically in uncompressed form, each coefficient can be encoded as a 16-bit binary number, as shown at 312, but more or fewer bits can be chosen. Therefore, in this example, 128 bits are needed to encode the sparse subset of eight zero coefficients, as shown at 312. As described herein, according to the SPGC8 compression scheme, the compressed subset of coefficients can be represented by header data and multiple body portions. In the sparse subset of coefficients 310, the largest coefficient value is 0, which can be encoded using 0 bits of data. Therefore, in this example, the header data indicates that 0 bits will be used to encode each coefficient in the subset of coefficients. The header data itself has a bit cost, for example, 1 bit (e.g., encoding the number 0 in binary requires only 1 bit, and each body portion will include 1 bit), while each body portion uses 0 bits to encode the coefficient values. In this example, a subset 310 of coefficients can therefore be encoded in compressed form using 1 bit of data, as shown in 314, instead of in uncompressed form using 128 bits, as shown in 312. Therefore, compressing the coefficient set according to the principles described herein for subsequent use in the neural network is highly advantageous because it allows for a large compression ratio, significantly reduces the memory footprint of the coefficient set, and significantly reduces the memory bandwidth required to read the coefficient set from memory. Furthermore, compressing the coefficient set subsequently used in the neural network according to the principles described herein significantly reduces the memory footprint of the model file / graphics / neural network representation.
[0209] On the other hand, if sparsity is applied in an unstructured manner, and even if one coefficient in a subset of coefficients is non-zero, the compression scheme will use one or more bits to encode each coefficient value in the subset, potentially significantly increasing the memory footprint of the compressed subset. For example, according to reference... Figure 3a Based on the reasoning explained by subset 302, the subset of coefficients 31, 0, 0, 0, 0, 0, 0, requires 43 bits for encoding (since the maximum value 31 requires 5 bits for encoding, each main part will use 5 bits for encoding). Therefore, it is particularly advantageous to apply sparsity to the coefficient groups so that the coefficient subsets of these groups are compressed to include only the "0" coefficient values.
[0210] It should be understood that many other suitable compression schemes exist, and the principles described herein are not limited to the application of the SPGC8 compression scheme. For example, the principles described herein can be applied to any compression scheme that compresses coefficient sets by compressing multiple subsets of these coefficient sets.
[0211] It should be understood that the structured sparsity principle described herein applies to the coefficient set of convolutional layers, fully connected layers, and any other type of neural network layer configured to combine a set of coefficients in a suitable format with the data input of said layer.
[0212] Channel trimming
[0213] Figure 4 The logic units of the data processing system 410 shown can be used in other ways to solve one or more problems identified herein. For example, coefficient identification logic 412 can be used to perform channel pruning.
[0214] Figure 11a An exemplary application of channel trimming in a convolutional layer is shown, based on the principles described herein. Figure 11a Two convolutional layers, 200-1a and 200-2a, are shown. It should be understood that a neural network can include any number of layers. The data input to layer 200-2a depends on the output data of layer 200-1a, referred to herein as the "previous layer". That is, the data input to layer 200-2a can be the data output from the previous layer 200-1a. Alternatively, additional processing logic (such as element-wise addition, subtraction, or multiplication logic, not shown) may exist between layers 200-1a and 200-2a, and operations may be performed on the output data 200-1a to provide the input data 200-2a.
[0215] Figure 11a Each layer shown is configured to combine a corresponding set of filters with the data input to the layer to form the output data of that layer. For example, layer 200-2a is configured to combine a set of filters 204-2a with the data input 202-2a to form the output data 206-2a for the layer. Each filter in the filter set of a layer may include multiple coefficients from the coefficient set of the layer. Each filter in the filter set of a layer may include different multiple coefficients. That is, each filter may include a unique set of coefficients. Alternatively, two or more filters in the filter set of a layer may include the same multiple coefficients. That is, two or more filters in a set of filters may be identical to each other.
[0216] Figure 11aEach layer of the filter set shown includes multiple coefficient channels, each coefficient channel in the filter set corresponding to a corresponding data channel in the data input of that layer. For example, input data 202-2a includes four channels, and each filter in the filter set 204-2a (e.g., each individual filter) includes four coefficient channels. The first or topmost filter in the filter set 204-2a includes coefficient channels a, b, c, and d, which correspond to channels A, B, C, and D of the input data 202-2a, respectively. For simplicity, the coefficient channels of each of the other two filters in the filter set 204-2a are not labeled, but it should be understood that the same principle applies to those filters. Therefore, the filter set of a layer (e.g., as a set) can be described as including multiple coefficient channels, each coefficient channel of the filter set (e.g., as a set) including the coefficient channels of each filter in the filter set (e.g., each individual filter) corresponding to the same data channel in the data input of said layer.
[0217] Figure 11a The output data of each layer shown includes multiple data channels, each corresponding to a specific filter in the filter set of that layer. That is, each filter in the filter set of a layer is responsible for forming a data channel in the output data of that layer. For example, the filter set 204-2a of layer 200-2a includes three filters, and the output data 206-2a of that layer includes three data channels. Each of the three filters in filter set 204-2a can correspond to (e.g., and is responsible for forming) a specific data channel in the output data 206-2a.
[0218] Figure 12 A method for training a neural network using channel pruning based on the principles described herein is illustrated.
[0219] In step 1202, the target coefficient channel of the filter set of the identification layer is determined. This step is performed as follows: Figure 4 The coefficient recognition logic 412 shown is executed. For example, in Figure 11a In the diagram, the identified target coefficient channels of filter set 204-2a are shown in shaded form. The target coefficient channels include the coefficient channel d of the first or topmost filter in filter set 204-2a, as well as the coefficient channels of the other two filters in filter set 204-2a that correspond to the same data channels in input data 202-2a. Figure 11a The identification of a target coefficient channel in filter set 204-2a is shown, but it should be understood that in step 1202, any number of target coefficient channels can be identified in a set of filters.
[0220] The target coefficient channel can be identified based on a sparsity parameter. For example, the sparsity parameter can indicate the percentage of sparsity to be applied to filter set 204-2a, such as 25%. Coefficient identification logic 412 can identify that 25% sparsity in filter set 204-2a can be achieved by applying sparsity to the shaded coefficient channel. The target coefficient channel can be the lowest saliency coefficient channel in the filter set. The coefficient channel can be similar to the referenced channels. Figure 7b and Figure 7d The pruning logic 402b or 402d shown describes logic for identifying one or more least significant coefficient channels in a filter set. For example, in addition to multiplication logic 714, the coefficient identification logic may include logic related to... Figure 7b and Figure 7d The trimming logics 702b or 702d shown are arranged with the same logical units to provide a binary mask, where the target channel is identified by binary '0'. Alternatively, coefficient identification logic 412 can cause the filter set to be processed by the trimming logic 402b or 402d itself to identify the target coefficient channel. It should be understood that in the channel trimming examples described herein, sparsity may or may not be applied to the target coefficient channel. For example, the coefficient identification logic may identify, label, or determine the target coefficient channel based on the sparsity parameters used in steps 1204 and 1206, without actually applying sparsity to the target coefficient channel. Alternatively, sparsity may be applied to the target coefficient channel in a test implementation of the neural network before performing steps 1204 and 1206 to determine how removing the coefficient channel would affect the accuracy of the network, as will be described in further detail herein.
[0221] In step 1204, the target data channel among multiple data channels in the data input to the layer is identified. This step is performed as follows: Figure 4 The coefficient identification logic 412 shown is executed. The target data channel is the data channel corresponding to the target coefficient channel of the filter set. For example, in Figure 11a In the input data 202-2a, the identified target data channel is data channel D, and it is shown in shaded.
[0222] Steps 1202 and 1204 can be performed by coefficient identification logic 412 during the "offline," "training," or "design" phase. Coefficient identification logic 412 can report the identified target coefficient channel and the identified target data channel to data processing system 410. In step 1206, the runtime implementation of the neural network is configured such that the filter set of the previous layer does not include the filter corresponding to the target data channel. Therefore, when the runtime implementation of the neural network is executed on the data processing system, combining a set of filters from the previous layer with the data input of the previous layer does not create a data channel in the output data of the previous layer corresponding to the target data channel. Step 1206 can be performed by data processing system 410 itself, which configures the software and / or hardware implementation of neural network 102-1 or 102-2 respectively. Step 1206 may also include storing the filter set of the previous layer that does not include the filter corresponding to the target data channel in memory (e.g., ...). Figure 4 The filter set of the layer shown in memory 102 is stored in memory for subsequent use by the runtime implementation of the neural network. Step 1206 may also include configuring the runtime implementation of the neural network, wherein each filter in the filter set of the layer does not include the target coefficient channel. Step 1206 may also include storing the filter set of the layer that does not include the target coefficient channel in memory (e.g., memory 102) for subsequent use by the runtime implementation of the neural network. Figure 4 The memory 102 shown is used for subsequent use in the runtime implementation of the neural network.
[0223] For example, in Figure 11a In this context, filter 1100a (shown in shaded area) in the filter set 204-1a of the preceding layer 200-1a corresponds to the identified target data channel (e.g., data channel D in the input data 204-2a). This is because, as described herein, each filter in the filter set of a layer is responsible for forming a corresponding data channel in the output data of that layer. The data input to layer 200-2a depends on the output data of the preceding layer 200-1a. Therefore, in Figure 11aIn this configuration, filter 1100a is responsible for forming data channel D in the output data 206-1a. The data channel D in the input data 202-2a depends on the data channel D in the output data 206-1a. In this way, filter 1100a corresponds to the data channel D in the input data 202-2a. By configuring a runtime implementation of the neural network in which the filter set of the preceding layer 200-1a does not include filter 1100a, the data channel D in the output data 206-1a will not be formed when the runtime implementation of the neural network is executed on the data processing system. Therefore, the input data 202-2a will not include data channel D. Therefore, when configuring the runtime implementation of the neural network, the target coefficient channel (shown in shaded areas) can also be omitted from the filter set in 204-2a. Alternatively, the target coefficient channel can be included in the filter set in 204-2a; however, when the runtime implementation of the neural network is executed on the data processing system, any calculations involving the coefficients in the target coefficient channel can be bypassed.
[0224] As described in this article, Figure 11a An exemplary application of channel pruning in a convolutional layer is shown. However, the set of coefficients used by other types of neural network layers, such as fully connected layers, can also be arranged as a set of filters as described herein. Therefore, it should be understood that the principles described herein apply to convolutional layers, fully connected layers, and the set of coefficients of any other type of neural network layer configured to combine a suitable set of coefficients with the data input of said layer.
[0225] For example, Figure 11b An exemplary application of channel pruning in a fully connected layer based on the principles described herein is shown. Figure 11b Two fully connected layers, 200-1b and 200-2b, are shown. It should be understood that a neural network can include any number of layers. The data input to layer 200-2b depends on the output data of layer 200-1b, referred to herein as the "previous layer". That is, the data input to layer 200-2b can be the data output from the previous layer 200-1b. Alternatively, additional processing logic (such as element-wise addition, subtraction, or multiplication logic, not shown) may exist between layers 200-1b and 200-2b that perform operations on the output data 200-1b to provide the input data 200-2b.
[0226] Figure 11b Each layer shown is configured to combine a corresponding set of filters with the data input to that layer to form the output data for that layer. For example, layer 200-2b is configured to combine a set of filters 204-2b with the data input 202-2b to form the output data 206-2b for that layer. Figure 11bIn this diagram, each filter is depicted as a vertical column of the set of filters. That is, a set of filters 204-2b comprises three filters. Each filter in the filter set of a layer may include multiple coefficients from the coefficient set of that layer. Each filter in the filter set of a layer may include multiple different coefficients. That is, each filter may include a unique set of coefficients. Alternatively, two or more filters in the filter set of a layer may include the same multiple coefficients. That is, two or more filters in a set of filters may be identical to each other.
[0227] Figure 11b Each layer's filter set shown includes multiple coefficient channels, and each coefficient channel in the filter set corresponds to a specific data channel in the data input of that layer. Figure 11b In this diagram, the coefficient channels are depicted as horizontal rows of a filter set. That is, a set of filters 204-2b comprises four coefficient channels. Figure 11b In this context, the data channels are depicted as vertical columns of input and output data sets. That is, the input data 202-2b comprises four coefficient channels. Figure 11b In the filter set 204-2b, there are coefficient channels a, b, c and d, which correspond to channels A, B, C and D of the input data 202-2b, respectively.
[0228] Figure 11b The output data of each layer shown comprises multiple data channels, each corresponding to a specific filter in the filter set of that layer. That is, each filter in the filter set of a layer is responsible for forming a data channel in the output data of that layer. For example, the filter set 204-2b of layer 200-2a comprises three filters (shown as a vertical column), and the output data 206-2b of that layer comprises three data channels (shown as a vertical column). Each of the three filters in the set of filters 204-2b can correspond to (e.g., and is responsible for forming) a corresponding data channel in the output data 206-2b.
[0229] Refer again Figure 12 In step 1202, the target coefficient channel of the filter set of the identification layer is identified as described herein. For example, in Figure 11b In the diagram, the identified target coefficient channel of filter set 204-2b is coefficient channel a, and is shown in shaded mode. In step 1204, the target data channel among multiple data channels in the layer's data input is identified as described herein. For example, in... Figure 11bIn the input data 202-2b, the identified target data channel is data channel A, and is shown in shaded mode. In step 1206, the runtime implementation of the neural network is configured such that the filter set of the previous layer does not include the filter corresponding to the target data channel as described herein. For example, in Figure 11b In the previous layer 200-1a, filter 1100b (shown in shaded) in a set of filters 204-1b corresponds to the identified target data channel (e.g., data channel A in input data 204-2b).
[0230] Two distinct bandwidth requirements affecting neural network performance are weight bandwidth and activation bandwidth. Weight bandwidth refers to the bandwidth required to read weights from memory. Activation bandwidth refers to the bandwidth required to read the input data of a layer from memory and write the corresponding output data of that layer back to memory. Weight bandwidth and activation bandwidth can be reduced by performing channel pruning. Weight bandwidth is reduced because: fewer filters in the layer (e.g., one or more filters in a set of filters are omitted when configuring the runtime implementation of the neural network) and / or smaller filters in the layer (e.g., one or more coefficient channels in a set of filters are omitted when configuring the runtime implementation of the neural network), the number of coefficients in the coefficient set of the layer is reduced, and therefore fewer coefficients are read from memory when executing the runtime implementation of the neural network. For the same reason, channel pruning also reduces the total memory footprint used for the coefficient set in the neural network (e.g., when stored in memory 104, as...). Figure 1 and Figure 4 (As shown). The reduced activation bandwidth is due to the reduced number of channels in the layer's output caused by fewer filters in the layer (e.g., one or more filters in a set of filters are omitted when configuring the runtime implementation of the neural network). This means less output data is written to memory and less input data is read from memory for subsequent layers. Channel pruning also reduces the computational requirements of the neural network by reducing the number of operations to be performed (e.g., multiplication between coefficients and their corresponding input data values).
[0231] Learnable sparsity parameters
[0232] This paper has described methods for "unstructured sparsity," "structured sparsity," and "channel pruning." In each method, a sparsity parameter has been referenced. As described, the sparsity parameter can be set (e.g., slightly arbitrarily by the user) based on the assumption that a certain proportion of coefficients in the coefficient set can be set to zero or removed without significantly affecting the accuracy of the neural network. Even so, by learning the value of the sparsity parameter, such as its optimal value, additional advantages can be gained in each of the described methods for "sparseness," "structured sparsity," and "channel pruning." As described, the sparsity parameter can be learned or trained as part of the training process of the neural network. This can be achieved by logically arranging... Figure 4 This is achieved through pruning logic 402, network accuracy logic 408, and sparse learning logic 406, as follows: Figure 9 As shown. Network accuracy logic 408 and sparse learning logic 406 can be collectively referred to as learning logic 414.
[0233] Figure 9 A data processing system is shown that implements a test implementation of a neural network for learning sparse parameters through training, based on the principles described herein. Figure 9 The test implementation of the neural network shown includes three neural network layers: 900-1, 900-2, and 900-j. Neural network layers 900-1, 900-2, and 900-j can be implemented in hardware, software, or any combination thereof (e.g., in a software implementation of neural network 102-1 and / or a hardware implementation of neural network 102-2, such as...). Figure 4 (As shown). Although Figure 9 Three neural network layers are shown, but it should be understood that test implementations of a neural network can include any number of layers. Test implementations of a neural network can include one or more of one or more convolutional layers, one or more fully connected layers, and / or any other type of neural network layer configured to combine a set of coefficients with the corresponding data values input to said layer. That is, it should be understood that the learnable sparsity parameter principle described herein applies to convolutional layers, fully connected layers, and the set of coefficients of any other type of neural network layer configured to combine a set of coefficients in a suitable format with the data input to said layer. It should be understood that test implementations of a neural network can also include other types of layers (not shown) that are not configured to combine the set of coefficients with the data input to those layers (such as activation layers and corresponding element layers).
[0234] The test implementation of the neural network also includes three instances of pruning logic 402-1, 402-2, and 402-j, each receiving the corresponding coefficient sets w1, w2, w3, and w4 for the respective neural network layers 900-1, 900-2, and 900-j.j and the corresponding sparsity parameters s1, s2, s j As input, the coefficient set can be in any suitable format, as described herein. The sparsity parameter can indicate the level of sparsity to be applied to the coefficient set by the pruning logic. For example, the sparsity parameter can indicate the percentage, fraction, or portion of the coefficient set to which the pruning logic applies sparsity.
[0235] Figure 9 The pruning logic shown can have references to the respective... Figure 7a , Figure 7b , Figure 7c and Figure 7d The pruning logic described in any of 402a, 402b, 402c, or 402d has the same characteristics. The type of pruning logic used in the test implementation of the neural network may depend on the method used to train the sparsity parameters (e.g., "unstructured sparsity," "structured sparsity," or "channel pruning") and / or the distribution of the coefficient set received by the pruning logic (e.g., whether the coefficient set is or approximately normally distributed). For example, if using... Figure 9 The test implementation of the neural network shown is used to learn sparsity parameters for applying structured sparsity to a normal distribution coefficient set. Instances of pruning logics 402-1, 402-2, and 402-j can then have the same characteristics as the reference... Figure 7d The pruning logic described is the same as that of 702d.
[0236] Figure 9 The test implementation of the neural network shown also includes network accuracy logic 408, which is configured to evaluate the accuracy of the test implementation of the neural network, and includes sparsity learning logic 406, which is configured to update the sparsity parameters s1, s2, and s3 based on the network accuracy. j One or more sparsity parameters in the form, as will be described in further detail in this paper.
[0237] Figure 10 A method for learning sparsity parameters by training a neural network according to the principles described herein is shown. Figure 10 Steps 1002, 1004, 1006, 1008, and 1010 can be used Figure 9 The test implementation of the neural network shown is executed accordingly. In the following description, the method for learning sparsity is described with reference to neural network layer 900-j. It should be understood that the same method can be executed simultaneously or sequentially for each of the other layers in the test implementation of the neural network.
[0238] In step 1002, based on the sparsity parameter s j Applying sparsity to the coefficient set w jOne or more coefficients. This step is performed by pruning logic 402-j. This can be achieved by applying a sparsity algorithm to the coefficient set. Sparsity can be applied by pruning logic 402-j in the manner described herein with reference to the methods of "unstructured sparsity," "structured sparsity," or "channel pruning."
[0239] In step 1004, the neural network's testing implementation uses the set of coefficients output by pruning logic 402-j to process the training input data, thereby forming the training output data. This step can be described as forward propagation. Figure 9 The solid arrow in the center indicates forward propagation. For example, in... Figure 9 In the neural network, layer 900-j combines the set of coefficients output by pruning logic 402-j with the data input to the layer to form the layer's output data. Figure 9 In the example shown, the output data of the last layer in the layer sequence (e.g., layer 900-j) will be used as the training output data.
[0240] In step 1006, the accuracy of the neural network is evaluated based on the training output data. This step is performed by network accuracy logic 408. The accuracy of the neural network can be evaluated by comparing the training output data with the validation output data of the training input data. The validation output data can be formed by a test implementation that operates the neural network on the training input data using the original set of coefficients (e.g., the set of coefficients before the manual application of sparsity in step 1002) before applying sparsity in step 1002. In another example, the validation output data can be provided along with the training input data. For example, in an image classification application where the training input data includes many images, the validation output data can include a predetermined category or set of categories for each of these images. In one example, step 1006 includes using training output data formed based on the set of coefficients output according to the training output data (e.g., according to the output of pruning logic 402-j, where the sparsity parameter s is determined according to the training output data). j Sparsity has been applied to the coefficient set w j The accuracy of a neural network can be evaluated using one or more coefficients from the training output data and the cross-entropy loss equation. For example, the accuracy of a neural network can be evaluated by using the cross-entropy loss function to determine the loss on the training output data.
[0241] In step 1008, the sparsity parameter s is updated based on the accuracy of the neural network evaluated in step 1006. j This step is performed by the sparse learning logic 406. This step can be described as the backpropagation of the network. Step 1008 may include updating the sparse parameters s according to parameter optimization techniques. jThis parameter optimization technique is configured to balance the coefficient set w to be applied. j The sparsity level, such as the sparsity parameter s. j The relationship with network accuracy is shown. The sparsity parameters used for the layers are learnable parameters that can be updated in an equivalent manner to the coefficient set used for the layers. In one example, the parameter optimization technique uses a cross-entropy loss equation that depends on the sparsity parameters and network accuracy. For example, the sparsity parameters s can be updated based on the loss of the training output data determined using the cross-entropy loss function via backpropagation and gradient descent. j Backpropagation can be viewed as the process of calculating the gradient of the sparsity parameters with respect to the cross-entropy loss function. This can be achieved by using the chain rule to start from the final output of the cross-entropy loss function and calculate the sparsity parameters in reverse. j Once the gradient is known, the sparsity parameters can be updated using the gradient descent (or its derivative) algorithm based on the gradient calculated via backpropagation. Gradient descent can be performed based on a learning rate parameter, which indicates the extent to which the coefficients can change according to the sparsity parameters in each iteration of the training process.
[0242] Step 1008 can be performed based on weights configured to make the test implementation of the neural network tend to maintain the accuracy of the network or increase the sparsity level applied to the coefficient set, as indicated by the sparsity parameter. The weights can be factors in the cross-entropy loss equation. The weights can be set by the user of the data processing system. For example, the weights can be set based on the memory and / or processing resources available on the data processing system where the runtime implementation of the neural network will be executed above. For example, if the memory and / or processing resources available on the data processing system where the runtime implementation of the neural network will be executed above are relatively small, the weights can be used to tend to increase the sparsity level applied to the coefficient set, as indicated by the sparsity parameter.
[0243] Step 1008 can be performed based on a limited maximum sparsity level indicated by the sparsity parameter to be updated. The limited maximum sparsity level can be a factor in the cross-entropy loss equation. The maximum sparsity level can be set by the user of the data processing system. For example, if the available memory and / or processing resources on the data processing system performing the above runtime implementation of the neural network are relatively small, the limited maximum sparsity level indicated by the sparsity parameter to be updated can be set to a relatively high maximum level to allow the method to increase the sparsity applied to the set of coefficients indicated by the sparsity parameter to a relatively high level.
[0244] As described herein, a test implementation of a neural network may include multiple layers, each configured to combine a corresponding set of coefficients with corresponding input data values to form the layer's output. The number of coefficients in the coefficient set of each of the multiple layers may vary between layers. In step 1008, the corresponding sparsity parameter may be updated for each layer in the multiple layers. In these examples, step 1008 may also include updating the sparsity parameter of each of the multiple layers based on the number of coefficients in the coefficient set of each layer, such that the test implementation of the neural network tends to update the corresponding sparsity parameter to indicate a higher level of sparsity to be applied to the coefficient set, which includes more coefficients than a coefficient set that includes fewer coefficients. This is because a coefficient set that includes a large number of coefficients typically includes a larger proportion of redundant coefficients. This means that a larger coefficient set may be able to apply a greater level of sparsity before the accuracy of the network is significantly affected, compared to a coefficient set that includes fewer coefficients.
[0245] In a specific example, steps 1006 and 1008 can be performed using the cross-entropy loss equation as defined in equation (11).
[0246]
[0247] In equation (11), This indicates that there are I pairs of input images x i and the output label y of the verification i Training input dataset The test implementation of the neural network executes the neural network model f, which solves the problem of mapping the input to the target label. w represents the coefficient set of layer J. j ,and The sparsity parameter of layer J The cross-entropy loss is defined by equation (12), where k defines the exponent of the probability output for each class, λ∥W∥1 is the L1 regularization term, and It is the cross-entropy coupling sparsity loss defined by equation (13).
[0248]
[0249]
[0250] The backpropagation and gradient process performed in step 1008 may involve working toward or finding a local minimum in the loss function, as shown in equation (12). The sparsity learning logic 406 may evaluate the gradient of the loss function for the set of coefficients and sparse parameters used in the forward propagation to determine how the set of coefficients and / or sparse parameters should be updated to move toward a local minimum in the loss function. For example, in equation (13), minimizing the term -log(1-c(W,s)) can find new values for the sparse parameters of each of the multiple layers, indicating the overall reduced sparsity level of the set of coefficients to be applied to the neural network. Minimizing the term -log(c(W,s)) can find new values for the sparse parameters of each of the multiple layers, indicating the overall increased sparsity level of the set of coefficients to be applied to the neural network.
[0251] In equation (13), α is a weighting value configured to either maintain the accuracy of the network or increase the sparsity applied to the coefficient set, as indicated by the sparsity parameter. The weighting value α can take values between 0 and 1. Lower values of α (e.g., relatively closer to 0) may tend to increase the sparsity applied to the coefficient set, as indicated by the sparsity parameter (e.g., potentially reducing network accuracy). Higher values of α (e.g., relatively closer to 1) may tend to maintain the accuracy of the network.
[0252] In equation (13), c(W,s), defined by equation (14), is a function used to update the sparsity parameter based on the number of coefficients in the coefficient set of each of the multiple layers, such that step 1008 tends to update the corresponding sparsity parameter in order to indicate a higher level of sparsity to be applied to the coefficient set, which includes more coefficients than a coefficient set that includes fewer coefficients.
[0253]
[0254] In a variant form, equation (13) can be modified to introduce the maximum sparsity level θ, which will be indicated by the updated sparsity parameter. This change is shown in equation (15).
[0255]
[0256] The maximum sparsity level indicated by the updated sparsity parameter θ can represent the maximum percentage, fraction, or portion of the set of coefficients to which the sparsity is applied by the pruning logic. Like the sparsity parameter, the maximum sparsity level θ can take values between 0 and 1. For example, a maximum sparsity level θ of 0.7 limits the sparsity indicated by the updated sparsity parameter to no more than 70%.
[0257] return Figure 9In examples of neural network test implementations that include pruning logic using non-differentiable bit methods (e.g., see references respectively),... Figure 7a and Figure 7b The described pruning logic (702a or 702b) and sparsity parameter s j It can be directly updated by sparse learning logic 406 in step 1008 (in Figure 9 The sparse learning logic 406 and the sparse parameters s are used in the learning process. j The dotted lines between them are shown). Test implementations of neural networks include the use of differentiable bit functions (e.g., see references respectively). Figure 7c and Figure 7d In an example of the pruning logic described in pruning logic 702c or 702d, the sparsity parameter s can be updated in step 1008 by backpropagating one or more gradients output by the sparsity learning logic 406 through network accuracy logic 408, neural network layer 900-j, and pruning logic 402-j. j (like Figure 9 (As shown by the dashed line in the middle). That is, when sparsity is applied in step 1002, including modeling the coefficient set using a differentiable function to identify a threshold based on the sparsity parameter, and applying sparsity based on the threshold, the sparsity parameter can be updated in step 1008 by backpropagating one or more gradient vectors using a differentiable function.
[0258] In the combined learnable sparsity parameter and channel pruning method, the sparsity parameters can first be trained using the learnable sparsity parameter method described herein. The sparsity can be applied to the coefficient channel (e.g., using a reference channel) by configuring the pruning logic. Figure 7b or Figure 7d The described pruning logic (where each coefficient channel is treated as a group of coefficients) is used to train sparsity parameters for channel pruning. Subsequently, one or more target data channels can be identified based on the trained sparsity parameters and the following steps of the performed channel pruning method (as can be referenced). Figure 11a , Figure 11b and Figure 12 (Description and understanding).
[0259] Steps 1002, 1004, 1006, and 1008 can be performed once. This can be referred to as "one-time pruning." Alternatively, steps 1002, 1004, 1006, and 1008 can be performed iteratively. That is, in the first iteration, sparsity can be applied in step 1002 based on the original sparsity parameters. In each subsequent iteration, sparsity can be applied in step 1002 based on the sparsity parameters updated in step 1008 of the previous iteration. The coefficient set can also be updated by backpropagation and gradient descent in step 1008 of each iteration. In step 1010, it is determined whether the final iteration of steps 1002, 1004, 1006, and 1008 has been performed. Otherwise, further iterations of steps 1002, 1004, 1006, and 1008 are performed. A fixed number of iterations can be performed. Alternatively, the testing implementation of the neural network can be configured to iteratively execute steps 1002, 1004, 1006, and 1008 until a condition is met. For example, until the target sparsity level in the coefficient set of the neural network is met. When it is determined in step 1010 that the final iteration has been performed, the method proceeds to step 1014.
[0260] In step 1014, the runtime implementation of the neural network is configured based on the updated sparsity parameters. When using the “unstructured sparsity” and “structured sparsity” methods described herein, step 1014 may include using pruning logic (e.g., Figure 4 The pruning logic 402 shown is used to apply sparsity to the coefficient set using an updated set of sparsity coefficients (e.g., the most recently updated set of coefficients) to provide a sparse set of coefficients. At this stage, sparsity should be applied using the same method as updating the sparsity parameters during training, such as "unstructured sparsity" or "structured sparsity". The sparse set of coefficients can be written to memory (e.g., ...). Figure 4 The memory (104) is used subsequently by the runtime implementation of the neural network. That is, it can be used in the "offline phase" (e.g., at "design time") as referenced. Figure 10 The sparsity parameters and coefficient set are trained as described in steps 1002, 1004, 1006, 1008, and 1010. Then, sparsity can be applied to the trained coefficient set based on the trained sparsity parameters to provide a trained sparse coefficient set, which is stored for subsequent use in the runtime implementation of the neural network. For example, the trained sparse coefficient set can form the input to the neural network (e.g., for...). Figure 1 The neural network implementation shown uses input 101. The runtime implementation of the neural network can be achieved by... Figure 4The data processing system 410 is implemented in the middle, and the data processing system is configured with software and / or hardware implementation of neural network 102-1 or 102-2 respectively.
[0261] When using the "channel pruning" method described herein, step 1014 may include configuring as referenced herein. Figure 12 Prior to the runtime implementation of the neural network, coefficient identification logic 412 is used to identify one or more target coefficient channels based on the updated sparsity parameters.
[0262] Learning or training sparsity parameters as part of the training process of a neural network is advantageous because it allows optimization of the sparsity of the coefficient set of each layer in the multiple layers to be applied to the neural network, in order to maximize sparsity without affecting the accuracy of the network, while maintaining the density of the coefficient set that the network is sensitive to sparsity.
[0263] Figure 1 The implementation method of the neural network shown Figure 4 , Figure 5 and Figure 9 Data processing systems and Figure 7a , Figure 7b , Figure 7c and Figure 7d The logic shown is illustrated as comprising numerous functional blocks. This is merely illustrative and not intended to define a strict division between the different logical elements of such an entity. Each functional block can be provided in any suitable manner. It should be understood that the intermediate values described herein formed by the data processing system do not need to be physically generated by the data processing system at any point in time, and may merely represent logical values that conveniently describe the processing performed by the data processing system between its inputs and outputs.
[0264] The data processing system described herein can be embodied in hardware on an integrated circuit. The data processing system described herein can be configured to perform any of the methods described herein. Generally, any of the functions, methods, techniques, or components described above can be implemented in software, firmware, hardware (e.g., a fixed logic circuit system), or any combination thereof. The terms “module,” “function,” “component,” “element,” “cell,” “block,” and “logic” are used herein to generally denote software, firmware, hardware, or any combination thereof. In the case of a software implementation, a module, function, component, element, cell, block, or logic represents program code that, when executed on a processor, performs a specified task. The algorithms and methods described herein can be executed by one or more processors that execute code that causes the processor to perform the algorithm / method. Examples of computer-readable storage media include random access memory (RAM), read-only memory (ROM), optical disk, flash memory, hard disk storage, and other memory devices that can use magnetic, optical, and other techniques to store instructions or other data and can be accessed by a machine.
[0265] As used herein, the terms computer program code and computer-readable instructions refer to any kind of executable code for a processor, comprising code expressed in machine language, interpreted language, or scripting language. Executable code includes binary code, machine code, bytecode, code defining integrated circuits (e.g., hardware description languages or netlists), and code expressed in programming languages such as C, Java, or OpenCL. Executable code can be, for example, any kind of software, firmware, script, module, or library that, when properly executed, processed, interpreted, compiled, or executed in a virtual machine or other software environment, causes the processor of a computer system that supports the executable code to perform tasks specified by said code.
[0266] A processor, computer, or computer system can be any kind of device, machine, or special-purpose circuit, or a collection or part thereof, that has the processing capability to execute instructions. A processor can be or includes any kind of general-purpose or special-purpose processor, such as a CPU, GPU, NNA, system-on-a-chip, state machine, media processor, application-specific integrated circuit (ASIC), programmable logic array, field-programmable gate array (FPGA), etc. A computer or computer system may include one or more processors.
[0267] This invention also intends to cover software defining the configuration of hardware as described herein, such as hardware description language (HDL) software, for designing integrated circuits or for configuring programmable chips to perform desired functions. That is, a computer-readable storage medium on which computer-readable program code in the form of an integrated circuit definition dataset is encoded may be provided, which, when processed (i.e., run) in an integrated circuit manufacturing system, configures the system to manufacture a data processing system configured to perform any of the methods described herein, or to manufacture a data processing system including any of the devices described herein. The integrated circuit definition dataset may, for example, be an integrated circuit description.
[0268] Therefore, a method for manufacturing a data processing system as described herein can be provided at an integrated circuit manufacturing system. Furthermore, an integrated circuit definition dataset can be provided, which, when processed in the integrated circuit manufacturing system, causes the method for manufacturing the data processing system to be executed.
[0269] Integrated circuit definition datasets can be in the form of computer code, such as as a netlist, code for configuring programmable chips, or as a hardware description language suitable for manufacturing at any level in integrated circuits, including as register-transfer level (RTL) code, as high-level circuit representations (such as Verilog or VHDL), and as low-level circuit representations (such as OASIS(RTM) and GDSII). Higher-level representations (such as RTL) that logically define hardware suitable for manufacturing in integrated circuits can be processed on a computer system configured to generate manufacturing definitions of integrated circuits within the context of a software environment that includes definitions of circuit elements and rules for combining these elements to generate the manufacturing definitions of the integrated circuits defined by the representation. As is typically the case where software executes at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate the manufacturing definitions of the integrated circuits, executing code that defines the integrated circuits to generate the manufacturing definitions of the integrated circuits.
[0270] Now refer to Figure 13 Describe an example of processing integrated circuit definition datasets at an integrated circuit manufacturing system in order to configure the system as a manufacturing data processing system.
[0271] Figure 13An example of an integrated circuit (IC) manufacturing system 1302 is shown, configured to manufacture a data processing system as described in any of the examples herein. Specifically, the IC manufacturing system 1302 includes a layout processing system 1304 and an integrated circuit generation system 1306. The IC manufacturing system 1302 is configured to receive an IC definition dataset (e.g., defining a data processing system as described in any of the examples herein), process the IC definition dataset, and generate an IC (e.g., embodying the data processing system as described in any of the examples herein) based on the IC definition dataset. Through the processing of the IC definition dataset, the IC manufacturing system 1302 is configured to manufacture integrated circuits embodying the data processing system as described in any of the examples herein.
[0272] The layout processing system 1304 is configured to receive and process an IC definition dataset to determine a circuit layout. Methods for determining a circuit layout based on an IC definition dataset are known in the art and may involve, for example, synthesizing RTL code to determine the gate-level representation of the circuit to be generated, for example, in relation to logic components (e.g., NAND, NOR, AND, OR, MUX, and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout can be determined based on the gate-level representation of the circuit. This can be done automatically or with user intervention to optimize the circuit layout. Once the layout processing system 1304 has determined the circuit layout, it can output the circuit layout definition to the IC generation system 1306. The circuit layout definition may be, for example, a circuit layout description.
[0273] As is known in the art, IC generation system 1306 generates ICs according to a circuit layout definition. For example, IC generation system 1306 can implement a semiconductor device manufacturing process for generating ICs, which may involve a multi-step sequence of photolithography and chemical processing steps, during which electronic circuits are gradually formed on a wafer made of semiconductor material. The circuit layout definition may be in the form of a mask, which can be used in the photolithography process to generate ICs according to the circuit definition. Alternatively, the circuit layout definition provided to IC generation system 1306 may be in the form of computer-readable code, which IC generation system 1306 can use to form a suitable mask for generating ICs.
[0274] The various processes performed by the IC manufacturing system 1302 may all be implemented in one location, for example, by one party. Alternatively, the IC manufacturing system 1302 may be a distributed system, allowing some processes to be performed in different locations and by different parties. For example, some of the following stages may be performed in different locations and / or by different parties: (i) synthesizing RTL code representing an IC definition dataset to form a gate-level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate-level representation; (iii) forming a mask based on the circuit layout; and (iv) using the mask to manufacture the integrated circuit.
[0275] In other examples, processing an integrated circuit definition dataset at an integrated circuit manufacturing system can configure the system as a manufacturing data processing system, without processing the IC definition dataset to determine circuit layout. For example, an integrated circuit definition dataset can define the configuration of a reconfigurable processor such as an FPGA, and processing that dataset can configure the IC manufacturing system (e.g., by loading the configuration data into the FPGA) to generate a reconfigurable processor with that defined configuration.
[0276] In some implementations, when processed in an integrated circuit manufacturing system, the integrated circuit manufacturing definition dataset can enable the integrated circuit manufacturing system to generate devices as described herein. For example, the integrated circuit manufacturing definition dataset, as referenced above... Figure 13 The configuration of the integrated circuit manufacturing system described herein can produce devices as described in this document.
[0277] In some examples, an integrated circuit definition dataset may include software running on hardware defined at the dataset, or software running in combination with hardware defined at the dataset. Figure 13 In the example shown, the IC generation system can be further configured by the integrated circuit definition dataset to load firmware onto the integrated circuit according to the program code defined at the integrated circuit definition dataset during the manufacturing of the integrated circuit, or otherwise provide the integrated circuit with program code for use with the integrated circuit.
[0278] Compared to known implementations, the concepts set forth in this application can lead to performance improvements in devices, apparatuses, modules, and / or systems (and in the methods implemented herein). Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and / or reduced power consumption. During the manufacture of such devices, apparatuses, modules, and systems (e.g., in integrated circuits), trade-offs can be made between performance improvements and physical implementations, thereby improving manufacturing methods. For example, a trade-off can be made between performance improvements and layout area, matching the performance of known implementations but using less silicon. This can be accomplished, for example, by reusing functional blocks serially or sharing functional blocks among elements of a device, apparatus, module, and / or system. Conversely, the concepts set forth in this application that lead to improvements in the physical implementations of devices, apparatuses, modules, and systems (such as reduced silicon area) can be traded off for performance improvements. This can be accomplished, for example, by manufacturing multiple examples of modules within a predefined area budget.
[0279] The applicant has independently disclosed each individual feature described herein, as well as any combination of two or more such features, to the extent that such features or combinations can be implemented based on the specification as a whole, in accordance with the common knowledge of those skilled in the art, regardless of whether such features or combinations of features solve any problem disclosed herein. In view of the foregoing description, those skilled in the art will understand that various modifications can be made within the scope of this invention.
Claims
1. A method for compressing a set of coefficients for subsequent computer implementation in a neural network, the set of coefficients comprising a plurality of non-zero coefficients, the method comprising: Sparsity is applied to multiple coefficient groups of the coefficient set, each coefficient group comprising a predefined number of coefficients, wherein each coefficient group comprises one or more subsets of coefficients from the coefficient set. n There are coefficients, and each subset of coefficients includes m There are coefficients, among which m Greater than 1 and n for m Integer multiples of, and where applying sparsity to the coefficient group includes setting each coefficient in the group to zero; According to a compression scheme registered with the plurality of coefficient groups, the plurality of coefficient groups that have been applied to sparsity are compressed by compressing the one or more subsets of coefficients included in each of the plurality of coefficient groups, wherein each of the subsets of coefficients to be compressed includes m A set of zero coefficients is used to represent each subset of coefficients by an integer number of one or more compressed values; as well as The compressed coefficient set is stored in memory for later use in the neural network.
2. The computer-implemented method according to claim 1, wherein... n Greater than m Furthermore, each coefficient group is compressed by compressing multiple adjacent or interleaved subsets of coefficients.
3. The computer-implemented method according to claim 1 or 2, wherein... n Equal to 2 m .
4. The computer-implemented method of claim 3, wherein each group comprises 16 coefficients, and each subset comprises 8 coefficients, and wherein each group is compressed by compressing two adjacent or interleaved subsets of coefficients.
5. The computer-implemented method according to claim 1, wherein... n equal m .
6. The computer-implemented method according to claim 1 or 2, wherein sparsity is applied to the plurality of coefficient sets according to a sparsity mask, the sparsity mask defining which coefficients in the coefficient sets to apply sparsity to.
7. The computer-implemented method of claim 6, wherein the coefficient set is a coefficient tensor, the sparsity mask is a binary tensor having the same dimension as the coefficient tensor, and sparsity is applied by performing element-wise multiplication of the coefficient tensor and the sparsity mask tensor.
8. The computer-implemented method of claim 7, wherein the sparse mask tensor is formed in the following manner: Generate a reduction tensor with one or more dimensions, wherein the dimension of the coefficient tensor is an integer multiple of the one or more dimensions, wherein the integer is greater than 1; Determine the elements of the reduced tensor to which sparsity is to be applied in order to generate a reduced sparsity mask tensor; as well as The reduced sparse mask tensor is expanded to generate a sparse mask tensor of the same dimension as the coefficient tensor.
9. The computer-implemented method of claim 8, wherein generating the reduced tensor comprises: The coefficient tensor is divided into multiple coefficient groups, such that each coefficient in the set is assigned to only one group, and all the coefficients are assigned to groups; as well as Each coefficient group of the coefficient tensor is represented by the maximum coefficient value within the group.
10. The computer-implemented method of claim 8, further comprising expanding the reduced sparse mask tensor by performing nearest-neighbor upsampling, such that each value in the reduced sparse mask tensor is represented by a group comprising a plurality of identical values in the sparse mask tensor.
11. The computer-implemented method according to claim 1 or 2, wherein compression includes m The subset of coefficients with zero coefficients includes: Each coefficient in the subset of coefficients is encoded using zero-bit data, and header data is generated that uses one bit of data to indicate that each coefficient in the subset of coefficients is encoded using zero-bit data.
12. The computer-implemented method according to claim 1 or 2, wherein the number of groups to which sparsity needs to be applied is determined based on a sparsity parameter.
13. The computer-implemented method according to claim 12, further comprising: The coefficient set is divided into multiple coefficient groups, such that each coefficient in the set is assigned to only one group, and all the coefficients are assigned to groups; Determine the significance of each coefficient group; as well as Sparsity is applied to the plurality of coefficient groups having significance below a threshold, and the threshold is determined based on the sparsity parameter, wherein optionally the threshold is the maximum absolute coefficient value or the average absolute coefficient value.
14. The computer-implemented method according to claim 1 or 2, further comprising using the compressed coefficient set in the neural network.
15. The computer-implemented method of claim 1 or 2, the method comprising implementing the neural network at a data processing system by configuring a hardware-implemented neural network accelerator at the data processing system to execute the neural network using the compressed set of coefficients.
16. The computer-implemented method of claim 1 or 2, the method comprising processing image data using the compressed coefficient set, the image data representing one or more images input to the neural network.
17. The computer-implemented method according to claim 1 or 2, wherein, A compressed coefficient subset includes: identifying enough bits to encode the largest coefficient value in the subset, and using those bits to encode each coefficient in the subset.
18. The computer-implemented method according to claim 1 or 2, the method comprising: At the processor, sparsity is applied to the plurality of coefficient groups; At the processor, the coefficient group is compressed according to a compression scheme registered with the coefficient group; as well as The compressed set of coefficients is stored in memory for subsequent use in a neural network implemented in hardware at a neural network accelerator.
19. A data processing system for compressing a set of coefficients for subsequent use in a neural network, the set of coefficients comprising a plurality of non-zero coefficients, the data processing system comprising: A pruning logic is configured to apply sparsity to multiple coefficient groups of the coefficient set, each coefficient group comprising a predefined number of coefficients, wherein each coefficient group comprises one or more subsets of coefficients from the coefficient set. n There are coefficients, and each subset of coefficients includes m There are coefficients, among which m Greater than 1 and n for m Integer multiples of, and where applying sparsity to the coefficient group includes setting each coefficient in the group to zero; Compression logic, configured to compress the plurality of coefficient groups that have been applied sparsity by compressing one or more subsets of coefficients included in each of the plurality of coefficient groups, according to a compression scheme registered with the plurality of coefficient groups, wherein each of the subsets of coefficients to be compressed includes... m A set of zero coefficients is used to represent each coefficient group by an integer number of one or more compressed values; as well as A memory configured to store compressed sets of coefficients for subsequent use in a neural network.
20. A non-transitory computer-readable storage medium storing computer-readable instructions thereon, the computer-readable instructions, when executed at a computer system, causing the computer system to perform a method for compressing a set of coefficients for subsequent use in a neural network, the set of coefficients comprising a plurality of non-zero coefficients, the method comprising: Sparsity is applied to multiple coefficient groups of the coefficient set, each coefficient group comprising a predefined number of coefficients, wherein each coefficient group comprises one or more subsets of coefficients from the coefficient set. n There are coefficients, and each subset of coefficients includes m There are coefficients, among which m Greater than 1 and n for m Integer multiples of, and where applying sparsity to the coefficient group includes setting each coefficient in the group to zero; According to a compression scheme registered with the plurality of coefficient groups, the plurality of coefficient groups that have been applied to sparsity are compressed by compressing the one or more subsets of coefficients included in each of the plurality of coefficient groups, wherein each of the subsets of coefficients to be compressed includes m A set of zero coefficients is used to represent each coefficient group by an integer number of one or more compressed values; as well as The compressed coefficient set is stored in memory for later use in the neural network.