Operation method of convolutional neural network and related device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing group quantization in convolutional neural networks and converting it into shift operations, the problems of large uniform quantization error and high computational cost of group quantization are solved, thereby reducing computational load and improving efficiency, and ensuring the convergence of the training process.

CN114418057BActive Publication Date: 2026-06-12HUAWEI TECH CO LTD +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HUAWEI TECH CO LTD
Filing Date: 2020-10-28
Publication Date: 2026-06-12

Application Information

Patent Timeline

28 Oct 2020

Application

12 Jun 2026

Publication

CN114418057B

IPC: G06N3/0464; G06F18/214; G06F17/15; G06F7/483

CPC: G06F7/483; G06F17/153; G06N3/045; G06F18/214

AI Tagging

Application Domain

Digital data processing details Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Supporting piece, method for manufacturing supporting piece, display module and electronic equipment
CN122201123ADigital data processing details Identification means
A heat dissipation module and equipment cabinet
CN224354806UEliminate poor heat dissipationEasy to install by yourselfDigital data processing details Mechanical engineering Physics
Lifting drive device and server
CN224354807UAssociation with control/drive circuitsDigital data processing details
Instruction processing device, acceleration unit, and server
CN115222015BDigital data processing detailsProgram control
Mainboard device
CN115793786BDigital data processing details

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In existing technologies, uniform quantization introduces large quantization errors into convolutional neural networks, while group quantization increases the computational cost of floating-point multiplication, resulting in reduced computational efficiency and making it difficult to effectively reduce computational load and improve computational efficiency.

⚗Method used

The group quantization method is adopted, which selects the first quantization factor with the same last digit to quantize the input data and weight parameters, and converts floating-point multiplication calculation into shift operation, thereby reducing the amount of calculation and improving the calculation efficiency.

🎯Benefits of technology

While reducing computational load and improving computational efficiency, it effectively reduces quantization error, enabling the convolutional neural network to reach the convergent target solution during training and achieving efficient computation for low-bit training.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN114418057B_ABST

Patent Text Reader

Abstract

The application provides a convolutional neural network operation method and related equipment, the method comprising: quantizing input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups, wherein the Cin first quantization factors are floating-point numbers, the Cin first quantization factors have the same mantissa, the Cin first quantization factors correspond to the Cin channels one by one, Cin is a positive integer; quantizing first weight parameters corresponding to a target convolution kernel to obtain second weight parameters; performing convolution calculation on the Cin data groups and the second weight parameters; and performing shift calculation on the result of the convolution calculation to obtain an operation result. The embodiments of the application can reduce the calculation amount of the convolutional neural network and improve the calculation efficiency.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a method for operating a convolutional neural network and related equipment. Background Technology

[0002] Matrix multiplication and addition operations account for over 90% of the total computation in Convolutional Neural Networks (CNNs), making acceleration of these operations a primary means of speeding up CNN inference and training. Representing high-bit data in a low-bit format can effectively reduce computational load and improve efficiency. Current technologies often employ uniform quantization or group quantization methods to convert high-bit data into low-bit representations.

[0003] However, uniform quantization introduces significant quantization errors when used to quantize high-bit data, making it difficult to train a convolutional neural network to converge to the target solution. While group quantization reduces quantization errors, it introduces floating-point multiplication, which is computationally expensive, contradicting the goal of reducing computational complexity and improving efficiency. Therefore, how to reduce the computational complexity and improve the efficiency of convolutional neural networks through quantization remains a major challenge. Summary of the Invention

[0004] This application discloses a method for operating a convolutional neural network and related equipment, which can effectively reduce the computational load and improve the computational efficiency of the convolutional neural network.

[0005] The first aspect of this application discloses a computing device for a convolutional neural network. The device includes a floating-point arithmetic logic unit and a convolution module connected in communication. The floating-point arithmetic logic unit is used to quantize input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups, wherein the Cin first quantization factors are floating-point numbers, the mantissas of the Cin first quantization factors are the same, the Cin first quantization factors correspond one-to-one with the Cin channels, and Cin is a positive integer. The floating-point arithmetic logic unit is also used to quantize the first weight parameter corresponding to the target convolution kernel to obtain the second weight parameter. The convolution module is used to perform convolution calculation on the Cin data groups and the second weight parameter, and to perform shift calculation on the result of the convolution calculation to obtain the operation result. In this embodiment, for the operation of a single convolutional kernel, when grouping and quantizing the input data, the first quantization factor with the same mantissa is selected, so that the floating-point multiplication calculation brought in by the grouping and quantization in the convolution operation is converted into a shift operation, thereby greatly reducing the computational amount corresponding to a single convolutional kernel and improving the computational efficiency of a single convolutional kernel. Moreover, compared with uniform quantization, the quantization method provided in this embodiment can reduce the computational amount and improve the computational efficiency while effectively reducing the quantization error, so that the convergent target solution can be obtained during the training process of the convolutional neural network, and convergence can be achieved through low-bit training. For the entire convolutional kernel neural network, if all convolution operations of the entire convolutional kernel neural network adopt the quantization method provided in this embodiment, the computational amount of the entire convolutional neural network can be effectively reduced and the computational efficiency of the entire convolutional neural network can be improved.

[0006] In one exemplary embodiment, the convolution module includes: a low-bit multiplier for performing convolution calculation on Cin data groups and a second weight parameter, wherein the result of the convolution calculation is Cin integers; a floating-point adder for performing the following steps for each of the Cin integers to obtain Cin floating-point numbers: shifting a target integer according to a first coefficient to obtain a floating-point number corresponding to the target integer, wherein the target integer is any one of the Cin integers, the first coefficient is determined according to the exponent of a first target quantization factor, the first target quantization factor is a first quantization factor corresponding to a first target channel, and the first target channel is a channel corresponding to the target integer; the floating-point adder is also used to perform cumulative calculation on the Cin floating-point numbers. In this example, during the convolution operation after group quantization, the Cin data groups are first convolved with the second weight parameter. This involves multiplying and accumulating the Cin data groups according to the quantized second weight parameter to obtain Cin integers. Then, each of these Cin integers is shifted; that is, each of the Cin integers is multiplied by a first coefficient determined by the exponent of the first quantization factor corresponding to that integer, yielding the corresponding floating-point number. Thus, Cin values can be calculated from the Cin integers. Floating-point numbers; accumulate these Cin floating-point numbers, and then calculate the result based on the accumulated result of these Cin floating-point numbers; since the mantissas of the Cin first quantization factors are the same, the mantissas of the first quantization factors and the second quantization factors can be extracted before the accumulation of the Cin floating-point numbers during the operation. Finally, multiply it by the accumulated result of the Cin floating-point numbers to obtain the convolution operation result. That is, the multiplication operation of floating-point numbers and integers in the convolution operation after group quantization is transformed into the shift operation of integers, which can effectively reduce the amount of calculation and improve the calculation efficiency.

[0007] In one exemplary embodiment, the floating-point arithmetic logic unit is further configured to: calculate the operation result based on the mantissa of the first quantization factor, the second quantization factor, and the result of the accumulation calculation, wherein the second quantization factor is a quantization factor used to quantize the first weight parameter. In this example, the Cin floating-point numbers are first accumulated, and then the result of the accumulation calculation is multiplied by the mantissa of the first quantization factor and the second quantization factor to obtain the operation result. Since the mantissas of the Cin first quantization factors are the same, the mantissas of the first quantization factors and the second quantization factor can be extracted before the accumulation of the Cin floating-point numbers during the operation. Finally, it is multiplied by the result of the accumulation of the Cin floating-point numbers to obtain the convolution operation result. That is, the multiplication operation of floating-point numbers and integers in the convolution operation after group quantization is converted into the shift operation of integers, thereby effectively reducing the amount of calculation and improving the calculation efficiency.

[0008] In one exemplary implementation, the floating-point arithmetic logic unit is further configured to: before quantizing the first weight parameters corresponding to the target convolutional kernel to obtain the second weight parameters; obtain the maximum weight parameter among the first weight parameters corresponding to the target convolutional kernel; and calculate a second quantization factor based on the maximum weight parameter and the quantization bit width, wherein the second quantization factor is a quantization factor used to quantize the first weight parameters. In this example, for a single convolutional kernel (i.e., the target convolutional kernel), the maximum weight parameter among the first weight parameters corresponding to the convolutional kernel is obtained, and then the second quantization factor corresponding to the convolutional kernel is calculated based on the maximum weight parameter and the quantization bit width; thus, for all convolutional kernels, the corresponding second quantization factor can be calculated, which is beneficial for using different second quantization factors to quantize the first weight parameters of different convolutional kernels and reducing the quantization error of weight quantization.

[0009] In one exemplary embodiment, the floating-point arithmetic logic unit is further configured to: before quantizing the input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups; obtain the maximum data parameter in the input data of the second target channel, wherein the second target channel is any one of the Cin channels; calculate the second target quantization factor according to the maximum data parameter in the input data of the second target channel and the quantization bit width; select the third target quantization factor from a preset quantization factor set according to the second target quantization factor, wherein the third target quantization factor is the preset quantization factor with the smallest absolute value of the difference between the third target quantization factor and the second target quantization factor in the preset quantization factor set, and the third target quantization factor is the first quantization factor corresponding to the second target channel. In this example, for the input data of a single channel (i.e., the second target channel), the maximum data parameter in the input data of that channel is obtained. Then, the second target quantization factor is calculated based on the maximum data parameter and the quantization bit width. Next, a third target quantization factor is selected from a preset quantization factor set based on the second target quantization factor. The third target quantization factor is the preset quantization factor with the smallest absolute value of the difference between it and the second target quantization factor in the preset quantization factor set. The third target quantization factor is the first quantization factor corresponding to the second target channel. Therefore, for the input data of each of the Cin channels, the corresponding second target quantization factor can be calculated, and the corresponding third target quantization factor can be selected from the preset quantization factor set based on this second target quantization factor, thus obtaining Cin first quantization factors, i.e., one first quantization factor for each channel. This facilitates the quantization of the input data of Cin channels using Cin first quantization factors, i.e., using different first quantization factors to quantize the input data of different channels, reducing the quantization error of the data quantization.

[0010] In one exemplary embodiment, the floating-point arithmetic logic unit is further configured to: before selecting a third target quantization factor from a preset quantization factor set based on a second target quantization factor; obtain the maximum data parameter among the input data of Cin channels; calculate a fourth target quantization factor based on the maximum data parameter among the input data of Cin channels and the quantization bit width; and calculate Cin preset quantization factors based on the fourth target quantization factor and Cin second coefficients, wherein the Cin second coefficients correspond one-to-one with the Cin preset quantization factors, and the Cin preset quantization factors constitute a preset quantization factor set. In this example, the maximum data parameter among the input data of Cin channels is obtained. Then, a fourth target quantization factor is calculated based on the maximum data parameter and the quantization bit width. Next, Cin preset quantization factors are calculated based on the fourth target quantization factor and Cin second coefficients. These Cin preset quantization factors constitute a preset quantization factor set. Since these Cin preset quantization factors are calculated from the same fourth target quantization factor and Cin different second coefficients, their mantissas are the same. The Cin first quantization factors used for group quantization of the input data are selected from these Cin preset quantization factors, so their mantissas are also the same. This facilitates converting the floating-point multiplication calculation introduced by group quantization in convolution operations into shift operations, reducing computational load and improving computational efficiency.

[0011] The second aspect of this application discloses a method for operating a convolutional neural network, comprising: quantizing input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups, wherein the Cin first quantization factors are floating-point numbers, the Cin first quantization factors have the same mantissa, the Cin first quantization factors correspond one-to-one with the Cin channels, and Cin is a positive integer; quantizing the first weight parameter corresponding to the target convolution kernel to obtain a second weight parameter; performing convolution calculation on the Cin data groups and the second weight parameter; and performing a shift calculation on the result of the convolution calculation to obtain the operation result. In this embodiment, for the operation of a single convolutional kernel, when grouping and quantizing the input data, the first quantization factor with the same mantissa is selected, so that the floating-point multiplication calculation brought in by the grouping and quantization in the convolution operation is converted into a shift operation, thereby greatly reducing the computational amount corresponding to a single convolutional kernel and improving the computational efficiency of a single convolutional kernel. Moreover, compared with uniform quantization, the quantization method provided in this embodiment can reduce the computational amount and improve the computational efficiency while effectively reducing the quantization error, so that the convergent target solution can be obtained during the training process of the convolutional neural network, and convergence can be achieved through low-bit training. For the entire convolutional kernel neural network, if all convolution operations of the entire convolutional kernel neural network adopt the quantization method provided in this embodiment, the computational amount of the entire convolutional neural network can be effectively reduced and the computational efficiency of the entire convolutional neural network can be improved.

[0012] In one exemplary implementation, the result of the convolution calculation is Cin integers. The result of the convolution calculation is then shifted to obtain the final result, including: for each of the Cin integers, the following steps are performed to obtain Cin floating-point numbers: a target integer is shifted according to a first coefficient to obtain the corresponding floating-point number, wherein the target integer is any one of the Cin integers, the first coefficient is determined based on the exponent of a first target quantization factor, the first target quantization factor is the first quantization factor corresponding to the first target channel, and the first target channel is the channel corresponding to the target integer; the final result is calculated based on the mantissa of the first quantization factor, a second quantization factor, and the Cin floating-point numbers, wherein the second quantization factor is the quantization factor used to quantize the first weight parameter. In this example, during the convolution operation after group quantization, the Cin data groups and the second weight parameter are first convolved. This involves multiplying and summing the Cin data groups according to the quantized second weight parameter to obtain Cin integers. Then, each of these Cin integers is shifted. This means each integer is multiplied by a first coefficient determined by the exponent of the first quantization factor corresponding to that integer, resulting in a floating-point number. Thus, Cin floating-point numbers can be calculated from the Cin integers. Finally, the result is calculated using the mantissa of the first quantization factor, the second quantization factor, and the Cin floating-point numbers. The second quantization factor is used to quantize the first weight parameter. That is, the Cin floating-point numbers are first accumulated, and then the accumulated result is multiplied by the mantissa of the first quantization factor and the second quantization factor to obtain the operation result. Since the mantissas of the Cin first quantization factors are the same, the mantissas of the first quantization factor and the second quantization factor can be extracted before the accumulation of the Cin floating-point numbers during the operation. Finally, it is multiplied by the accumulated result of the Cin floating-point numbers to obtain the convolution operation result. That is, the multiplication operation of floating-point numbers and integers in the convolution operation after group quantization is transformed into the shift operation of integers, which can effectively reduce the amount of calculation and improve the calculation efficiency.

[0013] In one exemplary implementation, before quantizing the first weight parameters corresponding to the target convolutional kernel to obtain the second weight parameters, the method further includes: obtaining the maximum weight parameter among the first weight parameters corresponding to the target convolutional kernel; and calculating a second quantization factor based on the maximum weight parameter and the quantization bit width, wherein the second quantization factor is a quantization factor used to quantize the first weight parameters. In this example, for a single convolutional kernel (i.e., the target convolutional kernel), the maximum weight parameter among the first weight parameters corresponding to the convolutional kernel is obtained, and then the second quantization factor corresponding to the convolutional kernel is calculated based on the maximum weight parameter and the quantization bit width; thus, for all convolutional kernels, the corresponding second quantization factor can be calculated, which is beneficial to use different second quantization factors to quantize the first weight parameters of different convolutional kernels and reduce the quantization error of weight quantization.

[0014] In one exemplary implementation, before quantizing the input data containing Cn channels according to Cn first quantization factors to obtain Cn data groups, the method further includes: obtaining the maximum data parameter in the input data of the second target channel, wherein the second target channel is any one of the Cn channels; calculating a second target quantization factor according to the maximum data parameter in the input data of the second target channel and the quantization bit width; selecting a third target quantization factor from a preset quantization factor set according to the second target quantization factor, wherein the third target quantization factor is the preset quantization factor with the smallest absolute value of the difference between the third target quantization factor and the second target quantization factor in the preset quantization factor set, and the third target quantization factor is the first quantization factor corresponding to the second target channel. In this example, for the input data of a single channel (i.e., the second target channel), the maximum data parameter in the input data of that channel is obtained. Then, the second target quantization factor is calculated based on the maximum data parameter and the quantization bit width. Next, a third target quantization factor is selected from a preset quantization factor set based on the second target quantization factor. The third target quantization factor is the preset quantization factor with the smallest absolute value of the difference between it and the second target quantization factor in the preset quantization factor set. The third target quantization factor is the first quantization factor corresponding to the second target channel. Therefore, for the input data of each of the Cin channels, the corresponding second target quantization factor can be calculated, and the corresponding third target quantization factor can be selected from the preset quantization factor set based on this second target quantization factor, thus obtaining Cin first quantization factors, i.e., one first quantization factor for each channel. This facilitates the quantization of the input data of Cin channels using Cin first quantization factors, i.e., using different first quantization factors to quantize the input data of different channels, reducing the quantization error of the data quantization.

[0015] In one exemplary embodiment, before selecting a third target quantization factor from a preset quantization factor set based on a second target quantization factor, the method further includes: obtaining the maximum data parameter among the input data of Cin channels; calculating a fourth target quantization factor based on the maximum data parameter among the input data of Cin channels and the quantization bit width; and calculating Cin preset quantization factors based on the fourth target quantization factor and Cin second coefficients, wherein the Cin second coefficients correspond one-to-one with the Cin preset quantization factors, and the Cin preset quantization factors constitute a preset quantization factor set. In this example, the maximum data parameter among the input data of Cin channels is obtained. Then, a fourth target quantization factor is calculated based on the maximum data parameter and the quantization bit width. Next, Cin preset quantization factors are calculated based on the fourth target quantization factor and Cin second coefficients. These Cin preset quantization factors constitute a preset quantization factor set. Since these Cin preset quantization factors are calculated from the same fourth target quantization factor and Cin different second coefficients, their mantissas are the same. The Cin first quantization factors used for group quantization of the input data are selected from these Cin preset quantization factors, so their mantissas are also the same. This facilitates converting the floating-point multiplication calculation introduced by group quantization in convolution operations into shift operations, reducing computational load and improving computational efficiency.

[0016] A third aspect of this application discloses a computer device including a processor, a memory, a communication interface, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, and the programs include instructions for performing steps in the method as described in any one of the second aspects above.

[0017] The fourth aspect of this application discloses a chip, characterized in that it includes: a processor for calling and running a computer program from a memory, causing a device equipped with the chip to perform the method described in any of the first aspects above.

[0018] The fifth aspect of this application discloses a computer-readable storage medium storing a computer program for electronic data interchange, wherein the computer program causes a computer to perform the method as described in any one of the second aspects above.

[0019] The sixth aspect of this application discloses a computer program product that causes a computer to perform the method as described in any one of the second aspects above. Attached Figure Description

[0020] The accompanying drawings used in the embodiments of this application are described below.

[0021] Figure 1 This is a schematic diagram of the training process of a convolutional neural network provided in an embodiment of this application;

[0022] Figure 2 This is a quantization diagram of a reasoning process provided in an embodiment of this application;

[0023] Figure 3 This is a schematic diagram of a uniform quantization principle provided in an embodiment of this application;

[0024] Figure 4 This is another uniform quantization principle diagram provided in the embodiments of this application;

[0025] Figure 5 This is a schematic diagram illustrating the maximum value distribution of different channels provided in an embodiment of this application;

[0026] Figure 6 This is a schematic diagram of convolution calculation provided in an embodiment of this application;

[0027] Figure 7 This is a schematic diagram of the architecture of a low-bit training system provided in an embodiment of this application;

[0028] Figure 8 This is a schematic diagram of the structure of a convolutional neural network computing device provided in an embodiment of this application;

[0029] Figure 9 This is a flowchart illustrating a method for operating a convolutional neural network according to an embodiment of this application;

[0030] Figure 10 This is another schematic diagram of convolution calculation provided in the embodiments of this application;

[0031] Figure 11 This is a schematic diagram of another training process for a convolutional neural network provided in an embodiment of this application;

[0032] Figure 12 This is a schematic diagram of the structure of a computer device provided in an embodiment of this application. Detailed Implementation

[0033] The embodiments of this application are described below with reference to the accompanying drawings.

[0034] To facilitate understanding of this application by those skilled in the art, some terms used in this application will be explained and the relevant technical knowledge involved in the embodiments of this application will be introduced.

[0035] Low bit: The main data formats used in current neural network computations are FP32 and FP16. Low bit usually refers to the INT8 / INT4 data format.

[0036] Quantization: In the computation process of neural networks, floating-point data format is converted into integer data format, thereby converting floating-point multiplication into integer multiplication, which can complete the calculation faster and more efficiently. The quantization process usually involves dividing the floating-point number by a quantization factor and then rounding it to the nearest integer.

[0037] Dequantization: After the multiplication of integers in quantization is completed, the integer data result is converted into a floating-point number. This ensures that integer multiplication and floating-point multiplication are mathematically equivalent. The dequantization process usually involves multiplying the integer by the quantization factor to obtain a floating-point number.

[0038] Convolutional Neural Networks (CNNs) are one of the most widely used techniques in the field of deep learning, typically consisting of two phases: training and inference. Taking the ImageNet dataset classification task as an example, the training process can be divided into forward propagation and backpropagation. In forward propagation, training images are input into the CNN, processed through convolutional and fully connected layers, and the probability of the image belonging to a specific class is output. This probability is then compared with the actual image label to obtain the error. In backpropagation, the error is propagated back through each layer of the CNN, where the gradients of the parameters are calculated, and the weight parameters of the CNN are optimized by an optimizer. The inference process builds upon the training, inputting the image to be classified into the trained CNN model and outputting the probability of the image belonging to a specific class.

[0039] Please see Figure 1 , Figure 1 This is a schematic diagram of the training process of a convolutional neural network provided in an embodiment of this application. Figure 1 As shown, both forward and backward propagation involve a large number of matrix calculations; during forward propagation, the input data (a) of each layer... l ) and weight parameters (w l The output data is obtained by multiplying the two layers and then passed to the next layer; during backpropagation, the backpropagation error of this layer (δ) is... c l ) and the weight parameters of this layer (w) l Multiplying these two values yields the upper-layer backhaul error, and the current-layer backhaul error (δ) c l ) and the data in this layer (a) l Multiplying these yields the gradient (Δw) of the weight parameters of this layer. l ).

[0040] Matrix multiplication and addition operations account for over 90% of the total computation in convolutional neural networks (CNNs). Therefore, accelerating matrix computation is a primary means of accelerating CNN inference and training. One major method for accelerating matrix computation is to convert high-bit data formats (such as FP32) to low-bit data formats (such as FP16 / FP8 / INT8). Low-bit data formats effectively improve data transfer efficiency and reduce memory access, thus resulting in higher computational efficiency. This method of converting high-precision data formats to low-precision data formats is collectively referred to in the industry as quantization. Currently, the mainstream training and inference data format is FP32. Quantization to even lower-bit formats like FP16 or FP8 / INT8 / INT4 for inference and training is still being explored in the industry.

[0041] Quantizing data and weights during the inference process is a common strategy. For example, converting FP32 data format to INT16 or INT8 can greatly accelerate the inference process.

[0042] Please see Figure 2 , Figure 2 This is a quantization diagram of a reasoning process provided in an embodiment of this application. For example... Figure 2 As shown, before quantization, both the data matrix and the weight matrix are floating-point numbers, and the inference process is a matrix calculation of floating-point numbers. After quantization, both the data matrix and the weight matrix are integers, and the inference process is a matrix calculation of integers, which can accelerate the inference process.

[0043] Current quantization methods mostly employ uniform quantization, also known as linear quantization.

[0044] Please see Figure 3 , Figure 3 This is a schematic diagram of a uniform quantization principle provided in an embodiment of this application. For example... Figure 3 As shown, taking INT8 quantization as an example, uniform quantization maps data x in the range of [-max, max] to the range of [-127, 127]. The mapping process can be represented by formulas (1) and (2).

[0045]

[0046]

[0047] In formulas (1) and (2), scale is the quantization factor, which represents the amplification factor for mapping from floating-point numbers to INT numbers; round is the rounding calculation, which approximates the mapped floating-point value to the nearest integer, and is also the root cause of the error brought about by quantization, which will be referred to as rounding error in the following text.

[0048] In uniform quantization, the weight parameters and data of a certain layer in a convolutional neural network are typically quantized using a uniform quantization factor, denoted by S. w and S a The matrix calculations for this layer can be represented by formulas (3), (4), and (5), respectively.

[0049]

[0050]

[0051] w=S w S a (w q a q (5)

[0052] In formulas (3), (4), and (5), w represents the weight parameter; w q This represents the quantized weight parameters; 'a' represents the data; 'a' represents the quantized weight parameters. q wa represents the quantized data; wa represents the product of the weight parameters and the data parameters before quantization, which is approximately equal to the result on the right side of the equation. This equation can generally be considered as the condition that the quantization process needs to satisfy.

[0053] Because of the existence of quantization error, choosing the maximum value of the data (max) to calculate the quantization factor is not the best solution. Instead, there exists an optimal max' that minimizes the impact of quantization on the result.

[0054] Please see Figure 4 , Figure 4 This is another uniform quantization principle diagram provided in the embodiments of this application. For example... Figure 4 As shown, during the quantization process, reducing the range of quantized data, such as searching for the optimal value of max' within the range of 0.6max to max, can effectively reduce the rounding error caused by quantization, minimizing the loss of precision after quantization. However, this results in the direct discarding of data outside the range of [-max', max'], leading to a certain truncation error.

[0055] Existing quantization techniques are mostly used in inference scenarios, such as INT8 / INT4, and even 2-bit and 1-bit inference have solutions. However, research on training scenarios remains limited to INT8, while there are no solutions for INT4 training scenarios. The reason for this is that directly transferring the quantization methods from INT4 inference to the training scenario requires quantizing not only the matrix calculations in forward propagation but also those in backward propagation. In this case, the quantization error introduced by the uniform quantization method results in a large error in the calculated weight gradient, making it impossible for the convolutional neural network to converge to the target solution.

[0056] Please see Figure 5 , Figure 5 This is a schematic diagram illustrating the maximum value distribution of different channels provided in an embodiment of this application. For example... Figure 5 As shown, Figure 5 The distribution of the maximum value of a single layer in the ResNet18 deep residual network is shown in different channels. It can be seen that the differences between different channels are large. If all channels use a single quantization factor, the rounding error caused by quantization will cause the training process to fail to converge.

[0057] Therefore, in the quantization of each layer of a convolutional neural network, multiple quantization factors can be selected for the weight parameters and data parameters to reduce quantization error, which is the group-wise quantization proposed in the application embodiments. Group-wise quantization groups the parameters to be quantized according to different dimensions, and selects the most suitable quantization factor for each group, thereby reducing quantization error.

[0058] Please see Figure 6 , Figure 6 This is a schematic diagram of convolution calculation provided in an embodiment of this application. For example... Figure 6 As shown, in the group quantization process, the dimension of the input data matrix (Feature Map) is BatchSize×Cin×H×W, and the dimension of the weight matrix (Weight) is Cout×Cin×Hk×Wk. Based on the convolution calculation principle, the dimension of the output data matrix (Feature Map) is BatchSize×Cout×H×W. Here, BatchSize represents the number of samples selected in one training iteration, H represents the height of the input data, W represents the width of the input data, Hk represents the height of the convolution kernel, Wk represents the width of the convolution kernel, the subscript k represents the convolution kernel, and Cin and Cout represent the number of channels. One Cout represents one convolution kernel; if there are multiple Couts, then... Figure 6 The number of cubes in the array represents the number of convolution kernels. The number of cubes, Cout, indicates the number of convolution kernels.

[0059] Depending on the grouping method, group quantization can be divided into the following three types:

[0060] The first method is Batch Quant: The data matrix (feature map) is grouped according to the BatchSize dimension, and a quantization factor is calculated for each image.

[0061] The second method is Channel Quant: The data matrix (feature map) is grouped according to the Cin dimension, and a quantization factor is calculated for each channel.

[0062] The third type is Kernel Quant: The weight matrix is grouped according to the Cout dimension, and each convolution kernel calculates a quantization factor.

[0063] The process of a single convolution kernel performing one convolution is shown in formula (6), where a single convolution kernel refers to one of several convolution kernels between the nth and (n+1)th convolutional layers of a convolutional neural network.

[0064]

[0065] In formula (6), conv(W,A) represents the result of the convolution calculation; the number of channels ranges from 1 to Cin; w ijk a represents the weight of the i-th channel in the j-th row and k-th column; ijk This represents the pixel in the j-th row and k-th column of the i-th channel.

[0066] If the data matrix uses the Channel Quant grouping quantization method, with each channel using a quantization factor; and the weight matrix uses the Kernel Quant grouping quantization method, then formula (6) can be transformed into formula (7).

[0067]

[0068] In formula (7), S w S ai It is a floating-point number; This is an INT integer multiplication calculation. Formula (7) can be simplified to formula (8).

[0069]

[0070] In formula (8), Float represents a floating-point number; Fix represents an integer (INT). It can be seen that floating-point multiplication is introduced at this time.

[0071] As can be seen from the above, although group quantization reduces quantization error, it comes at a huge computational cost. Specifically, the process involves floating-point multiplication, which is costly and contradicts the original intention of quantization to reduce computation.

[0072] The technical solution provided in this application will be described in detail below with reference to specific implementation methods.

[0073] Please see Figure 7 , Figure 7 This is a schematic diagram of the architecture of a low-bit training system provided in an embodiment of this application. Figure 7As shown, this system is mainly applied to low-bit training scenarios of convolutional neural networks. Specifically, the system inputs training data and user models into the system and specifies the quantization bit width for quantizing the training data during training. The data and weights are quantized separately to obtain quantized data and quantized weights. Then, forward and backward calculations are performed based on the quantized data and quantized weights to start the efficient model training process and finally obtain a low-bit neural network model.

[0074] Please see Figure 8 , Figure 8 This is a schematic diagram of the structure of a convolutional neural network computing device provided in an embodiment of this application. Figure 8 As shown, the computational device of this convolutional neural network includes a floating-point arithmetic logic unit (FloatCache) and a convolution (Conv) module connected by communication. The floating-point arithmetic logic unit is used to quantize the input data containing Cn channels according to Cn first quantization factors to obtain Cn data groups, wherein the Cn first quantization factors are floating-point numbers, the mantissas of the Cn first quantization factors are the same, the Cn first quantization factors correspond one-to-one with the Cn channels, and Cn is a positive integer. The floating-point arithmetic logic unit is also used to quantize the first weight parameter corresponding to the target convolution kernel to obtain the second weight parameter. The convolution module is used to perform convolution calculation on the Cn data groups and the second weight parameter, and to perform shift calculation on the result of the convolution calculation to obtain the operation result.

[0075] In this embodiment, for the operation of a single convolutional kernel, when grouping and quantizing the input data, the first quantization factor with the same mantissa is selected, so that the floating-point multiplication calculation brought in by the grouping and quantization in the convolution operation is converted into a shift operation, thereby greatly reducing the computational amount corresponding to a single convolutional kernel and improving the computational efficiency of a single convolutional kernel. Moreover, compared with uniform quantization, the quantization method provided in this embodiment can reduce the computational amount and improve the computational efficiency while effectively reducing the quantization error, so that the convergent target solution can be obtained during the training process of the convolutional neural network, and convergence can be achieved through low-bit training. For the entire convolutional kernel neural network, if all convolution operations of the entire convolutional kernel neural network adopt the quantization method provided in this embodiment, the computational amount of the entire convolutional neural network can be effectively reduced and the computational efficiency of the entire convolutional neural network can be improved.

[0076] In one exemplary embodiment, the convolution module includes: a low-bit multiplier for performing convolution calculations on Cn data groups and a second weight parameter, wherein the result of the convolution calculation is Cn integers; a floating-point adder for performing the following steps for each of the Cn integers to obtain Cn floating-point numbers: shifting a target integer according to a first coefficient to obtain a floating-point number corresponding to the target integer, wherein the target integer is any one of the Cn integers, the first coefficient is determined according to the exponent of a first target quantization factor, the first target quantization factor is a first quantization factor corresponding to a first target channel, and the first target channel is a channel corresponding to the target integer; the floating-point adder is also used to perform accumulation calculations on the Cn floating-point numbers.

[0077] It should be noted that, compared with the existing convolutional neural network computing devices, this convolutional neural network adds a floating-point adder in the convolution module for shift calculations and floating-point accumulation.

[0078] In this example, during the convolution operation after group quantization, the Cin data groups are first convolved with the second weight parameter. This involves multiplying and accumulating the Cin data groups according to the quantized second weight parameter to obtain Cin integers. Then, each of these Cin integers is shifted; that is, each of the Cin integers is multiplied by a first coefficient determined by the exponent of the first quantization factor corresponding to that integer, yielding the corresponding floating-point number. Thus, Cin values can be calculated from the Cin integers. Floating-point numbers; accumulate these Cin floating-point numbers, and then calculate the result based on the accumulated result of these Cin floating-point numbers; since the mantissas of the Cin first quantization factors are the same, the mantissas of the first quantization factors and the second quantization factors can be extracted before the accumulation of the Cin floating-point numbers during the operation. Finally, multiply it by the accumulated result of the Cin floating-point numbers to obtain the convolution operation result. That is, the multiplication operation of floating-point numbers and integers in the convolution operation after group quantization is transformed into the shift operation of integers, which can effectively reduce the amount of calculation and improve the calculation efficiency.

[0079] In one exemplary embodiment, the floating-point arithmetic logic unit is further configured to: calculate the operation result based on the mantissa of the first quantization factor, the second quantization factor, and the result of the cumulative calculation, wherein the second quantization factor is a quantization factor used to quantize the first weight parameter.

[0080] In this example, the Cin floating-point numbers are first accumulated, and then the accumulated result is multiplied by the mantissa of the first quantization factor and the second quantization factor to obtain the operation result. Since the mantissas of the Cin first quantization factors are the same, the mantissas of the first quantization factor and the second quantization factor can be extracted before the accumulation of the Cin floating-point numbers during the operation. Finally, it is multiplied by the accumulated result of the Cin floating-point numbers to obtain the convolution operation result. That is, the multiplication operation of floating-point numbers and integers in the convolution operation after group quantization is transformed into the shift operation of integers, which can effectively reduce the amount of calculation and improve the calculation efficiency.

[0081] In one exemplary embodiment, the floating-point arithmetic logic unit is further configured to: obtain the maximum weight parameter among the first weight parameters corresponding to the target convolution kernel before quantizing the first weight parameter to obtain the second weight parameter; and calculate the second quantization factor based on the maximum weight parameter and the quantization bit width, wherein the second quantization factor is a quantization factor used to quantize the first weight parameter.

[0082] In this example, for a single convolutional kernel (i.e., the target convolutional kernel), the maximum weight parameter in the first weight parameter corresponding to the convolutional kernel is obtained, and then the second quantization factor corresponding to the convolutional kernel is calculated based on the maximum weight parameter and the quantization bit width. Thus, for all convolutional kernels, the corresponding second quantization factor can be calculated, which is beneficial to use different second quantization factors to quantize the first weight parameters of different convolutional kernels and reduce the quantization error of weight quantization.

[0083] In one exemplary embodiment, the floating-point arithmetic logic unit is further configured to: before quantizing the input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups; obtain the maximum data parameter in the input data of the second target channel, wherein the second target channel is any one of the Cin channels; calculate the second target quantization factor according to the maximum data parameter in the input data of the second target channel and the quantization bit width; select the third target quantization factor from a preset quantization factor set according to the second target quantization factor, wherein the third target quantization factor is the preset quantization factor with the smallest absolute value of the difference between the third target quantization factor and the second target quantization factor in the preset quantization factor set, and the third target quantization factor is the first quantization factor corresponding to the second target channel.

[0084] In this example, for the input data of a single channel (i.e., the second target channel), the maximum data parameter in the input data of that channel is obtained. Then, the second target quantization factor is calculated based on the maximum data parameter and the quantization bit width. Next, a third target quantization factor is selected from a preset quantization factor set based on the second target quantization factor. The third target quantization factor is the preset quantization factor with the smallest absolute value of the difference between it and the second target quantization factor in the preset quantization factor set. The third target quantization factor is the first quantization factor corresponding to the second target channel. Therefore, for the input data of each of the Cin channels, the corresponding second target quantization factor can be calculated, and the corresponding third target quantization factor can be selected from the preset quantization factor set based on this second target quantization factor, thus obtaining Cin first quantization factors, i.e., one first quantization factor for each channel. This facilitates the quantization of the input data of Cin channels using Cin first quantization factors, i.e., using different first quantization factors to quantize the input data of different channels, reducing the quantization error of the data quantization.

[0085] In one exemplary embodiment, the floating-point arithmetic logic unit is further configured to: before selecting a third target quantization factor from a preset quantization factor set based on a second target quantization factor; obtain the maximum data parameter among the input data of Cin channels; calculate a fourth target quantization factor based on the maximum data parameter among the input data of Cin channels and the quantization bit width; and calculate Cin preset quantization factors based on the fourth target quantization factor and Cin second coefficients, wherein the Cin second coefficients correspond one-to-one with the Cin preset quantization factors, and the Cin preset quantization factors constitute a preset quantization factor set.

[0086] In this example, the maximum data parameter among the input data of Cin channels is obtained. Then, a fourth target quantization factor is calculated based on the maximum data parameter and the quantization bit width. Next, Cin preset quantization factors are calculated based on the fourth target quantization factor and Cin second coefficients. These Cin preset quantization factors constitute a preset quantization factor set. Since these Cin preset quantization factors are calculated from the same fourth target quantization factor and Cin different second coefficients, their mantissas are the same. The Cin first quantization factors used for group quantization of the input data are selected from these Cin preset quantization factors, so their mantissas are also the same. This facilitates converting the floating-point multiplication calculation introduced by group quantization in convolution operations into shift operations, reducing computational load and improving computational efficiency.

[0087] In one exemplary embodiment, the computational device of the convolutional neural network further includes a load module, which is communicatively connected to the convolution module and the floating-point arithmetic logic unit, respectively. The load module is used to obtain input data containing Cin channels and a first weight parameter corresponding to the target convolution kernel from an external storage module (e.g., DDR). The input data containing Cin channels is the input data of convolutional layer n, convolutional layer n is any convolutional layer of the convolutional neural network, the target convolution kernel is any one of the convolution kernels between convolutional layer n and convolutional layer n+1, and the input data of convolutional layer n+1 is the output data of convolutional layer n.

[0088] In one exemplary embodiment, the computational device of the convolutional neural network further includes a save module for storing the computational results to an external storage module.

[0089] The actual product of the computing device for the convolutional neural network can be a low-bit training chip or a low-bit training module integrated into an artificial intelligence chip, which can be deployed on a server or terminal device that can be used for training.

[0090] In one exemplary embodiment, the computing device of the convolutional neural network further includes a cache module, which is communicatively connected to the load module, the storage module, the convolution module, and the floating-point arithmetic logic unit, respectively. The cache module is used to cache input data containing Cin channels, a first weight parameter corresponding to the target convolution kernel, Cin first quantization factors and a second quantization factor, Cin data sets, and a second weight parameter.

[0091] In one exemplary embodiment, the cache module includes: a floating-point cache unit (FloatCache) for caching input data containing Cin channels, a first weight parameter corresponding to the target convolution kernel, and Cin first quantization factors and a second quantization factor; and a fixed-point cache unit (Fix Cache) for caching Cin data groups and a second weight parameter.

[0092] Specifically, the load module is responsible for moving data from external memory to the cache module, the save module is responsible for storing the calculation results into external memory, the fixed-point cache unit and the floating-point cache unit cache integer data and floating-point data respectively, the convolution module performs matrix calculations such as integer (INT) multiplication, partial summation and accumulation, and the floating-point arithmetic logic unit performs non-matrix calculations such as batch normalization (BN) layer and nonlinear layer.

[0093] It should be noted that, in Figure 8For the terminology, explanations, and implementation of the various operations of each module in the described device embodiments, please refer to [link to relevant documentation]. Figure 9 The relevant descriptions in the method embodiments shown.

[0094] Please see Figure 9 , Figure 9 This is a flowchart illustrating a method for operating a convolutional neural network according to an embodiment of this application. The method is executed by a computer device and includes, but is not limited to, the following steps.

[0095] Step 901: Quantize the input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups, wherein Cin first quantization factors are floating-point numbers, the mantissas of Cin first quantization factors are the same, Cin first quantization factors correspond one-to-one with Cin channels, and Cin is a positive integer.

[0096] It should be understood that the input data containing Cin channels is the input data of convolutional layer n, where convolutional layer n is any convolutional layer in the convolutional neural network. The input data of convolutional layer n is any one of the multiple samples in the input data matrix (Feature Map) of the convolutional kernel n. For example, if the dimension of the input data matrix of convolutional layer n is BatchSize × Cin × H × W, then the input data of convolutional layer n is any one of the BatchSize samples in the input data matrix of convolutional layer n. Furthermore, for floating-point numbers, they can be represented as a mantissa (Man) and an exponent (Exp); the first quantization factor can also be divided into a mantissa and an exponent, and the mantissas of the aforementioned Cin first quantization factors are the same.

[0097] It should be understood that the first quantization factor is the quantization factor corresponding to the data parameters. Before quantization, the input data is in floating-point format, and the Cin data groups are the quantized data, that is, the Cin data groups are in integer format. When the input data is grouped for quantization, the data parameters (that is, the pixel data of the input image) are grouped according to the channels of the convolutional neural network, that is, one channel corresponds to one first quantization factor; for the input data of convolutional layer n, there are Cin channels, each channel corresponds to one first quantization factor, so Cin channels correspond to Cin first quantization factors; when grouping and quantizing the input data of convolutional layer n, the first quantization factor corresponding to each channel is used to quantize the data of each channel, so each channel corresponds to one quantized data group, thus Cin channels correspond to Cin data groups.

[0098] Specifically, when quantizing data parameters in groups, for any convolutional layer n, the dimensions of the input data of this layer are BatchSize×Cin×H×W. The data is grouped according to the BatchSize and Cin dimensions, resulting in a total of BatchSize×Cin groups. A data quantization factor is calculated for each group, i.e., a first quantization factor is calculated for each group. For a single sample, which has Cin channels, the data is grouped according to the channel dimension, resulting in a total of Cin groups. A first quantization factor is calculated for each group in the Cin group, resulting in a total of Cin first quantization factors. The data parameters of this group are quantized using the first quantization factor corresponding to each group in the Cin group, thus obtaining Cin quantized data groups, thereby converting the floating-point data format input data into the integer data format input data.

[0099] For example, for a single sample, if the sample image has three channels (RGB), and it is grouped according to the channel dimension, that is, divided into three groups according to the R channel, G channel, and B channel, then the R channel corresponds to one data group, the G channel corresponds to one data group, and the B channel corresponds to one data group.

[0100] In one exemplary implementation, before quantizing the input data containing Cn channels according to Cn first quantization factors to obtain Cn data groups, the method further includes: obtaining the maximum data parameter in the input data of the second target channel, wherein the second target channel is any one of the Cn channels; calculating a second target quantization factor according to the maximum data parameter in the input data of the second target channel and the quantization bit width; selecting a third target quantization factor from a preset quantization factor set according to the second target quantization factor, wherein the third target quantization factor is the preset quantization factor with the smallest absolute value of the difference between the third target quantization factor and the second target quantization factor in the preset quantization factor set, and the third target quantization factor is the first quantization factor corresponding to the second target channel.

[0101] Specifically, when performing grouped quantization on the input data, the data is grouped according to the channel dimension, that is, the input data is divided into input data containing Cin channels. For the input data of each of the Cin channels, for example, the input data of the i-th channel, the maximum value of the data parameters corresponding to the i-th channel is first determined. Then, a second target quantization factor is calculated based on the maximum value of the data parameters corresponding to the i-th channel and the quantization bit width. Finally, the preset quantization factor with the smallest absolute value of the difference with the second target quantization factor is selected from the preset quantization factor set as the first quantization factor S corresponding to the i-th channel. ai The formula for calculating the second target quantification factor is shown in formula (9).

[0102]

[0103] In formula (9), S′ ai This represents the second target quantization factor corresponding to the i-th channel; max(a i ) represents the maximum value in the data parameters corresponding to the i-th channel; N represents the quantization bit width, such as INT8, INT4, etc.

[0104] To ensure that the mantissa of the first quantization factor is the same for each channel, according to S′ ai Select the first quantization factor S corresponding to the i-th channel from the following preset quantization factor set. ai :

[0105]

[0106] In the above set, S a_max This indicates that the fourth target quantization factor, which is the mantissa of the first quantization factor, is calculated based on the maximum data parameter and quantization bit width among the input data of Cin channels. In this case, the multiples of the first quantization factor for different channels are powers of 2.

[0107] In this example, for the input data of a single channel (i.e., the second target channel), the maximum data parameter in the input data of that channel is obtained. Then, the second target quantization factor is calculated based on the maximum data parameter and the quantization bit width. Next, a third target quantization factor is selected from a preset quantization factor set based on the second target quantization factor. The third target quantization factor is the preset quantization factor with the smallest absolute value of the difference between it and the second target quantization factor in the preset quantization factor set. The third target quantization factor is the first quantization factor corresponding to the second target channel. Therefore, for the input data of each of the Cin channels, the corresponding second target quantization factor can be calculated, and the corresponding third target quantization factor can be selected from the preset quantization factor set based on this second target quantization factor, thus obtaining Cin first quantization factors, i.e., one first quantization factor for each channel. This facilitates the quantization of the input data of Cin channels using Cin first quantization factors, i.e., using different first quantization factors to quantize the input data of different channels, reducing the quantization error of the data quantization.

[0108] In one exemplary embodiment, before selecting a third target quantization factor from a preset quantization factor set based on a second target quantization factor, the method further includes: obtaining the maximum data parameter among the input data of Cin channels; calculating a fourth target quantization factor based on the maximum data parameter among the input data of Cin channels and the quantization bit width; and calculating Cin preset quantization factors based on the fourth target quantization factor and Cin second coefficients, wherein the Cin second coefficients correspond one-to-one with the Cin preset quantization factors, and the Cin preset quantization factors constitute a preset quantization factor set.

[0109] It should be understood that the fourth objective quantification factor is S. a_max Among them, S a_max The calculation formula is shown in formula (10).

[0110]

[0111] In formula (10), max(a) represents the maximum value of all data parameters in the input data, that is, the maximum data parameter in the input data of Cin channels; N represents the quantization bit width.

[0112] Among them, the Cin second coefficients are shown in the following set:

[0113]

[0114] Therefore, based on the Cin second coefficients and S in the above set... a_max A preset set of quantization factors can be obtained.

[0115] In this example, the maximum data parameter among the input data of Cin channels is obtained. Then, a fourth target quantization factor is calculated based on the maximum data parameter and the quantization bit width. Next, Cin preset quantization factors are calculated based on the fourth target quantization factor and Cin second coefficients. These Cin preset quantization factors constitute a preset quantization factor set. Since these Cin preset quantization factors are calculated from the same fourth target quantization factor and Cin different second coefficients, their mantissas are the same. The Cin first quantization factors used for group quantization of the input data are selected from these Cin preset quantization factors, so their mantissas are also the same. This facilitates converting the floating-point multiplication calculation introduced by group quantization in convolution operations into shift operations, reducing computational load and improving computational efficiency.

[0116] For any one of the data parameters before quantization corresponding to the i-th channel, after selecting the first quantization factor corresponding to the i-th channel, the data parameter is quantized from a floating-point number to an integer number according to formula (11).

[0117]

[0118] In formula (11), a represents any one of the quantized data parameters corresponding to the i-th channel; i s represents any one of the data parameters before quantization corresponding to the i-th channel; ai This represents the first quantization factor corresponding to the i-th channel.

[0119] Step 902: Quantize the first weight parameter corresponding to the target convolution kernel to obtain the second weight parameter.

[0120] In one exemplary implementation, before quantizing the first weight parameter corresponding to the target convolutional kernel to obtain the second weight parameter, the method further includes: obtaining the maximum weight parameter among the first weight parameters corresponding to the target convolutional kernel; calculating a second quantization factor based on the maximum weight parameter and the quantization bit width, wherein the second quantization factor is a quantization factor used to quantize the first weight parameter.

[0121] Specifically, when grouping and quantizing the weight parameters, for any convolutional layer in the convolutional neural network, the weight parameters of that layer are grouped according to the convolutional kernel, and a weight quantization factor is calculated for each convolutional kernel; then, the calculation formula for the weight quantization factor corresponding to each convolutional kernel is shown in formula (12).

[0122]

[0123] In formula (12), S w represents the weight quantization factor corresponding to any convolution kernel, that is, the second quantization factor corresponding to any convolution kernel; max(w) represents the maximum value of the weight parameter corresponding to any convolution kernel, that is, the maximum weight parameter among the first weight parameters corresponding to any convolution kernel; N represents the quantization bit width.

[0124] It should be understood that the second quantization factor is the quantization factor corresponding to the weight parameters. The first weight parameter is the weight parameter before quantization, that is, the first weight parameter is in floating-point data format; the second weight parameter is the weight parameter after quantization, that is, the second weight parameter is in integer data format. During group quantization, the weight parameters are grouped according to the convolutional kernels of the convolutional neural network, meaning one convolutional kernel corresponds to one second quantization factor. For a single convolutional kernel, there is one corresponding second quantization factor, thus the target convolutional kernel corresponds to one second quantization factor. By using the same second quantization factor to quantize the first weight parameters corresponding to the target convolutional kernel, the second weight parameters corresponding to the target convolutional kernel can be obtained. Here, the target convolutional kernel is any one of the convolutional kernels between convolutional layer n and convolutional layer n+1, and the input data of convolutional layer n+1 is the output data of convolutional layer n.

[0125] In this example, for a single convolutional kernel (i.e., the target convolutional kernel), the maximum weight parameter in the first weight parameter corresponding to the convolutional kernel is obtained, and then the second quantization factor corresponding to the convolutional kernel is calculated based on the maximum weight parameter and the quantization bit width. Thus, for all convolutional kernels, the corresponding second quantization factor can be calculated, which is beneficial to use different second quantization factors to quantize the first weight parameters of different convolutional kernels and reduce the quantization error of weight quantization.

[0126] For any convolution kernel, after calculating the second quantization factor corresponding to the convolution kernel, the quantization of the first weight parameter of the floating-point number to the second weight parameter of the integer number is completed according to formula (13).

[0127]

[0128] In formula (13), w q represents any quantized weight parameter corresponding to any convolutional kernel, that is, any second weight parameter corresponding to any convolutional kernel; w represents any unquantized weight parameter corresponding to any convolutional kernel, that is, any first weight parameter corresponding to any convolutional kernel; S w This represents the weight quantization factor corresponding to any convolution kernel, which is also the second quantization factor corresponding to any convolution kernel.

[0129] Step 903: Perform convolution calculation on Cin data groups and the second weight parameter; perform shift calculation on the result of the convolution calculation to obtain the operation result.

[0130] It should be understood that convolving Cin data sets with the second weight parameter is equivalent to multiplying and accumulating Cin data sets according to the second weight parameter to obtain Cin integers. Among them, Cin first quantization factors correspond one-to-one with Cin channels, and Cin channels correspond one-to-one with Cin integers. Therefore, Cin first quantization factors correspond one-to-one with Cin integers.

[0131] In one exemplary embodiment, the result of the above convolution calculation is Cin integers. The result of the convolution calculation is then shifted to obtain the operation result, including: for each of the Cin integers, the following steps are performed to obtain Cin floating-point numbers: The target integer is shifted according to a first coefficient to obtain the floating-point number corresponding to the target integer, wherein the target integer is any one of the Cin integers, the first coefficient is determined according to the exponent of a first target quantization factor, the first target quantization factor is the first quantization factor corresponding to the first target channel, and the first target channel is the channel corresponding to the target integer; the operation result is calculated based on the mantissa of the first quantization factor, a second quantization factor, and the Cin floating-point numbers, wherein the second quantization factor is the quantization factor used to quantize the first weight parameter.

[0132] It should be understood that the value of the first coefficient corresponding to each of the Cin integers is different, and the value of the first coefficient is determined by the exponent of the first quantization factor corresponding to that integer. Shift calculation is a term in binary, and shifting in binary is represented by multiplying an integer (decimal) by a power of 2, where the value of the power of 2 indicates how many bits to shift.

[0133] The above-mentioned shift calculation based on the first coefficient to obtain the floating-point number corresponding to the target integer is that the target integer is multiplied by the first coefficient, which is the exponent of 2.

[0134] The above calculation is based on the mantissa of the first quantization factor, the second quantization factor, and Cin floating-point numbers. In other words, the Cin floating-point numbers are accumulated and calculated, and then the result of the accumulated calculation of the Cin floating-point numbers is multiplied by the mantissa of the first quantization factor and the second quantization factor to obtain the calculation result.

[0135] Please see Figure 10 , Figure 10 This is another schematic diagram of convolution calculation provided in an embodiment of this application. For example... Figure 10As shown, taking a convolutional kernel as an example, the second weight parameter obtained after quantization of the convolutional kernel is multiplied by the data group obtained after quantization of each channel (first multiplication), and then accumulated in the Hk and Wk dimensions (first accumulation) to obtain Cin integers; the accumulated result (Cin integers) is converted from integer to floating-point number (INT to Float, I2F) to obtain Cin floating-point number; then the Cin floating-point number is accumulated in the Cin dimension (second accumulation) to obtain the floating-point number accumulation result; finally, the floating-point number accumulation result is multiplied by the mantissa of the floating-point number (second multiplication) to obtain the output result of the convolutional kernel.

[0136] In this example, during the convolution operation after group quantization, the Cin data groups and the second weight parameter are first convolved. This involves multiplying and summing the Cin data groups according to the quantized second weight parameter to obtain Cin integers. Then, each of these Cin integers is shifted. This means each integer is multiplied by a first coefficient determined by the exponent of the first quantization factor corresponding to that integer, resulting in a floating-point number. Thus, Cin floating-point numbers can be calculated from the Cin integers. Finally, the result is calculated using the mantissa of the first quantization factor, the second quantization factor, and the Cin floating-point numbers. The second quantization factor is used to quantize the first weight parameter. That is, the Cin floating-point numbers are first accumulated, and then the accumulated result is multiplied by the mantissa of the first quantization factor and the second quantization factor to obtain the operation result. Since the mantissas of the Cin first quantization factors are the same, the mantissas of the first quantization factor and the second quantization factor can be extracted before the accumulation of the Cin floating-point numbers during the operation. Finally, it is multiplied by the accumulated result of the Cin floating-point numbers to obtain the convolution operation result. That is, the multiplication operation of floating-point numbers and integers in the convolution operation after group quantization is transformed into the shift operation of integers, which can effectively reduce the amount of calculation and improve the calculation efficiency.

[0137] It should be understood that the embodiments of this application have made creative improvements to floating-point calculation in group quantization, simplifying floating-point calculation into shift calculation. Among them, I2F is the shift process of converting an integer to a floating-point number, and its principle can be expressed by formulas (8), (14) and (15), which are specifically described below.

[0138]

[0139] For the floating-point number Float in formula (8), it can be represented as the mantissa part Man and the exponent part Exp, so formula (8) can be transformed into formula (14).

[0140]

[0141] In formula (14), if the last digit Man of all summation terms 1 to Cin (i.e., all channels of data) is... i If the same applies, the mantissa can be moved outside the parentheses, and the multiplication of floating-point numbers and integers becomes the shifting calculation of integers. After accumulating, multiplying by the mantissa will give the result of the convolution operation, as shown in formula (15).

[0142]

[0143] In formula (15), Indicates to Fix i Exp i Bit.

[0144] As shown in formula (15), after accumulating the integer Fix in the Hk and Wk dimensions, before accumulating in the Cin dimension, the integer Fix needs to be multiplied by the floating-point number Float. However, floating-point multiplication has a large computational cost. In order to simplify the floating-point multiplication calculation into a shift calculation and reduce the computational cost, it is necessary to select the first quantization factor with the same mantissa to quantize the input data, thereby extracting the mantissa Man of the floating-point number before the accumulation of 1 to Cin, thus realizing the conversion of floating-point multiplication into integer shift calculation. Specifically, in order to make the above Float i If the last two digits are the same, we only need to adjust the quantization factor (i.e., S) of each channel. w S ai The last two digits must be the same; because Float... i It is S w With S ai The product of S, for each channel w It is fixed, S ai It's a floating-point number, as long as each S... ai If the last digits are the same (the exponents can be different), then Float can be... i The last two digits are the same; among them, Float i The mantissa Man equals the second quantization factor S w The product of the product with the mantissa of the first quantization factor, if the mantissa of the first quantization factor is S a_max Then Man is S w S a_max .

[0145] As can be seen, in this embodiment, for the operation of a single convolution kernel, when grouping and quantizing the input data, the first quantization factor with the same mantissa is selected, so that the floating-point multiplication calculation brought in by the grouping and quantization in the convolution operation is converted into a shift operation, thereby greatly reducing the computational amount corresponding to a single convolution kernel and improving the computational efficiency of a single convolution kernel. Moreover, compared with uniform quantization, the quantization method provided in this embodiment can reduce the computational amount and improve the computational efficiency, while also effectively reducing the quantization error, so that the convergent target solution can be obtained during the training process of the convolutional neural network, and convergence can be achieved through low-bit training. For the entire convolutional kernel neural network, if all convolution operations of the entire convolutional kernel neural network adopt the quantization method provided in this embodiment, the computational amount of the entire convolutional neural network can be effectively reduced and the computational efficiency of the entire convolutional neural network can be improved.

[0146] The technical solution provided in this application will be described in detail below with specific examples.

[0147] 1) Grouping and Quantization of Data Parameters. The input data matrix of this layer has dimensions of BatchSize × Cin × H × W. It is grouped according to the BatchSize and Cin dimensions, resulting in a total of BatchSize × Cin groups. For one sample in the input data matrix, there are Cin channels, resulting in Cin groups. Each of these Cin groups yields a first quantization factor, resulting in Cin first quantization factors. To ensure that the last digit of the first quantization factor for each of these Cin channels is the same, these Cin first quantization factors are selected from the following set, where the multiples of the first quantization factors for different channels are powers of 2.

[0148]

[0149] From the above set, we can see that the last digit of the first quantization factor is 1 / S. w .

[0150] After selecting Cin first quantization factors, the data parameters corresponding to these Cin channels are quantized from floating-point numbers to integer numbers according to formula (11).

[0151]

[0152] 2) Grouping and quantization of weight parameters. The weight parameters w of this layer are grouped according to the convolution kernel, and a second quantization factor is calculated for each convolution kernel according to formula (12).

[0153]

[0154] After calculating the second quantization factor corresponding to each convolution kernel, the weight parameters are quantized from floating-point numbers to integer numbers according to formula (13).

[0155]

[0156] 3) Perform convolution calculations on the quantized integers. Perform convolution calculations according to formulas (7), (8), (14), and (15). Taking a convolution kernel as an example, the second weight parameter obtained after quantization of the convolution kernel is multiplied by the data group obtained after quantization of each channel, and then accumulated in the Hk and Wk dimensions to obtain Cin integers; the accumulated result (Cin integers) is converted from integers to floating-point numbers through I2F operation to obtain Cin floating-point numbers; then the Cin floating-point numbers are accumulated in the Cin dimension to obtain the accumulated floating-point number result; finally, the accumulated floating-point number result is multiplied by the mantissa of the floating-point number to obtain the output result of the convolution kernel.

[0157]

[0158]

[0159]

[0160]

[0161] The integer Fix, obtained by summing the Hk and Wk dimensions, needs to be multiplied by a floating-point number Float before summing the Cin dimension. Floating-point multiplication is expensive. In this example, the mantissa of the first quantization factor chosen is 1 / S. w Man equals the second quantization factor S w The product of the first quantization factor and the mantissa of the floating-point number makes the mantissa Man of the floating-point number equal to 1. Thus, the multiplication of floating-point numbers in the convolution operation can be directly simplified to the shifting operation of integer numbers.

[0162] As shown above, the group quantization method effectively reduces quantization error and significantly decreases the computational load in convolution operations. Furthermore, experiments demonstrate that for 4-bit training, the accuracy of ResNet18 on ImageNet reaches 68.14%, which is the best publicly available result in the industry. It should be noted that this scheme can be used not only for training scenarios but also for inference scenarios; this application does not specifically limit its application to this.

[0163] Please see Figure 11 , Figure 11 This is a schematic diagram of another training process for a convolutional neural network provided in an embodiment of this application. Figure 11 The training process shown can be performed by Figure 8The implementation of the computational device for the convolutional neural network shown below will be discussed in conjunction with... Figure 8 The hardware architecture shown is explained. The training process can be divided into two stages: forward computation and backward computation.

[0164] In the forward computation, the loss function value is obtained by relaying the input data layer by layer. The matrix calculations involved (convolutional layers and fully connected layers) are all performed through low-bit multiplication.

[0165] 1) The load module loads the data of this layer from external memory. l and the weight w of this layer l It is stored in the floating-point cache unit.

[0166] 2) The floating-point arithmetic logic unit calculates the data a for this layer. l and the weight w of this layer l The quantization factor scale is determined, and the data at this layer is quantized using Q(a) and Q(w) respectively. l and the weight w of this layer l The quantization yields the quantized data a of this layer. l _q and the quantized weights of this layer w l _q, and the quantized data a of this layer l _q and the quantized weights of this layer w l _q is stored in a fixed-point cache unit; where Q stands for Quantize, which is short for quantization function, Q(a) refers to the quantization function of the data, and Q(w) refers to the quantization function of the weights.

[0167] 3) The convolution module loads the quantized data a of this layer from the fixed-point cache unit. l _q and the quantized weights of this layer w l _q completes the multiplication and accumulation calculation, multiplies by the mantissa, obtains the output result, and saves the output result to the floating-point cache unit.

[0168] 4) The floating-point arithmetic logic unit loads the data a stored in the floating-point cache unit for this layer. l The quantization factor scale is used to quantize the current layer data a using Deq(a). l Complete the inverse quantization to obtain the data a of this layer. l And save the data a of this layer through the storage module. l Output to external memory; where Deq stands for Dequantize, which is short for inverse quantization, indicating that integer data is dequantized into floating-point numbers, and Deq(a) indicates that the data is dequantized.

[0169] In the backpropagation process, the loss function value is propagated back layer by layer, multiplied by the weights and data of each layer to obtain the backpropagation error and gradient of each layer, which is then used to update the weight parameters of each layer.

[0170] 1) The load module reads the current layer's backhaul error δ from the external memory. l It is stored in the floating-point cache unit.

[0171] 2) The floating-point arithmetic logic unit undergoes Q(δ) quantization to obtain the quantized local propagation error δ. l _q, and the quantized error δ of this layer is returned. l _q is stored in a fixed-point cache unit; where Q(δ) is the quantization function of the backpropagation error.

[0172] 3) The convolution module reads the quantized backpropagation error δ of this layer from the fixed-point cache unit. l _q, which represents the quantized backpropagation error δ of this layer. l _q and the quantized weight parameters w of this layer l Multiplying _q yields the propagation error δ of the previous layer. l+1 _q, and the backpropagation error δ from the previous layer l+1 _q is stored in a fixed-point cache unit.

[0173] 4) The backhaul error δ from the floating-point arithmetic logic unit reading the fixed-point cache unit from the previous layer. l+1 _q, after being dequantized by Deq(δ), yields the return error δ of the previous layer in floating-point format. l+1 _q, and saves the return error δ of the previous layer in floating-point format through the storage module. l+1 _q is output to external memory.

[0174] 5) The convolution module reads the quantized backpropagation error δ of this layer from the fixed-point cache unit. l _q, which represents the quantized backpropagation error δ of this layer. l _q and the quantized data a of this layer l Multiplying by _q yields the gradient Δw of this layer in integer form. l And the gradient Δw of this layer in integer format l It is stored in a fixed-point cache unit.

[0175] 6) The floating-point arithmetic logic unit reads the current-level gradient Δw in integer format from the fixed-point cache unit. l After dequantization using Deq(Δw), the current layer gradient Δw in floating-point format is obtained. l And by saving the current layer's gradient Δw in floating-point format, the module will save the data. l Output to external memory.

[0176] 7) The load module re-imports the weights w of this layer. l The gradient Δw of this layer in floating-point format l The gradient update is performed in the floating-point arithmetic logic unit to obtain the updated weight w of the current layer. l And save the updated weights w of this layer through the save module. l Save to external storage.

[0177] It should be understood that the forward and backward computation processes described above are performed alternately until training is complete, resulting in a low-bit model that can be used for inference. The quantization processes mentioned above refer to the group quantization method described earlier, which has been detailed in the above embodiments and will not be repeated here.

[0178] Please see Figure 12 , Figure 12 This is a schematic diagram of the structure of a computer device 1210 provided in an embodiment of this application. The computer device 1210 includes a processor 1211, a memory 1212 and a communication interface 1213. The processor 1211, the memory 1212 and the communication interface 1213 are interconnected through a bus 1214.

[0179] The memory 1212 includes, but is not limited to, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or compact disc read-only memory (CD-ROM), and is used for related computer programs and data. The communication interface 1213 is used for receiving and sending data.

[0180] The processor 1211 can be one or more central processing units (CPUs). When the processor 1211 is a CPU, the CPU can be a single-core CPU or a multi-core CPU.

[0181] The processor 1211 in the computer device 1210 is used to read the computer program code stored in the memory 1212 and perform the following operations: quantize the input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups, wherein the Cin first quantization factors are floating-point numbers, the mantissas of the Cin first quantization factors are the same, the Cin first quantization factors correspond one-to-one with the Cin channels, and Cin is a positive integer; quantize the first weight parameter corresponding to the target convolution kernel to obtain the second weight parameter; perform convolution calculation on the Cin data groups and the second weight parameter; and perform shift calculation on the result of the convolution calculation to obtain the operation result.

[0182] It should be noted that the implementation of each of the above operations can also be referred to accordingly. Figure 9 The corresponding description of the method embodiments shown.

[0183] exist Figure 12 In the described computer device 1210, for the operation of a single convolution kernel, when grouping and quantizing the input data, a first quantization factor with the same mantissa is selected, so that the floating-point multiplication calculation brought in by the grouping and quantization in the convolution operation is converted into a shift operation, thereby greatly reducing the computational amount corresponding to a single convolution kernel and improving the computational efficiency of a single convolution kernel. Moreover, compared with uniform quantization, the quantization method provided in this embodiment can reduce the computational amount and improve the computational efficiency while effectively reducing the quantization error, so that the convergent target solution can be obtained during the training process of the convolutional neural network, and convergence can be achieved through low-bit training. For the entire convolutional kernel neural network, if all convolution operations of the entire convolutional kernel neural network adopt the quantization method provided in this embodiment, the computational amount of the entire convolutional neural network can be effectively reduced and the computational efficiency of the entire convolutional neural network can be improved.

[0184] This application also provides a chip, which includes at least one processor, a memory, and an interface circuit. The memory, the transceiver, and the at least one processor are interconnected via circuits. The at least one memory stores a computer program. When the computer program is executed by the processor... Figure 9 The method and flow shown are thus implemented.

[0185] This application also provides a computer-readable storage medium storing a computer program, which, when run on a computer... Figure 9 The method and flow shown are thus implemented.

[0186] This application also provides a computer program product, which, when run on a computer, provides a more convenient and efficient way to run such a program. Figure 9 The method and flow shown are thus implemented.

[0187] In summary, by implementing the embodiments of this application, for the operation of a single convolutional kernel, when grouping and quantizing the input data, the first quantization factor with the same mantissa is selected, so that the floating-point multiplication calculation brought in by the grouping and quantization in the convolution operation is converted into a shift operation, thereby greatly reducing the computational amount corresponding to a single convolutional kernel and improving the computational efficiency of a single convolutional kernel. Moreover, compared with uniform quantization, the quantization method provided in this embodiment can not only reduce the computational amount and improve the computational efficiency, but also effectively reduce the quantization error, so that the convergent target solution can be obtained during the training process of the convolutional neural network, and convergence can be achieved through low-bit training. For the entire convolutional kernel neural network, if all convolution operations of the entire convolutional kernel neural network adopt the quantization method provided in this embodiment, the computational amount of the entire convolutional neural network can be effectively reduced and the computational efficiency of the entire convolutional neural network can be improved.

[0188] It should be understood that the processor mentioned in the embodiments of this application can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0189] It should also be understood that the memory mentioned in the embodiments of this application can be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

[0190] It should be noted that when the processor is a general-purpose processor, DSP, ASIC, FPGA, or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component, the memory (storage module) is integrated into the processor.

[0191] It should be noted that the memories described herein are intended to include, but are not limited to, these and any other suitable types of memories.

[0192] It should also be understood that the first, second, third, fourth and various numerical designations used herein are merely for descriptive convenience and are not intended to limit the scope of this application.

[0193] It should be understood that the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0194] It should be understood that in the various embodiments of this application, the order of the above-mentioned processes does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0195] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0196] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0197] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0198] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0199] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0200] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods shown in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0201] The steps in the method of this application embodiment can be adjusted, combined, or deleted according to actual needs.

[0202] The modules in the device of this application embodiment can be merged, divided, and deleted according to actual needs.

[0203] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit it. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.

Claims

1. A computing device for a convolutional neural network, characterized in that, The device includes a floating-point arithmetic logic unit and a convolution module that are connected in communication. The floating-point arithmetic logic unit is used to quantize the input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups, wherein the Cin first quantization factors are floating-point numbers, the Cin first quantization factors have the same mantissa, the Cin first quantization factors correspond one-to-one with the Cin channels, and Cin is a positive integer; the input data includes images; The floating-point arithmetic logic unit is also used to quantize the first weight parameter corresponding to the target convolution kernel to obtain the second weight parameter; The convolution module is used to perform convolution calculation on the Cin data groups and the second weight parameter; and to perform shift calculation on the result of the convolution calculation to obtain the operation result; The floating-point arithmetic logic unit is also used for: Before quantizing the input data containing Cin channels according to Cin first quantization factors to obtain Cin data groups; Obtain the maximum data parameter in the input data of the second target channel, wherein the second target channel is any one of the Cin channels; The second target quantization factor is calculated based on the maximum data parameter and quantization bit width in the input data of the second target channel; A third target quantization factor is selected from a preset quantization factor set based on the second target quantization factor, wherein the third target quantization factor is the preset quantization factor in the preset quantization factor set whose absolute value of the difference with the second target quantization factor is the smallest, and the third target quantization factor is the first quantization factor corresponding to the second target channel; Prior to selecting the third target quantization factor from the preset quantization factor set based on the second target quantization factor, the floating-point arithmetic logic unit is further configured to: Obtain the maximum data parameter among the input data of the Cin channels; The fourth target quantization factor is calculated based on the maximum data parameter and quantization bit width in the input data of the Cin channels; Based on the fourth target quantization factor and Cin second coefficients, Cin preset quantization factors are calculated, wherein the Cin second coefficients correspond one-to-one with the Cin preset quantization factors, and the Cin preset quantization factors constitute the preset quantization factor set.

2. The apparatus according to claim 1, characterized in that, The convolutional module includes: A low-bit multiplier is used to perform convolution calculation on the Cin data groups and the second weight parameter, wherein the result of the convolution calculation is Cin integers; A floating-point adder is used to perform the following steps for each of the Cin integers to obtain Cin floating-point numbers: shifting a target integer according to a first coefficient to obtain the floating-point number corresponding to the target integer, wherein the target integer is any one of the Cin integers, the first coefficient is determined according to the exponent of a first target quantization factor, the first target quantization factor is a first quantization factor corresponding to a first target channel, and the first target channel is a channel corresponding to the target integer; The floating-point adder is also used to perform cumulative calculations on the Cin floating-point numbers.

3. The apparatus according to claim 2, characterized in that, The floating-point arithmetic logic unit is also used for: The calculation result is obtained based on the mantissa of the first quantization factor, the second quantization factor, and the result of the cumulative calculation, wherein the second quantization factor is a quantization factor used to quantize the first weight parameter.

4. The apparatus according to claim 1, characterized in that, The floating-point arithmetic logic unit is also used for: Before quantizing the first weight parameter corresponding to the target convolutional kernel to obtain the second weight parameter; Obtain the maximum weight parameter from the first weight parameters corresponding to the target convolutional kernel; A second quantization factor is calculated based on the maximum weight parameter and the quantization bit width, wherein the second quantization factor is a quantization factor used to quantize the first weight parameter.

5. A method for operating a convolutional neural network, characterized in that, include: The input data containing Cn channels is quantized according to Cn first quantization factors to obtain Cn data groups, wherein the Cn first quantization factors are floating-point numbers, the Cn first quantization factors have the same mantissa, the Cn first quantization factors correspond one-to-one with the Cn channels, and Cn is a positive integer; the input data includes images; The first weight parameter corresponding to the target convolutional kernel is quantized to obtain the second weight parameter; Perform convolution calculation on the Cin data groups and the second weight parameter; The result of the convolution calculation is shifted to obtain the final result. Before quantizing the input data containing Cn channels according to Cn first quantization factors to obtain Cn data groups, the method further includes: Obtain the maximum data parameter in the input data of the second target channel, wherein the second target channel is any one of the Cin channels; The second target quantization factor is calculated based on the maximum data parameter and quantization bit width in the input data of the second target channel; A third target quantization factor is selected from a preset quantization factor set based on the second target quantization factor, wherein the third target quantization factor is the preset quantization factor in the preset quantization factor set whose absolute value of the difference with the second target quantization factor is the smallest, and the third target quantization factor is the first quantization factor corresponding to the second target channel; Before selecting the third target quantization factor from the preset quantization factor set based on the second target quantization factor, the method further includes: Obtain the maximum data parameter among the input data of the Cin channels; The fourth target quantization factor is calculated based on the maximum data parameter and quantization bit width in the input data of the Cin channels; Based on the fourth target quantization factor and Cin second coefficients, Cin preset quantization factors are calculated, wherein the Cin second coefficients correspond one-to-one with the Cin preset quantization factors, and the Cin preset quantization factors constitute the preset quantization factor set.

6. The method according to claim 5, characterized in that, The result of the convolution calculation is Cin integers. The step of shifting the result of the convolution calculation to obtain the final result includes: For each of the Cin integers, perform the following steps to obtain Cin floating-point numbers: The target integer is shifted according to the first coefficient to obtain the floating-point number corresponding to the target integer, wherein the target integer is any one of the Cin integers, the first coefficient is determined according to the exponent of the first target quantization factor, the first target quantization factor is the first quantization factor corresponding to the first target channel, and the first target channel is the channel corresponding to the target integer; The calculation result is obtained based on the mantissa of the first quantization factor, the second quantization factor, and the Cin floating-point numbers, wherein the second quantization factor is a quantization factor used to quantize the first weight parameter.

7. The method according to claim 5, characterized in that, Before quantizing the first weight parameters corresponding to the target convolutional kernel to obtain the second weight parameters, the method further includes: Obtain the maximum weight parameter from the first weight parameters corresponding to the target convolutional kernel; A second quantization factor is calculated based on the maximum weight parameter and the quantization bit width, wherein the second quantization factor is a quantization factor used to quantize the first weight parameter.

8. A computer device, characterized in that, The method includes a processor, a memory, a communication interface, and one or more programs, said one or more programs being stored in the memory and configured to be executed by the processor, said programs including instructions for performing the steps of the method as described in any one of claims 5-7.

9. A chip, characterized in that, include: A processor for retrieving and running a computer program from memory, causing a device on which the chip is mounted to perform the method as described in any one of claims 5-7.

10. A computer-readable storage medium, characterized in that, It stores a computer program for electronic data interchange, wherein the computer program causes the computer to perform the method as described in any one of claims 5-7.

11. A computer program product that causes a computer to perform the method as described in any one of claims 5-7.