A full adder network hardware acceleration method, device and electronic equipment

By reordering and merging the basic processing blocks of the full-addition network, combining row-level streaming processing strategies, optimizing the loop structure, and deploying it on the FPGA's DSP, the problem of deploying the full-addition neural network on a spatial platform is solved, achieving efficient hardware acceleration.

CN118446258BActive Publication Date: 2026-06-19BEIJING INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2024-04-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing remote sensing scene classification methods based on convolutional neural networks are difficult to deploy on space platforms, mainly due to limited computing resources and the inability of existing hardware accelerators to efficiently process fully additive neural networks.

Method used

By reordering and merging the basic processing blocks of the full-addition network, a row-level streaming processing strategy is adopted to optimize the loop structure and deploy it in the digital signal processing unit (DSP) of the FPGA, simplifying the processing flow, reducing storage requirements, and improving parallel computing efficiency.

🎯Benefits of technology

It enables efficient hardware deployment of the full Canadian network on a space platform, reducing computing resource requirements, increasing processing speed and throughput, and reducing storage and power consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118446258B_ABST
    Figure CN118446258B_ABST
Patent Text Reader

Abstract

A hardware acceleration method for a full-adder network includes: determining a second processing basic block based on a first processing basic block of the full-adder network; wherein the first processing basic block includes at least a quantization sub-layer, a first integer addition sub-layer, an inverse quantization sub-layer, and a BN layer; the second processing basic block includes a second integer addition sub-layer and a fused BN layer; the full-adder network includes N second processing basic blocks; each of the N second processing basic blocks is obtained by reordering and merging the layers in each first processing basic block; determining a row-level streaming processing strategy, wherein the row-level streaming processing strategy includes dividing the first integer feature map into two-dimensional data by rows, storing only the rows necessary for computation in the second processing basic block, and the second integer feature map output by the current second processing basic block is the input of the next level second processing basic block; and optimizing the loop of the second integer addition sub-layer in the second processing basic block based on the row-level streaming processing strategy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of remote sensing technology, and more particularly to a method, apparatus, and electronic device for full-fledged network hardware acceleration. Background Technology

[0002] Remote sensing scene classification is a crucial task in remote sensing image interpretation, widely used in environmental monitoring, disaster detection, urban planning, and national security missions. In recent years, deep learning-based methods have been widely applied to remote sensing scene classification due to their powerful feature abstraction and generalization capabilities. Traditional remote sensing image processing involves downloading remote sensing images from satellites to ground stations before scene classification. With the development of remote sensing technology, the size and resolution of acquired remote sensing images have significantly improved, leading to immense downlink transmission pressure. Simultaneously, the time consumed by large-scale data transmission increases the delay for ground personnel in obtaining critical information, posing challenges to time-constrained tasks such as natural disasters, military surveillance, and emergencies. Therefore, an intuitive solution is to deploy deep learning-based models on satellite edge devices. However, most existing convolutional neural network-based remote sensing scene classification methods require billions of multiplication operations. Due to the limited computing resources of space platforms, convolutional neural networks are difficult to deploy directly on space platform edge devices, such as field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs).

[0003] The recently proposed fully additive neural network based on additive kernels transforms all multiplication operations in convolutional layers into addition operations, reducing the computational resource requirements of convolutional neural networks and making it well-suited for deployment on FPGAs or ASICs. As a novel foundational network, the fully additive neural network holds great potential for online intelligent interpretation and scene classification of remote sensing images based on airborne and small satellite platforms.

[0004] However, the underlying operators of fully additive networks differ from those of commonly used convolutional neural networks. Existing hardware acceleration methods for convolutional neural networks cannot efficiently process fully additive networks, and existing accelerators cannot accelerate A2NNs due to differences in their basic operations. Therefore, it is necessary to research efficient hardware acceleration methods for fully additive networks, taking into account their fundamental operational structure. Summary of the Invention

[0005] In a first aspect, embodiments of this application provide a hardware acceleration method for a full-addition network. The method includes: determining a second processing basic block based on a first processing basic block of the full-addition network; wherein the first processing basic block includes a quantization sub-layer, a first integer addition sub-layer, an inverse quantization sub-layer, and a BN layer; the second processing basic block includes a second integer addition sub-layer and a fused BN layer; the full-addition network includes N second processing basic blocks; each of the N second processing basic blocks is obtained by reordering and merging the layers in each of the first processing basic blocks; determining a row-level streaming processing strategy, wherein the row-level streaming processing strategy includes dividing a first integer feature map into two-dimensional data by rows, the second processing basic block only stores the rows necessary for the operation, and the second integer feature map output by the current second processing basic block is the first integer feature map input to the next level second processing basic block; and optimizing the loop of the second integer addition sub-layer in the second processing basic block based on the row-level streaming processing strategy. Therefore, this application first simplifies the processing flow of the basic blocks of the full addition network by reordering the processing flow and merging the steps, eliminates off-chip access to feature maps in the hardware implementation by using a row-level streaming processing strategy, realizes the transmission and storage of the required feature maps using only on-chip storage resources, and applies the loop optimization based on the row-level streaming processing strategy to the addition layer, which facilitates hardware deployment.

[0006] In some feasible implementations, determining the second basic processing block based on the first basic processing block of the full-adder network includes: shifting the operation order of the quantization sub-layer forward; determining the second basic processing block with the first integer feature map and integer addition kernel weights as input; and determining the second integer addition layer as:

[0007]

[0008] Wherein, ΔS represents the feature map quantization factor S. F and weight quantization factor S W The absolute difference; the bit width of the result calculated by the second integer addition layer is:

[0009]

[0010] Where k represents the quantization bit width of the first integer feature map and the integer addition kernel weights, C in This represents the number of input channels in the first integer feature map. Therefore, this application simplifies the processing flow of the integer addition layer of the basic block of the full-adder network by reordering the processing flow, minimizing the input-output bandwidth of the intermediate results.

[0011] In some feasible implementations, determining the second processing basic block based on the first processing basic block of the full addition network further includes: extracting a shared quantization factor from the first integer addition sublayer and integrating it into the inverse quantization sublayer; merging the floating-point operations in the inverse quantization sublayer, the BN layer, and the quantization sublayer to obtain a fused BN layer; the input of the fused BN layer is a bit width of Q. A The intermediate value of the integer feature map is processed by a floating-point multiplication and a floating-point addition operation, and the output is a k-bit integer quantized value. Thus, this application simplifies the processing flow of the integer addition layer of the basic block of the full-adder network by fusing floating-point calculations in the processing flow.

[0012] In some feasible implementations, the second processing block further includes: an activation function layer, used to perform integer comparison on a k-bit quantized integer value using a ReLU function; if the comparison result is less than 0, the integer value is set to 0; if the comparison result is greater than 0, the integer value is set to 1, resulting in a k-bit integer activation value; and a pooling layer, used to compare the k-bit integer activation value bit by bit, retaining the larger shifted value, and outputting a second integer feature map. Thus, this application simplifies the floating-point activation function layer and pooling layer in the original processing flow into an integer processing flow.

[0013] In some feasible implementations, optimizing the loop of the second integer addition sub-layer based on the row-level streaming processing strategy includes: setting the row-level loop for the first integer feature map in the second integer addition sub-layer to the highest level according to the row-level streaming processing strategy, and using pipelined processing for input and output data in the row-level loop; setting the column-level loop for the first integer feature map in the second integer addition sub-layer to the lowest level loop, and keeping the weight of the integer addition kernel unchanged in the column-level loop; and setting the input channel parallelism (ICP) and output channel parallelism (OCP) for multi-channel parallel computation. Thus, this application can optimize the loop structure, keeping the weight of the addition kernel unchanged in low-level loops to facilitate data reuse, using pipelined processing in high-level loops to increase throughput, and designing the parallelism of input and output to achieve parallel computation and improve hardware processing speed.

[0014] In some feasible implementations, the method further includes deploying the optimized second processing block onto an adder, wherein the adder is an FPGA-based digital signal processing unit (DSP) that simultaneously processes eight shift-add operations in single-instruction multi-cycle mode. Thus, a single DSP can simultaneously process eight 4-bit shift-add operations without using any additional logic resources. By using a DSP as a hardware accelerator based on the reconfiguration of the basic processing block operations, acceleration of the full-addition network can be achieved.

[0015] In some feasible implementations, the method further includes optimizing the data flow of the second processing block, including: in a first-level loop, reading ICP data from the independent storage space of the feature map circular buffer, updating the ICP data by sliding a window along the feature map width W dimension, while keeping the ICP×OCP weight data unchanged; reading the ICP×OCP weight data to be used later from the adder core ROM into a data prefetch register for temporary storage in multiple steps, and replacing all ICP×OCP weight data involved in the calculation when entering the second-level loop; in the second-level loop, the adder reads the ICP×OCP weight data and the ICP feature map data along the input channel C. in The dimensions are updated simultaneously; the ICP data from the feature map circular buffer is read and added to the ICP×OCP weight data from the adder core ROM, resulting in OCP intermediate results stored in the intermediate result buffer; in the third and fourth level loops, the weight data is updated along the width and height of the adder core; simultaneously, the data in the intermediate result buffer is read, accumulated, and the accumulated result is written back to the adder; at the end of the fourth level loop, the accumulated result is output to the fused BN layer, and a floating-point multiplication and floating-point addition operation is performed to obtain the second integer feature map; in the fifth level loop, the second integer feature map is pushed along the output channel C in the result buffer. out Calculate and output the dimensions.

[0016] Therefore, weights are reused in the first-level loop, and data preloading technology is applied to reduce the input bandwidth requirement. The intermediate result addresses the problem of a large gap between input and output bandwidth, since the feature map and the weights of the addition kernel have already been quantized to low bit widths, while the bit width Q of the data in the intermediate result buffer is... A It is much larger in comparison. Therefore, the designed data flow minimizes the storage requirements of the result buffer without affecting the processing speed, thereby minimizing the storage requirements of the entire computing module.

[0017] Secondly, embodiments of this application provide a full-adder network hardware accelerator, the hardware accelerator including an FPGA-based digital signal processing unit (DSP), the DSP simultaneously processing eight shift-add operations in single-instruction multi-cycle mode; a second processing basic block of the full-adder network as described in any of the methods of the first aspect is deployed on the DSP. Its beneficial effects are as described in the first aspect, and will not be repeated here.

[0018] In some feasible implementations, the DSP processes eight shift-add operations simultaneously in single-instruction-multi-cycle mode, including: converting the shift-add operations into 4-bit signed number addition operations using a simplification algorithm; encapsulating two 4-bit signed number addition operations into one 11-bit signed number addition operation; and in single-instruction-multi-cycle mode, the four 12-bit wide adders in the DSP simultaneously process four encapsulated 11-bit signed number addition operations. Thus, by applying shift-add simplification and multi-addition encapsulation techniques, a DSP in single-instruction-multi-cycle mode can process eight 4-bit shift-add operations simultaneously without using any additional logic resources, improving processing efficiency and speed while reducing resource consumption.

[0019] Thirdly, embodiments of this application provide a computer storage medium storing instructions that, when executed on a computer, cause the computer to perform the method described in any one of the first aspects. The beneficial effects are as described in the first aspect and will not be repeated here. Attached Figure Description

[0020] To more clearly illustrate the technical solutions of the various embodiments disclosed in this specification, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only a few embodiments disclosed in this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0021] The accompanying drawings used in the description of the embodiments or prior art are briefly introduced below.

[0022] Figure 1 This is a simplified schematic diagram of the full-addition network processing flow in the full-addition network hardware acceleration method provided in the embodiments of this application;

[0023] Figure 2 This is a schematic diagram of the row-level streaming processing strategy in the full-addition network hardware acceleration method provided in the embodiments of this application;

[0024] Figure 3a This is a schematic diagram of the hardware accelerator provided in an embodiment of this application;

[0025] Figure 3b This is a schematic diagram of four 12-bit wide addition operations in a DSP provided in an embodiment of this application;

[0026] Figure 3c This is a schematic diagram of the simplified shift-add operation provided in the embodiments of this application;

[0027] Figure 3d This is a schematic diagram of the encapsulation of multiple addition operations provided in the embodiments of this application;

[0028] Figure 4 This is a schematic diagram of the data flow optimized based on row-level streaming processing strategy provided in an embodiment of this application;

[0029] Figure 5 This is a schematic diagram of the network structure of the all-addition network A2NN-VGGNet-13 provided in Embodiment 1 of this application;

[0030] Figure 6 This is a resource consumption data diagram on the FPGA provided in Embodiment 1 of this application. Detailed Implementation

[0031] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions in the embodiments of this application will be described below with reference to the accompanying drawings.

[0032] In the description of the embodiments of this application, the words "exemplary," "for example," or "for instance" are used to indicate examples, illustrations, or explanations. Any embodiment or design described as "exemplary," "for example," or "for instance" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or designs. Specifically, the use of the words "exemplary," "for example," or "for instance" is intended to present the relevant concepts in a specific manner.

[0033] In the description of the embodiments in this application, the term "and / or" is merely a description of the association relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, B existing alone, and A and B existing simultaneously. Furthermore, unless otherwise stated, the term "multiple" means two or more. For example, multiple systems refer to two or more systems, and multiple terminals refer to two or more terminals.

[0034] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. The terms "comprising," "including," "having," and their variations all mean "including but not limited to," unless otherwise specifically emphasized.

[0035] In the description of the embodiments in this application, "some embodiments" are mentioned, which describe a subset of all possible embodiments. However, it is understood that "some embodiments" can be the same subset or different subsets of all possible embodiments, and can be combined with each other without conflict.

[0036] In the description of the embodiments of this application, the terms "first, second, third, etc." or module A, module B, module C, etc. are used only to distinguish similar objects and do not represent a specific ordering of objects. It is understood that, where permitted, a specific order or sequence can be interchanged so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0037] In the description of the embodiments of this application, the reference numerals for the steps, such as S110, S120, etc., do not necessarily indicate that the steps will be executed in this manner. Where permissible, the order of the steps can be interchanged or executed simultaneously.

[0038] Unless otherwise defined, all technical and scientific terms used in the embodiments of this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the embodiments of this application is for the purpose of describing the embodiments of this application only and is not intended to limit this application.

[0039] In a convolutional neural network, a convolutional layer is defined as follows:

[0040]

[0041] in This represents the input feature map of the convolutional layer. This represents the output feature map of the convolutional layer. This represents the weights of the convolutional layer. The convolutional kernel uses the L2 norm to represent the similarity measure, while the additive layers in a full additive network use the L1 norm to represent the similarity between the additive kernel parameters and the input feature map, in order to eliminate multiplication operations.

[0042] The addition layer in a full addition network is defined as:

[0043]

[0044] To facilitate hardware deployment, the feature map and parameters of the additive layer are quantized into low-bit-width integers, and the quantization scaling factor is set to an integer power of 2. Therefore, the quantized additive layer can be represented by the following formula:

[0045]

[0046] Although the full-addition network has been quantized to a low bit width, it is still not suitable for direct deployment on FPGA due to its complex processing flow and large amount of data storage and computation.

[0047] To address these issues, this application proposes a hardware processing method for quantized full-adder networks. This method reconstructs the full-adder network algorithm for hardware. First, it simplifies the processing flow of basic blocks by reordering the processing flow and merging steps. Then, it adopts a row-level streaming processing strategy to enable on-chip access to feature maps in the hardware implementation and applies loop optimization based on the row-level streaming processing strategy to the addition layer to facilitate hardware deployment.

[0048] This application proposes a hardware processing method for quantizing a full-addition network, including the following steps S1-S4.

[0049] S1 simplifies the processing flow.

[0050] For the hardware implementation of the full Canadian network, S1 can simplify the processing flow through the following steps S10-S12.

[0051] S10 simplifies the calculation equations for the quantized addition layer, including:

[0052] S101, convert the subtraction in formula (3) into addition;

[0053] Subtraction in the equation can be converted into addition by preprocessing the weights into two's complement form;

[0054] S102, by extracting the shared quantization scaling factor The two multiplication operations are then converted into offset operations, resulting in the simplified computation equation for the quantization addition layer:

[0055]

[0056] S A =min(S) F ,S W )

[0057] ΔS=|S F -S W |

[0058] Among them, S A ΔS represents the shared quantization factor, and ΔS represents the feature map quantization factor S.F and weight quantization factor S W The absolute difference.

[0059] S11, determine the second processing basic block based on the first processing basic block of the full addition network.

[0060] Based on the simplified quantization addition layer calculation equation (4), the basic block processing flow of the full adder network is simplified. The simplified basic block processing flow of the full adder network includes S111 reordering each basic block and S112 performing merge calculation.

[0061] For example, Figure 1 This is a simplified diagram of the full addition network processing flow. To facilitate understanding and comparison of the changes in the full addition network processing flow before and after simplification, this application combines the original algorithm processing flow, the reordered algorithm flow, and the simplified hardware processing flow after reordering and merging in the appendix. Figure 1 It is displayed in the middle. Figure 1 Figure (1) shows the flowchart of the original algorithm for the full addition network. Figure 1 Figure (2) shows the flowchart of the algorithm after reordering. Figure 1 (3) is a simplified hardware processing flowchart after floating-point calculation fusion.

[0062] In this embodiment of the application, the basic processing block in the original algorithm processing flow can be denoted as the first basic processing block. Figure 1 In (3), each processing basic block after reordering and merging is denoted as the second processing basic block.

[0063] The original algorithm processing flow for quantizing the addition layer calculation equation (4) is as follows: Figure 1 As shown in (1) of the diagram, in the full-addition network model, the original algorithm processing flow can include multiple consecutive processing basic blocks, where the output of the i-th processing basic block is the input of the (i+1)-th processing basic block. Each processing basic block includes a quantization addition layer, a batch normalization (BN) layer, an activation function layer, and a pooling layer. The quantization addition layer includes a quantization sub-layer, an integer addition sub-layer, and an inverse quantization sub-layer.

[0064] exist Figure 1 In the processing basic block i shown in (1), the quantization sub-layer i processes the input floating-point feature map F. i Convert floating-point multiplication operations to k-bit integer feature maps Obtain the k-bit integer adder kernel weights Integer addition sublayer i pairs of integer feature maps Perform left shift S F Sum of additive weights Perform left shift S WThen, integer addition is performed to obtain a 32-bit integer; the inverse quantization sublayer i performs floating-point multiplication on the 32-bit integer to convert it into a floating-point number, which is then passed through a Batch Normalization layer for normalization, and then through an activation function layer and a pooling layer for floating-point comparison to obtain a floating-point feature map. Floating-point feature map It is both the output of the i-th processing block and the input of the (i+1)-th processing block. The (i+1)-th processing block continues to process the floating-point feature map. Perform quantized addition, batch normalization (BN), activation function, and pooling operations as in the i-th processing block. The integer addition sub-layer in the original algorithm flow can be denoted as the first integer addition sub-layer.

[0065] To simplify the hardware implementation of the full addition network model, it is necessary to... Figure 1 The original algorithm processing flow shown in (1) is simplified.

[0066] like Figure 1 As shown in (2), the basic block reordering process in S111 can be achieved through the following steps S1111-S1112.

[0067] S1111, shift the operation order of quantization sub-layer i forward to the basic processing block i-1, and determine the integer feature map output by the (i-1)th quantization sub-layer block after reordering the basic processing block i. Integer addition kernel weight For input.

[0068] Integer addition sublayer i pairs of integer feature maps And addition kernel weights A left shift and integer addition operation is performed to obtain a 32-bit integer; the inverse quantization sublayer i performs floating-point multiplication on the 32-bit integer to convert it into a floating-point number, which is then passed through the BN layer for floating-point multiplication and addition operations to obtain a floating-point number.

[0069] S1112, move the (i+1)th quantization sub-layer forward into the basic processing block i.

[0070] The quantization sublayer i+1 performs floating-point multiplication on the floating-point number to obtain a k-bit integer, which is then passed through the activation function layer for integer comparison and the pooling layer for integer comparison to obtain the integer feature map.

[0071] Since quantization in a full-addition network does not change the relative relationships of the data, the quantization sub-layer i+1 is moved forward to the basic processing block i, so that the feature maps in the activation function layer and pooling layer operations remain in integer format.

[0072] like Figure 1As shown in (3), the merging calculation in S112 can be achieved through the following steps S1121-S1123.

[0073] S1121, Extract the shared quantization factor S A .

[0074] Shared quantization factor S extracted from the integer addition sublayer A It is integrated into the inverse quantization sublayer.

[0075] S1122, for integer feature maps After performing a left shift ΔS, the weights are added together with integers. Perform integer addition to get Q A -bit integer.

[0076] S1123, perform floating-point calculation fusion to determine the fused BN layer.

[0077] exist Figure 1 In (2), the current inverse quantization sublayer i includes a floating-point multiplication operation; the BN layer includes a floating-point multiplication operation and a floating-point addition operation; the quantization sublayer i+1 includes a floating-point multiplication operation. The same floating-point calculation operations in the current inverse quantization sublayer i, the BN layer and the next layer's quantization sublayer i+1 are merged to obtain the merged BN layer i.

[0078] like Figure 1 As shown in (3), the fused BN layer i contains a floating-point multiplication and a floating-point addition, and the coefficients used for the multiplication and addition can be calculated before hardware deployment.

[0079] The formula for fusing the i-th BN layer is as follows:

[0080] y = γ fused ×x+β fused

[0081]

[0082]

[0083] in Represents the shared scaling factor of the i-th additive layer. γ represents the feature map scaling factor for the (i+1)th additive layer. fused β represents the multiplication coefficients after fusion. fused This represents the additive coefficient after fusion. γ i β i μ i , These are the original parameters of the BN layer, representing the BN output multiplication factor, the BN output addition factor, the mean of the BN input, and the variance of the BN input, respectively. ε is a very small number to prevent the denominator from being zero.

[0084] In the second basic processing block, the activation function layer compares the k-bit quantized integer value with the ReLU function. If the comparison result is less than 0, the integer value is set to 0; if the comparison result is greater than 0, the integer value is set to 1, thus obtaining the k-bit integer activation value. The pooling layer compares the k-bit integer activation value bit by bit, retains the larger shift value, and outputs the second integer feature map.

[0085] At this point, each simplified basic processing block is as follows: Figure 1 As shown in (3), it includes an integer addition sublayer, a fused BN layer, an activation function layer, and a pooling layer.

[0086] S13, determine the integer addition sublayer for hardware implementation.

[0087] For example, such as Figure 1 -(2) and Figure 1 As shown in (3), since the shared scaling factor of the integer addition sublayer is merged into the BN layer, the formula for the integer addition sublayer used in hardware implementation is:

[0088]

[0089] The bit width of the final result of the integer addition sublayer can be calculated as follows:

[0090]

[0091] Where k represents the quantization bit width of the feature map and weights, and C in The input channel number of the feature map is represented by ΔS, where ΔS is the number of bits shifted. The integer addition sublayer described above for hardware implementation can be referred to as the second integer addition sublayer.

[0092] In the hardware processing flow, the output feature map of the current processing block becomes the input feature map of the next level block, meaning the feature map needs to be transferred between each block. Typically, the feature map data is very large, making it difficult to store entirely on-chip.

[0093] If these feature maps are to be stored in off-chip memory, then the off-chip memory needs to be accessed frequently during the operation, which will limit the processing speed of the accelerator to the transmission bandwidth, and off-chip memory will also consume additional power.

[0094] To address this issue, the hardware processing method for quantized full-adder networks provided in this application provides a loop optimization of the quantized full-adder network model based on a row-level streaming processing strategy (step S2), which enables the transmission and storage of the required feature maps using only on-chip storage resources. Step S2 is described in detail below.

[0095] S2, determine the row-level streaming processing strategy.

[0096] In this application, the input feature map of each second processing basic block can be denoted as the first integer feature map, and the output feature map of each second processing basic block can be denoted as the second integer feature map.

[0097] The row-level streaming processing strategy involves dividing the first integer feature map into two-dimensional data by rows. The second processing block only stores the rows necessary for the operation. The second integer feature map output by the current second processing block is the first integer feature map input to the next level second processing block.

[0098] like Figure 2 As shown, after the reconstruction of the full-addition network, the input feature map of each second processing basic block is three-dimensional data. The three-dimensional data of the input feature map is divided into two-dimensional data by rows. Each second processing basic block only stores the rows necessary for the operation. After shifting and adding operations using these rows, the corresponding data is output. The second integer feature map output by the current second processing basic block is the input of the next level second processing basic block.

[0099] For example, the first integer feature map is divided into two-dimensional data by rows. For an adder kernel of size 3x3 with a step size of 1, the zero-padding row, the first row, and the second row necessary for the operation in the i-th second processing block are stored first. After the operation of these three rows, the first row of the output feature map of the i-th second processing block is obtained. Then, the zero-padding row is discarded to save memory, and the third row is stored. After the operation of the first row, the second row, and the third row, the second row of the output feature map is obtained. Next, the first row is discarded to save memory, and the fourth row of the feature map is stored. After the operation of the second row, the third row, and the fourth row, the third row of the output feature map is obtained, and so on. The subsequent operations follow the same pattern. Furthermore, the second integer feature map output by the i-th second processing block is the input of the (i+1)-th second processing block. The relationship between the i-th and i+1-th second processing blocks is pipelined, thereby achieving high throughput processing of the system.

[0100] S3 optimizes the loop of the second integer addition sub-layer in the second processing basic block based on the row-level streaming processing strategy.

[0101] For example, the original algorithm for the full addition network is shown in Algorithm 1.

[0102] Algorithm 1:

[0103]

[0104] The original algorithm 1 is a standard addition-level loop, consisting of six levels of loops: loop1 through loop6. Loop1 is the first level loop, loop2 is the second level loop, loop3 is the third level loop, loop4 is the fourth level loop, loop5 is the fifth level loop, and loop6 is the sixth level loop. i For the data of the input channel, c o Output channel data, C in C represents the total number of input channels. out This represents the total number of output channels.

[0105] The original algorithm 1 has the following shortcomings in hardware implementation: First, in the first to third level loops of the original algorithm, the k×k×C of the input feature map... in k×k×C of data and additive kernel weights in The data is used to calculate the pixels of an output feature map. If parallel processing is required, the input bandwidth is k×k×C. in The output bandwidth is 1, and there is a huge gap between the input and output bandwidths. Moreover, the weights in the first to third level loops are always changing, which makes data unusable.

[0106] The original algorithm 1 has the following shortcomings in hardware implementation: First, in the lowest level loop from the first to the third level, the input feature map's k×k×C... in k×k×C of data and additive kernel weights in The computation of data yields only one output feature map, resulting in a significant gap between input and output bandwidth, which is detrimental to parallel computing. Furthermore, the weights in the first to third loops are constantly changing, making data reuse impossible.

[0107] To address these issues, this application can reconstruct the loop of the addition layer of Algorithm 1 based on a row-level streaming processing strategy and use loop optimization to obtain Algorithm 2.

[0108] Algorithm 2:

[0109]

[0110] In Algorithm 2, the row-level loop of the first integer feature map is first set to the highest level according to the row-level stream processing strategy, and pipeline processing is adopted.

[0111] Then, the column-level loop (Loop1) of the first integer feature map is set as the lowest level loop. In this level loop, the weight of the addition kernel remains unchanged, which facilitates data reuse.

[0112] In Algorithm 2, input channel parallelism (ICP) and output channel parallelism (OCP) are also set for parallel computation to improve the speed of hardware processing.

[0113] S4 will deploy the second integer addition sublayer, optimized using a row-level streaming processing strategy, on a hardware accelerator.

[0114] In some implementations, the hardware accelerator is an FPGA-based digital signal processing unit (DSP) that processes eight 4-bit shift addition operations simultaneously in single-instruction multi-cycle mode.

[0115] For example, such as Figure 3a As shown, the input feature map and weights are both quantized to 4-bit signed numbers. The row data of 8 weights and 8 feature maps are input into a DSP48E1. The DSP48E1 is used to implement eight parallel shift and addition operations. The calculation result output by the DSP48E1 is 8 (k+1+ΔS)-bit signed integers.

[0116] It's important to understand that within a second processing block, each time P needs to be input... ic Feature map data and P ic *P oc Each set of weight data consists of 8 feature map data and 8 weight data, which are then input into one DSP.

[0117] To achieve this function, the DSP adopts a single instruction multiple cycle (SIMD) mode and optimizes the calculation process using shift addition simplification (step S31) and multiple addition encapsulation (step S32). Subsequently, the calculation result output by the DSP48E1 is input to the absolute value calculation module 32 to calculate the absolute value (step S33). Finally, the result is accumulated through the adder tree 33, and the accumulated result is a 9-bit unsigned number (step S34).

[0118] Steps S31-S35 are described in detail below based on the accompanying drawings and specific embodiments.

[0119] S31, based on DSP's Single Instruction Multiple Cycle (SIMD) mode, uses a simplified algorithm to transform shift-add operations into 4-bit signed number addition operations.

[0120] like Figure 3b As shown, in SIMD mode, a DSP can process n1+m1, n2+m2, n3+m3, and n4+m in one clock cycle. 24 Based on the four 12-bit wide addition operations, this application embodiment adopts a shift-add simplified algorithm to eliminate the influence of shift on the addition calculation.

[0121] For example, such as Figure 3c As shown, taking a shift ΔS = 2 as an example, we calculate g = (a << 2 + b), where a and b are 4-bit signed numbers. The original calculation method is to first shift a two bits to the right to become a 6-bit signed number, and then extend the sign bit of b to 6 bits. Then, we use an adder with a bit width of 6 bits to perform the addition operation to obtain the 7-bit calculation result g.

[0122] In the simplified addition operation of shift addition in this application, the last two bits b[1] and b[0] of the addend b are retained and directly used as the last two bits g[0] and g[1] of the result g. Then, the first two bits b[3] and b[2] of b are extended to 4 bits by the sign bit and added to a. The addition operation is performed using an adder with a bit width of 4 bits to obtain the first 5 bits of the result g. Combined with the last two bits g[0] and g[1] of g, the calculation result g with a bit width of 7 bits is obtained.

[0123] This simplifies the shift-add operation to a 4-bit addition operation, allowing the use of a low-width 4-bit adder without consuming any additional logic resources. Therefore, regardless of the value of the shift ΔS, the actual bit width of the addition remains unaffected, and a low-width 4-bit adder can be used directly for addition.

[0124] S32 uses multiple addition encapsulation to optimize the calculation process.

[0125] For example, such as Figure 3d As shown, two 4-bit signed low-width adders, "a+b" and "c+d", are encapsulated into an 11-bit signed adder, f+e, where f = c+d and e = a+b. To ensure calculation precision and prevent data overflow, the 4-bit data needs to be expanded to 5 bits, and then a 1-bit zero is set in the middle to prevent carry. The addition operation of two 4-bit signed numbers is encapsulated into an 11-bit signed number addition operation.

[0126] For example, a[3,…0], b[3,…0], c[3,…0] or d[3,…0] are signed numbers. When expanding, one bit of data with the same name as the sign bit can be added, and a zero bit is set between the two data bits to prevent carry. In this way, two 4-bit signed adders “a[3,…0]+b[3,…0]” and c[3,…0]+d[3,…0] are encapsulated into an 11-bit signed adder f[4,…0]+e[4,…0], where f[4,…0] = c[4,…0]+d[4,…0] and e[4,…0] = a[4,…0]+b[4,…0].

[0127] Thus, after applying shift-add simplification and multi-addition encapsulation, a DSP in single-instruction multi-cycle mode can process eight 4-bit shift-add operations simultaneously without using any additional logic resources.

[0128] It is worth noting that, since the addition calculation result in the addition layer needs to be taken as absolute value and accumulated, a mixed signed-unsigned number data representation method is adopted in the low-bit-width octa-parallel adder 31.

[0129] S33, input the addition calculation result into the absolute value calculation module 32 to calculate the absolute value.

[0130] Before taking the absolute value, the feature maps and weights used for computation are signed numbers, and the computation result of adder 31 is also signed. After taking the absolute value, all data is converted into unsigned integers, which reduces the resource consumption of the subsequent addition tree 33.

[0131] S34 is accumulated through addition tree 33, and the accumulated result is a 9-bit unsigned number.

[0132] Meanwhile, without overflow, each data in the low-bit-width octa-parallel adder 31 is set to the minimum bit width to further save resources in the FPGA.

[0133] The full-fledged network hardware acceleration method provided in this application also includes optimizing the second processing basic block data stream based on a row-level streaming processing strategy.

[0134] The data flow of the second processing block corresponds to Algorithm 2, enabling maximum weight reuse and minimizing memory requirements. For example... Figure 4 As shown, it includes the following steps:

[0135] S41 determines the parallelism parameters ICP and OCP, which determine the computing power of the hardware accelerator.

[0136] S42, sample the adder core and obtain ICP×OCP weight data from the adder core ROM.

[0137] by Figure 4 Taking a 3x3 adder layer in the adder array 41 shown as an example, the weight data in the adder core ROM comes from the sampling points in all 3x3 adder cores 42.

[0138] S43, in each clock cycle, P from the input feature map ic The data is added to the data with weights ICP×OCP from the adder core ROM to produce OCP intermediate results, which are then stored in the intermediate result buffer 43.

[0139] S44, according to the multi-level loop order in Algorithm 2, data access to the feature map circular buffer 44 can be controlled by the address counter.

[0140] The data stream is processed based on row-level streaming as follows:

[0141] In the first-level loop loop1, ICP data points are read from the independent storage space of the feature map circular buffer 44. The ICP data points are updated by sliding a window along the width W dimension of the feature map, while the ICP×OCP weight data remain unchanged.

[0142] In the second-level loop (loop2), the data of ICP×OCP weights and the data of ICP feature maps are processed along the input channel C. in Dimensions are updated simultaneously.

[0143] It is worth noting that since the weights are reused in the first loop, a data preloading technique can be applied to reduce bandwidth requirements. Specifically, in the first loop, the weight data to be used later is read into the data prefetch register 45 in multiple steps with a small bandwidth and temporarily stored. When entering the second loop, the weight data of all ICP×OCP involved in the calculation are replaced.

[0144] In the third-level loop (loop 3) and the fourth-level loop (loop 4), the weights are updated along the width and height of the adder core 42. Simultaneously, data in the intermediate result buffer 43 is read out, accumulated, and then written back to the adders in the adder array 41.

[0145] At the end of the fourth loop (loop4), the accumulated result is output as an output feature map to the fused BN layer for a floating-point multiplication and addition operation.

[0146] In the fifth-level loop (loop 5), the entire output feature map along the output channel C is in the intermediate result buffer 43. out Calculate and output the dimensions.

[0147] It is worth noting that this data stream can also be applied to adder layers with other adder core sizes and strides. Furthermore, when processing adder layers with different adder core sizes, the third and fourth level loops concerning the adder core height and width do not necessarily need to be controlled by address counters, but can be processed in parallel by replicating multiple adder arrays.

[0148] Since the feature map and the weights of the addition kernel have been quantized to a low bit width, and the bit width of the data in the intermediate result buffer 43 is Q... AThe bit width of the feature map and the bit width of the addition kernel weights are much larger than those of the data. Therefore, the optimized data flow minimizes the storage requirements of the intermediate result buffer 43 without affecting the processing speed, thereby minimizing the storage requirements of the entire computing module.

[0149] Example 1

[0150] The full-process network A2NN-VGGNet-13 model was deployed on an AMD-Xilinx VC709 processing platform for hardware acceleration. This platform utilizes an XC7VLX690T FPGA chip, which features 433K lookup tables (LUTs), 866K flip-flops (FFs), 1470 block random access memory (BRAM), and 3600 digital signal processing units (DSPs). The network structure is as follows... Figure 5 As shown.

[0151] First, the A2NN-VGGNet-13 network was reconstructed and optimized using the algorithm reconstruction method mentioned above. For specific solutions, please refer to some or all of the implementation methods in steps S1-S3.

[0152] Subsequently, each reconstructed processing basic block is deployed to the corresponding quantization full-adder network processing module on the FPGA. For specific implementation schemes, refer to some or all of the implementation methods in step S4.

[0153] For hardware accelerators, throughput, resource utilization, and power consumption are key performance metrics.

[0154] During model inference, gigaoperations per second (GOPs) are used to measure the number of operations in the network. In a full-addition network model, the addition between weights and feature maps, and the addition between intermediate results, are all considered as one operation. For a specific model, the accelerator's throughput can be measured in frames per second (fps). However, to reflect the accelerator's performance on different networks and to facilitate comparison with other works, throughput is usually expressed as gigaoperations per second (GOPS). The accelerator's resource utilization includes the number of LUTs, FFs, BRAMs, and DSPs consumed in the FPGA. The accelerator's power consumption is evaluated using the AMD-Xilinx power estimator.

[0155] Resource consumption on FPGA, such as Figure 6 As shown.

[0156] After multiple experimental tests, the system's average throughput was 329.26 frames, the equivalent computing power was 7378.7 GOPS, and the power consumption was 9.761W. It is evident that the accelerator employing the full-addition network hardware acceleration method provided in this application significantly improves average throughput and equivalent computing power while markedly reducing power consumption.

[0157] This application provides a full-adder network hardware accelerator, which includes an FPGA-based digital signal processing unit (DSP) that simultaneously processes eight shift-add operations in single-instruction multi-cycle mode. A second processing basic block of a full-adder network, as described in any of the methods in the first aspect, is deployed on the DSP. Its beneficial effects are as described in the first aspect and will not be repeated here.

[0158] In some feasible implementations, the DSP processes eight shift-add operations simultaneously in single-instruction-multi-cycle (SIN) mode. This includes: using a simplification algorithm to convert the shift-add operation into a 4-bit signed number addition operation; encapsulating two 4-bit signed number addition operations into a single 11-bit signed number addition operation; and in SIN mode, having four 12-bit wide adders in the DSP simultaneously process four encapsulated 11-bit signed number addition operations. Thus, by applying shift-add simplification and multi-addition encapsulation techniques, a DSP in SIN mode can process eight 4-bit shift-add operations simultaneously without using any additional logic resources, improving processing efficiency and speed while reducing resource consumption.

[0159] This application provides a computer storage medium storing instructions that, when executed on a computer, cause the computer to perform the method as described in any of the first aspects. The beneficial effects are as described in the first aspect and will not be repeated here.

[0160] It is understood that the processor in the embodiments of this application can be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. A general-purpose processor can be a microprocessor or any conventional processor.

[0161] The method steps in the embodiments of this application can be implemented in hardware or by a processor executing software instructions. The software instructions can consist of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, portable hard disks, CD-ROMs, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, enabling the processor to read information from and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and the storage medium can reside in an ASIC.

[0162] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of this application is generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in or transmitted through a computer-readable storage medium. The computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).

[0163] It is understood that the various numerical designations used in the embodiments of this application are merely for descriptive convenience and are not intended to limit the scope of the embodiments of this application.

Claims

1. A method for full adder network hardware acceleration, the method comprising: The method includes: A second processing basic block is determined based on a first processing basic block of the full addition network; wherein, the first processing basic block includes a quantization sub-layer, a first integer addition sub-layer, an inverse quantization sub-layer, and a BN layer; the second processing basic block includes a second integer addition sub-layer and a fused BN layer; the full addition network includes N second processing basic blocks; each of the N second processing basic blocks is obtained by reordering and merging the layers in each of the first processing basic blocks; the determination of the second processing basic block based on the first processing basic block of the full addition network includes: The operation sequence of the quantization sub-layer is moved forward, and the second processing basic block is determined to take the first integer feature map and integer addition kernel weights as inputs. The second integer addition sublayer is determined as follows: Among them, F int W represents the first integer feature map. int Represents the integer kernel weights; ΔS represents the feature map quantization factor S. F and weight quantization factor S W The absolute difference; K represents the height and width of the adder kernel; C in c represents the number of input channels in the adder layer. i Indicates the cth i One input channel; c o Indicates the cth o There are 1 output channel; x, y represent the y-th row and x-th column of the feature map of the adder layer output; u, v represent the adder kernel at the v-th row and u-th column; Indicates left shift Bit; The bit width Q of the second integer addition sublayer calculation result A Is: wherein k represents the quantization bit width of the first integer feature map and the quantization bit width of the integer addition kernel weight, C in represents the input channel number of the first integer feature map; A row-level streaming processing strategy is determined, which includes dividing the first integer feature map into two-dimensional data by rows, storing only the rows necessary for the operation in the second processing basic block, and the second integer feature map output by the current second processing basic block being the first integer feature map input to the next level second processing basic block. The loop of the second integer addition sub-layer in the second processing basic block is optimized based on the row-level streaming processing strategy. The method further includes optimizing the data flow of the second processing basic block, including: In the first loop, ICP data are read from the independent storage space of the feature map circular buffer. The ICP data are updated by sliding a window along the width W dimension of the feature map. The ICP×OCP weight data remain unchanged. The ICP×OCP weight data to be used later in the adder core ROM are read into the data prefetch register for temporary storage in multiple steps. When entering the second loop, all ICP×OCP weight data involved in the calculation are replaced. In the second-level loop, the adder combines the data of the ICP×OCP weights and the data of the ICP feature maps along the input channel C. in Dimensions are updated simultaneously; the OCP intermediate results generated by adding the ICP data from the feature map circular buffer to the ICP×OCP weights from the adder core ROM are stored in the intermediate result buffer. In the third and fourth level loops, the weight data is updated along the width and height of the adder kernel; at the same time, the data of the intermediate result buffer is read out, the data of the intermediate result buffer is accumulated, and the accumulated result is written back to the adder. At the end of the fourth loop, the accumulated result is output to the fused BN layer, and a floating-point multiplication and floating-point addition operation is performed to obtain the second integer feature map. In the fifth level loop, the second integer feature map is passed along the output channel C in the result buffer. out Calculate and output the dimensions.

2. The method of claim 1, wherein, The step of determining the second processing basic block based on the first processing basic block of the full addition network further includes: The shared quantization factor is extracted from the first integer addition sublayer and integrated into the inverse quantization sublayer; The floating-point operations in the inverse quantization sublayer, BN layer and quantization sublayer are combined to obtain the fused BN layer; The input to the fused BN layer is a bit width of Q. A The intermediate value of the integer feature map is processed by a floating-point multiplication and a floating-point addition operation, and the output is an integer quantized value with a bit width of k-bit.

3. The method of claim 2, wherein, The second processing basic block also includes: The activation function layer is used to compare the k-bit integer quantized value with the ReLU function. If the comparison result is less than 0, the integer value of that bit is set to 0; if the comparison result is greater than 0, the integer value of that bit is set to 1, thus obtaining the k-bit integer activation value. The pooling layer is used to compare two-by-two integer activation values ​​with a bit width of k-bit, retain the larger shift value, and output the second integer feature map.

4. The method of claim 1, wherein, The optimization of the loop in the second integer addition sub-layer based on the row-level streaming processing strategy includes: According to the row-level streaming processing strategy, the row-level loop for the first integer feature map in the second integer addition sub-layer is set to the highest level, and the input data and output data are processed in a pipeline manner in the row-level loop; Set the column-level loop for the first integer feature map in the second integer addition sub-layer to the lowest level loop, and keep the weight of the integer addition kernel unchanged in the column-level loop; Parallelism ICP is set in the input channel dimension, and parallelism OCP is set in the output channel dimension for multi-channel parallel computation.

5. The method of claim 1, wherein, The method further includes: The optimized second processing block is deployed in the adder, which is an FPGA-based digital signal processing unit (DSP) that processes eight shift addition operations simultaneously in single instruction multi-cycle mode.

6. A full adder network hardware accelerator, comprising: The hardware accelerator includes an FPGA-based digital signal processing unit (DSP) that simultaneously processes eight shift-add operations in single-instruction multi-cycle mode. A second processing basic block of the full-addition network as described in any one of claims 1-5 is deployed on the DSP.

7. The full adder network hardware accelerator of claim 6, wherein, The DSP processes eight shift-add operations simultaneously in single-instruction multi-cycle mode, including: The shift-add operation is transformed into a 4-bit signed number addition operation using a simplified algorithm; The addition operation of the two 4-bit signed numbers is encapsulated into a single 11-bit signed number addition operation; In single instruction multi-cycle mode, the DSP has four 12-bit wide adders that simultaneously process the addition of four packaged 11-bit signed numbers.

8. A computer storage medium storing instructions that, when executed on a computer, cause the computer to perform the method as described in any one of claims 1-5.