A multi-layer cascaded structure for improving usage rate of neural network MAC
By dividing the block memory into input, intermediate, and output layers and performing inter-layer cascaded computations within the processor, the problem of low data access efficiency between neural network layers is solved, improving memory access efficiency and MAC utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2022-08-02
- Publication Date
- 2026-06-23
Smart Images

Figure CN115374905B_ABST
Abstract
Description
Technical Field
[0001] This disclosure belongs to the field of artificial intelligence technology, and specifically relates to a multi-layer cascaded structure for improving the MAC utilization of neural networks. Background Technology
[0002] Memory access cost (MAC) is a crucial metric when measuring the computational speed of neural networks. Neural network operations (such as convolution or pooling) require extensive accesses to main memory, with the accessed data often stored as vectors or tensors. This means that in current mainstream computing architectures (such as existing CPUs, GPUs, NPUs, or processors of any architecture), main memory accesses consume a significant amount of time when processing these operations. However, in neural network computations, the data between layers is tightly interconnected (the output of an upper layer becomes the input of a lower layer). Accessing main memory for each layer is undoubtedly inefficient. Current computing architectures lack flexible and efficient data access control for this type of computation, thus facing the problem of inefficient memory usage and resulting in severe performance bottlenecks. Summary of the Invention
[0003] In view of this, the present disclosure provides a multi-layer cascaded structure to improve the MAC utilization rate of neural network memory usage overhead. It partitions the block memory in the processor used to store feature maps and uses it as a data storage area for different layers in the multi-layer neural network. The block memory is divided into at least three regions for storing feature maps of the multi-layer neural network. The three regions are the input layer, the intermediate layer and the output layer. The intermediate layer is one or more layers.
[0004] In other words, since the intermediate layer can be multi-layered, this disclosure essentially divides the block memory into at least three regions for storing feature maps of the input layer, intermediate layer and output layer in a multi-layered neural network.
[0005] Preferred,
[0006] The block memory is an on-chip, high-capacity, wide-bit-width memory used for neural network computation.
[0007] Preferred,
[0008] The smallest unit of the block memory is a dual-port SRAM with a bit width of 64 and a depth of 512, which is combined into a multi-bank storage unit to store feature maps, weights, and biases.
[0009] Preferred,
[0010] The block memory is a static random access memory (SRAM) on the FPGA chip.
[0011] Preferred,
[0012] The block memory is divided into seven regions for storing feature maps of multi-layer neural networks.
[0013] Preferred,
[0014] The seven regions are, respectively, the ifm_mr partition for storing input images or input layer feature maps, the layer1_mr partition, layer2_mr partition, layer3_mr partition, layer4_mr partition, and layer5_mr partition for storing intermediate layer feature maps, and the ofm_mr partition for storing output layer feature maps.
[0015] Preferred,
[0016] When the processor performs calculations, it first accesses the main memory DDR to obtain the input image or input layer feature map, stores it in the ifm_mr partition, and uses it as the input layer of the multi-layer neural network for NPU computation. After the input layer computation of the multi-layer neural network is completed, the NPU stores the results in five partitions from layer1_mr to layer5_mr as input to intermediate layers. The intermediate layers continue to use other areas in the five partitions from layer1_mr to layer5_mr to store computation results, which also serve as input to the next intermediate layer. Finally, the computation result is stored in the ofm_mr partition and then stored in the main memory DDR.
[0017] Preferred,
[0018] The processor utilizes a neural network computing unit (NPU) to process cascaded multi-layer neural networks or neural network building blocks containing branching structures.
[0019] Preferred,
[0020] The neural network building blocks containing branching structures include the bottleneck unit in ResNet and the ShuffleNet unit in ShuffleNet.
[0021] Preferred,
[0022] The block memory can be of any size.
[0023] The above technical solution significantly improves the MAC utilization of neural networks through inter-layer cascading, enabling efficient access to main memory. This design is of great significance in addressing the performance bottleneck of low memory access efficiency in neural network computation. Attached Figure Description
[0024] Figure 1 This is a multi-layer cascade structure diagram provided in one embodiment of the present disclosure for improving the MAC utilization rate of neural network memory usage overhead;
[0025] Figure 2 This is a schematic diagram of MRfm partitioning in one embodiment of this disclosure;
[0026] Figure 3 This is a schematic diagram of a neural network computing and storage structure in one embodiment of this disclosure;
[0027] Figure 4 This is a schematic diagram of the specific structure of MRfm in one embodiment of this disclosure;
[0028] Figure 5 This is a schematic diagram of a simple neural network module including Conv and Pooling in one embodiment of this disclosure;
[0029] Figure 6 This is a diagram illustrating the specific implementation process of a cascaded multilayer neural network in one embodiment of this disclosure;
[0030] Figure 7 This is a schematic diagram of the ShuffleNet unit building block in one embodiment of this disclosure;
[0031] Figure 8 This is a diagram illustrating the specific implementation process of the ShuffleNet unit building module in one embodiment of this disclosure;
[0032] Figure 9 This is a schematic diagram of the bottleneck residual block construction module in one embodiment of this disclosure;
[0033] Figure 10 This is a diagram illustrating the specific implementation process of the bottleneck residual block building module in one embodiment of this disclosure. Detailed Implementation
[0034] See Figure 1 In one embodiment, a multi-layer cascaded structure for improving the MAC utilization rate of neural network memory usage overhead is disclosed. By partitioning the block memory in the processor used to store feature maps, it is used as a data storage area for different layers in the multi-layer neural network. The block memory is divided into at least three regions for storing feature maps of the multi-layer neural network. The three regions are the input layer, the intermediate layer and the output layer, and the intermediate layer is one or more layers.
[0035] In this embodiment, a multi-layer cascaded structure (hereinafter referred to as inter-layer cascading) that improves the MAC utilization of a neural network utilizes the block memory partitioning within the processor used to store feature maps, treating it as a data storage area for different layers in the multi-layer neural network. During neural network computation, input layer data is first retrieved from main memory, and the block memory partitioning is fully utilized as the area for computation results from different layers. This allows the processor to process cascaded multi-layer neural networks or neural network building blocks containing branching structures (such as the bottleneck unit in ResNet and the ShuffleNet unit in ShuffleNet) within the processor by reading only one layer of input data from main memory. In a single computation, only the input and output layers interact with main memory.
[0036] Interlayer cascading is a block memory partitioning design that allows for more flexible and granular partitioning schemes to be arranged for specific problems. For example, the number of intermediate layer regions can be increased or decreased, and the address space of each region can be flexibly configured.
[0037] In another embodiment, the block memory is an on-chip high-capacity, high-bit-width memory used for neural network computation.
[0038] In this embodiment, the block memory design can also be used to store any large-bit-width, large-volume on-chip data, not limited to neural network computation.
[0039] In another embodiment, the smallest unit of the block memory is a dual-port SRAM with a bit width of 64 and a depth of 512, which is combined into multiple banks for storing feature maps, weights, and biases.
[0040] In this embodiment, the block memory, with 64x512 SRAM as the smallest unit, can be combined in different ways to realize a variety of neural network storage structures.
[0041] In another embodiment, the block memory is an on-chip static random access memory (SRAM) of a field-programmable gate array (FPGA).
[0042] In this embodiment, the block memory inside the processor is implemented using on-chip SRAM within the FPGA. Simply put, on-chip SRAM is a fixed, hard core within the FPGA that provides storage functionality. On-chip SRAM offers a large storage capacity, making it suitable for storing large-bit-width data, such as parameters, feature maps, weights, and biases required in neural network calculations. Using on-chip SRAM as the processor's internal block memory effectively utilizes the FPGA's on-chip resources.
[0043] In another embodiment, the block memory is divided into seven regions for storing feature maps of a multi-layer neural network.
[0044] In another embodiment, the seven regions are respectively the ifm_mr partition for storing the input image or input layer feature map, the layer1_mr partition, layer2_mr partition, layer3_mr partition, layer4_mr partition, and layer5_mr partition for storing the intermediate layer feature map, and the ofm_mr partition for storing the output layer feature map.
[0045] In this embodiment, as Figure 2 As shown, the inter-layer cascaded design divides the processor's internal block memory (MRfm, Matrix Registers for feature map) used to store neural network feature maps into seven regions, which are used to store feature maps of multi-layer neural networks. The functional descriptions of each region are shown in Table 1.
[0046] partition Function ifm_mr Store the input image or input layer feature map layer1_mr Storage intermediate layer feature map layer2_mr Storage intermediate layer feature map layer3_mr Storage intermediate layer feature map layer4_mr Storage intermediate layer feature map layer5_mr Storage intermediate layer feature map ofm_mr Store the output layer feature map
[0047] Table 1
[0048] In another embodiment, when the processor performs calculations, it first accesses the main memory DDR to obtain the input image or input layer feature map, stores it in the ifm_mr partition, and uses it as the input layer of the multi-layer neural network for NPU calculation. After the input layer calculation of the multi-layer neural network is completed, the NPU stores the result in five partitions from layer1_mr to layer5_mr as the input of intermediate layers. The intermediate layers continue to use other areas in the five partitions from layer1_mr to layer5_mr to store the calculation results, which also serve as the input of the next intermediate layer. Finally, the calculation result is stored in the ofm_mr partition and then stored in the main memory DDR.
[0049] In this embodiment, as Figure 3 As shown, this illustrates a neural network computation and storage structure designed using inter-layer cascading. When the processor performs computation, it first accesses the main memory to obtain the input image or input feature map, stores it in the ifm_mr region of MRfm, and then passes it to the NPU for computation as the input layer of the multi-layer neural network module.
[0050] After the input layer computation is complete, the NPU stores the result in the region between layer1_mr and layer5_mr, serving as input for intermediate layers. The intermediate layers then continue to use other regions within layer1_mr to layer5_mr to store their computation results, which also serve as input for the next intermediate layer. This process continues until the multi-layer neural network module finishes computation, at which point the final result is placed in the ofm_mr region and then stored in DDR.
[0051] The processor cannot compute the entire neural network with only the input layer as input (due to on-chip storage size), so it processes only a few consecutive layers (or one network module) at a time. The input layer of the entire neural network takes an image as input, while the intermediate layers take feature maps as input. Therefore, when the image first enters the neural network (the input part), the processor receives the image. However, when processing the next part, the processor receives the processed feature maps from the input part.
[0052] This allows for full utilization of block memory to compute modules with multi-layered neural networks, reducing access to main memory and greatly improving the MAC utilization of neural networks.
[0053] The input layer of a neural network takes an image as input, while the intermediate layers take feature maps as input. The processor cannot compute the entire network with only the input layer as input (due to on-chip storage limitations), so it processes only a few consecutive layers (network modules) at a time. Therefore, when an image first enters the neural network (input module), the processor receives the image itself. When processing the next module, the processor receives the feature maps processed by the input module. This process continues until the current neural network computation is complete.
[0054] In another embodiment, a neural network computing unit (NPU) is used within the processor to process cascaded multi-layer neural networks or neural network building blocks containing branching structures.
[0055] In another embodiment, the neural network building block containing the branch structure includes the bottleneck unit in ResNet and the ShuffleNet unit in ShuffleNet.
[0056] In this embodiment, the neural network building blocks containing branch structures include, but are not limited to, the bottleneck unit in ResNet and the ShuffleNet unit in ShuffleNet mentioned above.
[0057] In another embodiment, the block memory can be of any size.
[0058] In this embodiment, interlayer cascading is not targeted at a specific size of block memory; the design can be used in any size of block memory used for neural network computation.
[0059] In another embodiment, MRfm is used in this design to store neural network feature maps, with the specific structure as follows: Figure 4 As shown. MRfm contains 32 indices, denoted as MR0 to MR31. Each index is a 512-bit wide, 128-bit deep SRAM, composed of two 64x512 smallest units. Addressing follows this convention: MR is 16 bits of data, where MR[15:7] is the index address, and MR[6:0] represents the internal address of each index.
[0060] In another embodiment, Figure 5This demonstrates a basic convolutional neural network module comprising one Conv layer and one Pooling layer. In this embodiment, only the ifm_mr partition, layer2_mr partition, and ofm_mr partition from MRfm are used.
[0061] When the NPU computes this module, it first accesses main memory to obtain the input feature map and stores it in the ifm_mr region. The Conv layer uses the data in the ifm_mr region as input to begin computation, and after the computation is complete, it stores the result in the layer2_mr region. Then, the Pooling layer uses the data in the layer2_mr region as input to begin computation, and after the computation is complete, it stores the result in the ofm_mr region. Finally, the processor stores the feature map in the ofm_mr region, which is the output of this neural network module, into main memory. The specific process is as follows: Figure 6 As shown.
[0062] In another embodiment, Figure 7 This demonstrates a building block of the ShuffleNet v2 network architecture, which has two branches. The right branch contains three layers of neural network computation, which will be abbreviated as ra, rb, and rc for convenience. The left branch contains two layers of neural network computation, also abbreviated as lb and lc. The results of the two branches are concatenated and then channel-shuffled to become the final output of the module.
[0063] When the NPU computes this module, it first accesses main memory to obtain the input feature map and stores it in the ifm_mr region. The two branches can interleave, using the layer1_mr to layer5_mr regions as the computation results of their respective intermediate layers. For example, the results of ra and rb are stored in layer1_mr and layer3_mr, respectively, while the results of lb are stored in layer2_mr. Finally, the results of lc and rc are shuffled by the NPU's channels and placed into the ofm_mr region, and then stored in main memory. The specific process is as follows: Figure 8 As shown.
[0064] In another embodiment, a specific implementation of the bottleneck residual block building block is shown.
[0065] Figure 9 This demonstrates a building block in the ResNet network architecture, which has two branches. The right branch contains three layers of neural network computation; for convenience, these three layers will be referred to as ra, rb, and rc, respectively. The left branch is a direct connection. The results of the two branches are summed to obtain the final output of the module.
[0066] In this embodiment, if the space is divided into 7 regions, layer 4 and layer 5 regions will not be used during calculation, thus failing to fully utilize the MRfm storage space. Therefore, for this embodiment, the number of partitions can be reduced to 5, and layer 4 and layer 5 regions can be removed.
[0067] When the NPU computes this module, it first accesses main memory to obtain the input feature map and stores it in the ifm_mr region. The results of ra, rb, and rc are stored in layer1_mr, layer2_mr, and layer3_mr, respectively. Finally, rc and ifm are added together and placed into the ofm_mr region, and then stored in main memory. The specific process is as follows: Figure 10 As shown.
[0068] Although the embodiments of this disclosure have been described above in conjunction with the accompanying drawings, this disclosure is not limited to the specific embodiments and application fields described above. The specific embodiments described above are merely illustrative and instructive, and not restrictive. Those skilled in the art can make many other forms based on the guidance of this specification and without departing from the scope of protection of the claims of this disclosure, and all of these are within the scope of protection of this disclosure.
Claims
1. A multi-layer cascaded structure for improving the MAC utilization rate of neural networks, characterized in that: The MAC refers to memory usage overhead. The multi-layer cascaded structure partitions the block memory in the processor used to store feature maps, and uses it as a data storage area for different layers in the multi-layer neural network. The block memory is divided into three regions to store the feature maps of the multi-layer neural network. The three regions are the input layer, the intermediate layer and the output layer, and the intermediate layer is one or more layers. When performing neural network calculations, the input layer data is first obtained from the main memory. The block memory partition is fully utilized as the area for the calculation results of different layers. This allows the processor to process cascaded multi-layer neural networks or neural network building blocks with branch structures using the neural network computing unit by reading only one layer of input data from the main memory. During each calculation, only the input layer and the output layer have memory access interaction with the main memory.
2. The multi-layer cascaded structure according to claim 1, wherein, The block memory is an on-chip, high-capacity, wide-bit-width memory used for neural network computation.
3. The multi-layer cascaded structure according to claim 1, wherein, The smallest unit of the block memory is a dual-port SRAM with a bit width of 64 and a depth of 512, which is combined into multiple storage banks to store feature maps, weights, and biases.
4. The multi-layer cascaded structure according to claim 1, wherein, The block memory is a static random access memory (SRAM) on the FPGA chip.
5. The multi-layer cascaded structure according to claim 1, wherein, The block memory is divided into seven regions for storing feature maps of multi-layer neural networks.
6. The multi-layer cascaded structure according to claim 5, wherein, The seven regions are, respectively, the ifm_mr partition for storing input images or input layer feature maps, the layer1_mr partition, layer2_mr partition, layer3_mr partition, layer4_mr partition, and layer5_mr partition for storing intermediate layer feature maps, and the ofm_mr partition for storing output layer feature maps.
7. The multi-layer cascaded structure according to claim 1, wherein, When the processor performs calculations, it first accesses the main memory DDR to obtain the input image or input layer feature map, stores it in the ifm_mr partition, and uses it as the input layer of the multi-layer neural network for NPU computation. After the input layer computation of the multi-layer neural network is completed, the NPU stores the results in five partitions from layer1_mr to layer5_mr as input to intermediate layers. The intermediate layers continue to use other areas in the five partitions from layer1_mr to layer5_mr to store computation results, which also serve as input to the next intermediate layer. Finally, the computation result is stored in the ofm_mr partition and then stored in the main memory DDR.
8. The multi-layer cascaded structure according to claim 1, wherein, The processor utilizes a neural network computing unit (NPU) to process cascaded multi-layer neural networks or neural network building blocks containing branching structures.
9. The multi-layer cascaded structure according to claim 8, wherein, The neural network building blocks containing branching structures include the bottleneck unit in ResNet and the ShuffleNet unit in ShuffleNet.
10. The multi-layer cascaded structure according to claim 1, wherein, The block memory can be of any size.