Method, apparatus and device for promoting neural network acceleration computation

By calculating the memory access ratio of each layer of the neural network, grouping them, and optimizing the processing engine, the problem of mismatch between hierarchical computation features in FPGA neural network accelerators is solved, achieving more efficient computing performance and resource utilization.

CN116663629BActive Publication Date: 2026-06-30ELECTRIC POWER RES INST OF STATE GRID ZHEJIANG ELECTRIC POWER COMAPNY +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ELECTRIC POWER RES INST OF STATE GRID ZHEJIANG ELECTRIC POWER COMAPNY
Filing Date
2023-06-29
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In existing technologies, the mismatch between computational features at different levels in FPGA neural network accelerators leads to performance loss and wasted computing resources.

Method used

By calculating the memory access ratio of each layer in the neural network, grouping them and assigning similar layers to the same processing engine, and combining this with the design space exploration method to optimize the computational performance of the processing engine, the computational characteristics of the layers and the processing engine are matched.

Benefits of technology

It improves the computational efficiency and hardware resource utilization of neural network accelerators, reduces the computational latency differences between processing engines, and enhances overall performance and energy efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116663629B_ABST
    Figure CN116663629B_ABST
Patent Text Reader

Abstract

This invention discloses a method for accelerating neural network computation, relating to the field of neural network technology, to solve the problem of feature mismatch during computational allocation in existing neural networks. The method includes the following steps: obtaining neural network operating parameters; calculating the access ratio of each layer of the neural network based on the operating parameters; grouping the neural network according to the access ratio calculation results, grouping layers with access ratio differences within a preset range into the same group; and assigning layers grouped into the same group to the same processing engine. This invention also discloses a neural network computation acceleration device and electronic device. This invention groups neural network layers by access ratio and matches them with appropriate processing engines, resulting in feature matching and superior performance during neural network computational allocation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of neural network technology, and more particularly to a method, apparatus, device, and medium based on FPGA to accelerate neural network computation. Background Technology

[0002] With the rapid development of neural network models, selecting efficient acceleration hardware platforms to adapt to complex computing applications is crucial. If the acceleration hardware platform cannot support the computation of neural networks, it will cause problems such as the neural network failing to perform its functions or experiencing stuttering. Field-Programmable Gate Arrays (FPGAs) have gradually become an acceleration hardware platform that balances power consumption and performance due to their low power consumption and reconfigurability. In 2015, existing technologies began to introduce the Roofline model to analyze FPGA neural network accelerators under different model parameters, providing guidance and optimization for the design of FPGA neural network accelerators. The Roofline model can describe the theoretical computational performance that a computational task can achieve under the constraints of the hardware platform. The Roofline model includes a computational constraint region and a memory access constraint region. The computational constraint region refers to the highest performance level that can be achieved under the constraints of all available computing resources of the processor; the memory access constraint region refers to the maximum throughput that the processor core can support under a given memory access ratio for a computational task.

[0003] Matrix operations are a crucial part of neural network computation, and when broken down to the element-wise level, they involve a massive number of multiply-accumulate (MAC) operations. Therefore, the speed of MAC operations is paramount for neural network hardware accelerators. However, while the Roofline model can guide accelerator design, the hardware platform cannot reach its performance limit when the algorithm's computational intensity is low. Overall performance is constrained by bandwidth limitations during computation, with many computational cores remaining idle for extended periods. This is one of the challenges in accelerator design, as the inconsistent computational characteristics across different applications and layers can lead to performance losses.

[0004] In summary, it is necessary to address the mismatch between computational features at different levels in neural network accelerator design, based on existing technologies. Summary of the Invention

[0005] In order to overcome the shortcomings of the prior art, one of the objectives of the present invention is to provide a method to promote the accelerated computation of neural networks, which achieves grouping of each layer of the neural network through memory access ratio, and then rationally allocates processing engines to each group.

[0006] One of the objectives of this invention is achieved through the following technical solution:

[0007] A method to accelerate computation in neural networks includes the following steps:

[0008] Obtain the operating parameters of the neural network;

[0009] Based on the aforementioned operating parameters, calculate the memory access ratio of each layer in the neural network;

[0010] Based on the memory access ratio of each layer in the neural network, the layers of the neural network are grouped to obtain multiple groups, and the difference between the memory access ratios of any two layers in each group is within a preset range.

[0011] Layers grouped together are assigned to the same processing engine.

[0012] Furthermore, the operating parameters include: the computational load and access load of each layer in the neural network.

[0013] Furthermore, assigning layers grouped into the same group to the same processing engine also includes:

[0014] Calculate the theoretical throughput for each processing engine.

[0015] Based on the theoretical throughput of each processing engine, tasks in the middle layer of the same group are mapped to processing engines with computational costs less than their theoretical throughput.

[0016] Furthermore, the calculation of the theoretical throughput satisfies:

[0017] ;

[0018] in, Represents theoretical throughput. Indicates the peak performance of the system's computing hardware. This indicates the maximum performance supported by memory access bandwidth; it maps tasks in the same layer to processing engines whose computational memory access ratio is less than the upper limit of the processing engine.

[0019] Furthermore, it also includes regenerating the computational performance of the processing engine based on the equilibrium calculated by the computational processing engine and combined with the design space exploration method, wherein the equilibrium Equilibrium is calculated according to the formula:

[0020] ,

[0021] in, Indicates the number of processing engines. Indicates the first The time overhead of each processing engine in completing the computation of a subsequence at one layer. This represents the average time cost of the processing engine in processing and computing subsequences.

[0022] Furthermore, the processing engine is regenerated based on the equilibrium calculation and in conjunction with the design space exploration method, including the following steps:

[0023] Calculate the historical and current balance of the processing engine;

[0024] When the current equilibrium degree is less than or equal to the historical equilibrium degree, proceed with the iteration:

[0025] The processing engine with the largest computational latency overhead in the current configuration is called the first processing engine;

[0026] Add computing resources to the first processing engine and regenerate it to obtain the second processing engine;

[0027] The second processing engine is checked. If it is valid, the iteration continues until the current balance is greater than the historical balance. If it is invalid, the second processing engine is restored to the first processing engine, the parallelism of other processing engines is reduced, and the iteration continues.

[0028] Furthermore, when the processing engine processes the computational subsequences of neural network layers, the execution schedule includes:

[0029] Step 1: Divide the input data into n data sources, where n is the same as the number of processing engines; each data source is a neural network task, and each neural network task consists of multiple neural network layers.

[0030] Step 2: Load the first data source into the first processing engine via off-chip access. After processing, return the calculation result of the first data source to off-chip storage.

[0031] Step 3: Load the second data source and the processed first data source into the first processing engine and the second processing engine respectively through off-chip access. After processing, return the calculation results to off-chip storage.

[0032] Repeat steps two through three for each of the n data sources until all n data sources have been processed.

[0033] The second objective of this invention is to provide a neural network computing acceleration device that groups the computational load of neural networks by memory access ratio, thereby matching the neural network layers with appropriate computational load to the processing engine.

[0034] The second objective of this invention is achieved by the following technical solution:

[0035] A neural network computing acceleration device, comprising:

[0036] The data acquisition module is used to acquire the operating parameters of the neural network;

[0037] The calculation module is used to calculate the memory access ratio of each layer of the neural network based on the operating parameters.

[0038] The allocation module is used to group the neural network according to the memory access ratio calculation result, group the layers with memory access ratio differences within a preset range into the same group, and allocate the layers grouped into the same group to the same processing engine.

[0039] A third objective of the present invention is to provide an electronic device for performing one of the objectives of the invention, comprising a processor, a storage medium, and a computer program, wherein the computer program is stored in the storage medium and, when executed by the processor, implements the aforementioned method for accelerating computation of neural networks.

[0040] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0041] This invention employs inter-layer interconnection partitioning and calculates the memory access ratio through neural network operating parameters to achieve quantitative analysis of the computational characteristics of the neural network model. Based on the memory access ratio, the layers in the neural network are grouped, and each group is assigned to the same processing engine. This achieves matching between the computational characteristics of each layer in the neural network and the processing engine, maintaining the flexibility of the computing unit. Attached Figure Description

[0042] Figure 1 This is a schematic diagram of the Roofline model in Example 1;

[0043] Figure 2 This is a flowchart of the method for accelerating neural network computation in Example 1;

[0044] Figure 3 This is a detailed structural diagram of the convolution calculation unit in Embodiment 1;

[0045] Figure 4 This is a schematic diagram of the neural network layer partitioning strategy in Example 1;

[0046] Figure 5 This is a schematic diagram of the operation scheduling sequence of Example 1;

[0047] Figure 6 This is a schematic diagram of the amount of computation and fan-in data in each processing engine of Example 2, as well as the optimized latency;

[0048] Figure 7 This is a structural block diagram of the neural network acceleration device in Embodiment 3;

[0049] Figure 8 This is a structural block diagram of the electronic device in Embodiment 4. Detailed Implementation

[0050] The present invention will now be described in more detail with reference to the accompanying drawings. It should be noted that the following description of the present invention with reference to the accompanying drawings is merely illustrative and not restrictive. Various embodiments can be combined with each other to form other embodiments not shown in the following description.

[0051] Example 1

[0052] Example 1 provides a method to accelerate neural network computation. It aims to combine the design ideas of Overlap and Stream paradigms, propose a new accelerator design paradigm, analyze the computational differences between neural network layers based on the Roofline model, and propose a method for partitioning neural network layers.

[0053] The main principle of this method is to determine different processing clusters by calculating the memory access ratio and map them to the same processing engine, so that the neural network layers processed by the same processing engine have similar computational and data access volumes, thereby making better use of hardware resources and improving the operating efficiency of the accelerator.

[0054] The method for accelerating neural network computation in this embodiment includes accelerator architecture design, design space exploration, and upper-level scheduling strategy.

[0055] The novel accelerator architecture design primarily involves using an architecture that supports different subsequences executed on multiple processing engines as part of the overall operational sequence of the neural network model. This architecture is a general-purpose architecture and will not be elaborated upon further in this embodiment. This general-purpose architecture includes multiple processing engines connected via a bus, with all data transfers handled through external memory access. Simultaneously, each processing engine internally connects different computing cores via a pipelined approach to achieve the most efficient internal computing performance.

[0056] The proposed method for exploring the accelerator architecture design space primarily includes a workload balancing algorithm based on a greedy strategy, used to complete the design space exploration task. During the design phase, the latency of each stage between processing engines is made as close as possible to ensure that the entire computational architecture can operate efficiently through simple software control.

[0057] For upper-level scheduling strategies, the main consideration is how to logically connect the computational subsequences completed by these processing engines with the overall processing order of the network model. Then, at the upper-level task scheduling level, multiple processing engines can be switched to achieve coarse-grained pipelined execution, masking each other's processing time overhead.

[0058] This method analyzes the computational differences between neural network layers based on the Roofline model. Please refer to [link to Roofline model documentation]. Figure 1As shown, this model consists of constraints determined by two parameters: the theoretical computing power of the hardware platform and the upper limit of off-chip bandwidth. In the figure, computing performance refers to the highest performance level achievable under the constraints of all available computing resources of the processor; while memory access performance refers to the maximum throughput that the processor core can support given the memory access ratio of a given computing task. This model is a conventional model, and its principles will not be elaborated upon in this embodiment.

[0059] The design concept of this invention is based on the Overlap paradigm and the Stream paradigm.

[0060] The core idea of ​​the Overlap approach is to design a reusable, large processing engine to execute the computational portion of the algorithm. The representation of the computational hardware varies depending on the computational task, such as systolic arrays or multiply-accumulate trees. Hardware control and operation scheduling are handled by the upper-layer host code. Therefore, after a single compilation, the same bitstream can be used for many models without reconfiguration and can be scaled based on the input model and available on-chip resources. The advantage of this design is its flexibility; it eliminates the need for reconfiguration when the input model changes but the computation type remains the same. However, due to control mechanisms similar to general-purpose processors, achieving maximum efficiency during computation is not easy. A one-size-fits-all design may lead to inconsistent final performance of network models with different workload characteristics. Furthermore, the various nonlinear functions between computational parts can also affect computational efficiency. Therefore, achieving optimal design size and optimal processing scheduling for computational units can be challenging when implementing the Overlap paradigm.

[0061] Existing accelerator work proposes a design concept based on the Stream paradigm. The Stream paradigm is the opposite of the Overlap paradigm. In the computational flow of the target algorithm, Stream implements different hardware units for each computational part and optimizes them to leverage inter-layer parallelism. Data flows through each unit, and after completing the computation, it flows into the next unit. The overall scheduling strategy of the accelerator during runtime is a coarse-grained pipeline based on the computational units. The upper-layer software only needs to control the inflow of data into the coarse-grained pipeline, thus greatly reducing the scheduling difficulty during processing. Therefore, the Stream paradigm can utilize the parallelism between these pipelined layers and enable them to execute concurrently. The advantage of the Stream paradigm design is that, based on the differences in the characteristics of different computational parts, designers can customize the implementation and parallel solutions in each hardware kernel using different methods to improve overall performance and operating efficiency. However, the disadvantage of the Stream paradigm is its inflexibility and low hardware kernel compatibility; generally, new hardware computational units need to be configured for different network models. Furthermore, the advantages of the Stream paradigm become more apparent when there are many non-linear computational operations in the algorithm, because such operations can often be mapped to parallel hardware implementations.

[0062] In summary, the method for accelerating neural network computation described in this embodiment combines the advantages of the Overlap paradigm and the Stream paradigm. It improves overall computational performance by splitting the computation process of the network model, generating different processing engines to match and compute the fragmented sequences, and performing pipeline optimization within each engine. Then, a design space search method is used to optimize workload balancing to maximize computational efficiency. Specifically, please refer to... Figure 2 As shown, a method for accelerating computation in neural networks includes the following steps:

[0063] S1. Obtain the operating parameters of the neural network;

[0064] The above operating parameters include: the computational cost and access frequency of each layer in the neural network.

[0065] S2. Calculate the memory access ratio of each layer in the neural network based on the operating parameters.

[0066] The above memory access ratio calculation satisfies: ,in, The memory access ratio is calculated as follows: the numerator represents the total computational cost, and the denominator represents the data memory access cost.

[0067] As can be seen from the formula above, the memory access ratio is used to describe the proportion of computational load to data memory access during neural network computation. Its value represents the maximum number of computational results that a unit of data can produce under certain computational patterns. Under the same computational pattern, a larger value indicates less data interaction, meaning less data is needed for the same computational load, and the correlation is lower. Conversely, a smaller value indicates that more data is needed for the same computational load, and the correlation is stronger.

[0068] Specifically, it needs to be arranged according to the computing layers, and the total amount of computation and memory access for different layers needs to be obtained based on the relevant runtime parameters.

[0069] S3. Based on the memory access ratio of each layer in the neural network, the layers of the neural network are grouped to obtain multiple groups, and the difference between the memory access ratios of any two layers in each group is within a preset range.

[0070] The aforementioned preset intervals can be set according to actual computational load and requirements; this embodiment does not limit this. The purpose of S3 grouping is to subsequently allocate similar layers to the same processing engine based on computation-to-memory access ratios. This ensures that neural network layers processed by the same processing engine have similar computational loads and data access volumes, thereby better utilizing hardware resources and improving the accelerator's operating efficiency.

[0071] S4. Assign layers that belong to the same group to the same processing engine.

[0072] It should be noted that the specific number and method of the above-mentioned processing engine division are not limited in this embodiment. The number can be set according to actual needs. The division method can be manual or automatic by software. This embodiment will not elaborate on this.

[0073] To ensure computational efficiency during S4 allocation, the following also applies:

[0074] Calculate the theoretical throughput for each processing engine.

[0075] The calculation of theoretical throughput satisfies:

[0076] ;

[0077] in, Represents theoretical throughput. Indicates the peak performance of the system's computing hardware. This indicates the maximum performance supported by memory access bandwidth; it maps tasks in the same layer to processing engines whose computational memory access ratio is less than the upper limit of the processing engine.

[0078] The actual throughput performance of the aforementioned processing core will not exceed the minimum value in these two formulas. A processing engine with a performance lower than the system's peak performance is preferred to ensure computing power as much as possible.

[0079] The first term (top formula) describes the peak performance under all available computing resources or computational limitations of the system, while the second term (bottom formula) is the maximum throughput that a processing core can support under a given task's computation-to-memory ratio. Therefore, under the same computing mode, differences in computation-to-memory ratios between network layers can lead to potential differences in parallel performance. In actual execution, performance is limited either by computational performance or by memory access performance. Typically, the memory access performance and computing resources of a hardware platform are fixed. Therefore, in this embodiment, mapping processing tasks to processing cores falls within the computational limitation zone, effectively utilizing the computing power of the hardware platform. For applications with complex computational characteristics, the Overlap paradigm struggles to achieve high computational efficiency with accelerator architectures. The Stream paradigm's specific optimizations for each layer significantly improve overall performance, while the method described in this embodiment adapts computational features for each type of network layer subsequence to improve computational efficiency.

[0080] In terms of architecture design, this embodiment divides the network model's process graph into several common sub-sequences based on the computation-to-memory access ratio. A general design architecture is then used to support the execution of different sub-sequences on multiple processing engines as part of the overall operation sequence of the neural network model. Simultaneously, each processing engine internally connects different computing cores via a pipelined approach to achieve the most efficient internal computing performance.

[0081] In addition, it is worth noting that, please refer to Figure 3 As shown, this method presents the internal computational flow of the convolutional computation unit. Simultaneously, at the software layer, this novel architecture can combine different PEs in a specific order to complete the computation of different network models. Furthermore, at the hardware level, these sequential operations run in a predetermined order within the processing engine. This architecture uses pipeline technology to divide different stages and configures different parameters for the computational units to make the latency of each stage similar, thereby achieving high computational efficiency. In this way, the processing engine… The computational performance meets the following requirements:

[0082]

[0083] By combining the above formulas Figure 3 It is evident that the overall performance of the accelerator architecture designed using this paradigm is significantly affected by the maximum latency of each computational unit within the processing engine. Therefore, during the design phase, to optimize computational performance and hardware utilization, this novel design approach requires making the latency of each stage between processing engines as close as possible to ensure that the entire computational architecture can operate efficiently through simple software control.

[0084] Based on the theoretical throughput, tasks in the same layer are mapped to processing engines whose computation-to-memory access ratio is less than the upper limit of the processing engine.

[0085] For example, please refer to AlexNet. Figure 4 As shown, AlexNet consists of eight network layers, including five convolutional layers and three fully connected layers. These layers contain varying numbers of multiply-accumulate (MAC) operations and parameters. The figure lists the CTC values ​​from layer 1 to layer 8. After grouping, onv-1 to 2 are mapped to the first processing engine (PE), conv-3 to 5 to the second PE, and FC-6 to 8 to the third PE.

[0086] Having explained the design philosophy and architecture of this method, to further improve acceleration performance and avoid latency, it is necessary to make the latency of each stage between processing engines as close as possible during the design phase. This ensures that the entire computational architecture can operate efficiently through simple software control. In summary, to achieve an efficient architecture design, this embodiment also introduces the Design Space Exploration (DSE) method to obtain optimal computational performance.

[0087] The prerequisite for achieving coarse-grained pipelined switching execution is that this novel design method ensures similar time overhead for each processing engine during the design phase, enabling synchronous completion of processing tasks and writing of input results back to external storage. Specifically, the processing engines are regenerated based on a balance calculation combined with a design space exploration method, wherein the balance calculation satisfies:

[0088] ,

[0089] in, Indicates the number of processing engines. Indicates the first The time overhead of each processing engine in completing the computation of a subsequence at one layer. This represents the average time taken by the processing engine to process and compute the subsequence.

[0090] The regeneration of the processing engine is based on the balance calculation and combined with the design space exploration method, including the following steps:

[0091] Calculate the historical and current balance of the processing engine;

[0092] When the current equilibrium degree is less than or equal to the historical equilibrium degree, proceed with the iteration:

[0093] The processing engine with the largest computational latency overhead in the current configuration is called the first processing engine;

[0094] Add computing resources to the first processing engine and regenerate it to obtain the second processing engine;

[0095] The second processing engine is checked. If it is valid, the iteration continues until the current balance is greater than the historical balance. If it is invalid, the second processing engine is restored to the first processing engine, the parallelism of other processing engines is reduced, and the iteration continues.

[0096] The regeneration of the above processing engine can be referenced in the following programming languages:

[0097] Input: Set of computed subsequences, data size, FPGA device, PE set, hardware template

[0098] Output: Detailed implementation of each component

[0099]

[0100] During the iterative phase, the algorithm first identifies the processing engine with the largest computational latency in the current configuration. (Line 5). Then, by modifying the configuration file, the corresponding computing resources are added to increase the parallelism of the processing engine (Line 6). After generating the new configuration file, the algorithm will regenerate the processing engine (Line 7). After the hardware design is regenerated, we will first perform a validity check to ensure that the usage limits of each resource have not been reached and record this on the flag (Line 8).

[0101] To further ensure computing performance, runtime scheduling also needs to be determined to achieve maximum performance. Specifically, runtime scheduling includes:

[0102] Step 1: Divide the input data into n data sources, where n is the same as the number of processing engines; each data source is a neural network task, and each neural network task consists of multiple neural network layers.

[0103] Step 2: Load the first data source into the first processing engine via off-chip access. After processing, return the calculation result of the first data source to off-chip storage.

[0104] Step 3: Load the second data source and the processed first data source into the first processing engine and the second processing engine respectively through off-chip access. After processing, return the calculation results to off-chip storage.

[0105] Repeat steps two through three for each of the n data sources until all n data sources have been processed.

[0106] Specifically, please refer to Figure 5The diagram shows the architecture scheduling sequence. Solid boxes represent processes in progress, while dashed boxes indicate time reuse of the same processing engine. The horizontal axis represents the time variation of the processing process, and the vertical axis represents the different processing engine designations and the off-chip bus representation of the accelerator hardware. The upper-level scheduling refers to achieving coarse-grained pipelined execution by switching multiple processing engines at the upper-level task scheduling level, thus masking the processing time overhead. Figure 5 For example, the scheduler first loads the input -1 data into processing engine-1 via off-chip access and performs predefined computation operations. In this example, processing engine-1 needs to repeat the computation process twice. Then, the computation result of input -1 is returned to off-chip storage via off-chip access, thus completing a coarse-grained computation process.

[0107] In the next coarse-grained computation scheduling, input-2 and input-1 are loaded into processing engine-1 and processing engine-2 respectively via off-chip access for computation. At this point, the data of input-1 is already the result of processing engine-1, so processing engine-2 can then proceed to the next stage of processing. Processing engine-1 then performs the same predefined processing on input-2. In this example, processing engine-2 needs to repeat the computation process three times. Afterwards, the computation results of input-1 and input-2 are returned to off-chip storage via off-chip access, thus completing this coarse-grained computation process.

[0108] The third coarse-grained computation scheduling occurs next. Inputs -3, -2, and -1 are loaded into processing engines -1, -2, and -3 respectively via off-chip access for computation. At this point, the data for inputs -1 and -2 are already processed by processing engines -2 and -1, so processing engines -3 and -2 can proceed to the next stage of processing. Processing engine -1 then performs the same predefined processing on input -3. In this example, processing engine -3 only needs to repeat the computation process once. The computation results for inputs -1, -2, and -3 are then returned to off-chip storage via off-chip access, thus completing this coarse-grained computation process.

[0109] This process continues until all data sources requiring computation are completed. This coarse-grained pipelining at the upper-level scheduling level simplifies external scheduling strategies while maximizing the utilization of on-chip hardware resources.

[0110] Example 2

[0111] Example 2 is an experiment conducted using the method described in Example 1.

[0112] This embodiment selects the AlexNet CNN model for experimentation. AlexNet has a very regular network structure, consisting of eight layers: five convolutional layers and three fully connected layers. These layers contain varying numbers of multiplicative accumulations (MACs) and parameters. After processing using the partitioning strategy described in Embodiment 1, conv-1 to 2 are mapped to the first PE, conv-3 to 5 to the second PE, and FC-6 to 8 to the third PE.

[0113] This experiment uses the Xilinx Alveo U50 as the verification platform. As an accelerator board from Xilinx for data center use, this card is a custom UltraScale+ FPGA consisting of two Super Logic Regions (SLRs). The first SLR connects to an external High Bandwidth Memory (HBM) interface while also retaining a PCIe 3.0 x16 interface for programming on-board logic and interconnection with external hosts. It contains a total of 5952 DSPs, 1344 BRAMs, 872K LUTs, and 1743K FFs. For design tools, this embodiment uses Vitis HLS, Vivado, and Vitis version 2021.2. For each hardware design, this embodiment uses Xilinx Vitis HLS for synthesis and generation, and Xilinx Vitis and Vivado for assembling and deploying the entire architecture. This embodiment selects two common models, Alexnet and VGG-16, for experimental verification. In addition, the host CPU controls the accelerator and on-chip HBM data via the PCIe interface, and power consumption is provided using Vivado's built-in power report.

[0114] The following is an analysis of the experimental results:

[0115] An accelerator architecture for AlexNet was deployed using a U50 FPGA board. This accelerator employs three different processing engines to address the characteristics of different computational layers in AlexNet. The overall performance and resource usage of the accelerator were then recorded in this experiment. Table 1 shows the resource overhead of each processing engine and the overall resource usage in this AlexNet accelerator using a novel design approach.

[0116] Table 1 FPGA Resource Usage of the AlexNet Accelerator

[0117]

[0118] Please refer to Figure 6 As shown, the raw computational and data volumes for each processing engine differ significantly across different computational layers. This would normally severely impact the computational performance of the processing cores. However, after optimization, the overall computational performance of different processing engines is on the same order of magnitude across all processing engines. Ultimately, this optimization scheme improves the overall performance of the accelerator while reducing the computational latency differences between different processing engines, achieving an overall balance of 0.18.

[0119] Please refer to Table 2 for the performance and energy efficiency comparisons of our proposed AlexNet accelerator with CPU and GPU implementations. All figures are normalized to CPU performance metrics. Compared to the CPU implementation, the prototype of our accelerator achieves a 39.7x speedup when running the AlexNet network model. Compared to the GPU, it only achieves a 0.526x speedup. It's important to note that while our accelerator has a significant performance advantage over the CPU, it cannot completely replace the GPU. Compared to the CPU, the prototype of our accelerator achieves a 19.49x energy efficiency improvement when running the AlexNet network model. Compared to the GPU, it also improves energy efficiency by 7.91x. These results demonstrate that our accelerator offers advantages in both performance and energy efficiency for specific application scenarios, providing efficient computational acceleration for large-scale deep learning applications.

[0120] Table 2. Performance and efficiency comparison of the AlexNet accelerator based on the novel design method compared to CPU and GPU.

[0121]

[0122] Example 3

[0123] Example 3 discloses an apparatus corresponding to the method for accelerating neural network computation in the above embodiments. It is a virtual apparatus structure as described in the above embodiments; please refer to [reference needed]. Figure 7 As shown, it includes:

[0124] The data acquisition module 210 is used to acquire the operating parameters of the neural network;

[0125] The calculation module 220 is used to calculate the memory access ratio of each layer in the neural network based on the operating parameters.

[0126] The allocation module 230 is used to group the layers of the neural network according to the memory access ratio of each layer in the neural network to obtain multiple groups, wherein the difference between the memory access ratios of any two layers in each group is within a preset range; and to allocate the layers in the same group to the same processing engine.

[0127] Preferably, assigning layers grouped into the same group to the same processing engine further includes:

[0128] Calculate the theoretical throughput for each processing engine;

[0129] Based on the theoretical throughput of each processing engine, tasks in the same mid-layer are mapped to processing engines with computational demands lower than their theoretical throughput.

[0130] Preferably, the regeneration of the processing engine is based on the equilibrium calculation and combined with the design space exploration method, including the following steps:

[0131] Calculate the historical and current balance of the processing engine;

[0132] When the current equilibrium degree is less than or equal to the historical equilibrium degree, proceed with the iteration:

[0133] The processing engine with the largest computational latency overhead in the current configuration is called the first processing engine;

[0134] Add computing resources to the first processing engine and regenerate it to obtain the second processing engine;

[0135] The second processing engine is checked. If it is valid, the iteration continues until the current balance is greater than the historical balance. If it is invalid, the second processing engine is restored to the first processing engine, the parallelism of other processing engines is reduced, and the iteration continues.

[0136] Preferably, when the processing engine processes the computational subsequences of a neural network layer, the execution scheduling includes:

[0137] Step 1: Divide the input data received by the processing engine into n data sources, where n is the same as the number of processing engines;

[0138] Step 2: Load the first data source into the first processing engine via off-chip access. After processing, return the calculation result of the first data source to off-chip storage.

[0139] Step 3: Load the second data source and the processed first data source into the first processing engine and the second processing engine respectively through off-chip access. After processing, return the calculation results to off-chip storage.

[0140] Repeat steps two through three for each of the n data sources until all n data sources have been processed.

[0141] Example 4

[0142] Figure 8 This is a schematic diagram of the structure of an electronic device provided in Embodiment 4 of the present invention, as shown below. Figure 8As shown, the electronic device includes a processor 310, a memory 320, an input device 330, and an output device 340; the number of processors 310 in the computer device can be one or more. Figure 8 Taking a processor 310 as an example; the processor 310, memory 320, input device 330, and output device 340 in the electronic device can be connected via a bus or other means. Figure 8 Taking the example of a connection between China and Israel via a bus.

[0143] The memory 320, as a computer-readable storage medium, can be used to store software programs, computer-executable programs, and modules. The processor 310 executes various functional applications and data processing of the electronic device by running the software programs, instructions, and modules stored in the memory 320, thereby implementing the methods for accelerating neural network computation described in Embodiments 1 and 2 above.

[0144] The memory 320 may primarily include a program storage area and a data storage area. The program storage area may store the operating system and at least one application program required for a given function; the data storage area may store data created based on terminal usage. Furthermore, the memory 320 may include high-speed random access memory and non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some instances, the memory 320 may further include memory remotely located relative to the processor 310, which can be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0145] The input device 330 can be used to receive input user identity information, neural network operating parameters, etc. The output device 340 may include a display screen or other display device.

[0146] For those skilled in the art, various other corresponding changes and modifications can be made based on the technical solutions and concepts described above, and all such changes and modifications should fall within the protection scope of the claims of this invention.

Claims

1. A method for accelerating computation in neural networks, characterized in that, Includes the following steps: Obtain the operating parameters of the neural network; Based on the aforementioned operating parameters, calculate the memory access ratio of each layer in the neural network; Based on the memory access ratio of each layer in the neural network, the layers of the neural network are grouped to obtain multiple groups, and the difference between the memory access ratios of any two layers in each group is within a preset range. Layers grouped together will be assigned to the same processing engine. It also includes the regeneration of the processing engine based on the balance calculation and the design space exploration method; The regeneration of the processing engine is based on the balance calculation and combined with the design space exploration method, including the following steps: Calculate the historical and current balance of the processing engine; When the current equilibrium degree is less than or equal to the historical equilibrium degree, the iteration is performed: The processing engine with the largest computational latency overhead in the current configuration is called the first processing engine; Add computing resources to the first processing engine and regenerate it to obtain the second processing engine; The second processing engine is checked. If it is valid, the iteration continues until the current balance is greater than the historical balance. If it is invalid, the second processing engine is restored to the first processing engine, the parallelism of other processing engines is reduced, and the iteration continues.

2. The method for accelerating neural network computation as described in claim 1, characterized in that, The operating parameters include: the computational load and access load of each layer in the neural network.

3. The method for accelerating neural network computation as described in claim 1, characterized in that, Assigning layers grouped together to the same processing engine also includes: Calculate the theoretical throughput for each processing engine; Based on the theoretical throughput of each processing engine, tasks in the middle layer of the same group are mapped to processing engines with computational costs less than their theoretical throughput.

4. The method for accelerating neural network computation as described in claim 3, characterized in that, The calculation of the theoretical throughput satisfies: ; in, Represents theoretical throughput. Indicates the peak performance of the system's computing hardware. This indicates the maximum performance supported by the memory access bandwidth.

5. The method for accelerating neural network computation as described in claim 1, characterized in that, in, The current balance The calculation formula is: , in, Indicates the number of processing engines. Indicates the first The time overhead of each processing engine in completing the computation of a subsequence at one layer. This represents the average time taken by the processing engine to process and compute the subsequence.

6. The method for accelerating neural network computation as described in claim 1, characterized in that, When the processing engine processes computational subsequences of neural network layers, the execution scheduling includes: Step 1: Divide the input data into n data sources, where n is the same as the number of processing engines; each data source is a neural network task, and each neural network task consists of multiple neural network layers. Step 2: Load the first data source into the first processing engine via off-chip access. After processing, return the calculation result of the first data source to off-chip storage. Step 3: Load the second data source and the processed first data source into the first processing engine and the second processing engine respectively through off-chip access. After processing, return the calculation results to off-chip storage. Repeat steps two through three for each of the n data sources until all n data sources have been processed.

7. An apparatus for accelerating computation in neural networks, characterized in that, It includes: The data acquisition module is used to acquire the operating parameters of the neural network; The calculation module is used to calculate the memory access ratio of each layer in the neural network based on the operating parameters. The allocation module is used to group the layers of the neural network according to the memory access ratio of each layer in the neural network, to obtain multiple groups, wherein the difference between the memory access ratios of any two layers in each group is within a preset range; and to allocate the layers in the same group to the same processing engine. It also includes the regeneration of the processing engine based on the balance calculation and the design space exploration method; The regeneration of the processing engine is based on the balance calculation and combined with the design space exploration method, including the following steps: Calculate the historical and current balance of the processing engine; When the current equilibrium degree is less than or equal to the historical equilibrium degree, proceed with the iteration: The processing engine with the largest computational latency overhead in the current configuration is called the first processing engine; Add computing resources to the first processing engine and regenerate it to obtain the second processing engine; The second processing engine is checked. If it is valid, the iteration continues until the current balance is greater than the historical balance. If it is invalid, the second processing engine is restored to the first processing engine, the parallelism of other processing engines is reduced, and the iteration continues.

8. An electronic device comprising a processor, a storage medium, and a computer program, wherein the computer program is stored in the storage medium, characterized in that, When the computer program is executed by a processor, it implements the method for accelerating neural network computation as described in any one of claims 1 to 6.