Hardware accelerators, chips, computer devices suitable for machine learning
By designing a hardware accelerator suitable for machine learning, the problem of low efficiency of CPUs and GPUs when processing broadcast residual networks is solved, achieving efficient data processing and energy saving, and is suitable for various machine learning algorithms.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ANHUI UNIV
- Filing Date
- 2024-01-26
- Publication Date
- 2026-06-19
AI Technical Summary
Existing general-purpose processors such as CPUs and GPUs have low computational efficiency and high resource consumption when handling data processing tasks based on machine learning algorithms, such as broadcast residual networks, making it difficult to meet the high-efficiency operation requirements of resource-constrained devices such as mobile phones.
A hardware accelerator suitable for machine learning was designed, including a data computation module, a data storage module, a data read/write module, and a computation control module. It supports data processing of complex neural networks such as broadcast residual networks, and adopts a variety of operators and buffer designs. It achieves efficient data transmission and computation control through DMA.
It significantly improves the computational speed and efficiency of machine learning algorithms, reduces the utilization of CPU and GPU, saves computer energy, and is suitable for applications of different machine learning algorithms.
Smart Images

Figure CN117933328B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of NPU, specifically relating to a hardware accelerator suitable for machine learning and its corresponding neural network processor chip and computer device. Background Technology
[0002] Keyword recognition is an important research area, playing a crucial role in device wake-up and user interaction on smart devices. However, minimizing errors while operating efficiently on resource-constrained devices such as mobile phones is a challenge. Existing efficient CNNs typically consist of repeating blocks with the same structure and are based on residual learning and depthwise separable convolutions. This trend persists in CNN-based KWS methods, which process all features through 1D or 2D convolutions. While one-dimensional convolutions can achieve efficient design in terms of parameter count and computational cost, they lack features such as translation in the frequency direction. On the other hand, two-dimensional convolutions require significantly more computation compared to one-dimensional methods.
[0003] The Broadcast Residual Network (BRNN) addresses the aforementioned issues. Broadcast residual learning allows 1D and 2D features to be combined: frequency convolution is performed on the 2D features. Then, frequency averaging is applied to the 2D features to obtain the 1D temporal features. After some computation, the residual mapping is applied to the input 2D features by broadcasting the 1D residual information. This learning method enables convolutional processing in the frequency direction, gaining the advantages of 2D CNNs while minimizing computational cost. BRNNs achieve high accuracy with a small model size and low computational cost. This residual mapping allows the network to effectively represent useful audio features with lower computational cost compared to traditional convolutional neural networks.
[0004] In existing computer systems, using CPUs or GPUs to perform data processing tasks for broadcast residual networks results in low computational efficiency and high CPU and graphics card resource consumption. Therefore, developing a data processor more suitable for machine learning algorithms such as broadcast residual networks has become a pressing technical challenge for those skilled in the art. Summary of the Invention
[0005] To address the problem that existing general-purpose processors such as CPUs and GPUs are not suitable for handling novel computational tasks based on machine learning algorithms, such as broadcast residual networks, this invention provides a hardware accelerator, chip, and computer device suitable for machine learning.
[0006] This invention is achieved using the following technical solution:
[0007] A hardware accelerator for machine learning is disclosed, used to process data processing tasks based on machine learning algorithms in a computer system to improve the computational speed of such data processing tasks. The hardware accelerator includes: a data computation module, a data storage module, a data read / write module, a data distribution module, and a computation control module.
[0008] The data computation module contains all the operators applicable to the specified machine learning algorithm, and each operator is used to perform different computational operations and feature extraction.
[0009] The data storage module includes multiple internal buffers, each used to output data with three different bit widths: 16-bit, 32-bit, and 64-bit, according to instructions. Write operations to each buffer are 64-bit and employ a parameterized design to meet different application scenarios and precision requirements.
[0010] The data read / write module includes two DMAs for accessing external memory. One DMA retrieves data from the external memory, while the other DMA sends computation results to the external memory. The DMAs employ a memory-mapped data transfer mode to transfer data between the external memory and the internal data storage module.
[0011] The data allocation module is used to preprocess the feature maps based on the acquired configuration information. The preprocessed data is then categorized and stored in various internal buffers within the data storage module. The data allocation module is also used to read the calculation results from the output buffer and transfer them to external memory.
[0012] The computational control module acquires network configuration and parameters, and then, upon receiving a start signal, generates appropriate interface timing to initiate DMA and obtain the necessary network configuration information. The computational control module monitors DMA interrupt signals to determine whether configuration information is complete and clears relevant interrupts and statuses.
[0013] As a further improvement of the present invention, the operators in the data calculation module include: convolution, transposed convolution, ReLU, element-wise addition, depthwise separable convolution, pointwise convolution, max pooling, and average pooling.
[0014] As a further improvement of this invention, the data storage module includes four data buffers: an input buffer, a weight buffer, a bias buffer, and an output buffer. The input buffer is used to pre-store the input data of the network model in the machine learning algorithm. The weight buffer is used to pre-store the dynamically updated weights in the network model. The bias buffer pre-stores the bias information during the network model's computation. The output buffer is used to store the data processing results of the network model.
[0015] As a further improvement of the present invention, the data storage module also includes a configuration register group containing 16 64-bit registers, which are used to store configuration information such as the number of channels and feature map size in the network model of the machine learning algorithm.
[0016] As a further improvement of this invention, the data read / write module employs DMA based on a descriptor data structure, including DMA_write and DMA_read. DMA_read is used to acquire data from external memory and write it to the input buffer, weight buffer, bias buffer, and configuration register group in the data storage area. DMA_write is used to write data acquired from the output buffer to external memory.
[0017] As a further improvement of this invention, the data allocation module includes a Scatter module and a Gather module. The Scatter module preprocesses the feature maps read via DMA based on the acquired configuration information. The preprocessed data is then categorized and stored in the input buffer, weight buffer, bias buffer, and configuration register set. The preprocessing methods supported by the Scatter module include data format conversion, padding, rotation, and scaling operations. The Gather module is responsible for reading data from the output buffer during data transmission.
[0018] As a further improvement of this invention, the computation control module includes a Fetch Config module and a Fetch Param module. Upon receiving a start signal, the Fetch Config module generates appropriate interface timing to initiate DMA and acquire the required network configuration information. It monitors DMA interrupt signals to determine whether configuration information has been completed and clears related interrupts and states. The Fetch Param module is used to synchronously acquire network parameters, including weights and biases, based on the received signals.
[0019] As a further improvement of the present invention, the hardware accelerator for machine learning adopts an interface timing that follows the Avalon bus, and the applied interface signals include: clock, reset_n, process_start, process_done, mem_address, mem_write, mem_write_waitrequest, mem_writedata, mem_read, mem_read_waitrequest, mem_readdatavalid, and mem_qout.
[0020] The present invention also includes a neural network processor chip, which encapsulates the aforementioned hardware accelerator suitable for machine learning. This neural network processor chip is used to perform data processing tasks based on any other specified machine learning algorithm, including broadcast residual networks.
[0021] The present invention also includes a computer device comprising a memory, a processor, and a computer program stored in the memory and running on the processor. The processor includes a CPU and / or a GPU. In particular, the processor of the computer device further includes a neural network processor chip as described above.
[0022] The technical solution provided by this invention has the following beneficial effects:
[0023] This invention designs a novel hardware accelerator for emerging neural network data processing tasks. This accelerator can flexibly configure the structure of the data computing module according to the adjustment of the network model for different tasks, and is therefore suitable for applications of different machine learning algorithms, especially for applications of complex neural networks containing broadcast network residual structures.
[0024] The hardware accelerator provided by this invention can significantly improve the computing speed and efficiency of computer systems in processing machine learning algorithm tasks, while reducing the occupancy of hardware such as CPU and GPU, thus saving computer energy consumption. Attached Figure Description
[0025] Figure 1 This is a structural diagram of the BC-ResNet block mentioned in Embodiment 1 of the present invention.
[0026] Figure 2 This is an architecture diagram of a hardware accelerator suitable for machine learning provided in Embodiment 1 of the present invention.
[0027] Figure 3 This is a schematic diagram of the data computation module in a hardware accelerator suitable for machine learning.
[0028] Figure 4 This is a timing diagram of the signals for the hardware accelerator to perform a write operation in the simulation experiment.
[0029] Figure 5 This is a timing diagram of the signals for the hardware accelerator performing a read operation in the simulation experiment.
[0030] Figure 6 These are the test results obtained by the hardware accelerator in the data processing performance test during the simulation experiment. Detailed Implementation
[0031] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0032] Example 1
[0033] This embodiment provides a hardware accelerator suitable for machine learning, used to process data processing tasks based on machine learning algorithms in a computer system, thereby improving the computational speed of such data processing tasks. The hardware accelerator provided in this embodiment is mainly applied to Broadcast Residual Networks (BC-ResNet Blocks).
[0034] A typical ResNet Block can be represented as y = x + f(x), where x and y are the input and output features of the block, respectively, and the function f computes the residual. x (identity shortcut) and the residual f(x) are typically in the same dimension. To utilize both one-dimensional and two-dimensional features simultaneously, the function f is decomposed into f1 and f2, representing one-dimensional and two-dimensional temporal operations, respectively. Then, frequency averaging is performed on the two-dimensional features after f2 to obtain temporal features. These temporal features are then expanded back to the two-dimensional shape after f1. This averaging and expansion process is repeated for each residual block, thus proposing broadcast residual learning. Furthermore, to achieve frequency-aware convolution on the block, an auxiliary 2D residual connection is added from the 2D features, resulting in... Figure 1 The Broadcast Network Residual Block (BC-ResNet Block) shown has the following expression:
[0035] y=x+f2(x)+BC(f1(avgpool(f2(x))))
[0036] exist Figure 1 In the BC-ResNet block, the input feature x is R h×w Where h and w correspond to the frequency and time dimensions, respectively, the 2D feature part f2 consists of a 3×1 frequency-depth convolution and sub-spectral normalization (SSN), which divides the input frequencies into multiple groups and normalizes them respectively; thus obtaining R. 1×w The characteristics of f1 are as follows: f1 consists of 1×3 temporal depthwise convolution, BN, swish activation, 1×1 pointwise convolution, and channel dropout. The broadcast (BC) operation will R 1×w The feature extension in R h×w Temporal features are broadcast to 2D features. In small networks, pointwise convolutions are computationally most expensive. Compared to 2D depthwise separable convolutions, the BC-ResNet block performs temporal depthwise convolutions and pointwise convolutions on temporal features, reducing their computation by a factor of h.
[0037] for Figure 1 For this type of novel neural network, the data processing speed of traditional general-purpose processors is not high. Therefore, this embodiment provides, for example... Figure 2 This illustrates a novel dedicated hardware accelerator. Utilizing it to execute corresponding machine learning algorithms can significantly improve data processing efficiency. The hardware accelerator includes: a data computation module, a data storage module, a data read / write module, a data distribution module, and a computation control module.
[0038] The data computation module contains all the operators applicable to the specified machine learning algorithm, each operator used to perform different computational operations and feature extraction. For example... Figure 3 As shown, the operators in the data computation module include: convolution, transpose convolution, ReLU, element-wise addition, depthwise convolution, pointwise convolution, max pooling, and average pooling. The hardware accelerator in this embodiment can be configured to use different operators to perform computation and feature extraction according to the network structure. Each operator is designed independently and deployed in the data computation module as needed.
[0039] The data storage module includes multiple internal buffers, each used to output data with three different bit widths: 16-bit, 32-bit, and 64-bit, according to instructions. Write operations to each buffer are 64-bit and employ a parameterized design to meet different application scenarios and precision requirements.
[0040] The data storage module in this embodiment includes four data buffers: an input buffer, a weight buffer, a bias buffer, and an output buffer. The input buffer is used to pre-store the input data of the network model in the machine learning algorithm. The weight buffer is used to pre-store the dynamically updated weights in the network model. The bias buffer pre-stores the bias information during the network model's computation. The output buffer stores the data processing results of the network model. The input buffer, weight buffer, and output buffer are all concatenated from a smaller buffer to expand the data bit width.
[0041] The data storage module of the hardware accelerator in this embodiment also includes a configuration register group containing 16 64-bit registers. This configuration register group stores various configuration information from the network model of the machine learning algorithm, including the number of channels, feature map size, DMA descriptor size, and so on. This configuration information plays a crucial role in the entire network model's data processing, controlling parameter acquisition, data preprocessing, network computation, and the saving of output results.
[0042] The data read / write module includes two DMAs for accessing external memory. One DMA retrieves data from external memory, while the other sends computation results to external memory. The DMAs employ memory mapping, specifically a memory-to-memory (MM->MM) data transfer mode, to transfer data between external memory and the internal data storage module. The data read / write module uses a descriptor-based DMA structure, comprising DMA_write and DMA_read. DMA_read retrieves data from external memory and writes it to the input buffer, weight buffer, bias buffer, and configuration register group in the data storage area. DMA_write writes data retrieved from the output buffer to external memory.
[0043] In DMA, the parameters and control information for data transfer are encapsulated in specific data structures called descriptors. Each descriptor contains information about the data transfer, such as the source address, destination address, and data length. The DMA controller executes data transfer operations based on these descriptors, processing each descriptor sequentially according to a pre-defined descriptor linked list to complete the data transfer. Furthermore, DMA includes a set of CSR (Control and Status Registers) that contain control, status, and interrupt fields for configuring and monitoring DMA operations.
[0044] Both the descriptor and the CSR register have their own independent configuration interfaces. In the accelerator design, the configuration of the descriptor and CSR register is not done by the CPU, but by a dedicated hardware module that generates the timing of the configuration interface to configure and read the status. During normal operation of the accelerator, the descriptor and CSR register are configured, enabling the two DMAs to transfer input data, parameters, and computation results between external memory and internal buffers.
[0045] The data allocation module is used to preprocess the feature maps based on the acquired configuration information. The preprocessed data is then categorized and stored in various internal buffers within the data storage module. The data allocation module is also used to read the calculation results from the output buffer and transfer them to external memory.
[0046] like Figure 2 As shown, the data allocation module in the hardware accelerator of this embodiment includes a Scatter module and a Gather module. The Scatter module preprocesses the feature maps read via DMA based on the acquired configuration information. The preprocessing methods supported by the Scatter module in this embodiment include data format conversion, padding, rotation, scaling, etc. The preprocessed data is categorized and stored in an input buffer, a weight buffer, a bias buffer, and configuration registers. The Gather module is responsible for reading data from the output buffer during data transmission and passing it to the DMA read interface, ensuring timely and accurate transfer of computation results to external memory. Through this storage and categorization method, the accelerator can easily access and manage the input configuration information, ensuring the accuracy and effectiveness of the data during computation.
[0047] The computational control module acquires network configuration and parameters, and then, upon receiving a start signal, generates appropriate interface timing to initiate DMA and obtain the necessary network configuration information. The computational control module monitors DMA interrupt signals to determine whether configuration information is complete and clears relevant interrupts and statuses.
[0048] Specifically, the computation control module of the hardware accelerator designed in this embodiment includes a Fetch Config module and a Fetch Param module. Upon receiving the start signal, the Fetch Config module generates appropriate interface timing to initiate DMA and acquire the required network configuration information. It monitors DMA interrupt signals to determine whether configuration information is complete and clears related interrupts and states. The Fetch Param module is notified to begin acquiring network parameters, including weights and biases. The parameter acquisition process is similar to the process of acquiring network configuration information.
[0049] In the hardware accelerator provided in this embodiment, when a task request is received from the controller, the data computation module performs the corresponding computation according to the configuration and parameter information. To complete the computation, the hardware accelerator first uses DMA to retrieve the network configuration and parameters from external memory and reads the input feature map for processing. During the computation, various operators such as convolution and pooling are applied. After the computation is completed, the accelerator module stores the computation result in an internal buffer and writes the result back to external memory through the DMA interface for subsequent processing or output, thus completing the computation.
[0050] To achieve a complete data processing procedure, this embodiment also defines corresponding signal interfaces and timing logic for the hardware accelerator. Specifically, the hardware accelerator for machine learning provided in this embodiment adopts interface timing conforming to the Avalon bus, and the applied interface signals include: clock, reset_n, process_start, process_done, mem_address, mem_write, mem_write_waitrequest, mem_writedata, mem_read, mem_read_waitrequest, mem_readdatavalid, and mem_qout. The corresponding interface signal table is shown in Table 1 below:
[0051] Table 1: Interface signals of the hardware accelerator designed in this embodiment
[0052] Name Direction Width Description clock I 1 Clock input reset_n I 1 Reset input, active low process_start I 1 To accelerate startup, maintain a high level for at least one clock cycle. process_done 0 1 Acceleration ends, single clock cycle pulse mem_address 0 26 Memory address mem_write 0 1 Memory write mem_write_waitrequest I 1 Memory write wait mem_writedata 0 32 Write data to memory mem_read 0 1 Memory read mem_read_waitrequest I 1 Memory read wait mem_readdatavalid I 1 Memory read data valid mem_qout I 32 Reading data from memory
[0053] In the above signals, `clock` is the clock signal, and `reset_n` is the reset signal. The `process_start` signal is the accelerator start signal; it must remain high for at least one clock cycle to allow the module to detect its rising edge, indicating the start of a computation. The `process_done` signal is the accelerator end signal, indicating that the accelerator has completed the computation. This signal is a high-active pulse signal; it must remain active for one clock cycle before being pulled low to indicate the completion of a computation. `mem_read` and `mem_write` represent reading and writing to memory, respectively; these two signals must remain high for at least one clock cycle in practical applications. The `mem_write_waitrequest` and `mem_read_waitrequest` signals indicate whether the peripheral can accept new transaction requests. When the peripheral is busy processing the current transaction, it can inform the master device to wait by setting `waitrequest` high. The master device can send a new transaction request when it receives a low `waitrequest`, and needs to wait for the peripheral to complete its current transaction when `waitrequest` is high. The `waitrequest` signal coordinates data transmission between the master device and the peripheral, ensuring efficient data transmission and processing. It allows peripherals to control the flow of data according to their processing capabilities, preventing data loss or collisions. The mem_readdatavalid signal is used to indicate whether the memory read data is valid; it is valid when high. The mem_qout signal is the data read from the memory.
[0054] Example 2
[0055] Building upon the solution in Embodiment 1, this embodiment further provides a neural network processor chip (NPU). This NPU is a product corresponding to the solution in Embodiment 1, and it encapsulates an integrated circuit suitable for machine learning, as described in Embodiment 1. Because this hardware accelerator employs a novel computing architecture suitable for machine learning, it can be specifically designed to execute various neural network data processing tasks based on specified machine learning algorithms; especially tasks based on Broadcast Network Residual Blocks (BC-ResNet Blocks). This significantly improves the efficiency of computer systems in processing these types of data processing tasks, while reducing CPU utilization and device power consumption.
[0056] Example 3
[0057] This embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor. The processor controls the overall operation of the computer device. The processor in this embodiment includes a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Neural Network Processing Unit (NPU). The NPU uses a product as described in Embodiment 1. This computer device performs routine computing tasks using the CPU and GPU, and utilizes the NPU to perform data processing tasks based on BC-ResNet Blocks and other types of neural networks.
[0058] The computer equipment provided in this embodiment can be a smartphone, tablet computer, laptop computer, desktop computer, rack server, blade server, tower server, or cabinet server (including independent servers or server clusters composed of multiple servers) capable of executing programs.
[0059] The computer device in this embodiment includes, but is not limited to, a memory and a processor that can be interconnected via a system bus. In this embodiment, the memory (i.e., the readable storage medium) includes flash memory, hard disk, multimedia card, card-type memory (e.g., SD or DX memory), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory can be an internal storage unit of the computer device, such as the hard disk or RAM of the computer device. In other embodiments, the memory can also be an external storage device of the computer device, such as a plug-in hard disk, smart media card (SMC), secure digital card (SD), flash card, etc. Of course, the memory can also include both internal storage units and external storage devices of the computer device. In this embodiment, the memory is typically used to store the operating system and various application software installed on the computer device. Furthermore, the memory can also be used to temporarily store various types of data that have been output or will be output.
[0060] Simulation test
[0061] To verify the performance of the hardware accelerator designed for machine learning in this invention, the engineers used the VCS tool developed by Synopsys to simulate the solution, and combined it with the Verdi tool to view the simulation waveforms and test the hardware's functionality and acceleration performance.
[0062] I. Basic Performance
[0063] This embodiment primarily tests the data input / output functions of the simulated hardware accelerator. Among them, Figure 4 and Figure 5 These are the signal timing diagrams for write and read operations in the simulation experiment. Analysis of the data in the diagrams shows that: when the `men_write` signal is high, it indicates that a write operation is in progress. When the `men_write_waitrequest` signal is high, it tells the master device to wait, indicating that the current write operation is being completed. When it is low, the master device can send a new task request. When the `men_read` signal is high and the `mem_readdatavalid` signal is also high, it indicates that a read operation is valid. When the `men_read_waitrequest` signal is high, it tells the master device to wait, indicating that the current read operation is being completed. It can be seen that the accelerator has successfully implemented the writing and reading of input data, achieving its design goals.
[0064] II. Computational Performance
[0065] To verify the data processing capabilities of the hardware accelerator designed in this invention, this experiment further developed a reference model in MATLAB, used this reference model to calculate the computational results for each layer, then ran the circuit simulation results using the designed hardware accelerator, and finally compared the two results using a MATLAB script. The final output is as follows: Figure 6 The textual results are shown. Analysis. Figure 6 It can be seen that the solution of the present invention has passed all computational tests and is logically feasible.
[0066] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the various embodiments of the present invention.
Claims
1. A hardware accelerator suitable for machine learning, used to process data processing tasks based on machine learning algorithms in a computer system to improve the computational speed of such data processing tasks; characterized in that, The hardware accelerator includes: The data computation module contains all the operators applicable to the specified machine learning algorithm, and each operator is used to perform different computational operations and feature extraction. The data storage module includes multiple internal buffers, each used to output three different data bit widths: 16-bit, 32-bit, and 64-bit, according to instructions. The write operation of each buffer is 64-bit and adopts a parameterized design to meet different application scenarios and precision requirements. The data read / write module includes two DMAs for accessing external memory. One DMA is used to retrieve data sources from external memory, and the other DMA is used to send computation results to external memory. The DMAs use a memory-mapped data transfer mode to transfer data between external memory and the internal data storage module. A data allocation module is used to preprocess the feature map according to the acquired configuration information; the preprocessed data is categorized and stored in various internal buffers of the data storage module; the data allocation module is also used to read the calculation results from the output buffer and transfer them to external memory; the data allocation module includes a Scatter module and a Gather module; the Scatter module preprocesses the feature map read via DMA according to the acquired configuration information, and the preprocessed data is categorized and stored in the input buffer, weight buffer, bias buffer, and configuration register group; the preprocessing methods supported by the Scatter module include data format conversion, padding, rotation, and scaling operations; the Gather module is responsible for reading data from the output buffer during data transmission; and The computation control module is used to acquire network configuration and parameters, and then generate appropriate interface timing to start DMA and acquire the required network configuration information after receiving a start signal. The computation control module determines whether the configuration information is completed by monitoring the DMA interrupt signal and clears the relevant interrupts and states. The computation control module includes a FetchConfig module and a Fetch Param module. The FetchConfig module generates appropriate interface timing to start DMA and acquire the required network configuration information after receiving a start signal, and determines whether the configuration information is completed by monitoring the DMA interrupt signal and clearing the relevant interrupts and states. The Fetch Param module is used to synchronously acquire network parameters, including weights and biases, according to the received signals.
2. The hardware accelerator suitable for machine learning of claim 1, wherein: The operators in the data calculation module include: convolution, transposed convolution, ReLU, element-wise addition, depthwise separable convolution, pointwise convolution, max pooling, and average pooling.
3. The hardware accelerator suitable for machine learning of claim 1, wherein: The data storage module includes four data buffers: an input buffer, a weight buffer, a bias buffer, and an output buffer. The input buffer is used to pre-store the input data of the network model in the machine learning algorithm. The weight buffer is used to pre-store the dynamically updated weights in the network model. The bias buffer stores the bias information of the network model operation process in advance; the output buffer is used to store the data processing results of the network model.
4. The hardware accelerator suitable for machine learning of claim 1, wherein: The data storage module also includes a configuration register group containing 16 64-bit registers, which are used to store configuration information such as the number of channels and feature map size in the network model of the machine learning algorithm.
5. The hardware accelerator for machine learning as described in claim 1, characterized in that: The data read / write module adopts DMA based on the descriptor data structure, including DMA_write and DMA_read; DMA_read is used to acquire data in external memory and write it into the input buffer, weight buffer, bias buffer and configuration register group in the data storage area; The DMA_write function is used to write data acquired from the output buffer to external memory.
6. The hardware accelerator suitable for machine learning of claim 1, wherein, It adopts the interface timing that follows the Avalon bus, and the interface signals used include: clock, reset_n, process_start, process_done, mem_address, mem_write, mem_write_waitrequest, mem_writedata, mem_read, mem_read_waitrequest, mem_readdatavalid, and mem_qout.
7. A neural network processor chip, comprising: It contains a hardware accelerator for machine learning as described in any one of claims 1-6; the neural network processor chip is used to perform data processing tasks based on any other specified machine learning algorithm, including broadcast residual networks.
8. A computer device comprising a memory, a processor, and a computer program stored on the memory and running on the processor; the processor comprising a CPU and / or a GPU, characterized in that: The processor also includes the neural network processor chip as described in claim 7.