An accumulator buffered data accumulation offload system and method
By designing a data accumulation and unloading system for the accumulator buffer and utilizing the result processing unit to process the accumulation results in parallel, the problem of low accumulator buffer efficiency is solved, and efficient accumulation result unloading and caching operations are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JIANGNAN INST OF COMPUTING TECH
- Filing Date
- 2022-08-12
- Publication Date
- 2026-06-19
Smart Images

Figure CN115268837B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of accumulator buffering technology, specifically to a data accumulation and unloading system and method for accumulator buffers. Background Technology
[0002] Systolic arrays (SSAs) are two-dimensional computational structures that accelerate computation through a data-driven approach. Each computational unit in a SSA can transfer data between adjacent units, reducing the number of input / output data accesses through data reuse, thereby lowering memory access bandwidth requirements. SSAs can achieve high computational throughput with relatively low memory access bandwidth, solving the memory access bottleneck problem faced by most processors, and showing significant advantages, especially in high-intensity computation and memory access processing such as neural networks.
[0003] The matrix multiplication acceleration unit is a two-dimensional pulsating array, whose two-dimensional size is flexibly configurable and can be tailored to performance and application requirements. For example... Figure 1 As shown, the matrix multiplication acceleration unit consists of multiple isomorphic operation units. Northbound data is transmitted from north to south in the pulsating array and cached in each operation unit of the pulsating array as needed. Westbound data is transmitted from west to east in the pulsating array. After the westbound data arrives at the operation unit, it is multiplied with the northbound data cached inside the operation unit. The multiplication result is added with the accumulated data transmitted from the north operation unit, thus completing the multiplication-addition operation. The multiplication-addition result is then transmitted to the south operation unit, realizing the transmission of the multiplication-addition result from north to south.
[0004] For example Figure 1 As shown, the accumulator buffer is located at the south exit of the matrix multiplication acceleration unit. It receives the multiplication-accumulation results transmitted from the matrix multiplication acceleration unit, accumulates the results, and caches them. When the accumulator buffer receives an unload signal, the current round of accumulation operation ends. The data in the accumulator buffer is first written back to the local data memory. Then, an unload and write-zero operation is required to unload the data in the accumulator buffer. Only after the unload and write-zero operation is completed can the accumulator buffer start the next round of accumulation and caching operations. This makes the accumulator buffer inefficient. Summary of the Invention
[0005] This invention addresses the problems existing in the prior art by proposing a data accumulation and unloading system and method for an accumulator buffer. While the buffer body is accumulating and caching the accumulation results, the result processing unit can unload the cached accumulation results in the buffer body, which greatly shortens the waiting time for unloading the accumulation results and thus effectively improves the working efficiency of the accumulator buffer.
[0006] The technical solution adopted by this invention to solve its technical problem is: a data accumulation and unloading system for an accumulator buffer, comprising an accumulator buffer control logic and multiple accumulator buffer modules; each of the accumulator buffer modules includes
[0007] A control register, electrically connected to the accumulator buffer control logic, is used to receive and temporarily store control signals issued by the accumulator buffer control logic;
[0008] A data accumulation and unloading submodule, electrically connected to the control register, includes...
[0009] The buffer body is used to cache the accumulated results in the order of the buffer entries.
[0010] The result processing unit is electrically connected to the buffer body and is used to unload the accumulated results cached in the buffer body according to the order of the buffer entries.
[0011] Preferably, the result processing unit includes
[0012] The result extraction subunit is used to retrieve and temporarily store the accumulated results in the buffer entries;
[0013] The result unloading and writing 0 subunit is electrically connected to the result extraction subunit and is used to unload and write 0 to the accumulated result in the corresponding buffer entry after the result extraction subunit has completed the extraction of the accumulated result.
[0014] Preferably, the data accumulation and unloading submodule also includes
[0015] The first data register is connected one-to-one with the arithmetic unit at the output of the matrix multiplication acceleration unit, and is used to obtain and temporarily store the multiplication and addition results output by the corresponding arithmetic unit;
[0016] The second data register is electrically connected to the third data register and is used to obtain and temporarily store the accumulated result in the third data register;
[0017] An adder, electrically connected to the first data register and the second data register, is used to add the multiply-accumulate result in the first data register to the accumulation result in the second data register to obtain a new accumulation result;
[0018] The third data register is electrically connected to the adder and the buffer body, and is used to acquire and temporarily store the new accumulation result. After the buffer body acquires and temporarily stores the new accumulation result in the third data register, it caches the accumulation result in the third data register into the buffer entry of the buffer body.
[0019] Preferably, the data accumulation and unloading system also includes
[0020] The result write-back module is electrically connected to the result extraction subunit and the local data memory, and is used to write the accumulated result temporarily stored in the result extraction subunit into the local data memory.
[0021] Preferably, the result extraction subunit includes a first data storage area and a second data storage area, and when the first data storage area is in a first working state, the second data storage area is in a second working state; when the first data storage area is in the second working state, the second data storage area is in the first working state; wherein, the first working state is to acquire and temporarily store the accumulated result in the buffer body, and the second working state is to write the accumulated result into the local data memory through the result write-back module.
[0022] A method for unloading data accumulation from an accumulator buffer includes the following steps:
[0023] The S1 accumulator buffer control logic sends a control signal;
[0024] The S2 control register receives and temporarily stores the control signals issued by the accumulator buffer control logic;
[0025] S3, according to the control signal, buffers the accumulated results in the order of the buffer entries;
[0026] S4 According to the control signal, the result processing unit unloads the accumulated results cached in the buffer body in the order of the buffer entries.
[0027] Preferably, S4 specifically includes the following steps:
[0028] S41 result extraction subunit retrieves and temporarily stores the accumulated result in the buffer entry;
[0029] After the result extraction subunit completes the extraction of the multiplication-addition result, the result unload and write-to-zero subunit unloads and writes the accumulated result in the corresponding buffer entry to zero.
[0030] Preferably, S3 specifically includes the following steps:
[0031] The S31 first data register acquires and temporarily stores the multiplication and addition results output by the arithmetic unit at the output of the matrix multiplication acceleration unit;
[0032] S32 second data register retrieves and temporarily stores the accumulated result in the third data register;
[0033] The S33 adder adds the multiplication-addition result in the first data register to the accumulation result in the second data register to obtain a new accumulation result;
[0034] S34 The third data register retrieves and temporarily stores the new accumulation result;
[0035] After S35 retrieves and temporarily stores the new accumulated result in the third data register, the buffer body caches the accumulated result in the third data register into the buffer entry of the buffer body.
[0036] Preferably, the data accumulation and unloading method further includes the following steps:
[0037] The S5 result write-back module writes the accumulated result temporarily stored in the result extraction subunit into the local data memory.
[0038] Preferably, S41 specifically includes obtaining and temporarily storing the accumulated result in the buffer entry through the first data storage area or the second data storage area in the first working state;
[0039] Specifically, S5 includes writing the accumulated result in the first data storage area or the second data storage area, which is in the second working state, into the local data memory through the result write-back module.
[0040] Beneficial effects
[0041] In embodiments of the present invention, the accumulator buffer module can cache the multiplication and addition results through the buffer body while unloading the cached multiplication and addition results in the buffer body through the result processing unit. In this way, when the next control signal is received, the data accumulation and unloading submodule only needs to wait a short time before it can re-cache the output multiplication and addition results, thereby effectively improving the working efficiency of the accumulator buffer. Attached Figure Description
[0042] Figure 1 This is a schematic diagram illustrating the working principle of the matrix multiplication acceleration unit and accumulator buffer in the existing technology.
[0043] Figure 2 This is a schematic diagram of the accumulator buffer module in an embodiment of the present invention;
[0044] Figure 3 This is a flowchart of the result processing unit extracting the accumulated result from the buffer entry and unloading it to write 0 in an embodiment of the present invention. Detailed Implementation
[0045] The technical solution of the present invention will be further described below with reference to the accompanying drawings and specific embodiments.
[0046] Example 1: A data accumulation and unloading system for an accumulator buffer includes an accumulator buffer control logic and multiple accumulator buffer modules. The number of accumulator buffer modules is the same as the number of columns in the matrix multiplication acceleration unit. One column of operation units corresponds to one accumulator buffer module, and all accumulator buffer modules are controlled by the same accumulator buffer control logic. Figure 2 As shown, each of the accumulator buffer modules includes a control register and a data accumulation and unloading submodule.
[0047] The control register is electrically connected to the accumulator buffer control logic and is used to receive and temporarily store control signals issued by the accumulator buffer control logic. The control signals may include an accumulation result start signal and an accumulation result unload signal.
[0048] The data accumulation and unloading submodule is electrically connected to the control register and includes a buffer body and a result processing unit. The buffer body caches the accumulated results according to the order of the buffer entries based on the accumulation start signal. The result processing unit is electrically connected to the buffer body and unloads the accumulated results cached in the buffer body according to the order of the buffer entries based on the accumulation result unloading signal. Specifically, as shown... Figure 2 and Figure 3 As shown, the result processing unit includes a result extraction subunit and a result unloading and writing-to-zero subunit. The result extraction subunit is used to acquire and temporarily store the accumulated result in the buffer entry. The result unloading and writing-to-zero subunit is electrically connected to the result extraction subunit and is used to unload and write-to-zero the accumulated result in the corresponding buffer entry after the result extraction subunit has completed the extraction of the accumulated result.
[0049] Initially, the buffer body has no cached data. When the accumulator buffer control logic issues its first control signal, the control register receives and temporarily stores the control signal. The buffer body caches the accumulation result according to the accumulation start signal in the control signal, and the caching order is based on the order of the buffer entries' addresses in the buffer body. After a certain period of time, when the nth (which could be the 3rd) buffer entry has completed caching, the result processing unit will begin unloading the cached accumulation result from the 1st buffer entry according to the accumulation result unloading signal in the control signal. Thereafter, for each accumulation result cached by the buffer body, the result processing unit will unload one accumulation result in the order of the entries.
[0050] When the accumulator buffer control logic issues a second control signal, the control register receives and temporarily stores the second control signal. Upon receiving the accumulation start signal in the second control signal, the buffer body first stops caching the accumulation result and waits for the buffer body to complete the unloading operation of the remaining buffer entries. Only after all the accumulation results in the buffer body have been unloaded does the buffer body restart the accumulation result caching operation. After a certain period, when the nth (which could be the 3rd) buffer entry has completed caching, the result processing unit will again begin unloading the cached accumulation result from the first buffer entry based on the accumulation result unloading signal in the control signal.
[0051] The subsequent processing steps of the accumulator buffer module are the same as those described above.
[0052] In this embodiment, the accumulator buffer module can cache the accumulation results through the buffer body while simultaneously unloading the cached accumulation results from the buffer body through the result processing unit. Thus, when the next control signal is received, the data accumulation and unloading submodule only needs to wait a short time to re-cache the accumulation results (unlike existing technologies that require waiting for the entire unloading process), thereby effectively improving the working efficiency of the accumulator buffer in this embodiment.
[0053] Furthermore, as shown in the figure, the data accumulation and unloading submodule also includes a first data register, a second data register, an adder, and a third data register.
[0054] The first data register is connected one-to-one with the arithmetic units at the output of the matrix multiplication acceleration unit, and is used to acquire and temporarily store the multiplication-accumulation result output by the corresponding arithmetic unit. The second data register is electrically connected to the third data register, and is used to acquire and temporarily store the accumulation result in the third data register. The adder is electrically connected to the first data register and the second data register, and is used to add the multiplication-accumulation result in the first data register to the accumulation result in the second data register to obtain a new accumulation result. The third data register is electrically connected to the adder and the buffer body, and is used to acquire and temporarily store the new accumulation result. After the buffer body acquires and temporarily stores the new accumulation result in the third data register, it caches the accumulation result in the third data register into the buffer entry of the buffer body.
[0055] Initially, upon receiving a control signal, the control register first acquires and temporarily stores the multiplication-accumulation result output by the connected arithmetic unit, while the second data register acquires and temporarily stores the initial accumulation result (i.e., 0) of the third data register. Then, the adder adds the multiplication-accumulation result in the first data register to the accumulation result in the second data register to obtain a new accumulation result, which is then temporarily stored in the third data register. Finally, when the third data register acquires the current accumulation result, the buffer body caches the accumulation result in the third data register into the first buffer entry of the buffer body, at which point the accumulation caching step for the first accumulation result ends. After the accumulation caching step for one accumulation result is completed, the data accumulation unloading submodule continues the accumulation caching of the next accumulation result using the same steps.
[0056] After a certain period of time, for example, when the second buffer entry has completed caching of the accumulated result (which can be determined by obtaining and temporarily storing the third accumulated result through the third data register), the result processing unit begins to unload the accumulated result already cached in the first buffer entry (firstly, the accumulated result in the corresponding buffer entry is obtained and temporarily stored through the result extraction subunit, and then the accumulated result in the corresponding buffer entry is unloaded and written to 0 through the result unloading and writing 0 subunit). After that, for each accumulated result cached in the buffer body (which can be determined by obtaining and temporarily storing the next accumulated result through the third data register), the result processing unit will unload one accumulated result in the order of the buffer entries.
[0057] After a period of time, the accumulator buffer control logic will send a new control signal to the control register. Upon receiving the new control signal, the data accumulation unloading submodule will first stop the accumulation buffer operation of the accumulation result and wait for the buffer body to complete the unloading of the accumulation results of the remaining buffer entries. Only after all the accumulation results in the buffer body have been unloaded will the data accumulation unloading submodule restart the accumulation buffer operation of the multiply-accumulate result.
[0058] This embodiment uses the first data register, the second data register, the adder, the third data register, the buffer body, and the result processing unit in conjunction to enable the data accumulation and unloading submodule to continuously accumulate and cache the multiplication and addition results transmitted from the matrix multiplication acceleration unit, thereby achieving high working efficiency of the accumulator buffer.
[0059] Furthermore, the data accumulation and unloading system also includes a result write-back module, which is electrically connected to the result extraction subunit and the local data memory, and is used to write the accumulation result temporarily stored in the result extraction subunit into the local data memory.
[0060] The data accumulation and unloading system in this embodiment also includes a result write-back module. The input of the result write-back module is electrically connected to the result extraction subunit in the data accumulation and unloading submodule of the all accumulator buffer module, and the output of the result write-back module is connected to the local data memory. The result write-back module can write the accumulation result obtained by the result extraction subunit into the local data memory.
[0061] Furthermore, the result extraction subunit includes a first data storage area and a second data storage area. When the first data storage area is in a first working state, the second data storage area is in a second working state; when the first data storage area is in the second working state, the second data storage area is in the first working state. The first working state involves acquiring and temporarily storing the accumulated result in the buffer body, while the second working state involves writing the accumulated result into the local data memory through the result write-back module. The working states of the first data storage area and the second data storage area are always different.
[0062] Initially, neither the first nor the second data storage area contains any data. The accumulator buffer control logic can issue control signals that randomly determine the operating states of the first and second data storage areas. These control signals may also include operating state determination signals. For example, the first control signal might determine that the first data storage area enters the second operating state and the second data storage area enters the first operating state. Subsequently, the accumulator buffer control logic will issue operating state determination signals sequentially, with the operating states being the opposite of the previous ones. For example, the second control signal might determine that the first data storage area enters the first operating state and the second data storage area enters the second operating state.
[0063] In this embodiment, the result extraction subunit can first obtain and temporarily store the accumulated results in the buffer body through the first data storage area. When the control register receives a control signal, the data accumulation and unloading submodule first stops the accumulation and caching operation of the accumulated results and waits for the buffer body to complete the unloading of the accumulated results of the remaining buffer entries. After all the accumulated results in the buffer body have been unloaded, the data accumulation and unloading submodule reverses the working state of the first data storage area and the second data storage area. Then, the data accumulation and unloading submodule re-performs the accumulation, caching, extraction, and unloading operation of the accumulated results in the buffer body, and at this time, obtains and temporarily stores the accumulated results in the buffer body through the second data storage area (while the result write-back module writes the accumulated results in the first data storage area into the local data memory).
[0064] The result extraction subunit can simultaneously store the accumulated results that need to be unloaded from the buffer body in the first data storage area (or the second data storage area), and simultaneously write the previous round's accumulated results stored in the second data storage area (or the first data storage area) into the local data memory through the result write-back module. After the write is completed, the second data storage area (or the first data storage area) is empty, so that the next round of accumulated results can be stored. The setting of the first data storage area and the second data storage area allows the accumulator buffer module to perform the accumulation caching operation and the accumulation result write-back operation simultaneously (unlike the existing technology, it does not need to wait for the accumulated results to be written to the local data memory before the accumulation caching operation of the next round of multiply-accumulate results can be performed), thus further improving the working efficiency of the accumulator buffer.
[0065] Example 2: A data accumulation and unloading method for an accumulator buffer, using the data accumulation and unloading system in Example 1, specifically including the following steps.
[0066] The S1 accumulator buffer control logic sends a control signal.
[0067] The S2 control register receives and temporarily stores control signals issued by the accumulator buffer control logic.
[0068] The control register is electrically connected to the accumulator buffer control logic and is used to receive and temporarily store control signals issued by the accumulator buffer control logic. The control signals may include an accumulation result start signal and an accumulation result unload signal.
[0069] S3 starts accumulating signals based on the multiplication and addition results in the control signals, and the buffer body caches the accumulation results in the order of the buffer entries.
[0070] S3 specifically includes the following steps: S31. The first data register acquires and temporarily stores the multiplication-addition result output by the arithmetic unit at the output of the matrix multiplication acceleration unit. S32. The second data register acquires and temporarily stores the accumulation result in the third data register. S33. The adder adds the multiplication-addition result in the first data register to the accumulation result in the second data register to obtain a new accumulation result. S34. The third data register acquires and temporarily stores the new accumulation result. S35. After acquiring and temporarily storing the new accumulation result in the third data register, the buffer body caches the accumulation result in the third data register into the buffer entry of the buffer body.
[0071] Initially, the buffer body has no cached data. When the accumulator buffer control logic issues a control signal for the first time, the control register receives and temporarily stores the control signal. The buffer body then caches the accumulation result based on the accumulation result start-accumulation signal in the control signal, and the caching order is based on the order of the buffer entry addresses in the buffer body.
[0072] Specifically, after receiving a control signal, the control register first acquires and temporarily stores the multiplication-accumulation result output by the connected arithmetic unit, while the second data register acquires and temporarily stores the initial accumulation result (i.e., 0) of the third data register. Then, the adder adds the multiplication-accumulation result in the first data register to the accumulation result in the second data register to obtain a new accumulation result, which is then temporarily stored in the third data register. Finally, when the third data register acquires the current accumulation result, the buffer body caches the accumulation result in the third data register into the first buffer entry of the buffer body, at which point the accumulation caching step for the first multiplication-accumulation result ends. After the accumulation caching step for one multiplication-accumulation result is completed, the data accumulation unloading submodule continues the accumulation caching of the next accumulation result using the same steps.
[0073] S4 According to the accumulation result unloading signal in the control signal, the result processing unit unloads the accumulated result cached in the buffer body in the order of the buffer entries.
[0074] S4 specifically includes the following steps: S41 The result extraction subunit acquires and temporarily stores the accumulated result in the buffer entry. S42 After the result extraction subunit completes the extraction of the accumulated result, the result unload and write 0 subunit unloads and writes the accumulated result in the corresponding buffer entry to 0.
[0075] After a certain period of time, when the nth buffer entry has completed the caching of the multiplication and addition results, the result processing unit will start unloading the cached accumulation results in the first buffer entry according to the accumulation result unloading signal in the control signal. After that, for each accumulation result cached by the buffer body, the result processing unit will unload one accumulation result in the order of the entries.
[0076] Specifically, after a certain period of time, for example, when the second buffer entry completes the multiplication-accumulation result caching (which can be determined by obtaining and temporarily storing the third multiplication-accumulation result through the third data register), the result processing unit begins to unload the cached accumulation result in the first buffer entry (firstly, the accumulation result in the corresponding buffer entry is obtained and temporarily stored through the result extraction subunit, and then the accumulation result in the corresponding buffer entry is unloaded and written to 0 through the result unloading and writing 0 subunit). After that, for each accumulation result cached in the buffer body (which can be determined by obtaining and temporarily storing the next accumulation result through the third data register), the result processing unit will unload one accumulation result in the order of the buffer entries.
[0077] After the accumulator buffer control logic issues a second control signal, the control register receives and temporarily stores the second control signal. Upon receiving the "start accumulating result" signal from the second control signal, the buffer body first stops caching the accumulated result and waits for the buffer body to complete the unloading of the accumulated result for the remaining buffer entries. Only after all the accumulated results in the buffer body have been unloaded does the data accumulation and unloading submodule restart the next round of accumulating and caching operations.
[0078] In this embodiment, the accumulator buffer module can cache the accumulation results through the buffer body while simultaneously unloading the cached accumulation results from the buffer body through the result processing unit. Thus, when the next control signal is received, the data accumulation and unloading submodule only needs to wait a short time to re-cache the accumulation results (unlike existing technologies that require waiting for the entire unloading process), thereby effectively improving the working efficiency of the accumulator buffer in this embodiment.
[0079] The S5 result write-back module writes the accumulated result temporarily stored in the result extraction subunit into the local data memory.
[0080] Specifically, S5 includes writing the accumulated result in the first data storage area or the second data storage area, which is in the second working state, into the local data memory through the result write-back module.
[0081] The input of the result write-back module is electrically connected to the result extraction subunit in the data accumulation and unloading submodule of the all accumulator buffer module, and the output of the result write-back module is connected to the local data memory. The result write-back module can write the accumulation result obtained by the result extraction subunit into the local data memory.
[0082] Furthermore, the result extraction subunit includes a first data storage area and a second data storage area. When the first data storage area is in a first working state, the second data storage area is in a second working state; when the first data storage area is in the second working state, the second data storage area is in the first working state. The first working state involves acquiring and temporarily storing the accumulated result in the buffer body, while the second working state involves writing the accumulated result into the local data memory through the result write-back module. The working states of the first data storage area and the second data storage area are always different.
[0083] Initially, neither the first nor the second data storage area contains any data. The accumulator buffer control logic can issue control signals that randomly determine the operating states of the first and second data storage areas. These control signals may also include operating state determination signals. For example, the first control signal might determine that the first data storage area enters the second operating state and the second data storage area enters the first operating state. Subsequently, the accumulator buffer control logic will issue operating state determination signals sequentially, with the operating states being the opposite of the previous ones. For example, the second control signal might determine that the first data storage area enters the first operating state and the second data storage area enters the second operating state.
[0084] In this embodiment, the result extraction subunit can first obtain and temporarily store the accumulated result in the buffer body through the first data storage area. When the control register receives a control signal, the data accumulation unloading submodule first stops the accumulation caching operation of the accumulated result and waits for the buffer body to complete the unloading of the accumulated result of the remaining buffer entries. After all the multiply-accumulate results in the buffer body are unloaded, the data accumulation unloading submodule reverses the working state of the first data storage area and the second data storage area. Then, the data accumulation unloading submodule re-performs the accumulation caching and extraction unloading operation of the accumulated result in the buffer body, and at this time, it obtains and temporarily stores the accumulated result in the buffer body through the second data storage area (while the result write-back module writes the multiply-accumulate result in the first data storage area into the local data memory).
[0085] The result extraction subunit can simultaneously store the accumulated results that need to be unloaded from the buffer body through the first data storage area (or the second data storage area), and simultaneously write the previous round's accumulated results stored in the second data storage area (or the first data storage area) into the local data memory through the result write-back module. After the write is completed, the second data storage area (or the first data storage area) is empty, so that the next round's accumulated results can be stored. The setting of the first data storage area and the second data storage area allows the accumulator buffer module to perform the accumulation caching operation and the accumulation result write-back operation simultaneously (unlike the existing technology, it does not need to wait for the accumulated results to be written to the local data memory before the accumulation caching operation of the next round's accumulated results can be performed), thus further improving the working efficiency of the accumulator buffer.
[0086] The embodiments described above are merely preferred embodiments of the present invention and are not intended to limit the concept and scope of the present invention. Various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the design concept of the present invention should fall within the protection scope of the present invention. All technical contents for which protection is sought in this invention have been fully described in the claims.
Claims
1. An accumulator buffered data accumulation offload system, characterized by: It includes an accumulator buffer control logic and multiple accumulator buffer modules; each of the accumulator buffer modules includes A control register, electrically connected to the accumulator buffer control logic, is used to receive and temporarily store control signals issued by the accumulator buffer control logic; A data accumulation and unloading submodule, electrically connected to the control register, includes... The buffer body is used to cache the accumulated results in the order of the buffer entries. The result processing unit is electrically connected to the buffer body and is used to unload the accumulated results cached in the buffer body according to the order of the buffer entries. The data accumulation and unloading submodule also includes The first data register is connected one-to-one with the arithmetic unit at the output of the matrix multiplication acceleration unit, and is used to obtain and temporarily store the multiplication and addition results output by the corresponding arithmetic unit; The second data register is electrically connected to the third data register and is used to obtain and temporarily store the accumulated result in the third data register; An adder, electrically connected to the first data register and the second data register, is used to add the multiply-accumulate result in the first data register to the accumulation result in the second data register to obtain a new accumulation result; The third data register is electrically connected to the adder and the buffer body, and is used to acquire and temporarily store the new accumulation result. After the buffer body acquires and temporarily stores the new accumulation result in the third data register, it caches the accumulation result in the third data register into the buffer entry of the buffer body.
2. The accumulator buffered data accumulation offload system of claim 1, wherein: The result processing unit includes The result extraction subunit is used to retrieve and temporarily store the accumulated results in the buffer entries; The result unloading and writing 0 subunit is electrically connected to the result extraction subunit and is used to unload and write 0 to the accumulated result in the corresponding buffer entry after the result extraction subunit has completed the extraction of the accumulated result.
3. The accumulator buffered data accumulation offload system of claim 2, wherein: The data accumulation and unloading system also includes The result write-back module is electrically connected to the result extraction subunit and the local data memory, and is used to write the accumulated result temporarily stored in the result extraction subunit into the local data memory.
4. The accumulator buffered data accumulation offload system of claim 3, wherein: The result extraction subunit includes a first data storage area and a second data storage area. When the first data storage area is in a first working state, the second data storage area is in a second working state; when the first data storage area is in the second working state, the second data storage area is in the first working state. The first working state is to acquire and temporarily store the accumulated result in the buffer body, and the second working state is to write the accumulated result into the local data memory through the result write-back module.
5. An accumulator buffered data accumulation offload method, characterized by: Includes the following steps The S1 accumulator buffer control logic sends a control signal; The S2 control register receives and temporarily stores the control signals issued by the accumulator buffer control logic; S3, according to the control signal, buffers the accumulated results in the order of the buffer entries; S4 According to the control signal, the result processing unit unloads the accumulated results cached in the buffer body in the order of the buffer entries; S3 specifically includes the following steps. The S31 first data register acquires and temporarily stores the multiplication and addition results output by the arithmetic unit at the output of the matrix multiplication acceleration unit; S32 second data register retrieves and temporarily stores the accumulated result in the third data register; The S33 adder adds the multiplication-addition result in the first data register to the accumulation result in the second data register to obtain a new accumulation result; S34 The third data register retrieves and temporarily stores the new accumulation result; After S35 retrieves and temporarily stores the new accumulated result in the third data register, the buffer body caches the accumulated result in the third data register into the buffer entry of the buffer body.
6. The method of claim 5, wherein: S4 specifically includes the following steps. S41 result extraction subunit retrieves and temporarily stores the accumulated result in the buffer entry; S42 After the result extraction subunit completes the extraction of the accumulated result, the result unload and write 0 subunit unloads and writes the accumulated result in the corresponding buffer entry to 0.
7. The method of claim 6, wherein: The data accumulation and unloading method further includes the following steps: The S5 result write-back module writes the accumulated result temporarily stored in the result extraction subunit into the local data memory.
8. The method of claim 7, wherein: S41 specifically includes obtaining and temporarily storing the accumulated result in the buffer entry through the first data storage area or the second data storage area in the first working state; Specifically, S5 includes writing the accumulated result in the first data storage area or the second data storage area, which is in the second working state, into the local data memory through the result write-back module.