A unified systolic array computing unit for ann and snn

By using a shift-add computation unit with single-path weight input and pulsating array control, the problem of the difference in computational characteristics between ANN and SNN is solved, achieving efficient hybrid neural network computation, improving operating frequency and area efficiency, and supporting multiple computation modes.

CN122263992APending Publication Date: 2026-06-23GUANGZHOU RES INST OF XIAN UNIV OF ELECTRONIC SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUANGZHOU RES INST OF XIAN UNIV OF ELECTRONIC SCI & TECH
Filing Date
2026-03-23
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies struggle to simultaneously support the significantly different computational characteristics of ANN and SNN within the same computing unit, and it is difficult to balance the area efficiency and energy efficiency of the computing unit while ensuring the peak computing power of ANN and SNN respectively.

Method used

A shift-add computation unit with single-path weight input is adopted, combined with a soft reset mechanism and pulsating array control, to realize the fusion computation of ANN and SNN. By shift-add time-division multiplexing and cross-time step weight reuse, bandwidth and power consumption are reduced, and area efficiency and timing convergence are improved.

Benefits of technology

It increases the operating frequency, reduces power consumption and input bandwidth requirements, improves the area efficiency and timing convergence of the computing unit, and supports flexible configuration of multiple computing modes.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122263992A_ABST
    Figure CN122263992A_ABST
Patent Text Reader

Abstract

The application provides a unified systolic array computing unit for ANN and SNN, comprising a shift-add computing processing unit and a systolic array formed by arrayed deployment of the shift-add computing processing unit; an execution of artificial neural network operation or pulse neural network operation is switched through a configuration mode selection signal terminal; in the artificial neural network mode, multiplication operation is unfolded in the time dimension, and a shift register and an adder constitute a time division multiplexing operation sequence to complete multiplication and accumulation operation; in the pulse neural network mode, a weighting accumulation is directly completed through multiplexing of an addition link, and a neuron time step dependent chain is decoupled through a soft reset mechanism to realize cross-time step weight multiplexing; control signals and data are propagated in the same direction in the systolic array, which is used for realizing control logic array level multiplexing, and fixed data flow is adopted for the output, so that partial sums are preferentially resident in the array interior; the working frequency can be improved, the bandwidth and power consumption can be reduced, and the area efficiency and timing convergence can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of neural network hardware acceleration, and more particularly to a unified systolic array computing unit for ANN and SNN. Background Technology

[0002] With the continuous expansion of the scale and application scenarios of deep learning models, Artificial Neural Networks (ANNs) have been widely used in fields such as visual perception, speech recognition, and natural language processing due to their strong ability to represent spatial features. However, ANN inference typically relies on large-scale, continuous, and intensive multiply-accumulate (MAC) operations, which have high requirements for computing power and storage bandwidth, and data migration consumes a significant proportion of energy. This makes it face deployment bottlenecks in scenarios such as edge devices and embedded terminals that are sensitive to power consumption, area, and real-time performance.

[0003] In contrast, Spiking Neural Networks (SNNs) use discrete spike events as information carriers, exhibiting brain-like characteristics such as event-driven processing, natural adaptation to temporal modeling, computational sparsity, and potential high energy efficiency. The core computation of SNNs can be abstracted as spike-triggered weighted accumulation, neuron state updates / threshold comparisons / firing, which significantly reduces invalid computations and memory accesses when spikes are sparse. However, due to limitations such as complex training methods and insufficient model representation and accuracy stability, the accuracy of SNNs on most general tasks is generally still lower than that of mainstream ANNs.

[0004] Due to the complementary nature of ANNs and SNNs in terms of accuracy, energy efficiency, temporal processing, and hardware friendliness, Hybrid Neural Networks (HNNs) have gradually emerged. HNNs, by fusing the high-precision spatial feature extraction capabilities of ANNs with the event-driven temporal processing capabilities of SNNs within the same model, are considered an important technical approach for achieving both high performance and low power consumption. Existing technologies typically use two schemes to be compatible with ANNs and SNNs; however, due to the significant differences in computational characteristics between ANNs and SNNs, existing technologies struggle to achieve an effective balance between computational performance and resource costs, making it difficult to efficiently support both types of computational modes within a unified hardware architecture.

[0005] One approach is to reuse the addition path in the ANN multiply-accumulate unit (MAC) to support the accumulation operation of the SNN. However, the timing performance of this type of structure is constrained by the critical path and clock constraints of the ANN multiplier, which forces the operating frequency of the SNN to be consistent with that of the ANN, making it difficult to take advantage of the high-frequency operation and sparse event-driven computation that the SNN could have utilized.

[0006] Another approach employs an arithmetic reconfiguration-type fusion unit with "multiplier elimination," refactoring multiplication operations into shift-add operations to shorten critical path latency, thereby increasing operating frequency and achieving a better trade-off between chip area and energy efficiency. However, this type of approach often maintains throughput and area efficiency by increasing the internal computational parallelism of the PE, but this usually comes with higher input bandwidth requirements, resulting in greater port and interconnect overhead. Furthermore, when mapped to a systolic array interconnect architecture, data needs to be propagated between adjacent PEs step-by-step, and the links between PEs typically rely on pipelined registers for storage and forwarding; increased weighted parallelism leads to a corresponding increase in the size of the link pipelined registers, thereby increasing register area and dynamic power consumption.

[0007] Therefore, how to simultaneously support the two significantly different computational characteristics of ANN and SNN within the same computing unit, and to balance the area efficiency and energy efficiency of the computing unit while ensuring the peak computing power of ANN and SNN respectively, is a key technical problem that needs to be solved in the current hybrid neural network acceleration architecture. Summary of the Invention

[0008] To address the aforementioned issues, this invention provides a unified systolic array computing unit for ANNs and SNNs, which enables single-path weight input, shift-addition time-division multiplexing, soft reset weight reuse, and systolic array control propagation. This improves operating frequency, reduces bandwidth and power consumption, and enhances area efficiency and timing convergence.

[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows: This invention provides a unified systolic array computing unit for ANNs and SNNs, comprising: A shift-add computation processing unit with single-channel weight input, and a pulsating array formed by the arrayed deployment of the shift-add computation processing unit; The shift-addition calculation processing unit is equipped with only one weight input interface and a mode selection signal terminal, which is used to switch between performing artificial neural network operations or spiking neural network operations. In the artificial neural network model, the multiplication operation is expanded in the time dimension, and the multiplication-addition accumulation operation is completed by a time-division multiplexed operation sequence composed of shifters and adders. In the spiking neural network mode, the addition link is directly reused to complete the weighted accumulation, and the time step dependency chain of the neuron is decoupled through the soft reset mechanism to achieve weight reuse across time steps; In the pulse array, control signals and data propagate in the same direction according to a regular pattern, which is used to realize the multiplexing of control logic array level, and adopts a fixed output data stream so that some and preferential data reside inside the array.

[0010] Preferably, the shift-addition calculation processing unit for the single-path weight input includes a partial and reconstruction unit; the partial and reconstruction unit is composed of multiple isomorphic sub-modules in parallel, used to configure the parallel processing scale according to the weight quantization bit width, and to merge the multi-path partial and bit-weighted shifts obtained from the multi-bit weight decomposition.

[0011] Preferably, the shift-addition calculation processing unit for the single-path weight input is configured with a neuron completion signal terminal; when the neuron completion signal is valid, a zeroing operation is performed on the remaining part.

[0012] As a preferred method, during artificial neural network computation, the weight quantization bit width is decomposed into equivalent parts and sequences of corresponding lengths, and the final accumulation result is obtained through shift-addition accumulation.

[0013] Preferably, the artificial neural network mode performs three-dimensional sliding window convolution operations, and multiple convolution kernels can be applied in parallel to the same input feature map to generate corresponding output feature maps.

[0014] Preferably, in the spiking neural network mode, the number of time steps for parallel processing is determined by the bit width of the artificial neural network weight quantization, and multiple time steps of input feature maps are processed in parallel within a single computation window.

[0015] Preferably, in the spiking neural network mode, the data input bandwidth is reduced by reusing weights across time steps.

[0016] Preferably, in the spiking neural network mode, input pulse data from multiple time steps are received and accumulated in parallel within a single computation cycle.

[0017] Preferably, different rows of the pulsating array can be independently configured as artificial neural network data paths or spiking neural network data paths; supporting pure artificial neural network computing mode, pure spiking neural network computing mode, artificial neural network-spiking neural network fusion computing mode, and binary neural network computing mode.

[0018] Preferably, the control signals include an operation stage identifier, a valid signal, a clear enable, an accumulation enable, and a mode selection signal; the control signals are forwarded step by step within the pulse array, without using a global broadcast method.

[0019] Compared with the prior art, the beneficial effects of the present invention are as follows: To address the increased on-chip interconnect complexity, worsened wiring congestion, and increased timing convergence pressure caused by the need for parallel input of multiple weights when spatially expanding existing shift-add computation units, as well as the area and power consumption overhead resulting from the requirement to configure multiple sets of weight input registers at the array entry point, this invention utilizes a soft reset mechanism to enable parallelization of multiple time-step computations, improving weight reuse rate and thus reducing power consumption and input bandwidth requirements. With this mechanism, ANNs and SNNs can perform fused computations on a unified shift-add computation unit with a single weight input, further reducing wiring and register resource overhead and improving scalability.

[0020] This invention employs a systolic array as a unified computing carrier, deploying shift-addition fusion computing units with single-path weight inputs as arrayed processing elements (PEs). By ensuring that control signals and data propagate in the same direction and in a regular manner within the array, the hierarchical transmission and multiplexing of control information are achieved, thereby reducing the number of independent control units, lowering the large fan-out requirement of global control signals and the complexity of cross-array interconnections, and improving overall area efficiency and timing convergence. Furthermore, the array organization adopts a fixed output data stream, allowing partial sums to reside preferentially within the array, minimizing the number of read / write and movement operations at array boundaries, thus reducing the memory access power consumption and interconnect switching power consumption associated with partial sums. In this systolic array, different rows can be independently configured as ANN data paths or SNN data paths, enabling the array to be flexibly configured into SNN computing mode, ANN computing mode, SNN-ANN fusion computing mode, and binary neural network (BNN) computing mode according to task requirements, thereby achieving efficient mapping and operation of multi-paradigm neural networks within a unified hardware framework.

[0021] This invention achieves parallel computation of SNN data at time steps through a soft reset mechanism, improving the weight reuse rate. Under the same implementation conditions, the power consumption of this invention is lower than that of the two existing schemes. Because the unified ANN-SNN computation unit of this invention adopts a shift-add structure and integrates fewer input ports, its maximum operating frequency is higher than that of the two existing schemes under the same implementation conditions. In the same systolic array, compared to the arithmetic reconstruction-type unified computation unit with "multiplier elimination," this invention has fewer intermediate register resources. Compared to computation units with shared MAC data paths, this invention can achieve a higher operating frequency due to the shorter critical path of the shift adder. Therefore, under the same target computing power, fewer parallel instances or lower resource configurations can be used, thereby achieving lower resource overhead per unit computing power and further improving area efficiency. Attached Figure Description

[0022] Figure 1 This is a schematic diagram of the shift-addition calculation processing unit with single-channel weight input of the present invention; Figure 2This is a schematic diagram of the data flow of the pulsating array artificial neural network mode of the present invention; Figure 3 This is a schematic diagram of the data flow of the pulse array pulse neural network mode of the present invention. Detailed Implementation

[0023] To further illustrate the technical means and effects of the present invention in achieving its intended purpose, the following detailed description of the specific implementation methods, structures, features, and effects of the present invention, in conjunction with the accompanying drawings and preferred embodiments, is provided.

[0024] This invention provides a unified systolic array computing unit for ANNs and SNNs, comprising: A shift-add computation processing unit (PE unit) with single-path weight input, and a pulsating array composed of the arrayed deployment of the shift-add computation processing unit; The shift-addition calculation processing unit is equipped with only one weight input interface and a mode selection signal terminal, which is used to switch between performing artificial neural network operations or spiking neural network operations. In the artificial neural network model, the multiplication operation is expanded in the time dimension, and the multiplication-addition accumulation operation is completed by a time-division multiplexed operation sequence composed of shifters and adders. In the spiking neural network mode, the addition link is directly reused to complete the weighted accumulation, and the time step dependency chain of the neuron is decoupled through the soft reset mechanism to achieve weight reuse across time steps; In the pulse array, control signals and data propagate in the same direction according to a regular pattern, which is used to realize the multiplexing of control logic array level, and adopts a fixed output data stream so that some and preferential data reside inside the array.

[0025] The following is a detailed explanation.

[0026] Because existing arithmetic reconfiguration unified computing units with "multiplier elimination" need to inject multiple weight signals into the array entry in parallel at the same time, this leads to an increase in on-chip interconnect fan-out and a larger wiring scale, which in turn significantly increases the risk of wiring congestion and timing convergence pressure, causing the highest achievable operating frequency (Fmax) of the unified computing unit to decrease. At the same time, multiple weight inputs usually require multiple sets of weight input registers to be set at the systolic array entry for data latching, alignment and buffering, which further increases register resource usage and dynamic power consumption.

[0027] In this embodiment, the aforementioned problem is solved by normalizing multiple weight inputs into a single weight input. In ANN scenarios, the same round of multiply-add operations typically share the same set of weight parameters in the spatial dimension. Even if the traditional multiplier implementation is replaced with a shift-addition decomposition implementation, the weights can still be reused as a unified input within the array, thereby reducing the number of array entry ports and fan-out requirements, and lowering interconnect complexity and wiring pressure.

[0028] In the SNN neuron reset strategy, when using soft reset, the "integration" and "firing / resetting" are decoupled in time: during the integration phase, membrane potential accumulation is performed only for multiple time steps within a parallel window, without performing threshold comparisons at each time step; after the accumulation within the window is completed, the threshold determination and soft reset equivalent operations are performed centrally to restore the corresponding pulse output. This effectively reduces array ingress interconnection and register overhead, alleviates wiring congestion and timing pressure, and improves overall frequency and energy efficiency performance.

[0029] Please see Figure 1 The diagram shows the circuit of a single PE (Programmer) unit. The PE unit has only one weight input interface. The Mode_sel signal is used to select whether the PE unit performs an ANN (Application Not Node) operation or a SNN (Simplified Neural Network) operation. To perform computation in the pipeline, the PE unit is configured with a neuron completion signal terminal. When the neuron completion signal is valid, the remaining sum is cleared. That is, after the parallel time step computation or the current input feature map computation is completed, the 'Neuron_Done' signal is 1, and the remaining sum is cleared to zero.

[0030] Specifically, in ANN mode, the PE unit decomposes the multiplication by weighted quantization bit width Bw into equivalent partial sum sequences of length Bw, and replaces direct multiplication with shift-addition accumulation. The PE unit has a built-in partial sum reconstruction unit composed of multiple isomorphic submodules operating in parallel. This unit merges the equivalent partial sum sequences using shift-addition reconstruction logic to obtain the final accumulated result. The partial sum reconstruction unit processes the multiple partial sums obtained from multi-weighted decomposition and its parallel processing scale can be configured according to Bw (the structure and data merging process is as follows). Figure 1 (As shown). In SNN mode, the membrane potential accumulation corresponds to the additive accumulation process, which does not require weight shift alignment; the PE unit can process the feature map input of multiple time steps in parallel within a calculation window, and its parallel time step number T is determined by the weight quantization bit width Bw.

[0031] In this invention, the PE unit needs to perform multi-level control over the ANN / SNN mode switching, partial sum (psum) accumulation, and write-back processes in shift-add computation. If a traditional global control method is used, it often requires configuring a relatively complete control decoding, state machine, and handshake logic for each PE unit, while also distributing control signals through complex global interconnects. This leads to increased control area overhead, worsened wiring congestion, and makes critical path timing convergence difficult. To solve these problems, this invention utilizes the characteristic of pulsating arrays where "control and data propagate in the same direction," propagating critical control information (including but not limited to: computation stage identifiers, valid signals, clear / accumulate enable, mode selection, etc.) within the array in a pulsating manner at a fixed rhythm, thereby achieving array-level multiplexing and simplification of control logic. Specifically, the regular propagation mechanism of the pulsating array enables: Control logic sharing and reuse: Most PE cells in the array do not need to generate complex control sequences independently. They only need to perform a small amount of decoding and local latching of the control tokens / control fields from upstream to complete the triggering and alignment of the operation in this cycle, thereby reducing the need for redundant control state machines and decoding circuits.

[0032] Reduce global interconnection and fan-out pressure: Control signals are propagated in the array in a "step-by-step forwarding" manner, avoiding large fan-out networks caused by single-point global broadcasts, reducing cabling congestion and buffer insertion requirements, and improving cabling availability and timing convergence.

[0033] Improved area efficiency and scalability: As the array size expands, the control logic overhead tends to be linear with the array growth and the increment is small, avoiding the control explosion and interconnection bottleneck that occur in centralized control under large-scale arrays, thereby improving the effective computing power per unit area.

[0034] Furthermore, the systolic array structure itself possesses high data reuse characteristics, enabling the formation of regular data residence and transport paths within a two-dimensional topology, reducing reliance on long-distance off-chip / on-chip access. Combined with the PE cell design of this invention, the systolic array not only achieves efficient data path reuse but also allows the control path to complete timing unfolding and pattern management in a "fixed rhythm, local transmission" manner, thereby significantly reducing control overhead and improving overall area efficiency while ensuring computational throughput.

[0035] As described above, this invention solves the problem of repetitive logic and interconnection overhead caused by multi-level control in the fused computing architecture by using a pulsating array as a carrier to realize array-level sharing and propagation of control signals. It provides a hardware implementation scheme with high area efficiency, easy expansion and easy time convergence for unified ANN / SNN computing.

[0036] The following is a detailed description of the specific solution in this embodiment.

[0037] The systolic array designed in this embodiment contains 256 PE units, organized in a 16×16 two-dimensional array. In the sliding window ANN mode (or input spike flow in SNN mode), PE units in the same row share the same input feature map data flow, while PE units in the same column share the weight flow of the same convolutional kernel. Data propagates in a flowing manner within the systolic array, in C×K... 2 After 30 cycles, a portion of each PE unit will be sent to the neuronal module for neural function activation.

[0038] Please see Figure 2 The diagram shows the overall data flow of the systolic array in ANN mode. Its core operator is three-dimensional sliding window convolution, which applies a convolution kernel of size K×K×C to an input feature map of size M×M×C. N convolution kernels can be applied in parallel to the same input feature map to generate an output feature map of size HO×HO×N.

[0039] Please see Figure 3 The diagram illustrates the overall data flow of the pulsating array in SNN mode. Its overall process is similar to that of ANN mode, but SNN mode introduces parallelism in the time dimension. The number of parallel processing time steps is determined by the ANN weight quantization bit width. Multiple time-step input feature maps are processed in parallel within a single computation window; that is, input spikes from multiple time steps can be calculated simultaneously, and the number of parallel time steps S is determined by the weight quantization bit width. Therefore, the PE unit receives input feature map / spike data from S time steps in each cycle and performs parallel accumulation. By reusing weights across time steps, the data input bandwidth is reduced by 4.5 times compared to existing shift-addition fusion computation units. Specifically, the core operator is a three-dimensional sliding window convolution, applying a K×K×C convolution kernel to an M×M×C input feature map. M convolution kernels can be applied in parallel to the same input feature map. Simultaneously, the computation at the S time steps is unfolded in parallel within the PE unit and accumulated along the time dimension, resulting in a membrane potential accumulation result of size N×M. Furthermore, the number of parallel time steps S is determined by the weight quantization bit width.

[0040] Furthermore, the accelerator proposed in this embodiment supports multiple reconfigurable modes: (1) Full ANN mode: All PE units are configured as shift-add paths, and the partial sums are fed into the ANN neuron module for subsequent processing; (2) Full SNN mode: All PE units are configured as non-shift (pure accumulation) paths, and part of them are fed into the SNN neuron module to complete the pulse firing; (3) Hybrid mode: PE units are configured by row. Some rows enable the shift-addition path to perform ANN calculation, while other rows enable the pure accumulation path to perform SNN calculation. The partial sums are then sent to the corresponding neuron modules to achieve ANN-SNN collaborative / hybrid inference.

[0041] The overall process of this embodiment is as follows: In this embodiment, the unified ANN-SNN systolic array includes multiple processing units (PEs) arranged in a two-dimensional array. Before the array runs, the operating mode of each PE unit is configured using the control signal Mode_sel according to the target network type and hierarchical task, enabling the systolic array to operate in full ANN mode, full SNN mode, or a hybrid ANN-SNN mode.

[0042] In the hybrid mode, the pulsating array is functionally divided by row. Some rows of PE units are configured as ANN computation paths, while other rows of PE units are configured as SNN computation paths, thereby enabling collaborative computation of ANN and SNN in the same hardware array.

[0043] After mode configuration is completed, input data and weight data are input from the systolic array boundary. Specifically, input feature map data or input spike streams are transmitted sequentially along the array rows, while convolutional kernel weights are transmitted sequentially along the array columns. Upon receiving the corresponding row and column data, each PE unit performs calculations for the current clock cycle based on the received data. Simultaneously, it stores the row and column data in its local register and forwards them to adjacent PE units in the next clock cycle, thereby achieving systolic propagation and multiplexing of data within the array.

[0044] In ANN mode, the PE unit performs partial sum merging operations via a shift-addition path. Taking 4-bit weighted quantization as an example, each submodule receives four equivalent time-step partial sums P0, P1, P2, and P3 during a merging process and performs shift merging according to bit weights. The submodule first executes: Q0 = P0 + (P1 << 1), Q1 = P2 + (P3 << 1); then executes: R0 = Q0 + (Q1 << 2). Here, "<<" indicates a left shift operation. The resulting R0 is further accumulated with the current partial sum, and the accumulated result is stored in an accumulator register for continued use in subsequent calculations. When the current calculation task is completed, the Neuron_Done signal is set, and the remaining partial sum in the accumulator register is cleared.

[0045] In SNN mode, the PE unit performs the partial sum merging operation corresponding to the membrane potential through a non-shift accumulation path. Since the membrane potential update process corresponds to the direct addition and accumulation of multiple time step inputs, there is no need for weighted shift alignment. Taking the partial sums P0, P1, P2, and P3 of four equivalent time steps as an example, the submodule first executes: Q0 = P0 + P1, Q1 = P2 + P3, and then executes: R0 = Q0 + Q1. The resulting R0 is further accumulated with the current partial sum, and the accumulation result is stored in the accumulation register for subsequent calculations. When the current calculation task is completed, the Neuron_Done signal is set, and the remaining partial sum in the accumulation register is cleared.

[0046] After completing mode configuration and injecting input and weight data, the systolic array begins convolution calculation. From the start of calculation, after C×K^2+30 clock cycles, each PE unit outputs its corresponding partial sum, which is then fed into the subsequent neuron module. Specifically, in ANN mode, the partial sum is used by the ANN neuron module for activation output; in SNN mode, the partial sum is used by the SNN neuron module for membrane potential updates, threshold determination, and pulse firing; in hybrid mode, the partial sums output by PE units in different rows are fed into corresponding neuron modules, thus achieving unified collaborative computation between ANN and SNN.

[0047] On the other hand, the present invention provides a unified systolic array computation method for ANN and SNN, applied to the aforementioned unified systolic array computation unit for ANN and SNN, comprising the following steps: A shift-addition computation processing unit with single-channel weight input is constructed, and the artificial neural network operation mode and the spiking neural network operation mode are switched through a mode selection signal; When artificial neural networks perform computations, multiplication operations are expanded in the time dimension and multiplication-accumulation is completed using a time-division multiplexed sequence composed of shifters and adders. During spiking neural network computation, the addition link is reused to complete weighted accumulation, and a soft reset mechanism is used to decouple the time step dependency chain of neurons, so as to realize the reuse of weights across time steps; The shift-addition calculation processing unit is arrayed and deployed as a pulse array, so that the control signal and data propagate in the same direction and in a regular manner, thereby realizing the array-level multiplexing of control logic; By using a fixed output data stream, some and priority data reside within the array, reducing partial and data access and data movement.

[0048] To address the shortcomings of existing technologies, this invention aims to propose a unified fusion computing scheme for artificial neural networks and spiking neural networks, balancing the performance, area efficiency, and array-based scalability of both ANNs and SNNs within a unified hardware framework. Specific objectives include: (1) Reduce the limitation of the multiplier critical path on the frequency of the unified computation unit and improve the area utilization in the SNN mode: In existing hybrid neural network accelerators, the MAC of an ANN is reused to be compatible with the pulse accumulation operation of a SNN. This maps the accumulation operation of the SNN to the multiply-accumulate data path of the ANN, inevitably limiting the critical timing path of the SNN to the multiplier and its related data path. Consequently, the highest operating frequency of the SNN is "locked" at the frequency level of the ANN, making it difficult to realize the peak computing power potential of SNN's event-driven and sparse computing at high frequencies. At the same time, SNN operations are mainly addition / comparison, with very low demand for multiplication. The reused structure will cause the multiplier to remain idle or underutilized in SNN mode for a long time, resulting in low hardware area utilization, insufficient computing power density, and difficulty in balancing performance and resource cost.

[0049] This invention achieves ANN multiplication by replacing traditional multipliers with shift-addition operations, making the critical path of the computation unit consist of shifters and adders. This shortens the critical path, increases the maximum operating frequency, avoids the SNN frequency locking problem caused by reusing multipliers, and reduces the area waste caused by idle multipliers in SNN mode.

[0050] (2) Reduce the number of weight inputs required for the spatially parallel shift-add structure, thereby reducing interconnect complexity and input register overhead: Existing technologies use shift-add to replace multipliers to handle both multiplication operations in ANNs and addition operations in SNNs. However, to maintain throughput in ANN scenarios, this approach often requires parallel input of multiple weight data paths to the computing unit within the same cycle during spatial parallel expansion. This significantly increases on-chip interconnect fan-out and routing complexity, leading to greater routing congestion risks and timing convergence pressures, potentially resulting in a decrease in the maximum operating frequency. Furthermore, multiple weight inputs typically require multiple sets of weight input registers at the array ingress or computing unit input for latching and alignment, increasing register resource and power consumption overhead. This makes it difficult to achieve high area efficiency and high scalability when deploying such unified computing units in an array configuration.

[0051] This invention improves the structure and data organization, enabling the unified computing unit to perform bit expansion calculations using unified weight input multiplexing in ANN scenarios. It also reduces the weight port bit width and fan-out requirements in array deployment, thereby reducing wiring congestion and timing pressure, reducing the number of array entry weight input register groups and their power consumption, and improving area efficiency and scalability.

[0052] (3) The present invention provides a unified fusion computing architecture suitable for pulsating array deployment, which improves the overall timing convergence and area efficiency: The present invention deploys the above-mentioned unified computing unit as a processing element in the pulsating array, so that data and control are regularly propagated in the array and control logic is reused, reducing global fan-out and interconnection complexity, and improving the scalability and engineering feasibility of the overall structure.

[0053] In summary, this invention proposes a unified systolic array computation unit architecture for both ANNs and SNNs. This architecture expands the multiplication operation of ANNs along the time dimension, replacing the traditional multiplier implementation with a time-division multiplexed operation sequence composed of shifters and adders. This significantly shortens the critical path, reduces timing pressure, and increases the maximum operating frequency of the computation unit. Simultaneously, the accumulation operation of SNNs can be directly completed using the same addition link, eliminating the need for complex multiplier structures. Based on a soft-reset mechanism, the time-step dependency chain of neurons is decoupled, enabling cross-time-step weight reuse in SNNs. This improves the weight reuse rate, thereby reducing power consumption and input bandwidth requirements. Compared to the most advanced shift-addition fusion computation unit currently available, the data input bandwidth of the computation unit (PE) is reduced by 4.5 times.

[0054] The above embodiments are merely descriptions of preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Various modifications and improvements made by those skilled in the art to the technical solutions of the present invention without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.

Claims

1. A unified systolic array computing unit for ANN and SNN, characterized in that, include: A shift-add computation processing unit with single-channel weight input, and a pulsating array formed by the arrayed deployment of the shift-add computation processing unit; The shift-addition calculation processing unit is equipped with only one weight input interface and a mode selection signal terminal, which is used to switch between performing artificial neural network operations or spiking neural network operations. In the artificial neural network model, the multiplication operation is expanded in the time dimension, and the multiplication-addition accumulation operation is completed by a time-division multiplexed operation sequence composed of shifters and adders. In the spiking neural network mode, the addition link is directly reused to complete the weighted accumulation, and the time step dependency chain of the neuron is decoupled through the soft reset mechanism to achieve weight reuse across time steps; In the pulse array, control signals and data propagate in the same direction according to a regular pattern, which is used to realize the multiplexing of control logic array level, and adopts a fixed output data stream so that some and preferential data reside inside the array.

2. The unified systolic array computing unit for ANN and SNN according to claim 1, characterized in that, The shift-addition calculation processing unit for the single-path weight input includes a partial and reconstruction unit; the partial and reconstruction unit is composed of multiple isomorphic sub-modules in parallel, used to configure the parallel processing scale according to the weight quantization bit width, and to merge the multi-path partial and bit-weighted shifts obtained from the multi-bit weight decomposition.

3. A unified systolic array computing unit for ANN and SNN according to claim 2, characterized in that, The shift-add calculation processing unit for the single-path weight input is equipped with a neuron completion signal terminal; when the neuron completion signal is valid, a zeroing operation is performed on the remaining part.

4. The unified systolic array computing unit for ANN and SNN according to claim 1, characterized in that, When an artificial neural network operates, the weight quantization bit width is decomposed into equivalent parts and sequences of corresponding lengths, and the final accumulated result is obtained through shift-addition.

5. A unified systolic array computing unit for ANN and SNN according to claim 1, characterized in that, The artificial neural network mode performs three-dimensional sliding window convolution operations, and multiple convolution kernels can be applied in parallel to the same input feature map to generate corresponding output feature maps.

6. A unified systolic array computing unit for ANN and SNN according to claim 1, characterized in that, In spiking neural network mode, the number of time steps for parallel processing is determined by the bit width of the artificial neural network weight quantization, allowing for parallel processing of multiple time-step input feature maps within a single computation window.

7. A unified systolic array computing unit for ANN and SNN according to claim 1, characterized in that, In the described spiking neural network mode, the data input bandwidth is reduced by reusing weights across time steps.

8. A unified systolic array computing unit for ANN and SNN according to claim 1, characterized in that, In the spiking neural network mode, input pulse data from multiple time steps are received and accumulated in parallel within a single computation cycle.

9. A unified systolic array computing unit for ANN and SNN according to claim 1, characterized in that, Different rows of the pulsating array can be independently configured as artificial neural network data paths or spiking neural network data paths; it supports pure artificial neural network computing mode, pure spiking neural network computing mode, artificial neural network-spiking neural network fusion computing mode, and binary neural network computing mode.

10. A unified systolic array computing unit for ANN and SNN according to claim 1, characterized in that, The control signals include an operation stage identifier, a valid signal, a clear enable, an accumulation enable, and a mode selection signal; the control signals are forwarded step by step within the pulse array.