Fista algorithm hardware acceleration method and device based on single traversal matrix
By using a single-pass matrix traversal method, the problem of frequent access to the sensor matrix in the hardware implementation of the FISTA algorithm is solved, which improves the burst transmission efficiency of DDR and the operating efficiency of the hardware system, reduces latency and power consumption, simplifies the design of the control state machine, and realizes efficient sparse signal reconstruction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2026-02-28
- Publication Date
- 2026-06-19
AI Technical Summary
The existing FISTA algorithm hardware implementation suffers from the memory wall problem caused by frequent access to the sensing matrix, resulting in system performance bottlenecks. This is especially true in large-scale problems where on-chip SRAM resources are limited, computational unit utilization is low, complex soft threshold operation has high latency, and momentum parameter calculation occupies unnecessary chip area and power consumption.
A single-pass matrix traversal method is adopted to read data blocks continuously row by row from off-chip memory. The residual components are calculated based on the current data block read and temporarily stored on-chip. Gradient descent and soft thresholding are performed using a pipelined architecture, and the auxiliary variable search points are updated by combining a lookup table method, which simplifies the design of the control state machine.
It improves the burst transfer efficiency of DDR, reduces critical path latency, saves logic resources and power consumption, simplifies the design cost of the control state machine, and significantly improves the operating efficiency and throughput of the hardware system.
Smart Images

Figure CN122240961A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of hardware acceleration technology for the FISTA algorithm, specifically relating to a hardware acceleration method and apparatus for the FISTA (Fast Iterative Shrinkage-Thresholding Algorithm) algorithm based on a single traversal matrix. Background Technology
[0002] With the rapid development of big data and artificial intelligence technologies, sparse signal processing is playing an increasingly important role in fields such as compressed sensing (CS), wireless communication, medical imaging, and radar signal processing. The core of these applications often boils down to solving a sparse solution to an underdetermined system of linear equations, namely the well-known LASSO (Least Absolute Shrinkage and Selection Operator) problem. Its mathematical form typically involves solving the following unconstrained convex optimization problem:
[0003] in, It is the observation vector. It is a sensing matrix (or dictionary matrix). It is a sparse signal to be recovered. It is the regularization parameter.
[0004] To efficiently solve this problem, researchers have proposed several algorithms, among which the fast iterative threshold shrinkage algorithm stands out due to its advantages. Its fast convergence rate and low single-iteration complexity have made it the preferred algorithm for engineering implementation. The FISTA algorithm mainly consists of three core steps: 1. Gradient calculation: Calculate the gradient of the data fidelity term.
[0005] 2. Proximal mapping (soft thresholding): Nonlinearly shrinks the result after gradient descent.
[0006] 3. Momentum Update: Utilizing Nesterov momentum acceleration technology, a search point is established using auxiliary variables. and accelerated sequence parameters Update iteration points.
[0007] In practical hardware implementations (such as FPGAs or ASICs), because the dimensions of the sensing matrix are typically very large, it cannot be fully stored in the limited internal storage resources of the chip and must be stored in larger off-chip memory (such as DDR). Therefore, the sensing matrix is processed during algorithm execution. Frequent access to memory has become a major bottleneck limiting system performance, a problem known in computer architecture as the "memory wall".
[0008] Although existing hardware implementations of the FISTA algorithm have solved the problem of sparse signal reconstruction to some extent, they still have the following significant shortcomings in terms of architecture design and processing efficiency: Current technologies generally employ a step-by-step, serial processing approach when calculating gradient terms. This means that in each iteration, the massive sensing matrix must be loaded from off-chip memory twice. In data-intensive applications, I / O bandwidth is often the biggest bottleneck of the system; doubling the access volume directly halves the algorithm iteration speed, severely limiting the overall system throughput.
[0009] In the second step of gradient calculation in existing technologies, frequent interruptions to burst transfers in memory and a large amount of row activation / precharge overhead result in the actual effective bandwidth being far lower than the theoretical peak, making the calculation process extremely time-consuming.
[0010] To alleviate the aforementioned memory access issues, some existing technologies attempt to cache the entire sensing matrix on-chip. However, for large-scale problems, on-chip static random-access memory (SRAM) resources are extremely limited and expensive, making it impossible to accommodate large matrices. Without caching, computing units remain idle for extended periods while waiting for data to load, resulting in low utilization of expensive digital signal processor (DSP) resources.
[0011] Existing technologies for implementing complex soft thresholding operations rely on standard square root and division units. Dividers in hardware implementations are typically based on iterative logic, resulting in extremely high latency and difficulty in achieving a high-throughput, fully pipelined design. This not only blocks data flow but also often becomes a critical path in system timing, limiting the overall clock speed increase of the hardware accelerator.
[0012] Furthermore, existing technologies calculate dynamically changing momentum parameters in real time during each iteration, requiring dedicated floating-point square root and division circuits. This part of the logic is used only to update a single scalar parameter, resulting in extremely low utilization while occupying unnecessary chip area and power consumption. Summary of the Invention
[0013] To address the aforementioned problems in the prior art, this invention provides a hardware acceleration method and apparatus for the FISTA algorithm based on a single traversal matrix.
[0014] The technical problem to be solved by this invention is achieved through the following technical solution: In a first aspect, the present invention provides a hardware acceleration method for the FISTA algorithm based on a single traversal of a matrix, the hardware acceleration method for the FISTA algorithm comprising: Data blocks of the sensing matrix are read sequentially row by row from off-chip memory, residual components are calculated based on the current data block read, and the current data block is loaded into on-chip temporary storage. Based on the residual components and the temporarily stored current data block, a gradient accumulation operation is performed to update the gradient vector, while the next data block is read from the off-chip memory and temporarily stored. The temporarily stored next data block is used as the current data block, and the process returns to the step of calculating the residual components based on the read current data block, until all data blocks of the sensing matrix are read and the complete gradient vector is obtained. Gradient descent is performed using the complete gradient vector to generate intermediate variables of the sparse recovery signal; soft thresholding is performed on the intermediate variables of the sparse recovery signal using a pipeline architecture to generate a thresholded reconstructed signal. The search point of the auxiliary variable is updated based on the combined coefficients of the reconstructed signal and the table lookup method. The change of the reconstructed signal is calculated in parallel. The algorithm terminates the iteration by comparing the change with a preset threshold or the number of iterations to determine whether the maximum number of iterations has been reached.
[0015] Secondly, the present invention provides a hardware acceleration device for the FISTA algorithm based on a single traversal matrix, the FISTA algorithm hardware acceleration device comprising: The read module is used to continuously read data blocks of the sensor matrix row by row from off-chip memory; The residual calculation unit is used to calculate the residual components based on the currently read data blocks and load the current data blocks into on-chip temporary storage; The gradient update unit is used to perform gradient accumulation operation based on the residual component and the temporarily stored current data block to update the gradient vector, and at the same time read from the off-chip memory and temporarily store the next data block. The gradient update unit is also used to take the temporarily stored next data block as the current data block and return to the step of calculating the residual components based on the read current data block, until the reading of all data blocks of the sensing matrix is completed and the complete gradient vector is obtained. The soft thresholding module is used to perform gradient descent using the complete gradient vector to generate intermediate variables of the sparse recovery signal; and to perform soft thresholding on the intermediate variables of the sparse recovery signal using a pipeline architecture to generate a thresholded reconstructed signal. The convergence judgment module is used to update the auxiliary variable search point based on the combined coefficients updated by the reconstructed signal and the lookup table method, calculate the change of the reconstructed signal in parallel, and determine whether the algorithm should terminate the iteration by comparing the change with a preset threshold or the number of iterations to see if the maximum number of iterations has been reached.
[0016] Thirdly, the present invention provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; When a processor executes a computer program stored in memory, it implements the steps described in any of the above-described hardware acceleration methods for the FISTA algorithm based on a single traversal matrix.
[0017] Fourthly, the present invention provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps described in any of the above-described hardware acceleration methods for the FISTA algorithm based on a single traversal matrix.
[0018] This invention provides a hardware acceleration method for the FISTA algorithm based on a single matrix traversal. It abandons the inefficient matrix transpose access method and maintains continuous burst readings of the sensing matrix row by row. Whether in the residual calculation stage or the gradient update stage, the data flow strictly follows the physical storage order of the memory, maximizing the burst transfer efficiency of DDR and eliminating the time loss caused by random access.
[0019] By using a pipelined architecture to perform soft thresholding on intermediate variables of the sparse recovery signal, a thresholded reconstructed signal is generated, eliminating pipeline blockage and significantly reducing critical path latency, thereby allowing the hardware system to operate at higher clock frequencies.
[0020] The auxiliary variable search point is updated by combining the reconstructed signal and the lookup table method, which saves logic resources and power consumption while simplifying the design cost of the control state machine.
[0021] The present invention will now be described in further detail with reference to the accompanying drawings. Attached Figure Description
[0022] Figure 1 This is a flowchart illustrating a hardware acceleration method for the FISTA algorithm based on a single traversal matrix, provided in an embodiment of the present invention. Figure 2 This is a schematic diagram of the momentum update and convergence detection logic provided in an embodiment of the present invention; Figure 3 This is a simulation waveform diagram of the hardware implementation provided in the embodiment of the present invention; Figure 4 This is a top-level architecture diagram of the FISTA hardware accelerator based on row interleaving gradient calculation provided in an embodiment of the present invention; Figure 5 This is the row interleaved gradient calculation data stream and timing diagram provided in the embodiments of the present invention; Figure 6 This is a schematic diagram of the complex soft threshold circuit structure of the fully pipelined circuit provided in an embodiment of the present invention; Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0023] The present invention will be further described in detail below with reference to specific embodiments, but the implementation of the present invention is not limited thereto.
[0024] To address the issues of high cost, low efficiency, and limited clock speed in existing FISTA algorithm hardware acceleration methods, this invention provides a hardware acceleration method for the FISTA algorithm based on a single traversal matrix. (See [link to relevant documentation]). Figure 1 , Figure 1 This is a flowchart illustrating a hardware acceleration method for the FISTA algorithm based on a single traversal matrix, provided by an embodiment of the present invention. The method specifically includes the following steps: Step S101: Read data blocks of the sensing matrix row by row from the off-chip memory, calculate the residual components based on the read current data blocks, and load the current data blocks into on-chip temporary storage.
[0025] Assuming the sensing matrix The size is gradient vector Actually, it's a sensing matrix. The weights of the linear combination of rows of data are the residual components. Therefore, the order of summation can be changed, and the calculation process can be restructured into a row-by-row summation form.
[0026] Viewing the gradient formula as an accumulation of contributions to each row of the sensing matrix, the gradient formula is: ; in, Represents the sensing matrix; Represents the residual vector; superscript This represents the conjugate transpose operation; This represents the total number of rows of data in the sensor matrix; Represents the residual vector The Each component is a scalar; Represents the sensing matrix The Row data, which is a string of length The row vector; initialization For the first The processing of individual row data is as follows: Calculate residual components: Read the first... Individual data Based on the current search point of the auxiliary variables, the residual components are calculated immediately. : ; in, Represents the observation vector The Middle Each observation value is an observation vector, which is a known data vector obtained from actual measurements of the physical system, such as current or voltage signals collected by sensors. The auxiliary variable search point is a variable of length . The column vector, initially... ; Step S102: Perform gradient accumulation operation based on residual components and the temporarily stored current data block to update the gradient vector, and at the same time read from off-chip memory and temporarily store the next data block.
[0027] In this embodiment of the invention, the gradient contribution is accumulated: utilizing the gradient contribution that was just read in and is still cached on-chip. and the residual components just calculated Calculate the first The updated gradient vector corresponding to the row data: ; in, Represents the sensing matrix The Row data; Indicates the first The updated gradient vector corresponding to the row data; Indicates the first The updated gradient vector corresponding to the row data; Represents the residual vector The Each component; superscript This indicates the conjugate transpose operation.
[0028] Simultaneously, the next data block is read from and temporarily stored in off-chip memory: Initiate a prefetch operation for the next data block and cache it on the chip.
[0029] In this embodiment of the invention, the data is divided into row data or block data; block data includes multiple consecutive row data.
[0030] When data is divided into blocks, the system first reads them sequentially. Individual row data and calculation Each residual component, after all the data in this block is loaded into on-chip temporary storage, will be used in batches. The gradient is updated for each row of data, and the next block of data is received and temporarily stored.
[0031] This approach offers higher bus efficiency for matrices with extremely short row lengths, and its essence remains the same: using caching to achieve read-write multiplexing.
[0032] Step S103: The temporarily stored next data block is used as the current data block, and the process returns to the step of calculating the residual components based on the read current data block, until all data blocks of the sensing matrix are read and the complete gradient vector is obtained.
[0033] In this embodiment of the invention, the next row of data read in advance is taken as the current data block, and then the process returns to the step of calculating the residual components based on the read current data block. M After the line processing is complete, the final update is obtained. This is the final complete gradient vector required. Through this derivation, it was discovered that as long as the first... When processing rows of data, temporarily caching that row (only one row needs to be cached, not the entire sensing matrix) allows the residual components to be immediately "back-projected" back to the gradient vector after calculation, thus enabling the entire sensing matrix to be read only once.
[0034] Step S104: Gradient descent is performed using the complete gradient vector to generate intermediate variables of the sparse recovery signal; soft thresholding is performed on the intermediate variables of the sparse recovery signal using a pipeline architecture to generate the thresholded reconstructed signal.
[0035] In this embodiment of the invention, gradient descent is performed using the complete gradient vector to generate intermediate variables for the sparse recovery signal, including: ; in, Represents intermediate variables in the sparse recovery signal; Indicates the first The momentum acceleration point in the next iteration, i.e., the auxiliary variable search point; Indicates the step size; Indicates the first The complete gradient vector in the next iteration.
[0036] In this embodiment of the invention, a pipeline architecture is used to perform soft thresholding on intermediate variables of the sparse recovery signal to generate a thresholded reconstructed signal, including: The reciprocal of the modulus is obtained by performing a fast square root operation on the intermediate variables of the sparse recovery signal using the sum of the squared modulus and the reciprocal square root. The shrinkage factor is calculated based on the reciprocal of the modulus. The intermediate variables and shrinkage factor of the sparse recovery signal are aligned in time and multiplied to obtain the thresholded reconstructed signal.
[0037] Transform the traditional soft threshold formula into a multiplicative form: ; in, This represents the reconstructed signal after thresholding; This represents the intermediate variable of the sparse recovery signal, i.e., the one mentioned above. ; Indicates the threshold; The division operation in the traditional soft threshold formula is converted into a multiplication operation.
[0038] Based on this multiplicative soft threshold calculation formula, the specific process for calculating the thresholded result in this embodiment of the invention is as follows: First, input a complex signal. The actual input here is the aforementioned , Complex signal The real part, Complex signal The imaginary part; calculating the sum of squares using two multipliers and one adder. Then perform a fast reciprocal square root calculation, and... The data is fed into a dedicated IP core to calculate the modulo reciprocal. This module is designed with a fully pipelined architecture, producing one result per clock cycle with a fixed delay of 32 cycles, and is implemented using a fast reciprocal square root algorithm. After the calculation, calculate the intermediate variables. Then calculate the shrinkage factor. After calculating the data, a sign check is needed. This is achieved using a zero clamp, specifically by utilizing the sign bit (MSB) of the comparator. If the multiplication coefficients... If the multiplexer outputs 0, then the output is 0; otherwise, the output is 0. .in, Indicates the threshold. , This is the regularization parameter.
[0039] raw input After passing through a set of shift registers (FIFO), the number of delay cycles equals the total delay of the above calculation steps. Finally, the delayed z and Multiply and output the final result, i.e., the first product. The reconstructed signal after thresholding in the next iteration .
[0040] Step S105: Update the search points of auxiliary variables based on the combined coefficients of the reconstructed signal and the lookup table method, calculate the change in the reconstructed signal in parallel, and determine whether the algorithm should terminate the iteration by comparing the change with a preset threshold or the number of iterations to see if the maximum number of iterations has been reached.
[0041] In this embodiment of the invention, the auxiliary variable search point is updated based on the combined coefficients of the reconstructed signal and the lookup table method, and the change in the reconstructed signal is calculated in parallel, including: Extract combination coefficients from the preset momentum parameter table based on the current iteration number index; The search points for auxiliary variables are updated using the combination coefficients and the reconstructed signal, and the changes in the reconstructed signal are calculated in parallel.
[0042] See Figure 2 , Figure 2 This is a schematic diagram of the momentum update and convergence detection logic provided in an embodiment of the present invention. In the momentum update, the momentum coefficient of FISTA... and combination coefficients Depends only on the number of iterations .
[0043] First, data preprocessing was performed, using Matlab to pre-calculate the preceding data. Generations, such as 300 times The data is quantized into a single-precision floating-point number, and a .coe file is generated to initialize the on-chip single-port read-only memory (ROM).
[0044] In the hardware implementation, a state machine is used, employing an iterative counter as the address, which is read once per iteration cycle. This eliminates the need for complex floating-point square root circuits.
[0045] The reconstructed signal after thresholding is obtained after soft thresholding calculation. Calculate the search point for the auxiliary variable used to calculate the gradient in the next iteration. : ; in, Indicates the current reconstruction signal; Represents the combination coefficients; This represents the reconstruction signal from the previous iteration.
[0046] During this process, a parallel accumulator path is added. The changes in the reconstructed signal are calculated: ; Calculate immediately And accumulate it into the convergence register.
[0047] When a round of vector update is completed, the value of the convergence register, i.e., the change, is... , Directly compare it with the preset threshold By comparing or observing whether the maximum number of iterations has been reached, a decision is made on whether to trigger a completion signal to stop the iteration.
[0048] If the change is less than a preset threshold, the algorithm is considered to have converged, and the reconstructed signal is output. The iteration ends; if the number of iterations is... Maximum limit has been reached Output reconstruction signal If the iteration ends, then continue searching for the point using the updated auxiliary variables; otherwise, continue searching using the updated auxiliary variables. The reconstructed signal calculated in this round and observation vector As input, the number of iterations Then proceed to the next iteration and continue executing step S101. This process does not require additional clock cycles to read data.
[0049] In this embodiment of the invention, the inefficient matrix transpose access method is abandoned, and continuous burst readings of the sensing matrix are maintained row by row at all times. Whether in the residual calculation stage or the gradient update stage, the data flow strictly follows the physical storage order of the memory, maximizing the burst transfer efficiency of DDR and eliminating the time loss caused by random access.
[0050] By using a pipelined architecture to perform soft thresholding on intermediate variables of the sparse recovery signal, a thresholded reconstructed signal is generated, eliminating pipeline blockage and significantly reducing critical path latency, thereby allowing the hardware system to operate at higher clock frequencies.
[0051] The auxiliary variable search point is updated by combining the reconstructed signal and the lookup table method, which saves logic resources and power consumption while simplifying the design cost of the control state machine.
[0052] The simulation experiment of the hardware acceleration method of FISTA algorithm based on single traversal matrix provided by the embodiment of the present invention is as follows: to verify the improvement effect of the row interleaved gradient accumulation hardware architecture of single traversal matrix proposed in this invention in terms of latency and external memory access, a behavioral simulation environment is built to compare and evaluate the data paths of the same set of FISTA single iterations.
[0053] 1) The simulation configuration is as follows: Sensing matrix dimension: 300×300; matrix elements are provided in a continuous, row-by-row streaming manner; Clock frequency: 200MHz (clock period 5ns); Comparison objects: Comparative approach: The existing two-pass scanning structure first completes the residual calculation, and then re-traverses the matrix to complete the gradient calculation of transpose multiplied by the residual; The present invention employs a single-pass interleaving structure. When reading a certain row, the residual components are calculated and cached. Subsequently, with the help of ping-pong caching, the gradient contribution is accumulated while the next row is read continuously.
[0054] Statistical definition: The total delay is the number of cycles between the start trigger in the simulation waveform and the validity of the current iteration result (such as x_new_valid or equivalent completion flag).
[0055] 2) The simulation results and analysis are as follows: See Figure 3 , Figure 3 The above is a simulation waveform diagram of the hardware implementation provided in this embodiment of the invention. The simulation waveform shows that, under the conditions of a 300×300 matrix and 200MHz, the solution of this invention completes one iteration in approximately 459.57µs from the input high pulse of the start signal (start signal in the figure) to the completion of the calculation of the sparse coefficient vector data, corresponding to approximately 91,914 clock cycles. The comparative solution requires two complete traversals of the same matrix, and the residual and gradient stages are difficult to effectively overlap on the bandwidth-limited external memory interface, thus the total cycle is approximately doubled. See Table 1, which shows a comparison of the iteration delay of the two solutions (300×300, 200MHz). Table 1 Comparison of iteration delay between the two schemes
[0056] Under the same matrix size and clock conditions, this invention reduces the number of matrix traversals from 2 to 1, thereby reducing the total period from approximately 182,782 to approximately 91,914, resulting in an overall latency reduction of approximately 49.7% and an equivalent throughput increase of approximately 1.988 times; at the same time, the external read bandwidth requirement of the matrix is theoretically reduced by approximately 50%.
[0057] This comparison shows that the present invention, by combining a single matrix read with row cache reuse in its data flow organization, can reduce the number of external matrix reads from 2 to 1, thereby significantly reducing total latency and increasing throughput in I / O-dominated scenarios.
[0058] Based on the same inventive concept, this invention also provides a hardware acceleration device for the FISTA algorithm based on a single traversal matrix. The FISTA algorithm hardware acceleration device includes: The read module is used to continuously read data blocks of the sensor matrix row by row from off-chip memory; The residual calculation unit is used to calculate the residual components based on the currently read data blocks and load the current data blocks into on-chip temporary storage; The gradient update unit is used to perform gradient accumulation operation based on the residual component and the temporarily stored current data block to update the gradient vector, and at the same time read from the off-chip memory and temporarily store the next data block. The gradient update unit is also used to take the temporarily stored next data block as the current data block and return to the step of calculating the residual components based on the read current data block, until the reading of all data blocks of the sensing matrix is completed and the complete gradient vector is obtained. The soft thresholding module is used to perform gradient descent using the complete gradient vector to generate intermediate variables of the sparse recovery signal; and to perform soft thresholding on the intermediate variables of the sparse recovery signal using a pipeline architecture to generate a thresholded reconstructed signal. The convergence judgment module is used to update the auxiliary variable search point based on the combined coefficients updated by the reconstructed signal and the lookup table method, calculate the change of the reconstructed signal in parallel, and determine whether the algorithm should terminate the iteration by comparing the change with a preset threshold or the number of iterations to see if the maximum number of iterations has been reached.
[0059] The following will further describe a hardware acceleration device for the FISTA algorithm based on a single traversal matrix provided by an embodiment of the present invention. See [link to relevant documentation]. Figure 4 , Figure 4 This is a top-level architecture diagram of the FISTA hardware accelerator based on row-interleaved gradient calculation provided in this embodiment of the invention. By reconstructing the calculation order of mathematical formulas and combining it with an on-chip ping-pong caching mechanism, a single streaming traversal of a large-scale sensing matrix is achieved. That is, in a single matrix read process, the calculation of residual components and gradient vectors is completed simultaneously, completely breaking through the memory wall bottleneck.
[0060] The hardware device of this invention mainly includes the following sub-modules: 1. DDR Control Interface: Responsible for burst reading of sensor matrix data from external DDR using bus protocols such as AXI4 (Advanced eXtensible Interface 4).
[0061] 2. Input FIFO (First-In-First-Out) buffer: used for cross-clock domain processing to smooth DDR data flow.
[0062] 3. Residual Calculation Unit: This is a multiply-accumulate array responsible for calculating row vectors and vectors. The dot product is used to obtain the residual components. .
[0063] 4. Ping-Pong Row Cache Heap: It adopts a dual-port RAM (Random Access Memory) architecture, containing two independent storage banks (Bank 0 and Bank 1), and its storage depth is designed to accommodate at least a single row of matrix elements.
[0064] 5. Gradient Accumulation and Update Unit: Responsible for reading data from the ping-pong buffer, combining it with the calculated residual components, and accumulating the gradient vector.
[0065] 6. Gradient Memory: Stores intermediate gradient results.
[0066] 7. Pipeline Soft Threshold and Momentum Update Module: Handles nonlinear proximal mapping and calculates the soft threshold. With updated search points .
[0067] 8. Vector storage and momentum control unit: storage The vector is used to update the momentum using a lookup table.
[0068] In this embodiment of the invention, a sensing matrix, arranged in rows, is stored in off-chip memory (DDR SDRAM). The system comprises two core computational units: the residual computation unit (MatVec Unit) and the gradient update unit (GradientUpdate Unit), as well as a crucial ping-pong buffer. The implementation process is carried out in two cyclical phases: Phase One: The Loading and forward projection of rows.
[0069] See Figure 5 , Figure 5 This invention provides a row-interleaved gradient calculation data stream and timing diagram, initiating burst transmission via the DDR interface and reading the sensor matrix. The data from the i-th row is streamed into the residual calculation unit in real time. This unit reads data from the BRAM (Block RAM) in a pipeline manner. For the corresponding elements, perform multiplication and accumulation: .
[0070] While data flows through the multiplier, a copy of the i-th row of data is split and synchronously written to the write bank (e.g., Bank 0) of the ping-pong buffer. This leverages the advantages of FPGA parallel routing with zero time overhead.
[0071] Once the row has been read, the observation vector is subtracted to obtain the residual components. At this point, Bank 0 has completely stored the data from the first row. i Row data.
[0072] Phase Two: The The back projection of the row and the first Loading of lines.
[0073] Action switching: Once the residual component calculation is complete, the state machine immediately flips the ping-pong control signal.
[0074] Path A (gradient update - using Bank 0), the gradient update unit locks Bank 0 (at this point, it switches to reading from the Bank), and it begins to read out the data that was just written one by one. The gradient accumulation operation is performed to update the gradient vector, and the new gradient vector is written back to the BRAM on-chip where the gradient is stored.
[0075] Path B (new row read - using Bank 1), simultaneously, the DDR interface does not wait and seamlessly continues reading the matrix's first row. row data This new row of data is sent to the residual calculation unit to calculate the residual components. At the same time, it is written to Bank 1 of the ping-pong buffer (at this time, the writing to Bank is switched).
[0076] This process is repeated until the sensing matrix... M After all row data has been processed, this invention ensures that the external DDR bus never stops transmitting and always operates at maximum bandwidth throughput; and the matrix data is read only once, thus completing the process. and Calculation of contributions from both parts.
[0077] Phase 3: Gradient descent calculation. For details, please refer to step S104 above, which will not be repeated here.
[0078] See Figure 6 , Figure 6 This is a schematic diagram of the fully pipelined complex soft threshold circuit structure provided in an embodiment of the present invention. First, a complex signal is input. Calculate the sum of squares using two multipliers and one adder. Then perform a fast reciprocal square root calculation, and... The data is fed into a dedicated IP core for calculation. This module is designed with a fully pipelined architecture, producing one result per clock cycle with a fixed delay of 32 cycles, and is implemented using a fast reciprocal square root algorithm. After the calculation, calculate the intermediate variables. Then calculate the shrinkage factor. After calculating the data, a sign check is needed. This is achieved using a zero clamp, specifically by utilizing the sign bit (MSB) of the comparator. If the multiplication coefficients... If the multiplexer outputs 0, then the output is 0; otherwise, the output is 0. .in, Indicates the threshold. , This is the regularization parameter.
[0079] raw input After passing through a set of shift registers (FIFO), the number of delay cycles equals the total delay of the above calculation steps. Finally, the delayed... and Multiply the results and output the final result, which is the reconstructed signal after thresholding.
[0080] In this embodiment of the invention, when data is divided into blocks, the capacity of each bank in the ping-pong cache is designed to accommodate... Row data (e.g.) OK).
[0081] The working method is that the system first reads continuously. Row data calculation Each residual value is used in batches after all the data in this block is stored in Bank0. The row data is updated with gradients, and Bank 1 receives the next row. Line block.
[0082] In one implementation, instead of using an explicit dual-bank (Bank 0 / 1) address switching structure, a large circular buffer or a sufficiently deep FIFO (First In First Out) can be used to replace the ping-pong buffer.
[0083] The working principle is that matrix data enters the calculation unit to calculate the residual and is written to the FIFO at the same time; after a fixed delay (i.e. the time required for residual calculation), the data flows out from the FIFO output and enters the gradient accumulation unit.
[0084] As long as the cache structure serves the purpose of "delaying data for secondary use", regardless of its physical form as RAM, FIFO, or register chain, it is a substitute for the embodiments of this invention.
[0085] In this embodiment of the invention, the method of using a pipeline architecture to perform soft thresholding on intermediate variables of sparse recovery signals can also be achieved through modulus / phase processing based on the CORDIC (Coordinate Rotation Digital Computation) algorithm. The working method is to directly process complex numbers using the CORDIC module of the entire pipeline. The modulus and phase are calculated by rotation; after threshold subtraction of the modulus, the complex form is restored using CORDIC's rotation mode or trigonometric function lookup table.
[0086] This is another classic method for implementing complex number operations in hardware. Although the internal mathematical formulas are different, it also achieves the technical effect of "no division and fully pipelined".
[0087] Furthermore, the modulus of the input complex number can be transformed to the logarithmic domain. In the logarithmic domain, division becomes subtraction, and square root operation becomes shifting (dividing by 2). After the contraction operation, the number is transformed back to the linear domain. This method also avoids complex dividers and represents an equivalent substitution at the algorithmic level.
[0088] For alternatives to matrix storage and access formats (for dense / sparse formats), compressed sparse row format is supported: The embodiments of this invention primarily describe dense matrices or simply stored sparse matrices. Alternatives can be adapted to standard sparse matrix formats.
[0089] The operating principle involves adding index decoding logic to the hardware. DDR reads non-zero elements and their column indices. During gradient update calculation, data is accumulated into non-contiguous addresses in the gradient RAM based on the column indices.
[0090] Regardless of whether the matrix contains zero elements or whether its storage is compressed, as long as the data flow control follows the principles of "row-by-row reading, single traversal, and bidirectional projection," it can be implemented based on the embodiments of this invention.
[0091] For multi-channel / high-bandwidth memory parallel architectures: For ultra-large-scale matrices, use multiple DDR channels or HBM (High Bandwidth Memory).
[0092] The working method involves dividing the matrix into row blocks and storing them in different physical memory channels. Multiple "residual calculation + gradient update" processing cores are instantiated inside the FPGA. Each processing core is responsible for the parallel processing of a portion of the rows, and finally the gradient results are aggregated.
[0093] This is a spatial extension of the parallelism of the architecture of this invention embodiment, and the internal operating mechanism of each sub-core is consistent with that of this invention.
[0094] In one implementation, the momentum parameter can be updated ( The computation and convergence decision logic were moved from pure hardware (RTL) to an on-chip embedded soft core.
[0095] The operation involves the FPGA hardware logic handling the heavy matrix operations. After each iteration, an interrupt is sent to the CPU. The CPU then reads a small number of status registers (such as the error norm) and calculates the next iteration. The values are then configured and assigned to the hardware. This approach replaces part of the hardware state machine and LUT with software, but the core accelerated data path remains unchanged, representing a standard replacement in engineering implementation.
[0096] In one implementation, instead of using ROM lookup tables, a simplified hardware arithmetic logic unit can be designed.
[0097] The working method is to target This formula uses fixed-point number shifting and addition for approximate iterative calculation to generate the momentum coefficient in real time. This can be achieved using any low-cost calculation strategy (whether table lookup or approximate calculation) adopted to avoid the high overhead of floating-point division.
[0098] In this embodiment of the invention, row-interleaved gradient accumulation technology, combined with an on-chip ping-pong cache mechanism, achieves "read once, compute twice". This enables single-stream traversal of the sensor matrix. During the process, residual calculation and gradient backprojection are completed simultaneously. This directly reduces the amount of data accessed to off-chip memory in each iteration by 50%. In sparse signal processing applications with low computational intensity but high I / O intensity, this directly results in nearly doubling the system throughput and iteration speed.
[0099] This invention utilizes the row decomposition properties of gradients, eliminating the need for physical or logical transpose operations and maintaining continuous burst reads of the sensing matrix row by row. Whether in the residual calculation stage or the gradient update stage, the data flow strictly follows the physical arrangement order of the memory. This memory access mode eliminates the time penalty associated with random access, allowing the effective bandwidth utilization of external memory to approach its theoretical peak (typically >90%), significantly reducing data loading time.
[0100] This invention designs a fully pipelined, division-free soft-threshold circuit, replacing the traditional division normalization operation with fast reciprocal square root logic and a multiplier. This design transforms complex nonlinear operations into a fixed-delay pipeline stage, eliminating blocking stages and enabling the hardware system to operate at higher clock frequencies, thereby further improving computational efficiency.
[0101] The ping-pong cache of this invention only needs to store a single row of matrix data, resulting in extremely low on-chip storage resource requirements that are independent of the total number of rows in the matrix. Furthermore, it employs a pre-calculated lookup table method to update momentum parameters, eliminating complex real-time floating-point arithmetic logic. This invention can support the reconstruction of extremely large-dimensional sparse signals with very low hardware resource costs (low-cost FPGA or small-area ASIC), offering extremely high cost-effectiveness and scalability.
[0102] This invention integrates parallel norm accumulation logic in the momentum update data path bypass, synchronously calculating the convergence criterion during data flow. This achieves zero-time-overhead convergence detection, simplifies the design of the control state machine, and further optimizes the system's time efficiency.
[0103] This invention also provides an electronic device, such as... Figure 7 As shown, it includes a processor 701, a communication interface 702, a memory 703, and a communication bus 704, wherein the processor 701, the communication interface 702, and the memory 703 communicate with each other through the communication bus 704. Memory 703 is used to store computer programs; When the processor 701 executes the program stored in the memory 703, it implements the method steps of any of the above-mentioned hardware acceleration methods for the FISTA algorithm based on a single traversal matrix.
[0104] The communication bus mentioned in the above electronic devices can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of representation, only one thick line is used in the diagram, but this does not indicate that there is only one bus or one type of bus.
[0105] The communication interface is used for communication between the aforementioned electronic devices and other devices.
[0106] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.
[0107] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0108] The present invention also provides a computer-readable storage medium. A computer program is stored in the computer-readable storage medium, and when executed by a processor, the computer program implements the method steps of any of the above-described hardware acceleration methods for the FISTA algorithm based on a single traversal matrix.
[0109] Optionally, the computer-readable storage medium may be non-volatile memory (NVM), such as at least one disk storage device.
[0110] Optionally, the aforementioned computer-readable storage medium may also be at least one storage device located remotely from the aforementioned processor.
[0111] In another embodiment of the present invention, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to execute the steps described in any of the above-described hardware acceleration methods for the FISTA algorithm based on a single traversal matrix.
[0112] It should be noted that the terms "first," "second," etc., are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatuses and methods consistent with some aspects of the invention.
[0113] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Moreover, the specific features or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Furthermore, those skilled in the art can combine and integrate the different embodiments or examples described in this specification.
[0114] Although the invention has been described herein in conjunction with various embodiments, those skilled in the art will understand and implement other variations of the disclosed embodiments by reviewing the accompanying drawings and the disclosure in carrying out the claimed invention. In the description of the invention, the word "comprising" does not exclude other components or steps, "a" or "an" does not exclude a plurality, and "a plurality" means two or more, unless otherwise explicitly specified. Furthermore, while different embodiments may describe certain measures, this does not mean that these measures cannot be combined to produce good results.
[0115] The method provided in this invention can be applied to electronic devices. Specifically, the electronic device can be a desktop computer, a portable computer, a smart mobile terminal, a server, etc. No limitation is made herein; any electronic device that can implement this invention falls within the protection scope of this invention.
[0116] For the embodiments of the device / electronic device / storage medium, since they are basically similar to the method embodiments, the description is relatively simple, and relevant parts can be referred to in the description of the method embodiments.
[0117] It should be noted that the device, electronic device and storage medium in the embodiments of the present invention are respectively devices, electronic devices and storage media that apply the above-mentioned hardware acceleration method of FISTA algorithm based on a single traversal matrix. Therefore, all embodiments of the above-mentioned hardware acceleration method of FISTA algorithm based on a single traversal matrix are applicable to the device, electronic device and storage medium, and can achieve the same or similar beneficial effects.
[0118] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various simple deductions or substitutions can be made without departing from the concept of the present invention, and all such modifications and substitutions should be considered within the scope of protection of the present invention.
Claims
1. A hardware acceleration method for the FISTA algorithm based on a single-pass matrix traversal, characterized in that, The hardware acceleration method for the FISTA algorithm includes: Data blocks of the sensing matrix are read sequentially row by row from off-chip memory, residual components are calculated based on the current data block read, and the current data block is loaded into on-chip temporary storage. Based on the residual components and the temporarily stored current data block, a gradient accumulation operation is performed to update the gradient vector, while the next data block is read from the off-chip memory and temporarily stored. The temporarily stored next data block is used as the current data block, and the process returns to the step of calculating the residual components based on the read current data block, until all data blocks of the sensing matrix are read and the complete gradient vector is obtained. Gradient descent is performed using the complete gradient vector to generate intermediate variables of the sparse recovery signal; soft thresholding is performed on the intermediate variables of the sparse recovery signal using a pipeline architecture to generate a thresholded reconstructed signal. The search point of the auxiliary variable is updated based on the combined coefficients of the reconstructed signal and the table lookup method. The change of the reconstructed signal is calculated in parallel. The algorithm terminates the iteration by comparing the change with a preset threshold or the number of iterations to determine whether the maximum number of iterations has been reached.
2. The FISTA algorithm hardware acceleration method according to claim 1, characterized in that, The intermediate variables of the sparse recovery signal are subjected to soft thresholding using a pipelined architecture to generate a thresholded reconstructed signal, including: Perform a modulus square sum and fast reciprocal square root operation on the intermediate variables of the sparse recovery signal to obtain the modulus reciprocal; The shrinkage factor is calculated based on the reciprocal of the modulus; The sparse recovery signal intermediate variable and the shrinkage factor are aligned in time and multiplied to obtain the thresholded reconstructed signal.
3. The FISTA algorithm hardware acceleration method according to claim 1, characterized in that, The auxiliary variable search point is updated based on the combined coefficients of the reconstructed signal and the lookup table method, and the change in the reconstructed signal is calculated in parallel, including: Extract combination coefficients from the preset momentum parameter table based on the current iteration number index; The auxiliary variable search point is updated using the combined coefficients and the reconstructed signal, and the change in the reconstructed signal is calculated in parallel.
4. The FISTA algorithm hardware acceleration method according to claim 3, characterized in that, The preset momentum parameter table is pre-calculated and stored during the FISTA algorithm initialization phase.
5. The FISTA algorithm hardware acceleration method according to claim 3, characterized in that, Updating the auxiliary variable search point using the combined coefficients and the reconstructed signal includes: ; in, This represents the reconstructed signal; Represents the combination coefficients; This represents the reconstructed signal from the previous iteration; This indicates the search point for the auxiliary variable.
6. The FISTA algorithm hardware acceleration method according to claim 1, characterized in that, The data is divided into rows or blocks; the blocks include multiple consecutive rows.
7. The FISTA algorithm hardware acceleration method according to claim 6, characterized in that, Based on the residual components and the temporarily stored current data blocks, perform gradient accumulation operations to update the gradient vector, including: ; in, The first element of the sensing matrix represents the... Individual row data; Indicates the first The updated gradient vector corresponding to each row of data; Indicates the first The updated gradient vector corresponding to each row of data; Represents the residual vector The Each component; superscript This indicates the conjugate transpose operation.
8. A hardware acceleration device for the FISTA algorithm based on a single traversal of a matrix, characterized in that, The FISTA algorithm hardware acceleration device includes: The read module is used to continuously read data blocks of the sensor matrix row by row from off-chip memory; The residual calculation unit is used to calculate the residual components based on the currently read data blocks and load the current data blocks into on-chip temporary storage; The gradient update unit is used to perform gradient accumulation operation based on the residual component and the temporarily stored current data block to update the gradient vector, and at the same time read from the off-chip memory and temporarily store the next data block. The gradient update unit is also used to take the temporarily stored next data block as the current data block and return to the step of calculating the residual components based on the read current data block, until the reading of all data blocks of the sensing matrix is completed and the complete gradient vector is obtained. The soft thresholding module is used to perform gradient descent using the complete gradient vector to generate intermediate variables of the sparse recovery signal; and to perform soft thresholding on the intermediate variables of the sparse recovery signal using a pipeline architecture to generate a thresholded reconstructed signal. The convergence judgment module is used to update the auxiliary variable search point based on the combined coefficients updated by the reconstructed signal and the lookup table method, calculate the change of the reconstructed signal in parallel, and determine whether the algorithm should terminate the iteration by comparing the change with a preset threshold or the number of iterations to see if the maximum number of iterations has been reached.
9. An electronic device, characterized in that, It includes a processor, a communication interface, a memory, and a communication bus, wherein the processor, the communication interface, and the memory communicate with each other through the communication bus; Memory, used to store computer programs; A processor, when executing a computer program stored in memory, implements a hardware acceleration method for the FISTA algorithm based on a single traversal matrix as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, which, when executed by a processor, implements a hardware acceleration method for the FISTA algorithm based on a single traversal matrix as described in any one of claims 1-7.