A hardware-accelerated loop processing system
The hardware-accelerated loop processing system optimizes instruction fetching by utilizing a hardware loop controller and loop buffer, thus solving the problem of excessive loop control overhead and improving the system's computational efficiency and performance. It is suitable for digital signal processing in embedded systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NORTHWESTERN POLYTECHNICAL UNIV
- Filing Date
- 2026-05-25
- Publication Date
- 2026-06-19
AI Technical Summary
The loop control overhead of the existing loop structure accounts for too high a proportion of the total execution time, which seriously reduces the computational efficiency and prevents the effective computing power of the system from being fully used for actual data processing, resulting in a decline in system performance.
A hardware-accelerated loop processing system is adopted, which coordinates other modules through a hardware loop controller, optimizes instruction fetching by using a loop buffer or cache, reduces the overhead of repeated instruction fetching of loop body instructions, and eliminates explicit loop control instructions.
It improves the average IPC (Instructions Per Cycle) of control-intensive applications by 37%, reduces instruction fetching and branch prediction overhead, improves overall energy efficiency by 3.2 times, reduces cycle execution time fluctuations, and reduces code size by 15%, making it more suitable for hard real-time systems.
Smart Images

Figure CN122240183A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of digital signal processing technology, and more specifically, to a hardware-accelerated loop processing system. Background Technology
[0002] In embedded systems and digital signal processing, loop structures are frequently used as core computational paradigms (such as finite impulse response filtering, convolution operations, and matrix operations). However, traditional software-based loop implementations rely on explicit instructions to perform operations such as incrementing or decrementing counters, determining loop termination conditions, and jumping to the beginning of the loop. These loop control instructions not only introduce additional time overhead but may also increase CPU cache usage in systems with limited cache capacity, thereby affecting the overall performance and execution efficiency of the program.
[0003] Existing loop structures require a counter comparison instruction to be executed in each iteration. This operation involves register reading, Arithmetic Logic Unit (ALU) operations, and status flag updates. Taking ARM Cortex-M series processors as an example, a single comparison instruction consumes 1-2 clock cycles. For a loop that executes 1000 iterations, the conditional checks alone generate approximately 2000 cycles of pure overhead. Loop control overhead accounts for an excessively high proportion of the total execution time, severely reducing computational efficiency and preventing the system's effective computing power from being fully utilized for actual data processing, resulting in a decline in system performance. Summary of the Invention
[0004] The purpose of this invention is to provide a hardware-accelerated loop processing system to address the technical problem that the loop control overhead of existing loop structures accounts for an excessively high proportion of the total execution time, severely reducing computational efficiency and preventing the system's effective computing power from being fully utilized for actual data processing, thus leading to a decline in system performance. In view of this, this invention achieves this through the following solution.
[0005] This invention provides a hardware-accelerated loop processing system, which connects an instruction decoding unit, an instruction receiving unit, and a branch prediction unit; the loop processing system includes: The control module is configured as a hardware loop controller to coordinate the other modules and make decisions based on the current processor state. The storage module is used to store the boundary address of the current loop. The loop counting module receives a decrement signal from the control module and an initial value from the instruction decoding unit as input; its output is to feed back the current count value to the control module in real time as a basis for the control module to determine whether the loop should continue. The instruction fetch address calculation module receives the following inputs: the source of the next address, the branch target address from the branch prediction unit, and the loop start address from the storage module; its output is to output the selected next address to the instruction receiving unit. The hardware loop controller obtains instructions through a loop buffer or cache optimization instruction. When the loop is executed for the first time, the instruction is fetched and stored in the loop buffer or cache. In subsequent iterations, when it is determined that the loop should continue, the instruction fetch address calculation module is instructed to read the instruction from the loop buffer or cache.
[0006] Compared with the prior art, the hardware-accelerated loop processing system of the present invention is configured to connect an instruction decoding unit, an instruction receiving unit, and a branch prediction unit; the loop processing system is also configured with a control module, a storage module, a loop counting module, and an instruction fetch address calculation module; based on the structure of this loop processing system, the control module is configured as a hardware loop controller for coordinating the other modules and making decisions based on the current processor state; the storage module is used to store the boundary address of the current loop; the instruction fetch address calculation module receives the next address source, the branch target address from the branch prediction unit, and the loop start address from the storage module; and outputs the selected next address to the instruction receiving unit. Based on the above technical solution, the hardware loop controller optimizes instruction fetching through a loop buffer or cache. During the first execution of the loop, the instruction is fetched and stored in the loop buffer or cache. In subsequent iterations, when it is determined that the loop should continue, the instruction fetch address calculation module is instructed to read the instruction from the loop buffer or cache, thereby reducing the overhead of repeated instruction fetching in the loop body. In some embodiments of this invention, the hardware loop can improve the IPC (Instructions Per Cycle) of control-intensive applications by an average of 37%, reduce instruction fetching and branch prediction overhead, and improve overall energy efficiency by 3.2 times. By eliminating explicit loop control instructions, the code size is reduced by an average of 15%, and the fluctuation of loop execution time is reduced, making it more suitable for hard real-time systems. Through the above technical solution of this invention, the problem that the loop control overhead of existing loop structures accounts for too high a proportion of the total execution time, seriously reducing computational efficiency and preventing the effective computing power of the system from being fully used for actual data processing, leading to a decline in system performance is solved.
[0007] Furthermore, in the hardware-accelerated loop processing system of the present invention, the storage module includes a start address register and an end address register; The loop processing system identifies and parses loop instructions through an instruction decoding unit; The starting address register is used to store the address of the first instruction of the loop body, i.e., the starting address; The end address register is used to store the address after the last instruction of the loop body, i.e., the end address.
[0008] Furthermore, in the hardware-accelerated loop processing system of the present invention, the input of the storage module is managed by the control module and is written into the loop parameters by the instruction decoding unit when the instruction is executed; The output of the storage module is to provide the start address and end address of the storage to the control module and the instruction fetch address calculation module.
[0009] Furthermore, in the hardware-accelerated loop processing system of the present invention, the instruction fetch address calculation module is provided with an instruction fetch unit; the loop counting module is provided with a loop counter; In the initial state of the loop processing system, the instruction fetch unit fetches an instruction from the address pointed to by the program counter. When the instruction fetch unit fetches and decodes an instruction, the processor begins to configure the hardware loop unit. The operands of the instruction are decoded, the start address is written to the start address register, the end address is written to the end address register, and the number of iterations is loaded into the loop counter.
[0010] Furthermore, in the hardware-accelerated loop processing system of the present invention, in each instruction fetch cycle, the control module compares the value of the current program counter with the value of the starting address in the storage module; when the value of the current program counter is equal to the value of the starting address, it indicates a match. When the value of the program counter matches the value of the starting address, the first instruction of the execution stream enters the loop body, the loop processing system is activated, and the loop is marked as started.
[0011] Furthermore, in the hardware-accelerated loop processing system of the present invention, the control module performs address checking and count checking simultaneously in each instruction fetch cycle; During the address check, the value of the current program counter is compared with the value of the end address in the storage module to see if they match. During the counting check, the value of the loop counter is checked to see if it is greater than zero; When the current program counter value matches the end address value and the loop counter value is greater than zero, the loop continues. The control module sends a decrement signal to automatically decrement the loop counter by 1. The control module sends a redirection signal to the instruction fetch address calculation module.
[0012] Furthermore, in the hardware-accelerated loop processing system of the present invention, after receiving the redirection signal, the instruction fetch address calculation module selects the starting address provided by the storage module as the next instruction fetch address; in the next clock cycle, the instruction fetch unit will start fetching instructions again from the starting address and begin a new round of iteration.
[0013] Furthermore, in the hardware-accelerated loop processing system of the present invention, the loop ends when the value of the current program counter matches the value of the end address and the value of the loop counter is equal to zero.
[0014] Furthermore, in the hardware-accelerated loop processing system of the present invention, after the loop ends, the control module shuts down the loop processing system to return it to an idle state; The instruction fetching unit continues to fetch instructions sequentially after the end address.
[0015] Furthermore, in the hardware-accelerated loop processing system of the present invention, the instruction decoding unit serves as an interface for configuring the loop processing system; When the decoder in the instruction decoding unit recognizes a hardware loop instruction, it generates a control signal and writes the loop start address, end address, and iteration count carried in the instruction into the storage module and the loop counting module. Attached Figure Description
[0016] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this invention, illustrate exemplary embodiments of the invention and are used to explain the invention, but do not constitute an undue limitation of the invention. In the drawings: Figure 1 This is a schematic diagram of the structure of a hardware-accelerated loop processing system according to the present invention; Figure 2 This is a schematic diagram of a hardware-accelerated loop processing system in this invention. Detailed Implementation
[0017] To make the technical problems to be solved, the technical solutions, and the beneficial effects of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.
[0018] It should be noted that when a component is referred to as being "fixed to" or "set on" another component, it can be directly on or indirectly on that other component. When a component is referred to as being "connected to" another component, it can be directly connected to or indirectly connected to that other component.
[0019] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified. "Several" means one or more, unless otherwise explicitly specified.
[0020] Existing loop structures require a counter comparison instruction to be executed in each iteration. This operation involves register reading, Arithmetic Logic Unit (ALU) operations, and status flag updates. Taking ARM Cortex-M series processors as an example, a single comparison instruction in current technology consumes 1-2 clock cycles. For a loop that executes 1000 iterations, the conditional checks alone generate approximately 2000 cycles of pure overhead. Loop control overhead accounts for an excessively high proportion of the total execution time, severely reducing computational efficiency and preventing the system's effective computing power from being fully utilized for actual data processing, resulting in a decline in system performance.
[0021] To address the above technical problems, this invention provides a hardware-accelerated loop processing system, which connects an instruction decoding unit, an instruction receiving unit, and a branch prediction unit; the loop processing system includes a control module, a storage module, a loop counting module, and an instruction fetch address calculation module; wherein: The control module is configured as a hardware loop controller to coordinate the other modules and make decisions based on the current processor state; The storage module is used to store the boundary address of the current loop; The inputs to the loop counting module are: a decrement signal from the control module and an initial value from the instruction decoding unit; its output is: to feed back the current count value to the control module in real time as a basis for the control module to determine whether the loop should continue. The inputs to the instruction fetch address calculation module are: the source of the next address, the branch target address from the branch prediction unit, and the loop start address from the storage module; its output is: the selected next address is output to the instruction receiving unit. The hardware loop controller obtains instructions through a loop buffer or cache optimization instruction. When the loop is executed for the first time, the instruction is fetched and stored in the loop buffer or cache. In subsequent iterations, when it is determined that the loop should continue, the instruction fetch address calculation module is instructed to read the instruction from the loop buffer or cache.
[0022] In the case of the above technical solution, the hardware-accelerated loop processing system of the present invention is configured to connect an instruction decoding unit, an instruction receiving unit, and a branch prediction unit; the loop processing system is also configured with a control module, a storage module, a loop counting module, and an instruction fetch address calculation module; based on the structure of the loop processing system, wherein: the control module is configured as a hardware loop controller, used to coordinate the other modules and make decisions based on the current processor state; the storage module is used to store the boundary address of the current loop; the instruction fetch address calculation module receives the next address source, the branch target address from the branch prediction unit, and the loop start address from the storage module; and outputs the selected next address to the instruction receiving unit. Based on the above technical solution, the hardware loop controller optimizes instruction fetching through a loop buffer or cache. During the first execution of the loop, the instruction is fetched and stored in the loop buffer or cache. In subsequent iterations, when it is determined that the loop should continue, the instruction fetch address calculation module is instructed to read the instruction from the loop buffer or cache, thereby reducing the overhead of repeated instruction fetching in the loop body. In some embodiments of this invention, the hardware loop can improve the IPC (Instructions Per Cycle) of control-intensive applications by an average of 37%, reduce instruction fetching and branch prediction overhead, and improve overall energy efficiency by 3.2 times. By eliminating explicit loop control instructions, the code size is reduced by an average of 15%, and the fluctuation of loop execution time is reduced, making it more suitable for hard real-time systems. Through the above technical solution of this invention, the problem that the loop control overhead of existing loop structures accounts for too high a proportion of the total execution time, seriously reducing computational efficiency and preventing the effective computing power of the system from being fully used for actual data processing, leading to a decline in system performance is solved.
[0023] To better understand the present invention, the following specific embodiments further illustrate the content of the present invention, but the content of the present invention is not limited to the following embodiments.
[0024] Example 1 This embodiment provides a hardware-accelerated loop processing system, which connects an instruction decoding unit, an instruction receiving unit, and a branch prediction unit. The loop processing system includes a control module, a storage module, a loop counting module, and an instruction fetch address calculation module; wherein: The instruction decoding unit serves as the interface for configuring the loop processing system. When the decoder in the instruction decoding unit recognizes a hardware loop instruction, it generates a control signal and writes the loop start address, end address, and iteration count carried in the instruction into the storage module and the loop counting module. The control module is configured as a hardware loop controller to coordinate the other modules and make decisions based on the current processor state; The storage module is used to store the boundary address of the current loop; Furthermore, the storage module includes a start address register and an end address register; the loop processing system identifies and parses dedicated loop instructions through an instruction decoding unit; the start address register is used to store the address of the first instruction of the loop body, i.e., the start address; the end address register is used to store the address after the last instruction of the loop body, i.e., the end address.
[0025] Furthermore, the input to the storage module is managed by the control module and is written into the loop parameters through the instruction decoding unit when the instruction is executed; The output of the storage module is to provide the start and end addresses of the storage to the control module and the instruction fetch address calculation module. Furthermore, the instruction fetch address calculation module is equipped with an instruction fetch unit; the loop counter module is equipped with a loop counter; In the initial state of the loop processing system, the instruction fetch unit fetches an instruction from the address pointed to by the program counter. When the instruction fetch unit fetches and decodes an instruction, the processor begins to configure the hardware loop unit. The operands of the instruction are decoded, the start address is written to the start address register, the end address is written to the end address register, and the number of iterations is loaded into the loop counter. Furthermore, in each instruction fetch cycle, the control module compares the current program counter (PC) value (i.e., the address of the instruction currently being fetched) with the value of the starting address in the start address register; when the values of these two addresses are equal, a match is indicated. When the value of the program counter matches the value of the starting address, the execution stream enters the first instruction of the loop body, the loop processing system is activated, and the loop is marked as started; Furthermore, the control module performs address checks and count checks simultaneously; During the address check, the value of the current program counter is compared with the value of the end address in the storage module to see if they match. During the counting check, check if the value of the loop counter is greater than zero; When the current program counter value matches the end address value and the loop counter value is greater than zero, the loop continues. The control module sends a decrement signal to automatically decrement the loop counter by 1. The control module sends a redirection signal to the instruction fetch address calculation module. The inputs to the loop counting module are: a decrement signal from the control module and an initial value from the instruction decoding unit; its output is: to feed back the current count value to the control module in real time as a basis for the control module to determine whether the loop should continue. The inputs to the instruction fetch address calculation module are: the source of the next address, the branch target address from the branch prediction unit, and the loop start address from the memory module; its output is: the selected next address is output to the instruction receiving unit. Furthermore, after receiving the redirection signal, the instruction fetch address calculation module selects the starting address provided by the storage module as the next instruction fetch address; in the next clock cycle, the instruction fetch unit will start fetching from the starting address again and begin a new round of iteration; Furthermore, the loop ends when the current program counter matches the end address and the loop counter is zero. After the loop ends, the control module shuts down the loop processing system and returns it to an idle state. The instruction fetch unit continues to fetch instructions after the end address in sequence. The hardware loop controller optimizes instruction fetching through the loop buffer or cache. When the loop is first executed, the instruction is fetched and stored in the loop buffer or cache. In subsequent iterations, when it is determined that the loop will continue, the instruction fetch address calculation module is instructed to read instructions from the loop buffer or cache.
[0026] Example 2 Please see Figure 1 This embodiment provides a hardware-accelerated loop processing system, which includes a control module, a storage module, a loop counting module, and an instruction fetch address calculation module; wherein: The core of the control module is a hardware loop controller, which is responsible for coordinating the work of all other modules and making decisions based on the current processor state. Its core is a state machine (such as IDLE / ACTIVE / END states, i.e., idle / active / end states), which switches states by detecting pipeline signals (such as program counter value, loop end flag).
[0027] Regarding the connection relationships of the control modules, we have: Input: Receive the current program counter (PC) value from the instruction fetch stage; receive the current count value (whether it is 0) from the loop counter module; receive the loop end address from the memory module.
[0028] Output: Sends a redirect signal and the loop start address to the instruction fetch address calculation module; sends a decrement signal to the loop counter module; manages the enable and disable status of the entire loop processing system.
[0029] The integrated condition judgment logic of the control module is as follows: continuously compare the value of the current program counter with the value of the end address in the end address register. If they match and the counter is not zero, a redirection is triggered, and the counting module is notified to decrement.
[0030] Furthermore, the storage module described above acts as the "memory" of the loop unit, specifically used to store the boundary addresses of the currently active loop. The storage module typically contains two dedicated registers: a start address register and an end address register. The start address register stores the address of the first instruction in the loop body, i.e., the start address; the end address register stores the address after the last instruction in the loop body, i.e., the end address.
[0031] Regarding the connection relationship of the storage module, its input is managed by the control module and written through the decoding unit when executing LOOP_SETUP type instructions; the output is to continuously provide the start address and end address of the storage to the control module and the instruction fetch address calculation module.
[0032] Furthermore, the above-mentioned loop counting module is described. The loop counting module acts as a "countdown timer" for the loop, automatically decrementing by 1 after each iteration. It contains a dedicated loop counter register. For the connection relationship of the loop counting module, its input is: receiving the decrement signal from the control module; receiving the initial value from the decoding unit (loaded by the LOOP_SETUP instruction); the output is: feeding back the current count value (especially the "whether it is zero" status signal) to the control module in real time, as the core basis for its judgment on whether the loop should continue.
[0033] Further explanation of the instruction fetch address calculation module: This module acts as the "traffic controller" for the processor's instruction fetch address, determining where the next instruction will be fetched. Essentially, it is a multiplexer (MUX). The inputs to this module are: multiple possible next address sources, a normal sequential address (usually PC+4; most current RISC processors use 32-bit, 4-byte fixed-length instructions and address by byte, so during sequential execution, the address of the next instruction is the current instruction address + 4), the branch target address from the branch prediction unit, and the loop start address from the memory module (when the loop continues). Its output is: the selected next address is output to the instruction receiving unit (i.e., the instruction memory or instruction cache). The instruction fetch address calculation module is controlled by a redirection signal from the control module. This signal has the highest priority; when it is active, it forces the selection of the loop start address as the next instruction fetch address, ignoring all other inputs. The aforementioned RISC stands for Reduced Instruction Set Computer. The aforementioned redirection signal is a high-priority control signal issued by the hardware loop control module. After receiving the redirection signal, the instruction fetch address calculation module will ignore all other input signals.
[0034] Furthermore, from Figure 1 As can be seen from this, the address sending unit (i.e. the unit that specifically executes the instruction fetch address calculation module) is responsible for generating the next address to be accessed from the instruction memory; its inputs are: sequential address PC+4, branch prediction address, and the starting address in the starting address register; the output is: the final address after selection, which is sent to the instruction memory.
[0035] Furthermore, the aforementioned instruction memory is a hardware component that stores all instruction codes. Its function is to return the instruction data corresponding to a given address. Its input is the address signal from the address sending unit; its output is the instruction data stream corresponding to that address. The instruction receiving / unpacking unit receives the raw instruction word sent from the instruction memory and splits it into individual instructions that can be processed by the decoder (this process is particularly important when processing compressed instruction sets). Its input is the raw instruction data from the instruction memory; its output is the organized, complete single instruction, which is sent to the decoding unit. The execution unit is responsible for executing the actual calculation operations required by the instruction. It may include an arithmetic logic unit, multiplier, shifter, load / store address calculator, etc. The input of the execution unit is the control signals and operands from the decoding unit; the output of the execution unit is writing the calculation result back to the register file or sending it to the subsequent stage.
[0036] Based on the structure of the above-mentioned cyclic processing system, the integration of the cyclic processing system with the pipeline is as follows: The hardware loop logic is integrated into the instruction fetch stage to avoid affecting the critical path of the execution stage. The instruction receiving unit and instruction unpacking unit receive the address output from the instruction fetch address calculation module and fetch the instruction from memory. This is the starting point for instructions to flow into the pipeline and the target location for loop redirection. The instruction decoding unit is the key interface for configuring the loop unit. When the decoder recognizes a dedicated hardware loop instruction (such as LOOP_SETUP, LOOP_START), it generates a corresponding control signal, writing the loop start address, end address, and iteration count carried in the instruction into the memory module and loop counter module of the aforementioned loop processing system, thereby completing the loop initialization. The relationship between the execution unit and the loop unit is indirect. The execution unit no longer needs to execute loop counter update, comparison, and conditional jump instructions, allowing the execution unit to focus on executing the core calculation instructions in the loop body, improving efficiency. Here, LOOP represents loop, SETUP represents setting or configuring, and START represents starting or launching.
[0037] Furthermore, this embodiment also provides instructions for hardware looping. To support hardware looping, new instructions are needed to configure and manage loop parameters, specifically: Loop initialization instruction: LOOP_SETUP Rs, Re, Rn # Rs=start address, Re=end address, Rn=iteration count; Function: Writes the loop start address Rs to the start address of the memory module; writes the end address Re (usually the end of the loop body + 4) to the end address; loads the iteration count Rn into the loop counter module's counter; switches the control module's state to ARMED (ready), waiting for the loop to start. Hardware action: If a loop buffer (LOOP_BUF) exists, prefetches and caches the instructions.
[0038] Loop start instruction: LOOP_START; Function: Activates the control module, switching the state to ACTIVE; Notifies the instruction fetch unit: The next instruction is fetched from the starting address (first iteration). Hardware action: The loop counter module locks the initial value to prevent accidental modification.
[0039] Loop end address marker instruction: LOOP_END; Function: Serves as a hardware marker indicating the end of the loop body, without generating actual machine code (similar to a pseudo-instruction); Informs the hardware: When the PC points to this address, the loop control logic is triggered. Hardware action: When the PC reaches this address, the control module automatically detects: If the counter > 0, a redirection signal is sent to the instruction fetch unit; if the counter = 0, the loop unit is closed, and normal execution resumes.
[0040] These instructions can be customized based on different instruction sets, allowing programmers or compilers to directly control hardware loops without manually managing loop counts and branch jumps, thereby reducing the number of instructions and branch prediction overhead.
[0041] Further, please refer to Figure 2 , combined Figure 2This document describes the processing flow of the hardware-accelerated loop processing system in this embodiment. In the initial state, the instruction fetch unit operates normally, fetching instructions from the address pointed to by the program counter. The loop processing system is inactive, and its control logic does not interfere with the selection of the fetch address. The processor executes instructions sequentially until it encounters an instruction to configure the hardware loop. When the instruction fetch unit fetches and decodes a special LOOP_SETUP instruction, the processor begins configuring the hardware. The operands of the LOOP_SETUP Rs, Re, Rn instruction are decoded: the start address (Rs) is written to the start address register of the loop unit; the end address (Re) is written to the end address register; and the iteration count (Rn) is loaded into the loop counter. At this time, the loop unit enters the ARMED state. The instruction fetch unit continues to operate normally, fetching instructions one by one. In each fetch cycle, the loop control module compares the current PC with the start address (Rs) in the memory module. When PC == Rs: this means that the execution flow has officially entered the first instruction of the loop body, and the loop unit is activated. At this time, no additional operation is required; it simply marks the official start of the loop. The execution of the loop body is a completely zero-overhead phase. Specifically, the instruction fetch unit continuously fetches instructions from the instruction cache or memory for the loop body. The execution unit only processes the core computational instructions within the loop body; no instructions for updating counters or conditional checks are fetched or executed, saving bandwidth and power consumption. In each fetch cycle, the control module performs two checks simultaneously: address check: comparing the current PC with the end address (Re); count check: checking if the loop counter value is greater than zero. When PC == Re, indicating an iteration is about to complete, the hardware decides what to do next. If PC == Re and the counter > 0, it means the loop has not yet ended. The controller sends a decrement signal, automatically decrementing the loop counter by 1; the controller also sends a redirection signal, a high-priority control signal, to the instruction fetch address calculation module (a multiplexer MUX). Upon receiving the redirection signal, the instruction fetch address calculation module ignores the normal PC+4 address and the address provided by the branch prediction unit, instead directly selecting the starting address (Rs) provided by the loop memory module as the next instruction fetch address. In the next clock cycle, the instruction fetch unit will restart fetching instructions directly from the start address (the beginning of the loop), thus beginning a new iteration. If PC == Re and the counter == 0: the loop officially ends. The controller closes the loop unit, returning it to the idle state. The instruction fetch unit will continue to fetch instructions sequentially after the end address (i.e., the code after the loop body).
[0042] Furthermore, to fully leverage the advantages of hardware loops, this embodiment extends the compiler to automatically detect loop structures, optimize loop boundaries, support nested loops, and collaborate with the prefetch buffer. Specifically, automatic loop structure detection involves converting appropriate for / while loops into hardware loop instructions. Loop boundary optimization includes optimizing dynamic loop counts and loop body alignment. Optimizing dynamic loop counts involves the compiler generating code that loads the count into a loop counter if the loop count is determined at runtime. Optimizing loop body alignment ensures the loop start address is aligned to the instruction cache boundary, avoiding additional latency caused by compressed instructions (16-bit). Supporting nested loops means that since the hardware supports nested loops, the compiler prioritizes mapping the innermost loop to the hardware loop to maximize performance gains (due to its highest execution frequency). Collaborating with the prefetch buffer reduces the overhead of repeated instruction fetching in the loop body by combining it with a level-zero instruction buffer.
[0043] In the description of the above embodiments, specific features, structures, materials, or characteristics may be combined in any suitable manner in one or more embodiments or examples.
[0044] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A hardware acceleration based loop processing system, characterized by, The loop processing system connects the instruction decoding unit, the instruction receiving unit, and the branch prediction unit; the loop processing system includes: The control module is configured as a hardware loop controller to coordinate the other modules and make decisions based on the current processor state. The storage module is used to store the boundary address of the current loop. The loop counting module receives a decrement signal from the control module and an initial value from the instruction decoding unit as input; its output is to feed back the current count value to the control module in real time as a basis for the control module to determine whether the loop should continue. The instruction fetch address calculation module receives the following inputs: the source of the next address, the branch target address from the branch prediction unit, and the loop start address from the storage module; its output is to output the selected next address to the instruction receiving unit. The hardware loop controller obtains instructions through a loop buffer or cache optimization instruction. When the loop is executed for the first time, the instruction is fetched and stored in the loop buffer or cache. In subsequent iterations, when it is determined that the loop should continue, the instruction fetch address calculation module is instructed to read the instruction from the loop buffer or cache.
2. The hardware acceleration-based loop processing system of claim 1, wherein, The storage module includes a start address register and an end address register; The loop processing system identifies and parses loop instructions through an instruction decoding unit; The starting address register is used to store the address of the first instruction of the loop body, and the address of the first instruction is the starting address; The end address register is used to store the address after the last instruction of the loop body, and the address after the last instruction is the end address.
3. The hardware acceleration-based loop processing system of claim 2, wherein, The input to the storage module is managed by the control module and is written into the loop parameter through the instruction decoding unit when the instruction is executed; The output of the storage module is to provide the start address and end address of the storage to the control module and the instruction fetch address calculation module.
4. The hardware acceleration-based loop processing system of claim 3, wherein, The instruction fetch address calculation module is equipped with an instruction fetch unit; the loop counter module is equipped with a loop counter; In the initial state of the loop processing system, the instruction fetch unit fetches an instruction from the address pointed to by the program counter. When the instruction fetch unit fetches and decodes an instruction, the processor begins to configure the hardware loop unit. The operands of the instruction are decoded, the start address is written to the start address register, the end address is written to the end address register, and the number of iterations is loaded into the loop counter.
5. The hardware-accelerated loop processing system according to claim 4, characterized in that, In each instruction fetch cycle, the control module compares the current program counter value with the starting address value in the memory module; a match is indicated when the current program counter value is equal to the starting address value. When the value of the program counter matches the value of the starting address, the first instruction of the execution stream enters the loop body, the loop processing system is activated, and the loop is marked as started.
6. The hardware-accelerated loop processing system according to claim 5, characterized in that, During each instruction fetch cycle, the control module performs both address checks and count checks simultaneously. During the address check, the value of the current program counter is compared with the value of the end address in the storage module to see if they match. During the counting check, the value of the loop counter is checked to see if it is greater than zero; When the current program counter value matches the end address value and the loop counter value is greater than zero, the loop continues. The control module sends a decrement signal to automatically decrement the loop counter by 1. The control module sends a redirection signal to the instruction fetch address calculation module.
7. The hardware-accelerated loop processing system according to claim 6, characterized in that, After receiving the redirection signal, the instruction fetch address calculation module selects the starting address provided by the storage module as the next instruction fetch address; in the next clock cycle, the instruction fetch unit will start fetching from the starting address again and begin a new round of iteration.
8. The hardware-accelerated loop processing system according to claim 7, characterized in that, The loop ends when the current program counter value matches the end address value and the loop counter value is zero.
9. The hardware-accelerated loop processing system according to claim 8, characterized in that, After the loop ends, the control module shuts down the loop processing system and returns it to an idle state. The instruction fetching unit continues to fetch instructions sequentially after the end address.
10. The hardware-accelerated loop processing system according to claim 1, characterized in that, The instruction decoding unit serves as an interface for configuring the loop processing system. When the decoder in the instruction decoding unit recognizes a hardware loop instruction, it generates a control signal and writes the loop start address, end address, and iteration count carried in the instruction into the storage module and the loop counting module.