Nested loop control
By combining a nested loop controller and a compiler, the problem of pipelined execution difficulties in DSP when executing nested loops is solved, and the execution efficiency of nested loops is improved, especially the pipelined execution efficiency of imperfect nested loops.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TEXAS INSTRUMENTS INC
- Filing Date
- 2020-05-22
- Publication Date
- 2026-06-19
Smart Images

Figure CN111984319B_ABST
Abstract
Description
Technical Field
[0001] The embodiments of this application relate to nested loop control. Background Technology
[0002] Modern digital signal processors (DSPs) face multiple challenges. DSPs typically execute software containing nested loops, which consist of an inner loop and one or more outer loops. To improve DSP performance, some instructions can be pipelined, where multiple instructions are executed simultaneously by different functional units of the DSP. However, pipelining nested loops presents difficulties in determining whether to execute instructions associated with one or more outer loops (e.g., assertions to determine instructions in an efficient manner). Summary of the Invention
[0003] According to at least one embodiment of this disclosure, a nested loop controller includes: a first register having a first value, initialized to an initial first value; a second register having a second value, initialized to an initial second value; and a third register configured as an assertion FIFO, initialized to have a third value. The third value includes a first bit equal to the outer loop indicator. During loop execution, the second value is advanced in response to a tick instruction. In response to the second value reaching a second threshold, the second register is reset to the initial second value. The nested loop controller further includes a comparator coupled to the second register and the assertion FIFO, and configured to: provide the outer loop indicator value as input to the assertion FIFO when the second value equals the second threshold; and provide the inner loop indicator value as input to the assertion FIFO when the second value does not equal the second threshold.
[0004] According to another embodiment of this disclosure, a method includes: initializing a first register having an associated first value to an initial first value; initializing a second register having an associated second value to an initial second value; initializing a third register having an associated third value to an initial third value, wherein the third register is configured to assert a first-in-first-out buffer (FIFO); during loop execution, advancing the second value in response to a tick instruction; resetting the second register to the initial second value in response to the second value reaching a second threshold; providing an outer loop indicator value as input to the assert FIFO when the second value is equal to the second threshold; and providing an inner loop indicator value as input to the assert FIFO when the second value is not equal to the second threshold. Attached Figure Description
[0005] For a detailed description of each example, reference will now be made to the accompanying drawings, in which:
[0006] Figure 1 The dual scalar / vector data path processors for each instance are shown;
[0007] Figure 2 It shows Figure 1 The registers and functional units in the dual scalar / vector data path processor shown in the figure, according to each instance;
[0008] Figure 3 An exemplary global scalar register file is shown;
[0009] Figure 4 An exemplary local scalar register file shared by the arithmetic function unit is shown;
[0010] Figure 5 An exemplary local scalar register file shared by the multiplication function unit is shown;
[0011] Figure 6 An exemplary local scalar register file shared by the load / store unit is shown;
[0012] Figure 7 An exemplary global vector register file is shown;
[0013] Figure 8 An exemplary assertion register file is shown;
[0014] Figure 9 An exemplary local vector register file shared by the arithmetic function unit is shown;
[0015] Figure 10 An exemplary local vector register file shared by the multiplication and related functional units is shown;
[0016] Figure 11 The pipeline stages of the central processing unit are shown for each instance;
[0017] Figure 12 Sixteen instructions are shown for retrieving a single packet based on each instance;
[0018] Figure 13A and 13B A schematic diagram of a nested loop controller based on individual instances is shown;
[0019] Figure 14A-1 , 14A-2 14B and 14C show exemplary pseudo-assembly code instructions obtained by compiling and merging nested loops according to various instances; and
[0020] Figure 15 A block diagram of an exemplary system for compiling nested loops is shown, based on various examples. Detailed Implementation
[0021] In many typical DSP applications, loops constitute a large portion of the cycle or MIPS count, and therefore, loop performance can impact the overall application performance. As explained above, DSPs typically execute software containing nested loops, which consist of an inner loop and one or more outer loops. Some common examples are Finite Impulse Response (FIR) and Infinite Impulse Response (IIR) filters, Fast Fourier Transform (FFT), and visual code. A perfect nested loop is one where all instructions to be executed are contained only in the inner loop, and the outer loops do not contain any instructions other than those in the loop itself. An imperfect nested loop is one where one or more outer loops contain instructions to be executed.
[0022] To improve DSP performance, certain instructions can be piped, where multiple instructions are executed simultaneously by different functional units of the DSP. However, pipelining these nested loops, in particular, presents difficulties in determining whether to execute instructions associated with one or more outer loops (e.g., assertions to determine instructions in an efficient manner).
[0023] Examples of this disclosure addressing the aforementioned problems include nested loop controllers that maintain an inner loop counter, an outer loop counter, and an assertion FIFO buffer to improve pipelined execution of nested loops. Other examples include instructions for initializing the nested loop controller, controlling the nested loop controller during nested loop execution, and accessing the assertion FIFO during nested loop execution to determine whether an assertion instruction was executed in a specific cycle. Still other examples involve compilers configured to compile nested loops to utilize the functionality of the nested loop controller and the associated instructions explained above.
[0024] Figure 1 A dual scalar / vector data path processor according to various embodiments of the present disclosure is shown. Processor 100 includes separate Level 1 instruction cache (L1I) 121 and Level 1 data cache (L1D) 123. Processor 100 includes a Level 2 combined instruction / data cache (L2) 130 that stores both instructions and data. Figure 1 The connection (bus 142) between the Level 1 instruction cache 121 and the Level 2 combination instruction / data cache 130 is shown. Figure 1 The connection (bus 145) between the Level 1 data cache 123 and the Level 2 combined instruction / data cache 130 is shown. In one example, the Level 2 combined instruction / data cache 130 of the processor 100 stores instructions for backing up the Level 1 instruction cache 121 and data for backing up the Level 1 data cache 123. In this example, the Level 2 combined instruction / data cache 130 is further configured in a manner known in the art and not described herein. Figure 1The connection to higher-level caches and / or main memory is shown in the diagram. In this example, the central processing unit core 110, the level 1 instruction cache 121, the level 1 data cache 123, and the level 2 combined instruction / data cache 130 are formed on a single integrated circuit. The signal integrated circuit optionally contains other circuitry.
[0025] The central processing unit core 110, under the control of the instruction fetch unit 111, fetches instructions from the Level 1 instruction cache 121. The instruction fetch unit 111 determines the next few instructions to be executed and calls a set of fetch packets of these instructions. The nature and size of the fetch packets are described in further detail below. As is known in the art, when a cache hit occurs, instructions are fetched directly from the Level 1 instruction cache 121 (if these instructions are stored in the Level 1 instruction cache 121). When a cache miss occurs (the specified instruction fetch packet is not stored in the Level 1 instruction cache 121), these instructions are searched in the Level 2 combined cache 130. In this example, the size of the cache line in the Level 1 instruction cache 121 is equal to the size of the fetch packet. The memory location of these instructions is either a hit or a miss in the Level 2 combined cache 130. Hits are serviced from the Level 2 combined cache 130. Misses are serviced from higher-level caches (not shown) or from main memory (not shown). As is known in the art, the requested instruction can be provided to both the Level 1 instruction cache 121 and the central processing unit core 110 simultaneously to speed up usage.
[0026] In one example, the central processing unit core 110 contains multiple functional units for performing instruction-specified data processing tasks. An instruction dispatch unit 112 determines the target functional unit for each fetched instruction. In this example, the central processing unit 110 functions as a Very Long Instruction Word (VLIW) processor capable of simultaneously operating on multiple instructions within a corresponding functional unit. Preferably, the compiler organizes the instructions in an execution packet executed together. The instruction dispatch unit 112 directs each instruction to its target functional unit. The functional unit assigned to an instruction is entirely specified by the instructions generated by the compiler. The hardware of the central processing unit core 110 plays no role in this functional unit allocation. In this example, the instruction dispatch unit 112 can operate on multiple instructions in parallel. The number of these parallel instructions is set by the size of the execution packet. This will be described in further detail below.
[0027] One function of the instruction dispatch unit 112 is to determine whether an instruction is executed on the scalar data path side A 115 or the vector data path side B 116. An instruction bit, referred to as the s-bit, in each instruction determines which data path the instruction controls. This will be explained in further detail below.
[0028] Instruction decoding unit 113 decodes each instruction in the current execution package. Decoding includes identifying the functional unit performing the instruction, identifying the registers from a possible register file used to provide data for the corresponding data processing operation, and identifying the register destination for the result of the corresponding data processing operation. As explained further below, an instruction may contain a constant field instead of a register number operand field. The result of this decoding is a signal used to control the target functional unit to perform the data processing operation specified by the corresponding instruction on the pointed-to data.
[0029] The central processing unit core 110 includes a control register 114. The control register 114 stores information for controlling functional units in the scalar data path side A115 and the vector data path side B116. This information may include mode information, etc.
[0030] Decoded instructions from instruction decoder 113 and information stored in control register 114 are provided to scalar data path side A 115 and vector data path side B 116. Therefore, functional units within scalar data path side A 115 and vector data path side B 116 perform instruction-specified data processing operations on the instruction-specified data and store the results in one or more instruction-specified data registers. Each of scalar data path side A 115 and vector data path side B 116 contains multiple functional units that preferably operate in parallel. These will be discussed in conjunction below. Figure 2 Further details: A data path 117 exists between the scalar data path side A 115 and the vector data path side B 116, allowing data exchange.
[0031] The central processing unit core 110 includes additional non-instruction-based modules. The emulation unit 118 allows the determination of the machine state of the central processing unit core 110 in response to instructions. This capability is typically used for algorithm development. The interrupt / exception unit 119 enables the central processing unit core 110 to respond to external asynchronous events (interrupts) and to attempts to perform improper operations (exceptions).
[0032] Central processing unit core 110 includes streaming engine 125. Streaming engine 125, as shown in this embodiment, provides two data streams from predetermined addresses, typically cached in a secondary combined cache 130, to register a file on vector data path side B 116. This provides controlled data movement from memory (e.g., cached in the secondary combined cache 130) directly to the function unit operand inputs. This is described in further detail below.
[0033] Figure 1An exemplary data width of the bus between the various sections is shown. The Level 1 instruction cache 121 provides instructions to the instruction fetch unit 111 via bus 141. Bus 141 is preferably a 512-bit bus. Bus 141 is unidirectional from the Level 1 instruction cache 121 to the central processing unit 110. The Level 2 combination cache 130 provides instructions to the Level 1 instruction cache 121 via bus 142. Bus 142 is preferably a 512-bit bus. Bus 142 is unidirectional from the Level 2 combination cache 130 to the Level 1 instruction cache 121.
[0034] Level 1 data cache 123 exchanges data with the register file in scalar data path side A 115 via bus 143. Bus 143 is preferably a 64-bit bus. Level 1 data cache 123 exchanges data with the register file in vector data path side B 116 via bus 144. Bus 144 is preferably a 512-bit bus. Buses 143 and 144 are shown to bidirectionally support data reading and writing for central processing unit 110. Level 1 data cache 123 exchanges data with level 2 combined cache 130 via bus 145. Bus 145 is preferably a 512-bit bus. Bus 145 is shown to bidirectionally support cache services for data reading and writing for central processing unit 110.
[0035] As is known in the art, when a cache hit occurs (if the requested data is stored in L1 data cache 123), the CPU data request is fetched directly from L1 data cache 123. When a cache miss occurs (the specified data is not stored in L1 data cache 123), the data is searched in L2 combined cache 130. The memory location of the requested data is either a hit or a miss in L2 combined cache 130. A hit is serviced from L2 combined cache 130. A miss is serviced from another level of cache (not shown) or from main memory (not shown). As is known in the art, the requested instruction can be served simultaneously to L1 data cache 123 and central processing unit core 110 to speed up usage.
[0036] The L2 combined cache 130 provides data of a first data stream to the streaming engine 125 via bus 146. Bus 146 is preferably a 512-bit bus. The streaming engine 125 provides data of the first data stream to the functional unit of the vector data path side B 116 via bus 147. Bus 147 is preferably a 512-bit bus. The L2 combined cache 130 provides data of a second data stream to the streaming engine 125 via bus 148. Bus 148 is preferably a 512-bit bus. The streaming engine 125 provides data of the second data stream to the functional unit of the vector data path side B 116 via bus 149. Bus 149 is preferably a 512-bit bus. According to various embodiments of this disclosure, buses 146, 147, 148, and 149 are shown as unidirectional from the L2 combined cache 130 to the streaming engine 125 and to the vector data path side B 116.
[0037] When a cache hit occurs (if the requested data is stored in the second-level combined cache 130), the streaming engine 125 data request is retrieved directly from the second-level combined cache 130. When a cache miss occurs (the specified data is not stored in the second-level combined cache 130), the data is retrieved from another level of cache (not shown) or from main memory (not shown). In some instances, it is technically feasible for the first-level data cache 123 to cache data not stored in the second-level combined cache 130. If this is supported, when the streaming engine 125 data request is a miss in the second-level combined cache 130, the second-level combined cache 130 should listen to the first-level data cache 123 to obtain the data requested by the streaming engine 125. If the first-level data cache 123 stores the data, its listener response will contain the data and then serve it to the streaming engine 125 request. If the Level 1 data cache 123 does not store this data, its listening response will indicate this situation, and the Level 2 combined cache 130 must service the request for this streaming engine 125 from another level of cache (not shown) or from main memory (not shown).
[0038] In one instance, both the Level 1 data cache 123 and the Level 2 combined cache 130 can be configured as a selected amount of cache or direct addressable memory according to a U.S. Patent No. 6,606,686 entitled "Unified Memory System Architecture Including Cache and Directly Addressable Static Random Access Memory".
[0039] Figure 2Further details of the functional units and register files within scalar data path side A 115 and vector data path side B 116 are shown. Scalar data path side A 115 contains a global scalar register file 211, an L1 / S1 local register file 212, an M1 / N1 local register file 213, and a D1 / D2 local register file 214. Scalar data path side A 115 contains L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226. Vector data path side B 116 contains a global vector register file 231, an L2 / S2 local register file 232, an M2 / N2 / C local register file 233, and an assertion register file 234. Vector data path side B 116 contains L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246. There are restrictions on which functional units can read from or which register files can be written to. These will be explained in detail below.
[0040] The scalar data path side A 115 contains L1 unit 221. L1 unit 221 typically accepts two 64-bit operands and produces a 64-bit result. Each operand is retrieved from an instruction-specified register in either the global scalar register file 211 or the L1 / S1 local register file 212. L1 unit 221 preferably performs the following instruction selection operations: 64-bit addition / subtraction; 32-bit minimum / maximum value operations; 8-bit single-instruction multiple-data (SIMD) instructions, such as absolute sum, minimum and maximum value determination; loop minimum / maximum value operations; and various shift operations between register files. The result can be written to the instruction-specified register in the global scalar register file 211, the L1 / S1 local register file 212, the M1 / N1 local register file 213, or the D1 / D2 local register file 214.
[0041] The scalar data path side A 115 contains S1 unit 222. S1 unit 222 typically accepts two 64-bit operands and produces a 64-bit result. Each operand is retrieved from an instruction-specified register in either the global scalar register file 211 or the L1 / S1 local register file 212. S1 unit 222 preferably performs the same type of operation as L1 unit 221. Slight differences may optionally exist between the data processing operations supported by L1 unit 221 and S1 unit 222. The result may be written to an instruction-specified register in the global scalar register file 211, the L1 / S1 local register file 212, the M1 / N1 local register file 213, or the D1 / D2 local register file 214.
[0042] The scalar data path side A 115 contains M1 unit 223. M1 unit 223 typically accepts two 64-bit operands and produces a 64-bit result. Each operand is called from an instruction-specified register in either the global scalar register file 211 or the M1 / N1 local register file 213. M1 unit 223 preferably performs the following instruction selection operations: 8-bit multiplication; complex dot product; 32-bit bit counting; complex conjugate multiplication; and bitwise logical operations, shift operations, addition, and subtraction. The result can be written to an instruction-specified register in the global scalar register file 211, the L1 / S1 local register file 212, the M1 / N1 local register file 213, or the D1 / D2 local register file 214.
[0043] The scalar data path A 115 contains N1 unit 224. N1 unit 224 typically accepts two 64-bit operands and produces a 64-bit result. Each operand is called from an instruction-specified register in either the global scalar register file 211 or the M1 / N1 local register file 213. N1 unit 224 preferably performs the same type of operation as M1 unit 223. There may be some double operations (referred to as double-spec instructions) that simultaneously utilize both M1 unit 223 and N1 unit 224. The result can be written to the instruction-specified register of the global scalar register file 211, the L1 / S1 local register file 212, the M1 / N1 local register file 213, or the D1 / D2 local register file 214.
[0044] The scalar data path side A 115 includes D1 unit 225 and D2 unit 226. D1 unit 225 and D2 unit 226 typically each accept two 64-bit operands and each produce a 64-bit result. D1 unit 225 and D2 unit 226 typically perform address calculations and corresponding load and store operations. D1 unit 225 is used for 64-bit scalar load and store. D2 unit 226 is used for 512-bit vector load and store. D1 unit 225 and D2 unit 226 preferably also perform: exchanging, packing, and unpacking load and store data; 64-bit SIMD arithmetic operations; and 64-bit bitwise logical operations. The D1 / D2 local register file 214 typically stores the base and offset addresses used in the address calculations for the corresponding load and store operations. The two operands are each called from the instruction-specified register in the global scalar register file 211 or the D1 / D2 local register file 214. The calculation results can be written to the instruction-specified registers in the global scalar register file 211, the L1 / S1 local register file 212, the M1 / N1 local register file 213, or the D1 / D2 local register file 214.
[0045] The vector data path side B 116 contains an L2 unit 241. The L2 unit 241 typically accepts two 512-bit operands and produces a 512-bit result. Each operand is called from an instruction-specified register in the global vector register file 231, the L2 / S2 local register file 232, or the assertion register file 234. The L2 unit 241 preferably performs instructions similar to those in the L1 unit 221, but on a wider 512-bit data. The result can be written to an instruction-specified register in the global vector register file 231, the L2 / S2 local register file 232, the M2 / N2 / C local register file 233, or the assertion register file 234.
[0046] Vector data path side B 116 contains S2 unit 242. S2 unit 242 typically accepts two 512-bit operands and produces a 512-bit result. Each operand is retrieved from an instruction-specified register in the global vector register file 231, the L2 / S2 local register file 232, or the assertion register file 234. S2 unit 242 preferably performs instructions similar to those in S1 unit 222. The result can be written to an instruction-specified register in the global vector register file 231, the L2 / S2 local register file 232, the M2 / N2 / C local register file 233, or the assertion register file 234.
[0047] The vector data path side B 116 contains M2 unit 243. M2 unit 243 typically accepts two 512-bit operands and produces a 512-bit result. Each operand is called from an instruction specifying register in either the global vector register file 231 or the M2 / N2 / C local register file 233. M2 unit 243 preferably performs instructions similar to those in M1 unit 223, but on a wider 512-bit data. The result can be written to the instruction specifying register in the global vector register file 231, the L2 / S2 local register file 232, or the M2 / N2 / C local register file 233.
[0048] Vector data path side B 116 contains N2 unit 244. N2 unit 244 typically accepts two 512-bit operands and produces a 512-bit result. Each operand is called from an instruction specifying register in either the global vector register file 231 or the M2 / N2 / C local register file 233. N2 unit 244 preferably performs the same type of operation as M2 unit 243. There may be some double operations (referred to as double-spread instructions) that simultaneously utilize both M2 unit 243 and N2 unit 244. The result can be written to the instruction specifying register of the global vector register file 231, the L2 / S2 local register file 232, or the M2 / N2 / C local register file 233.
[0049] The vector data path side B 116 contains C unit 245. C unit 245 typically accepts two 512-bit operands and produces a 512-bit result. Each operand is called from an instruction-specified register in either the global vector register file 231 or the M2 / N2 / C local register file 233. C unit 245 preferably performs: "rake" and "search" instructions; up to 512 2-bit PN*8-bit multiplication I / Q complex multiplications per clock cycle; 8-bit and 16-bit absolute difference sum (SAD) calculations, up to 512 SADs per clock cycle; horizontal addition and horizontal minimum / maximum instructions; and vector permutation instructions. C unit 245 also contains four vector control registers (CUCR0 to CUCR3) for controlling certain operations of C unit 245 instructions. In some C unit 245 operations, control registers CUCR0 to CUCR3 are used as operands. Control registers CUCR0 to CUCR3 are preferably used to control the general permutation instruction (VPERM); and serve as masks for SIMD multiple DOT product operations (DOTPM) and SIMD multiple absolute difference sums (SAD) operations. Control register CUCR0 is preferably used to store polynomials for Galois field multiplication operations (GFMPY). Control register CUCR1 is preferably used to store Galois field polynomial generator functions.
[0050] The vector data path side B 116 contains P unit 246. P unit 246 performs basic logical operations on the registers of the local assertion register file 234. P unit 246 can directly read from and write to the assertion register file 234. These operations include single-register unary operations. For example: NEG (invert), which inverts each bit of a single register; BITCNT (bit count), which returns the count of the number of bits in a single register with a predetermined numeric state (1 or 0); RMBD (rightmost bit detection), which returns the number of bit positions from the least significant bit position (rightmost) to the first bit position with a predetermined numeric state (1 or 0); DECIMATE, which selects the Nth (1, 2, 4, etc.) bit specified by each instruction for output; and EXPAND, which copies each bit N times (2, 4, etc.) as specified by the instruction. These operations involve two register binary operations, such as: AND, bitwise AND of the data in two registers; NAND, bitwise AND of the data in two registers followed by inversion; OR, bitwise OR of the data in two registers; NOR, bitwise OR of the data in two registers followed by inversion; and XOR, XOR of the data in two registers. These operations involve transferring data from the assertion registers in assertion register file 234 to another specified assertion register or a specified data register in global vector register file 231. The typical intended use of unit P 246 includes manipulating the results of SIMD vector comparisons to control additional SIMD vector operations. The BITCNT instruction can be used to count the 1s in the assertion registers to determine the number of valid data elements from the assertion registers.
[0051] Figure 3 Global scalar register file 211 is shown. There are 16 independent 64-bit wide scalar registers, named A0 to A15. Each register in global scalar register file 211 can be read or written as 64-bit scalar data. All scalar data path side A115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can read or write to global scalar register file 211. Global scalar register file 211 can be read as 32 bits or 64 bits, and can only be written as 64 bits. Instruction execution determines the size of the read data. Vector data path side B116 functional units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can read from global scalar register file 211 via cross path 117 under the constraints described in detail below.
[0052] Figure 4The D1 / D2 local register file 214 is shown. There are 16 independent 64-bit wide scalar registers, named D0 to D16. Each register in the D1 / D2 local register file 214 can be read or written as 64-bit scalar data. All scalar data path-side A115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can be written to the global scalar register file 211. Only D1 unit 225 and D2 unit 226 can be read from the D1 / D2 local scalar register file 214. The data stored in the D1 / D2 local scalar register file 214 is expected to contain the base address and offset address used in address calculations.
[0053] Figure 5 The L1 / S1 local register file 212 is shown. Figure 5 The example shown has eight independent 64-bit scalar registers, named AL0 through AL7. Preferred instruction encoding (see...) Figure 15 This allows the L1 / S1 local register file 212 to contain up to 16 registers. Figure 5 The example implements only 8 registers to reduce circuit size and complexity. Each register in the L1 / S1 local register file 212 can be read or written as 64-bit scalar data. All scalar data path-side A115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can be written to the L1 / S1 local scalar register file 212. Only L1 unit 221 and S1 unit 222 can be read from the L1 / S1 local scalar register file 212.
[0054] Figure 6 The M1 / N1 local register file 213 is shown. Figure 6 The example shown has eight independent 64-bit wide scalar registers, named AM0 through AM7. Preferred instruction encoding (see...) Figure 15 This allows the M1 / N1 local register file 213 to contain up to 16 registers. Figure 6 The example implements only 8 registers to reduce circuit size and complexity. Each register in the M1 / N1 local register file 213 can be read or written as 64-bit scalar data. All scalar data path-side A115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can be written to the M1 / N1 local scalar register file 213. Only M1 unit 223 and N1 unit 224 can be read from the M1 / N1 local scalar register file 213.
[0055] Figure 7 Global vector register file 231 is shown. There are 16 independent 512-bit wide vector registers. Each register in global vector register file 231 can be read or written as 64-bit scalar data, named B0 through B15. Each register in global vector register file 231 can be read or written as 512-bit vector data, named VB0 through VB15. The instruction type determines the data size. All vector data path side B 116 function units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can read or write to global scalar register file 231. Scalar data path side A 115 function units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can read from global vector register file 231 via cross path 117 under the constraints described in detail below.
[0056] Figure 8 The P local register file 234 is shown. There are eight independent 64-bit wide registers, named P0 through P7. Each register in the P local register file 234 can be read or written as 64-bit scalar data. The vector data path side B116 functional units (L2 unit 241, S2 unit 242, C unit 244, and P unit 246) can write to the P local register file 234. Only L2 unit 241, S2 unit 242, and P unit 246 can be read from the P local scalar register file 234. The typical intended uses of the P local register file 234 include: writing a single bit of SIMD vector comparison result from L2 unit 241, S2 unit 242, or C unit 244; manipulating the SIMD vector comparison result by P unit 246; and using the manipulated result to control additional SIMD vector operations.
[0057] Figure 9 The L2 / S2 local register file 232 is shown. Figure 9 The example shown has eight independent 512-bit wide vector registers. Preferred instruction encoding (see...) Figure 15 This allows the L2 / S2 local register file 232 to contain up to 16 registers. Figure 9The example implements only 8 registers to reduce circuit size and complexity. Each register in the L2 / S2 local vector register file 232 can be read or written as 64-bit scalar data, named BL0 through BL7. Each register in the L2 / S2 local vector register file 232 can be read or written as 512-bit vector data, named VBL0 through VBL7. The instruction type determines the data size. All vector data path-side B 116 function units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can be written to the L2 / S2 local vector register file 232. Only L2 unit 241 and S2 unit 242 can be read from the L2 / S2 local vector register file 232.
[0058] Figure 10 The M2 / N2 / C local register file 233 is shown. Figure 10 The example shown has eight independent 512-bit wide vector registers. Preferred instruction encoding (see...) Figure 15 This allows the M2 / N2 / C local vector register file 233 to contain up to 16 registers. Figure 10 The example implements only 8 registers to reduce circuit size and complexity. Each register in the M2 / N2 / C local vector register file 233 can be read or written as 64-bit scalar data, named BM0 to BM7. Each register in the M2 / N2 / C local vector register file 233 can be read or written as 512-bit vector data, named VBM0 to VBM7. All vector data path-side B 116 functional units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can be written to the M2 / N2 / C local vector register file 233. Only M2 unit 243, N2 unit 244, and C unit 245 can be read from the M2 / N2 / C local vector register file 233.
[0059] One design option is to provide a global register file accessible to all functional units on one side, and a local register file accessible only to some functional units on one side. Some examples of this disclosure use only one type of register file corresponding to the disclosed global register file.
[0060] Return to reference Figure 2Cross path 117 allows limited data exchange between scalar data path side A 115 and vector data path side B 116. During each operating cycle, a 64-bit data word can be called from global scalar register file A 211 to be used as an operand by one or more functional units on vector data path side B 116, and a 64-bit data word can be called from global vector register file 231 to be used as an operand by one or more functional units on scalar data path side A 115. Any functional unit on scalar data path side 115 (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can read a 64-bit operand from global vector register file 231. This 64-bit operand is the least significant bit of the 512-bit data in the registers of the accessed global vector register file 231. Multiple functional units on scalar data path side A 115 can use the same 64-bit cross path data as an operand during the same operating cycle. However, in any single operation cycle, only one 64-bit operand is transferred from the vector data path side B 116 to the scalar data path side A 115. Any vector data path side B116 functional unit (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can read the 64-bit operand from the global scalar register file 211. If the corresponding instruction is a scalar instruction, the cross-path operand data is treated as any other 64-bit operand. If the corresponding instruction is a vector instruction, the high 448 bits of the operand are filled with zeros. Multiple vector data path side B 116 functional units can use the same 64-bit cross-path data as operands during the same operation cycle. In any single operation cycle, only one 64-bit operand is transferred from the scalar data path side A 115 to the vector data path side B 116.
[0061] In certain constrained situations, the streaming engine 125 transfers data. The streaming engine 125 controls two data streams. Each stream consists of a sequence of elements of a specific type. Programs operating on the streams read the data sequentially, operating on each element in turn. Each stream has the following fundamental properties: Stream data has a well-defined start and end time. Stream data has a fixed element size and type throughout the stream. Stream data has a fixed sequence of elements. Therefore, programs cannot perform random searches within the stream. Stream data is read-only while active. Programs cannot write to the stream while reading. Once a stream is opened, the streaming engine 125: calculates the address; retrieves the defined data type from the L2 unified cache (this may require cache services from higher-level memory); performs data type manipulations such as zero extension, sign extension, and data element sorting / exchange (e.g., matrix transpose); and directly transfers the data to the programmed data register file within the CPU 110. Therefore, the streaming engine 125 is useful for real-time digital filtering operations on well-behaved data. The streaming engine 125 frees up these memory acquisition tasks from the corresponding CPU, enabling other processing functions.
[0062] The streaming engine 125 provides the following benefits: Streaming engine 125 allows for multidimensional memory access. Streaming engine 125 increases the available bandwidth of functional units. Because the streaming buffer bypasses the L1 data cache 123, streaming engine 125 minimizes the amount of cache miss latency. Streaming engine 125 reduces the number of scalar operations required to maintain loops. Streaming engine 125 manages address pointers. Streaming engine 125 handles address generation, thereby automatically freeing up address generation instruction slots and D1 and D2 units 226 for other computations.
[0063] The CPU 110 operates on an instruction pipeline. Instructions are fetched in fixed-length instruction packets, as described further below. All instructions require the same number of pipeline stages for fetching and decoding, but a different number of execution stages.
[0064] Figure 11 The following pipeline stages are illustrated: program fetch stage 1110, dispatch and decode stage 1120, and execution stage 1130. Program fetch stage 1110 comprises three stages for all instructions. Dispatch and decode stage 1120 comprises three stages for all instructions. Execution stage 1130 comprises one to four stages depending on the instruction.
[0065] The fetch phase 1110 comprises the program address generation phase 1111 (PG), the program access phase 1112 (PA), and the program reception phase 1113 (PR). During the program address generation phase 1111 (PG), the program address is generated in the CPU, and a fetch request is sent to the memory controller of the Level 1 instruction cache (L1I). During the program access phase 1112 (PA), the L1I processes the request, accesses the data in its memory, and sends the fetch packet to the CPU boundary. During the program reception phase 1113 (PR), the CPU registers the fetch packet.
[0066] The instruction always fetches sixteen 32-bit wide slots at a time, forming a fetch packet. Figure 12 The diagram shows 16 instructions 1201 to 1216 in a single fetch packet. The fetch packet is aligned on 512-bit (16-word) boundaries. One instance uses a fixed 32-bit instruction length. Fixed-length instructions are advantageous for several reasons. Fixed-length instructions enable easy decoder alignment. Properly aligned instruction fetches can load multiple instructions into a parallel instruction decoder. This proper alignment can be achieved through pre-defined instruction alignment when stored in memory coupled to fixed-length instruction packet fetches (with fetch packets aligned on 512-bit boundaries). Aligned instruction fetches allow the parallel decoder to operate on fetch bits of the instruction size. Variable-length instructions require an initial step of locating each instruction boundary before decoding. Fixed-length instruction sets generally allow for a more regular arrangement of instruction fields. This simplifies the construction of each decoder, which is an advantage for wide-issue VLIW CPUs.
[0067] The execution of each instruction is partially controlled by a p-bit in each instruction. This p-bit is preferably bit 0 of a 32-bit slot. The p-bit determines whether the instruction is executed in parallel with the next instruction. Instructions are scanned from the least significant address bit to the most significant address bit. If the p-bit of an instruction is 1, the next subsequent instruction (the higher-order memory address) is executed in parallel with that instruction (within the same cycle). If the p-bit of an instruction is 0, the next subsequent instruction is executed in the cycle following the instruction.
[0068] The CPU 110 and L1 instruction cache 121 pipelines are separate. The return of a fetch packet from the L1 instruction cache can take a varying number of clock cycles, depending on external conditions, such as whether a hit occurs in L1 instruction cache 121 or L2 combined cache 130. Therefore, program access phase 1112 (PA) can take several clock cycles instead of one clock cycle as in other phases.
[0069] Instructions executed in parallel constitute an execution package. In one instance, an execution package can contain up to sixteen instructions. Two instructions within an execution package must not use the same functional unit. A slot is one of the following five types: 1) a self-contained instruction that executes on one of the functional units of the CPU 110 (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, D2 unit 226, L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246); 2) unitless instructions, such as NOP (no operation) instructions or multiple NOP instructions; 3) branch instructions; 4) constant field extensions; and 5) condition code extensions. Some of these slot types will be explained further below.
[0070] The dispatch and decoding phase 1120 includes an instruction dispatch to the appropriate execution unit phase 1121 (DS), an instruction pre-decoding phase 1122 (DC1), and an instruction decoding and operand fetch phase 1123 (DC2). During the instruction dispatch to the appropriate execution unit phase 1121 (DS), the fetch packet is divided into execution packets and assigned to the appropriate functional unit. During the instruction pre-decoding phase 1122 (DC1), the source register, destination register, and associated paths are decoded to execute the instructions in the functional unit. During the instruction decoding and operand fetch phase 1123 (DC2), more detailed unit decoding and operand fetching from the register file are performed.
[0071] Execution phase 1130 comprises execution phases 1131 to 1135 (E1 to E5). Different types of instructions require different numbers of these phases to complete their execution. These pipeline phases play a crucial role in understanding the device state at the boundaries of CPU cycles.
[0072] During the execution of Phase 1, 1131 (E1), the conditions of the instruction are evaluated, and the operands are operated on. For example... Figure 11 As shown, execution stage 1131 can receive an operand from either the stream buffer 1141 or the register file, schematically shown as 1142. For load and store instructions, address generation is performed and the address modification is written to the register file. For branch instructions, the branch fetch packet in stage PG is affected. Figure 11 As shown, load and store instructions access memory, schematically represented here as memory 1151. For single-cycle instructions, the result is written to the destination register file. It is assumed that any condition of the instruction is evaluated as true. If the condition is evaluated as false, the instruction does not write any result or have any pipelined operation after execution stage 1131.
[0073] During the execution of a 2-stage 1132 (E2) instruction, a load instruction sends the address to memory. A store instruction sends the address and data to memory. If saturation occurs, a single-cycle instruction that saturates the result sets the SAT bit in the control status register (CSR). For 2-cycle instructions, the result is written to the destination register file.
[0074] During the execution of a 3-stage 1133 (E3) instruction, data memory accesses are performed. If saturation occurs, any multiplication instructions that saturate the result set the SAT bit in the Control Status Register (CSR). For 3-cycle instructions, the result is written to the destination register file.
[0075] During the execution of stage 4 1134 (E4), load instructions bring data to the CPU boundary. For 4-cycle instructions, the results are written to the destination register file.
[0076] During the execution of stage 5 1135 (E5), a load instruction writes data to a register. This is in Figure 11 The diagram is schematically illustrated using inputs from memory 1151 to execution stage 5 1135.
[0077] In some cases, processor 100 (e.g., DSP) can be invoked to execute software containing nested loops. Software pipelined execution involves initiating a new iteration of the loop before completing the previous iteration to achieve high throughput. This implies that there are cycles that begin execution or pipe up each inner loop (loop beginning) and cycles that pipe down the loop (loop ending). These cycles will exist every time the outer loop executes, and therefore they can impact performance, especially when the inner loop count is small. The deeper the DSP 100 pipeline, the more cycles will be required at the beginning and end. As explained above, particularly regarding imperfect nested loops, pipelined execution of nested loops presents difficulties in determining whether to execute instructions associated with one or more outer loops (e.g., assertions to determine instructions in an efficient manner).
[0078] The following description references the following exemplary nested loops, where the function of each line of pseudocode is explained in the comments:
[0079]
[0080] Example 1. Unmerged nested loops.
[0081] For the sake of terminology consistency, the nested loops described above have an "inner loop count" value M and an "outer loop count" value N.
[0082] This exemplary nested loop can be rewritten or "merged" into a single loop by conditionalizing or asserting instructions associated with the outer loop. An example of the resulting merged nested loop is given below:
[0083]
[0084]
[0085] Example 2a. Nested loops that are merged.
[0086] The above is an example of merging loops. Other examples can avoid computationally intensive division and modulo operations by implementing record-keeping variables and utilizing pointer algorithms. For example:
[0087]
[0088] Example 2b. Nested loops that are merged.
[0089] In some instances, one or more outer loops can be empty, or contain no instructions other than the loop control instruction itself. Such outer loops can be combined or collapsed into the next (inner) outer loop by adjusting variables in the loop control instruction. For example:
[0090]
[0091] This can be folded as follows:
[0092] for(int k = 0; k <N*M;k++){
[0093] B0+=D0[k];
[0094] }
[0095] Example 3. Folded nested loops.
[0096] Once any such empty outer loop has been collapsed, the resulting imperfect nested loops are merged, as explained above.
[0097] As will be explained in further detail below, merged nested loops are better suited for execution using the pipelined software of the DSP 100. However, because the outer loop instruction is conditional (e.g., it is executed only because the if statement is true in the pseudocode above), the corresponding outer loop instruction in the pipeline must be asserted. This requires determining an assertion for each instruction to decide whether the outer loop instruction should be executed in a specific cycle. Typically, the calculation and storage of assertion values are complex and monopolize a large number of registers (e.g., to store assertion values for multiple different instructions). Examples of nested loop controllers in this disclosure (explained further below) improve upon these drawbacks.
[0098] Turn now Figure 13A A nested loop controller 1300 is illustrated according to an example of this disclosure. An exemplary hardware and logic construct of the nested loop controller 1300 is shown. It should be understood that different alternative hardware and logic constructs can be used to design the nested loop controller 1300. The nested loop controller 1300 includes an outer loop counter 1302 (with an OLCNT value), an inner loop counter 1308 (with an ILCNT value), and an assertion FIFO 1312. Each of the counters 1302, 1308, and assertion FIFO 1312 can be implemented as, for example, a register, such as the control register 114 explained above. In some instances, the nested loop controller 1300 also includes an episkew counter 1316 (with an EPISKEW value), which can also be implemented as the control register 114 and will be explained in further detail below.
[0099] Examples of this disclosure also include instructions for interacting with the nested loop controller 1300. The nested loop controller 1300 initialization instruction (NLCINIT) initializes the values of the outer loop counter 1302 and the inner loop counter 1308 (and / or the inner loop overload register 1304). For consistency, it is assumed that NLCINIT initializes the outer loop counter 1302 to the outer loop count value minus 1, and initializes the inner loop counter (and / or the inner loop overload register 1304, depending on the implementation) to the inner loop count value minus 1. In this particular example, the nested loop controller 1300 operates by decrementing the counter values, and therefore the threshold of the counter values (e.g., for reset purposes) is zero. However, it should be understood that, more generally, the outer loop, inner loop, and overload counters 1302, 1308, 1316 are ahead and can, for example, be initialized to zero and subsequently incremented to the threshold (e.g., the loop count value minus 1). This disclosure is not limited to any particular counter management method, and all such methods are within the scope of the examples described herein. The compiler generates the NLCINIT instruction, which is executed before the instructions in the nested loop.
[0100] The nested loop controller 1300's look-ahead instruction (TICK) causes the outer loop, inner loop, and up-counters 1302, 1308, and 1316 of the nested loop controller 1300 to look ahead according to certain logic, which will be explained in further detail below. The TICK instruction is a unitless instruction. As will also be explained, the compiler generates the TICK instruction for each iteration of the merged nested loops.
[0101] The nested loop controller 1300 receives an assertion instruction (GETP) that reads a value (e.g., a single bit) from the assertion FIFO 1312 and writes that value into a destination register of the DSP 100. In addition to specifying the destination register, the GETP instruction also includes an offset value specifying the location in the assertion FIFO 1312 to be written into the destination register. As will be explained in further detail below, this allows for a simple method to access assertion values associated with historical TICK instructions. As will be explained further below, the compiler generates a GETP instruction for any assertion instruction (i.e., the outer loop instruction before merging nested loops), and then executes the assertion instruction based on the value returned by the GETP instruction.
[0102] Reference Figure 13A and 13B as well as Figure 14A-1 , 14A-2 Sections 14B and 14C further explain the function of the nested loop controller 1300. For simplicity, refer to the nested loop examples 1 and 2 above, where the outer loop counter (N) is 5 and the inner loop counter (M) is 3. Furthermore, it is assumed that all instructions, except the load instruction, complete within one cycle, and the load instruction completes within three cycles. As explained above, the compiler initially generates the NLCINIT instruction, which initializes the outer and inner loop counters 1302 and 1308 upon execution. In this example, where the counters are advanced by decrementing, the outer loop counter 1302 is initialized to the outer loop count value minus 1 or 4, and the inner loop counter 1308 is initialized to the inner loop count value minus 1 or 2. The inner loop overload register 1304 is also initialized to 2 and remains unchanged in this example.
[0103] Multiplexer (mux) 1306 determines whether to load inner loop counter 1308 with the value held in inner loop reload register 1304 (e.g., for initialization or to reset inner loop counter 1308 at the end of inner loop iteration). Comparator 1310 compares the value held in inner loop counter 1308 with 0 and populates assertion FIFO 1312 accordingly. In this logic example, when inner loop counter 1308 equals 0, the value 1 is pushed into assertion FIFO 1312 as an outer loop indicator. When inner loop counter 1308 is not equal to zero, the value 0 is pushed into assertion FIFO 1312 as an inner loop indicator. The distinction and importance of these assertion FIFO 1312 values will be explained further below. Subtraction circuit 1314 decrements inner loop counter 1308 (in response to execution of the TICK instruction) and updates inner loop counter 1308 via multiplexer 1306.
[0104] When the inner loop counter 1308 reaches its threshold (0 in this example) and the outer loop counter 1304 has not yet reached its threshold (also 0 in this example), the inner loop counter 1308 is reset by the multiplexer 1306 and the inner loop reload register 1304, while the outer loop counter 1302 is decremented. Once the outer loop counter 1302 reaches its threshold, the inner loop counter 1208 stops being reset. Furthermore, although not in Figure 13A and 13B The logic indicates that, however, the outer loop counter 1302 reaching its threshold also prevents the comparator 1310 from pushing the outer loop indicator (i.e., 1s in this example) into the assertion FIFO 1312 (even if the inner loop counter 1308 is equal to 0). Finally, in the case of one or more phase folds at the end, the up-biased counter 1316 is used.
[0105] Figure 13B Another example of a nested loop controller 1350 according to the present disclosure is shown. The nested loop controller 1350 is similar to... Figure 13A The nested loop controller 1300 is shown, but further details are illustrated. For example, the nested loop controller 1350 also includes an outer loop counter 1302 and an inner loop counter 1308 (and / or an inner loop overload register 1304). The assertion FIFO 1312 is also shown in more detail, illustrating an instance of how indexing of the assertion FIFO 1312 is implemented in response to the GETP instruction. Specifically, the offset value of the GETP instruction controls the multiplexer 1352 to output the desired offset value to the assertion FIFO 1312. For example, a GETP instruction with an offset value of 0 and a level value of 0 controls the multiplexer 1352 to select the bit in the assertion FIFO 1312 corresponding to offset 0, shown as offset = 0. As another example, a GETP instruction with an offset value of 3 and a level value of 0 controls the multiplexer 1352 to select the bit in the assertion FIFO 1312 corresponding to offset 3, shown as offset = 3.
[0106] The nested loop controller 1350 includes an additional assertion FIFO 1353, which is filled based on whether the outer loop counter 1302 is equal to zero (e.g., based on comparator 1355). In this example, the outer loop counter 1302 may alternatively be referred to as the intermediate loop counter 1302. A separate outer loop counter, not shown for simplicity, is implemented to indicate when the nested loop controller 1350 should stop executing the merged loop (e.g., in an example where counters are advanced by decrementing, when the inner, intermediate, and outer loop counters reach 0). Specifically, the two assertion FIFOs 1312 and 1353 shown allow the merging of nested loops containing inner, intermediate, and outer loops, each containing instructions to be executed. In this example, assertion FIFO 1312 provides an assertion for the conditional execution of instructions in the intermediate loop, while assertion FIFO 1353 provides an assertion for the conditional execution of instructions in the outer loop. Assertion FIFO 1353 is also indexed to use the GETP instruction offset value as a control signal for multiplexer 1354. Furthermore, the GETP instruction level value is provided as a control signal for multiplexer 1356 to select whether a specific offset assertion value is provided by assertion FIFO 1312, which is filled based on the inner loop counter 1308, or by assertion FIFO 1353, which is filled based on the intermediate loop counter 1302. Regardless of whether the assertion value is provided by assertion FIFO 1312, which is filled based on the inner loop counter 1308, or by assertion FIFO 1353, which is filled based on the intermediate loop counter 1302, the assertion value is provided to register 1357 specified by the GETP instruction, and the value of said register can then be used to determine whether the assertion instruction should be executed.
[0107] Additionally, the nested loop controller 1350 illustrates different ways in which the inner loop counter 1308 and the outer loop counter 1302 can be filled. For example, based on the control of the multiplexer 1358, the inner loop counter 1308 can be filled with a decrementing value from the subtraction circuit 1314 (e.g., when both the inner loop counter 1308 and the outer loop counter 1302 are non-zero) or with a value in the inner loop overload register 1304 (e.g., when the inner loop counter 1308 reaches zero and the outer loop counter 1302 is non-zero). The inner loop counter 1308 can also be initially filled with a value corresponding to a shift constant (MVC) instruction, which is also used to initialize the inner loop overload register 1304.
[0108] In another example, based on the control of multiplexer 1360, the outer loop counter 1302 is filled with a decrement value from subtraction circuit 1362 (e.g., when inner loop counter 1308 reaches zero and outer loop counter 1302 is non-zero) or with the current value in outer loop counter 1302 (e.g., when both inner loop counter 1308 and outer loop counter 1302 are non-zero). The outer loop counter 1302 can also be initially filled with a value corresponding to a shift constant (MVC) instruction.
[0109] As described above, the nested loop controller 1350 is initialized with the NLCINIT instruction, which initializes the inner loop counter 1308 and the outer loop counter 1302 registers and provides an over-bias parameter indicating the number of additional branches to be taken once the inner loop counter 1308 and the outer loop counter 1302 reach zero.
[0110] Similarly, as described above, the nested loop controller 1350 advances in response to the TICK instruction, which causes the inner loop counter 1308 and the outer loop counter to advance and update the assertion FIFOs 1312 and 1353. In one instance, counters 1308 and 1302 are in an advance "odometer pattern," where the inner loop counter 1308 reaching its travel count (e.g., zero) causes the outer loop counter 1302 to advance (e.g., decrement). In some instances, it is assumed that the TICK instruction is the first instruction in the software pipeline loop.
[0111] Furthermore, as described above, the nested loop controller 1350 assertion FIFOs 1312 and 1353 are accessed in response to the GETP instruction. The GETP instruction provides a query to the nested loop controller 1350 that retrieves an assertion value at a specified level. In a given iteration of the loop, retrieving an assertion before a TICK instruction results in a pre-loop assertion (e.g., the assertion register value of the pre-loop instruction), while retrieving an assertion after a TICK instruction results in a post-loop assertion (e.g., the assertion register value of the post-loop instruction). In one instance, a GETP instruction issued after initialization (e.g., execution of the NLCINIT instruction) but before the first TICK returns "true" or the accepted assertion value. This GETP instruction corresponds to a pre-loop assertion, thus this exemplary behavior allows the pre-loop instruction to be executed on the first loop iteration.
[0112] As explained above, the GETP instruction also includes a level parameter specifying the cycle level (e.g., control signals for multiplexer 1356). The GETP instruction offset is used as control signals for multiplexers 1352 and 1354 to index into assertion FIFOs 1312 and 1353, which benefits overlap lifetime in the software pipeline. In one instance, unlike the Movement Constant (MVC) instruction, the GETP instruction can be executed on .S1, .L1, and .M1 units, not just on the .S1 unit.
[0113] Figure 14A-1 and 14A-2 An example of pseudo-assembly code instructions obtained by compiling and merging nested loops according to the present disclosure is shown. Specifically, Figure 14A-1 and 14A-2 The merged loop is shown, which is not software-pipelined but uses the aforementioned nested loop controllers 1300 and 1350. Figure 14A-1 and 14A-2 The loop corresponds to the loop shown in Examples 1 and 2 above. Figure 14A-1 and 14A-2 The loop contains an inner loop that sums a portion of a vector. The outer loop initializes the accumulator to zero before the inner loop. After the inner loop completes, the outer loop stores the accumulated result in memory. Therefore, when nested loops are merged, the outer loop operation needs to be properly asserted. The GETP instruction is used to generate assertions for accumulator initialization, and another GETP instruction is used to generate assertions for storing the accumulated result in memory.
[0114] The vertical axis represents the cycle count and shows which instructions are executed in each cycle. The horizontal axis "slots" represent the functional units that perform specific instructions and thus depict the instructions executed on those specific functional units. In this example, a new iteration of the loop begins every 8 cycles.
[0115] Figure 14A-1 and 14A-2The values of assertion FIFO 1312 for each cycle are also shown, where the leftmost bit is the input or "first" bit. As explained above, the execution of the merged nested loops begins with the NLCINIT instruction, which is cellless. Furthermore, in cycle 0, the TICK instruction is initially executed; however, it should be understood that this tick can be omitted by changing the logic of the subsequent GETP instruction (e.g., to change the offset of assertion FIFO 1312 by -1). Only one TICK instruction occurs before the GETP instruction is executed in cycle 1, which obtains the assertion used to initialize the accumulator (the MVK (shift constant) instruction in cycle 2), and therefore, the offset of assertion FIFO 1312 for the GETP instruction is zero. The GETP instruction in cycle 1 obtains the first bit of the outer loop's assertion FIFO 1312. In some cases, assertion FIFO 1312 is initialized to have the outer loop value (e.g., 1) in its first input bit. In this case, since the loop has just started and the TICK instruction was executed in the previous cycle, the first bit in FIFO 1312 is asserted to be true (1), which is stored in register A5, and therefore the MVK instruction will be executed in this time through the inner loop. Figure 14A-1 and 14A-2 In the text, the bold text indicates when the obtained assertion is true (and therefore, the outer loop instruction is executed).
[0116] In cycle 2, since the assertion value contained in A5 is true (1), the MVK (movement constant) instruction is executed to load register B0 with the value 0. However, in the second iteration starting from cycle 8, the GETP instruction in cycle 9 returns false (0) because the TICK instruction in cycle 8 pushes 0 into assertion FIFO 1312, so the MVK instruction is not executed.
[0117] At the start of the third iteration in cycle 16, three inner loop iterations have been executed, and therefore, the outer loop code (e.g., storing the accumulated result) should be executed after the inner loop code. For this reason, the GETP instruction in cycle 22 (which returns an assertion of the SDD (store double word) instruction storing the accumulated result) is offset by 2, such that the GETP instruction obtains an assertion from the previous two iterations, i.e., when the outer loop assertion is true (1). In this example, for clarity, the loop level parameter explained above is omitted, since there is only one level of loop nesting in this example.
[0118] Figure 14A-1 and 14A-2 The instance continues with the second iteration of the outer loop, where the accumulator initialization (e.g., MVK instruction) is performed because the TICK instruction in cycle 24 pushes true (1) into the assertion FIFO 1312, causing GETP in cycle 25 to return the value. Figure 14A-1 and 14A-2 As shown in the rest of the text, the loop iterations continue in a similar manner through cycles 26-55.
[0119] Figure 14B The merged nested loop is illustrated, which is software-pipelined and uses the aforementioned nested loop controllers 1300 and 1350. In this example, as described above, the inner loop count is 3 and the outer loop count is 5, resulting in a total of 15 original inner loop iterations. In this example, a new iteration of the merged loop begins at the start of each cycle. In cycle 0, the TICK instruction pushes true (1) to the assertion FIFO 1312, while in cycle 1, the GETP instruction with offset 0 obtains 1 from the front of the assertion FIFO 1312 (e.g., during the first part of the cycle). The TICK instruction in cycle 1 causes the comparator 1310 of the nested loop controller 1300 to push 0 to the assertion FIFO 1312 (e.g., during the second part of the cycle). In this example, the GETP instruction reads the assertion FIFO 1312 during the first part of the cycle, such that the 1 from the previous cycle is still at the front of the assertion FIFO 1312, and therefore, the GETP instruction uses offset 0.
[0120] exist Figure 14B In this process, instructions are accelerated as each successive iteration begins at the start of the software pipeline cycle. Figure 14A-1 and 14A-2 As in the example, bold text indicates when the acquired assertion is true and the outer loop instruction is executed. Once all possible instructions containing the merged nested loop have been executed (as in cycle 7), the loop is considered to be in the kernel. "Beginning" refers to cycles that precede the kernel and build the kernel (e.g., cycles 0-6), while "End" refers to cycles that follow the kernel and slow down the loop (e.g., cycles 15-21). In cycle 7, the software pipeline merged loop enters the kernel phase, where the loop repeats the same sequence of instructions (in this example, only one instruction cycle due to the 1-cycle start interval). In cycle 8, the third GETP instruction is executed. This GETP instruction acquires the assertion used to store the accumulated value in register B0. Therefore, the acquired assertion should be true. As can be seen, when the outer loop assertion is true, offset 7 of the GETP instruction acquires the outer loop assertion from the first iteration. The inner loop counter 1308 (ILC) decrements each cycle and is reset to 2 after 3 iterations. At the end of every 3 iterations, the outer loop counter 1302 (OLC) is decremented. In this case, when both OLC and ILC reach zero, no branch in the kernel is taken, so control falls to the end. The final code then slows down the loop, ending with a GETP instruction and the storage of the final accumulated value.
[0121] Figure 14C The merged nested loop is shown, which uses the aforementioned nested loop controllers 1300 and 1350 and has been software-pipelined with complete end-folding. Figure 14C Similar to Figure 14B The kernel is simply executed multiple times to demonstrate that the end has been removed via stage folding. In this example, the compiler uses a bias of 7 on NLCINIT to instruct nested loop controllers 1300 and 1350 to execute the kernel an additional 7 times (e.g., by using additional BNL instructions, which will be explained further below). In some instances, the compiler determines that excessive execution of instructions (indicated by dashed box 1402) is safe and does not affect the program's outcome or otherwise establishes assertions for conditionally executing these instructions. For simplicity, assertions for controlling / preventing excessive execution of kernel code are not shown in this example.
[0122] The BNL instruction is a branch instruction that indicates a return to the kernel (i.e., the execution packet shown in cycle 9). In instances where the inner and outer loop counters 1308 and 1302 are decremented as they are advanced, executing the BNL instruction after these counters reach zero will result in a branch being taken (e.g., to the kernel) until the number of iterations specified by the overlay parameter has completed. As explained above, each TICK instruction reduces the loop count value by decrementing the inner and outer loop counters 1308 and 1302. In one instance, the total number of ticks is calculated as: OLCNT value * ILCNT value + EPISKEW. Whether a branch is taken is determined based on whether the counted ticks are less than the calculated total number of ticks; if so, a branch is taken. If the counted ticks are greater than or equal to the calculated total number of ticks, a branch is not taken.
[0123] therefore, Figure 14B and 14C This indicates an extended version of the content that will be reduced in size in the instruction store (e.g., instruction cache 121). For example, the instruction store contains a beginning instruction; an instance of the kernel that is repeatedly branched back until the inner loop counter 1308 of the nested loop controller 1300 reaches a threshold (e.g., 0) at a final time (e.g., once the outer loop counter 1302 also reaches a threshold, or 0), which occurs as shown in cycle 14; and an end instruction. Figure 14C In the example, because the ending has been folded as explained above, the ending instruction is not included.
[0124] As explained, in Figure 14CIn this instance, the end is further folded. The end is functionally equivalent to the kernel, the only difference being that some instructions are irrelevant each time the kernel is executed instead of the folded end cycle, namely those shown in dashed box 1402. The bias parameters and registers discussed above take end folding into account when determining how many BNL instructions to use. The NLCINIT instruction specifies the bias value, which can be calculated by the compiler (discussed in further detail below) as, for example, the number of TICK instructions preceding the first BNL instruction minus the number of unfolded end stages. In the current instance, end stages are all single cycles; however, in other instances, end stages may contain multiple cycles. Therefore, in Figure 14B In this context, the number of TICK instructions preceding the first BNL instruction is 7, and the number of unfolded end stages is also 7, resulting in an up bias parameter of 0. However, in Figure 14C In this context, the number of TICK instructions before the first BNL instruction is 7, while the number of unfolded end stages is 0, resulting in an up bias parameter of 7.
[0125] Therefore, the nested loop controller 1300 allows for a simplified method of generating, storing, and accessing assertion values to control the execution of instructions within merged nested loops. For example, instead of storing assertion values into and loading them from registers (both of which require function unit overhead, not to mention the complexity of tracking which assertion value resides in which register), the TICK instruction is unit-free and therefore does not monopolize function units. Furthermore, the GETP instruction allows for simple indexed access to the assertion FIFO 1312. Additionally, the NLCINIT instruction allows for easy setting of the loop counter and uses an up-biased counter to facilitate folding at the end, further reducing the instruction storage requirements for merged nested loops.
[0126] Return to reference Figure 13A and 13B In some instances, an interrupt or exception event may occur during the processing of nested loops. In response to this event, the DSP 100 operates to save the registers of the nested loop controllers 1300 and 1350 (e.g., inner loop counter 1308, outer loop counter 1302, assertion FIFOs 1312 and 1353, up-bias counter 1316, and inner loop overload register 1304). Once the interrupt or exception event has been processed, the DSP 100 resumes by reloading the registers of the nested loop controllers 1300 and 1350 from, for example, a location in memory pointed to by the event handling pointer.
[0127] Figure 15An exemplary system 1500 for compiling nested loops according to an example of this disclosure is shown. System 1500 includes a compiler 1502 coupled to a DSP 1504, which is functionally equivalent to the DSP 100 described above. DSP 1504 includes an instruction storage device 1506 storing assembly or object-level instructions to be executed by DSP 1504. Instruction storage device 1506 may be equivalent to or similar to the instruction cache 121 described above.
[0128] Compiler 1502 is configured to receive higher-level software code containing nested loops, so as to compile the nested loops into assembly or object-level instructions to be stored in instruction storage device 1506 and subsequently executed by DSP 1504. Compiler 1502 is configured to generate initialization instructions (e.g., NLCINIT as described above) that execute before the instructions for the nested loops. The initialization instructions initialize the nested loop controller, as shown above. Figure 13A and 13B Nested loop controllers 1300 and 1350. Specifically, initialization instructions initialize the outer loop counter value (e.g., outer loop counter 1302) and the inner loop counter value (e.g., inner loop counter 1308 and / or inner loop overload register 1304) and assert the FIFO (e.g., assert FIFO 1312, 1353).
[0129] The compiler 1502 is also configured to merge nested loops by conditionalizing or asserting instructions associated with the outer loop. Therefore, at least one assertion instruction in the merged nested loop corresponds to an instruction in the outer loop of the compiled nested loop. The process of merging nested loops is described in more detail above, particularly with respect to Figures 14A and 14B.
[0130] As shown in Figures 14A and 14B, compiler 1502 is also configured to generate a TICK instruction for each iteration of the merged nested loop. As explained above, the TICK instruction advances the inner loop counter (and the outer loop counter, e.g., when the inner loop counter reaches a threshold), and typically controls the advance of the nested loop controller register during the execution of the merged nested loop. The advance of the inner loop counter (and the outer loop counter) also populates assertion FIFOs 1312 and 1353, as explained above.
[0131] Furthermore, for each assertion instruction in a nested loop (e.g., corresponding to the outer loop instruction), compiler 1502 is configured to generate a get assertion instruction (e.g., GETP mentioned above) that includes an offset value, which returns the value from the assertion FIFO specified by the offset value. In this way, assertion values are generated, stored, and accessed in a simplified manner. For example, instead of storing assertion values into registers and loading assertion values from registers (both of which require functional unit overhead, not to mention the complexity of tracking which assertion value resides in which register), the TICK instruction is unit-free and therefore does not monopolize functional units. Moreover, the GETP instruction allows for simple indexed access to assertion FIFOs 1312 and 1353.
[0132] As explained above, in some instances, certain outer loops within nested loops may not contain any instructions other than one or more loop control instructions. Compiler 1502 can be configured to collapse these outer loops that do not contain instructions. Furthermore, in some cases, the end of a merged nested loop can be collapsed, for example, by continuing to branch to the kernel of the merged nested loop after the inner and outer loop counters have reached their thresholds (e.g., 0 in the case of decrementing counters initialized to decremented inner and outer loop counts, respectively). In this case, compiler 1502 generates an initialization instruction (e.g., NLCINIT) to initialize an upper bias value (e.g., upper bias counter 1316) that specifies the number of additional iterations or branches to the kernel to be taken once the inner and outer loop counters have reached their thresholds (0 in the current instance).
[0133] In the foregoing discussion and claims, the terms “comprising” and “including” are used in an open-ended manner and should therefore be interpreted as meaning “including but not limited to…”. Furthermore, the term “coupled” is intended to indicate an indirect or direct connection. Thus, if a first device is coupled to a second device, the connection can be a direct connection or an indirect connection via other devices and connections. Similarly, devices coupled between a first component or location and a second component or location can be a direct connection or an indirect connection via other devices and connections. Elements or features “configured” to perform a task or function can be configured by the manufacturer at the time of manufacture (e.g., programmed or structurally designed) to perform a function and / or can be configured (or reconfigured) by the user after manufacture to perform a function and / or other additional or alternative functions. Configuration can be achieved through firmware and / or software programming of the device, through the construction and / or arrangement of the device’s hardware components and interconnections, or a combination thereof. Additionally, in the foregoing discussion, the use of the phrase “ground” or similar terms is intended to encompass chassis ground, ground, floating ground, virtual ground, digital ground, common ground, and / or any other form of grounding connection applicable to or suitable for the teachings of this disclosure. Unless otherwise stated, “about,” “approximately,” or “substantially” preceding a value means + / - 10% of that value.
[0134] The foregoing discussion is intended to illustrate the principles and various embodiments of this disclosure. Once the foregoing disclosure is fully understood, many variations and modifications will become apparent to those skilled in the art. The following claims are intended to be construed as covering all such variations and modifications.
Claims
1. A nested loop controller, comprising: A first register, which has a first value and is configured to be initialized to the initial first value; The second register has a second value and is configured to be initialized to the initial second value; and The third register is configured to assert the first-in-first-out (FIFO) buffer and is configured to be initialized to have a third value, wherein the third value includes the first bit equal to the outer loop indicator; During loop execution, the second value is advanced in response to a tick instruction with no unit; and In response to the second value reaching the second threshold, the second register is reset to the initial second value; The nested loop controller further includes a comparator coupled to the second register and the assertion FIFO, and is configured to: When the second value equals the second threshold, an outer loop indicator value is provided as input to the assertion FIFO; When the second value is not equal to the second threshold, an inner loop indicator value is provided as input to the assertion FIFO; and The assertion FIFO is used to store the outer loop indicator value and the inner loop indicator value, and outputs the outer loop indicator value and the inner loop indicator value in a first-in-first-out order, so as to control the execution of nested loop assertion instructions in the pipeline.
2. The nested loop controller according to claim 1, wherein: The initial second value is equal to zero; The second value increments in response to the tick command; and The second threshold is equal to the inner loop count value minus one.
3. The nested loop controller according to claim 1, wherein: The initial second value is equal to the inner loop count value minus one; The second value decreases in response to the tick command; and The second threshold is equal to zero.
4. The nested loop controller of claim 1, wherein the first value is advanced in response to the second value reaching the second threshold.
5. The nested loop controller of claim 4, wherein the second value stops being reset in response to the first value reaching a first threshold.
6. The nested loop controller of claim 4, wherein the comparator is further configured to provide an inner loop indicator as input to the assertion FIFO when the second value is equal to the second threshold and the first value is equal to the first threshold.
7. The nested loop controller of claim 4, further comprising a fourth register configured to be initialized to a fourth value, wherein the fourth value is advanced in response to a tick instruction once the first value reaches a first threshold.
8. The nested loop controller of claim 1, wherein the first, second and / or third registers comprise control registers of a digital signal processor.
9. A method for nested loop control, comprising: Initialize the first register, which has an associated first value, to the initial first value; Initialize the second register, which has an associated second value, to the initial second value; The third register with the associated third value is initialized to the initial third value, wherein the third register is configured to assert a first-in-first-out (FIFO) buffer; During loop execution, the second value is advanced in response to a tick instruction with no unit; In response to the second value reaching the second threshold, the second register is reset to the initial second value; When the second value equals the second threshold, an outer loop indicator value is provided as input to the assertion FIFO; When the second value is not equal to the second threshold, an inner loop indicator value is provided as input to the assertion FIFO; and The assertion FIFO is used to store the outer loop indicator value and the inner loop indicator value, and outputs the outer loop indicator value and the inner loop indicator value in a first-in-first-out order, so as to control the execution of nested loop assertion instructions in the pipeline.
10. The method of claim 9, wherein the initial second value is equal to zero, and the second threshold is equal to the inner loop count value minus one, the method further comprising incrementing the second value in response to the tick instruction.
11. The method of claim 9, wherein the initial second value is equal to the inner loop count value minus one, and the second threshold is equal to zero, the method further comprising decrementing the second value in response to the tick instruction.
12. The method of claim 9, further comprising causing the first value to advance in response to the second value reaching the second threshold.
13. The method of claim 12, further comprising stopping the resetting of the second value in response to the first value reaching a first threshold.
14. The method of claim 12, further comprising: When the second value equals the second threshold and the first value equals the first threshold, an inner loop indicator value is provided as input to the assertion FIFO.
15. The method of claim 12, further comprising: Initialize the fourth register with the associated fourth value to the initial fourth value; Once the first value reaches the first threshold, the fourth value is advanced in response to a tick instruction.
16. The method of claim 9, wherein the first, second and / or third registers comprise control registers of a digital signal processor.