vector bit transposition
By interpreting the DSP's source data as a two-dimensional array and reversing the row and column indices, the problem of high instruction overhead in the DSP's transpose function is solved, enabling more efficient bit transpose operations and improving DSP performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TEXAS INSTRUMENTS INC
- Filing Date
- 2020-05-15
- Publication Date
- 2026-06-16
Smart Images

Figure CN111984313B_ABST
Abstract
Description
Background Technology
[0001] Modern digital signal processors (DSPs) face multiple challenges. DSPs may frequently execute software that requires transposition, such as during operations that require rearranging bits (e.g., shuffle instructions that repack data with unusual bit boundaries, such as packing and unpacking 10-bit or 12-bit image data). Transposition can also be used to convert certain algorithms into bit-slice implementations by repackaging bytes into bit channels, or by unpacking dense bitmaps into byte-per-element bitmaps and then repackaging them. Transposition may require multiple instructions to transpose a bitmap. Summary of the Invention
[0002] According to at least one example of this disclosure, a method for transposing source data in a processor in response to a vector bit transpose instruction includes specifying a source register containing the source data and a destination register for storing the transposed data in a corresponding field of the vector bit transpose instruction. The method further includes performing the vector bit transpose instruction by interpreting the N×N bits of the source data as a two-dimensional array having N rows and N columns, creating the transposed source data by transposing the bits by inverting the row and column indices for each bit, and storing the transposed source data in the destination register.
[0003] According to another example of this disclosure, a data processor includes: a source register configured to contain source data; and a destination register. In response to the execution of a single vector bit transpose instruction, the data processor is configured to: interpret N×N bits of the source data as a two-dimensional array having N rows and N columns; create transposed source data by transposing the bits by reversing the row and column indices for each bit; and store the transposed source data in the destination register. Attached Figure Description
[0004] For a detailed description of the various examples, reference will now be made to the accompanying drawings, in which:
[0005] Figure 1 The diagram illustrates dual scalar / vector data path processors based on various examples;
[0006] Figure 2 It shows Figure 1 The registers and functional units in the processor are shown and illustrated according to various examples of paired scalar / vector data paths;
[0007] Figure 3 An example global scalar register file is shown;
[0008] Figure 4 An exemplary local scalar register file shared by the arithmetic function unit is shown;
[0009] Figure 5 An exemplary local scalar register file shared by multiple functional units is shown;
[0010] Figure 6 An exemplary local scalar register file shared by the load / store unit is shown;
[0011] Figure 7 An example global vector register file is shown;
[0012] Figure 8 An example assertion register file is shown;
[0013] Figure 9 An exemplary local vector register file shared by the arithmetic function unit is shown;
[0014] Figure 10 An exemplary local vector register file shared by the multiplication and related functional units is shown;
[0015] Figure 11 The pipeline stages of a central processing unit are shown according to various examples;
[0016] Figure 12 Sixteen instructions for a single extraction package are shown, based on various examples;
[0017] Figures 13A-13D Examples of instruction execution based on various examples are shown;
[0018] Figure 14 Example vector registers are shown according to various examples;
[0019] Figure 15 Instruction encoding based on various examples of instructions is shown; and
[0020] Figure 16 A flowchart illustrating the methods for executing instructions based on various examples is shown. Detailed Implementation
[0021] As mentioned above, DSPs often execute software that requires transpose functionality. Implementing transpose functionality at the processor level (e.g., using assembly or compiler-level instructions) might require multiple instructions to transpose a bit pattern. Since the transpose operations that need to be performed by the DSP are typically frequent and repetitive, increasing instruction and computational overhead is undesirable.
[0022] To improve the performance of a DSP performing transpose functions, at least by reducing the instruction overhead required to execute those transpose functions, the examples in this disclosure pertain to vector bit transpose instructions that interpret the bits of source data as a two-dimensional array (e.g., having N rows and N columns). In this example, the source data comprises a vector having at least N × N bits. The vector transpose instruction then creates the transposed source data by transposing the N × N bits by reversing the "row index" and "column index" for each bit, and stores the transposed source data in a destination register.
[0023] In one example, interpreting an N×N bit array, which is typically a one-dimensional vector, as a two-dimensional array involves interpreting the first N bits of the N×N bits as having a first row index (e.g., row index equal to 0), the second N bits of the N×N bits as having a second row index (e.g., row index equal to 1), and so on, including interpreting the last N bits of the N×N bits as having a last row index (e.g., row index equal to N-1). Continuing with this example, the bits in each "row" of the two-dimensional array interpretation are also interpreted as having column indices. For example, the bits in the first row have corresponding column indices ranging from 0 to N-1. The same applies to the rows of the second through Nth interpretations of the two-dimensional array.
[0024] Then, the N×N bits are transposed by reversing the row and column indices for each bit, thus creating the transposed source data. For example, a bit with a column index of N-1 and a row index of 0 (denoted as the ordered pair (N-1, 0)) will have a transposed column index of 0 and a row index of N-1 (e.g., (0, N-1)). The transposed source data is then stored in the destination register.
[0025] In some examples, the vector bit transpose instruction is a Single Instruction Multiple Data (SIMD) instruction that operates on source data divided into multiple groups of N×N bits. For example, the source register is a 512-bit vector register, and N = 8, such that there are eight groups of 64 bits (e.g., double words), which are interpreted as a two-dimensional array and transposed in response to the execution of a single vector bit transpose instruction. Other examples are similar within the scope of this disclosure, such as where N = 4, and therefore the 512-bit source data comprises 32 groups of 16 bits (e.g., half words), which are interpreted as a two-dimensional array and transposed in response to the execution of a single vector bit transpose instruction. In yet another example, N = 16, and therefore the 512-bit source data comprises two groups of 256 bits, which are interpreted as a two-dimensional array and transposed in response to the execution of a single vector bit transpose instruction.
[0026] By implementing a single vector bit transpose instruction that reduces the instructions and computational overhead required to perform bit transpose operations, the performance of the DSP is improved when executing software that requires transpose functionality.
[0027] Figure 1 This disclosure describes various examples of dual scalar / vector data path processors. Processor 100 includes separate Level 1 instruction cache (L1I) 121 and Level 1 data cache (L1D) 123. Processor 100 includes a Level 2 combined instruction / data cache (L2) 130, which holds both instructions and data. Figure 1 Explain the connection (bus 142) between the first-level instruction cache 121 and the second-level combined instruction / data cache 130. Figure 1 The connection (bus 145) between the Level 1 data cache 123 and the Level 2 combined instruction / data cache 130 is described. In one example, the processor 100 Level 2 combined instruction / data cache 130 stores both instructions to back up the Level 1 instruction cache 121 and stores data to back up the Level 1 data cache 123. In this example, the Level 2 combined instruction / data cache 130 is further configured in a manner known in the art, but... Figure 1 The signal integrated circuit is connected to higher-level caches and / or main memory in a manner not described herein. In this example, the central processing unit core 110, the level 1 instruction cache 121, the level 1 data cache 123, and the level 2 combined instruction / data cache 130 are formed on a single integrated circuit. This signal integrated circuit may optionally include other circuitry.
[0028] The central processing unit core 110, under the control of the instruction fetch unit 111, fetches instructions from the first-level instruction cache 121. The instruction fetch unit 111 determines the next instruction to be executed and invokes a set of such instructions of fetch packet size. The nature and size of the fetch packet are further detailed below. As is known in the art, upon cache hit (if the instructions are stored in the first-level instruction cache 121), the instructions are fetched directly from the first-level instruction cache 121. Upon cache miss (if the specified instruction fetch packet is not stored in the first-level instruction cache 121), the instructions are searched for in the second-level combined cache 130. In this example, the size of the cache line in the first-level instruction cache 121 is equal to the size of the fetch packet. The memory location of these instructions is either hit or miss in the second-level combined cache 130. A hit is serviced by the second-level combined cache 130. A miss is serviced by a higher-level cache (not illustrated) or by main memory (not illustrated). As is known in the art, the requested instructions can be supplied to both the Level 1 instruction cache 121 and the central processing unit core 110 simultaneously to accelerate their use.
[0029] In one example, the central processing unit core 110 includes a plurality of functional units to execute data processing tasks specified by instructions. An instruction dispatch unit 112 determines the target functional unit for each fetched instruction. In this example, the central processing unit 110 operates as a Very Long Instruction Word (VLIW) processor, capable of processing a plurality of instructions in a corresponding functional unit simultaneously. Preferably, the compiler organizes the instructions executed together in the execution package. The instruction dispatch unit 112 directs each instruction to its target functional unit. The functional unit assigned to an instruction is entirely specified by the instructions generated by the compiler. The hardware of the central processing unit core 110 does not participate in this functional unit allocation. In this example, the instruction dispatch unit 12 can operate a plurality of instructions in parallel. The number of such parallel instructions is determined by the size of the execution package. This will be described in further detail below.
[0030] Part of the dispatching task of instruction dispatch unit 112 is to determine whether an instruction is executed on a functional unit in scalar data path A 115 or on vector data path B 116. An instruction bit, referred to as the s bit in each instruction, determines which data path the instruction controls. This will be described in further detail below.
[0031] Instruction decoding unit 113 decodes each instruction in the current execution package. Decoding includes identifying the functional unit executing the instruction, identifying the registers from a possible register file (RF) used to supply data for the corresponding data processing operation, and identifying the register destination of the result of the corresponding data processing operation. As explained further below, an instruction may include a constant field replacing an operand field of a register number. The result of this decoding is signals used to control the target functional unit to perform the data processing operation specified by the corresponding instruction on the given data.
[0032] The central processing unit core 110 includes a control register 114. The control register 114 stores information used to control the functional units in the scalar data path side A 115 and the vector data path side B 116. This information may include mode information, etc.
[0033] Decoded instructions from instruction decoder 113 and information stored in control register 114 are supplied to scalar data path side A 115 and vector data path side B 116. As a result, functional units in scalar data path side A 115 and vector data path side B 116 execute instruction-specified data processing operations according to the instruction-specified data and store the results in one or more instruction-specified data registers. Each of scalar data path side A 115 and vector data path side B 116 includes a plurality of functional units preferably operating in parallel. These will be combined... Figure 2Further details. A data path 117 exists between scalar data path side A 115 and vector data path side B 116, allowing data exchange.
[0034] The central processing unit core 110 further includes non-instruction-based modules. Emulation unit 118 allows the determination of the machine state of the central processing unit core 110 in response to instructions. This capability is typically used for algorithm development. Interrupt / Exception unit 119 enables the central processing unit core 110 to respond to external asynchronous events (interrupts) and to respond to attempts to perform inappropriate operations (exceptions).
[0035] Central processing unit core 110 includes a streaming engine 125. In the embodiment described herein, streaming engine 125 supplies two data streams from predetermined addresses, typically cached in a secondary combined cache 130, to a register file on the vector data path side B 116. This provides controlled data movement directly from memory (such as that cached in the secondary combined cache 130) to the operand inputs of the functional units. This will be described in further detail below.
[0036] Figure 1 Exemplary data widths of the buses between the various sections are described. The Level 1 instruction cache 121 supplies instructions to the instruction fetch unit 111 via bus 141. Preferably, bus 141 is a 512-bit bus. Bus 141 provides unidirectional access from the Level 1 instruction cache 121 to the central processing unit 10. The Level 2 combined cache 130 supplies instructions to the Level 1 instruction cache 121 via bus 142. Preferably, bus 142 is a 512-bit bus. Bus 142 provides unidirectional access from the Level 2 combined cache 130 to the Level 1 instruction cache 121.
[0037] The primary data buffer 123 exchanges data with the register file in the scalar data path side A 115 via bus 143. Preferably, bus 143 is a 64-bit bus. The primary data buffer 123 exchanges data with the register file in the vector data path side B 116 via bus 144. Preferably, bus 144 is a 512-bit bus. Buses 143 and 144 are illustrated as bidirectionally supporting both data reading and data writing to the central processing unit 110. The primary data buffer 123 exchanges data with the secondary combined buffer 130 via bus 145. Preferably, bus 145 is a 512-bit bus. Bus 145 is illustrated as a bidirectional buffer service supporting both data reading and data writing to the central processing unit 110.
[0038] As is known in the art, after a cache hit (if the requested data is stored in Level 1 data cache 123), the CPU data request is fetched directly from Level 1 data cache 123. After a cache miss (if the specified data is not stored in Level 1 data cache 123), the data is searched for in Level 2 combined cache 130. The memory location of the requested data is either hit or miss in Level 2 combined cache 130. A hit is serviced by Level 2 combined cache 130. A miss is serviced by another cache (not shown) or by main memory (not shown). As is known in the art, the requested instruction can be supplied to both Level 1 data cache 123 and central processing unit core 110 simultaneously for accelerated use.
[0039] The secondary combined buffer 130 supplies data from a first data stream to the streaming engine 125 via bus 146. Preferably, bus 146 is a 512-bit bus. The streaming engine 125 supplies the data from the first data stream to the functional unit of the vector data path side B 116 via bus 147. Preferably, bus 147 is a 512-bit bus. The secondary combined buffer 130 supplies data from a second data stream to the streaming engine 125 via bus 148. Preferably, bus 148 is a 512-bit bus. The streaming engine 125 supplies the data from the second data stream to the functional unit of the vector data path side B 116 via bus 149. Preferably, bus 149 is a 512-bit bus. According to various examples of this disclosure, buses 146, 147, 148, and 149 are illustrated as unidirectional connections from the secondary combined buffer 130 to the streaming engine 125 and to the vector data path side B 116.
[0040] Upon a cache hit (if the requested data is stored in the second-level combined cache 130), the streaming engine 125 fetches the data request directly from the second-level combined cache 130. Upon a cache miss (if the specified data is not stored in the second-level combined cache 130), the data is searched for from another cache level (not illustrated) or from main memory (not illustrated). In some examples, it is technically feasible for the first-level data cache 123 to cache data not stored in the second-level combined cache 130. If this operation is supported, based on a streaming engine 125 data request that is missed in the second-level combined cache 130, the second-level combined cache 130 should listen to the first-level data cache 123 for the data requested by the streaming engine 125. If the first-level data cache 123 stores the data, its listener response will include the data subsequently served to service the streaming engine 125's request. If the first-level data cache 123 does not store the data, its listening response will indicate this result, and the second-level combined cache 130 must serve this stream engine 125 request from another cache (not shown) or from main memory (not shown).
[0041] In one example, according to U.S. Patent No. 6,606,686 entitled “UNIFIED MEMORY SYSTEM ARCHITECTURE INCLUDING CACHE AND DIRECTLY ADDRESSABLE STATIC RANDOM ACCESS MEMORY”, both the primary data cache 123 and the secondary combined cache 130 can be configured as a selected number of caches or directly addressable memory.
[0042] Figure 2 Further details are provided regarding the functional units and register files in scalar data path side A 115 and vector data path side B 116. Scalar data path side A 115 includes a global scalar register file (RF) 211, an L1 / S1 local register file 212, an M1 / N1 local register file 213, and a D1 / D2 local register file 214. Scalar data path side A 115 also includes L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226. Vector data path side B 116 includes a global vector register file 231, an L2 / S2 local register file 232, an M2 / N2 / C local register file 233, and an assertion register file 214. The vector data path side B 116 includes L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246. There are restrictions on which register files the functional units can read from or write to. These will be detailed below.
[0043] The scalar data path side A 115 includes L1 unit 221. L1 unit 221 typically accepts two 64-bit operands and produces a 64-bit result. Both operands are called from registers specified by instructions in the global scalar register file 211 or the L1 / S1 local register file 212. L1 unit 221 preferably performs operations selected by the following instructions: 64-bit addition / subtraction operations; 32-bit minimum / maximum operations; 8-bit single-instruction multiple-data (SIMD) instructions (such as absolute summation, minimum and maximum value determination); looping minimum / maximum operations; and various move operations between register files. The result can be written to registers specified by instructions in the global scalar register file 211, the L1 / S1 local register file 212, the M1 / N1 local register file 213, or the D1 / D2 local register file 214.
[0044] The scalar data path side A 115 includes S1 unit 222. S1 unit 222 typically accepts two 64-bit operands and produces a 64-bit result. Both operands are called from registers specified by instructions in the global scalar register file 211 or the L1 / S1 local register file 212. S1 unit 222 preferably performs the same type of operation as L1 unit 211. Slight variations may be selectively allowed between the data processing operations supported by L1 unit 211 and S1 unit 222. The result may be written to registers specified by instructions in the global scalar register file 211, L1 / S1 local register file 212, M1 / N1 local register file 213, or D1 / D2 local register file 214.
[0045] The scalar data path side A 115 includes M1 unit 223. M1 unit 223 typically accepts two 64-bit operands and produces a 64-bit result. Both operands are called from registers specified by instructions in the global scalar register file 211 or the M1 / N1 local register file 213. M1 unit 223 preferably performs operations selected by the following instructions: 8-bit multiplication; complex dot product; 32-bit counting; complex conjugate multiplication; and bitwise logic operations, shifts, additions, and subtractions. The result can be written to registers specified by instructions in the global scalar register file 211, the L1 / S1 local register file 212, the M1 / N1 local register file 213, or the D1 / D2 local register file 214.
[0046] The scalar data path side A 115 includes N1 unit 224. N1 unit 224 typically accepts two 64-bit operands and produces a 64-bit result. Both operands are called from registers specified by instructions in the global scalar register file 211 or the M1 / N1 local register file 213. N1 unit 224 preferably performs the same type of operation as M1 unit 223. Certain dual operations (referred to as dual-issue instructions) can exist that simultaneously employ both M1 unit 223 and N1 unit 224. The result can be written to registers specified by instructions in the global scalar register file 211, L1 / S1 local register file 212, M1 / N1 local register file 213, or D1 / D2 local register file 214.
[0047] The scalar data path side A 115 includes D1 unit 225 and D2 unit 226. Both D1 unit 225 and D2 unit 226 typically accept two 64-bit operands and each produce a 64-bit result. D1 unit 225 and D2 unit 226 typically perform address calculations and corresponding load and store operations. D1 unit 225 is used for 64-bit scalar load and store. D2 unit 226 is used for 512-bit vector load and store. Preferably, D1 unit 225 and D2 unit 226 also perform: swapping, packing, and unpacking of load and store data; 64-bit SIMD algorithm operations; and 64-bit bitwise logic operations. The D1 / D2 local register file 214 typically stores the base address and offset address used for address calculations of the corresponding load and store operations. Both operands are called from registers specified by instructions in the global scalar register file 211 or the D1 / D2 local register file 214. The results of the calculation can be written to the registers specified by the instructions in the global scalar register file 211, the L1 / S1 local register file 212, the M1 / N1 local register file 213, or the D1 / D2 local register file 214.
[0048] The vector data path side B 116 includes an L2 unit 241. The L2 unit 241 typically accepts two 512-bit operands and produces a 512-bit result. Both operands are invoked from registers specified by instructions in the global vector register file 231, the L2 / S2 local register file 232, or the assertion register file 234. In addition to the wider 512-bit data, the L2 unit 241 preferably executes instructions similar to those in the L1 unit 221. The result can be written to registers specified by instructions in the global vector register file 231, the L2 / S2 local register file 232, the M2 / N2 / C local register file 233, or the assertion register file 214.
[0049] The vector data path side B 116 includes an S2 unit 242. The S2 unit 242 typically accepts two 512-bit operands and produces a 512-bit result. Both operands are called from registers specified by instructions in the global vector register file 231, the L2 / S2 local register file 232, or the assertion register file 234. The S2 unit 242 preferably executes instructions similarly to the S1 unit 222. The result can be written to registers specified by instructions in the global vector register file 231, the L2 / S2 local register file 232, the M2 / N2 / C local register file 233, or the assertion register file 214.
[0050] The vector data path side B 116 includes M2 unit 243. M2 unit 243 typically accepts two 512-bit operands and produces a 512-bit result. Both operands are called from registers specified by instructions in the global vector register file 231 or the M2 / N2 / C local register file 233. In addition to the wider 512-bit data, M2 unit 243 preferably executes instructions similar to those in M1 unit 223. The result can be written to registers specified by instructions in the global vector register file 231, the L2 / S2 local register file 232, or the M2 / N2 / C local register file 233.
[0051] The vector data path side B 116 includes N2 unit 244. N2 unit 244 typically accepts two 512-bit operands and produces a 512-bit result. Both operands are called from registers specified by instructions in the global vector register file 231 or the M2 / N2 / C local register file 233. N2 unit 244 preferably performs the same type of operation as M2 unit 243. Certain dual operations (referred to as dual-issue instructions) may exist that simultaneously employ both M2 unit 243 and N2 unit 244. The result may be written to the register specified by instructions in the global vector register file 231, the L2 / S2 local register file 232, or the M2 / N2 / C local register file 233.
[0052] The vector data path side B 116 includes C unit 245. C unit 245 typically accepts two 512-bit operands and produces a 512-bit result. Both operands are called from registers specified by instructions in the global vector register file 231 or the M2 / N2 / C local register file 233. C unit 245 preferably executes: "search" and "find" instructions; I / Q complex multiplication of up to 512 2-bit PN*8-bit multiplications per clock cycle; 8-bit and 16-bit absolute difference summation (SAD) calculations, up to 512 SADs per clock cycle; horizontal addition and horizontal minimum / maximum instructions; and vector permutation instructions. C unit 245 also includes four vector control registers (CUCR0 to CUCR3) for controlling certain operations of C unit 245 instructions. Control registers CUCR0 to CUCR3 are used as operands for certain C unit 245 operations. Control registers CUCR0 through CUCR3 are preferably used to control the General Permutation Instruction (VPERM) and serve as masks for SIMD Multiple Point Product Operation (DOTPM) and SIMD Multiple Absolute Difference Summation (SAD) operations. Control register CUCR0 is preferably used to store polynomials for Galois field multiplication operations (GFMPY). Control register CUCR1 is preferably used to store Galois field polynomial generator functions.
[0053] The vector data path side B 116 includes a P unit 246. The P unit 246 performs basic logic operations on the registers of the local assertion register file 234. The P unit 246 has direct access to read from and write to the assertion register file 234. These operations include single-register unary operations such as: NEG (negation), which inverts each bit of a single register; BITCNT (bit count), which returns a count of the number of bits in a single register that have a predetermined numeric state (1 or 0); RMBD (rightmost bit detection), which returns the number of bit positions from the least significant bit position (rightmost) to the first bit position with a predetermined numeric state (1 or 0); DECIMATE, which selects the Nth (1st, 2nd, 4th, etc.) bit specified by each instruction to be output; and EXPAND, which repeats each bit N times (2nd, 4th, etc.) as specified by the instruction. These operations include two-register binary operations, such as: AND, which performs a bitwise AND operation on the data in two registers; NAND, which performs a bitwise AND and NOT operation on the data in two registers; OR, which performs a bitwise OR operation on the data in two registers; NOR, which performs a bitwise OR and NOT operation on the data in two registers; and XOR, which performs a bitwise XOR operation on the data in two registers. These operations include transferring data from the assertion registers in assertion register file 234 to another specified assertion register or to a specified data register in global vector register file 231. The general intended use of unit P 246 includes manipulating the results of SIMD vector comparisons to control further SIMD vector operations. The BITCNT instruction can be used to count the number of "1"s in the assertion registers to determine the number of valid data elements from the assertion registers.
[0054] Figure 3 The global scalar register file 211 is described. There are 16 independent 64-bit wide scalar registers, labeled A0-A15. Each register in the global scalar register file 211 can be read or written as 64-bit scalar data. All scalar data path side A115 function units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can read from or write to the global scalar register file 211. The global scalar register file 211 can be read as 32 bits, read as 64 bits, or written as only 64 bits. The execution of instructions determines the size of the read data. The vector data path side B116 function units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can read from the global scalar register file 211 via cross path 117, subject to the constraints detailed below.
[0055] Figure 4Description of D1 / D2 local register file 214. There are 16 independent 64-bit wide scalar registers, labeled D0-D16. Each register in D1 / D2 local register file 214 can be read or written as 64-bit scalar data. All scalar data path-side A115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can write to global scalar register file 211. Only D1 unit 225 and D2 unit 226 can be read from D1 / D2 local scalar register file 214. The data stored in D1 / D2 local register file 214 is expected to include the base address and offset address used for address calculation.
[0056] Figure 5 Explanation of L1 / S1 local register file 212. Figure 5 The example described has eight independent 64-bit scalar registers, labeled AL0-AL7. Preferred instruction encoding (see...) Figure 15 This allows the L1 / S1 local register file 212 to include up to 16 registers. Figure 5 The example implements only 8 registers to reduce circuit size and complexity. Each register in the L1 / S1 local register file 212 can be read or written as 64-bit scalar data. All scalar data path-side A115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can write to the L1 / S1 local register file 212. Only L1 unit 221 and S1 unit 222 can be read from the L1 / S1 local register file 212.
[0057] Figure 6 Explanation of M1 / N1 local register file 213. Figure 6 The example described has eight independent 64-bit wide scalar registers, labeled AM0-AM7. Preferred instruction encoding (see...) Figure 15 This allows the M1 / N1 local register file 213 to include up to 16 registers. Figure 6 The example implements only 8 registers to reduce circuit size and complexity. Each register in the M1 / N1 local register file 213 can be read or written as 64-bit scalar data. All scalar data path-side A115 functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can write to the M1 / N1 local register file 213. Only M1 unit 223 and N1 unit 224 can be read from the M1 / N1 local register file 213.
[0058] Figure 7 The global vector register file 231 is described. There are 16 independent 512-bit wide vector registers. Each register in the global vector register file 231 can be read or written as 64-bit scalar data, labeled B0-B15. The instruction type determines the data size. All vector data path side B 116 function units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can read from or write to the global vector register file 231. Scalar data path side A 115 function units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can read from the global vector register file 231 via cross path 117, subject to the constraints detailed below.
[0059] Figure 8 The P local register file 234 is described. It contains eight independent 64-bit wide registers, labeled P0-P7. Each register in the P local register file 234 can be read or written as 64-bit scalar data. Functional units L2 241, S2 242, C 244, and P 246 on the vector data path side can write to the P local register file 234. Only L2 241, S2 242, and P 246 can read from the P local register file 234. Common intended uses of the P local register file 234 include: writing a single bit of SIMD vector comparison result from L2 241, S2 242, or C 244; manipulating the SIMD vector comparison result by P 246; and using the manipulated result to control further SIMD vector operations.
[0060] Figure 9 Explanation of L2 / S2 local register file 232. Figure 9 The example described has eight independent 512-bit wide vector registers. Preferred instruction encoding (see...) Figure 15 This allows the L2 / S2 local register file 232 to include up to 16 registers. Figure 9The example implements only 8 registers to reduce circuit size and complexity. Each register in the L2 / S2 local vector register file 232 can be read or written as 64-bit scalar data, labeled BL0-BL7. Each register in the L2 / S2 local vector register file 232 can be read or written as 512-bit vector data, labeled VBL0-VBL7. The instruction type determines the data size. All vector data path-side B 116 function units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can write to the L2 / S2 local vector register file 232. Only L2 unit 241 and S2 unit 242 can be read from the L2 / S2 local vector register file 232.
[0061] Figure 10 Explanation of M2 / N2 / C local register file 233. Figure 10 The example described has eight independent 512-bit wide vector registers. Preferred instruction encoding (see...) Figure 15 This allows the M2 / N2 / C local vector register file 233 to include up to 16 registers. Figure 10 The example implements only 8 registers to reduce circuit size and complexity. Each register in the M2 / N2 / C local vector register file 233 can be read or written as 64-bit scalar data, labeled BM0-BM7. Each register in the M2 / N2 / C local vector register file 233 can be read or written as 512-bit vector data, labeled VML0-VML7. All vector data path-side B116 functional units (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can write to the M2 / N2 / C local vector register file 233. Only M2 unit 243, N2 unit 244, and C unit 245 can read from the M2 / N2 / C local vector register file 233.
[0062] The design choice is to specify a global register file accessible by all functional areas on one side and a local register file accessible only by some functional units on one side. Some examples in this disclosure use only one type of register file corresponding to the disclosed global register file.
[0063] Return to reference Figure 2Cross path 117 allows limited data exchange between scalar data path side A 115 and vector data path side B 116. During each operating cycle, a 64-bit data word can be called from global scalar register file A 211 as an operand for one or more function units on vector data path side B 116, and a 64-bit data word can be called from global vector register file 231 as an operand for one or more function units on scalar data path side A 115. Any scalar data path side A 115 function unit (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, and D2 unit 226) can read a 64-bit operand from global vector register file 231. This 64-bit operand is the least significant bit of the 512-bit data in the register accessed by global vector register file 231. Multiple scalar data path side A 115 function units can use the same 64-bit cross path data as the operands during the same operating cycle. However, in any single operation cycle, only one 64-bit operand is transferred from the vector data path side B 116 to the scalar data path side A 115. Any vector data path side B 116 (L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246) can read a 64-bit operand from the global scalar register file 211. If the corresponding instruction is a scalar instruction, the cross-path operand data is treated as any other 64-bit operand. If the corresponding instruction is a vector instruction, the high 448 bits of the operand are filled with zeros. Multiple vector data path side B 116 functional units can use the same 64-bit cross-path data as the operands during the same operation cycle. In any single operation cycle, only one 64-bit operand is transferred from the scalar data path side A 115 to the vector data path side B 116.
[0064] Streaming engine 125 transmits data under certain constraints. Streaming engine 125 controls two data streams. One stream consists of a sequence of elements of a specific type. Programs operating on these streams read data sequentially and then operate on each element. Each stream has the following basic properties: The stream data has well-defined start and end times. The stream data has a fixed element size and type throughout the stream. The stream data has a fixed sequence of elements. Therefore, programs cannot search randomly within the stream. Stream data is read only when active. Programs cannot write to the stream while reading from it. Once a stream is opened, streaming engine 125: calculates addresses; fetches defined data types from a secondary unified cache (which may require cache services from higher-level memory); performs data type manipulations such as zero extension, sign extension, data element rearrangement / transposition (such as matrix transpose); and transfers data directly to a programmed data register file in CPU 110. Therefore, streaming engine 125 is beneficial for real-time digital filtering operations on well-behaved data. Streaming engine 125 frees up these memory fetching tasks from the corresponding CPU to enable other processing functions.
[0065] The streaming engine 125 provides the following benefits: Streaming engine 125 allows for multidimensional memory access. Streaming engine 125 increases the available bandwidth of functional units. Because the streaming buffer bypasses the first-level data cache 123, streaming engine 125 minimizes the number of cache stalls. Streaming engine 125 reduces the number of scalar operations required to maintain loops. Streaming engine 125 manages address pointers. Streaming engine 125 handles the automatic release of address generation instruction slots for other computations, as well as address generation for D1 unit 225 and D2 unit 226.
[0066] The CPU 110 operates the instruction pipeline. Instructions are extracted from fixed-length instruction packets, as described further below. All instructions require the same number of pipeline stages for extraction and decoding, but different numbers of execution stages.
[0067] Figure 11 Describe the following pipeline stages: program fetch stage 1110, dispatch and decode stage 1120, and execution stage 1130. Program fetch stage 1110 comprises three levels for all instructions. Dispatch and decode stage 1120 comprises three levels for all instructions. Execution stage 1130 comprises one to four levels depending on the instruction.
[0068] The fetch phase 1110 includes a program address generation stage 1111 (PG), a program access stage 1112 (PA), and a program receive stage 1113 (PR). During the program address generation stage 1111 (PG), a program address is generated in the CPU and a fetch request is sent to the memory controller of the Level 1 instruction register (L1I). During the program access stage 1112 (PA), the L1I processes the request, accesses the data in its memory, and sends the fetch packet to the CPU boundary. During the program receive stage 1113 (PR), the CPU registers the fetch packet.
[0069] The instruction is always extracted at once into sixteen 32-bit wide gaps to form an extraction packet. Figure 12 This describes 16 instructions 1201-1216 in a single fetch packet. The fetch packet is aligned on 512-bit (16-word) boundaries. An example uses a fixed 32-bit instruction length. Fixed-length instructions are advantageous for several reasons: Fixed-length instructions make decoder alignment easier. Properly aligned instruction fetches can load multiple instructions into a parallel instruction decoder. Pre-aligned instruction fetches achieve this proper alignment when the pre-defined instruction alignment is stored in memory coupled to the fixed instruction packet fetch (fetch packets aligned on 512-bit boundaries). Aligned instruction fetches allow the parallel decoder to operate on the fetch bits of the instruction size. Variable-length instructions require initial steps to locate the boundaries of each instruction before each instruction can be decoded. Fixed-length instruction sets generally allow for a more regular layout of instruction fields. This simplifies the structure of each decoder and is beneficial for wide-issue VLIW CPUs.
[0070] The execution of individual instructions is partially controlled by p bits in each instruction. Preferably, the p bits are 32-bit wide gaps of bit 0. The p bits determine whether an instruction is executed in parallel with the next instruction. Instructions are scanned from lower to higher addresses. If the p bit of an instruction is 1, the next following instruction (at a higher memory address) is executed in parallel with that instruction (within the same cycle). If the p bit of an instruction is 0, the next following instruction is executed in the cycle following that instruction.
[0071] The CPU 110 and the L1 instruction cache 121 pipeline are decoupled. The fetch packets returned from the L1 instruction cache L1I can take a different number of clock cycles, depending on external conditions such as whether a hit occurs in the L1 instruction cache 121 or in the L2 combined cache 130. Therefore, program access level 1112 (PA) can take several clock cycles instead of one clock cycle as in other levels.
[0072] Instructions executed in parallel form execution packets. In one example, an execution packet may include up to sixteen instructions. No two instructions in an execution packet can use the same functional unit. Gap is one of the following five types: 1) self-contained instructions that execute on one of the functional units of CPU110 (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, D2 unit 226, L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246); 2) unitless instructions, such as NOP (no operation) instructions or multiple NOP instructions; 3) branch instructions; 4) constant field extensions; and 5) condition code extensions. Some of these gap types will be explained further below.
[0073] The dispatch and decoding phase 1120 includes instruction dispatch to the appropriate execution unit level 1121 (DS); instruction pre-decoding level 1122 (DC1); and instruction decoding and operand fetch level 1123 (DC2). During instruction dispatch to the appropriate execution unit level 1121 (DS), the fetch packet is divided into execution packets and assigned to the appropriate functional unit. During instruction pre-decoding level 1122 (DC1), the source register, destination register, and associated path are decoded for instructions in the execution functional unit. During instruction decoding and operand fetch level 1123 (DC2), more detailed unit decoding is performed, and operands are read from the register file.
[0074] Execution phase 1130 includes execution levels 1131-1135 (E1-E5). Different types of instructions require different numbers of these levels to complete their execution. These pipeline levels play an important role in understanding the device state at the boundaries of CPU cycles.
[0075] During the execution of Level 1 1131 (E1), the conditions of the instruction are estimated and the operands are operated on. For example... Figure 11 As illustrated, execution level 1 1131 can receive operands from either the stream buffer 1141 or the register file (illustratively shown as 1142). For load and store instructions, address generation is performed and the address modifications are written to the register file. For branch instructions, the branch fetch packet in the PG stage is affected. Figure 11 The illustration shows load and store instructions accessing memory (illustrated here as memory 1151). For a single-cycle instruction, the result is written to the destination register file. This assumes that any conditions of the instruction are evaluated correctly. If the conditions are evaluated incorrectly, the instruction does not write any result or has any pipelined operation after execution of stage 1 1131.
[0076] During the execution of a Level 2 1132 (E2) instruction, a load instruction sends the address to memory. A store instruction sends both the address and data to memory. If saturation occurs, a single-cycle instruction that saturates the result sets the SAT bit in the Control Status Register (CSR). For two-cycle instructions, the result is written to the destination register file.
[0077] During the execution of a level 3 1133 (E3) instruction, a data memory access is performed. If saturation occurs, any multi-instruction that saturates the result sets the SAT bit in the Control Status Register (CSR). For 3-cycle instructions, the result is written to the destination register file.
[0078] During the execution of Level 4 1134 (E4), the load instruction brings data to the CPU boundary. For a 4-cycle instruction, the result is written to the destination register file.
[0079] During the execution of level 5 1135 (E5), a load instruction writes data into a register. This is in Figure 11 The diagram illustrates the process by showing the input from memory 1151 to execution level 5 1135.
[0080] In some cases, processor 100 (e.g., a DSP) can be invoked to execute software that requires transposition functionality. As mentioned above, implementing transposition functionality at the processor level (e.g., using assembly or compiler-level instructions) requires multiple instructions and increases computational overhead. Since transposition functionality performed by a DSP is often frequent and repetitive, especially in operations that require rearranging bits (e.g., scrambling and repackaging data with unusual bit boundaries, converting an algorithm to a bit-slice implementation, or unpacking a dense bitmap into a byte-by-byte bitmap and then repackaging it), increasing instruction overhead and / or computation time is undesirable.
[0081] Figures 13A-13D This describes the transposition of bits performed by a vector bit transpose instruction according to an example of this disclosure. Figure 13A A vector 1300 (or a portion of a vector) comprising 16 bits is shown. The 16-bit vector 1300 is exemplary, and in this case, the transpose operation was chosen for simplicity of illustration. It should be understood that other examples of this disclosure may apply the transpose operation to groups having more or less than 16 bits. For example, vector 1300 may comprise 512 bits, and the transpose operation may be applied to each of eight groups of 64 bits.
[0082] The 16 bits of vector 1300 are consecutively numbered from 0 to 15. The bit number identifies a specific position and does not relate to its value. For the purposes of this example, the actual value of the bit is treated as arbitrary.
[0083] As mentioned above, an N×N bit array can be interpreted as a two-dimensional array with N rows and N columns. Therefore, in Figure 13A In the example, vector 1300 has 16 bits, and therefore N = 4.
[0084] Figure 13B An illustrative two-dimensional array 1310 is shown, which is an interpretation of the bit vector 1300, in this case interpreted as a 4×4 two-dimensional array 1310. The two-dimensional array 1310 includes the first N bits (e.g., elements 0-3) of the vector 1300 as its first row. The two-dimensional array 1310 includes the second N bits (e.g., elements 4-7) of the vector 1300 as its second row. The two-dimensional array 1310 includes the third N bits (e.g., elements 8-11) of the vector 1300 as its third row, and the fourth N bits (e.g., elements 12-15) of the vector 1300 as its fourth row.
[0085] When a 16-bit vector 1300 is interpreted as a 4×4 two-dimensional array 1310, the position of each bit in the two-dimensional array 1310 can be described by ordered pairs of row and column indices. In one example, the position of bit 0 is described as (0, 0); the position of bit 3 is described as (3, 0); and the position of bit 8 is described as (0, 2). In this way, the one-dimensional vector 1300 is interpreted as a two-dimensional array 1310, and therefore the bits within the vector 1300 can be identified by row and column indices.
[0086] Figure 13C Another two-dimensional array 1320 is shown after the bit transposition has occurred according to the vector bit transpose instruction. Specifically, the row index value and column index value of each bit in the two-dimensional array 1310 are reversed to reach the transposed two-dimensional array 1320.
[0087] For example, bits 0, 5, 10, and 15, each having the same row index value as the column index value, are preserved in the same position in both the 2D array 1310 and the transposed 2D array 1320. Bit 3, with an initial row index value of 0 and a column index value of 3 (e.g., position (3, 0)), is transposed to result in a row index value of 3 and a column index value of 0 (e.g., position (0, 3)). Therefore, in the transposed 2D array 1320, bit 3 appears in the first column, fourth row. A similar transpose is applied to all bits of the 2D array 1310 to generate the transposed 2D array 1320.
[0088] It should be understood that Figure 13B and Figure 13C This is intended to illustrate bit transposition. In practice, a two-dimensional array of bits may never actually be created (e.g., in memory); instead, source data from the source register in one-dimensional vector form is transposed as if it were a two-dimensional array, and the transposed source data is stored in the destination register.
[0089] Figure 13D will come from Figure 13C The transposed two-dimensional array 1320 is shown as a one-dimensional vector 1330, which includes the transposed source data as described above. The first N bits of the transposed vector 1330 come from the first row of the transposed two-dimensional array 1320 (e.g., bits 0, 4, 8, 12). The second N bits of the transposed vector 1330 come from the second row of the transposed two-dimensional array 1320 (e.g., bits 1, 5, 9, 13). The third N bits of the transposed vector 1330 come from the third row of the transposed two-dimensional array 1320 (e.g., bits 2, 6, 10, 14). Finally, the fourth N bits of the transposed vector 1330 come from the fourth row of the transposed two-dimensional array 1320 (e.g., bits 3, 7, 11, 15).
[0090] Figures 13A to 13D The specific numerical examples given (e.g., a 16-element vector) are not intended to limit the scope of this disclosure. In another example, vector 1300 comprises a 512-bit vector, and N = 8, such that there are eight groups of 64 bits (e.g., double words), which are interpreted as a two-dimensional array 1310 and transposed to a two-dimensional array 1320 in response to the execution of a single vector bit transpose instruction. As mentioned above, in other examples, N = 4, and therefore 32 groups are transposed; or N = 16, and therefore two groups are transposed in response to the execution of a single vector bit transpose instruction.
[0091] Figure 14 An example of register 1400 for executing vector bit transpose instructions is shown. Register 1400 can be either a source register or a destination register. In this example, whether it is a source register or a destination register, register 1400 is a 512-bit vector register, such as those contained in the global vector register file 231 explained above. In other examples, register 1400 can have different sizes; the scope of this disclosure is not limited to a particular register size or register group size.
[0092] As described above, the vector bit transpose instruction is a SIMD instruction that operates on source data divided into multiple groups of N×N bits. In this example, the 512-bit vector register 1400 is divided into eight equal-sized groups, each group being 64 bits. Each group can be interpreted as an 8×8 two-dimensional array, and therefore these groups are labeled array 1 through array 8.
[0093] The vector bit transpose instruction includes fields specifying the source and destination registers (e.g., in the global vector register file 231). In some examples, the vector bit transpose instruction also includes a field specifying the group size (e.g., 16 bits for a 4×4 2D array, 64 bits for an 8×8 2D array, or 256 bits for a 16×16 2D array). In response to executing the vector bit transpose instruction, the DSP 100 follows the above references. Figures 13A to 13D The transpose operation is interpreted to transpose each group of 64 bits as if the group were an 8×8 two-dimensional array. Once the bits of each of arrays 1 through 8 have been transposed, the DSP 100 stores the transposed source data in the destination register.
[0094] Figure 15 This illustration describes an example of instruction encoding 1500 for the functional unit instructions used in this disclosure. Those skilled in the art will recognize that other execution encodings are feasible and within the scope of this disclosure. Each instruction consists of 32 bits and controls the operation of one of the various controllable functional units (L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, D2 unit 226, L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, C unit 245, and P unit 246). The bit fields are defined as follows.
[0095] The dst field 1502 (bits 26 to 31) specifies the destination register in the corresponding vector register file 231 as the destination of the transposed source data generated due to the execution of the vector bit transpose instruction.
[0096] In the exemplary instruction code 1500, bits 20 to 25 contain constant values used as placeholders.
[0097] The src1 field 1504 (bits 14 to 19) specifies the source register from the global vector register file 231.
[0098] Opcode field 1506 (bits 5 to 13) specifies the instruction and indicates the appropriate instruction options (e.g., the dimensions of each group in an N×N two-dimensional array). Cell field 1508 (bits 2 to 4) provides an explicit indication of the functional cell used and the operation performed. A detailed interpretation of the opcodes, apart from the instruction options detailed below, is generally beyond the scope of this disclosure.
[0099] Bit 1510 (bit 1) of the s parameter specifies either scalar data path side A 115 or vector data path side B 116. If s = 0, scalar data path side A 115 is selected. This limits the functional unit to... Figure 2The diagram illustrates L1 unit 221, S1 unit 222, M1 unit 223, N1 unit 224, D1 unit 225, D2 unit 226, and the corresponding register file. Similarly, s=1 selects the vector data path side B 116, which restricts the functional unit to... Figure 2 The diagram illustrates L2 unit 241, S2 unit 242, M2 unit 243, N2 unit 244, P unit 246, and the corresponding register file.
[0100] The p-bit (1512, or bit 0) marks the execution packet. The p-bit determines whether the instruction is executed in parallel with subsequent instructions. The p-bit is scanned from lower to higher addresses. If p = 1 for the current instruction, the next instruction is executed in parallel with the current instruction. If p = 0 for the current instruction, the next instruction is executed in the cycle following the current instruction. All instructions executed in parallel form an execution packet. An execution packet can contain up to twelve instructions. Each instruction in an execution packet must use a different functional unit.
[0101] Figure 16 A flowchart of method 1600 according to an example of this disclosure is shown. Method 1600 begins in block 1602, where a source register containing source data and a destination register for storing transposed data are specified. Fields in the vector bit transpose instruction (such as those mentioned above) are... Figure 15 The source and destination registers are specified in the described src1 field 1504 and dst field 1502. Method 1600 continues in box 1604, where a vector bit transpose instruction is performed, specifically by interpreting the N×N bits of the source data as a two-dimensional array with N rows and N columns. In one example, the source data comprises a 512-bit vector, and N = 8, resulting in eight groups of 64 bits (e.g., double words), which are interpreted as a two-dimensional array, as described above regarding... Figures 13A-13D The explanation given.
[0102] Method 1600 continues in box 1606, where the transposed source data is created by transposing the bits by reversing the row and column indices for each bit. This specific step is in Figure 13B and Figure 13C The transposition is described in detail above. It should be understood that, in practice, a two-dimensional array of bits may never actually be created (e.g., in memory); instead, the source data, in one-dimensional vector form from the source register, is transposed as if it were a two-dimensional array, and the transposed source data is stored in the destination register. Method 1600 continues in box 1608, where the transposed source data is stored in the destination register, as described above regarding... Figure 13D As shown.
[0103] As stated above, the specific numerical examples are not intended to limit the scope of this disclosure. For example, although described as a 512-bit vector in which N=8, in other examples N=4, and therefore 32 groups are transposed; or N=16, and therefore two groups are transposed in response to the execution of a single vector bit transpose instruction.
[0104] In the foregoing discussion and claims, the terms “comprising” and “including” are used in an open-ended manner and should therefore be interpreted as meaning “including, but not limited to…”. Furthermore, the terms “coupled” or “coupled” are intended to indicate an indirect or direct connection. Thus, if a first device is coupled to a second device, the connection can be either a direct connection or an indirect connection via other devices and connections. Similarly, devices coupled between a first component or location and a second component or location can be either a direct connection or an indirect connection via other devices and connections. Elements or features “configured” to perform a task or function can be configured by the manufacturer at the time of manufacture (e.g., programmed or structurally designed) to perform a function and / or can be configurable (or reconfigurable) by the user after manufacture to perform a function and / or other additional or alternative functions. Configuration can be achieved through firmware and / or software programming of the device, through the construction and / or layout of the device’s hardware components and interconnections, or combinations thereof. Additionally, in the foregoing discussion, the use of the phrase “ground” or similar terms is intended to include chassis ground, earth ground, floating ground, virtual ground, digital ground, common ground, and / or any other form of grounding connection applicable to or suitable for the teachings of this disclosure. Unless otherwise stated, "approximately," "around," or "substantially" preceding a value means + / - 10% of that value.
[0105] The foregoing discussion is intended to illustrate the principles and various embodiments of this disclosure. Once the foregoing disclosure is fully understood, many variations and modifications will become apparent to those skilled in the art. The appended claims are intended to be construed as encompassing all such variations and modifications.
Claims
1. A method for transposing source data in a processor in response to a single vector bit transpose instruction, the method comprising: In the corresponding field of the vector bit transpose instruction, a source register containing an M-bit vector that serves as the source data and a destination register for storing the transposed data are specified. as well as Executing the vector bit transpose instruction, wherein executing the vector bit transpose instruction further includes: Each of the multiple N×N bits of the source data is interpreted as a two-dimensional array with N rows and N columns, where M is a multiple of N×N; The transposed source data is created by transposing each N×N bit group by reversing the row and column indices for each bit in each group; and The transposed source data is stored in the destination register.
2. The method according to claim 1, wherein M = 512.
3. The method according to claim 2, wherein N = 8.
4. The method according to claim 2, wherein N = 4.
5. The method according to claim 2, wherein N = 16.
6. A data processor, comprising: The source register is configured to contain an M-bit vector that serves as the source data; as well as Destination register; In response to the execution of a single vector bit transpose instruction, the data processor is configured to: Each of the multiple N×N bits of the source data is interpreted as a two-dimensional array with N rows and N columns, where M is a multiple of N×N; The transposed source data is created by transposing each N×N bit group by reversing the row and column indices for each bit in each group; and The transposed source data is stored in the destination register.
7. The data processor according to claim 6, wherein M = 512.
8. The data processor according to claim 7, wherein N = 8.
9. The data processor according to claim 7, wherein N = 4.
10. The data processor according to claim 7, wherein N = 16.