A method for executing a vector compress instruction in an RVV instruction set

By splitting the vector compression instructions in the RVV instruction set into multiple compression micro-instructions and utilizing the preprocessing and aggregation operations of the aggregation operation array, the problem of low execution efficiency in the prior art is solved, achieving more efficient instruction execution and hardware resource utilization.

CN122240176APending Publication Date: 2026-06-19INST OF COMPUTING TECH CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF COMPUTING TECH CHINESE ACAD OF SCI
Filing Date
2026-02-05
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The vector compression instructions in the existing RVV instruction set have low execution efficiency, especially when the VLEN/DLEN ratio or LMUL is large. The large number of microinstructions leads to high hardware storage overhead and subsequent instruction execution blockage.

Method used

The compression instruction is broken down into multiple compression micro-instructions, and its input is converted into an index and selected data adapted to the aggregation operation array through preprocessing. The aggregation operation is then performed using a low-latency, high-throughput aggregation operation array, which reduces the number of micro-instructions and speeds up execution.

Benefits of technology

When the VLEN/DLEN ratio is large or the hardware-supported LMUL is large, only the number of micro-instructions that grows linearly with VLEN/DLEN and LMUL is generated, reducing hardware storage overhead, improving execution efficiency, and reducing the blocking of subsequent instruction splitting, scheduling, and execution.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240176A_ABST
    Figure CN122240176A_ABST
Patent Text Reader

Abstract

This invention provides a method for executing vector compression instructions in an RVV instruction set, comprising: when the instruction to be processed is a compression instruction, splitting the compression instruction into multiple compression micro-instructions, and segmenting the mask vector and source data vector corresponding to each compression micro-instruction from the total mask vector corresponding to the compression instruction and the total source data vector to be compressed; performing preprocessing for an adaptive aggregation operation array, comprising: generating an index corresponding to each compression micro-instruction that can be used for operations by the aggregation operation array, and selected data obtained by locally compressing the source data vector corresponding to the compression micro-instruction, based on the partial mask vector, source data vector, and number corresponding to the compression micro-instruction, to match the input requirements of the aggregation instruction; and executing the aggregation instruction on the aggregation operation array to perform aggregation operations based on the index and selected data corresponding to each compression micro-instruction split from the compression instruction to obtain the result corresponding to the compression instruction.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, more specifically to the field of RISC-V vector extension instruction acceleration technology, and more specifically, to a method for executing vector compression instructions in the RVV instruction set. Background Technology

[0002] The RVV (RISC-V Vector Extension) instruction set features vector configuration that is not encoded within vector instructions. Instead, it uses a separate vector configuration instruction to set the vector configuration VTYPE CSR register. Vector configuration includes fields such as LMUL and SEW. The LMUL field specifies the number of vector registers operated on by the vector instruction, and the SEW field specifies the bit width of the operands.

[0003] The RVV instruction set includes a class of permutation instructions used to perform permutation operations on data in vector registers, such as vector aggregation instructions (vrgather.vv) and vector compression instructions (vcompress.vm). vcompress.vm is an instruction in vector expansion used to filter elements by mask. Its assembly instruction format is "vcompress.vm vd, vs2, vs1". Here, vd is the destination vector register group storing the operation result; vs2 is the source vector register group to be compressed. The element bit width in both vd and vs2 is specified by the SEW field in the VTYPE CSR, and the number of vector registers in each vd and vs2 vector register group is specified by the LMUL field in the VTYPE CSR. vs1 is the source operand register providing the mask vector; the element bit width of the mask is always 1, and the number of vector registers is always 1.

[0004] VLEN (Vector LENgth) is the length of the vector register, measured in bits. DLEN (Datapath LENgth) refers to the length of a data path for reading and writing vector registers, also measured in bits. In vector processor design, vector instructions are often broken down into multiple microinstructions at the register read / write granularity. Each microinstruction reads multiple vector operands of length DLEN and writes a vector destination operand of length DLEN. The hardware implementation divides the length of the vector register into VLEN / DLEN parts at the DLEN granularity, and each microinstruction reads one part of the data from the corresponding numbered vector register.

[0005] The GATHER array is an existing computing unit capable of processing LMUL vector aggregation micro-instructions, which are derived from a single vector aggregation instruction, with a latency of LMUL*2-1 and a throughput of 1 / LMUL. Vector aggregation micro-instructions are the micro-instructions derived from the vector aggregation instruction and used for computation by the GATHER array.

[0006] Let's illustrate the function of a vector compression instruction with an example. When SEW=32, the element width of both the vd and vs2 vector register groups is 32 bits, while the mask width in the vs1 vector register is 1 bit. In a processor core with VLEN=128, each vector register in each vd and vs2 vector register group contains 4 data elements, while the vs1 vector register contains 128 masks. The function of this instruction is to select data from the vs2 register group using the mask in vs1, and then continuously arrange the selected results at the beginning of the vd vector register group.

[0007] The following C code demonstrates how vcompress.vm performs its operations. Elements in vd and vs2 are divided into several elements according to the bit width specified by SEW. vs1 is divided into several masks using 1-bit segments. vd[i], vs2[i], and vs1[i] represent the i-th element. The calculation formula for each element in vd is as follows:

[0008] for (size_t i = 0, pos = 0; i < VL; i++)

[0009] if (vs1[i]==1)

[0010] vd[pos++]=vs2[i];

[0011] Here, size_t i = 0 means declaring an unsigned integer variable i and initializing it to 0.

[0012] The following is an example of using the `vcompress.vm v2,v1, v0` instructions under the conditions of VLEN=128, LMUL=1, SEW=16, VL=7, and tu. In the table, the effective mask in v0 is 1, the selected elements in v1, and the results of the selected elements in v2 are marked in bold. In this example, elements numbered 0 to 6 are less than VL (Vector Length, commonly written as vl in this field, but capitalized to avoid confusion with v1 in the table), so these elements participate in the operation (the relevant element numbers in the table are shown in italics). Within these element numbers, the three mask elements numbered 0, 2, and 5 in v0 are 1, so the three elements a, c, and f with corresponding numbers are selected from v1 and arranged consecutively at the beginning of the result vector register v2. Thus, the elements numbered 0 to 2 in "destination v2" are a, c, and f respectively (shown in bold in the table). Since three elements actually participate in the compression operation, according to the RVV instruction set specification, the tail position of source v2 remains unchanged under the tu strategy. Therefore, the element with the number greater than or equal to 3 in "destination v2" is filled with the element with the corresponding number in "source v2" (indicated by underscores in the table).

[0013]

[0014] As can be seen, the data in each compressed vector will shift to the lower-numbered position, such as c and f in the example above, or remain unchanged, such as a in the example above. Therefore, for the vcompress.vm instruction with LMUL>1, the elements in each destination vector register set may come from the vector register with the larger number in the source vector register set. Taking the vcompress.vmv16, v8, v0 instruction as an example, when LMUL=8, a logical vector occupies 8 consecutive physical vector registers, showing the possible source data registers for the data in each destination register. Entering "Yes" indicates that the data in the destination register of the corresponding row may come from the source data register of the corresponding column. In this example, destination register v16 has 8 source data registers from v8 to v15, while destination register v23 only has one source data register, v15.

[0015]

[0016] In one existing method A, the vcompress.vm instruction is executed element-by-element sequentially. The arithmetic unit uses a set of data paths connected to the vector register file for data transfer between the arithmetic unit and the vector register file. Specifically, the arithmetic unit first reads the mask register, then traverses the mask in the mask register from least significant bit to most significant bit. If a mask is 1, the element corresponding to the mask position is read from the vector register. The element width is specified by SEW, and then stored one by one in the corresponding position of the destination vector register. This method uses two counters to record the mask sequence number being traversed and the number of masks that are 1 during the traversal. These correspond to the variables i and pos in the C language code above, respectively.

[0017] In another existing method B, the vcompress.vm instruction is processed by splitting microinstructions out of order. Based on the source data register of each destination register listed in the table above, a microinstruction is generated for each "yes" entry in the table associated with both the destination and source data registers. Thus, the vcompress.vm instruction with LMUL=8 will be split into 36 microinstructions in this method. Similarly, with LMUL=4, it will be split into 10 microinstructions (considering only rows v16-v19 and columns v8-v11). With LMUL=2, it will be split into 3 microinstructions (considering only rows v16-v17 and the first columns v8-v9). With LMUL=1, it will be split into only 1 microinstruction. It should be noted that this number of microinstructions based on the table is predicated on DLEN=VLEN, meaning each microinstruction uses a complete vector register. If DLEN=VLEN / 2, more microinstructions will be split. Let part = VLEN * LMUL / DLEN, meaning that the data of a vector register group will be divided into data of length DLEN in part groups. The formula for calculating the number of microinstructions is (part + 1) * part / 2. According to this formula, when DLEN = VLEN / 2 and LMUL = 8, part = 16, so this method will result in 136 microinstructions.

[0018] The aforementioned existing method A essentially implements the software loop in hardware. If each clock cycle iterates through n masks and reads and stores the corresponding m elements (m≤n), then executing the vcompress.vm instruction using this method requires VLEN*LMUL / SEW / n clock cycles. Due to hardware limitations preventing the reading and writing of data from multiple different addresses to the vector registers within a single cycle, n in this method is typically 2, and at most 4. This results in the method occupying multiple vector register read / write ports for an extended period, and the excessive number of execution cycles leads to inefficiency.

[0019] The aforementioned existing method B uses out-of-order microinstruction splitting. As can be seen from the formula for calculating the number of microinstructions, the number of microinstructions is proportional to the square of VLEN / DLEN and also proportional to the square of LMUL. Therefore, this method incurs a very large microinstruction storage overhead when the VLEN / DLEN ratio or LMUL (LMUL is a value configured in the software code) is large. Furthermore, the splitting of a single vcompress.vm instruction into numerous microinstructions can block the splitting, scheduling, and execution of subsequent instructions. To conserve silicon area, the number of arithmetic units and pipelines processing vcompress.vm microinstructions cannot increase quadratically. Therefore, this method is highly inefficient when the VLEN / DLEN ratio or LMUL is large.

[0020] As can be seen from the above introduction of the two existing methods, the existing methods have the following disadvantages:

[0021] When compression instructions are directly broken down into compression micro-instructions to complete all compression operations, the number of micro-instructions is large when the VLEN / DLEN ratio is large or the hardware-supported LMUL is large. This will result in a large amount of hardware storage overhead, and a large number of micro-instructions will block the splitting, scheduling and execution of subsequent instructions.

[0022] When the VLEN / DLEN ratio is large or the hardware-supported LMUL is large, the execution efficiency of this instruction is low; or, in order to improve efficiency, the number of arithmetic units and pipelines is quadratic with VLEN / DLEN or LMUL, resulting in large area overhead.

[0023] It should be noted that the background information presented here is only for illustrating relevant information about the present invention to aid in understanding the technical solution of the present invention, and does not imply that the relevant information is necessarily prior art. The relevant information was submitted and disclosed together with the present invention, and should not be considered prior art unless there is evidence that the relevant information was disclosed before the filing date of the present invention. Summary of the Invention

[0024] Therefore, the purpose of this invention is to overcome the shortcomings of the prior art and provide a method for executing vector compression instructions in the RVV instruction set.

[0025] The objective of this invention is achieved through the following technical solution:

[0026] According to a first aspect of the present invention, a method for executing vector compression instructions in an RVV instruction set is provided, comprising: when the instruction to be processed is a compression instruction, splitting the compression instruction into multiple compression micro-instructions, segmenting a partial mask vector corresponding to each compression micro-instruction from the total mask vector corresponding to the compression instruction, and segmenting a source data vector corresponding to each compression micro-instruction from the total source data vector to be compressed corresponding to the compression instruction; performing preprocessing for an adaptive aggregation operation array, comprising: generating an index corresponding to each compression micro-instruction that can be used for operation by the aggregation operation array and selected data obtained by locally compressing the source data vector corresponding to the compression micro-instruction, based on the partial mask vector, source data vector, and number corresponding to the compression micro-instruction, to match the input requirements of the aggregation instruction; and executing an aggregation instruction on the aggregation operation array to perform aggregation operations based on the index and selected data corresponding to each compression micro-instruction split from the compression instruction to obtain the result corresponding to the compression instruction. This scheme can achieve at least the following beneficial technical effects: it splits the compression instruction into multiple compression micro-instructions, and through preprocessing, converts the input of the compression micro-instructions into an index and selected data that are adapted to the aggregation operation array. Finally, it directly uses the aggregation operation of the aggregation operation array to obtain the result corresponding to the compression instruction, so as to accelerate the execution of the compression instruction (vcompress.vm) by using the low-latency, high-throughput aggregation (GATHER) operation array.

[0027] Optionally, the number of compressed micro-instructions split from each compressed instruction can be determined according to the following calculation method. :

[0028]

[0029] in, Indicates the length of the vector register. This indicates the number of vector registers used to specify compression instruction operations. This represents the length of a data path for reading and writing a vector register. This scheme achieves at least the following beneficial technical effects: Existing methods generate too many micro-instructions during the execution of compressed instructions, occupying register read / write ports frequently and affecting the execution of other vector instructions. The scheme of this invention generates VLEN / DLEN*LMUL micro-instructions each time. When the VLEN / DLEN ratio is large or the hardware-supported LMUL is large, only the number of micro-instructions increases linearly with VLEN / DLEN and LMUL, resulting in less hardware storage overhead and less obstruction to the splitting, scheduling, and execution of subsequent instructions.

[0030] Optionally, the selected data corresponding to each compressed microinstruction is obtained as follows: Elements with a mask value of 1 are selected from the source data vector corresponding to the compressed microinstruction, and arranged sequentially from low to high as the first local compressed vector corresponding to the compressed microinstruction; the first local compressed vector corresponding to the compressed microinstruction is then cyclically shifted left by Y elements at the smallest data granularity to obtain the second local compressed vector corresponding to the selected data of the compressed microinstruction, where Y represents the shift number, which is the number of mask values ​​of 1 in the partial mask vectors corresponding to all other compressed microinstructions with numbers less than the number of the compressed microinstruction. This scheme can achieve at least the following beneficial technical effects: The second local compressed vector; without introducing the second local compressed vector, when calculating the j-th index of the i-th group, the number of 1s in the first j masks of the i-th group needs to be added to the sum of the partial sums of the first i groups, resulting in a carry-over during addition, making the circuit the longest timing path in the entire module, affecting the frequency. When using the second local compression vector, the index calculation becomes I + M x ∑, where I will not be larger than M, and M must be an integer power of 2. We only need to concatenate I and log2(M) ∑, reducing the use of adders. The reduced addition operations shorten the longest path of the index generation circuit, allowing the design to achieve higher frequencies.

[0031] Optionally, the index indicates the location of the element required by the aggregation instruction in the register set storing the second local compressed vector, and the index is determined as follows:

[0032]

[0033] in, Indicates the overall sequence number The index of the element at that location. , Indicates the overall sequence number The element's sequence number in its corresponding compression microinstruction. This represents the modulo operation. This represents the number of the smallest bit-width elements in the source data vector corresponding to each compression microinstruction. -1 represents the number of the compressed microinstruction to which the current calculated sequence number belongs; Indicates serial number Is it greater than or equal to compressed microinstruction 0 to compressed microinstruction? The corresponding valid mask and, if ,otherwise This scheme can achieve at least the following beneficial technical effects: when calculating the index using this method, the position of the relevant element in the register group after a circular left shift can be calculated efficiently.

[0034] According to a second aspect of the present invention, a processor supporting the RVV instruction set is provided, comprising a data sorting arithmetic unit, the arithmetic unit comprising: a preprocessing module for vector compression microinstructions, configured to: when the microinstruction input to the data sorting arithmetic unit is a compression microinstruction, perform preprocessing for an adapted aggregation arithmetic array, comprising: generating an index required for the aggregation arithmetic array to operate on each compression microinstruction according to a partial mask vector, a source data vector, and an index, and selected data obtained by locally compressing the source data vector corresponding to the compression microinstruction, wherein the partial mask vector comes from vector operand 1 and the source data vector comes from vector operand 2; and an aggregation arithmetic array, configured to: when the microinstruction input to the data sorting arithmetic unit is a compression microinstruction, obtain the index and selected data from the preprocessing module to perform aggregation operations to obtain the result corresponding to the relevant compression instruction; and when the microinstruction input to the data sorting arithmetic unit is an aggregation microinstruction, directly perform aggregation operations according to each vector operand corresponding to the aggregation microinstruction. This solution achieves at least the following beneficial technical effects: It adds a preprocessing module for vcompress.vm micro-instructions to the aggregation (GATHER) array, allowing instruction computation to be performed using the aggregation array with only an additional execution cycle. The area of ​​the aggregation array is linearly related to both VLEN / DLEN and LMUL, avoiding the introduction of quadratic complexity hardware area overhead. The instruction computation latency is LMUL*2 cycles, and the throughput is 1 / LMUL, reducing latency and increasing throughput compared to existing implementations.

[0035] Optionally, the arithmetic unit includes: a first data selection module, whose inputs are vector operand 1 and the index output by the preprocessing module; and a second data selection module, whose inputs are vector operand 2 and the selected data output by the preprocessing module. When the microinstruction of the input data sorting arithmetic unit is a compressed microinstruction, the first data selection module outputs the index from the preprocessing module, and the second data selection module outputs the selected data from the preprocessing module. When the microinstruction of the input data sorting arithmetic unit is an aggregate microinstruction, the first data selection module outputs vector operand 1, and the second data selection module outputs vector operand 2. This scheme can achieve at least the following beneficial technical effects: it effectively integrates compressed microinstructions and aggregate microinstructions into one arithmetic unit, and selects and determines the content of the final input aggregate arithmetic array according to the type of the input microinstruction.

[0036] Optionally, the preprocessing module includes: a local data compression module, used to select elements with a mask value of 1 from the source data vector corresponding to the compressed microinstruction, and arrange them sequentially from low to high as the first local compressed vector corresponding to the compressed microinstruction; and a data cyclic shift module, used to cyclically shift the first local compressed vector corresponding to the compressed microinstruction to the left by Y elements according to the smallest data granularity, to obtain the second local compressed vector of the selected data corresponding to the compressed microinstruction, where Y represents the shift number, which is the number of elements with a mask value of 1 in the partial mask vectors corresponding to all other compressed microinstructions with numbers less than the number of the compressed microinstruction.

[0037] Optionally, the preprocessing module further includes: a mask segment prefix summation module, used to calculate the number of elements with a mask value of 1 in the partial mask vector corresponding to each numbered compressed microinstruction as the partial sum of that partial mask vector, and then sequentially calculate the sum of compressed microinstructions 0 to each compressed microinstruction. The corresponding prefix sum is used as the effective mask sum, and the effective mask sum is the sum of the compressed microinstructions 0 to 1. The sum of the partial sums of the corresponding partial mask vectors; prefix sum storage units for storing compressed microinstructions 0 to each compressed microinstruction. The corresponding valid masks are, where, The index generation module is used to determine the index based on the total sequence number, the number of the compressed microinstruction, and the effective mask corresponding to the compressed microinstruction from 0 to the corresponding compressed microinstruction. Attached Figure Description

[0038] The embodiments of the present invention will be further described below with reference to the accompanying drawings, wherein:

[0039] Figure 1 This is a schematic diagram of the structure of a data sorting arithmetic unit according to an embodiment of the present invention;

[0040] Figure 2 This is a schematic diagram of the architecture of the preprocessing module for compressed microinstructions according to an embodiment of the present invention;

[0041] Figure 3 This is a schematic diagram illustrating the generation process of the second local compression vector according to an embodiment of the present invention;

[0042] Figure 4 A schematic diagram illustrating the principle of local data compression implemented by the preprocessing module according to an embodiment of the present invention;

[0043] Figure 5 A schematic diagram illustrating the generation of an index by the index generation module in the preprocessing module according to an embodiment of the present invention;

[0044] Figure 6This is a flowchart illustrating a method for executing vector compression instructions in the RVV instruction set according to an embodiment of the present invention.

[0045] Figure 7 This is a schematic diagram illustrating the execution process of the micro-instructions decomposed from the VCOMPRESS instruction according to an embodiment of the present invention;

[0046] Figure 8 This is another schematic diagram illustrating the execution process of the micro-instructions decomposed from the VCOMPRESS instruction according to an embodiment of the present invention. Detailed Implementation

[0047] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative and are not intended to limit the invention.

[0048] As mentioned in the background section, directly breaking down compression instructions into compression micro-instructions to complete all compression operations can lead to excessive storage overhead and impact subsequent instructions, or even large area overhead. Therefore, the method of this invention breaks down compression instructions into multiple compression micro-instructions, and preprocesses the input of these micro-instructions into an index and selected data adapted to the aggregation array. Finally, it directly utilizes the aggregation operation of the aggregation array to obtain the result corresponding to the compression instruction. This leverages a low-latency, high-throughput aggregation (GATHER) array to accelerate the execution of the compression instruction (vcompress.vm), reusing the existing aggregation array to speed up the execution of the compression instruction without incurring significant additional area overhead.

[0049] According to one embodiment of the present invention, the present invention provides a processor supporting the RVV instruction set, which includes a data sorting arithmetic unit that supports vector compression microinstructions and vector aggregation microinstructions. See also Figure 1 The data sorting operator includes:

[0050] The preprocessing module for vector compression microinstructions is configured to: when the microinstruction of the input data sorting operator is a compression microinstruction, perform preprocessing to adapt the aggregation operation array. This preprocessing includes: generating an index required for the aggregation operation array to operate on each compression microinstruction based on the partial mask vector, source data vector, and number corresponding to the compression microinstruction, and selecting data obtained by locally compressing the source data vector corresponding to the compression microinstruction. The partial mask vector comes from vector operand 1, and the source data vector comes from vector operand 2. In other words, the preprocessing module for vector compression microinstructions receives vector operand 1 (corresponding to vs1) and vector operand 2 (corresponding to vs2) of the compression microinstruction, and then, combined with the number of the current compression microinstruction, outputs the index and data required by the aggregation operation array.

[0051] The aggregation (GATHER) operation array is configured such that: when the microinstruction of the input data sorting operator is a compression microinstruction, the index and selected data are obtained from the preprocessing module to perform aggregation operations and obtain the results corresponding to the relevant compression instructions; when the microinstruction of the input data sorting operator is an aggregation microinstruction, the aggregation operation is performed directly according to the vector operands corresponding to the aggregation microinstruction.

[0052] The first data selection module XZ1 takes vector operand 1 and the index output by the preprocessing module as input, and its output can be used as the index required by the aggregation operation array.

[0053] The second data selection module XZ2 takes vector operand 2 and the selected data output by the preprocessing module as its inputs, and its output can be used as the source data required by the aggregation operation array.

[0054] The first data selection module XZ1 and the second data selection module XZ2 can receive indications of whether the current input instruction to the arithmetic unit is a compressed microinstruction or an aggregate microinstruction. When the microinstruction of the input data sorting arithmetic unit is a compressed microinstruction, the first data selection module XZ1 outputs the index from the preprocessing module, and the second data selection module XZ2 outputs the selected data from the preprocessing module; that is, vector operands 1 and 2 are not directly input into the aggregate operation array, but are first preprocessed by the preprocessing module to adapt to the aggregate instruction and the aggregate operation array. When the microinstruction of the input data sorting arithmetic unit is an aggregate microinstruction, the first data selection module outputs vector operand 1, and the second data selection module outputs vector operand 2.

[0055] In other words, when the microinstruction of the input data sorting operator is a vector compression microinstruction, the selection circuit on the left side of the aggregation array (first data selection module XZ1) selects the index output from the vector compression microinstruction preprocessing module as the index required for the array operation, and the selection circuit above the aggregation array (second data selection module XZ2) selects the data output from the vector compression microinstruction preprocessing module as the data required for the array operation. However, when the microinstruction of the input data sorting operator is a vector aggregation microinstruction, the selection circuit on the left (first data selection module XZ1) directly selects vector operand 1 as the index input for the aggregation array, and the selection circuit above (second data selection module XZ2) directly selects vector operand 2 as the data input for the aggregation array. It should be noted that... Figure 1 In the diagram, the two instances of vector operand 1 are identical; they are shown in two places for ease of diagramming. Similarly, the two instances of vector operand 2 are identical. In the specific instance, only one vector operand 1 and one vector operand 2 enter the data sorting operator.

[0056] According to one embodiment of the present invention, the vector compression instructions (hereinafter referred to as compression instructions) vcompress.vmvd, vs2, vs1 will be split into multiple compression micro-instructions uop.vcopmress uvd.a, uvs2.b, uvs1.c. Here, uvd, uvs2, and uvs1 are register numbers, and a, b, and c represent the nth data in that register. The number of splits is calculated as VLEN / DLEN*LMUL, where VLEN and DLEN are hardware-fixed constants. Indicates the length of the vector register. This indicates the length of a data path for reading and writing vector registers. LMUL, on the other hand, is a configuration value provided by the software code. This indicates the number of vector registers used to specify compressed instruction operations. Therefore, for the nth microinstruction (n starts counting from 0), uvd is vd+n*VLEN / DLEN, uvs1 is vs1, uvs2 is vs2+n*VLEN / DLEN, a and b are n MOD (VLEN / DLEN), and c is (VLEN*LMUL*n) / (SEW*DLEN).

[0057] According to one embodiment of the present invention, Figure 2 The diagram illustrates the architecture of the preprocessing module for compressed microinstructions. The preprocessing module includes: a local data compression module, a data cyclic shift module, a mask segmentation prefix summation module, a prefix sum storage unit, an index generation module, a third data selection module XZ3, and a fourth data selection module XZ4. Figure 2The two microinstructions have the same number 'n', but they are represented separately for ease of diagramming. The vector operand 1 will be divided into VLEN*LMUL / DLEN parts, numbered from 0 to VLEN*LMUL / DLEN-1. Where:

[0058] The third data selection module XZ3 is used to select the nth part from the VLEN*LMUL / DLEN parts divided by vector operand 1 as the partial mask vector corresponding to the compressed microinstruction, using the number n of the compressed microinstruction as the selection signal; the selected mask vector is the mask corresponding to the vector operand 2 of this compressed microinstruction.

[0059] The data local compression module is used to select elements with a mask value of 1 from the source data vector corresponding to the compressed microinstruction, and arrange them sequentially from low to high as the first local compression vector corresponding to the compressed microinstruction. The partial mask vector and vector operand 2 are input into the data local compression module to obtain the "first local compression vector". Among them, vector operand 2 is divided into DLEN / ELEMMIN parts according to the smallest data granularity ELEMMIN. The elements in vector operand 2 corresponding to the mask with a value of 1 in the partial mask vector will be selected in the data local compression module and then arranged consecutively as the "first local compression vector" for output.

[0060] The data cyclic shift module is used to cyclically shift the first local compressed vector corresponding to the compressed microinstruction to the left by Y elements according to the smallest data granularity, to obtain the second local compressed vector, which is the selected data corresponding to the compressed microinstruction. Here, Y represents the shift number, which is the number of mask values ​​of 1 in the partial mask vectors corresponding to all other compressed microinstructions with numbers less than the number of the compressed microinstruction. The "first local compressed vector" and the shift number from the prefix sum storage unit are input into the data cyclic shift module to obtain the "second local compressed vector," which is the output of the vector compressed microinstruction preprocessing module. Specifically, the "first local compressed vector" is divided into DLEN / ELEMMIN parts according to the smallest data granularity ELEMMIN, and then shifted to the higher bits by the "shift number" elements.

[0061] The mask segment prefix summation module is used to calculate the number of elements with a mask value of 1 in the partial mask vector corresponding to each numbered compressed microinstruction, and then calculate the partial sum of that partial mask vector. This is then applied sequentially to compressed microinstructions 0 through each compressed microinstruction. The corresponding prefix sum is used as the effective mask sum, and the effective mask sum is the sum of the compressed microinstructions 0 to 1. The sum of the partial sums of the corresponding partial mask vectors. The input mask segment summation module calculates the prefix sum of VLEN*LMUL / DLEN groups. For example, when VLEN=256, DLEN=128, and LMUL=8, each mask group contains VLEN / (VLEN*LMUL / DLEN)=16 bits. Therefore, the mask segment summation module calculates the number of 1s in bits 0 to 15 (closed interval, the same below), 0 to 31, 0 to 47... 0 to 255 of the mask in vector operand 1, respectively, thus obtaining the sum of multiple groups of 1-bit data. This invention does not impose any restrictions on the implementation method of summing multiple groups of 1-bit data. In this example, the implementation can be achieved by using 16 groups of 1-bit adder circuits to perform 16 addition operations with different mask counts, or by using 16 identical adders of 16 1-bit bits to obtain 16 groups of 5-bit partial sums, with a maximum result of 16. Then, take the first 1, the first 2, the first 3, etc., and sum them to obtain the prefix sum of multiple sets of masks.

[0062] Prefix and storage units are used to store compressed microinstructions 0 to each compressed microinstruction. The corresponding valid masks are, where, The prefix sum in the prefix sum storage unit has two uses. One is as the displacement number for the "data cycle displacement module", and the other is as the value passed to the index generation module to generate the index.

[0063] The index generation module is used to determine the index based on the total sequence number, the number of the compressed microinstruction, and the effective mask corresponding to the compressed microinstruction from 0 to the corresponding compressed microinstruction.

[0064] The fourth data selection module XZ4 is used to select the valid mask sum of the nth group of stored data according to the microinstruction number n, that is, the prefix sum corresponding to compressed microinstruction 0 to compressed microinstruction n.

[0065] According to one example of the invention, participants Figure 3This paper presents a schematic diagram illustrating the principle of the "data local compression module" and the "data circular shift module" in generating the "second local compression vector". In this example, DLEN=VLEN, LMUL=2, VL value is 14, the lower 16 bits of the vs1 mask vector are 10110111 11100101, and the vs2 vector register group contains two vector registers, vs2 and vs2+1. The source data vectors with element numbers 0-7 on the right belong to the vector register with number vs2 corresponding to microinstruction 0, and the source data vectors with element numbers 8-15 on the left belong to the vector register with number vs2+1 corresponding to microinstruction 1. In the preprocessing module of the compression microinstruction, a bit vector with "number less than VL" is generated based on the value of VL. Bits 0-13 of this bit vector are set to 1 because the number is less than 14, while bits 14 and 15 are set to 0. This preprocessing module selects the source data vectors corresponding to mask vector 1 and arranges them sequentially in the lower bits of the first local compression vector. For microinstruction 0, the first local compression vector selected is h, g, f, c, a. For microinstruction 1, the first local compression vector selected is n, m, k, j, i. The high-order bits of the first local compression vector are filled with zeros (the principle of generating the first local compression vector will be explained later). It is important to note that since each microinstruction can only carry one DLEN bit from the vs2 register set into the data sorting unit, for microinstruction 1, the source data vector of microinstruction 0 cannot be obtained in the preprocessing module of the compressed microinstruction. Therefore, the local compression of the data carried by microinstruction 1 can only be completed in the preprocessing module. Because this operation step only completes the compression of the source data vector carried by one compressed microinstruction, this data compression step is called "local compression".

[0066] Based on the first local compressed vector, the second local compressed vector is obtained by cyclically shifting the first local compressed vector towards the higher bits of the index, according to the number of 1s in the masks corresponding to all microinstructions with a number less than the current microinstruction. Illustratively, if the number of 1s in the masks corresponding to all other microinstructions with an index less than the current compressed microinstruction is Y, then the first local compressed vector is cyclically shifted (cyclically left-shifted) Y elements towards the higher bits. In the example above, microinstruction 0 is the smallest microinstruction, and the number of 1s in the masks corresponding to microinstructions with an index less than its own is 0. Therefore, microinstruction 0 can reach the second local compressed vector without shifting. Microinstruction 0 is the microinstruction with an index less than microinstruction 1, and the number of 1s in the mask of microinstruction 0 is 5. Therefore, when microinstruction 1 enters the vector compressed microinstruction preprocessing module, the first local compressed vector is cyclically shifted 5 elements towards the higher bits to obtain the second local compressed vector. The second local compressed vector is then passed as the output of the compressed microinstruction preprocessing module to the aggregation operation array.

[0067] The advantage of the second local compression vector lies in simplifying index calculation and optimizing its timing by rearranging data. Without the second local compression vector, calculating the j-th index of the i-th group requires adding the number of 1s in the first j masks of the i-th group to the sum of the prefix sums of the first i groups. This addition requires a carry, making the circuit the longest timing path in the entire module, affecting the frequency. However, with the second local compression vector, the index calculation becomes I + M x ∑, where I is no larger than M and M is always an integer power of 2. We only need to concatenate I and log2(M) bits of ∑, reducing the use of adders. The reduced addition operations shorten the longest path of the index generation circuit, allowing the design to achieve higher frequencies.

[0068] Figure 4 This diagram illustrates the principle of local data compression implemented by the preprocessing module. First, the number of 1s in the mask vector is calculated from lowest to highest index. Then, if the mask is 1, the number of 1s at the corresponding position is retained; otherwise, it is cleared to zero, thus obtaining the target position of the element. Assuming the target position of element i is j, this means the local data compression process needs to send the data element with index i to position j-1 of the first local compression vector. If j=0, it means the element with index i is not selected. For example, the target position of element 5 is 3, meaning the 5th element in the data vector needs to be sent to position 2 of the first local compression vector. Next, each target position is compared with a constant value on the left and converted into a one-hot code vector. Multiple one-hot code vectors are concatenated to form a one-hot code matrix. Each row of this matrix is ​​either a one-hot code or all 0s. If row i of the matrix is ​​a one-hot code, it indicates that an element will be selected from the source data vector corresponding to the position of 1 in that one-hot code and stored at position i of the first local compression vector. If row i of the matrix is ​​all 0, the selection result will always be 0. It should be noted that after obtaining the one-hot code matrix, the existing method of "selection logic based on AND-OR" can be used to implement the function of selecting data using one-hot codes.

[0069] Figure 5This diagram illustrates the index generation process within the preprocessing module of the compressed microinstruction. The effective mask sum of compressed microinstruction 0, and the effective mask sums of compressed microinstruction 0 and 1, are both derived from the prefix sum storage unit. Since LMUL=2 and DLEN=VLEN=64 in this example, only two sets of effective mask vectors (VLEN / DLEN*LMUL=2) are used to generate the index. The calculation method for ensuring the index is not less than the effective mask sum is as follows: if the element's index is not less than the effective mask sum, it is set to 1; otherwise, it is set to 0. In this example, the effective mask sum of compressed microinstruction 0 is 5, so positions in the vector with indices greater than or equal to 5 are set to 1, and the rest are set to 0. Similarly, positions with indices greater than or equal to 10 are set to 1, and the rest are set to 0, resulting in the "effective mask sum with indices not less than the sum of microinstructions 0 and 1" shown in the diagram. The number of vectors whose sequence number is not less than the effective mask sum is equal to the number of compressed micro-instructions split from the compressed instruction. Let A[x][y] denote whether the element with sequence number y is not less than the effective mask sum from micro-instruction 0 to micro-instruction x. If true, A[x][y] = 1; otherwise, it equals 0. Let the number of minimum bit-width elements in the source data vector corresponding to the compressed micro-instruction be denoted as . , , ,but The "compressed microinstruction sequence number" is a hardware constant representing the element's sequence number within that microinstruction. Elements at the same position within a compressed microinstruction have the same sequence number, denoted as [insert sequence number here]. , and serial number The relationship is , express Divide by Take the remainder. The sequence number is... The formula for calculating the "index" is: For example, sequence number 4 is less than the sum of the effective masks of microinstruction 0 (value 5), and also less than the sum of the effective masks of microinstructions 0 and 1 (value 10), thus obtaining... , .and , Therefore, the index corresponding to sequence number 4 is calculated to be... For example, if sequence number 7 is not less than the sum of the effective masks of microinstruction 0 (5), but less than the sum of the effective masks of microinstructions 0 and 1 (10), then we get... , .and , Therefore, the index corresponding to sequence number 7 is calculated to be... Referring to the second local compression vector in the figure, the calculated sequence number is... The meaning of "index" is to take the data at that index position from the second local compression vector and store it in the result with the sequence number . The position is the same as the behavior of the vrgather.vv instruction. Thus, the "index generation module" can generate the index vector required for the aggregation (GATHER) operation array calculation. Using this index vector, data can be selected from multiple second local compression vectors to complete the operation of the vector compression microinstruction.

[0070] According to one embodiment of the present invention, the present invention also provides a method for executing vector compression instructions in the RVV instruction set, see [link to relevant documentation]. Figure 6 The process includes: Step S1: When the instruction to be processed is a compression instruction, the compression instruction is split into multiple compression micro-instructions. A partial mask vector corresponding to each compression micro-instruction is extracted from the total mask vector corresponding to the compression instruction, and a source data vector corresponding to each compression micro-instruction is extracted from the total source data vector to be compressed corresponding to the compression instruction. Step S2: Preprocessing for the adaptive aggregation operation array is performed, including: generating an index for each compression micro-instruction that can be used for aggregation operation array operations, and selected data obtained by locally compressing the source data vector corresponding to the compression micro-instruction, based on the partial mask vector, source data vector, and number of the compression micro-instruction, to match the input requirements of the aggregation instruction. Step S3: By executing the aggregation instruction on the aggregation operation array, aggregation operations are performed based on the index and selected data corresponding to each compression micro-instruction split from the compression instruction to obtain the result corresponding to the compression instruction.

[0071] According to one embodiment of the present invention, in step S1, the number of compressed micro-instructions split from each compressed instruction is determined according to the following calculation method. :

[0072]

[0073] in, Indicates the length of the vector register. This indicates the number of vector registers used to specify compression instruction operations. This indicates the length of a data path for reading and writing a vector register.

[0074] According to an embodiment of the present invention, in step S2, the selected data corresponding to each compressed microinstruction is obtained as follows: elements with a mask value of 1 are selected from the source data vector corresponding to the compressed microinstruction, and arranged sequentially from low to high as the first local compressed vector corresponding to the compressed microinstruction; the first local compressed vector corresponding to the compressed microinstruction is cyclically shifted left by Y elements according to the smallest data granularity to obtain the second local compressed vector corresponding to the selected data of the compressed microinstruction, where Y represents the displacement number, and the displacement number is the number of elements with a mask value of 1 in the partial mask vectors corresponding to all other compressed microinstructions with numbers less than the number of the compressed microinstruction.

[0075] According to one embodiment of the present invention, in step S2, the index indicates the position of the element required by the aggregation instruction in the register group storing the second local compression vector, and the index is determined in the following manner:

[0076]

[0077] in, Indicates the overall sequence number The index of the element at that location. , Indicates the overall sequence number The element's sequence number in its corresponding compression microinstruction. This represents the modulo operation. This represents the number of the smallest bit-width elements in the source data vector corresponding to each compression microinstruction. -1 represents the number of the compressed microinstruction to which the current calculated sequence number belongs; Indicates serial number Is it greater than or equal to compressed microinstruction 0 to compressed microinstruction? The corresponding valid mask and, if ,otherwise .

[0078] To visually demonstrate the throughput and latency metrics of the method of this invention, we will use vcompress.vm v16, v8, v0 with LMUL=8 as an example. See also Figure 7Under the condition that the operands of the compressed microinstruction are ready (the case where the operands are not ready is irrelevant to this invention), at time T, the first compressed microinstruction split from the compressed instruction enters the preprocessing module in the data sorting arithmetic unit for data preprocessing and index generation. At time T+1 (+1 refers to the next clock cycle relative to T), the first compressed microinstruction (uop4) enters the first arithmetic unit in the GATHER array (i.e., the aggregation arithmetic array), and obtains a partial result in register v20. At the same time, the next compressed microinstruction (uop1) enters the vector compression preprocessing module. At time T+2, the local compressed data related to the first compressed microinstruction (uop4) enters the second arithmetic unit, while its index and operation result remain in the first arithmetic unit. The data of the second compressed microinstruction (uop1) enters the first arithmetic unit of the GATHER array, and is updated in register v20 after being queried by the index of uop4. The index of uop1 enters the second arithmetic unit of the GATHER array, and is operated with the local compressed data of uop4 to obtain a partial operation result in register v17.

[0079] Figure 8 In the process, at time T+8, the last microinstruction enters the GATHER array, freeing up the vector compression microinstruction preprocessing module for use by other vector compression instructions. When LMUL=8, it takes 8 clock cycles from the first microinstruction entering the vector compression microinstruction preprocessing module to the module being emptied before it can receive the microinstruction obtained from the decomposition of the next instruction for operation, so the throughput is 1 / LMUL.

[0080] Figure 8 Meanwhile, register v20 has completed operations with all locally compressed data (corresponding to 8 sets of locally compressed data for 8 microinstructions) and can leave the arithmetic unit to be written back to the vector register. The last microinstruction (uop3) leaves the data sorting arithmetic unit at time T+16. Thus, a vector compression instruction with LMUL=8 is completed in the data sorting arithmetic unit after 16 cycles. That is, the delay is LMUL*2.

[0081] The following is a comparison of the theoretical performance of this invention with existing methods A and B:

[0082]

[0083] The performance analysis and comparison of 8-bit data when LMUL is 8 are as follows:

[0084]

[0085] It should be noted that the number of times the register read / write ports are used is the same as the number of microinstructions. It can be seen that the method of this invention is superior to existing methods in terms of the number of times the register read / write ports are used, resulting in lower instruction latency and higher instruction throughput.

[0086] In summary, some embodiments of the present invention can achieve at least the following beneficial effects:

[0087] During the execution of the vcompress.vm instruction, this invention reads the register data used for each operation from the register file (RegFile) only once into the arithmetic unit, instead of reading all the data at once. This avoids repeatedly occupying the register read / write port and has minimal impact on the execution of other vector instructions.

[0088] This invention always splits the code into VLEN / DLEN*LMUL micro-instructions. When the VLEN / DLEN ratio is large or the hardware-supported LMUL is large, only the number of micro-instructions that grows linearly with VLEN / DLEN and LMUL is generated, resulting in less hardware storage overhead and less blocking of subsequent instruction splitting, scheduling and execution due to micro-instruction generation.

[0089] This invention adds a preprocessing module for vcompress.vm micro-instructions to the aggregated computation array. This allows for instruction computation to be performed using the aggregated computation array in just one additional execution cycle. The area of ​​the aggregated computation array is linearly related to both VLEN / DLEN and LMUL, avoiding the introduction of quadratic complexity hardware area overhead. The instruction computation latency is LMUL*2 cycles, and the throughput is 1 / LMUL, reducing latency and increasing throughput compared to existing implementations.

[0090] It should be noted that although the steps are described in a specific order above, it does not mean that the steps must be executed in the above specific order. In fact, some of these steps can be executed concurrently, or even in a different order, as long as the required function can be achieved.

[0091] This invention can be a system, method, electronic device, computing device, computer program product and / or computer-readable medium.

[0092] Computer program products mainly refer to software products that implement various aspects of the present invention through computer programs, or hardware products that carry software that implements various aspects of the present invention.

[0093] Computer-readable storage media can be tangible devices that hold and store instructions for use by an instruction execution device. Computer-readable storage media can include, for example, but not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable compact disc read-only memory (CD-ROM), digital multifunction disc (DVD), memory sticks, floppy disks, mechanical encoding devices, such as punch cards or recessed protrusions storing instructions thereon, and any suitable combination thereof.

[0094] The various embodiments of the present invention have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method for executing vector compression instructions in the RVV instruction set, characterized in that, include: When the instruction to be processed is a compression instruction, the compression instruction is split into multiple compression micro-instructions. The partial mask vector corresponding to each compression micro-instruction is extracted from the total mask vector corresponding to the compression instruction. The source data vector corresponding to each compression micro-instruction is extracted from the total source data vector to be compressed corresponding to the compression instruction. Preprocessing for adapting the aggregation operation array includes: generating an index required for the aggregation operation array to operate on each compressed microinstruction based on the partial mask vector, source data vector and number corresponding to the compressed microinstruction, and selected data obtained by locally compressing the source data vector corresponding to the compressed microinstruction to match the input requirements of the aggregation instruction; By executing aggregation instructions on the aggregation operation array, aggregation operations are performed based on the indexes corresponding to each compression micro-instruction split from the compression instruction and the selected data, thereby obtaining the result corresponding to the compression instruction.

2. The method according to claim 1, characterized in that, The number of compressed micro-instructions derived from each compressed instruction is determined using the following calculation method. : in, Indicates the length of the vector register. This indicates the number of vector registers used to specify compression instruction operations. This indicates the length of a data path for reading and writing a vector register.

3. The method according to claim 2, characterized in that, The selected data corresponding to each compression microinstruction is obtained in the following way: Select the elements with a mask value of 1 from the source data vector corresponding to the compressed microinstruction, and arrange them in order from low to high as the first local compressed vector corresponding to the compressed microinstruction. The first local compression vector corresponding to the compressed microinstruction is cyclically shifted left by Y elements according to the smallest data granularity to obtain the second local compression vector, which is the selected data corresponding to the compressed microinstruction. Here, Y represents the number of shifts, which is the number of mask values ​​of 1 in the partial mask vectors corresponding to all other compressed microinstructions with numbers less than the number of the compressed microinstruction.

4. The method according to claim 3, characterized in that, The index indicates the location of the element required by the aggregation instruction in the register set storing the second local compressed vector, and the index is determined as follows: in, Indicates the overall sequence number The index of the element at that location. , Indicates the overall sequence number The element's sequence number in its corresponding compression microinstruction. This represents the modulo operation. This represents the number of the smallest bit-width elements in the source data vector corresponding to each compression microinstruction. -1 represents the number of the compressed microinstruction to which the current calculated sequence number belongs; Indicates serial number Is it greater than or equal to compressed microinstruction 0 to compressed microinstruction? The corresponding valid mask and, if ,otherwise .

5. A processor supporting the RVV instruction set, comprising a data sorting arithmetic unit, the arithmetic unit including: The preprocessing module for vector compression microinstructions is configured to: when the microinstruction of the input data sorting operator is a compression microinstruction, perform preprocessing to adapt the aggregation operation array, which includes: generating an index required for the aggregation operation array to operate on each compression microinstruction according to the partial mask vector, source data vector and number corresponding to the compression microinstruction, and the selected data obtained by locally compressing the source data vector corresponding to the compression microinstruction, wherein the partial mask vector comes from vector operand 1 and the source data vector comes from vector operand 2; The aggregation operation array is configured such that: when the microinstruction of the input data sorting operator is a compression microinstruction, it obtains the index and selected data from the preprocessing module to perform aggregation operations and obtain the results corresponding to the relevant compression instructions; when the microinstruction of the input data sorting operator is an aggregation microinstruction, it directly performs aggregation operations based on the vector operands corresponding to the aggregation microinstruction.

6. The processor according to claim 5, characterized in that, Compressed micro-instructions are obtained by splitting compressed instructions, and the number of compressed micro-instructions obtained from the split is... for: in, Indicates the length of the vector register. This indicates the number of vector registers used to specify compression instruction operations. This indicates the length of a data path for reading and writing a vector register.

7. The processor according to claim 5 or 6, characterized in that, The arithmetic unit includes: The first data selection module takes vector operand 1 and the index output by the preprocessing module as input. The second data selection module takes vector operand 2 and the selected data output from the preprocessing module as input. When the microinstruction of the input data sorting operator is a compressed microinstruction, the first data selection module outputs the index from the preprocessing module, and the second data selection module outputs the selected data from the preprocessing module. When the microinstruction of the input data sorting operator is an aggregate microinstruction, the first data selection module outputs vector operand 1, and the second data selection module outputs vector operand 2.

8. The processor according to claim 5, characterized in that, The preprocessing module includes: The data local compression module is used to select the elements with a mask value of 1 from the source data vector corresponding to the compressed microinstruction, and arrange them in order from low to high as the first local compression vector corresponding to the compressed microinstruction. The data cyclic shift module is used to cyclically shift the first local compression vector corresponding to the compressed microinstruction to the left by Y elements according to the smallest data granularity, so as to obtain the second local compression vector as the selected data corresponding to the compressed microinstruction. Here, Y represents the number of shifts, which is the number of mask values ​​of 1 in the partial mask vectors corresponding to all other compressed microinstructions with numbers less than the number of the compressed microinstruction.

9. The processor according to claim 8, characterized in that, The preprocessing module also includes: The mask segment prefix summation module is used to calculate the number of elements with a mask value of 1 in the partial mask vector corresponding to each numbered compressed microinstruction, and then calculate the partial sum of that partial mask vector. This is then applied sequentially to compressed microinstructions 0 through each compressed microinstruction. The corresponding prefix sum is used as the effective mask sum, and the effective mask sum is the sum of the compressed microinstructions 0 to 1. The sum of the partial sums of the corresponding partial mask vectors; Prefix and storage units are used to store compressed microinstructions 0 to each compressed microinstruction. The corresponding valid masks are, where, ; The index generation module is used to determine the index based on the total sequence number, the number of the compressed microinstruction, and the effective mask corresponding to the compressed microinstruction from 0 to the corresponding compressed microinstruction.

10. The processor according to claim 9, characterized in that, The index is determined as follows: in, Indicates the overall sequence number The index of the element at that location. , Indicates the overall sequence number The element's sequence number in its corresponding compression microinstruction. This represents the modulo operation. This represents the number of the smallest bit-width elements in the source data vector corresponding to each compression microinstruction. -1 represents the number of the compressed microinstruction to which the current calculated sequence number belongs; Indicates serial number Is it greater than or equal to compressed microinstruction 0 to compressed microinstruction? The corresponding valid mask and, if ,otherwise .