Instruction processing method and apparatus, storage medium, and electronic device
By identifying and updating vector parameters in the RISC-V processor and breaking down long vector instructions into micro-operations, the problems of complex vector instruction execution circuitry and excessive power consumption are solved, achieving more efficient processor performance and resource utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SANECHIPS TECH CO LTD
- Filing Date
- 2023-07-31
- Publication Date
- 2026-06-16
AI Technical Summary
The execution circuitry for vector instructions in existing RISC-V processors is too complex, resulting in excessive area and power consumption, which leads to low processing efficiency.
By identifying the first type of instruction to be processed in the preset instruction encoding, updating the vector parameters, and breaking down long vector instructions into multiple micro-operations, the area and power consumption of the renaming and instruction issuing units are reduced by using fine-grained micro-operations.
It reduces the design complexity of the execution circuit, optimizes the processing frequency, improves the user experience, and achieves a better balance between performance, power consumption, and area.
Smart Images

Figure CN119473397B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of RISC-V high-performance out-of-order processor microarchitecture, and more specifically, to an instruction processing method, apparatus, storage medium, and electronic device. Background Technology
[0002] RISC-V is an open reduced instruction set architecture. It adheres to the RISC design philosophy and the experience and lessons learned from decades of computer architecture development. It has designed a sophisticated instruction set architecture with modular standard extended instruction subsets such as M / A / F / D / C / B / P / V, which can be used for different application areas.
[0003] The current RISC-V standard Vector extension includes vector instructions such as addition, subtraction, multiplication, multiply-accumulate, bitwise logical AND, OR, XOR, comparison, shift, min / max, permutation, and reduction. The instruction length is fixed at 32 bits, with 32 general-purpose vector registers, a condition mask register (Predicate), and related control status registers (CSRs). It supports conditional instruction execution, multiple memory access modes including unit-stride, strided, and indexed scatter / gather. Notably, it supports dynamic typing, allowing the vector length (grouping of vector registers) and the bit width of a single element (SEW) to be set via `vsetvl`, `vsetvli`, and `vsetivli`, saving coding space, supporting mixed precision, and maintaining software compatibility across different vector lengths. The grouping parameters of the vector registers are configured using the `vsetvl`, `vsetvli`, and `vsetivli` instructions, allowing multiple vector registers to be combined into a larger vector register, up to eight registers, effectively increasing the operable register width to eight times the width of the currently implemented hardware registers. This feature makes program writing more concise and significantly reduces the amount of code executed. The CSR register VTYPE (vector type, vector parameter, abbreviated as VTYPE) stores the vector register grouping parameter LMUL (Vector register group multiplier (LMUL) setting, vector register grouping configuration, abbreviated as LMUL). After the vsetvl / vsetvli / vsetivli instructions configure this parameter, subsequent vector instructions will group the vector registers according to this parameter, and the processing width of a single vector instruction will also be based on this group length, until a new vsetvl / vsetvl / vsetivli instruction updates the parameter. After the update, all vector instructions younger than the vsetvl / vsetvli / vsetivli instructions will be executed according to the latest configuration value.
[0004] In related technologies, the following methods are used to address the above problems: Method 1: The pipeline of a high-performance processor core is generally divided into multiple processing stages. The instruction cache (high-speed buffer memory) caches the instructions to be executed. The instruction fetch process reads the instructions to be executed from the instruction cache and sends them to the instruction decoder for decoding. Figure 3This is a schematic diagram of an instruction processing flow in related technologies. When the instruction decoder recognizes that the current instruction is a `vsetvl` instruction to update the vector parameter register, it sets the vector pause flag signal to 1, indicating that subsequent vector instructions must wait for the `vsetvl` instruction to complete before proceeding to the next stage of renaming processing. Method 2 Figure 2 This is a schematic diagram of another instruction processing flow in related technologies. After instruction decoding, a vector register group value history table is added to predict the value of the register group update by the vsetvl instruction.
[0005] However, the first method is prone to pipeline stalls because the vector register grouping value (LMUL) or the bit width of the vector element (SEW) can only be obtained after the vsetvl instruction has been executed. In addition, the design process of renaming at the instruction granularity in the renaming stage is complicated. Although the second method can alleviate the stall problem caused by the first method, it still has the problem that one vector instruction in the first method needs to support register grouping and simultaneously support register renaming of far more registers than ordinary instructions. This requires occupying multiple renaming map read / write ports or performing read / write port arbitration, which makes the design complicated and results in high power consumption and area costs.
[0006] Regarding the problems of excessive complexity of the execution unit circuitry in the processor during vector instruction processing, resulting in excessive area and power consumption, no effective solution has yet been found.
[0007] Application content
[0008] This application provides an instruction processing method, apparatus, storage medium, and electronic device to at least solve the problems of excessive complexity of the execution unit circuit in the processor and excessive area and power consumption in vector instructions in related technologies.
[0009] According to one embodiment of this application, a method for processing instructions is provided, comprising: determining a first type of instruction to be processed from a plurality of instructions to be processed according to a preset instruction code, wherein the first type of instruction to be processed is used to update a first vector parameter in a speculative value register; updating the first vector parameter to a second vector parameter using an update method corresponding to the first type of instruction to be processed; splitting long vector instructions in other instructions to be processed into a plurality of micro-operations based on the second vector parameter, and renaming the other instructions to be processed according to the plurality of micro-operations, wherein the other instructions to be processed are instructions other than the first type of instruction to be processed among the plurality of instructions to be processed.
[0010] According to another embodiment of this application, a bandwidth adjustment system is provided, comprising: a determining module, configured to determine a first type of pending instructions from a plurality of pending instructions according to a preset instruction code, wherein the first type of pending instructions is used to update a first vector parameter in a speculative value register; an updating module, configured to update the first vector parameter to a second vector parameter using an update method corresponding to the first type of pending instructions; and a processing module, configured to split long vector instructions in other pending instructions into a plurality of micro-operations based on the second vector parameter, and to rename the other pending instructions according to the plurality of micro-operations, wherein the other pending instructions are instructions other than the first type of pending instructions among the plurality of pending instructions.
[0011] According to yet another embodiment of this application, a computer-readable storage medium is also provided, wherein a computer program is stored therein, and the computer program is configured to perform the steps in any of the above method embodiments when it is run.
[0012] According to yet another embodiment of this application, an electronic device is also provided, which may include a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above method embodiments.
[0013] Through the above steps, in the system architecture corresponding to the CPU core, the first type of instruction to be processed is identified using a preset instruction code. The second vector parameter is then obtained by speculating between instruction decoders. Based on the register vector grouping value contained in the second vector parameter, the instruction decomposition subunit configured in the instruction decoding unit instructs the long vector instruction in other instructions to be processed to be decomposed into multiple micro-operations. This fine-grained micro-operation reduces the area and power consumption overhead of renaming and instruction issuing units in processing long vectors. It also reduces the complexity and power consumption overhead of further micro-operation decomposition of long vector instructions in the execution unit. In other words, by pre-determining the vector parameter and using it to decompose the long vector before execution, the problem of excessive complexity, area, and power consumption in the processor's execution unit circuitry for vector instructions in related technologies is solved. This achieves the effect of reducing the design complexity of the execution unit circuitry and realizing a better processing frequency within a limited area, thus improving the user experience. Attached Figure Description
[0014] Figure 1 This is a schematic diagram of an instruction processing flow in related technologies;
[0015] Figure 2 This is a schematic diagram of another instruction processing flow in related technologies;
[0016] Figure 3 This is a hardware structure block diagram of a chip for an instruction processing method according to an embodiment of this application;
[0017] Figure 4 This is a flowchart illustrating a method for processing instructions according to an embodiment of this application;
[0018] Figure 5 This is a schematic diagram of the system architecture of a processing method using instructions in a CPU chip according to an embodiment of this application;
[0019] Figure 6 This is a schematic diagram of the encoding of the Risc-v vector extension instruction vset{i}vl{i} according to an embodiment of this application;
[0020] Figure 7 This is a schematic diagram illustrating the configuration of the vtype_sp update source in the instruction processing method according to an embodiment of this application;
[0021] Figure 8 This is a schematic diagram of the vector register grouping configuration according to an embodiment of this application;
[0022] Figure 9 This is a schematic diagram illustrating two different computing resource configurations according to embodiments of this application;
[0023] Figure 10 This is a schematic diagram of the structure of an instruction processing apparatus according to an embodiment of this application;
[0024] Figure 11 This is a schematic diagram of the structure of an electronic device according to an embodiment of this application. Detailed Implementation
[0025] The present application will be described in detail below with reference to the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in the embodiments of the present application can be combined with each other.
[0026] It should be noted that the terms "first," "second," etc., in the specification, claims, and drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
[0027] Example 1
[0028] The method embodiment provided in Embodiment 1 of this application can be executed in a chip, computer terminal, or similar computing device. Taking its operation on a chip as an example, Figure 3 This is a hardware structure block diagram of a chip for an instruction processing method according to an embodiment of this application. For example... Figure 3As shown, chip 10 may include one or more (only one is shown in the figure) processors 102 (processors 102 may include, but are not limited to, microprocessors MCUs or programmable logic devices FPGAs), a memory 104 for storing data, and a transmission device 106 for communication functions. Those skilled in the art will understand that... Figure 3 The structure shown is for illustrative purposes only and does not limit the structure of the electronic device described above. For example, chip 10 may also include components that are more... Figure 3 The more or fewer components shown, or having the same Figure 3 The different configurations shown.
[0029] The memory 104 can be used to store software programs and modules of application software, such as program instructions / modules corresponding to the instruction processing method in the embodiments of this application. The processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, thereby implementing the above-described method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory remotely located relative to the processor 102, and these remote memories can be connected to the chip 10 via a network. Examples of the above-described networks may include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0030] The transmission device 106 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the communication vendor of chip 10. In one example, the transmission device 106 may include a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 106 may be a Radio Frequency (RF) module, used for wireless communication with the Internet.
[0031] Optionally, in practical applications, the above instruction processing method can be applied to CPU processors and Risc-v CPU core IPs or SOC chips that integrate Risc-v CPU cores. This application does not impose any further limitations on this.
[0032] This embodiment provides a method for processing instructions. Figure 4 This is a flowchart illustrating the processing method of the instructions according to an embodiment of this application, as shown below. Figure 4 As shown, the process may include, but is not limited to, the following steps:
[0033] Step S202: Determine a first type of instruction to be processed from a plurality of instructions to be processed according to a preset instruction code, wherein the first type of instruction to be processed is used to update the first vector parameter in the speculative value register;
[0034] In an optional embodiment, the aforementioned preset instruction encoding can be a vector extended instruction determined based on the 32-bit instruction encoding of Risc-v, such as the vsetvl{i} instruction encoding. The vsetvl{i} instruction encoding includes vsetvl, vsetvli, vsetvli, etc., wherein vsetvl is used to indicate instructions that require reading registers to obtain configuration values (including but not limited to vsetvl), and vsetvli is used to indicate instructions that do not require reading registers and can obtain parameter configuration values from the instruction encoding (immediate value) (including but not limited to vsetvli and vsetvli). Thus, when the instruction with the aforementioned vsetvl{i} instruction encoding is identified in the instructions to be processed, the parameters used to update the vector parameter generation unit can be quickly determined according to the content corresponding to the instruction encoding.
[0035] Step S204: Update the first vector parameter to the second vector parameter using the update method corresponding to the first type of instruction to be processed;
[0036] It should be noted that after the vector parameters configured in the speculative value register are updated using the vsetvl / vsetvli / vsetivli instructions, subsequent vector instructions will group the vector registers according to the second vector parameter. The processing width of a single vector instruction is also based on this group length, until a new vsetvl / vsetvl / vsetivli instruction updates the second vector parameter again. After the update, all vector instructions younger than the vsetvl / vsetvli / vsetivli instructions will be executed according to the latest vector parameter configuration value in the speculative value register.
[0037] Step S206: Based on the second vector parameter, the long vector instruction in the other instructions to be processed is split into multiple micro-operations, and the other instructions to be processed are renamed according to the multiple micro-operations, wherein the other instructions to be processed are instructions other than the first type of instructions to be processed among the multiple instructions to be processed.
[0038] Through the above steps, in the system architecture corresponding to the CPU core, the first type of instruction to be processed is identified using a preset instruction code. The second vector parameter is then obtained by speculating between instruction decoders. Based on the register vector grouping value contained in the second vector parameter, the instruction decomposition subunit configured in the instruction decoding unit instructs the long vector instruction in other instructions to be processed to be decomposed into multiple micro-operations. This fine-grained micro-operation reduces the area and power consumption overhead of renaming and instruction issuing units in processing long vectors. It also reduces the complexity and power consumption overhead of further micro-operation decomposition of long vector instructions in the execution unit. In other words, by pre-determining the vector parameter and using it to decompose the long vector before execution, the problem of excessive complexity, area, and power consumption in the processor's execution unit circuitry for vector instructions in related technologies is solved. This achieves the effect of reducing the design complexity of the execution unit circuitry and realizing a better processing frequency within a limited area, thus improving the user experience.
[0039] Understandably, the presence of long vectors in vector instructions leads to excessive complexity in naming circuits (e.g., the source register of a long vector instruction needs to query multiple registers, the destination register needs to allocate multiple physical registers, the instruction issue circuit (e.g., the issue queue needs to store multiple register IDs and perform wake-up operations on multiple registers), and instruction execution circuits (e.g., the long vector circuit needs to decompose multiple short vector micro-operations, requiring a stall to pause the pipeline). This results in high area and power consumption overhead, hindering the fast execution of vector instructions and significantly reducing the overall instruction processing efficiency. However, by using the method described above, long vectors are decomposed at the instruction decoding unit. After decomposition, the naming circuits in the later stages of the pipeline, such as the issue circuit and instruction execution circuit, no longer need to be aware of the vector register grouping configuration LMUL and the change in the bit width SEW of vector elements. This reduces the required decomposition logic circuits, thereby reducing the circuit complexity in the later stages of the pipeline and optimizing area and power consumption.
[0040] As an optional implementation, the above method can also be applied to high-performance vector computing scenarios, such as multimedia (audio, video, images, etc.) data processing, digital signal processing, scientific computing and AI, and other computationally intensive fields, as well as high-performance CPU applications such as scalar to variable optimization in compilers. In practice, by using speculative acquisition of vector parameters to split long vector instructions and resource-constrained vector instructions, the overhead on hardware resources (area, power consumption, etc.) caused by long vector instructions in the pipeline can be reduced, achieving a better PPA (performance, power, area) balance.
[0041] In an exemplary embodiment, when the above instruction processing method is applied to a CPU chip, the corresponding system architecture will be augmented with the following additions: a vector parameter generation unit, a vector instruction decomposition subunit added to the instruction decoding unit, an instruction issuance queue vtype_sp (vector parameter speculative value) added to the instruction issuance unit, and a vtype (vector parameter) speculative correction subunit added to the vsetvl instruction execution unit.
[0042] Optional, Figure 5 This is a schematic diagram of a system architecture for a processing method using instructions in a CPU chip according to an embodiment of this application, including:
[0043] Instruction cache unit 32: used to cache instructions fetched from the next lower cache level. When instructions are executed in a loop or repeatedly, they can be fetched from the instruction cache (high-speed cache memory) without having to fetch them from the lower cache level, thus reducing the latency of fetching values and improving performance.
[0044] Instruction fetch processing unit 34: used to query the instruction cache unit to retrieve the instruction to be executed according to the instruction fetch instruction, including virtual and physical address translation, checking whether there is a hit, if there is a hit, a request to query from the next layer cache is issued, and after the instruction is retrieved, alignment operations and other processing are performed.
[0045] The instruction pre-decoding unit 36 is used to identify the three instructions vsetvl, vsetvli, vsetvli based on the 32-bit instruction encoding of Risc-v, and to notify the vector parameter generation unit of the identification result.
[0046] Optionally, the encoding of the above three instructions, vsetvl, vsetvli, and vsetvli, is as follows: Figure 6 As shown, Figure 6 This is a schematic diagram of the encoding of the Risc-v vector extension instruction vset{i}vl{i} according to an embodiment of this application.
[0047] The vector parameter generation unit 38 is used to determine the vector parameters to be processed by subsequent vector instructions; it mainly determines the value of the vector parameter to be saved in the vector parameter speculative value register through various methods.
[0048] Specifically, when it is determined that the current instruction identified by the instruction pre-decoding unit 36 is a vsetivli or vsetivli instruction, the immediate value to be updated in the vtype register is parsed from the 32-bit instruction code corresponding to the vsetivli or vsetivli instruction, and the vector parameter of the vtype_sp register is updated with the immediate value.
[0049] If the current instruction is the vsetvl instruction, the characteristic information of the current thread (such as ASID and VMID) is used as matching information to search for matching prediction information in the vector parameter prediction table. If a match is found, the predicted value of vtype is obtained from the prediction table and updated in the vtype_sp register with the predicted value. It should be noted that if no match is found, the original value can be kept or updated to a common preset scheme such as LMUL=1.
[0050] It should be noted that the vector parameter in the vtype_sp register is equivalent to a speculative value of the vtype register, because it can be updated using the vsetivli / vsetivli instructions on the faulty branch path, or by prediction using the vsetivl instruction. Therefore, it is a speculative value.
[0051] If the vsetvl instruction detects a speculation failure in the speculation correction subunit within the execution unit, the vtype_sp register will be updated with the execution result corresponding to the processing device.
[0052] Optionally, if an interrupt, exception, or other PipelineFlush based on the oldest instruction boundary occurs in the processing unit, the value of the vtype register (commit value) needs to be updated to the vtype_sp register.
[0053] The instruction decoding unit 40 is responsible for the decoding of all instructions. When the current instruction is identified as a vector instruction, the vector instruction decomposition subunit is used to decompose the subsequent vector instructions according to the LMUL value carried from the vtype_sp register of the vector parameter generation unit.
[0054] Rename Unit 42: Used to rename the split vector instructions.
[0055] It should be noted that the purpose of register renaming is to eliminate two types of pseudo-data hazards: write-after-instruction and read-after-instruction. This reduces pipeline stalls caused by these hazards, improves ILP (Instruction-Level Parallelism, or TLP), and thus achieves better performance. In high-performance processors, register renaming operations that process multiple instructions in a single clock cycle are usually supported. The renaming unit generally supports renaming scalar registers (general purpose registers, or GPRs), vector registers, and CSRs (Control Status Registers). This unit is usually equipped with a mapping table from logical registers to physical registers to record the mapping relationship between them.
[0056] Instruction issuing unit 44: It is used to process the decomposed instructions according to the granularity of the vector instructions. Under normal circumstances, the issuing queue needs to store the register ID of the instruction to check whether the instruction operands in the issuing queue are ready. The instruction can only be issued after all operands are ready. If they are not ready, the register ID is also used to match and wake up the instruction. The instruction can only be issued normally after being woken up.
[0057] In practical applications, the vsetvl instruction obtains the speculative vector parameter value, i.e., the vtype_sp value, from the vector parameter generation unit and carries this value to each downstream processing unit. In the instruction issuance unit, each type of instruction (e.g., integer instruction, floating-point instruction, memory access instruction, etc.) has one or more issuance queues. The issuance queue that stores the vsetvl instruction must store the vtype_sp value carried by vsetvl. After the instruction is issued, the vtype_sp value must also be passed to the vsetvl execution unit for vtype speculative correction judgment.
[0058] Instruction Execution Unit 46: Responsible for executing the `vsetvl` instruction. The key to `vsetvl` instruction execution is obtaining the configuration value of the vector parameter `vtype` from the general-purpose register and updating the `vtype` register using this configuration value during the instruction commit phase. Another crucial function is correcting the speculative value of `vtype`. The configuration value of `vtype` in the execution result of `vsetvl` is compared with the `vtype_sp` passed from the instruction issue unit. If they are not equal, the speculative attempt has failed; if they are equal, the speculative attempt has succeeded. If the speculative attempt fails, the `flushPipeline` needs to be initiated to invalidate the instructions in the pipeline and correct the value of the `vtype_sp` register in the vector parameter generation unit. Consequently, after completing the instructions in the invalid pipeline, all instructions younger than the `vsetvl` instruction will not use the failed speculative value of `vtype_sp` for instruction splitting; younger instructions will use the corrected speculative value for instruction splitting.
[0059] In an exemplary embodiment, before updating the first vector parameter to the second vector parameter using the update method corresponding to the first type of pending instruction, the method further includes: determining the number of instructions of the first type of pending instructions among the plurality of pending instructions; if the number of instructions is greater than a preset number, determining the update time of each instruction of the plurality of first type of pending instructions to update the first vector parameter, thereby obtaining a plurality of update times; determining a plurality of update intervals based on the time relationship of the plurality of update times, wherein the target other pending instructions processed within each update interval use the target second vector parameter most recently updated in the update interval.
[0060] In simple terms, when updating the first vector parameter in the speculative value register using the first type of pending instructions, and given that there are multiple first type of pending instructions, in order to avoid abnormal disassembly of subsequent long vector instructions caused by unordered vector parameter updates, it is necessary to identify the time when each first type of pending instruction updates the vector parameter currently stored in the speculative value register during the update process. Then, when disassembling long vectors using the updated vector parameters, the latest vector parameter within the current clock cycle is used for the update.
[0061] In an exemplary embodiment, the vector parameter generation unit can process multiple vtype vector parameter updates in the same clock cycle. For example, if there are two vsetvli instructions to update the vtype vector parameters in the same clock cycle, the latest value will be updated to the vtype_sp vector parameter speculative value register at the same time. In addition, it is necessary to ensure the order relationship between the vector instructions that need to use the vtype value and the values updated by vsetvli, so that the vector instructions can obtain the appropriate vtype_sp.
[0062] For example, there are four instruction sequences within the same clock cycle, as follows:
[0063] / / cycle-1 clock cycle 1
[0064] vsetvli t0,a0,e8,m2 / / LMUL=2
[0065] vadd.vv vd,vs2,vs1,vm
[0066] vsetvli t0,a0,e16,m8 / / LMUL=8
[0067] vsub.vv vd,vs2,vs1,vm
[0068] / / cycle-2 clock cycle 2
[0069] vmul.vv vd,vs2,vs1
[0070] vadd.vv vd,vs2,vs1,vm
[0071] vsetvli t0,a0,e16,m8 / / LMUL=4
[0072] vsub.vv vd,vs2,vs1,vm,
[0073] The instructions are newer from top to bottom. Therefore, within clock cycle 1, the `vadd` instruction should be split based on LMUL=2. The LMUL carrying the `vadd` instruction to the decoder should be equal to 2, while the `vsub` instruction should carry LMUL=8 to the decoder for instruction splitting. At the end of clock cycle 1, the immediate value of the second instruction (LMUL=8) should be updated in the `vtype_sp` register. Further, within clock cycle 2, the `vmul` and `vadd` instructions should carry the updated value of `vtype_sp` (LMUL=8) to the decoder, while the last instruction, `vsub`, should carry the value to be updated by the third instruction, `vsetvli` (LMUL=4), to the decoder for instruction splitting.
[0074] In an exemplary embodiment, updating the first vector parameter to the second vector parameter using the update method corresponding to the first type of instruction to be processed includes: when the first type of instruction to be processed is an immediate instruction, updating the first vector parameter to the second vector parameter according to the encoding information of the immediate instruction; when the first type of instruction to be processed is a non-immediate instruction, updating the first vector parameter to the second vector parameter through the predicted value associated with the non-immediate instruction.
[0075] In other words, if the instruction to be processed is identified as a vstivli / vsetvli instruction (immediate value) under the vst{i}vl{i} extended instruction, the configuration value can be obtained based on the encoded immediate value of the instruction itself, and the current vtype speculative value (the value of the vtype_sp register) can be updated with this configuration value. All subsequent vector instructions will carry this vtype speculative value to the instruction decoding unit until a new vst{i}vl{i} instruction updates the vtype speculative value, or the instruction commit part repairs the vtype speculative value, or the current value stored in the vtype_sp register is corrected after the vector parameter speculative fails.
[0076] If the instruction to be processed is identified as the vsetvl instruction (non-immediate value) under the vset{i}vl{i} extended instruction, then the vtype predicted value is obtained by prediction based on thread characteristic information, and the vtype_sp (vtype speculative value) is updated with the predicted value; where the above non-immediate value is used to indicate that the configuration value needs to be obtained through the register.
[0077] In addition, it should be noted that if the instruction to be processed is not any of the vset{i}vl{i} extended instructions, the current vtype speculative value needs to be maintained. Subsequent vector instructions will carry the current vtype speculative value to the instruction decoding unit until a new vset{i}vl{i} instruction or instruction submission unit updates the speculative value of the vector parameter generation unit.
[0078] In an exemplary embodiment, updating the first vector parameter to a second vector parameter according to the encoding information of the immediate instruction includes: determining a first configuration value for the vector register grouping configuration and a second configuration value corresponding to the bit width of the vector element according to the encoding information of the immediate instruction; and updating the first vector parameter to a second vector parameter using the first configuration value and the second configuration value.
[0079] In an exemplary embodiment, updating the first vector parameter to a second vector parameter using a predicted value associated with the non-immediate instruction includes: determining feature information corresponding to the thread processing the non-immediate instruction; searching for a predicted value matching the feature information in a preset vector parameter prediction table; and using the predicted value to update the first vector parameter to the second vector parameter.
[0080] In an exemplary embodiment, after updating the first vector parameter to the second vector parameter using the predicted value, the method further includes: obtaining the execution result corresponding to the non-immediate instruction, determining the actual value corresponding to the execution result; comparing the actual value and the predicted value to determine whether the speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is valid based on the comparison result.
[0081] Understandably, since the vsetvl instruction (equivalent to the aforementioned non-immediate instruction) is mostly used for context restoration after thread switching, it can use thread history for prediction. However, prediction may be incorrect. In order to identify erroneous predictions in a timely and effective manner, the vsetvl instruction can carry the vtype prediction value obtained from the prediction table and carry it with the vsetvl instruction itself to the execution unit. After the execution unit has obtained the source operand, it executes the instruction and obtains the execution result. The execution result is compared with the vtype prediction value to detect whether the prediction is normal.
[0082] In an exemplary embodiment, comparing the actual value and the predicted value to determine whether a speculative operation of updating the first vector parameter to a second vector parameter using the predicted value is valid based on the comparison result includes: determining that the speculative operation of updating the first vector parameter to a second vector parameter using the predicted value is valid if the actual value is the same as the predicted value; and determining that the speculative operation of updating the first vector parameter to a second vector parameter using the predicted value is invalid if the actual value is different from the predicted value.
[0083] Optionally, if the execution result of the execution unit is not equal to the vtype prediction value, it indicates a prediction error and speculation failure, and a new vtype prediction value needs to be determined. If the execution result of the execution unit is equal to the vtype prediction value, speculation is successful, and the vtype prediction value is used to split and rename multiple long vectors within the clock cycle after the vsetvl instruction.
[0084] In an exemplary embodiment, after determining that a speculative operation to update the first vector parameter to a second vector parameter using the predicted value is invalid when the actual value is different from the predicted value, the method further includes: updating the first vector parameter in the speculative value register using the actual value to obtain a third vector parameter; identifying a target instruction to be processed after the non-immediate instruction among the plurality of instructions to be processed; if the target instruction to be processed is a long vector, splitting the target instruction to be processed into a plurality of target micro-operations using the third vector parameter; and renaming the target instruction to be processed according to the plurality of target micro-operations.
[0085] In an exemplary embodiment, since all pending instructions newer than the `vsetvl` instruction are processed based on the `vtype` prediction value, when speculation fails, to ensure that these pending instructions newer than the `vsetvl` instruction can be processed effectively, the instructions in the pipeline are flushed to invalidate them. This invalidates the `vtype` prediction value carried in these pending instructions, preventing the splitting of vector instructions based on the failed `vtype` prediction value. Afterwards, the value of `vtype_sp` needs to be updated with the execution result of the `vsetvl` instruction, and instruction fetching and execution restart from the instruction following `vsetvl`, ensuring that the re-fetched instructions can be processed based on the correct `vtype_sp` value.
[0086] Obviously, the embodiments described above are only some embodiments of this application, and not all embodiments. To better understand the above method, the following description, in conjunction with embodiments, illustrates the process, but is not intended to limit the technical solutions of the embodiments of this application. Specifically:
[0087] In one exemplary embodiment, a RISC-V vector instruction processing method is provided, specifically including the following steps:
[0088] Step 1: Pre-decode the instruction fetched back to identify whether it is a `vset{i}vl{i}` instruction. Based on the identification result, determine the source of the vector parameter that updates the speculative value in the `vtype_sp` register. Figure 7 This is a schematic diagram illustrating the structure of the vtype_sp update source in the processing method according to the instructions of this application embodiment. The specific details are as follows:
[0089] Scenario 1: Since vsetvl is obtained through the prediction table and is a speculative prediction value, the execution unit compares the execution result of the vsetvl instruction with the prediction value. If they do not match, it means that the prediction is wrong and the vtype speculation fails. The execution result of the vsetvl instruction needs to be synchronized to correct the speculative value stored in the vtype_sp register. That is, the execution unit will send the correct vtype value to restore the vtype_sp speculative value. Subsequent instructions are split based on the restored speculative value.
[0090] Scenario 2: If it is a vsetvl instruction (a non-immediate value, i.e., the configuration value needs to be obtained through a register), then the vtype prediction value is obtained by prediction based on thread characteristic information, and the vtype_sp (vtype speculative value) is updated with the prediction value; where the above prediction is used to instruct the lookup of the corresponding thread's vtype prediction value in the prediction table of the vector parameter register based on the thread characteristic information.
[0091] Scenario 3: If it is a vstivli / vsetvli instruction (immediate value), the configuration value can be obtained from the encoded immediate value of the instruction itself, and this configuration value is used to update the current vtype speculative value (the value of the vtype_sp register). All subsequent vector instructions will carry this vtype speculative value to the instruction decoding unit until a new vst{i}vl{i} instruction updates or the instruction commit part repairs the vtype speculative value, or vtype_sp is corrected after the vector parameter speculative fails.
[0092] It should be noted that if it is not a vset vl{i} instruction, the current vtype speculative value is retained, and subsequent vector instructions carry the current vtype speculative value to the instruction decoding unit until a new vset vl{i} instruction or instruction commit unit updates the speculative value.
[0093] Scenario 4: When a branch prediction error occurs or an interrupt or exception causes the Pipeline to flush an invalid instruction, it is necessary to restore the value in the vtype vector parameter register and also restore the value in the vtype_sp vector parameter speculative value register. This is because the vset{i}vl{i} instruction on the instruction stream before the flush point will update vtype to an inaccurate value.
[0094] Step 2: The instruction decoding unit decomposes the vector instruction according to the configuration value of LMUL in the vtype speculative value.
[0095] For example: LMUL=8 means that 8 vector registers form a group, and the vector length is 8*VLEN. Here, VLEN is a constant parameter that is determined by the specific implementation and is used to specify the width of a single vector register, such as 128-bit. Figure 8 This is a schematic diagram of the vector register grouping configuration according to an embodiment of this application.
[0096] The instruction decoding unit decomposes a vector instruction affected by LMUL into 8 uops (micro-operations) at the VLEN granularity. The equivalent length of a single uop is the vector length when LMUL=1. Since the decoding unit only outputs a specific number of decoded instructions per clock cycle, there is a maximum limit. For example, if a maximum of 4 decoded instructions are output per clock cycle, then the above vector instruction will require 2 clock cycles to complete the decomposition.
[0097] Without long-vector instruction decomposition, such as LMUL=8, a vector instruction with two source operands and one destination operand requires reading 8 x 2 = 16 register rename maps and allocating 8 free physical registers as the destination register. High-performance processor cores typically process multiple instructions per clock cycle, such as 4 instructions. Therefore, a single clock cycle would require 8 x 16 = 128 reads of the rename map table to obtain the mapping relationship, and simultaneously allocate 32 free physical registers per clock cycle. Furthermore, the instruction issue unit needs to store multiple source and destination registers, and the wakeup instruction requires comparing multiple source registers, considering area and power consumption. The overhead is relatively large. If long vector instructions are decomposed into 1 granularity, the back-end pipeline and register-related processing logic can be reduced to 1 / 8 of the original, greatly simplifying the design complexity, area, and power consumption. High-performance processing is generally designed at high frequencies, typically above 3GHz. After instruction decomposition and simplification, the design can achieve even better frequencies. In addition, renaming units with 128 read ports or wakeup circuits for multiple registers in the instruction issue unit can lead to excessive wiring in local areas, causing congestion problems, which is not conducive to physical implementation and may further worsen timing and increase power consumption.
[0098] In an exemplary embodiment, in order to achieve a balance between area, power consumption, and performance, execution units for certain special vector instructions (e.g., large area of a single execution unit) are not configured according to the maximum number of vector elements, vlmax. Assuming that the bit width of the vector element is sew = 16 bits and VLEN = 128 bits, then vlmax = 128 / 16 = 8. The instruction decoder can then decompose the vector instructions through sew configuration. The method of obtaining sew is the same as that of LMUL, which is obtained from the vector parameter generation unit mentioned above. The vector parameter generation unit also processes sew in the same way.
[0099] Decomposing instructions using a decoder reduces the complexity, area, and power consumption overhead of further decomposition at the execution and dispatch units. This is because execution units mostly process instructions in a pipeline manner, with instructions entering the pipeline back-to-back. If an instruction needs to be decomposed in the middle, the pipeline needs to be stalled to free up resources for the decomposition's UOP (User-Operated Operations), which can easily cause pipeline bubbles and disrupt pipeline operations. Flushing handles instructions that have been issued but cannot enter the execution unit for execution. This is because the instruction decomposition stalls the pipeline flow, preventing them from entering the execution unit. Flushing and then re-issuing results in significant power waste. If instructions were decomposed at the issue unit, decomposition logic circuitry would be needed for each entry in the issue queue. Since the issue queue typically consists of multiple entries, requiring a separate decomposition logic for each entry would also incur large area and power overhead. However, with the decomposition process described above, the instruction decoding unit only needs to decompose according to the decoding width, which is generally much smaller than the number of entries in the issue queue. After decomposition, the later stages of the pipeline (renaming, issue units, etc.) do not need to be aware of changes in LMUL and SEW, thus reducing the circuit complexity of the later stages of the pipeline and optimizing area and power consumption.
[0100] Step 3: Rename the unit and process the instructions according to the granularity of uop;
[0101] It should be noted that after the above processing, although LMUL=8, the vector register width seen by the renaming unit is always the VLEN width, and is not affected by the change of LMUL in the vector register grouping configuration. The renaming unit does not need more renaming mapping table read ports and read port arbitration logic to adapt to the case of LMUL>1. It also does not need to allocate 8 times (LMUL is a maximum of 8) of the number of free registers per clock cycle, which simplifies the allocation and management logic of free physical registers, and can achieve better timing, power consumption and area.
[0102] Step 4: The transmitting unit processes the decomposed instructions at the uop granularity. Normally, the transmitting queue needs to store the instruction's register ID (identification, or ID for short) to check whether the instruction operands in the transmitting queue are ready. Transmission can only proceed after all operands are ready. If not, the register ID is used for matching and wake-up, and only after wake-up can normal transmission proceed. After the decoding unit decomposes the vector instructions, the entries in the transmitting queue only need to store register IDs at the uop and VLEN granularity, and do not need to store the register ID of the entire instruction. It should be noted that since the maximum LMUL is 8, the maximum number of register IDs for a single instruction is 8 times that of a single uop. That is, the transmitting unit is not affected by changes in LMUL, which can greatly reduce the area overhead of storing register IDs. In addition, the process of waking up the instruction in the transmitting queue requires comparing the register IDs stored in the entry. Therefore, with fewer register IDs, the number of comparators is also reduced, and the area and power consumption are naturally reduced. Furthermore, it is necessary to ensure that all source operands are ready. If the number of registers increases, the timing will naturally be worse. After instruction decomposition, it is only necessary to determine the number of registers at the uop granularity, and the timing will be better.
[0103] Step 5: The execution unit processes the decomposed instructions according to the granularity of uop. It is not necessary to equip the computing unit according to the vector width configured by LMUL. For example, if LMUL=8, the vector width is 8 times the width of the physical register. In this case, if the execution unit is equipped with 8 times the number of physical registers, the area and power consumption will be relatively large. Moreover, when the value of LMUL is configured to be relatively small, the execution unit cannot be fully utilized.
[0104] Figure 9 This is a schematic diagram illustrating two different computing resource configurations according to embodiments of this application. To achieve a 128-bit VLEN constant parameter, a comparison is made between supporting four 32-bit dividers and 32 dividers. If dividers are configured according to the maximum vector (vector register group configuration LMUL=8), when LMUL=1, 32-4=28 dividers will be idle, resulting in low efficiency. Moreover, LMUL=1 is more commonly used in applications. Therefore, a more balanced design is to allocate computing resources (e.g., the number of dividers) based on the VLEN width, and to achieve a better balance between area and performance by decomposing UOPs. As shown earlier, decomposing UOPs during the decoding stage (before renaming) is the best approach. The further along the pipeline, the more information needs to be transmitted, leading to greater power consumption and area overhead.
[0105] In one exemplary embodiment, an optional embodiment of this application also provides a fast recovery solution for vsetvl speculation failure, such as... Figure 5 As shown, the `vsetvl` instruction carries the `vtype` prediction value obtained from the prediction table and is carried to the execution unit along with the `vsetvl` instruction itself. The execution unit has already obtained the source operands, can execute the instruction and obtain the execution result. By comparing the execution result with the `vtype` prediction value, it can be detected whether the prediction is normal. If they are not equal, it means that the prediction is wrong and the speculation has failed; if they are equal, the speculation has succeeded. Because all instructions newer than `vsetvl` are processed based on the prediction value (or speculation value), if the speculation fails, all instructions newer than `vsetvl` need to be flushed, and then the instruction fetching and execution should start again from the next instruction after `vsetvl`. However, before this, the value of `vtype_sp` needs to be updated with the execution result of `vsetvl` to ensure that the re-fetched instructions can be processed based on the correct `vtype_sp` value (e.g., unpacking `uop`).
[0106] It's important to note that because the source operand of the `vsetvl` instruction (used to update the `vtype` register) originates from the GPR (General Purpose Register), and this source operand might also be the result of an older instruction without register renaming or similar processing, the source operand cannot be obtained during the instruction's pre-decoding and decoding phases. Vector instruction decomposition depends on the `vtype` vector parameter. One approach is to pause the `vsetvl` instruction on the pipeline, waiting for the result of the older instruction before processing it. However, this causes pipeline pauses (the `vsetvl` instruction relies on memory access data returns; if there are multiple cache misses, this time can be very long. If it doesn't depend on the results of other instructions, it typically pauses for 5-10 time cycles, depending on the specific microarchitecture implementation), blocking subsequent out-of-order instruction execution and significantly impacting performance. Another approach is to use thread characteristic information for prediction, since the `vsetvl` instruction is often used for context recovery after thread switching, and thread history can be used for prediction. However, predictions can be wrong. Therefore, there must be methods to detect and quickly recover from such prediction errors, without having to wait until the instruction reaches the submission stage to check whether the prediction is correct, thus improving performance. Because thread characteristics are relatively clear, the probability of prediction errors is relatively high. Prediction can reduce pipeline stalls, resulting in a significant performance improvement.
[0107] Optionally, in practical applications, the above implementation methods can also be applied to the following specific scenarios:
[0108] Scenario 1: A general-purpose CPU core that supports all vector instructions;
[0109] Scenario 2: Application to DSP processors, vector processors, or GPUs;
[0110] Scenario 3: When applied to processors that only support scalars, some complex scalar instructions may require speculative UOP decomposition based on a certain configuration parameter and rapid recovery after speculative failure.
[0111] Scenario 4: Application in SOC chips, such as in the main control SOC chips of communication equipment (routers, switches, base station controllers, etc.), used to control and manage various operations of the communication equipment; or used in baseband signal processing SOC chips.
[0112] Through the above embodiments, vector instruction decomposition is performed by speculatively obtaining vector parameter information (including LMUL, but not limited to LMUL, such as SEW) in advance, without interrupting the pipeline to wait for the execution result of vsetvl. If Vsetvl may depend on the result of other instructions, the waiting time will be relatively long, and the pipeline pause time will be relatively long. By predicting and obtaining the vtype value in advance, pipeline pauses can be reduced, thereby achieving better performance. Decomposing long vector instructions can reduce the design complexity of the CPU core back-end pipeline (including renaming unit, instruction issuing unit, instruction execution unit, etc.), reduce area and power consumption overhead, and achieve better area efficiency and energy efficiency. Furthermore, high-performance processing cores typically submit instructions sequentially. Using the aforementioned fast recovery method after prediction failure, there's no need to wait for the `vsetvl` instruction to reach the submission stage before checking if the `vtype` prediction is correct. Instead, recovery occurs immediately after the `vsetvl` instruction obtains its result during instruction execution, reducing the time between instruction completion and submission. This early recovery achieves better performance. Because instructions are submitted sequentially according to program order, if there are many instructions preceding `vsetvl`, the time between instruction completion and submission can be long. The fast recovery method described in the above embodiment, which recovers immediately after instruction execution, reduces this time and achieves better performance.
[0113] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as read-only memory / random access memory (ROM / RAM), magnetic disk, optical disk), and may include several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0114] Example 2
[0115] This embodiment also provides an instruction processing method apparatus for implementing the above embodiments and preferred embodiments; details already described will not be repeated. As used below, the term "module" can refer to a combination of software and / or hardware that implements a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.
[0116] Figure 10 This is a schematic diagram of the structure of an instruction processing apparatus according to an embodiment of this application, as shown below. Figure 10 As shown, the device may include:
[0117] The determining module 72 is used to determine a first type of instruction to be processed from a plurality of instructions to be processed according to a preset instruction code, wherein the first type of instruction to be processed is used to update the first vector parameter in the speculative value register;
[0118] Update module 74 is used to update the first vector parameter to the second vector parameter using the update method corresponding to the first type of instruction to be processed;
[0119] The processing module 76 is used to split the long vector instruction in other instructions to be processed into multiple micro-operations based on the second vector parameter, and to rename the other instructions to be processed according to the multiple micro-operations, wherein the other instructions to be processed are instructions other than the first type of instructions to be processed among the multiple instructions to be processed.
[0120] Through the aforementioned device, in the system architecture corresponding to the CPU core, a preset instruction code is used to identify the first type of instruction to be processed during execution. This allows for the speculative acquisition of a second vector parameter between instruction decoders. Then, based on the register vector grouping value contained in the second vector parameter, the instruction decomposition subunit configured in the instruction decoding unit instructs to split long vector instructions in other instructions to be processed into multiple micro-operations. This fine-grained micro-operation reduces the area and power consumption overhead of renaming and instruction issuing units in processing long vectors. It also reduces the complexity and power consumption overhead of further micro-operation decomposition of long vector instructions in the execution unit. In other words, by pre-determining the vector parameter and using it to decompose the long vector between corresponding long vectors, the problem of excessive complexity, area, and power consumption in the processor's execution unit circuitry for vector instructions in related technologies is solved. This achieves the effect of reducing the design complexity of the execution unit circuitry and realizing a better processing frequency within a limited area, thus improving the user experience.
[0121] As an optional implementation, the above apparatus may further include: a time module, configured to: determine the number of instructions of the first type of pending instructions among the plurality of pending instructions before updating the first vector parameter to the second vector parameter using the update method corresponding to the first type of pending instructions; if the number of instructions is greater than a preset number, determine the update time of each instruction of the plurality of first type of pending instructions updating the first vector parameter, thereby obtaining a plurality of update times; and determine a plurality of update intervals based on the time relationship of the plurality of update times, wherein the target other pending instructions processed within each update interval use the target second vector parameter most recently updated in the update interval.
[0122] As an optional implementation, the above-described update module is further configured to update the first vector parameter to a second vector parameter according to the encoding information of the immediate instruction when the first type of instruction to be processed is an immediate instruction; and to update the first vector parameter to a second vector parameter by means of the predicted value associated with the non-immediate instruction when the first type of instruction to be processed is a non-immediate instruction.
[0123] As an optional implementation, the update module is further configured to determine a first configuration value for the vector register grouping configuration and a second configuration value corresponding to the bit width of the vector element based on the encoding information of the immediate instruction; and update the first vector parameter to the second vector parameter using the first configuration value and the second configuration value.
[0124] As an optional implementation, the update module is further configured to determine the feature information corresponding to the thread processing the non-immediate instruction; search for a predicted value matching the feature information in a preset vector parameter prediction table; and use the predicted value to update the first vector parameter to the second vector parameter.
[0125] As an optional implementation, the update module further includes: a comparison unit, configured to, after updating the first vector parameter to the second vector parameter using the predicted value, obtain the execution result corresponding to the non-immediate instruction, determine the actual value corresponding to the execution result; compare the actual value and the predicted value, so as to determine whether the speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is valid based on the comparison result.
[0126] As an optional implementation, the comparison unit is further configured to determine that a speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is valid when the actual value is the same as the predicted value; and to determine that a speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is invalid when the actual value is different from the predicted value.
[0127] As an optional implementation, the update module further includes: an update unit, configured to update the first vector parameter in the speculative value register using the actual value to obtain a third vector parameter; identify the target instruction to be processed after the non-immediate instruction among the plurality of instructions to be processed; if the target instruction to be processed is a long vector, use the third vector parameter to split the target instruction to be processed into a plurality of target micro-operations; and rename the target instruction to be processed according to the plurality of target micro-operations.
[0128] It should be noted that the above modules can be implemented by software or hardware. For the latter, they can be implemented in the following ways, but are not limited to: all the above modules are located in the same processor; or, the above modules are located in different processors in any combination.
[0129] To facilitate understanding of the technical solutions provided in this application, detailed descriptions will be provided below with reference to specific scenario embodiments.
[0130] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above method embodiments when run.
[0131] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.
[0132] Embodiments of this application also provide an electronic device that may include a memory and a processor, the memory storing a computer program and the processor being configured to run the computer program to perform the steps in any of the above method embodiments.
[0133] Optionally, in this embodiment, the processor can be configured to perform the following steps via a computer program:
[0134] S1, determine a first type of instruction to be processed from a plurality of instructions to be processed according to a preset instruction code, wherein the first type of instruction to be processed is used to update the first vector parameter in the speculative value register;
[0135] S2, update the first vector parameter to the second vector parameter using the update method corresponding to the first type of instruction to be processed;
[0136] S3, based on the second vector parameter, the long vector instruction in other instructions to be processed is split into multiple micro-operations, and the other instructions to be processed are renamed according to the multiple micro-operations, wherein the other instructions to be processed are instructions other than the first type of instructions to be processed among the multiple instructions to be processed.
[0137] Alternatively, as those skilled in the art will understand, Figure 11 The structure shown is for illustrative purposes only. The electronic device can also be a smartphone (such as an Android phone, an iOS phone, etc.), a tablet computer, a PDA, a mobile internet device (MID), a PAD, and other terminal devices. Figure 11 This does not limit the structure of the aforementioned electronic device. For example, the electronic device may also include components that are more... Figure 11 The more or fewer components shown (such as network interfaces, etc.), or having the same Figure 11 The different configurations shown.
[0138] The memory 902 can be used to store software programs and modules, such as the program instructions / modules corresponding to the instruction processing method and apparatus in this embodiment. The processor 904 executes various functional applications and data processing by running the software programs and modules stored in the memory 902, thereby implementing the aforementioned instruction processing method. The memory 902 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 902 may further include memory remotely located relative to the processor 904, and these remote memories can be connected to the terminal via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof. As an example, such as... Figure 11 As shown, the memory 902 may include, but is not limited to, the determination module 72, update module 74, and processing module 76 of the instruction processing method apparatus. Furthermore, it may include, but is not limited to, other module units of the instruction processing method apparatus, which will not be elaborated upon in this example.
[0139] Optionally, the transmission device 906 described above is used to receive or send data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 906 includes a Network Interface Controller (NIC), which can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network. In another example, the transmission device 906 is a Radio Frequency (RF) module, used for wireless communication with the Internet.
[0140] In addition, the above-mentioned electronic device also includes: a display 908, configured to display the above-mentioned instruction to be processed; and a connection bus 910, configured to connect the various module components in the above-mentioned electronic device.
[0141] In one exemplary embodiment, the electronic device may further include a transmission device and an input / output device, wherein the transmission device is connected to the processor and the input / output device is connected to the processor.
[0142] Specific examples in this embodiment can be found in the examples described in the above embodiments and exemplary implementations, and will not be repeated here.
[0143] Obviously, those skilled in the art should understand that the modules or steps of this application described above can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. They can be implemented using computer-executable program code, and thus can be stored in a storage device for execution by a computing device. In some cases, the steps shown or described can be performed in a different order than those presented here, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, this application is not limited to any particular combination of hardware and software.
[0144] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the principles of this application should be included within the protection scope of this application.
Claims
1. A method of processing instructions, characterized by, include: According to the preset instruction code, a first type of instruction to be processed is determined from a plurality of instructions to be processed, wherein the first type of instruction to be processed is used to update the first vector parameter in the speculative value register; The first vector parameter is updated to the second vector parameter using the update method corresponding to the first type of instruction to be processed. Based on the second vector parameter, the long vector instruction in other instructions to be processed is split into multiple micro-operations, and the other instructions to be processed are renamed according to the multiple micro-operations. The other instructions to be processed are instructions other than the first type of instructions to be processed among the multiple instructions to be processed. The method of updating the first vector parameter to the second vector parameter using the update method corresponding to the first type of instruction to be processed includes: when the first type of instruction to be processed is an immediate instruction, updating the first vector parameter to the second vector parameter according to the encoding information of the immediate instruction; when the first type of instruction to be processed is a non-immediate instruction, updating the first vector parameter to the second vector parameter through the predicted value associated with the non-immediate instruction.
2. The method of claim 1, wherein, Before updating the first vector parameter to the second vector parameter using the update method corresponding to the first type of instruction to be processed, the method further includes: Determine the number of instructions of the first type among the plurality of instructions to be processed; When the number of instructions is greater than a preset number, the update time of each instruction in the plurality of first-type instructions to be processed is determined to update the first vector parameter, and a plurality of update times are obtained. Multiple update intervals are determined based on the temporal relationship of the multiple update times, wherein other pending instructions of the target processed within each update interval use the target second vector parameter of the most recently updated update interval.
3. The method of claim 1, wherein, The first vector parameter is updated to the second vector parameter according to the encoding information of the immediate instruction, including: The first configuration value of the vector register group configuration and the second configuration value corresponding to the bit width of the vector element are determined based on the encoding information of the immediate instruction. The first vector parameter is updated to the second vector parameter using the first configuration value and the second configuration value.
4. The method of claim 1, wherein, Updating the first vector parameter to the second vector parameter using the predicted value associated with the non-immediate instruction includes: Determine the characteristic information corresponding to the thread that processes the non-immediate instruction; The system searches for a predicted value that matches the feature information in a preset vector parameter prediction table, and uses the predicted value to update the first vector parameter to the second vector parameter.
5. The method of claim 4, wherein, After updating the first vector parameter to the second vector parameter using the predicted value, the method further includes: Obtain the execution result corresponding to the non-immediate instruction, and determine the actual value corresponding to the execution result; The actual value and the predicted value are compared to determine whether the speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is valid based on the comparison result.
6. The method according to claim 5, characterized in that, Comparing the actual value and the predicted value to determine whether a speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is valid based on the comparison result includes: If the actual value is the same as the predicted value, it is determined that the speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is valid; If the actual value is different from the predicted value, the speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is determined to be invalid.
7. The method according to claim 6, characterized in that, After determining that the speculative operation of updating the first vector parameter to the second vector parameter using the predicted value is invalid when the actual value is different from the predicted value, the method further includes: The first vector parameter in the speculative value register is updated using the actual value to obtain the third vector parameter; Identify the target instruction to be processed among the plurality of instructions to be processed, which is processed after the non-immediate instruction; When the target instruction to be processed is a long vector, the third vector parameter is used to split the target instruction to be processed into multiple target micro-operations; The target instruction to be processed is renamed according to the multiple target micro-operations.
8. An instruction processing apparatus, characterized in that, include: The determination module is used to determine a first type of instruction to be processed from a plurality of instructions to be processed according to a preset instruction code, wherein the first type of instruction to be processed is used to update the first vector parameter in the speculative value register; The update module is used to update the first vector parameter to the second vector parameter using the update method corresponding to the first type of instruction to be processed; The processing module is used to split the long vector instruction in other instructions to be processed into multiple micro-operations based on the second vector parameter, and to rename the other instructions to be processed according to the multiple micro-operations, wherein the other instructions to be processed are instructions other than the first type of instructions to be processed among the multiple instructions to be processed; The update module is further configured to update the first vector parameter to a second vector parameter according to the encoding information of the immediate instruction when the first type of instruction to be processed is an immediate instruction; and to update the first vector parameter to a second vector parameter by means of the predicted value associated with the non-immediate instruction when the first type of instruction to be processed is a non-immediate instruction.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the steps of the method described in any one of claims 1 to 7.
10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method described in any one of claims 1 to 7.