Processor core, processor, system on a chip, computing device and instruction processing method

By introducing a dedicated vector operation instruction into the processor core, floating-point numbers can be converted into integers by a factor of four or more, solving the performance loss problem caused by multiple instructions in existing technologies. This achieves quantization instruction expansion and acceleration, improving the computing power and hardware performance of deep learning.

CN115469930BActive Publication Date: 2026-06-12C SKY MICROSYST CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
C SKY MICROSYST CO LTD
Filing Date
2022-09-20
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In the low-precision quantization process of deep learning, existing technologies require the execution of multiple instructions, resulting in performance loss. In particular, the operation of converting floating-point numbers to 8-bit integers requires 5 instructions, which affects the hardware execution efficiency.

Method used

A processor core is provided that converts floating-point numbers (more than four times) into integers using a dedicated vector operation instruction. This is abstracted into a single instruction to implement quantization, reducing the number of instructions and avoiding performance loss caused by the dependency of multiple instructions.

🎯Benefits of technology

It effectively reduces the number of instructions, increases computing power, reduces memory access bandwidth requirements, and improves hardware performance, making it suitable for deep learning inference scenarios in edge computing and IoT devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115469930B_ABST
    Figure CN115469930B_ABST
Patent Text Reader

Abstract

The application provides a processor core, a processor, a system on chip, a computing device and an instruction processing method, which can be used in the scene of the vector extension instruction set of the RISC-V instruction set. The processor core comprises: an instruction extraction unit configured to extract a vector operation instruction; an instruction decoding unit configured to decode the extracted vector operation instruction; and an instruction execution unit configured to execute the decoded vector operation instruction to convert a floating-point number N times of scaling to an integer, wherein N is an integer greater than or equal to 4. According to the technical scheme of the application, the quantization function of converting a floating-point number more than 4 times of scaling to an integer can be realized by using a special instruction, thereby realizing the extension of the quantization instruction and the acceleration of the quantization of the floating-point number.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of chip technology, and in particular to processor cores, processors, systems-on-a-chip, computing devices, and instruction processing methods. Background Technology

[0002] The implementation and innovative applications of deep learning in edge computing and Internet of Things (IoT) devices face significant challenges, primarily limited by computational power, memory, and energy consumption requirements. To achieve real-time deployment of deep learning, there is a trend towards using low-precision quantization schemes in artificial intelligence (AI) applications. Quantization, also known as fixed-point quantization, converts expensive floating-point operations into integer operations, effectively improving computational power and reducing memory access bandwidth requirements.

[0003] Currently, optimizations for low-precision quantization in AI instruction sets primarily focus on improving multiply-accumulate performance, without extending the quantization algorithm itself with faster instruction extensions. Taking the ARM instruction set as an example, to quantize a single-precision floating-point number to an 8-bit integer, five instructions are required: 1. Single-precision floating-point to 32-bit integer conversion (vcvt_aq_s32_f32); 2. 32-bit integer to 16-bit integer reduction (vqmovn_s32); 3. 16-bit integer to 8-bit integer reduction (vqmovn_s16); 4. 16-bit data concatenation (vcombine_s16); 5. Signed 8-bit maximum value extraction (vmax_s8). Clearly, this requires a large number of instructions, and due to data dependencies between instructions, the performance loss in practical applications is even greater. Summary of the Invention

[0004] This application provides a processor core, processor, system-on-a-chip, computing device, and instruction processing method to optimize the performance of vector operations.

[0005] In a first aspect, embodiments of this application provide a processor core, including:

[0006] The instruction extraction unit is used to extract vector operation instructions;

[0007] The instruction decoding unit is used to decode the extracted vector operation instructions;

[0008] The instruction execution unit is used to execute the decoded vector operation instructions to convert a floating-point number into an integer by multiplying it by N, where N is an integer greater than or equal to 4.

[0009] Secondly, embodiments of this application provide a processor, including at least one processor core provided in embodiments of this application.

[0010] Thirdly, embodiments of this application provide a system-on-a-chip, including at least one processor core provided in embodiments of this application.

[0011] Fourthly, embodiments of this application provide a computing device, including a coupled memory and a processor provided in embodiments of this application.

[0012] Fifthly, embodiments of this application provide an instruction processing method, including: extracting vector operation instructions; decoding the extracted vector operation instructions; and executing the decoded vector operation instructions to convert a floating-point number into an integer by multiple N, wherein N is an integer greater than or equal to 4.

[0013] According to the technical solution of this application, by abstracting the operation of converting a floating-point number to an integer by a factor of four or more into a single vector operation instruction, the quantization function of converting a floating-point number to an integer by a factor of four or more can be implemented with a single dedicated instruction. This achieves quantization instruction expansion and acceleration of floating-point quantization. Compared with the prior art quantization method that uses multiple vector operation instructions, the technical solution of this application can effectively reduce the number of instructions and avoid the performance loss caused by the need to execute multiple related instructions sequentially.

[0014] The above overview is for illustrative purposes only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of this application will become readily apparent from the accompanying drawings and the following detailed description. Attached Figure Description

[0015] In the accompanying drawings, unless otherwise specified, the same reference numerals throughout the various drawings denote the same or similar parts or elements. These drawings are not necessarily drawn to scale. It should be understood that these drawings depict only some embodiments disclosed in this application and should not be construed as limiting the scope of this application.

[0016] Figure 1 This diagram shows a schematic representation of the structure of a computing device 100 according to an embodiment of this application.

[0017] Figure 2 This diagram illustrates the structure of a system-on-a-chip 200 according to an embodiment of this application.

[0018] Figure 3 This diagram shows a schematic representation of the processor core 300 according to an embodiment of the present application.

[0019] Figure 4A flowchart illustrating an instruction processing method according to an embodiment of this application is shown. Detailed Implementation

[0020] Many specific details are set forth in the following description to provide a full understanding of this application. However, this application can be implemented in many other forms than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this application; therefore, this application is not limited to the specific embodiments disclosed below.

[0021] The following terms are used in this document.

[0022] Processor: The core of a computer system for computation and control. A processor can be a Complex Instruction Set Computer (CISC) architecture, a Reduced Instruction Set Computer (RISC) architecture, a Very Long Instruction Word (VLIW) architecture, or a combination of the above instruction sets.

[0023] Processor core: The core computing engine in a processor used to process and execute instructions. A processor may include one processor core or multiple processor cores.

[0024] Quantization, also known as fixed-point conversion, converts expensive floating-point numbers into integers. For example, it converts single-precision floating-point numbers (FP32) into 16-bit integers (INT16), half-precision floating-point numbers (FP16) into 8-bit integers (INT8), double-precision floating-point numbers (FP64) into 16-bit integers (INT16), and single-precision floating-point numbers (FP32) into 8-bit integers (INT8).

[0025] N-fold reduction: In quantization, if the number of bits in the unquantized floating-point number is greater than the number of bits in the quantized integer number, it is called reduction. Here, the number of bits in the unquantized floating-point number is N times the number of bits in the quantized integer number, where N is a positive integer, such as 2 or 4. For example, converting a single-precision floating-point number to INT8 is a 4-fold reduction.

[0026] Vector operation instructions: Vector operations are operations that can produce results from multiple elements in parallel. The instructions used to perform these vector operations are called vector operation instructions. Some examples of vector operation instructions include vector addition instructions, vector floating-point multiplication instructions, and vector floating-point arithmetic logic unit (ALU) instructions. The source operands and / or destination operands of vector operation instructions are vector operands.

[0027] Element: In parallel computing of vector operations, the operands targeted by a computation are elements, such as the source operand and destination operand in a vector operation instruction.

[0028] Operands: These are the objects to which instructions are executed. Operands indicate the source of the data required for instruction execution, such as immediate values, register addresses, or memory addresses.

[0029] Vector parameters are resource configuration parameters used when executing vector operation instructions, such as vector type and register configuration. The vector type determines how the elements in the vector register are organized, such as the element width. In other words, vector parameters are not the objects of vector operations, nor are they operands in vector operation instructions; rather, they reflect the resource allocation during vector operations.

[0030] Bit width: This refers to the size of an element in a vector, specifically how many bits an element occupies in the vector register.

[0031] Vector parameter configuration instructions: Instructions separate from vector operation instructions and used to configure the vector parameters used by vector operation instructions.

[0032] Saturation processing: If the calculation result exceeds the maximum value of the data that the required data format can store, then the maximum value is used to represent the calculation result; if the calculation result exceeds the minimum value of the data that the required data format can store, then the minimum value is used to represent the calculation result.

[0033] Signed / Unsigned: Signed numbers are data with a sign bit, where the highest bit is the sign bit. If the highest bit is 0, it represents a positive number; if the highest bit is 1, it represents a negative number. Unsigned numbers are data without a sign bit.

[0034] RVV instruction set: Vector extension instruction set of RISC-V instruction set.

[0035] LMUL: The number of registers in a vector register set; it is a vector parameter. The RVV instruction set supports vector register grouping settings. LMUL can be configured in software to group multiple vector registers into a vector register set. For example, LMUL can be configured to a maximum of 8; this can be achieved by changing the size of LMUL using vector parameter configuration instructions.

[0036] In the RVV instruction set, converting a single-precision floating-point number to INT8 requires three instructions: 1. `vfncvt`, a vector floating-point number conversion instruction, to convert a single-precision floating-point number to INT16; 2. `vnclip`, a vector signed right shift instruction with saturation, to shift INT16 to INT8; and 3. `vmax`, a vector signed maximum value instruction for integers. Compared to the ARM instruction set, the RVV instruction set can perform 4x quantization with fewer instructions. However, due to data dependencies between instructions, subsequent instructions must wait for the completion of preceding instructions before being issued and executed, thus affecting hardware execution efficiency.

[0037] This application aims to provide a dedicated quantization extension instruction specifically for accelerating the quantization of floating-point numbers that have been ablated by more than four times to integers, thereby improving computing power and hardware performance. The specific implementation methods of this application will be described in detail below.

[0038] Figure 1 A schematic diagram of a computing device 100 according to an embodiment of this application is shown. The computing device 100 is, for example, a computer system, which may be built based on various processors currently on the market and driven by an operating system such as a version of Windows™, UNIX, or Linux. The computing device 100 may be a laptop computer, desktop computer, workstation, personal digital assistant (PDA), server, blade server, mainframe computer, or mobile communication device, etc. The computing device 100 in the embodiments of this application is not limited to any specific combination of hardware circuitry and software.

[0039] refer to Figure 1As shown, the computing device 100 includes a processor 101. The processor 101 has data processing capabilities known in the art. The processor 101 can be a CISC architecture, RISC architecture, VLIW architecture, or an architecture combining the above instruction sets, or any processor device built for a specific purpose. Exemplarily, the processor 101 includes a central processing unit (CPU), a graphics processing unit (GPU), or a general-purpose graphics processing unit (GPGPU). The processor 101 also includes a processor core 1011 improved according to the technical solutions provided in the embodiments of this application, the specific details of which will be provided below. There can be one or more processor cores 1011 for processing and executing instructions, the processing and execution of which can be controlled by a user (e.g., through an application) and / or a system platform.

[0040] The processor 101 is coupled to the system bus 102, which can be an interconnect circuit for connecting the processor 101 and various other components. The interconnect circuit can support various interconnect protocols and interface circuits to realize the communication relationship between the processor 101 and various components.

[0041] The computing device 100 also includes a memory 103 for storing instruction information and / or data information represented by digital signals. The memory 102 may be dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, or other memory.

[0042] The computing device 100 also includes: an input device 104, such as a keyboard, mouse, etc.; an output unit 105, such as various types of displays, speakers, etc.; a storage device 106, such as a hard disk, optical disk, etc.; and a communication device 107, such as a network interface card, modem, wireless transceiver, etc. The communication device 107 allows the computing device 100 to exchange information or data with other devices or systems through computer networks such as the Internet and / or various telecommunications networks.

[0043] Of course, the structure of different computing devices may vary depending on the motherboard, operating system, and instruction set architecture. For example, many current computing devices have an input / output control center connected between the system bus 101 and each input device 104 or output device 105, and this input / output control center may be integrated into the processor 101 or independent of the processor 101. This application embodiment does not limit this.

[0044] Figure 2 This diagram illustrates the structure of a system-on-a-chip (SoC) 200 according to an embodiment of this application. The SoC can be manufactured and sold as a standalone device, or it can be combined with other components to form a new device for manufacturing and sale. The SoC 200 can be manufactured using various processors currently available on the market and can be driven by operating systems such as Windows™, UNIX, Linux, Android, and RTOS. The SoC 200 can be implemented in computer devices, handheld devices, and embedded devices. Some examples of handheld devices include cellular phones, Internet Protocol devices, digital cameras, PDAs, and personal computers (PCs). Some examples of embedded devices may include network PCs, set-top boxes, network hubs, wide area network (WAN) switches, etc., and embedded devices can also be any other system executing one or more instructions.

[0045] refer to Figure 2 As shown, the system-on-a-chip 200 includes a processor 201. The processor 201 has data processing capabilities known in the art. It can be a CISC architecture, RISC architecture, VLIW architecture, or an architecture combining the above instruction sets, or any processor device built for a specific purpose. Exemplarily, the processor 201 specifically includes a CPU, GPU, or GPGPU. The processor 201 also includes a processor core 2011 improved according to the technical solutions provided in the embodiments of this application, the specific details of which will be provided below. There can be one or more processor cores 2011 for processing instructions, the processing and execution of which can be controlled by a user (e.g., through an application) and / or the system platform.

[0046] Processor 201 is coupled to system bus 202. System bus 202 can be an interconnect circuit for connecting processor 201 and various other components, and this interconnect circuit can support various interconnect protocols and interface circuits to realize the communication relationship between processor 201 and various components. For example, system bus 202 can be an Advanced High Performance Bus (AHB) or an Advanced eXtensible Interface (AXI). As the complexity of SOC design increases and the capabilities of processors continue to improve, the choice of system bus for SOC can be diverse, and this application does not limit it.

[0047] Static random access memory 204 and flash memory 205 are used to store instruction information and / or data information represented by digital signals. For example, static random access memory 205 can serve as the runtime space for various applications (APPs), creating heaps, stacks, storing intermediate data, etc. for various applications, while flash memory 206 can store the executable code of various applications and the executable code of the operating system.

[0048] The system-on-a-chip 200 may also include various input / output (I / O) interfaces 203 coupled to the system bus 202. These I / O interfaces 203 include, but are not limited to, the following interface types: Secure Digital High Capacity (SDHC) interface, Inter-Integrated Circuit (I2C), Serial Peripheral Interface (SPI), Universal Asynchronous Receiver Transmitter (UART), Universal Serial Bus (USB), General-purpose input / output (GPIO), and Bluetooth UART. Based on the I / O interfaces 203, the system-on-a-chip 200 can be coupled to peripheral devices of corresponding interface types, such as USB devices, memory cards, message transceivers, and Bluetooth devices.

[0049] It should be noted that, Figure 1 The computing device 100 shown and Figure 2The system-on-a-chip 200 shown is only used to exemplify some application scenarios of the embodiments of this application, and is not intended to limit the embodiments of this application. This application implements an improvement on existing processors or processor cores, and therefore can theoretically be applied to devices or systems with any processor architecture and instruction set architecture.

[0050] Figure 3 A schematic diagram of the structure of a processor core 300 according to an embodiment of this application is shown. The processor core 300 may have the same or similar structure as the processor cores 1011 and 2011 described above.

[0051] refer to Figure 3 As shown, the processor core 300 includes a pipelined architecture. To improve the efficiency of instruction execution, the operation of an instruction is divided into multiple small steps, each completed by a dedicated circuit. The software and hardware combination that implements the instruction pipeline is called the instruction pipeline architecture. Specifically, in this embodiment, the pipelined architecture of the processor core 300 includes an instruction fetch unit 301, an instruction decode unit 302, and an instruction execution unit 303.

[0052] The instruction fetching unit 301 serves as the boot engine for the processing unit 12, and is used, under the control of the application or system platform, to fetch instructions from external memory (e.g., memory outside the processor core 300). Figure 1 The instruction extraction unit 301 extracts vector operation instructions from memory 103 (in the processor core 300), the cache memory 18 inside the processor core 300, or other memory units that may be located inside the processor core 300. Specifically, the extracted vector operation instructions are in encoded form, that is, the instruction extraction unit 301 extracts the encoded information of the vector operation instructions. The encoded information of the vector operation instructions includes the source operand encoding information and the destination operand encoding information. In this embodiment, the source operand of the vector operation instructions is a floating-point number, and the destination operand is an integer.

[0053] The instruction decoding unit 302 decodes (decodes) the extracted vector operation instructions. For example, according to a predetermined instruction format, it decodes the encoding information of the vector operation instructions to obtain the decoding information of the vector operation instructions, including the source operand decoding information and the destination operand decoding information of the vector operation instructions.

[0054] The instruction execution unit 303 executes the decoded vector operation instructions to convert a floating-point number (source operand) into an integer (destination operand) by a factor of N, where N is an integer greater than or equal to 4. In other words, the vector operation instructions in this embodiment are floating-point number quantization instructions with a factor of 4 or greater. It should be noted that N is usually even, but N can also be odd; this embodiment does not limit this.

[0055] According to the technical solution of this application, by abstracting the operation of converting a floating-point number to an integer by a factor of four or more into a single vector operation instruction, the quantization function of converting a floating-point number to an integer by a factor of four or more can be achieved with a single dedicated vector operation instruction. This realizes the expansion of quantization instructions and the acceleration of floating-point quantization. Compared with the existing technology that uses multiple vector operation instructions for quantization, the technical solution of this application can effectively reduce the number of instructions and avoid the performance loss caused by the need to execute multiple related instructions sequentially.

[0056] For example, the floating-point quantization instruction of more than 4 times the number of floating-point numbers provided in the embodiments of this application can quantize the input, weight parameters and other parameters of the operation nodes in the deep learning model, thereby reducing the requirements for data throughput and storage space, improving computing power and reducing the need for memory access bandwidth. It can be widely used in deep learning inference scenarios of edge computing and IoT devices.

[0057] It should be noted that the programming model may vary depending on the instruction set architecture, and the assembly function for the floating-point quantization instruction with a reduction of more than 4 times may also vary. However, as long as the corresponding quantization function can be achieved, it is acceptable. This embodiment does not limit this.

[0058] In one implementation, the instruction execution unit 303 is specifically used to execute the decoded vector operation instruction to convert single-precision floating-point numbers into INT8. INT8 is widely used in deep learning inference. Taking the XuanTie C910 as an example, if the computing power of INT8 is increased by 4 times on the existing basis, the quantization overhead in some typical convolutional layers will exceed 50%. Therefore, providing an instruction to convert single-precision floating-point numbers to INT8, and converting single-precision floating-point numbers to INT8 for participation in deep learning operations, can accelerate deep learning algorithms.

[0059] In one example, the vector operation instruction is a signed vector operation instruction. The instruction execution unit 303 is specifically used to execute the decoded signed vector operation instruction to convert the floating-point number into a signed integer by N times reduction.

[0060] Taking the instruction of converting a single-precision floating-point number to a signed INT8 number by 4 times reduction as an example, the functions that the assembly function needs to implement include: converting a single-precision floating-point number to a 32-bit integer (INT32); converting INT32 to a signed INT8 number with saturation processing, and saturating the result to [-127, 127]. That is, if the conversion result exceeds the maximum value of 127, then the maximum value of 127 is used to represent the conversion result; if it exceeds the minimum value of -127, then the minimum value of -127 is used to represent the conversion result.

[0061] In another example, the vector operation instruction is an unsigned vector operation instruction. The instruction execution unit 303 is specifically used to execute the decoded unsigned vector operation instruction to convert the floating-point number into an unsigned integer by N times.

[0062] Taking the instruction of converting a single-precision floating-point number to an unsigned INT8 by 4 times ablation as an example, the functions that the assembly function needs to implement include: converting a single-precision floating-point number to INT32; converting INT32 to an unsigned 8-bit integer with saturation processing, and saturating the result to [0, 255]. That is, if the conversion result exceeds the maximum value of 255, then the maximum value of 255 is used to represent the conversion result; if it exceeds the minimum value of 0, then the minimum value of 0 is used to represent the conversion result.

[0063] In another implementation, the instruction execution unit 303 is specifically used to execute the decoded vector operation instructions to convert double-precision floating-point numbers into 16-bit integers. Specific implementation methods and assembly functions can be found in the instructions for converting single-precision floating-point numbers to INT8, and will not be repeated here.

[0064] When designing instructions, the source operand width and destination operand width can be determined according to the application requirements of the instruction set, and then encoded into the vector arithmetic instructions. For example, in an application scenario requiring INT8, a single-precision floating-point number can be converted to INT8 using one assembly instruction, or in an application scenario requiring INT16, a double-precision floating-point number can be converted to INT16 using another assembly instruction. However, this approach is limited by the encoding space and requires the assembly functions to be designed specifically for different application scenarios. Therefore, this embodiment also provides an optimized implementation method, namely, defining vector arithmetic instructions to convert a floating-point number to an integer by N times ablation, but without specifying the specific values ​​of the source and destination operand widths in the vector arithmetic instructions. Taking 4x ablation as an example: The vector operation instruction `vfn4cvt.x` can convert single-precision floating-point numbers to INT8 (signed) and double-precision floating-point numbers to INT16 (signed); the vector operation instruction `vfn4cvt.xu` can convert single-precision floating-point numbers to INT8 (unsigned) and double-precision floating-point numbers to INT16 (unsigned). The source operand width and destination operand width can be configured using vector parameter configuration instructions.

[0065] Specifically, in this embodiment, the instruction extraction unit 301 is also used to extract vector parameter configuration instructions. Based on the vector parameter configuration instructions, vector parameters can be determined, including the source operand width and destination operand width of the vector operation instructions. After determining the source operand width and destination operand width, the instruction execution unit 303 can execute the decoded vector operation instructions based on the source operand width and destination operand width to convert single-precision floating-point numbers to INT8 or double-precision floating-point numbers to INT16. Furthermore, the configured vector parameters can be reused by subsequent vector operation instructions.

[0066] Furthermore, the instruction extraction unit 301 extracts the encoding information of the vector parameter configuration instruction. The vector parameter can be an immediate value, meaning the value of the vector parameter is directly given in the encoding information of the vector parameter configuration instruction. The instruction extraction unit 301 can directly determine the vector parameter (such as the source operand width and destination operand width in a vector operation instruction) from the encoding information of the vector parameter configuration instruction. The vector parameter can also be a non-immediate value. For example, if the encoding information of the vector parameter configuration instruction includes the address of the register storing the vector parameter, the instruction decoding unit 302 needs to decode the encoding information of the vector parameter configuration instruction and then address the corresponding register according to the register address to determine the vector parameter (such as the source operand width and destination operand width in a vector operation instruction).

[0067] Furthermore, such as Figure 3 As shown, the processor core 300 also includes a register set 304. Register set 304 includes multiple registers used to store the source and destination operands of vector operation instructions. In the encoding information of the vector operation instructions, the source and destination operands can be register information. After the instruction decoding unit 302 decodes the encoding information of the vector operation instructions, it addresses the corresponding registers according to their addresses, thereby obtaining the source and destination operands in the vector operation instructions.

[0068] In an application example, as shown in the table below, `sew` represents the destination operand bit width, `vd` represents the destination operand, and `vs2` represents the source operand. The destination operand bit width of a vector arithmetic instruction can be obtained based on `sew`, and the source operand bit width can be obtained based on `4*sew`. For example, based on the vector arithmetic instruction `vfn4cvt`, the source operand is retrieved from the vector register corresponding to `vs2`. The bit width of the source operand can be determined based on `4*sew`, and the source operand is converted to the destination operand by a factor of 4 and stored in the vector register corresponding to `vd`. When `sew` equals 8, it converts a single-precision floating-point number to INT8; when `sew` equals 16, it converts a double-precision floating-point number to INT16.

[0069] Vector parameter configuration instructions vd:sew vs2:4*sew Vector operation command vfn4cvt vd,vs2

[0070] Furthermore, instruction execution requires valid vector parameter settings. Specifically, in this embodiment, vector parameters include source operand bit width, destination operand bit width, register information (such as the number of registers for the source operand and the number of registers for the destination operand), etc. The instruction decoding unit 302 is also used to decode the extracted vector parameter configuration instruction to determine the vector parameters, and if the data type of the vector parameters conforms to a preset data type and / or if the vector parameters conform to a preset register configuration rule, it sends the source operand bit width and destination operand bit width to the instruction execution unit 303; the instruction execution unit 303 executes the decoded vector operation instruction based on the source operand bit width and destination operand bit width.

[0071] Both the source and destination operand widths must conform to the preset data types. This can be understood as the source operand width (eew) and destination operand width (sew) needing to meet the configuration requirements of the corresponding instruction set architecture. For example, in the RVV instruction set architecture, vector floating-point ablation conversion instructions support sew = 8 / 16 / 32 and eew = 16 / 32 / 64. If the data type of sew or eew is detected to be inconsistent with this, the vector operation instruction is illegal and cannot be executed. As another example, the vector operation instruction vfn4cvt requires eew = 4 * sew. Before executing the vector operation instruction vfn4cvt, if it is detected that eew is not equal to 4 * sew, the vector operation instruction vfn4cvt is illegal and cannot be executed.

[0072] The data types of the number of registers in the source operand and the number of registers in the destination operand both conform to the preset data types. This can be understood as the number of registers in the destination operand (LMUL) and the number of registers in the source operand (EMUL) needing to meet the configuration requirements under the corresponding instruction set architecture.

[0073] Register configuration rules include register alignment rules and overlap rules. In this embodiment, the register configuration rule can be: sew is smaller than eew, but the overlap portion is only in the lowest index portion of the source operand's register set. If a discrepancy is detected in the vector parameters, the vector operation instruction vfn4cvt is illegal and cannot be executed.

[0074] Furthermore, other configurations of the processor core 300 in the above embodiments of this application, such as cache, instruction issuing unit, and instruction rollback unit, can employ various technical solutions now and in the future known to those skilled in the art, and will not be described in detail here.

[0075] Figure 4A flowchart illustrating an instruction processing method according to an embodiment of this application is provided. This instruction processing method corresponds to the structure of the processor core 300, meaning it can be executed by the processor core 300. Therefore, the instruction processing method of this embodiment will be described in detail below with reference to the processor core 300 described in the foregoing embodiments.

[0076] refer to Figure 4 As shown, the instruction processing method includes:

[0077] Step S401: Extract vector operation instructions;

[0078] Step S402: Decode the extracted vector operation instructions;

[0079] Step S403: Execute the decoded vector operation instruction to convert the floating-point number N times smaller into an integer, where N is an integer greater than or equal to 4.

[0080] In one embodiment, the instruction processing method further includes: extracting a vector parameter configuration instruction, wherein the vector parameter configuration instruction is used to determine vector parameters, the vector parameters including the source operand bit width and the destination operand bit width of the vector operation instruction; and in step S403, executing the decoded vector operation instruction includes: executing the decoded vector operation instruction based on the source operand bit width and the destination operand bit width.

[0081] In one implementation, the source operand width is 32 bits and the destination operand width is 8 bits; or, the source operand width is 64 bits and the destination operand width is 16 bits.

[0082] Exemplary, the instruction extraction unit 301 can be used to execute step S301 and the extraction step of the vector parameter configuration instruction, the instruction decoding unit 302 can be used to execute step S202, and the instruction execution unit 303 can be used to execute step S303. Therefore, the implementation details of this instruction processing method can be referred to the preceding detailed description of the processor core 300. Its implementation is similar to that of the embodiment of the processor core 300, only the perspective of description is different. To save space, further details will not be repeated.

[0083] In the description of this specification, the reference to terms such as "embodiment," "an implementation," and "example" indicates that a specific feature, structure, or characteristic described in connection with that embodiment, implementation, or example is included in at least one embodiment, implementation, or example of this application. Furthermore, the described specific features, structures, or characteristics may be combined in any suitable manner in one or more embodiments, implementations, or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the features of different embodiments or implementations or examples described in this specification.

[0084] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "a plurality of" means two or more, unless otherwise explicitly specified.

[0085] Any process or method description in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or more executable instructions for implementing a particular logical function or process. Furthermore, the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functionality involved.

[0086] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable medium for use by, or in conjunction with, an instruction execution system, apparatus or device (such as a computer-based system, a processor-included system or other system that can fetch and execute instructions from, an instruction execution system, apparatus or device).

[0087] It should be understood that the various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. Furthermore, the functional units in the various embodiments of this application can be integrated into a single processing module, or each unit can exist physically separately, or two or more units can be integrated into a single module. The integrated module described above can be implemented in hardware or as a software functional module.

[0088] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various variations or substitutions within the technical scope disclosed in this application, and these should all be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A processor core, comprising: The instruction extraction unit is used to extract vector operation instructions and vector parameter configuration instructions. The vector parameter configuration instructions are used to determine vector parameters. The vector parameters include the source operand bit width and the destination operand bit width of the vector operation instructions. The vector parameters are reused by multiple subsequent vector operation instructions. The instruction decoding unit is used to decode the extracted vector operation instructions; The instruction execution unit is configured to execute decoded vector operation instructions based on the source operand bit width and the destination operand bit width, so as to convert a floating-point number into an integer number by N times reduction, where N is an integer greater than or equal to 4.

2. The processor core according to claim 1, wherein the instruction execution unit is specifically used to execute the decoded vector operation instructions to convert a single-precision floating-point number into an 8-bit integer or a double-precision floating-point number into a 16-bit integer.

3. The processor core according to claim 1, wherein, The source operand has a bit width of 32 bits and the destination operand has a bit width of 8 bits; or, the source operand has a bit width of 64 bits and the destination operand has a bit width of 16 bits.

4. The processor core according to claim 1, wherein the instruction decoding unit is further configured to decode the extracted vector parameter configuration instruction to determine the vector parameter, and, if the data type of the vector parameter conforms to a preset data type, send the source operand bit width and the destination operand bit width to the instruction execution unit.

5. The processor core according to claim 1, further comprising: The register set includes multiple registers for storing the source operands and destination operands of the vector operation instructions; The vector parameters also include register information. The instruction decoding unit is further configured to decode the extracted vector parameter configuration instruction to determine the vector parameters, and, if the vector parameters conform to the preset register configuration rules, send the source operand bit width and the destination operand bit width to the instruction execution unit.

6. The processor core according to claim 1, wherein the vector operation instruction is an unsigned vector operation instruction, and the instruction execution unit is specifically used to execute the decoded unsigned vector operation instruction to convert the floating-point number into an unsigned integer by N times reduction.

7. The processor core according to claim 1, wherein the vector operation instruction is a signed vector operation instruction, and the instruction execution unit is specifically used to execute the decoded signed vector operation instruction to convert the floating-point number into a signed integer by N times reduction.

8. A processor comprising at least one processor core as claimed in any one of claims 1 to 7.

9. A system-on-a-chip, comprising at least one processor core as described in any one of claims 1 to 7.

10. A computing device comprising a coupled memory and a processor, the processor as described in claim 8.

11. An instruction processing method, comprising: Extract vector operation instructions and vector parameter configuration instructions. The vector parameter configuration instructions are used to determine vector parameters. The vector parameters include the source operand bit width and the destination operand bit width of the vector operation instructions. The vector parameters are reused by multiple subsequent vector operation instructions. Decode the extracted vector operation instructions; Based on the source operand width and the destination operand width, the decoded vector operation instruction is executed to convert the floating-point number into an integer by N times reduction, where N is an integer greater than or equal to 4.

12. The instruction processing method according to claim 11, wherein, The source operand has a bit width of 32 bits and the destination operand has a bit width of 8 bits; or, the source operand has a bit width of 64 bits and the destination operand has a bit width of 16 bits.