Central processing unit core, instruction processing method, electronic device and storage medium

By designing multiple instruction fetch and decode pipelines at the processor core front end and combining them with instruction cache and microinstruction cache, the problems of decoding latency and increased power consumption in existing technologies are solved, achieving more efficient instruction processing and improved multi-threaded performance.

WO2026137680A1PCT designated stage Publication Date: 2026-07-02HYGON INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HYGON INFORMATION TECH CO LTD
Filing Date
2025-05-19
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing processor core front-ends suffer from latency and increased power consumption during instruction decoding, especially when using micro-instruction caches, which leads to reduced bandwidth and fails to meet the instruction distribution requirements of high-performance processors.

Method used

The design incorporates multiple fetch-decode pipelines, combined with instruction caches and microinstruction caches, and coupled them using memory interleaving or multi-port techniques. This enables flexible selection and processing of instructions and microinstructions, supports instruction decoding mode and microinstruction cache mode, and fully utilizes multiple pipelines in multi-threaded environments.

Benefits of technology

It increases the processor's front-end bandwidth, improves processor performance and multi-threaded performance, ensures efficient pipeline utilization in different modes, and enhances the processor's overall performance and power efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025095681_02072026_PF_FP_ABST
    Figure CN2025095681_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The embodiments of the present disclosure relate to a central processing unit core comprising a front-end, and an instruction processing method, a device and a medium. The central processing unit core front-end comprises N instruction fetch and decode pipelines, at least one instruction cache and at least one micro-op cache, N being an integer greater than or equal to 2. Each instruction fetch and decode pipeline comprises instruction fetch selection logic, an instruction cache port, a micro-op cache port, a decoder and a first micro-op queue. Prediction information is processed by means of an instruction decode pipeline or a micro-op pipeline among the N instruction fetch and decode pipelines and then enters the first micro-op queue for dispatch. The central processing unit core has a larger average front-end bandwidth and an improved processing performance.
Need to check novelty before this filing date? Find Prior Art

Description

Processor core, instruction processing method, electronic device and storage medium

[0001] Cross-reference of related applications

[0002] This application claims priority to Chinese Patent Application No. 202411919034.0, filed on December 24, 2024, the disclosure of which is incorporated herein by reference in its entirety. Technical Field

[0003] This disclosure relates to a processor core, an instruction processing method, an electronic device, and a computer-readable storage medium. Background Technology

[0004] The CPU core front-end typically includes modules for branch prediction (BP), instruction fetch (IF), instruction decoder (DE), and instruction dispatch (DI). In some high-performance processors, the instruction decoder module also includes a micro-op cache (u-op cache). The micro-op cache stores the micro-instructions obtained from the instruction decoding. When the address of these micro-instructions is accessed a second time, the processor directly accesses the micro-op cache and does not access the instruction cache or decode the instruction again. In some processor architectures, instruction decoding increases latency, reduces bandwidth, and increases power consumption. If a micro-op cache is used, the negative effects of this decoding are significantly reduced in a large number of programs. The decision of whether to access the instruction cache or the micro-op cache lies between the branch prediction and instruction fetch phases. Summary of the Invention

[0005] At least one embodiment of this disclosure provides a processor core including a front-end. The front-end includes N fetch-decode pipelines, at least one instruction cache, and at least one microinstruction cache, where N is an integer greater than or equal to 2. Each fetch-decode pipeline includes fetch selection logic, an instruction cache port, a microinstruction cache port, a decoder, and a first microinstruction queue. The instruction cache port is configured to read and write the instruction cache and provide instructions fetched from the instruction cache to the decoder. The microinstruction cache port is configured to read and write the microinstruction cache and provide a first microinstruction fetched from the microinstruction cache to the first microinstruction queue. The decoder is configured to decode instructions fetched from the instruction cache port into a second microinstruction and provide the decoded second microinstruction to the first microinstruction queue. The fetch selection logic is configured to select the instruction cache port and the decoder for fetch-decode operation, or select the microinstruction cache port for fetch-decode operation.

[0006] For example, at least one embodiment of the present disclosure provides a processor core in which the front end further includes: an instruction dispatch unit configured to dispatch microinstructions received from a first microinstruction queue of one of the N instruction fetch-decode pipelines for execution.

[0007] For example, in at least one embodiment of the present disclosure, the processor core is provided in which the instruction cache is configured to be coupled to the instruction cache ports of the N instruction fetch-decode pipelines via memory interleaving technology or multi-port technology, so that the N instruction fetch-decode pipelines can access the instruction cache; and / or the microinstruction cache is configured to be coupled to the microinstruction cache ports of the N instruction fetch-decode pipelines via memory interleaving technology or multi-port technology, so that the N instruction fetch-decode pipelines can access the microinstruction cache.

[0008] For example, at least one embodiment of the present disclosure provides a processor core in which the front end further includes: a branch prediction unit configured to generate a prediction result based on the received instruction address for use in the N instruction fetch-decode pipelines.

[0009] For example, at least one embodiment of the processor core provided in this disclosure includes a front-end that further includes a boundary information determination unit configured to generate boundary information based on the prediction result and generate at least one information stream based on the prediction result; wherein the type of the boundary information includes the last byte of a jump branch instruction, the last byte of the prediction result, or the last byte of an intermediate instruction in the prediction result.

[0010] For example, in at least one embodiment of the present disclosure, the processor core further includes a front-end: a window selection unit configured to allocate the at least one information stream to the N fetch-decode pipelines for processing the at least one information stream, wherein the allocation strategy of the window selection unit includes any one of the following: the number of prediction results, the number of bytes of the prediction results, the blocking degree of the N fetch-decode pipelines, and the window switching frequency in different modes.

[0011] For example, in at least one embodiment of the present disclosure, a processor core is provided, wherein the front end further includes: reordering logic, configured to reorder the microinstructions obtained by the N instruction fetch-decode pipelines processing the at least one information stream and then send them to the instruction dispatch unit.

[0012] For example, at least one embodiment of the processor core provided in this disclosure includes a front-end further comprising: a first arbitration logic and a second arbitration logic, wherein the prediction result includes prediction results for M threads, where M is an integer greater than or equal to 2; the first arbitration logic is configured to distribute the prediction results of the M threads to the N instruction fetch-decode pipelines for processing; and the second arbitration logic is configured to sequentially select and send the microinstructions corresponding to the M threads obtained after processing by the N instruction fetch-decode pipelines to the instruction dispatch unit.

[0013] For example, at least one embodiment of the present disclosure provides a processor core in which each of the instruction fetch-decode pipelines further includes a second microinstruction queue configured to receive microinstructions from the decoder and / or the microinstruction cache port in each of the instruction fetch-decode pipelines.

[0014] At least one embodiment of this disclosure also provides an instruction processing method, comprising: in response to an instruction processing request, selecting a target fetch-decode pipeline among N fetch-decode pipelines included in the front end of a processor core to respond to the instruction processing request, wherein N is an integer greater than or equal to 2; in the target fetch-decode pipeline, selecting either through an instruction cache port and a decoder, or through a microinstruction cache port, to perform an instruction fetch-decode operation in response to the instruction processing request, wherein in response to selecting the microinstruction cache port to perform the instruction fetch-decode operation in response to the instruction processing request, the method comprises: retrieving a first microinstruction from a microinstruction cache through the microinstruction cache port and providing the first microinstruction to a first microinstruction queue, or in response to selecting through the instruction cache port and the decoder to perform the instruction fetch-decode operation in response to the instruction processing request, the method comprises: retrieving an instruction from an instruction cache through the instruction cache port and providing the instruction to the decoder, wherein the decoder decodes the instruction into a second microinstruction and provides the second microinstruction to the first microinstruction queue.

[0015] For example, at least one embodiment of the instruction processing method provided in this disclosure further includes: distributing microinstructions received from a first microinstruction queue of one of the N instruction fetch-decode pipelines for execution.

[0016] For example, at least one embodiment of the instruction processing method provided in this disclosure includes an instruction cache coupled to the instruction cache port of a target fetch-decode pipeline among the N fetch-decode pipelines via memory interleaving or multi-port technology, so that the target fetch-decode pipeline among the N fetch-decode pipelines can access the instruction cache; and / or the microinstruction cache coupled to the microinstruction cache port of the target fetch-decode pipeline among the N fetch-decode pipelines via memory interleaving or multi-port technology, so that the target fetch-decode pipeline among the N fetch-decode pipelines can access the microinstruction cache.

[0017] For example, at least one embodiment of the instruction processing method provided in this disclosure further includes: generating a prediction result based on the received instruction address to generate the instruction processing request for a target instruction fetch / decode pipeline among the N instruction fetch / decode pipelines.

[0018] For example, at least one embodiment of the instruction processing method provided in this disclosure further includes: generating boundary information based on the prediction result corresponding to the instruction processing request, and generating at least one information stream based on the prediction result, wherein the type of the boundary information includes the last byte of a jump branch instruction, the last byte of the prediction result, or the last byte of an intermediate instruction in the prediction result.

[0019] For example, the instruction processing method provided in at least one embodiment of this disclosure further includes: allocating the at least one information stream to a target fetch-decode pipeline among the N fetch-decode pipelines through window selection logic to process the at least one information stream, wherein the window selection logic includes any one of the following: the number of prediction results, the number of bytes of the prediction results, the blocking degree of the target fetch-decode pipeline among the N fetch-decode pipelines, and the window switching frequency in different modes.

[0020] For example, at least one embodiment of the instruction processing method provided in this disclosure further includes: reordering the microinstructions obtained by the target fetch-decode pipeline in the N fetch-decode pipelines from processing the at least one information stream for distribution.

[0021] For example, at least one embodiment of the present disclosure provides an instruction processing method, wherein the prediction result includes prediction results for M threads, where M is an integer greater than or equal to 2, and the method further includes: allocating the prediction results of the M threads corresponding to the instruction processing request to the target fetch-decode pipeline among the N fetch-decode pipelines for processing; and sequentially selecting the microinstructions corresponding to the M threads obtained after processing by the target fetch-decode pipeline among the N fetch-decode pipelines for distribution.

[0022] For example, at least one embodiment of the present disclosure provides an instruction processing method, wherein responding to the instruction processing request by selecting the microinstruction cache port to perform an instruction fetch and decode operation includes: retrieving a third microinstruction from the microinstruction cache through the microinstruction cache port and providing the third microinstruction to a second microinstruction queue; and responding to the instruction processing request by selecting the instruction cache port and the decoder to perform an instruction fetch and decode operation includes: retrieving an instruction from the instruction cache through the instruction cache port and providing the instruction to the decoder, wherein the decoder decodes the instruction into a fourth microinstruction and provides the fourth microinstruction to the second microinstruction queue.

[0023] At least one embodiment of this disclosure also provides an electronic device including the processor core described in any of the above embodiments.

[0024] At least one embodiment of this disclosure also provides an electronic device, including: a processor; and a memory including one or more computer program instructions; wherein the one or more computer program instructions are executed by the processor to perform the instruction processing method provided in any of the above embodiments.

[0025] At least one embodiment of this disclosure provides a computer-readable storage medium for non-transitory storage of computer-readable instructions, wherein the computer-readable instructions, when executed by a processor, implement the instruction processing method provided in any embodiment of this disclosure. Attached Figure Description

[0026] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings of the embodiments will be briefly described below. Obviously, the drawings described below only relate to some embodiments of this disclosure and are not intended to limit this disclosure.

[0027] Figure 1 shows a schematic diagram of a processor core pipeline;

[0028] Figure 2 shows a schematic diagram of the basic structure of a processor;

[0029] Figure 3 shows a schematic diagram of an example of the front-end configuration of a CPU core;

[0030] Figure 4 shows a schematic diagram of a branch prediction module;

[0031] Figure 5 illustrates a schematic diagram of the instruction decoding operation in the processor core;

[0032] Figure 6 shows a schematic diagram of the structure of a processor core according to at least one embodiment of the present disclosure;

[0033] Figure 7 shows a schematic flowchart of an instruction processing method according to at least one embodiment of the present disclosure;

[0034] Figure 8 shows a detailed flowchart of an instruction processing method according to at least one embodiment of the present disclosure;

[0035] Figure 9 illustrates a structural example of a processor core according to at least one embodiment of the present disclosure;

[0036] Figure 10 illustrates a structural example of a processor core according to at least another embodiment of the present disclosure;

[0037] Figure 11 illustrates a structural example of a processor core according to at least one embodiment of the present disclosure;

[0038] Figure 12 shows a schematic block diagram of an electronic device according to at least one embodiment of the present disclosure;

[0039] Figure 13 shows a schematic diagram of the structure of a computer-readable storage medium according to at least one embodiment of the present disclosure; and

[0040] Figure 14 shows a schematic block diagram of an electronic device according to at least another embodiment of the present disclosure. Detailed Implementation

[0041] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the described embodiments of this disclosure without creative effort are within the scope of protection of this disclosure.

[0042] Unless otherwise defined, the technical or scientific terms used herein should have the ordinary meaning understood by one of ordinary skill in the art to which this disclosure pertains. The terms “first,” “second,” and similar terms used in this disclosure do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Similarly, terms such as “comprising” or “including” mean that the element or object preceding the word encompasses the elements or objects listed following the word and their equivalents, without excluding other elements or objects. Terms such as “connected” or “linked” are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as “upper,” “lower,” “left,” and “right” are used only to indicate relative positional relationships, which may change accordingly when the absolute position of the described objects changes.

[0043] Single-core or multi-core processors utilize pipelining technology to improve instruction execution efficiency. Pipelining divides the complete operation of a CPU core into multiple sub-steps and executes these sub-steps in a pipelined manner to improve efficiency.

[0044] Figure 1 illustrates an exemplary scalar central processing unit (CPU) instruction pipeline comprising a five-stage pipeline. As shown in Figure 1, each instruction can be issued per clock cycle and executed within a fixed time (e.g., 5 clock cycles). The execution of each instruction is divided into 5 steps: Fetch (IF) stage 1001, Decode (ID) stage 1002, Execute (EX) stage 1003, Memory Access (MEM) stage 1004, and Write Back (WB) stage 1005. In IF stage 1001, the specified instruction is fetched from the instruction cache. A portion of the fetched specified instruction is used to specify the source register that can be used to execute the instruction. In ID stage 1002, the instruction is decoded and control logic is generated, fetching the contents of the specified source register. Based on the control logic, arithmetic or logical operations are performed in EX stage 1003 using the fetched contents. In MEM stage 1004, the instruction read / write data is executed in the memory in the data cache. Finally, in WB stage 1005, the value obtained by executing the instruction can be written back to a register.

[0045] The earlier pipeline steps in a CPU are typically classified as the front end. For example, in a five-stage pipeline, the modules used for the instruction fetch and decode stages are classified as the CPU front end. Conversely, the later pipeline steps in a CPU are classified as the back end. For example, in a five-stage pipeline, the modules used for the execution, memory access, and write-back stages are classified as the CPU back end.

[0046] To support high operating frequencies, each pipeline stage may contain multiple (sub)pipelines (clock cycles). Although each pipeline stage performs a limited number of operations, this minimizes the time per clock cycle, thereby increasing the CPU core's performance by raising its operating frequency. Each pipeline stage can also further enhance processor core performance by accommodating more instructions (i.e., superscalar technology). Superscalar refers to the method of executing multiple instructions in parallel within a single cycle, exhibiting increased instruction-level parallelism. A processor capable of processing multiple instructions in a single cycle is called a superscalar processor. For example, superscalar processors can further support out-of-order execution. Out-of-order execution refers to a CPU employing a technique that allows multiple instructions to be sent to their respective circuit units for processing outside the order specified in the program.

[0047] Within the microarchitecture, the processor core translates each architectural instruction into one or more microinstructions. Each microinstruction performs only a limited number of operations, ensuring that each pipeline stage is short and thus increasing the processor core's operating frequency. For example, a memory read instruction can be translated into an address generation microinstruction and a memory read microinstruction. The second microinstruction depends on the result of the first microinstruction; therefore, the second microinstruction will only begin execution after the first microinstruction has completed. Each microinstruction contains multiple microarchitecture-related fields used to pass relevant information between pipeline stages.

[0048] Speculative execution is another technique to improve processor performance. This technique executes the instruction following the current instruction before the current instruction has finished executing. One speculative execution technique is branch prediction. As mentioned above, the instruction fetch unit is responsible for providing the processor with the next instruction to be executed. During the fetch phase, in addition to fetching multiple instructions, the fetch address for the next cycle must also be determined. Therefore, this phase determines whether a conditional branch instruction exists, whether to jump (direction) if a branch exists, and the target address. The instruction fetch unit includes a branch prediction unit (branch predictor) to perform branch prediction. The branch prediction unit (branch predictor) at the front end of the processor core predicts, prefetches, and executes the instruction in that direction for the jump direction of the conditional branch instruction. Another speculative execution technique is to execute a memory read instruction before the addresses of all preceding memory write instructions have been fetched.

[0049] Predictive execution further improves the parallelism between instructions, thereby significantly improving processor core performance. When a predictive execution error occurs, such as a branch prediction error or a write instruction before a memory read instruction modifying the same address, all instructions in the pipeline following the erroneous instruction need to be flushed (or "cleaned up"). The program then jumps to the point of error and re-executes to ensure the correctness of program execution.

[0050] Figure 2 shows a schematic diagram of the basic structure of a processor. The processor 200 includes at least one CPU core (processor core) and at least one L1 cache. For example, the CPU core includes a front end 201 and a back end 202. For example, the at least one L1 cache includes an L1 cache (not shown in the figure) located within the CPU core and a L2 cache 203 located outside the CPU core, where the L2 cache 203 is a separate structure.

[0051] Figure 3 illustrates a schematic diagram of an example front-end configuration of a CPU core. The front-end 201 of this CPU core includes an instruction fetch unit, a decode unit, and an issue unit. The instruction fetch unit includes branch prediction 301 and selection logic 302; the decode unit includes an instruction cache 303, an instruction decoder 304, a microinstruction cache 305, and a microinstruction queue 306; the issue unit 307 is connected to the microinstruction queue 306. This CPU core has both an instruction cache and a microinstruction cache, thus enabling microarchitectural optimization. The instruction address (which can be the program counter address, or PC address) obtained by the instruction fetch unit is predicted by the branch prediction 301 to obtain the address of the next instruction to be executed. Simultaneously, the instruction address is selected by the selection logic 302 to determine whether instruction decoding is required for the instruction corresponding to that address. If "yes," the path on the left side of Figure 3 is followed, requiring instruction decoding; if "no," the path on the right side of Figure 3 is followed, not requiring instruction decoding, but instead accessing the microinstruction cache to obtain the corresponding microinstruction set data.

[0052] Figure 4 illustrates a branch prediction module. Referring to Figure 3, the instruction address about to enter the processor pipeline first enters the branch prediction module 301, which then accesses the branch predictor 3011 and other functional modules 3012. Accessing the branch predictor 301 is to determine whether there is a branch instruction within a range starting from the instruction address (e.g., the range from the instruction address to the first aligned address; for example, the processor aligned address could be 64 bytes), the branch direction if a branch instruction exists, and the jump address. Accessing other functional modules is to obtain key information for use by subsequent modules. These other functional modules may include, for example, micro-instruction tag caches and instruction tag caches. If a jumpable branch is predicted, the next instruction address is the jump address for that branch. If no jumpable branch is predicted, the next instruction address is the starting address of the next aligned address.

[0053] Figure 5 illustrates the instruction decoding operation in the processor core. Referring to Figure 3, based on instruction address A, the instruction cache 303 is queried to obtain the undecoded instruction data (e.g., one or more instructions) corresponding to instruction address A. Then, instruction decoding 304 decodes the instruction data into multiple microinstructions (these microinstructions can be called microinstruction groups, such as microinstruction 1, microinstruction 2, microinstruction 3, etc.). The obtained microinstruction groups are sent to the microinstruction queue 306 to await allocation and dispatch by the issue unit 307 to the corresponding execution unit in the back end of the CPU core for execution. Alternatively, under certain conditions (e.g., the microinstruction group is a frequently used set of microinstructions), they are saved to the microinstruction cache 305, awaiting possible future access.

[0054] In one embodiment, the processor core can update the microtag cache of microinstructions in other functional modules 3012 while saving the microinstruction set to the microinstruction cache 305.

[0055] The inventors of this disclosure have noted that the aforementioned processor has only one pipeline including branch prediction, instruction fetch, decoding, and microinstruction caching. With the trend of increasing instruction dispatch bandwidth (distributing more instructions per unit time) and memory access bandwidth, as well as increasing the number of various execution units, the processor core front end is gradually unable to meet the needs of the processor core back end to execute more instructions. Therefore, how to increase the front end bandwidth has become a problem that mainstream processors need to solve.

[0056] Simultaneous Multithreading (SMT) is an important technology for improving overall CPU performance. It utilizes the multi-issue and out-of-order execution mechanisms of high-performance CPU cores to execute instructions from multiple threads simultaneously. Thus, a physical CPU core appears to the software and operating system as multiple virtual CPU cores. In modern multi-issue high-performance CPUs, when executing a single thread, the multiple execution units and hardware resources within it are often underutilized most of the time. When a thread pauses due to some reason (such as an L2 cache miss), the hardware execution units can only idle, resulting in wasted hardware resources and reduced performance-to-power ratio. In SMT mode, such as dual-threaded mode (SMT2), when one thread pauses, other threads can still run, improving hardware resource utilization and thus increasing the CPU core's multithreaded throughput, overall performance, and performance-to-power ratio. It's important to note that because CPU core resources are shared with other threads, the performance of a thread running in SMT mode is often lower than its performance in single-threaded mode.

[0057] In SMT (Short-Time Multiprocessing) mode, at least one stage in the processor core's pipeline is time-multiplexed. This means that if one thread occupies a specific stage of a pipeline at a given time, other threads cannot occupy that same stage at that time. However, if there are two pipelines, each thread can exclusively use one. This results in a significant performance improvement for SMT.

[0058] The inventors of this disclosure also noted that, for multi-threaded processor cores, multi-threading can only use partially expanded pipelines in one of the instruction decoding mode or microinstruction mode, thus failing to fully and effectively utilize the advantages of multiple pipelines.

[0059] At least one embodiment of this disclosure provides a processor core, an instruction processing method, an electronic device, and a storage medium. The processor core of this disclosure includes a front-end, which comprises N fetch-decode pipelines, at least one instruction cache, and at least one microinstruction cache, where N is an integer greater than or equal to 2. Each fetch-decode pipeline includes fetch selection logic, an instruction cache port, a microinstruction cache port, a decoder, and a first microinstruction queue. The instruction cache port is configured to read and write the instruction cache and provide instructions fetched from the instruction cache to the decoder; the microinstruction cache port is configured to read and write the microinstruction cache and provide a first microinstruction fetched from the microinstruction cache to the first microinstruction queue; the decoder is configured to decode instructions fetched from the instruction cache port into a second microinstruction and provide the decoded second microinstruction to the first microinstruction queue; the fetch selection logic is configured to select the instruction cache port and the decoder for fetch-decode operation, or select the microinstruction cache port for fetch-decode operation. Here, the first microinstruction and the second microinstruction respectively refer to the microinstruction described in the two processing methods.

[0060] At least one embodiment of this disclosure also provides an instruction processing method corresponding to the processor core described above.

[0061] The processor core provided in at least one embodiment of this disclosure, on the one hand, can be designed with multiple fetch-decode pipelines, each of which can operate in instruction decoding mode and microinstruction cache mode. Regardless of whether one pipeline in instruction decoding mode or microinstruction cache mode is blocked, another pipeline can be used to handle the prediction information. This allows more microinstructions to accumulate in the microinstruction queue for execution by the processor core backend, providing greater processor front-end bandwidth and improving processor performance. On the other hand, the processor core provided in at least one embodiment of this disclosure allows multiple threads to use multiple pipelines in both instruction decoding mode and microinstruction cache mode. Each thread can generate a larger average front-end bandwidth, thereby significantly improving multi-threaded performance.

[0062] The embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.

[0063] Figure 6 shows a schematic diagram of the structure of a processor core according to at least one embodiment of the present disclosure.

[0064] As shown in Figure 6, the processor core 600 includes a front-end 601, which includes N instruction fetch-decode pipelines (e.g., N=2, i.e., instruction fetch-decode pipeline 1 and instruction fetch-decode pipeline 2 as shown in the figure), at least one instruction cache 6031-603n, and at least one microinstruction cache 6041-604n. It should be noted that the number of instruction fetch-decode pipelines can be determined based on the specific design performance of the processor core front-end, and this disclosure does not impose any limitations on this.

[0065] The following description uses fetch-decode pipeline 1 as an example. This fetch-decode pipeline 1 includes selection logic 6021, instruction cache port 6051, microinstruction cache port 6061, instruction decoder 6071, and a first microinstruction queue 6081. For example, instruction cache port 6051, instruction cache (e.g., instruction cache 6031), and instruction decoder 6071 constitute an instruction decoding pipeline (for instruction decoding mode); for example, microinstruction cache port 6061 and microinstruction cache (e.g., microinstruction cache 6041) constitute a microinstruction pipeline (for microinstruction cache mode). The internal structure of the remaining N-1 fetch-decode pipelines is the same as that of fetch-decode pipeline 1, and will not be described further here.

[0066] In the above embodiments, for example, there is only one instruction cache (e.g., instruction cache 6031) and one microinstruction cache (e.g., microinstruction cache 6041), so that N instruction fetch-decode pipelines share the instruction cache and the microinstruction cache.

[0067] For example, unless otherwise specified, all the following descriptions are based on instruction fetching and decoding pipeline 1.

[0068] For example, in one possible implementation, the front end 601 further includes an instruction dispatch unit configured to dispatch microinstructions received from a first microinstruction queue of one of N instruction fetch-decode pipelines for execution.

[0069] For example, the instruction dispatch unit (not shown in the figure) is configured after the first microinstruction queue 6081 to receive the processed microinstructions from the fetch-decode pipeline 1 and dispatch them to subsequent execution units for execution.

[0070] For example, the instruction cache (e.g., instruction cache 6031) and instruction cache ports 6051-605n are coupled using memory interleaving technology so that instruction cache ports 6051-605n can access instruction cache 6031; the micro instruction cache (e.g., micro instruction cache 6041) and micro instruction cache ports 6061-606n are coupled using memory interleaving technology so that micro instruction cache ports 6061-606n can access instruction cache 6041.

[0071] Memory interleaving technology divides memory modules into multiple independent channels and alternately allocates data blocks to different channels, allowing multiple channels to be operated on simultaneously when accessing memory, thereby reducing waiting time and increasing data transfer rate.

[0072] For example, the instruction cache (e.g., instruction cache 6031) is coupled to instruction cache ports 6051-605n via multiport technology, so that instruction cache ports 6051-605n can access instruction cache 6031; the micro instruction cache (e.g., micro instruction cache 6041) is coupled to micro instruction cache ports 6061-606n via multiport technology, so that micro instruction cache ports 6061-606n can access instruction cache 6041.

[0073] This multi-port technology allows each of the multiple ports to perform read and write operations independently, enabling multiple requests to be processed simultaneously, thereby reducing waiting time and improving the overall performance of the system.

[0074] The instruction cache 6031 and microinstruction cache 6041 described above are merely exemplary, and this disclosure does not limit them. For example, if two ports (e.g., instruction cache port 6051 and instruction cache port 6052) encounter an address conflict while accessing simultaneously, access to one of them (e.g., instruction cache port 6052) can be delayed until the non-delayed access to the other port is completed.

[0075] For example, in one possible implementation, the front end 601 further includes a branch prediction unit configured to generate a prediction result based on the received instruction address for use in the N instruction fetching and decoding pipelines.

[0076] For example, a branch prediction unit (not shown in the figure) is configured before fetch-decode pipelines 1 to N. This branch prediction unit is used in the processor to optimize program execution flow, particularly the performance when handling conditional branch statements. For example, this branch prediction unit can generate a prediction result based on the instruction address of an instruction entering the instruction pipeline (e.g., a jump instruction) for use in the aforementioned fetch-decode pipelines 1 to N. For example, the branch prediction method of this branch prediction unit can include any one or more of static branch prediction, dynamic branch prediction, and indirect branch prediction.

[0077] For example, in one possible implementation, the front-end 601 further includes a boundary information determination unit configured to generate boundary information based on the prediction result and to generate at least one information stream based on the prediction result. For example, depending on the type of branch prediction unit, the type of boundary information includes the last byte of a jump branch instruction, the last byte of the prediction result, or the last byte of an intermediate instruction in the prediction result. For example, one type of branch prediction unit, in the event of a branch target buffer (BTB) hit, will force the prediction information to end at the last byte of a branch instruction, even if the branch instruction does not jump, instead of ending at the last byte of the cached basic block (or prediction window), thus resulting in "the last byte of the prediction result".

[0078] For example, a boundary information determination unit (not shown in the figure) is configured after the branch prediction unit and before the instruction fetching and decoding pipeline. For example, the number of generated information streams depends on the number of boundary information streams, and this disclosure does not limit this.

[0079] For example, boundary information can be stored in a new cache structure or an existing cache structure. When the boundary information is the last byte of the intermediate instruction in the prediction result, the boundary information is first decoded and then stored in either of the aforementioned cache structures.

[0080] For example, in one possible implementation, the front end 601 further includes a window selection unit configured to allocate the at least one information stream to the N fetch-decode pipelines for processing the at least one information stream. For example, the allocation strategy of the window selection unit includes any one of the following: the number of prediction results, the number of bytes of the prediction results, the blocking degree of the N fetch-decode pipelines, and the window switching frequency in different modes.

[0081] For example, a window selection unit (not shown in the figure) is configured after the boundary information determination unit and before the instruction fetch / decode pipeline. For example, at least one information stream (e.g., 3 streams) is sent to instruction fetch / decode pipeline 1 and instruction fetch / decode pipeline 2 via the window selection unit. For example, the at least one information stream may be sent to one pipeline or different pipelines. For example, 2 information streams are sent to instruction fetch / decode pipeline 1, and the remaining 1 is sent to instruction fetch / decode pipeline 2. The specific number of streams allocated can be dynamically adjusted based on the execution status (e.g., blocking level) of instruction fetch / decode pipeline 1 to instruction fetch / decode pipeline n, and this disclosure does not limit this.

[0082] For example, in one possible implementation, the front end 601 further includes reordering logic, which is configured to reorder the microinstructions obtained by the above N instruction fetching and decoding pipelines from processing the above at least one information stream before sending them to the above instruction distribution unit.

[0083] For example, the reordering logic (not shown in the figure) is configured after the fetch-decode pipeline and before the instruction dispatch unit. For example, since fetch-decode pipelines 1 to n (e.g., fetch-decode pipelines 1 and 2) receive prediction results from a single thread, the instruction data processed by the boundary information determination unit and allocated by the window selection unit is out of order. That is, the information flow processed in fetch-decode pipelines 1 and 2 is out of order. For example, based on the first-in-first-out (FIFO) principle, the microinstructions processed in fetch-decode pipeline 1 and accumulated in the first microinstruction queues 6081 to 608n need to be reordered in the reordering logic to restore their order before entering the fetch-decode pipelines. This ensures that the microinstructions processed by the instructions that are earlier in the information flow can be dispatched in a timely manner, thereby contributing to the efficient and correct processing of the instruction sequence.

[0084] It should be noted that the boundary information unit, window selection unit, and reordering logic mentioned in the above embodiments are applicable to single-threaded mode.

[0085] The processor core provided in at least one embodiment of this disclosure is suitable for single-threaded mode and is designed with multiple fetch-decode pipelines. Each fetch-decode pipeline can operate in instruction decoding mode and microinstruction cache mode. Regardless of whether one pipeline in instruction decoding mode or microinstruction cache mode is blocked, another pipeline can be used to take over the prediction information, thereby allowing more microinstructions to accumulate in the microinstruction queue for the processor core backend to execute, providing greater processor frontend bandwidth and improving processor performance.

[0086] For example, in one possible implementation, the front end 601 further includes a first arbitration logic and a second arbitration logic, wherein the prediction result includes prediction results for M hardware threads (hereinafter referred to as "threads"), where M is an integer greater than or equal to 2 (e.g., M = 2, 4, or 8, etc.); the first arbitration logic is configured to distribute the prediction results of the M threads to the N instruction fetch-decode pipelines for processing; the second arbitration logic is configured to sequentially select and send the microinstructions corresponding to the M threads obtained after processing by the N instruction fetch-decode pipelines to the instruction dispatch unit.

[0087] For example, the first arbitration logic (not shown in the figure) is configured after the branch prediction unit and before the instruction fetch-decode pipeline. The second arbitration logic (not shown in the figure) is configured after the instruction fetch-decode pipeline and before the instruction dispatch unit. Since this embodiment is applicable to multi-threaded mode, it does not include a boundary information determination unit, a window selection unit, or reordering logic.

[0088] For example, the prediction results mentioned above include prediction results for two (M=2) threads (e.g., thread 1 and thread 2). The first arbitration logic allocates the prediction results of these two threads to N instruction fetch-decode pipelines (e.g., two, namely instruction fetch-decode pipeline 1 and instruction fetch-decode pipeline 2) for processing. For example, the prediction result of thread 1 can be allocated to instruction fetch-decode pipeline 1 or instruction fetch-decode pipeline 2, and the specific allocation rules are not limited in this disclosure. For example, the second arbitration logic sequentially selects and sends the microinstructions corresponding to the two threads obtained after processing by instruction fetch-decode pipeline 1 and instruction fetch-decode pipeline 2 to the instruction dispatch unit.

[0089] For example, in one possible implementation, each of the above N instruction fetch-decode pipelines further includes a second microinstruction queue, which is configured to receive microinstructions from the instruction decoder and / or microinstruction cache port in each of the above instruction fetch-decode pipelines.

[0090] For example, in one possible implementation, the second microinstruction queue (not shown) is parallel (and equivalent) to the first microinstruction queue (e.g., the first microinstruction queue 6081) and is configured after the instruction decoder 6071 and the microinstruction cache port 6061 and before the aforementioned second arbitration logic.

[0091] For example, the prediction results described above include prediction results for four (M=4) threads (e.g., threads 1 to 4), and the first arbitration logic distributes the prediction results of these four threads to N instruction fetch-decode pipelines (e.g., two, namely instruction fetch-decode pipeline 1 and instruction fetch-decode pipeline 2). For example, prediction results from threads 1 and 2 are distributed to instruction fetch-decode pipeline 1, and prediction results from threads 3 and 4 are distributed to instruction fetch-decode pipeline 2 for processing. It should be noted that the above thread allocation is merely exemplary, and this disclosure does not impose any limitations on it.

[0092] Taking the above allocation result as an example, for instance, in instruction fetching and decoding pipeline 1, the microinstruction obtained by thread 1 through the instruction decoding pipeline or microinstruction pipeline can be sent to the first microinstruction queue 6081, and the microinstruction obtained by thread 2 through the instruction decoding pipeline or microinstruction pipeline can be sent to the second microinstruction queue.

[0093] For example, the microinstructions obtained by thread 1 through the instruction decoding pipeline or microinstruction pipeline can be sent to the second microinstruction queue, and the microinstructions obtained by thread 2 through the instruction decoding pipeline or microinstruction pipeline can be sent to the first microinstruction queue 6081.

[0094] For example, the microinstructions obtained by the instruction decoding pipeline or microinstruction pipeline of thread 1 can be sent to the first microinstruction queue 6081, and the microinstructions obtained by the remaining part of thread 1 can be sent to the second microinstruction queue. The same applies to thread 2.

[0095] It should be noted that the instruction fetch-decode pipeline 2, which processes threads 3 and 4, allocates microinstructions in a similar manner to the instruction fetch-decode pipeline 1, and will not be elaborated here.

[0096] The processor core provided in at least one embodiment of this disclosure is applicable to multi-threaded mode and can allow multiple threads to use multiple pipelines regardless of whether it is in instruction decoding mode or micro-instruction cache mode. Each thread can generate a larger average front-end bandwidth, thereby greatly improving multi-threaded performance.

[0097] In addition to the aforementioned front-end, the processor core of the embodiments of this disclosure also includes a back-end, such as various execution units (e.g., arithmetic logic unit (ALU), floating point unit (FPU), load / store unit (LSU), etc.), reorder cache (ROB), termination unit, etc., which will not be described in detail here.

[0098] The processor cores in the embodiments of this disclosure may be based on the x86 microarchitecture, ARM microarchitecture, RISC-V microarchitecture, MIPS microarchitecture, etc., and this disclosure does not limit them.

[0099] At least one embodiment of this disclosure also provides an instruction processing method, which includes: in response to an instruction processing request, selecting a target fetch-decode pipeline from N fetch-decode pipelines included in the front end of a processor core to respond to the instruction processing request, wherein N is an integer greater than or equal to 2; in the target fetch-decode pipeline, selecting either an instruction cache port and a decoder, or a microinstruction cache port, to perform an instruction fetch-decode operation in response to the instruction processing request. Responding to selecting the microinstruction cache port to perform the instruction fetch-decode operation in response to the instruction processing request includes: retrieving a first microinstruction from a microinstruction cache through the microinstruction cache port and providing the first microinstruction to a first microinstruction queue. Alternatively, responding to selecting the instruction cache port and a decoder to perform the instruction fetch-decode operation in response to the instruction processing request includes: retrieving an instruction from an instruction cache through the instruction cache port and providing the instruction to the decoder, whereby the decoder decodes the instruction into a second microinstruction and provides the second microinstruction to the first microinstruction queue.

[0100] Figure 7 shows a schematic flowchart of an instruction processing method according to at least one embodiment of the present disclosure. As shown in Figure 7, the instruction processing method includes steps S710 and S720.

[0101] Step S710: In response to the instruction processing request, select the target fetch-decode pipeline from the N fetch-decode pipelines included in the front end of the processor core to respond to the instruction processing request, where N is an integer greater than or equal to 2;

[0102] Step S720: In the target instruction fetch-decode pipeline, select either the instruction cache port and decoder or the microinstruction cache port to perform the instruction fetch-decode operation in response to the instruction processing request.

[0103] For example, the instruction processing method described above corresponds to the processor core shown in Figure 6.

[0104] For example, in one possible implementation, before responding to an instruction processing request, the instruction processing method includes: generating a prediction result based on the received instruction address for use in a target instruction fetch / decode pipeline among N instruction fetch / decode pipelines. For example, this prediction result is issued by a branch prediction unit, the specific function of which has been described above and will not be repeated here.

[0105] For example, in the instruction processing method described above, the prediction result includes prediction results for single-threaded or multi-threaded (i.e., multiple hardware threads in a multi-threaded processor).

[0106] For example, in one possible implementation, boundary information is generated based on the prediction result, and at least one information stream is generated based on the prediction result. For example, the type of boundary information includes the last byte of a jump branch instruction, the last byte of the prediction result, or the last byte of an intermediate instruction in the prediction result. For example, the number of information streams depends on the number of boundary information streams. For example, if the last byte of an intermediate instruction in the prediction result is taken as boundary information, it is generated by decoding and then stored in a cache structure. For example, this cache structure can be a newly added cache or a reused existing cache; this disclosure does not limit this.

[0107] For example, in one possible implementation, at least one information stream is assigned to a target fetch / decode pipeline among N fetch / decode pipelines for processing via window selection logic. For example, the window selection logic may be based on any one of the following: the number of prediction results, the number of bytes in the prediction results, the blocking level of the target fetch / decode pipeline among the N fetch / decode pipelines, and the window switching frequency under different modes; this disclosure does not limit this. For example, the aforementioned multiple information streams may be sent to one fetch / decode pipeline or to different fetch / decode pipelines.

[0108] For example, responding to an instruction processing request by selecting a microinstruction cache port to perform an instruction fetch and decode operation includes: retrieving a first microinstruction from a microinstruction cache via the microinstruction cache port and providing the first microinstruction to a first microinstruction queue.

[0109] For example, in response to selecting an instruction cache port and a decoder to perform an instruction fetch and decode operation in response to an instruction processing request, the method includes: fetching an instruction from an instruction cache through the instruction cache port and providing the instruction to the decoder, whereby the decoder decodes the instruction into a second microinstruction and provides the second microinstruction to a first microinstruction queue.

[0110] As described above, the microinstruction cache port and the microinstruction cache constitute the microinstruction pipeline (i.e., for microinstruction cache mode), and the instruction cache port, instruction decoder, and instruction cache constitute the instruction decoding pipeline (i.e., for instruction decoding mode). For example, the first microinstruction refers to the microinstruction obtained after processing by the microinstruction pipeline; the second microinstruction refers to the microinstruction obtained after processing by the instruction decoding pipeline.

[0111] For example, in one possible implementation, the instruction cache is coupled to the instruction cache port of the target fetch-decode pipeline in N fetch-decode pipelines via memory interleaving or multi-port technology, so that the target fetch-decode pipelines in N fetch-decode pipelines can access the instruction cache; or, the microinstruction cache is coupled to the microinstruction cache port of the target fetch-decode pipeline in N fetch-decode pipelines via memory interleaving or multi-port technology, so that the target fetch-decode pipelines in N fetch-decode pipelines can access the microinstruction cache. This disclosure does not limit the caching technology that can implement multi-port access; the specific caching technology depends on the actual design.

[0112] For example, the microinstruction cache ports of the target fetch / decode pipelines in N fetch / decode pipelines access the same microinstruction cache, and the instruction cache ports of the target fetch / decode pipelines in N fetch / decode pipelines access the same instruction cache. It should be noted that if multiple ports (e.g., two instruction cache ports accessing the instruction cache) encounter address conflicts while accessing the instruction cache simultaneously, access will be performed sequentially.

[0113] For example, in one possible implementation, the instruction processing method further includes: reordering the microinstructions obtained by processing at least one information stream in the target fetch-decode pipeline among N fetch-decode pipelines for distribution. For instance, since there are two information streams in a target fetch-decode pipeline, and the operating state of these two information streams in their respective fetch-decode pipelines is determined by their respective pipelines, it is possible that an information stream that enters earlier may exit the pipeline later. Because the information streams need to maintain their order at this stage, the microinstructions need to be reordered for distribution.

[0114] For example, in one possible implementation, the instruction processing method further includes distributing microinstructions received from a first microinstruction queue of one of the N instruction fetch-decode pipelines for execution.

[0115] The instruction processing method provided by at least one embodiment of the present disclosure designs multiple target fetch-decode pipelines, each of which can operate in instruction decoding mode and microinstruction cache mode. Regardless of whether one pipeline in instruction decoding mode or microinstruction cache mode is blocked, another pipeline can be used to take over the prediction information, thereby allowing more microinstructions to accumulate in the microinstruction queue for execution by the processor core backend, providing greater processor frontend bandwidth and improving processor performance.

[0116] For example, in one possible implementation, the prediction results include prediction results for M threads, where M is an integer greater than or equal to 2. The instruction processing method further includes: allocating the prediction results from the M threads to the target fetch-decode pipelines in the N fetch-decode pipelines for processing; and sequentially selecting the microinstructions corresponding to the M threads obtained after processing by the target fetch-decode pipelines in the N fetch-decode pipelines for distribution.

[0117] For example, if M < N (e.g., M = 3, N = 4), then each thread is sent to its own fetch-decode pipeline (e.g., thread 1 corresponds to pipeline 1, and so on), where there are spare fetch-decode pipelines (e.g., spare pipeline 4).

[0118] For example, if M = N (e.g., M = N = 4), then each thread is sent to its own fetch-decode pipeline (e.g., thread 1 corresponds to pipeline 1, and so on), where there are no remaining fetch-decode pipelines.

[0119] For example, if M > N (e.g., M = 5, N = 4), then in addition to the threads allocated to each pipeline (e.g., thread 1 corresponds to pipeline 1, and so on), the remaining threads (e.g., remaining thread 5) are allocated to the existing pipelines for processing based on the execution status (e.g., blocking status) of all pipelines.

[0120] It should be noted that the allocation order of the threads mentioned above is merely an example, and the specific order can be determined based on the actual design. This disclosure does not impose any restrictions on this.

[0121] For example, the microinstructions corresponding to the M threads (e.g., threads 1 to 4) obtained after being processed by the target fetch-decode pipeline (e.g., fetch-decode pipeline 1 to fetch-decode pipeline 4) in the N fetch-decode pipelines are selected in sequence for distribution (e.g., fetch-decode pipeline 1 corresponds to thread 1, and so on).

[0122] For example, in one possible implementation, in response to selecting a microinstruction cache port to perform an instruction fetch and decode operation in response to an instruction processing request, the method further includes: retrieving a first microinstruction from a microinstruction cache via the microinstruction cache port and providing the first microinstruction to a second microinstruction queue; and in response to selecting an instruction cache port and a decoder to perform an instruction fetch and decode operation in response to an instruction processing request, the method further includes: retrieving an instruction from an instruction cache via the instruction cache port and providing the instruction to a decoder, whereby the decoder decodes the instruction into a second microinstruction and provides the second microinstruction to a second microinstruction queue.

[0123] For example, the first microinstruction and the second microinstruction can come from the same thread or different threads. For example, instruction data from thread 1 and thread 2 are processed in fetch-decode pipeline 1. Exemplarily, the first and second microinstructions obtained from thread 1 through fetch-decode pipeline 1 can be provided separately to the first microinstruction queue or the second microinstruction queue, or a portion of the first and second microinstructions obtained from thread 1 through fetch-decode pipeline 1 can be provided to the first microinstruction queue, and the remainder to the second microinstruction queue. The same applies to the microinstructions of thread 2.

[0124] The instruction processing method provided by at least one embodiment of this disclosure allows multiple threads to use multiple pipelines regardless of whether it is in instruction decoding mode or microinstruction caching mode. Each thread can generate a larger average front-end bandwidth, thereby greatly improving multi-threaded performance.

[0125] Figure 8 shows a detailed flowchart of an instruction processing method according to at least one embodiment of the present disclosure, applicable to a single-threaded mode. As shown in Figure 8, the instruction processing method includes steps S801-S812.

[0126] Steps S801-S804, S806-S808, S810, and S812 have been described above and will not be repeated here.

[0127] Step S805: Enter instruction decoding mode?

[0128] For example, based on the cache hit or cache miss information provided by the microinstruction cache, it is determined whether to enable the microinstruction cache mode. For instance, if the microinstruction cache provides cache hit information that several consecutive microinstruction groups exist in the microinstruction cache, then the microinstruction cache fetch mode is enabled; otherwise, the prediction results are processed in the instruction cache mode.

[0129] Step S809: Has the microinstruction become the head of the queue?

[0130] For example, if the microinstruction of the current information stream becomes the head of the microinstruction queue (FIFO queue), it means that the microinstructions of the previous information stream have been distributed (emitted). As a result, there are no more microinstructions of the previous information stream in the current microinstruction queue. This allows microinstructions from different fetch-decode pipelines to be reordered to restore the order in which they entered the N fetch-decode pipelines.

[0131] Step S811: Can instruction dispatch be initiated?

[0132] For example, in one possible implementation, instruction distribution is determined based on one or more factors, including resource availability, data dependency, scheduling strategy, and branch prediction and prefetching.

[0133] For example, each execution unit (such as the Arithmetic Logic Unit (ALU), Floating-Point Unit (FPU), Load / Store Unit (LSU), etc.) has a certain throughput limit. If there is an idle execution unit, the corresponding instruction can be dispatched. Otherwise, the instruction must wait until the required resource becomes available.

[0134] For example, to ensure program correctness, instruction dispatch needs to consider data dependencies. For instance, if the result of one instruction is the input of another, the latter cannot be executed before it. Reordering logic ensures that instructions are dispatched only when all preconditions are met by tracking these dependencies.

[0135] For example, complex scheduling algorithms are typically used to determine which instructions should be dispatched with priority. Common scheduling algorithms include:

[0136] Oldest First: Selects the instruction that arrived at the earliest reservation station and has no unresolved dependencies.

[0137] Least Delay: Prioritize distributing instructions with the shortest expected execution time to reduce the possibility of pipeline stalls.

[0138] Maximize resource utilization: Balance the workload of each execution unit as much as possible to avoid some units being overloaded while others are idle.

[0139] The instruction processing method provided in at least one embodiment of this disclosure is applicable to single-threaded or multi-threaded processor cores. Regardless of which of the multiple instruction fetch / decode pipelines is blocked, a backup pipeline takes over the prediction information, allowing more microinstructions to accumulate in the microinstruction queue for backend consumption, providing greater frontend bandwidth and further improving processor performance. Moreover, regardless of whether the processor is in instruction decoding mode or microinstruction mode, the processor has two pipelines consuming the increased branch prediction bandwidth, thus enabling more efficient utilization of the increased branch prediction bandwidth.

[0140] Figure 9 illustrates a structural example of a processor core according to at least one embodiment of the present disclosure, which is suitable for single-threaded mode.

[0141] As shown in Figure 9, the front end of the processor core includes two fetch-decode pipelines (e.g., fetch-decode pipeline 1 and fetch-decode pipeline 2). It should be noted that in other examples, the number of fetch-decode pipelines can be determined based on the specific design performance of the processor core front end, and may include two or more fetch-decode pipelines.

[0142] The processor core front end specifically includes a branch prediction unit 901, selection logic 9021 / 9022, instruction cache 903, instruction cache ports 9031 / 9032, instruction decoder 9041 / 9042, microinstruction cache 905, microinstruction cache ports 9051 / 9052, microinstruction queues 9061 / 9063 (an example of a "first microinstruction queue"), instruction dispatch unit 907, boundary information determination unit 908, window selection unit 909, and reordering logic 910. These components or logic can be implemented through hardware, firmware, software, or any combination thereof, and their functional descriptions are as described above, and will not be repeated here.

[0143] The following explanation uses one of the two instruction fetch-decode pipelines as an example (e.g., instruction fetch-decode pipeline 1). Instruction fetch-decode pipeline 1 includes selection logic 9021, instruction cache 903, instruction cache port 9031, instruction decoder 9041, microinstruction cache 905, microinstruction cache port 9051, and microinstruction queue 9061. Instruction cache 903, instruction cache port 9031, and instruction decoder 9041 constitute instruction decoder pipeline 1; microinstruction cache 905 and microinstruction cache port 9051 constitute microinstruction pipeline 1. Instruction fetch-decode pipeline 2 has a similar structure to instruction fetch-decode pipeline 1, and will not be described further here.

[0144] Instruction cache ports 9031 and 9032 in the two instruction fetch / decode pipelines access the same instruction cache 903; microinstruction cache ports 9051 and 9052 in the two instruction fetch / decode pipelines access the same microinstruction cache 905.

[0145] Instruction cache 903 is configured to be coupled to the instruction cache ports of the two instruction fetch / decode pipelines (e.g., instruction cache port 9031 and instruction cache port 9032) via memory interleaving or multi-port technology to enable non-blocking multi-port access, allowing the two instruction fetch / decode pipelines to access the instruction cache; microinstruction cache 905 is configured to be coupled to the microinstruction cache ports of the two instruction fetch / decode pipelines (e.g., microinstruction cache port 9051 and microinstruction cache port 9052) via memory interleaving or multi-port technology to enable non-blocking multi-port access, allowing the two instruction fetch / decode pipelines to access the microinstruction cache.

[0146] For example, if the aforementioned instruction cache ports (e.g., instruction cache port 9031 and instruction cache port 9032) simultaneously access instruction cache 903 and encounter an address conflict, the access will proceed in order, and this disclosure does not impose any restrictions on the access order. The processing method for microinstruction caches is the same as for instruction caches, and will not be elaborated here.

[0147] The processor core shown in Figure 9 is designed for single-threaded mode and has two fetch-decode pipelines, including an instruction decoding pipeline and a microinstruction pipeline. No matter which pipeline is blocked, there is a backup pipeline to take over the instructions to be processed, providing greater front-end bandwidth and thus further improving processor performance.

[0148] Figure 10 shows a structural example of a processor core according to at least another embodiment of the present disclosure, which is suitable for dual-thread mode, i.e., the processor core is an SMT2 processor core.

[0149] As shown in Figure 10, the processor core front end also includes two instruction fetch and decode pipelines. Compared with the processor core of the embodiment shown in Figure 9, the processor core structure does not include the boundary information confirmation unit 908, the window selection unit 909, and the reordering logic 910 on the one hand, but includes the first arbitration logic 911 and the second arbitration logic 912 on the other hand.

[0150] The prediction results from the branch prediction unit include prediction results for two threads, and the first arbitration logic 911 is configured to distribute the prediction results from the two threads to two instruction fetch decoding pipelines for processing.

[0151] The microinstruction queues 9061 / 9063 of the processor core receive the microinstructions obtained by processing the prediction results of their respective threads. The second arbitration logic 912 is configured to select and send the microinstructions corresponding to the two threads obtained by the two instruction fetch and decode pipelines to the instruction dispatch unit in sequence.

[0152] The processor core shown in Figure 10 is suitable for dual-thread mode. Whether in instruction decoding or microinstruction mode, it allows two threads to use two fetch-decode pipelines, so that each thread can generate a larger average front-end bandwidth, thereby improving multi-threaded performance.

[0153] Figure 11 shows a structural example of a processor core according to at least one embodiment of the present disclosure, which is suitable for a four-thread mode, i.e., the processor core is an SMT4 processor core.

[0154] As shown in Figure 11, the processor core front end also includes two instruction fetch and decode pipelines. Compared to the processor core of the embodiment shown in Figure 10, this processor core adds microinstruction queues 9062 / 9064 (an example of a "second microinstruction queue") to the two instruction fetch and decode pipelines.

[0155] The microinstruction queue 9062 / 9064 is configured to receive microinstructions from the decoder and / or microinstruction buffer port in each corresponding fetch-decode pipeline.

[0156] In the embodiment of Figure 11, the prediction results include prediction results for four threads (e.g., threads 1-4), while there are two instruction fetch / decode pipelines (e.g., instruction fetch / decode pipeline 1 and instruction fetch / decode pipeline 2). After arbitration by the first arbitration logic 911, for example, the prediction results from threads 1 and 2 can be assigned to instruction fetch / decode pipeline 1 for processing, and the prediction results from threads 3 and 4 can be assigned to instruction fetch / decode pipeline 2 for processing.

[0157] For example, in instruction fetch / decode pipeline 1, all microinstructions obtained by thread 1 through the instruction decoding pipeline or microinstruction pipeline can be sent to microinstruction queue 9061 or microinstruction queue 9062; alternatively, some microinstructions obtained by thread 1 through the instruction decoding pipeline or microinstruction pipeline can be sent to microinstruction queue 9061, and the remaining microinstructions can be sent to microinstruction queue 9062. The allocation method for microinstructions from thread 2 in instruction fetch / decode pipeline 1 is the same as that for microinstructions from thread 1, and will not be repeated here. The second arbitration logic 912 is configured to sequentially select and send the microinstructions corresponding to the four threads obtained through the two instruction fetch / decode pipelines to the instruction dispatch unit.

[0158] The processor core shown in Figure 11 is suitable for a four-thread mode. Whether in instruction decoding or microinstruction mode, it allows four threads to use two fetch-decode pipelines, which can increase the number of instructions received by the processor core front end and thus improve multi-threaded performance.

[0159] This disclosure also provides an electronic device according to at least one embodiment. FIG12 shows a schematic block diagram of an electronic device according to at least one embodiment of this disclosure.

[0160] For example, as shown in FIG12, the electronic device 1200 includes a processor 1210 and a memory 1220. The memory 1220 is used to store non-transitory computer-readable instructions (e.g., one or more computer program modules). The processor 1210 is used to execute the computer program instructions, which, when executed by the processor 1210, perform the instruction processing methods provided in any embodiment of this disclosure. The memory 1220 and the processor 1210 can be interconnected via a bus system and / or other forms of connection mechanism (not shown).

[0161] The processor 1210 can be a central processing unit (CPU), tensor processor (TPU), network processor (NP), or graphics processing unit (GPU) with data processing and / or program execution capabilities. It can also be a digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. For example, the CPU can be based on x86 or ARM architectures. The processor 1210 can be a general-purpose processor or a special-purpose processor, capable of controlling other components in the electronic device 1200 to perform desired functions.

[0162] For example, memory 1220 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. Volatile memory may include, for example, random access memory (RAM) and / or cache memory. Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, etc. One or more computer program modules may be stored on the computer-readable storage medium, and processor 1210 may run one or more computer program modules to implement various functions of electronic device 1200. Various application programs and various data, as well as various data used and / or generated by the application programs, may also be stored in the computer-readable storage medium.

[0163] It should be noted that, in the embodiments of this disclosure, the specific functions and technical effects of the electronic device 1200 can be referred to the description of the instruction processing method above, and will not be repeated here.

[0164] At least one embodiment of this disclosure also provides a non-transitory storage medium. FIG13 is a schematic diagram of a computer-readable storage medium provided in at least one embodiment of this disclosure. For example, as shown in FIG13, the storage medium 1300 non-transitory stores computer-executable instructions 1310, which can execute the instruction processing method of any embodiment of this disclosure when the non-transitory computer-executable instructions 1310 are executed by a computer (including a processor).

[0165] For example, one or more computer instructions may be stored on the storage medium 1300. Some of the computer instructions stored on the storage medium 1300 may be, for example, instructions for implementing one or more steps in the instruction processing method described above.

[0166] For example, the storage medium may include the storage component of a tablet computer, the hard disk of a personal computer, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), optical disc read-only memory (CD-ROM), flash memory, or any combination of the above storage media, or other suitable storage media. For example, storage medium 1300 may include memory 1220 in the aforementioned electronic device 1200.

[0167] The technical effects of the storage medium provided in the embodiments of this disclosure can be referred to the corresponding description of the instruction processing method in the above embodiments, and will not be repeated here.

[0168] At least one embodiment of this disclosure also provides an electronic device that includes the processor core of any of the above embodiments.

[0169] Figure 14 shows a schematic block diagram of an electronic device according to at least another embodiment of the present disclosure. The electronic device 1400 shown in Figure 14 is merely an example and should not be construed as limiting the functionality and scope of the embodiments of the present disclosure.

[0170] As shown in Figure 14, in some examples, the electronic device 1400 includes a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 1401, which may include the processor core of any of the above embodiments, capable of performing various appropriate actions and processes according to a program stored in read-only memory (ROM) 1402 or a program loaded from storage device 1408 into random access memory (RAM) 1403. The RAM 1403 also stores various programs and data required for the operation of the computer system. The processing device 1401, ROM 1402, and RAM 1403 are connected via a bus 1404. An input / output (I / O) interface 1405 is also connected to the bus 1404.

[0171] For example, the following components can be connected to I / O interface 1405: input devices 1406 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 1407 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 1408 including, for example, magnetic tapes, hard disks, etc.; and communication devices 1409, such as network interface cards like LAN cards and modems, etc. Communication device 1409 allows electronic device 1400 to communicate wirelessly or wiredly with other devices to exchange data and perform communication processing via networks such as the Internet. Drive 1410 is also connected to I / O interface 1405 as needed. Removable media 1411, such as disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on drive 1410 as needed so that computer programs read from them can be installed into storage device 1408 as needed.

[0172] Although Figure 14 illustrates an electronic device 1400 including various devices, it should be understood that implementation or inclusion of all the devices shown is not required. More or fewer devices may be implemented or included alternatively.

[0173] For example, the electronic device 1400 may further include a peripheral interface (not shown). This peripheral interface can be various types of interfaces, such as a USB interface, a Lightning interface, etc. The communication device 1409 can communicate wirelessly with a network and other devices, such as the Internet, an intranet, and / or a wireless network such as a cellular telephone network, a wireless local area network (LAN), and / or a metropolitan area network (MAN). Wireless communication can use any of a variety of communication standards, protocols, and technologies, including but not limited to Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), Wideband Code Division Multiple Access (W-CDMA), Code Division Multiple Access (CDMA), Time Division Multiple Access (TDMA), Bluetooth, Wi-Fi (e.g., based on IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, and / or IEEE 802.11n standards), Voice over Internet Protocol (VoIP), Wi-MAX, protocols for email, instant messaging, and / or Short Message Service (SMS), or any other suitable communication protocol.

[0174] For example, the electronic device 1400 may include any device such as a mobile phone, tablet computer, laptop computer, e-book, game console, television, digital photo frame, navigator, server, etc., or any combination of data processing device and hardware. The embodiments disclosed herein do not limit this.

[0175] The following points need to be clarified regarding this disclosure:

[0176] (1) The accompanying drawings of the embodiments of this disclosure only involve the structures involved in the embodiments of this disclosure. Other structures can be referred to the general design.

[0177] (2) Where there is no conflict, features of the same embodiment and different embodiments of this disclosure can be combined with each other.

[0178] The above are merely specific embodiments of this disclosure, but the scope of protection of this disclosure is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this disclosure should be included within the scope of protection of this disclosure. Therefore, the scope of protection of this disclosure should be determined by the scope of the claims.

Claims

1. A processor core comprising a front end, wherein, The front end includes N instruction fetch-decode pipelines, at least one instruction cache, and at least one microinstruction cache, where N is an integer greater than or equal to 2. Each of the aforementioned fetch-decode pipelines includes fetch selection logic, an instruction cache port, a microinstruction cache port, a decoder, and a first microinstruction queue. The instruction cache port is configured to read and write the instruction cache and provide the instructions retrieved from the instruction cache to the decoder. The microinstruction cache port is configured to read and write the microinstruction cache, and to provide the first microinstruction obtained from the microinstruction cache to the first microinstruction queue. The decoder is configured to decode instructions obtained from the instruction cache port into second microinstructions and provide the decoded second microinstructions to the first microinstruction queue. The instruction fetch selection logic is configured to select the instruction cache port and the decoder for instruction fetch and decoding operations, or to select the microinstruction cache port for instruction fetch and decoding operations.

2. The processor core of claim 1, wherein, The front end also includes: The instruction dispatch unit is configured to dispatch microinstructions received from a first microinstruction queue of one of the N instruction fetch-decode pipelines for execution.

3. The processor core of claim 1 or 2, wherein, The instruction cache is configured to be coupled to the instruction cache ports of the N instruction fetch-decode pipelines via memory interleaving technology or multi-port technology, so that the N instruction fetch-decode pipelines can access the instruction cache. and / or The microinstruction cache is configured to be coupled to the microinstruction cache ports of the N fetch-decode pipelines via memory interleaving or multi-port technology, so that the N fetch-decode pipelines can access the microinstruction cache.

4. The processor core of claim 2 or 3, wherein, The front end also includes: The branch prediction unit is configured to generate a prediction result based on the received instruction address for use in the N instruction fetch and decode pipelines.

5. The processor core of claim 4, wherein, The front end also includes: The boundary information determination unit is configured to generate boundary information based on the prediction result and to generate at least one information stream based on the prediction result. The types of boundary information include the last byte of a jump branch instruction, the last byte of the prediction result, or the last byte of an intermediate instruction in the prediction result.

6. The processor core of claim 5, wherein, The front end also includes: A window selection unit is configured to allocate the at least one information stream to the N fetch-decode pipelines for processing the at least one information stream. The allocation strategy of the window selection unit includes any one of the following: the number of prediction results, the number of bytes of the prediction results, the blocking degree of the N fetch-decode pipelines, and the window switching frequency in different modes.

7. The processor core of claim 5 or 6, wherein, The front end also includes: The reordering logic is configured to reorder the microinstructions obtained by the N instruction fetch-decode pipelines processing the at least one information stream before sending them to the instruction dispatch unit.

8. The processor core of any of claims 4-7, wherein, The front end also includes: The first arbitration logic and the second arbitration logic, among which... The prediction results include prediction results for M threads, where M is an integer greater than or equal to 2; The first arbitration logic is configured to distribute the prediction results from the M threads to the N instruction fetch / decode pipelines for processing; and The second arbitration logic is configured to sequentially select and send the microinstructions corresponding to the M threads obtained after processing by the N instruction fetch-decode pipelines to the instruction distribution unit.

9. The processor core of claim 8, wherein, Each of the instruction fetch-decode pipelines also includes a second microinstruction queue. The second microinstruction queue is configured to receive microinstructions from the decoder and / or the microinstruction cache port in each of the instruction fetch-decode pipelines.

10. An instruction processing method, comprising: In response to an instruction processing request, a target fetch-decode pipeline is selected from the N fetch-decode pipelines included in the front end of the processor core to respond to the instruction processing request, where N is an integer greater than or equal to 2. In the target instruction fetch-decode pipeline, instruction fetch-decode operations are performed in response to the instruction processing request via either the instruction cache port and the decoder, or via the microinstruction cache port. In response to selecting the microinstruction cache port to perform an instruction fetch and decode operation in response to the instruction processing request, the method includes: retrieving a first microinstruction from the microinstruction cache through the microinstruction cache port and providing the first microinstruction to the first microinstruction queue, or In response to selecting the instruction cache port and the decoder to perform an instruction fetch and decode operation in response to the instruction processing request, the method includes: fetching an instruction from the instruction cache through the instruction cache port and providing the instruction to the decoder, wherein the decoder decodes the instruction into a second microinstruction and provides the second microinstruction to the first microinstruction queue.

11. The instruction processing method according to claim 10, further comprising: Microinstructions received from the first microinstruction queue of one of the target fetch-decode pipelines among the N fetch-decode pipelines are distributed for execution.

12. The instruction processing method of claim 10 or 11, wherein, The instruction cache is coupled to the instruction cache port of the target fetch-decode pipeline in the N fetch-decode pipelines through memory interleaving technology or multi-port technology, so that the target fetch-decode pipeline in the N fetch-decode pipelines can access the instruction cache. and / or The microinstruction cache is coupled to the microinstruction cache port of the target fetch-decode pipeline in the N fetch-decode pipelines through memory interleaving technology or multi-port technology, so that the target fetch-decode pipeline in the N fetch-decode pipelines can access the microinstruction cache.

13. The instruction processing method according to claim 11 or 12, further comprising: A prediction result is generated based on the received instruction address to generate the instruction processing request for the target instruction fetch / decode pipeline among the N instruction fetch / decode pipelines.

14. The instruction processing method according to claim 13, further comprising: Boundary information is generated based on the prediction result corresponding to the instruction processing request, and at least one information stream is generated based on the prediction result. The types of boundary information include the last byte of a jump branch instruction, the last byte of the prediction result, or the last byte of an intermediate instruction in the prediction result.

15. The instruction processing method according to claim 14, further comprising: The at least one information stream is assigned to the target fetch / decode pipeline among the N fetch / decode pipelines via window selection logic to process the at least one information stream. The window selection logic includes any one of the following: the number of prediction results, the number of bytes in the prediction results, the blocking degree of the target fetch / decode pipeline in the N fetch / decode pipelines, and the window switching frequency in different modes.

16. The instruction processing method according to claim 15, further comprising: The microinstructions obtained by the target fetch-decode pipeline in the N fetch-decode pipelines from processing the at least one information stream are reordered for distribution.

17. The instruction processing method of any of claims 13-16, wherein, The prediction results include prediction results for M threads, where M is an integer greater than or equal to 2. The method further includes: The prediction results corresponding to the instruction processing requests in the M threads are assigned to the target instruction fetch / decode pipeline in the N instruction fetch / decode pipelines for processing; and The microinstructions corresponding to the M threads obtained after being processed by the target fetch-decode pipeline in the N fetch-decode pipelines are selected in sequence for distribution.

18. The instruction processing method of claim 17, wherein, The response to selecting the microinstruction cache port for instruction fetching and decoding in response to the instruction processing request includes: The first microinstruction is retrieved from the microinstruction cache via the microinstruction cache port and provided to the second microinstruction queue; and The response to selecting the instruction cache port and the decoder to perform an instruction fetch and decode operation in response to the instruction processing request includes: Instructions are retrieved from the instruction cache through the instruction cache port and provided to the decoder, which decodes the instructions into second microinstructions and provides the second microinstructions to the second microinstruction queue.

19. An electronic device comprising the processor core according to any one of claims 1-9.

20. An electronic device, comprising: processor; as well as Memory, which includes one or more computer program instructions; The one or more computer program instructions are executed by the processor according to the instruction processing method of any one of claims 10-18.

21. A computer-readable storage medium, non-transitorily storing computer- readable instructions, wherein, The instruction processing method according to any one of claims 10-18 is implemented when the computer-readable instructions are executed by a processor.