Instruction processing apparatus, method, processor, chip and board
By breaking down instructions into micro-instructions and employing a chained execution mechanism, the pipeline blocking problem caused by instruction dependencies is solved, achieving higher instruction parallelism and improved processor performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CAMBRIAN (KUNSHAN) INFORMATION TECH CO LTD
- Filing Date
- 2024-12-30
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, the dependencies between instructions cause instruction pipeline blockage, limiting the advantages of instruction-level parallelism and making it difficult to fully tap the processor's computational potential.
A microinstruction-level dependency determination scheme is adopted, which breaks down instructions into microinstructions and uses a chain execution mechanism to schedule instructions to execute in parallel with fine granularity, thereby reducing the latency between dependent instructions.
By determining dependencies at the microinstruction level, higher instruction parallelism and processor performance are achieved, while reducing latency between dependent instructions.
Smart Images

Figure CN122308905A_ABST
Abstract
Description
Technical Field
[0001] This disclosure generally relates to the field of processors. More specifically, this disclosure relates to an instruction processing apparatus, an instruction processing method, a processor, a chip, and a circuit board. Background Technology
[0002] With the development of computer technology, the computing power of processors has continued to improve. The design of single-core processors, which improves performance by increasing frequency, has encountered a bottleneck due to power consumption limitations. Multi-core processors have gradually taken over the market. In multi-core processors, tasks can be distributed to different cores for processing, improving program parallelism. Within a single processor core, different processes can also be distributed to multiple processing units simultaneously.
[0003] Instruction-level parallelism plays a crucial role in fully utilizing the computational power of multiple arithmetic units and / or multiple processors. However, various dependencies exist between instructions, including architecture dependencies, data dependencies, and control dependencies. Instruction dependencies can easily cause pipeline blockages, limiting the full potential of instruction-level parallelism.
[0004] In view of this, there is an urgent need for an instruction control scheme that can improve instruction-level parallelism, to maximize the potential of instruction parallelism, and to further improve the processor's computing efficiency. Summary of the Invention
[0005] In order to at least address one or more of the technical problems mentioned above, this disclosure proposes an instruction processing scheme in several aspects.
[0006] In a first aspect, this disclosure provides an instruction processing apparatus, including a broadcast instruction slot, the broadcast instruction slot comprising: an instruction merging unit configured to collect broadcast instructions from various processor cores participating in the broadcast and merge them into a unified broadcast instruction; a microinstruction generation unit configured to generate a plurality of broadcast microinstructions based on the unified broadcast instruction; and a microinstruction slot configured to cache broadcast microinstructions to be executed, and to control whether to emit the currently processed broadcast microinstruction based on whether there is a microinstruction-level dependency between the currently processed broadcast microinstruction and the preceding instructions being executed by the various processor cores.
[0007] In a second aspect, this disclosure provides an instruction processing method, comprising: collecting broadcast instructions from various processor cores participating in broadcasting and merging them into a unified broadcast instruction; generating a plurality of broadcast micro-instructions based on the unified broadcast instruction; determining whether there is a micro-instruction-level dependency between the currently processed broadcast micro-instruction and the preceding instructions being executed by the various processor cores; and controlling whether to emit the currently processed broadcast micro-instruction according to the micro-instruction-level dependency.
[0008] In a third aspect, this disclosure provides a processor including the instruction processing apparatus of the first aspect.
[0009] In a fourth aspect, this disclosure provides a chip including the processor of the aforementioned third aspect.
[0010] In the fifth aspect, this disclosure provides a board including the chip described in the fourth aspect above.
[0011] Through the instruction processing apparatus, method, processor, chip, and board provided above, the embodiments disclosed herein provide microinstruction-level dependency determination for broadcast instruction scenarios, which can schedule instruction parallel execution in a more granular manner, reduce the latency between dependent instructions, and thereby improve instruction parallelism and processor performance. Attached Figure Description
[0012] The above and other objects, features, and advantages of exemplary embodiments of this disclosure will become readily apparent upon reading the following detailed description with reference to the accompanying drawings. In the drawings, several embodiments of this disclosure are illustrated by way of example and not limitation, and like or corresponding reference numerals denote like or corresponding parts, wherein:
[0013] Figure 1 A schematic diagram of the structure of a board according to an embodiment of this disclosure is shown;
[0014] Figure 2 A structural diagram of the combined processing apparatus in the chip according to an embodiment of this disclosure is shown;
[0015] Figure 3 It shows Figure 2 A schematic diagram of the internal structure of the processor core when the computing device is a single-core device;
[0016] Figure 4 It shows Figure 2 A simplified diagram of the internal structure of a computing device with multiple cores;
[0017] Figure 5A This diagram illustrates the sequential execution of existing dependent instructions.
[0018] Figure 5B A schematic diagram illustrating the parallel execution of dependent instructions in an embodiment of this disclosure is shown;
[0019] Figure 6 This illustrates instruction execution scenarios involving various dependencies according to embodiments of this disclosure;
[0020] Figure 7 An exemplary structural block diagram of an instruction processing apparatus according to an embodiment of this disclosure is shown;
[0021] Figure 8 An exemplary structural block diagram of an instruction processing apparatus according to other embodiments of this disclosure is shown;
[0022] Figure 9 This illustrates several possible scenarios between the data memory access addresses of two consecutive instructions.
[0023] Figure 10 An exemplary structural block diagram of an instruction processing apparatus according to further embodiments of the present disclosure is shown;
[0024] Figure 11A and Figure 11B A schematic diagram illustrating an address comparison optimization scheme of some embodiments of this disclosure is provided.
[0025] Figure 12A and Figure 12B This illustration shows an example of implementing a chained parallel mechanism on multidimensional instructions;
[0026] Figure 13 An exemplary structural block diagram of an instruction processing apparatus according to some embodiments of this disclosure is shown;
[0027] Figure 14 An exemplary structural block diagram of an instruction processing apparatus according to other embodiments of this disclosure is shown; and
[0028] Figure 15 An exemplary flowchart of an instruction processing method according to an embodiment of this disclosure is shown. Detailed Implementation
[0029] The technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this disclosure, not all of them. Based on the embodiments in this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0030] It should be understood that the terms "first," "second," "third," and "fourth," etc., that may appear in the claims, specification, and drawings of this disclosure are used to distinguish different objects, rather than to describe a specific order. The terms "comprising" and "including" as used in the specification and claims of this disclosure indicate the presence of the described features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or collections thereof.
[0031] It should also be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of this disclosure. As used in this disclosure and claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in this disclosure and claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes such combinations.
[0032] As used in this specification and claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted, depending on the context, as "once determined," "in response to determination," "once [described condition or event] is detected," or "in response to detection of [described condition or event]."
[0033] The specific embodiments disclosed herein will now be described in detail with reference to the accompanying drawings.
[0034] Exemplary hardware environment
[0035] Figure 1 A schematic diagram of the structure of a board 10 according to an embodiment of this disclosure is shown. Figure 1 As shown, board 10 includes chip 101, which is a system-on-a-chip (SoC) integrating one or more combined processing devices. These combined processing devices are artificial intelligence computing units used to support various deep learning and machine learning algorithms, meeting the intelligent processing needs of complex scenarios in fields such as computer vision, speech, natural language processing, and data mining. In particular, deep learning technology is widely used in cloud intelligence, and a significant characteristic of cloud intelligence applications is the large volume of input data, placing high demands on the platform's storage and computing capabilities. Board 10 in this embodiment is suitable for cloud intelligence applications, possessing massive off-chip storage, on-chip storage, and powerful computing capabilities.
[0036] Chip 101 is connected to external device 103 via external interface device 102. External device 103 may be, for example, a server, computer, camera, monitor, mouse, keyboard, network card, or Wi-Fi interface. Data to be processed can be transmitted from external device 103 to chip 101 via external interface device 102. The calculation results from chip 101 can be transmitted back to external device 103 via external interface device 102. Depending on the application scenario, external interface device 102 may have different interface forms, such as a PCIe interface.
[0037] The board 10 also includes a storage device 104 for storing data, which includes one or more memory cells 105. The storage device 104 is connected to and transmits data with the controller 106 and the chip 101 via a bus. The controller 106 in the board 10 is configured to regulate the state of the chip 101. Therefore, in one application scenario, the controller 106 may include a microcontroller (MCU).
[0038] Figure 2 This is a structural diagram illustrating the combined processing device in chip 101 of this embodiment. (As shown) Figure 2 As shown, the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device (DRAM) 204.
[0039] The computing device 201 is configured to perform user-specified operations. It is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations. It can interact with the processing device 203 through the interface device 202 to jointly complete the user-specified operations.
[0040] Interface device 202 is used to transmit data and control commands between computing device 201 and processing device 203. For example, computing device 201 can obtain input data from processing device 203 via interface device 202 and write it to on-chip storage device of computing device 201. Further, computing device 201 can obtain control commands from processing device 203 via interface device 202 and write them to on-chip control cache of computing device 201. Alternatively or optionally, interface device 202 can also read data from storage device of computing device 201 and transmit it to processing device 203.
[0041] Processing device 203, as a general-purpose processing device, performs basic control including but not limited to data transfer, and starting and / or stopping computing device 201. Depending on the implementation, processing device 203 may be one or more types of processors, including but not limited to digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and their number can be determined according to actual needs. As mentioned above, computing device 201 disclosed herein can be considered as having a single-core structure or a homogeneous multi-core structure. However, when computing device 201 and processing device 203 are considered together, they are considered to form a heterogeneous multi-core structure.
[0042] Storage device 204 is used to store data to be processed. It may be DRAM, an off-chip memory, specifically DDR memory, typically 16G or larger, used to store data of computing device 201 and / or processing device 203.
[0043] Figure 3 It shows Figure 2 The computing device 201 is a schematic diagram of the internal structure of the processor core in a single-core device. The computing device 301 is used to process input data such as computer vision, speech, natural language, and data mining. The computing device 301 includes three main modules: control module 31 (also called controller), arithmetic module 32 (also called arithmetic unit), and storage module 33 (also called memory).
[0044] The control module 31 coordinates and controls the operation of the computation module 32 and the storage module 33 to complete the deep learning task. It includes an instruction fetch unit (IFU) 311 and an instruction decode unit (IDU) 312. The instruction fetch unit 311 fetches instructions from the processing device 203, and the instruction decode unit 312 decodes the fetched instructions and sends the decoding result as control information to the computation module 32 and the storage module 33.
[0045] The computation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used to perform vector operations and can support complex operations such as vector multiplication, addition, and nonlinear transformations; the matrix operation unit 322 is responsible for the core computations of deep learning algorithms, namely matrix multiplication and convolution.
[0046] Storage module 33 is used to store or move relevant data, including neuron RAM (NRAM) 331, weight RAM (WRAM) 332, and direct memory access (DMA) module 333. NRAM 331 stores input neurons, output neurons, and intermediate results after computation; WRAM 332 stores the convolution kernels of the deep learning network, i.e., the weights; DMA 333 is connected to DRAM 204 via bus 34 and is responsible for data transfer between computing device 301 and DRAM 204. It should be noted that NRAM and WRAM here can be two storage regions formed by dividing the same memory in logical storage space, or they can be two independent memories; no specific limitation is made here.
[0047] Figure 4 It shows Figure 2 The computing device 201 shown is a simplified internal structure diagram for a multi-core system. A multi-core computing device can be abstracted using a hierarchical hardware model. As shown, the multi-core computing device 400 is a system-on-a-chip that includes at least one computing cluster, and each computing cluster includes multiple processor cores. In other words, the multi-core computing device 400 is constructed in a hierarchical structure of system-on-a-chip, computing cluster, and processor cores.
[0048] From the perspective of the system-on-a-chip hierarchy, as shown in the figure, the multi-core computing device 400 includes an external memory controller 41, a peripheral communication module 42, an on-chip interconnect module 43, a global synchronization module 44, and multiple computing clusters 45.
[0049] There can be multiple external storage controllers 41; two are shown as an example in the figure. These controllers are used to respond to access requests issued by the processor core to access external storage devices (e.g., Figure 2 The peripheral communication module 42 is used to read data from external devices or write data via an interface device (DRAM 204). Figure 2 202) receives from the processing device ( Figure 2 The control signal of 203) activates the computing device ( Figure 2(201) Performs tasks. The on-chip interconnect module 43 connects the external storage controller 41, the peripheral communication module 42, and multiple computing clusters 45 to transmit data and control signals between the modules. The global synchronization module 44 is, for example, a global barrier controller (GBC) to coordinate the working progress of each computing cluster and ensure information synchronization. The multiple computing clusters 45 are the computing cores of the multi-core computing device 400. Four are exemplarily shown on each die in the figure. With the development of hardware, the multi-core computing device 400 disclosed herein may also include 8, 16, 64, or even more computing clusters 45. The computing clusters 45 are used to efficiently execute deep learning algorithms.
[0050] From the perspective of the computing cluster hierarchy, as shown in the figure, each computing cluster 45 includes multiple processor cores 406 as control and computing units, and also a shared memory core 407 as a storage unit. Furthermore, each computing cluster may also include a local synchronization module 412 to coordinate the working progress of each processor core within the computing cluster, ensuring information synchronization. Four processor cores 406 are exemplarily shown in the figure; this disclosure does not limit the number of processor cores 406.
[0051] The storage core 407 is primarily used for storage and communication, namely storing shared data or intermediate results between processor cores 406, and performing communication between computing clusters 45 and DRAM 204, communication between computing clusters 45, and communication between processor cores 406. In other embodiments, the storage core 407 has scalar operation capabilities and is used to perform scalar operations.
[0052] Storage core 407 includes a shared memory unit (SMEM) 408, a broadcast bus 409, a cluster direct memory access (CDMA) module 410, and a global direct memory access (GDMA) module 411. SMEM 408 acts as a high-performance data relay station. Data multiplexed between different processor cores 406 within the same computing cluster 45 does not need to be obtained from the DRAM 204 by each processor core 406 individually. Instead, it is relayed between processor cores 406 via SMEM 408. Storage core 407 only needs to quickly distribute the multiplexed data from SMEM 408 to multiple processor cores 406, thereby improving inter-core communication efficiency and significantly reducing on-chip and off-chip I / O access. Broadcast bus 409, CDMA 410, and GDMA 411 are used to perform communication between processor cores 406, communication between computing clusters 45, and data transfer between computing clusters 45 and DRAM 204, respectively.
[0053] From the perspective of processor core hierarchy, the structure of a single processor core can be similar to... Figure 3 The structural diagram of the single-core computing device shown is not described in detail here.
[0054] Instruction processing scheme
[0055] As mentioned earlier, dependencies between instructions can easily cause pipeline blockages, limiting the advantages of instruction-level parallelism. In most instruction set architectures (ISAs), dependencies between instructions can be determined by judging the address ranges of instruction reads and writes. The hardware preprocesses the instructions to calculate the address ranges of each operand, thus determining which instructions can be executed in parallel.
[0056] For example, for two instructions S1 and S2, if the address ranges of the write operands of S1 and arbitrary operands of S2 overlap, or the address ranges of the read operands of S1 and write operands of S2 overlap, then S1 and S2 are considered to have a dependency relationship. That is, if there is any overlap in the operand ranges between two instructions, and at least one of the overlapping operands is a write operand, then there is a dependency between these two instructions.
[0057] Once the dependencies between instructions are identified, the hardware can decide to execute the instructions in parallel, ensuring sequential consistency. That is, if an instruction has no dependency on any earlier, incomplete instruction, it can be issued immediately. From the programmer's perspective, the execution result is consistent with serial execution. When the hardware can determine dependencies between instructions and guarantee correctness, because all instructions are ordered, instructions do not need to be distinguished by their instruction stream, and synchronization instructions within the kernel can be omitted.
[0058] However, the above method only allows for parallel execution of completely independent instructions. If two instructions are dependent on each other, the next instruction can only be issued after the previous instruction has completed and committed. This results in both excessively granular instruction parallel scheduling and high latency between dependent instructions.
[0059] In view of this, the present disclosure provides a microinstruction-level dependency determination scheme. The basic idea is to break down instructions into microinstructions (uops) and then determine dependencies based on these microinstructions. When there are instruction-level dependencies, compared to serial execution on an instruction-by-instruction basis, this method can significantly reduce latency between dependent instructions and achieve more granular parallelism between instructions.
[0060] Figure 5A This diagram illustrates the sequential execution of existing dependent instructions. Figure 5BA schematic diagram illustrating the parallel execution of dependent instructions according to an embodiment of this disclosure is shown. In the figure, the horizontal axis represents time, and the vertical axis represents four different memory access spaces A0 to A3. The figure shows two computation instructions, where instruction C1 involves writing data to the four memory spaces A0 to A3, and instruction C2 involves reading data from the four memory spaces A0 to A3.
[0061] like Figure 5A As shown, according to the existing instruction parallelism scheme, due to the dependency between instructions C1 and C2, the subsequent instruction C2 can only be executed after instruction C1 has completed execution and the dependency has been resolved. Specifically, since instruction C1 has no dependency with the preceding instructions, instruction C1 can be executed, sequentially writing data to memory access spaces A0 to A3. After writing to A3 is completed, the dependency of instruction C2 is resolved, and then instruction C2 is executed, sequentially reading data from memory access spaces A0 to A3. Therefore, it is evident that when instructions C1 and C2 have a dependency, they can only be executed sequentially.
[0062] In contrast, such as Figure 5B As shown, according to the instruction parallelism scheme of this disclosure embodiment, even if there is a dependency between instructions C1 and C2, instructions C1 and C2 are broken down into several micro-instructions. These micro-instructions can be executed in parallel without dependency, without waiting for the entire instruction C1 to be executed. Specifically, since instruction C1 has no dependency with its preceding instruction, instruction C1 can be executed. At this time, all four micro-instructions that instruction C1 is broken down into can be executed, that is, data is written to memory access spaces A0 to A3 in sequence. For instruction C2, since there is a dependency between instruction C2 and instruction C1, the dependency of the micro-instructions of instruction C2 can be determined. For example, the micro-instruction of instruction C2 that reads space A0 has a dependency with the micro-instruction of instruction C1 that writes space A0. However, after the micro-instruction of writing space A0 is executed, the dependency between the two is removed, and the micro-instruction of instruction C2 that reads space A0 can be executed. The remaining micro-instructions of instruction C2 can be similarly determined in terms of dependency, and the blocking and releasing of micro-instructions can be performed according to the dependency. Therefore, even though instructions C1 and C2 have a dependency relationship, a certain degree of parallel instruction execution can still be achieved by determining the dependency relationship at the microinstruction level. Figure 5BAs shown, while executing the microinstruction for writing to space A1 of instruction C1, the microinstruction for reading to space A0 of instruction C2 can be executed simultaneously; similarly, while executing the microinstruction for writing to space A2 of instruction C1, the microinstruction for reading to space A1 of instruction C2 can be executed simultaneously, and so on. As can be seen from the diagram, this parallel execution method interweaves the microinstructions of different instructions like a chain; this execution method is referred to as "chaining" in this paper. In this way, instructions C1 and C2, which have dependencies, can still achieve parallelism at the microinstruction level, thereby greatly reducing the latency between instructions.
[0063] The example above describes a scenario where there is a write-after-read dependency between instructions. In addition to write-after-read dependencies, the "chained" execution mechanism provided in this disclosure embodiment can also be applied to scenarios with read-after-write and write-after-write dependencies, avoiding performance loss caused by latency during instruction switching.
[0064] Figure 6 This diagram illustrates an instruction execution scenario with various dependencies according to embodiments of this disclosure. In the figure, the horizontal axis represents time, and the vertical axis represents four different memory access spaces, A0 to A3.
[0065] The diagram illustrates three instructions, in order: Load instruction L1 loads data from external storage into local memory spaces A0-A3; Calculation instruction C1 performs calculations on the data in memory spaces A0-A3; and Load instruction L2 loads data from external storage into local memory spaces A0-A3. That is, after the data loaded by instruction L1 is used by calculation instruction C1, the data is loaded into the same location a second time by instruction L2. At this point, there is a write-after-read dependency between calculation instruction C1 and load instruction L1, a read-after-write dependency between load instruction L2 and calculation instruction C1, and a write-after-write dependency between load instruction L2 and load instruction L1.
[0066] Through the "chained" execution mechanism of this disclosed embodiment, instruction C1 can be executed without waiting for the execution of its dependent instruction L1, and instruction L2 can be executed without waiting for the execution of its dependent instructions L1 and C1. This allows for a continuous flow of data to the storage space even when the Load instruction experiences a certain delay due to the need to first issue a read request. For example, as can be seen from the entire timeline, the data written to the storage space is continuously provided; that is, the light gray shaded area in the figure is continuous on the timeline.
[0067] Figure 7 An exemplary structural block diagram of an instruction processing apparatus according to an embodiment of the present disclosure is shown. As shown, the instruction processing apparatus 700 includes a decoding unit 710, a microinstruction generation unit 720, a microinstruction-level dependency determination unit 730, and an instruction issuing unit 740.
[0068] The decoding unit 710 is configured to decode instructions. The primary task of the decoding unit is to translate instructions (i.e., machine code) fetched from memory into internal signals that the processor can understand. These signals instruct various parts of the processor how to execute the instructions. Specifically, the decoding unit identifies the opcode in the instruction, which is the part of the instruction used to identify the type of operation to be performed. Based on the opcode, the decoding unit determines whether the instruction is an arithmetic operation, logical operation, data transfer, or control flow operation, etc. The decoding unit also parses the operands in the instruction, including the source operand and the destination operand. It determines the operand type (such as immediate value, register, or memory address) and addressing mode (such as direct addressing, indirect addressing, etc.). The decoding unit also generates control signals that are passed to other parts of the processor, such as the arithmetic logic unit (ALU), data storage unit, and control unit, to guide them in executing the operations required by the instructions.
[0069] The decoding unit typically includes an instruction buffer queue for temporary storage of instructions, which are then fed into the decoding unit at a certain rate (e.g., 6 instructions per cycle) for decoding, and the decoded instructions are passed to the next pipeline stage.
[0070] The microinstruction generation unit 720 is configured to break down the decoded instructions into several microinstructions. The primary task of the microinstruction generation unit is to decompose complex machine instructions into a series of simpler microinstructions. These microinstructions are the basic steps to implement specific operations and typically correspond to basic operations within the processor.
[0071] The microinstruction generation unit can split instructions into several microinstructions according to a predetermined splitting strategy. Typically, splitting is done according to the basic operations that implement the instructions. For cases with large operand data volumes, further splitting can be performed based on the data itself. For example, when the operands are long vectors, they can be split into microinstructions that operate on multiple short vectors.
[0072] The microinstruction-level dependency determination unit 730 is configured to determine whether there is a microinstruction-level dependency between the currently processed microinstruction and the preceding instruction being executed. This dependency mainly refers to data dependencies, such as write-after-read dependency, write-after-write dependency, and read-after-write dependency. Therefore, the determination of dependencies is primarily based on the memory access addresses of the data.
[0073] In some embodiments, for any microinstruction involving a write operation, if the data memory access address of the currently processed microinstruction does not overlap with the data memory access addresses of all incomplete microinstructions in the preceding instructions, it can be determined that there is no microinstruction-level dependency between the currently processed microinstruction and the preceding instructions. Conversely, if there is overlap, it indicates that a microinstruction-level dependency exists. Here, "any microinstruction involving a write operation" refers to either of the two parts being compared, encompassing the three possible scenarios of dependency relationships described above.
[0074] The instruction issuing unit 740 is configured to determine whether to issue the currently processed microinstruction based on microinstruction-level dependencies. If it is determined that there is no microinstruction-level dependency between the current microinstruction and its preceding instructions, the instruction issuing unit can issue the microinstruction to a subsequent execution unit for execution. If it is determined that there is a microinstruction-level dependency between the current microinstruction and any preceding instruction, the instruction issuing unit can block the microinstruction until the microinstruction-level dependency is resolved.
[0075] By setting microinstruction-level dependency judgments, instruction parallel execution can be scheduled more finely, reducing latency between dependent instructions and thus improving instruction parallelism and processor performance.
[0076] Figure 8 Exemplary structural block diagrams of an instruction processing apparatus according to other embodiments of this disclosure are shown. In these embodiments, the instruction processing apparatus 800 includes, in addition to a decoding unit 810, a microinstruction generation unit 820, a microinstruction-level dependency determination unit 830, and an instruction issuing unit 840, an instruction-level dependency determination unit 850.
[0077] The instruction-level dependency determination unit 850 can be configured to determine whether an instruction-level dependency exists between the currently processed instruction and its preceding execution instruction. In this case, the micro-instruction-level dependency determination unit 830 can be further configured to: when the instruction-level dependency determination unit determines that an instruction-level dependency exists, further determine whether a micro-instruction-level dependency exists. Therefore, the instruction issuing unit 840 can be further configured to: when neither the instruction-level dependency nor the micro-instruction-level dependency indicates a dependency, issue the currently processed micro-instruction. It is understood that the arrows in the figure are merely illustrative and do not represent that all embodiments are connected in this way. For example, although the figure shows a connection between the instruction-level dependency determination unit 850 and the instruction issuing unit 840, it does not mean that the two need to be directly connected; the information can also be transmitted to the micro-instruction-level dependency determination unit 830 and then to the instruction issuing unit 840. For example, the instruction-level dependency determination unit 850 can pass information indicating no instruction-level dependency to the micro-instruction-level dependency determination unit 830, and the micro-instruction-level dependency determination unit 830 can determine that there is definitely no micro-instruction-level dependency based on this, thereby instructing the instruction issuing unit 840.
[0078] By first determining instruction-level dependencies and then only determining micro-instruction-level dependencies for instructions with instruction-level dependencies, the overhead of comparison and judgment can be effectively reduced, thereby improving processor efficiency.
[0079] Determining dependencies depends on whether there is overlap between data memory access addresses. In principle, determining microinstruction-level dependencies involves comparing the data memory access address of the subsequent microinstruction with the data memory access addresses of all incomplete microinstructions in the preceding instructions. To reduce comparison overhead, a dynamic instruction range can be recorded for each instruction, representing the range of addresses accessed by the incomplete portion of the instruction, and updated immediately after each microinstruction commit.
[0080] Therefore, in some embodiments, the instruction processing apparatus 800 may further include an address range maintenance unit 860, configured to maintain a dynamic address range for each preceding instruction being executed. This address range includes the start and end addresses of the data memory access addresses corresponding to the incomplete portion of the preceding instruction. Further, the address range maintenance unit may be further configured to update the aforementioned address range of its corresponding preceding instruction in response to the submission of a microinstruction. For sequentially executed instructions, this instruction range gradually decreases, thereby freeing up address space and allowing subsequent instructions' microinstructions to change from dependent to independent, thus enabling execution. This real-time updating of the address range of the incomplete portions of each instruction facilitates the determination of dependencies, timely removal of dependencies, and release of microinstructions.
[0081] Because instructions exhibit sequential execution characteristics in the "chained" data memory access dimension, such as execution from front to back (as described later in the context of multidimensional instructions), microinstruction-level dependency determination does not require considering the end address of preceding instructions; only the dynamically changing start address of the preceding instructions needs to be considered. Accordingly, the microinstruction of the current instruction can be determined by comparing its end address with the start address of the preceding instructions. If the end address is less than the start address of the incomplete portions of all preceding instructions, it indicates no dependency and can be issued. More specifically, microinstruction-level dependencies can be determined by comparing the end address of the current instruction's microinstruction with the minimum value (limit) of the current start addresses of all preceding instructions.
[0082] When both instruction-level dependency judgment and microinstruction-level dependency judgment exist, the judgment method can be further optimized to reduce the overhead of comparison judgment.
[0083] In some embodiments, the instruction-level dependency determination unit 850 may be further configured to: for any instruction involving a write operation, if the starting address in the data memory access address of the currently processed instruction is greater than the current ending address of the preceding instruction being executed, determine that there is no instruction-level dependency; otherwise, determine that there is an instruction-level dependency.
[0084] In these embodiments, the microinstruction-level dependency determination unit 830 may be further configured to: for a microinstruction with an instruction-level dependency, if the end address in the data memory access address of the microinstruction is less than the current start address of the preceding instruction being executed, determine that there is no microinstruction-level dependency; otherwise, determine that there is a microinstruction-level dependency.
[0085] Furthermore, the microinstruction-level dependency determination unit 830 can be further configured to: for a microinstruction with an instruction-level dependency, if the end address in the data memory access address of the microinstruction is less than the minimum value limit of the current start address of all preceding instructions with which it has an instruction-level dependency, determine that there is no microinstruction-level dependency.
[0086] The following section will explain the logic for determining the above dependency relationship in more detail by combining several scenarios.
[0087] Figure 9 This diagram illustrates several possible scenarios between the data memory access addresses of two consecutive instructions. In the diagram, the light-colored box represents the preceding instruction `inst0`, and the dark-colored box represents the following instruction `inst1`. These boxes represent the data memory access space, defined by the start and end addresses, respectively. For ease of observation, the boxes representing the data memory access space are slightly offset vertically. Figure 9Intuitively, it can be seen that there is no dependency between the two instructions inst0 and inst1 in scenarios A and D, while there is a dependency between scenarios B and C. When the "chained" execution of the embodiments disclosed herein is applied:
[0088] In scenario A, if we follow the rule of judging microinstruction-level dependencies, that is, there is no dependency when the end address of the current instruction's microinstruction is less than the current start address of all preceding instructions, then since the start address of instruction inst0_start is always relatively small, inst1 will always be considered to have a dependency and will be blocked until inst0 finishes execution, thus creating a false dependency.
[0089] In scenario B, initially, according to the microinstruction dependency rules, `inst1` is blocked because the end address of the first few microinstructions is greater than the start address of `inst0`. As the microinstructions of `inst0` are executed, their start address `inst0_start` increments, gradually exceeding the end address of the first few microinstructions of `inst1`. Therefore, the first few microinstructions of `inst1` can be released and execution can begin, and so on. When `inst0` completes, it is committed, at which point the remaining microinstructions of `inst1` can be released.
[0090] In scenario C, initially, according to the microinstruction-level dependency judgment rules, since the end address of the microinstructions preceding inst1 is less than the start address of instruction inst0, inst1 can execute some microinstructions. When execution reaches the overlapping part with inst0, inst1 is blocked. At this time, it waits for the start address of inst0, inst0_start, to increment, and as the start address of inst0 increments, the dependency is gradually released until execution is complete.
[0091] In case D, according to the rules for judging microinstruction-level dependencies, the end address of the microinstruction of instruction inst1 is always less than the start address of instruction inst0_start, so it is considered to have no dependency and the entire instruction inst1 can be executed.
[0092] The case where the instructions inst0 and inst1 contain each other can be regarded as a combination of case B and case C. The execution process is similar and will not be elaborated here.
[0093] As can be seen from the above scenarios, in the chained execution phase based on microinstruction-level dependency judgment, scenarios B, C, and D can all achieve chained execution normally, while scenario A will result in false dependencies. Therefore, when microinstruction-level dependency judgment and instruction-level dependency judgment exist simultaneously, scenario A can be judged through instruction-level dependency judgment to eliminate false dependencies, while other scenarios can be handled through microinstruction-level dependency judgment.
[0094] Therefore, when determining instruction-level dependencies, for any instruction involving a write operation, if the starting address of the data memory access address of the currently processed instruction is greater than the current ending address of the preceding instruction, it is determined that there is no instruction-level dependency; otherwise, it is determined that there is an instruction-level dependency. Specifically, for the above four scenarios, the instruction-level dependency determination will consider A to have no dependency, while B, C, and D will all have dependencies.
[0095] For instructions identified as having instruction-level dependencies, a micro-instruction-level dependency determination is further performed. At this point, according to the above analysis, in the chained execution phase based on micro-instruction-level dependency determination, all three scenarios (B, C, and D) can achieve pipelined execution normally.
[0096] The above describes the dependency determination and chained execution process from the perspective of processing a single instruction or microinstruction. In real-world scenarios, multiple instructions are typically processed simultaneously, with each instruction broken down into multiple microinstructions, resulting in multiple instructions and microinstructions being processed in parallel.
[0097] Figure 10 An exemplary structural block diagram of an instruction processing apparatus according to further embodiments of the present disclosure is shown. In these embodiments, the instruction processing apparatus 1000 may include a launch queue 1010 and a plurality of launch slots 1020.
[0098] The issue queue 1010 supports instruction-level dependency determination and can select instructions for execution out of order. Specifically, the issue queue may include a decoding unit 1011, an instruction-level dependency determination unit 1012, and an address range maintenance unit 1013. The functions and implementations of these components can be referred to the previous description and will not be repeated here.
[0099] The issue slot 1020 can be used to store instructions and microinstructions currently being decoded and executed in parallel, and to perform microinstruction-level dependency judgment. Specifically, the issue slot 1020 may include a microinstruction generation unit 1021, a microinstruction-level dependency judgment unit 1022, an instruction issue unit 1023, and a microinstruction submission queue 1024. The microinstruction submission queue 1024 is used to respond to the execution unit (not shown) generating microinstruction submission information and perform corresponding submission processing, such as sending it to the address range maintenance unit 1013 to update the address range of the instructions being executed. The functions and implementations of other components can be referred to the previous description, and will not be repeated here.
[0100] The instruction processing device 1000 may include multiple issue slots 1020. The specific implementation details of the issue slots may differ slightly depending on the type of instruction, but they will generally include the aforementioned components. Multiple issue slots can exist for the same type of instruction. Instructions cached in different issue slots are independent of each other, and instructions within the same issue slot are issued in an ordered manner.
[0101] In the chained execution phase based on microinstruction-level dependency judgment, the process of comparing the minimum value of the starting addresses of all instruction operands requires a large number of address comparison operations, which consumes a lot of hardware and time resources, especially when there are a large number of instructions and / or microinstructions being executed at the same time.
[0102] In some embodiments disclosed herein, the address comparison process can be optimized by leveraging the characteristic that all microinstruction-level dependency judgment units share the address range information of the preceding instruction, thereby reducing hardware and time consumption.
[0103] Specifically, the aforementioned address range maintenance unit 1013 can be further configured to: compare the current start addresses of all preceding instructions being executed pairwise to obtain a comparison matrix. This comparison matrix can then be transmitted to each microinstruction-level dependency determination unit 1022. Each microinstruction-level dependency determination unit 1022 can then be further configured to: based on the comparison matrix, and using the instruction-level dependencies of the currently processed microinstruction, determine the minimum value of the current start addresses of all preceding instructions that have instruction-level dependencies with the currently processed microinstruction. Then, the end address of the currently processed microinstruction can be compared with the minimum value of this start address to determine whether a microinstruction-level dependency exists.
[0104] In other words, a complete address comparison can be performed first, that is, comparing the current starting addresses of any two instructions among all incomplete preceding instructions to obtain a comparison matrix. Then, the comparison matrix is broadcast to each microinstruction-level dependency determination unit. Each microinstruction-level dependency determination unit can select the minimum value of the instruction it depends on from the comparison matrix according to its own instruction-level dependency to compare and determine the microinstruction-level dependency.
[0105] Figures 11A-11B A schematic diagram of an address comparison optimization scheme according to some embodiments of this disclosure is shown as an example.
[0106] Figure 11AThis shows the complete comparison matrix obtained by pairwise comparison of the current starting addresses of all preceding instructions being executed. In this example, assuming there are four preceding instructions with instruction numbers 1, 3, 5, and 2, the comparison matrix is a 4×4 matrix, where each element represents the comparison result of the current starting address of the instruction at the corresponding row and column coordinates. For example, if the current starting address of the row coordinate instruction is less than the current starting address of the column coordinate instruction, the matrix element at that location is 1; otherwise, it is 0. The elements on the diagonal of the matrix are considered invalid. Therefore, the column with all valid values of 1 represents the instruction containing the minimum value, or conversely, the row with all valid values of 0 represents the instruction containing the minimum value. For example, Figure 11A The current starting address of instruction 5 is the smallest.
[0107] Figure 11B This diagram illustrates how a microinstruction-level dependency judgment unit extracts the minimum value it needs from a comparison matrix.
[0108] Each microinstruction-level dependency determination unit can determine the instruction-level dependencies of the current microinstruction. Specifically, instruction-level dependencies can be represented using a bitmap, where each bit indicates whether there is a dependency between the current instruction and a preceding instruction in execution. The determination of instruction-level dependencies can be referred to the previous description. For example, if the current instruction has no dependency on the first preceding instruction, the first bit in the bitmap can be set to "1"; if the current instruction has a dependency on the second preceding instruction, the second bit in the bitmap can be set to "0"; and vice versa.
[0109] Therefore, the microinstruction-level dependency determination unit can use this bitmap as a mask, apply it to the comparison matrix, and select the minimum value of the starting address of the instruction it depends on. For example, as Figure 11B As shown, assuming the preceding instructions with a dependency relationship between the current microinstruction are instruction 3 and instruction 2, and the bitmap is 0101 (NYNY in the diagram), then the column with all valid values of 1 is selected from the columns of instruction 3 and instruction 2, which is the column of instruction 3. That is, among the preceding instructions with a dependency relationship between the current microinstruction, instruction 3 has the smallest current starting address. Subsequently, the microinstruction-level dependency determination unit can use the current starting address of instruction 3 to compare with the ending address of the current microinstruction to determine whether a microinstruction-level dependency relationship exists.
[0110] In this way, when the number of microinstructions processed in parallel is large, comparison time and hardware resources can be effectively saved.
[0111] Artificial intelligence processors typically process high-dimensional tensor data. For multidimensional instructions with operands consisting of multidimensional data blocks, there are often many microinstruction splitting strategies. Therefore, different splitting strategies for consecutive instructions can lead to different dimensions covered by the microinstructions of different instructions. However, the "chained" parallel mechanism of the embodiments disclosed in this paper can still be applied to a certain extent.
[0112] In some embodiments, when the operand of the current instruction is a multidimensional data block, and the operand of a preceding instruction with an instruction-level dependency is also the same multidimensional data block, a "chained" parallel mechanism can be implemented on the dimension where the access order of the current instruction and the preceding instruction is consistent. In this case, the microinstruction-level dependency determination unit can be further configured to: perform the aforementioned microinstruction-level dependency determination on the dimension where the access order of the current instruction and the preceding instruction is consistent. That is, when determining the microinstruction-level dependency, the comparison is based on the address range of the data on the dimension where the access order is consistent.
[0113] Figure 12A and Figure 12B An example of implementing a chained parallel mechanism on multidimensional instructions is illustrated.
[0114] Figure 12A This diagram illustrates a scenario where the preceding instruction is a Load instruction, followed by a Conv instruction. The three data spaces on the left (a), (b), and (c) show the changes in data loaded during the execution of the Load instruction over time, while the three data spaces on the right (d), (e), and (f) show the data used during the execution of the Conv instruction over time. The data blocks in the diagram are three-dimensional, including C, W, and H dimensions. Dimension C is the lowest dimension, meaning data in dimension C is stored contiguously in the one-dimensional storage space. Dimension W is the intermediate dimension, and dimension H is the highest dimension.
[0115] The micro-instruction splitting strategy of the Load instruction is to fetch data from the layer above each time, for example, a layer of W×C data. As shown in Figure (a), initially, a light gray data block is loaded into the data space. As the micro-instruction is executed, Figure (b) shows that 4 layers of W×C data have been loaded; Figure (c) shows that 12 layers of W×C data have been loaded.
[0116] The micro-instruction splitting strategy of the Conv instruction is to fetch 4×4×4 (H×W×C) data at a time. Therefore, when the Load instruction executes to load one layer of W×C data, the micro-instructions of the Conv instruction cannot be executed, as shown in Figure (d), and must continue to wait. When the Load instruction executes to load 4 layers of W×C data, the micro-instructions of the Conv instruction can be executed. As shown in Figure (e), the Conv instruction can split the 4×W×C data block into multiple 4×4×4 data blocks, executing the corresponding micro-instructions in the order of C dimension first, then W dimension. And so on, when the Load instruction executes to load 12 layers of W×C data, the Conv instruction can execute the micro-instructions of the corresponding third layer of 4×W×C data blocks, as shown in Figure (f).
[0117] Therefore, although the micro-instruction splitting strategies of Load and Conv instructions are different, from the H dimension, their processing / access order is consistent, both being processed from top to bottom. Thus, a "chained" parallel mechanism can be implemented at the H dimension.
[0118] Figure 12B This diagram illustrates a scenario where the preceding instruction is a matrix multiplication Matmul instruction, followed by a vector Vec instruction. The three data spaces on the left (a), (b), and (c) show the changes in data stored within the Matmul instruction over time, while the three data spaces on the right (d), (e), and (f) show the data used during the execution of the Vec instruction. The data blocks in the diagram are two-dimensional, comprising M and N dimensions, where N is the lowest dimension and M is the highest dimension.
[0119] The Matmul instruction's micro-instruction splitting strategy is to compute a small matrix block at a time, for example, a matrix of size 2×N / 2. As shown in Figure (a), initially, a light gray matrix block is computed and stored in the data space. As the micro-instruction is executed, Figure (b) shows that two light gray matrix blocks have been computed; Figure (c) shows that six light gray matrix blocks have been computed.
[0120] The Vec instruction's microinstruction splitting strategy is to fetch one row of data at a time. Therefore, when the Matmul instruction only calculates one light gray matrix block, the Vec instruction's microinstructions cannot be executed, as shown in Figure (d), and it must continue to wait. Once the Matmul instruction has calculated two light gray matrix blocks, i.e., two complete rows of data, the Vec instruction's microinstructions can be executed. As shown in Figure (e), the Vec instruction can split the two light gray matrix blocks into two rows of data and execute the corresponding microinstructions row by row. Similarly, when the Matmul instruction has calculated six light gray matrix blocks, the Vec instruction can execute the microinstructions corresponding to the 5th and 6th rows of data, as shown in Figure (f).
[0121] Therefore, although the micro-instruction decomposition strategies of the Matmul and Vec instructions are different, their processing / access order is consistent from the M-dimensional perspective, both being processed from top to bottom. Thus, a "chained" parallel mechanism can be implemented in the M-dimensional space.
[0122] Optionally or additionally, in some embodiments, the microinstruction generation unit may be further configured to: adjust the splitting strategy of the current instruction so that there is at least one dimension with consistent access order between the split microinstruction and the preceding instruction. In some implementations, when there is no dimension with consistent access order between the split microinstruction and the preceding instruction according to the current instruction splitting strategy, the current instruction splitting strategy can be adjusted so that there is at least one dimension with consistent access order between the split microinstruction and the preceding instruction, thereby enabling a "chained" parallel mechanism to be implemented on that dimension. In other implementations, even if there is a dimension with consistent access order between the split microinstruction and the preceding instruction according to the current instruction splitting strategy, the current instruction splitting strategy can be adjusted so that there are more dimensions with consistent access order between the split microinstruction and the preceding instruction, thereby enabling a "chained" parallel mechanism to be implemented on a larger scale. For example, in Figure 12A In the example shown, if possible, the splitting strategy of the Conv instruction can be adjusted to be consistent with the splitting strategy of the Load instruction. Then, as the Load instruction finishes executing each micro-instruction (loading a layer of data), the corresponding Conv instruction's micro-instruction can be executed (performing convolution operations on this layer of data).
[0123] Broadcast command processing scheme
[0124] Amplifying input / output (I / O) bandwidth through broadcast transmission is a key optimization technique for artificial intelligence (AI) processors, significantly improving processor performance. In AI processors, a broadcast instruction refers to sending data from one device to multiple devices, enabling other devices to read that data. For example, a broadcast instruction could involve transferring data from off-chip memory (e.g., ...). Figure 4 Data is read from DRAM 204 and then copied to multiple processor cores (e.g., ...). Figure 4 This is achieved by using processor core 406 in the dataset, allowing all processor cores to access the same dataset. This operation is extremely useful in distributed computing and parallel processing, especially in scenarios where multiple processor cores need to work together to process the same dataset.
[0125] In some embodiments disclosed herein, in order to support the “chained” execution mechanism proposed in the embodiments of this disclosure while broadcasting, a dedicated broadcast instruction slot component is designed to handle the decoding and scheduling of broadcast instructions, so that the “chained” execution mechanism within each processor core can still be supported in the broadcast scenario.
[0126] Figure 13 An exemplary structural block diagram of an instruction processing apparatus according to some embodiments of the present disclosure is shown.
[0127] As shown in the figure, in addition to other structures, the instruction processing device 1300 includes a broadcast instruction slot 1310, which further includes an instruction merging unit 1311, a microinstruction generation unit 1312, and a microinstruction slot 1313.
[0128] Broadcast instruction slot 1310 is responsible for decoding and scheduling broadcast instructions, and it is a common module shared by all processor cores. The broadcast instruction slot can be located in the system management controller, and the specific location can vary depending on the design. This disclosed embodiment is not limited in this respect.
[0129] The instruction merging unit 1311 is configured to collect broadcast instructions from various processor cores participating in the broadcast and merge them into a unified broadcast instruction. Specifically, the instruction merging unit 1311 can merge multiple broadcast instructions from various processor cores that participate in the same broadcast operation into a single broadcast instruction, thereby avoiding multiple memory accesses to obtain broadcast data.
[0130] The microinstruction generation unit 1312 is configured to generate several broadcast microinstructions based on the merged unified broadcast instructions. The microinstruction generation unit 1312 is related to the preceding text. Figure 7 The described microinstruction generation units function similarly, decomposing complex machine instructions into a series of simpler microinstructions. These microinstructions are the basic steps for implementing specific operations. Here, the microinstruction generation unit 1312 breaks down broadcast instructions into several microinstructions. These microinstructions may include, for example, at least read microinstructions for reading data from a specified address (e.g., off-chip memory) and write microinstructions for writing the read data to a specified address (e.g., on-chip memory of multiple processor cores). Furthermore, for cases where the operand data volume is large, it can be further broken down into multiple read microinstructions and multiple write microinstructions. For example, when the operand is a long vector, it can be broken down into microinstructions that operate on multiple short vectors.
[0131] Microinstruction slot 1313 is configured to cache broadcast microinstructions to be executed, and to control whether to issue the currently processed broadcast microinstruction based on whether there is a microinstruction-level dependency between the currently processed broadcast microinstruction and the preceding instructions being executed by the various processor cores participating in the broadcast. Figure 7 The microinstruction-level dependency judgment unit 730 and instruction issuing unit 740 have similar functions, the difference being that this is for broadcast microinstructions.
[0132] In some embodiments, microinstruction slot 1313 may be configured to: determine that the currently processed broadcast microinstruction has no microinstruction-level dependency when there is no overlap between the currently processed broadcast microinstruction and the data memory access address that has not yet been accessed by the preceding instruction in the execution of each processor core; and, in response to determining that the currently processed broadcast microinstruction has no microinstruction-level dependency, to emit the broadcast microinstruction.
[0133] Because broadcast microinstructions involve multiple processor cores participating in the broadcast, to support the "chained" execution mechanism within each processor core, the microinstruction-level dependency determination of broadcast microinstructions needs to be based on the unaccessed data memory addresses of the preceding instructions in all participating processor cores. Only when all conditions are met that there are no dependencies is the broadcast microinstruction considered to have no microinstruction-level dependency and can be issued. The method for determining the microinstruction-level dependency with each processor core can be similar to the previous method. For example, the end address of the current broadcast microinstruction's data memory address is compared with the start address of the data memory addresses of all incomplete microinstructions in the preceding instructions within a single processor core. More specifically, the end address of the current broadcast microinstruction is compared with the minimum value (limit) of the current start addresses of all preceding instructions within a single processor core to determine whether there is a microinstruction-level dependency with that processor core.
[0134] By using a dedicated broadcast instruction slot to handle the micro-instruction-level dependencies of broadcast instructions, the "chained" execution mechanism within each processor core can still be supported in broadcast scenarios. This allows for fine-grained scheduling of parallel instruction execution, reducing latency between dependent instructions and improving instruction parallelism and processor performance.
[0135] Figure 14 Exemplary structural block diagrams of an instruction processing apparatus according to other embodiments of this disclosure are shown. In these embodiments, interactions between a broadcast instruction slot shared by multiple processor cores and components located within each processor core of the instruction processing apparatus are further illustrated.
[0136] As shown in the figure, the instruction processing apparatus 1400 includes a broadcast instruction slot 1410 and related components located in multiple processor cores 1420-1, 1420-2, ... 1420-N. It is understood that the instruction processing related components in a single processor core may include the various components described above in conjunction with the accompanying figures. For simplicity and clarity, only components related to the processing of broadcast instructions are shown here. As shown, each processor core includes a microinstruction submission queue 1421, an instruction execution unit 1422, a local instruction slot 1423, and a local instruction-level dependency determination unit 1424.
[0137] In some embodiments, the microinstruction slot 1413 in the broadcast instruction slot 1410 may be further configured to: in response to the issuance of a broadcast microinstruction, broadcast the registration of the broadcast microinstruction to the microinstruction submission queue 1421 of each processor core participating in the broadcast. Since the broadcast microinstruction involves multiple processor cores, the issuance of its microinstruction needs to be registered in the microinstruction submission queue within each processor core to monitor the processing progress of the broadcast microinstruction.
[0138] As previously described, the broadcast command is broken down into a read micro-instruction and a write micro-instruction. The read micro-instruction involves reading data from off-chip memory. After the read micro-instruction is transmitted, the instruction execution device is the off-chip memory control unit (not shown in the figure), which controls the reading of data from the specified address in the off-chip memory and generates a read commit message. This read commit message is returned to the broadcast command slot 1410, for example, to the instruction commit unit 1414.
[0139] In these embodiments, the instruction submission unit 1414 may be configured to broadcast the submission information of a received read microinstruction to the microinstruction submission queues 1421 of each processor core in response to receiving such submission information. For example, the read microinstruction submission information may be passed to the microinstruction submission queues in processor cores 1420-N, and then sequentially to the microinstruction submission queues in other processor cores. In response to the submission of the microinstruction, the relevant components in each processor core (e.g., Figure 10 The address range maintenance unit in the document can perform corresponding operations, as described above, and will not be repeated here.
[0140] For write microinstructions in broadcast microinstructions, since the write operation is completed within each processor core, it can be committed directly within each core without needing to be forwarded through the broadcast instruction slot. For write operations, the instruction execution unit in each processor core is an on-chip memory control unit, which controls the writing of data to a specified address in the on-chip memory and generates a write commit message. This write commit message is returned to the microinstruction commit queue within the corresponding processor core, without going through the instruction commit unit 1414 in the broadcast instruction slot 1410.
[0141] In these embodiments, the instruction execution unit 1422 located in each processor core can be configured to send the write microinstruction commit information to the microinstruction commit queue 1421 of the corresponding processor core after executing the write microinstruction of the broadcast instruction.
[0142] Each processor core also includes its own local instruction slot 1423. Each local instruction slot 1423 can be configured to cache and process instructions to be executed by the local processor core, and to maintain the range of data memory addresses that have not yet been accessed by preceding instructions in the execution of the local processor core. The local instruction slot 1423 can include the aforementioned combination Figure 10 The description includes some functional components within the broadcast slot (e.g., microinstruction generation unit, microinstruction-level dependency judgment unit, instruction issuing unit, etc.) and components outside the broadcast slot (e.g., address range maintenance unit). Therefore, functions and structures similar to those described earlier will not be repeated here; the following description focuses only on content related to the processing of broadcast instructions. Those skilled in the art will understand that the division of functional components in the above schematic structural block diagram is merely for the convenience of describing the device's functions and implementation. This division, based on consideration of logical functions, does not necessarily imply that the actual implementation will follow this division method, but rather that other division methods are possible.
[0143] In some embodiments, each local instruction slot 1423 may be further configured to: send the broadcast instruction to the broadcast instruction slot 1410 in response to the currently processed instruction being a broadcast instruction; and also send the range of data memory access addresses that have not yet been accessed by the preceding instructions being executed by the local processor core to the broadcast instruction slot 1410.
[0144] Combined with the previous text Figure 8 Similar to the described embodiments, in some embodiments, instruction-level dependency determination may also be performed first for broadcast instructions. In these embodiments, the local instruction-level dependency determination unit 1424 located in each processor core can be configured to determine whether there is an instruction-level dependency between the currently processed broadcast instruction and the preceding instruction being executed in the local processor core. This instruction-level dependency can be passed to the local instruction slot 1423.
[0145] The operation of the local instruction-level dependency determination unit 1424 is similar to that described above. For example, the local instruction-level dependency determination unit 1424 can be further configured to: determine that no instruction-level dependency exists when the start address of the data memory access address of the currently processed broadcast instruction is greater than the current end address of the preceding instruction being executed in the local processor core; otherwise, determine that an instruction-level dependency exists. It can be understood that what is determined here is the instruction-level dependency of the broadcast instruction within a single processor core. As mentioned earlier, this instruction-level dependency can be represented using a bitmap, where each bit in the bitmap indicates whether there is a dependency between the current broadcast instruction and a preceding instruction being executed within a single processor core.
[0146] At this point, the local instruction slot 1423 can be further configured to: in response to the local instruction-level dependency determination unit 1424 determining that no instruction-level dependency exists, send the broadcast instruction to the broadcast instruction slot 1410 and indicate that the micro-instruction-level dependency of the broadcast instruction is non-dependent; and / or in response to the local instruction-level dependency determination unit 1424 determining that an instruction-level dependency exists, send the broadcast instruction to the broadcast instruction slot 1410 and indicate the instruction-level dependency. It is understood that the above dependency determination is limited to the determination within a single processor core.
[0147] Optionally or additionally, the microinstruction slot 1413 in the broadcast instruction slot 1410 may be further configured to: for broadcast microinstructions with instruction-level dependencies, if the end address in the data memory access address of the broadcast microinstruction is less than the minimum of the current start address of all preceding instructions executed by any processor core, determine that there is no microinstruction-level dependency; otherwise, determine that there is a microinstruction-level dependency.
[0148] Specifically, the end address in the data memory access address of the broadcast microinstruction can be compared one by one with the minimum value limit of the current start address of all preceding instructions of the multiple processor cores participating in the broadcast. If it is less than the limit of any processor core, it is determined that there is no microinstruction-level dependency and the instruction can be issued.
[0149] The above describes the interactive communication between various unit components in the instruction processing device in the context of broadcast instructions, in order to support the "chain" execution mechanism within each processor core, thereby enabling fine-grained scheduling of parallel instruction execution, reducing latency between dependent instructions, and improving instruction parallelism and processor performance.
[0150] Figure 15 An exemplary flowchart of an instruction processing method 1500 according to an embodiment of this disclosure is shown.
[0151] As shown in the figure, in step 1510, the broadcast instructions of each processor core participating in the broadcast are collected and merged into a unified broadcast instruction; in step 1520, several broadcast micro-instructions are generated based on the unified broadcast instruction; in step 1530, it is determined whether there is a micro-instruction-level dependency between the currently processed broadcast micro-instruction and the preceding instructions being executed by each processor core; and in step 1540, based on the determined micro-instruction-level dependency, it is controlled whether to issue the currently processed broadcast micro-instruction.
[0152] In some embodiments, determining whether a microinstruction-level dependency exists between the currently processed broadcast microinstruction and the preceding instructions being executed by each processor core may further include: when the currently processed broadcast microinstruction does not overlap with any unaccessed data memory addresses of the preceding instructions being executed by each processor core, it is determined that the currently processed broadcast microinstruction does not have a microinstruction-level dependency. In this case, the broadcast microinstruction may be emitted in response to determining that the currently processed broadcast microinstruction does not have a microinstruction-level dependency.
[0153] In some embodiments, method 1500 may further include step 1550: in response to the issuance of a broadcast microinstruction, broadcasting the registration of the broadcast microinstruction to a microinstruction submission queue located in each processor core.
[0154] In some embodiments, method 1500 may further include step 1560: in response to receiving the commit information of the read microinstruction in the broadcast microinstruction, broadcasting the commit information to the microinstruction commit queue of each processor core.
[0155] In some embodiments, method 1500 may further include step 1570: in response to completing the write microinstruction in the broadcast microinstruction locally on each processor core, sending the commit information of the write microinstruction to the microinstruction commit queue of the corresponding processor core.
[0156] Those skilled in the art will understand that the various features of the instruction processing apparatus described above in conjunction with the accompanying drawings can be similarly applied to... Figure 15 The instruction processing methods, such as determining instruction-level dependencies for broadcast instructions, will not be repeated here.
[0157] This disclosure also provides a processor, including the aforementioned instruction processing apparatus for implementing the instruction processing method. This disclosure further provides a chip, which may include the processor of any of the embodiments described above in conjunction with the accompanying drawings. Furthermore, this disclosure also provides a board that may include the aforementioned chip.
[0158] Depending on the application scenario, the electronic devices or apparatus disclosed herein may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablets, smart terminals, PC devices, IoT terminals, mobile terminals, mobile phones, dashcams, navigators, sensors, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and / or medical devices. The vehicles include airplanes, ships, and / or vehicles; the home appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, lights, gas stoves, and range hoods; the medical devices include MRI scanners, ultrasound machines, and / or electrocardiographs. The electronic devices or apparatus disclosed herein can also be applied in fields such as the Internet, IoT, data centers, energy, transportation, public management, manufacturing, education, power grids, telecommunications, finance, retail, construction sites, and healthcare. Furthermore, the electronic devices or apparatus disclosed herein can also be used in application scenarios related to artificial intelligence, big data, and / or cloud computing, such as cloud computing, edge computing, and terminal applications. In one or more embodiments, the high-computing-power electronic devices or apparatuses according to the present disclosure can be applied to cloud devices (e.g., cloud servers), while the low-power electronic devices or apparatuses can be applied to terminal devices and / or edge devices (e.g., smartphones or cameras). In one or more embodiments, the hardware information of the cloud devices and the hardware information of the terminal devices and / or edge devices are compatible with each other, so that suitable hardware resources can be matched from the hardware resources of the cloud devices to simulate the hardware resources of the terminal devices and / or edge devices based on the hardware information of the terminal devices and / or edge devices, so as to complete the unified management, scheduling and collaborative work of end-to-cloud or cloud-edge-end integration.
[0159] It should be noted that, for the sake of brevity, this disclosure describes some methods and their embodiments as a series of actions and combinations thereof. However, those skilled in the art will understand that the solutions disclosed herein are not limited by the order of the described actions. Therefore, based on the disclosure or teachings of this document, those skilled in the art will understand that some steps can be performed in a different order or simultaneously. Furthermore, those skilled in the art will understand that the embodiments described in this disclosure can be considered optional embodiments, that is, the actions or modules involved are not necessarily essential for the implementation of one or more solutions disclosed herein. In addition, depending on the solution, the description of some embodiments in this disclosure may have different emphases. In view of this, those skilled in the art will understand that parts not described in detail in a certain embodiment of this disclosure can also be referred to the relevant descriptions of other embodiments.
[0160] In terms of specific implementation, based on the disclosure and teachings of this document, those skilled in the art will understand that several embodiments disclosed herein can also be implemented in other ways not disclosed herein. For example, regarding the various units in the electronic device or apparatus embodiments described above, this document divides them based on logical functions, but in actual implementation, there may be other division methods. As another example, multiple units or components can be combined or integrated into another system, or some features or functions in a unit or component can be selectively disabled. Regarding the connection relationships between different units or components, the connections discussed above in conjunction with the accompanying drawings can be direct or indirect couplings between units or components. In some scenarios, the aforementioned direct or indirect couplings involve communication connections utilizing interfaces, where the communication interface can support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
[0161] In this disclosure, the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units. The aforementioned components or units may be located in the same location or distributed across multiple network units. Furthermore, depending on actual needs, some or all of the units can be selected to achieve the purpose of the solution described in the embodiments of this disclosure. Additionally, in some scenarios, multiple units in the embodiments of this disclosure may be integrated into one unit or each unit may exist physically independently.
[0162] In some implementation scenarios, the integrated unit described above can be implemented as a software program module. If implemented as a software program module and sold or used as an independent product, the integrated unit can be stored in a computer-readable storage device (CMSDD). Therefore, when the disclosed solution is embodied in a software product (e.g., a computer-readable storage medium), the software product can be stored in a memory, which may include several instructions to cause a computer device (e.g., a personal computer, server, or network device) to execute some or all of the steps of the method described in the embodiments of this disclosure. The aforementioned memory may include, but is not limited to, various media capable of storing program code, such as USB flash drives, flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0163] In other implementation scenarios, the integrated units described above can also be implemented in hardware, i.e., as specific hardware circuits, which may include digital circuits and / or analog circuits. The physical implementation of the circuit's hardware structure may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors. Therefore, the various devices described herein (e.g., artificial intelligence processor computing devices or other processing devices) can be implemented using appropriate hardware processors, such as CPUs, GPUs, FPGAs, DSPs, and ASICs. Furthermore, the aforementioned storage units or storage devices can be any suitable storage medium (including magnetic storage media or magneto-optical storage media), such as resistive random access memory (RRAM), dynamic random access memory (DRAM), static random access memory (SRAM), enhanced dynamic random access memory (EDRAM), high-bandwidth memory (HBM), hybrid memory cube (HMC), ROM, and RAM.
[0164] While numerous embodiments of this disclosure have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Many modifications, alterations, and alternatives will occur to those skilled in the art without departing from the spirit and intent of this disclosure. It should be understood that various alternatives to the embodiments of this disclosure described herein may be employed in the practice of this disclosure. The appended claims are intended to define the scope of this disclosure and therefore cover equivalents or alternatives within the scope of these claims.
Claims
1. An instruction processing apparatus, comprising a broadcast instruction slot, the broadcast instruction slot comprising: The instruction merging unit is configured to collect broadcast instructions from each processor core participating in the broadcast and merge them into a unified broadcast instruction. A micro-instruction generation unit is configured to generate several broadcast micro-instructions based on the unified broadcast instruction; The microinstruction slot is configured to cache broadcast microinstructions to be executed, and to control whether to issue the currently processed broadcast microinstruction based on whether there is a microinstruction-level dependency between the currently processed broadcast microinstruction and the preceding instructions being executed by each processor core.
2. The instruction processing apparatus according to claim 1, wherein, The microinstruction slot is further configured for: When the currently processed broadcast microinstruction does not overlap with the data memory address that has not yet been accessed by the preceding instructions being executed by each processor core, it is determined that the currently processed broadcast microinstruction has no microinstruction-level dependency. as well as In response to determining that there is no microinstruction-level dependency between the currently processed broadcast microinstruction, the broadcast microinstruction is transmitted.
3. The instruction processing apparatus according to claim 2, wherein, The microinstruction slot is further configured for: In response to the issuance of the broadcast microinstruction, the broadcast microinstruction is broadcast registered to the microinstruction submission queue of each processor core.
4. The instruction processing apparatus according to claim 3, wherein, The broadcast micro-instruction includes a readout micro-instruction, and the broadcast instruction slot further includes: The instruction submission unit is configured to broadcast the submission information to the microinstruction submission queue of each processor core in response to receiving the submission information of the read microinstruction.
5. The instruction processing apparatus according to claim 4, wherein, The broadcast microinstruction also includes a write microinstruction, and the instruction processing device further includes: The instruction execution unit located in each processor core is configured to send the write microinstruction submission information to the microinstruction submission queue of the corresponding processor core after executing the write microinstruction.
6. The instruction processing apparatus according to any one of claims 1-5, wherein, The instruction processing device further includes: The local instruction slots located in each processor core are configured to cache and process instructions to be executed by the local processor core, and to maintain the range of data memory addresses that have not yet been accessed by preceding instructions in the execution of the local processor core.
7. The instruction processing apparatus according to claim 6, wherein, The local instruction slot is further configured for: In response to the currently processed instruction being a broadcast instruction, the broadcast instruction is sent to the broadcast instruction slot; and Send the range of data memory addresses that have not yet been accessed by the preceding instructions being executed by the local processor core to the broadcast instruction slot.
8. The instruction processing apparatus according to claim 7, wherein, The instruction processing device further includes: The local instruction-level dependency determination unit, located in each processor core, is configured to determine whether there is an instruction-level dependency between the currently processed broadcast instruction and the preceding instruction being executed in the local processor core; and The local instruction slot is further configured for: In response to the local instruction-level dependency determination unit determining that no instruction-level dependency exists, the broadcast instruction is sent to the broadcast instruction slot, and the micro-instruction-level dependency of the broadcast instruction is indicated to be non-dependent; and / or In response to the local instruction-level dependency determination unit determining the existence of an instruction-level dependency, the broadcast instruction is sent to the broadcast instruction slot, and the instruction-level dependency is indicated.
9. The instruction processing apparatus according to claim 8, wherein, The local instruction-level dependency determination unit is further configured to: When the starting address of the data memory access address of the currently processed broadcast instruction is greater than the current ending address of the preceding instruction being executed by the local processor core, it is determined that there is no instruction-level dependency. Otherwise, an instruction-level dependency is confirmed.
10. The instruction processing apparatus according to claim 9, wherein, The microinstruction slot is further configured for: For broadcast microinstructions with instruction-level dependencies, if the end address in the data memory access address of the broadcast microinstruction is less than the minimum value of the current start address of all preceding instructions executed by any processor core, it is determined that there is no microinstruction-level dependency. Otherwise, a microinstruction-level dependency is confirmed.
11. A processor comprising the instruction processing means according to any one of claims 1-10.
12. A chip comprising the processor according to claim 11.
13. A circuit board comprising the chip according to claim 12.
14. An instruction processing method, comprising: Collect the broadcast instructions from each processor core participating in the broadcast and merge them into a unified broadcast instruction; Several broadcast micro-instructions are generated based on the unified broadcast instructions; Determine whether there is a microinstruction-level dependency between the currently processed broadcast microinstruction and the preceding instructions being executed by each of the processor cores; as well as Based on the microinstruction-level dependencies, control whether to transmit the currently processed broadcast microinstruction.
15. The instruction processing method according to claim 14, further comprising: When the currently processed broadcast microinstruction does not overlap with the data memory address that has not yet been accessed by the preceding instructions being executed by each processor core, it is determined that the currently processed broadcast microinstruction has no microinstruction-level dependency. as well as In response to determining that there is no microinstruction-level dependency between the currently processed broadcast microinstruction, the broadcast microinstruction is transmitted.
16. The instruction processing method according to claim 15, further comprising: In response to the issuance of the broadcast microinstruction, the broadcast microinstruction is broadcast registered to the microinstruction submission queue located in each of the processor cores.
17. The instruction processing method according to claim 16, wherein, The broadcast micro-instructions include readout micro-instructions, and the method further includes: In response to receiving the read microinstruction submission information, the submission information is broadcast to the microinstruction submission queue of each processor core.
18. The instruction processing method according to claim 17, wherein, The broadcast microinstruction includes a write microinstruction, and the method further includes: In response to the completion of the write microinstruction locally on each processor core, the commit information of the write microinstruction is sent to the microinstruction commit queue of the corresponding processor core.