Elimination of execution of instructions that compute constant values based on load values predictions
By predicting constant load values and executing elimination structures, the problem of repeated loading and calculation in the processor loop is solved, resulting in performance improvement and power consumption reduction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INTEL CORP
- Filing Date
- 2025-12-19
- Publication Date
- 2026-06-30
AI Technical Summary
The processor repeatedly loads the same data and executes the same computations or data processing instructions in a loop, leading to performance limitations and increased power consumption.
By predicting the value of the load instruction using constant load value prediction and storing the result in a register, the same load and calculation instructions are avoided from being executed repeatedly. An execution elimination structure is used to mark instructions that do not need to be executed.
It improves processor performance and reduces power consumption by reducing unnecessary loading and computation operations.
Smart Images

Figure CN122308915A_ABST
Abstract
Description
[0001] background Technical Field
[0002] The embodiments described herein generally relate to processors. More specifically, the embodiments described herein generally relate to the execution of instructions in a processor. Background Technology
[0003] Processors typically execute load instructions to load data from system memory into processor registers. Load instructions are sometimes referred to by other names, such as move instructions, swap instructions, etc. In some cases, load instructions can be included in a loop and executed during each iteration of the loop. In this case, computation or data processing instructions can also be included in the loop and executed during each iteration of the loop to perform computation or data processing on the data loaded by the load instructions to produce a result. Attached Figure Description
[0004] Examples according to this disclosure will be described with reference to the accompanying drawings, in which: Figure 1 Detailed examples of loops that can be included in software or code are shown.
[0005] Figure 2A This is a block diagram flowchart of an embodiment of the method.
[0006] Figure 2B This is a block diagram of an example embodiment of performing the elimination structure.
[0007] Figure 3 This is a block flowchart of an embodiment of a method for training or otherwise filling in the elimination structure.
[0008] Figure 4 Based on Figure 1 A block diagram of a detailed example embodiment of the execution elimination structure, showing the state of a specific example loop for training or filling.
[0009] Figure 5 This is a block diagram of an embodiment of the processor allocation phase.
[0010] Figure 6 The diagram illustrates a computing system.
[0011] Figure 7 The diagram shows a block diagram of an example processor and / or System on a Chip (SoC) that may have one or more cores and an integrated memory controller.
[0012] Figure 8(A) is a block diagram illustrating both the example ordered pipeline and the example register renaming, out-of-order issue / execution pipeline, based on the example.
[0013] Figure 8(B) is a block diagram illustrating both the example ordered architecture core to be included in the processor and the example register renaming, out-of-order issue / execution architecture core, according to the example.
[0014] Figure 9 The diagram illustrates an example of one or more execution unit circuits.
[0015] Figure 10 It is a block diagram of a register architecture based on some examples.
[0016] Figure 11 An example of a command format is shown in the diagram.
[0017] Figure 12 This diagram illustrates an example of an addressing information field.
[0018] Figure 13 The diagram shows an example of the first prefix.
[0019] Figures 14(A)-14(D) How to use the diagram Figure 13 Examples of the R, X, and B fields in the first prefix.
[0020] Figures 15(A)-15(B) The illustration shows an example of the second prefix.
[0021] Figure 16 The diagram shows an example of the third prefix.
[0022] Figure 17 This is a block diagram illustrating, based on an example, the use of a software instruction converter to translate binary instructions from a source instruction set architecture into binary instructions from a target instruction set architecture. Detailed Implementation
[0023] This document discloses a method, apparatus, system, and non-transitory computer-readable storage medium for executing computational instructions to eliminate computational constant values based on load value prediction. Numerous specific details (e.g., specific structures, microarchitecture details, loop types, instruction types, processor configurations, operation sequences, etc.) are set forth in the following description. However, embodiments may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the understanding of this specification.
[0024] As described in the background section, processors typically execute load instructions to load data from system memory into processor registers. Load instructions are sometimes referred to by other names, such as move instructions, swap instructions, etc. In some cases, load instructions can be included in a loop and can be executed during each iteration of the loop. If the data is not changed or modified between different iterations of the loop (e.g., data is overwritten by another entity in the system), the load instructions can reload the same data during each iteration of the loop. Executing load instructions takes time and consumes power. Therefore, unnecessarily reloading the same data multiple times represents an inefficient form that tends to limit performance and increase power consumption. Computation or data processing instructions can also be included in loops and can be executed during each iteration of the loop. Some such computation or data processing instructions can perform computations or data processing that depend only on the data loaded by the load instructions and constant values. Such computation or data processing instructions will perform the same computation or data processing on the same loaded data during each iteration of the loop to produce the same result. Executing these computation or data processing instructions takes time and consumes power. Therefore, unnecessarily producing the same result multiple times represents an inefficient form that tends to limit performance and increase power consumption.
[0025] Figure 1 A detailed example of loop 100 (which may be included in software or code) is shown. This loop has generic "loop start" and "loop end" instructions. Loop start and loop end instructions can be implemented using different types of instructions, such as branch instructions, jump instructions, and other types of control flow transfer instructions.
[0026] In this example, nine instructions are included within the body of the loop. In this example, the instructions are examples of instructions from the x86 instruction set architecture. In other embodiments, instructions from other instruction set architectures may be used alternatively. Each instruction within the loop has a corresponding instruction pointer (IP). The instruction pointer may represent a pointer, address, or other value indicating the address or location of the corresponding instruction in system memory. The instruction pointer is sometimes also referred to as the program counter. In this example, the instructions operate on the processor's example registers RAX, RBX, and R8-R15. RAX, RBX, and R8-R15 are general-purpose registers with 64-bit operands in the x86 instruction set architecture. These registers may broadly represent general-purpose registers or simply processor registers. In other embodiments, other registers (e.g., registers of other operand sizes, other types of registers, registers from other instruction set architectures, etc.) may be used alternatively.
[0027] The nine instructions are: the first instruction with instruction pointer A, the second instruction with instruction pointer A+4, the third instruction with instruction pointer A+8, the fourth instruction with instruction pointer A+12, the fifth instruction with instruction pointer A+16, the sixth instruction with instruction pointer A+20, the seventh instruction with instruction pointer A+24, the eighth instruction with instruction pointer A+28, and the ninth instruction with instruction pointer A+32. Instruction pointer A can represent any appropriate instruction pointer value (e.g., an address or pointer to the first instruction in a loop in system memory). "+4" increments A by 4 8-bit bytes, "+8" increments A by 8 8-bit bytes, and so on. The labels first, second, third, fourth, fifth, sixth, seventh, eighth, and ninth are further used below to refer to the specific instructions described in this paragraph.
[0028] The first instruction, “RBX=LOAD(RAX)(RBX=load(RAX))”, is a load instruction that, when executed, causes the processor to load data (e.g., data elements, values, etc.) from the address or location in system memory indicated by data from the source register RAX, and stores the loaded data in the destination register RBX. In this illustrative example, consider that the loaded data has a value of 2.
[0029] The second instruction "R8=RBX" "2" is an arithmetic data processing instruction (e.g., the opposite of an instruction that simply moves data from one register to another), which, when executed, causes the processor to multiply the data in the source register RBX by 2 and then store the result in the destination register R8. Note that the source register RBX of the second instruction is the same as the destination register RBX of the first instruction. It should also be noted that the result of the second instruction depends only on the data loaded in RBX and the constant multiplier value of 2. After the execution of the second instruction, register R8 will store twice the value loaded (e.g., 4 in this illustrative example).
[0030] The third instruction "R9=R8" "3" is an arithmetic data processing instruction that, when executed, causes the processor to multiply the data in source register R8 by 3 and store the result in destination register R9. Note that the source register R8 of the third instruction is the same as the destination register R8 of the second instruction. It should also be noted that the result of the third instruction depends only on the constant multiplier 3 and the result of the second instruction (e.g., as mentioned above, it depends only on the data loaded in RBX and the constant multiplier 2). After the execution of the third instruction, register R9 will store a value that is 6 times the loaded value (e.g., 12 in this illustrative example).
[0031] The fourth instruction "R10=R9" "4" is an arithmetic data processing instruction that, when executed, causes the processor to multiply the data in source register R9 by 4 and store the result in destination register R10. Note that the source register R9 for the fourth instruction is the same as the destination register R9 for the third instruction. It should also be noted that the result of the fourth instruction depends only on the constant multiplication value of 4 and the result of the third instruction (e.g., as mentioned above, it depends only on the loaded data and the constant value). After the execution of the fourth instruction, register R10 will store a value that is 24 times the loaded value (e.g., 48 in this illustrative example).
[0032] The fifth instruction "R11=R10+6", the sixth instruction "R12=R11+2", the seventh instruction "R13=R12-3", the eighth instruction "R14=R13+1", and the ninth instruction "R15=R14" are also shown. 8”. Note that each of the fifth through ninth instructions similarly has a source register, which is the destination register of the preceding instruction in the loop. It should also be noted that the result of each of the fifth through ninth instructions similarly depends only on the loaded data and a constant value that remains constant during each iteration of the loop (e.g., the integers 2, 3, 4, 6, 2, 3, 1, and 8 shown). As a result, if the loaded data (e.g., the data loaded by the first instruction) remains the same, the result of each of the second through ninth instructions will also remain the same.
[0033] Typically, each of the first through ninth instructions will be executed during each iteration of the loop. For example, the first instruction "RBX = LOAD(RAX)" will be executed during each iteration of the loop to load data from the address or location in system memory indicated by data from the source register RAX and store the loaded data in the destination register RBX. If the data at that address has not been changed or modified (e.g., a write to that address is performed by another entity in the system), the same data will be reloaded during each iteration of the loop. Executing the load instruction takes time and consumes power. Therefore, unnecessarily reloading the same data multiple times represents an inefficient form that tends to limit performance and increase power consumption. Furthermore, each of the second through ninth data processing instructions will be executed during each iteration of the loop. The execution of these data processing instructions depends only on the calculation or data processing of the data loaded by the load instruction and constant values (e.g., 2, 3, 4, 6, 2, 3, 1, and 8). If the loaded data has not been changed or modified, each of the second through ninth instructions, when re-executed during each iteration of the loop, will perform the same computation or data processing on the same loaded data to produce the same result. Executing these computation or data processing instructions takes time and consumes power. Therefore, unnecessarily generating the same result multiple times represents an inefficient form that tends to limit performance and increase power consumption.
[0034] To help improve performance and / or reduce power consumption, in some embodiments, the processor can utilize constant load value prediction to predict the value of a load instruction (e.g., the first instruction RBX=LOAD(RAX)). As an example, the processor can keep track of the load instructions being executed (e.g., using their corresponding instruction pointers as identifiers) and the data loaded by those instructions. The processor can then predict that subsequent instances of the same load instruction (e.g., load instructions with the same instruction pointer) will reload the same data again. Based on this, the processor can simply output or provide the predicted data for the load instruction without performing a load operation to reload the same data again. For example, the first instruction “RBX=LOAD(RAX)” can be translated to “RBX=MOV(0x2)”, which simply moves the predicted load value (e.g., 2 in this illustrative example) from the immediate value of the MOV instruction (e.g., shown as “0x2”) to the destination register RBX. This can help improve performance by avoiding the latency associated with reloading the same data. This can also help improve performance by allowing subsequent instructions (e.g., instructions 2 through 9) to execute faster (e.g., they can be dispatched for execution once they are allocated in the out-of-order portion of the processor, because constant load value predictions occur earlier (e.g., at allocation time)).
[0035] In some embodiments, to help improve performance and / or reduce power consumption, the processor can implement the loop more efficiently by leveraging the fact that the execution of each of the second through ninth instructions depends only on the data loaded into register RBX by the load instruction or only on the computation or data processing derived from the results using constant values (e.g., in this example, the integers 2, 3, 4, 6, 2, 3, 1, and 8, which remain constant for each iteration of the loop). As a result, if the data loaded into register RBX by the first instruction remains the same, the results of each of the second through ninth instructions will also remain the same. Based on this, in some embodiments, the second through ninth instructions may be executed only once (e.g., during the initial or early iterations, where a constant value load prediction is first used to predict the data loaded by the first instruction RBX = LOAD(RAX). The results generated by executing each of the second through ninth instructions can be stored or retained. For example, during register renaming, physical registers can be assigned for each of the general-purpose or logical registers RBX and R8-R15, and the results of each of the first through ninth instructions can be stored or retained in these assigned physical registers. During subsequent iterations of the loop, as long as the first instruction is executed, the results of each of the second through ninth instructions can be executed only once. By keeping the predicted values the same, there's no need to re-execute instructions two through nine (e.g., because the stored or retained results would be the same as those that would be recalculated if instructions two through nine were re-executed). Instead, during these subsequent iterations, each of instructions two through nine can be marked to be removed from execution by pointing its destination physical register to the same physical register previously used to store or retain the results of execution instances of instructions two through nine. Removing the execution of instructions two through nine in subsequent iterations of the loop instruction can help improve performance and / or reduce power consumption.
[0036] Figure 2AThis is a block flowchart of an embodiment of method 202 executed by a processor. The method includes, at block 203 (e.g., utilizing a prediction unit of the processor), predicting the value that a load instruction of the first iteration of a loop will load. For example, the value (e.g., 2) that the first instruction RBX = LOAD(RAX) will load can be predicted. The method includes, at block 204 (e.g., utilizing one or more execution units of the processor), executing multiple instructions that occur after the load instruction in the first iteration of the loop to produce multiple results. For example, this could include executing each of the second through ninth instructions during the first iteration of loop 100. Each of the multiple results depends only on the value, one or more constant values, values derived therefrom, or any combination thereof. For example, each of the results of the second through ninth instructions in loop 100 depends only on the loaded value (e.g., 2), one or more constants (e.g., 2, 3, 4, 5, 2, 3, 1, and 8), and one or more values derived therefrom in some cases (e.g., the value in R8 is derived from the loaded value and the constant 2, and R9 is derived from the value in R8). The method also includes, at block 205, (e.g., during the allocation phase and / or utilizing the processor's allocation circuitry) generating multiple results for multiple instructions during each iteration in one or more iterations following the first iteration, without re-executing the multiple instructions. In some embodiments, the multiple results generated for the first iteration may be stored in multiple physical registers (e.g., the destination physical register of the instructions), and generating results may include generating the stored multiple results from the multiple physical registers (e.g., the destination physical register of the instructions) during each of one or more iterations. For example, the result of each of the second through ninth instructions may be stored during the first iteration in a physical register indicated by a physical register identifier in an execution elimination structure (e.g., further disclosed below), and the result may then be generated from the physical register based on the execution elimination structure during each of one or more subsequent iterations.
[0037] In some embodiments, the execution elimination structure can be used to determine instructions that do not need to be executed. When the data for loading instructions is a predicted constant load value, the execution elimination structure can be trained or otherwise populated during one execution of a loop or other set of instructions. The execution elimination structure can then be used to determine instructions that do not need to be executed.
[0038] Figure 2BThis is a block diagram of an example embodiment of the execution elimination structure 210. The execution elimination structure can be implemented in a storage device, a hardware-implemented data structure, etc. As a specific example, the execution elimination structure can be implemented in content addressable memory (CAM) or similar associative memory or associative storage devices. The execution elimination structure shown is formatted as a table with columns and rows, but alternative arrangements and / or other types of data structures may be used instead. The table shown has five columns. The top row is a header row, which indicates the type of data stored in each of the five columns. As indicated by the header row, the first (e.g., leftmost) column is used to store the instruction pointer (IP) of the instruction corresponding to the row. The second column is used to store a first source physical register identifier, which identifies the first source physical register (e.g., mapped to a first source logical register) of the instruction corresponding to the row. The third column is used to store a second source physical register identifier, which identifies the second source physical register (e.g., mapped to a second source logical register) of the instruction corresponding to the row. The fourth column (which is optional) is used to store a third source physical register identifier, which identifies the third source physical register (e.g., mapped to a third source logical register) of the instruction corresponding to the row. Some instruction set architectures may not support instructions with three source registers, and in such cases, the fourth column may not be included. The fifth column is used to store a destination physical register identifier, which identifies the destination physical register (e.g., mapped to a destination logical register) of the instruction corresponding to the row.
[0039] Figure 3 This is a block flowchart of an embodiment of a method 320 for training or otherwise populating the elimination structure. In some embodiments, training or populating the elimination structure may begin or be initiated during the first, second, or other initial iterations of a loop when a constant load value prediction is first used to predict data loaded by a load instruction and that load instruction will subsequently be retired or otherwise submitted. The method includes checking at block 321 the next instruction to be retired or otherwise submitted (e.g., by a retirement unit, submission unit, etc.). For example, instructions submitted from a reorder buffer (ROB) may be checked.
[0040] At box 322, a determination can be made as to whether the instruction is a load instruction whose load value is predicted. If the instruction is a load instruction whose load value is predicted (e.g., the determination at box 322 is "yes"), the method can proceed to box 323. At box 323, the execution elimination structure can be updated to include the instruction. For example, the execution elimination structure (e.g., its lines or other entries) can be updated to include the instruction pointer, the source predicted load value, and the destination physical register identifier (e.g., a destination general or other logical register used to store the predicted load value and / or mapped to the instruction).
[0041] The method can then proceed to box 324, where the destination physical register identifier of the instruction can be marked as an execution elimination hit. This can include storing execution elimination hit information for the destination physical register identifier. In some embodiments, the execution elimination hit information can be stored in a new execution elimination hit field in the entry for the corresponding physical register identifier in an existing register renaming structure or unit (e.g., register alias table, register renaming unit, etc.). For example, the entry can be expanded to include an execution elimination hit field to store a single bit of execution elimination hit information, which may have a first binary value (e.g., set to binary 1 according to a possible convention) to indicate an execution elimination hit, or a second binary value (e.g., cleared to binary 0 according to a possible convention) to indicate a lack of execution elimination hits. This execution elimination hit information can be used during training to help determine whether the source of instructions remains constant and whether those instructions should be added to or assigned to the execution elimination structure. The method can then proceed from box 324 to box 321, where the next instruction to be submitted can be examined.
[0042] If the instruction at box 322 is not a load instruction whose load value is predicted (e.g., the determination at box 322 is "No"), the method can proceed to box 325. At box 325, it can be determined whether the source of the instruction is a predicted load value and / or a physical register identified as a kill sign. If this is the case (e.g., the determination at box 325 is "Yes"), the method can proceed to box 323. At box 323, the kill sign structure can be updated to include the instruction. For example, the execution elimination structure (e.g., its line or other entries) can be updated to include the instruction pointer of the instruction, the source of the instruction (e.g., the predicted load value and / or a source physical register identifier that identifies the source physical register (e.g., a source general-purpose or other logical register mapped to the instruction), and a destination physical register identifier that identifies the destination physical register (e.g., a destination general-purpose or other logical register used to store the predicted load value and / or a destination general-purpose or other logical register mapped to the instruction). Then, at box 324, the destination physical register identifier of the instruction can be marked as an execution elimination hit. The method can then return to box 321, where the next instruction to be committed can be checked.
[0043] Alternatively, if the determination at box 325 is "No", the method can return to box 321. At box 321, the next instruction to be submitted can be checked. This process can be repeated for all instructions in the loop to train or populate the execution elimination structure using the state of loop instructions that do not need to be executed on subsequent iterations of the loop.
[0044] To further illustrate, consider how this method can be used to process... Figure 1 The first to third instructions of loop 100. For the first instruction RBX = load(RAX), the determination at box 322 can be "yes". Consider that RBX is assigned to the first physical register identifier (PRID1). At box 323, the first instruction will be added to the execution elimination structure, and at box 324, the destination PRID1 will be marked as an execution elimination hit. For the second instruction R8 = RBX 2. The determination at box 322 can be "No," and the determination at box 325 can be "Yes." Making the "Yes" determination at box 325 can include recognizing that PRID1 (assigned to RBX) was previously marked as an execution elimination hit. Consider R8 being assigned to the second physical register identifier (PRID2). At box 323, a second instruction will be added to the execution elimination structure, and at box 324, the destination PRID2 will be marked as an execution elimination hit. For the third instruction, R9 = R8. 3. The determination at box 322 can be "No," and the determination at box 325 can be "Yes." Making the "Yes" determination at box 325 can include recognizing that PRID2 (which is assigned to R8) was previously marked as having performed a kill hit. Consider R9 being assigned to the third physical register identifier (PRID3). At box 323, a third instruction will be added to the kill structure, and at box 324, the destination PRID3 will be marked as having performed a kill hit.
[0045] Method 320 has been described in a relatively basic form, but operations may optionally be added to and / or removed from the method. Furthermore, although the flowchart illustrates a specific order of operations according to an embodiment, this order is exemplary. Alternative embodiments may perform operations in a different order, combine certain operations, overlap certain operations, etc.
[0046] Figure 4 Based on Figure 1 A block diagram of a detailed example embodiment of the execution elimination structure 410, used to train or populate the state of a specific example loop 100. The execution elimination structure 410 can be implemented in various ways as described above (e.g., in a storage device, a hardware-implemented data structure, etc.). The execution elimination structure shown is formatted as a table with columns and rows, but alternative arrangements and / or other types of data structures may be used as substitutes. The table shown has three columns. Optionally, additional columns may exist (e.g., as for...). Figure 2B (as described), but the instructions for loop 100 do not require such additional columns, so for simplicity these additional columns are not shown.
[0047] The top row is the header row, which indicates the type of data stored in each of the three columns. As indicated by the header row, the first (e.g., leftmost) column is used to store the instruction pointer (IP) of the instruction corresponding to the row. The second column is used to store the source physical register identifier, which identifies the source physical register (e.g., mapped to a source logical register) of the instruction corresponding to the row. The third column is used to store the destination physical register identifier, which identifies the destination physical register (e.g., mapped to a destination logical register) of the instruction corresponding to the row. Specific examples of physical register identifiers (PRIDs) (e.g., PRID1 through PRID9) are shown, although these are only illustrative examples. For ease of understanding, the corresponding source and destination general-purpose or logical registers (e.g., RBX and R8-R15) to which the physical register identifiers are mapped are enclosed in parentheses. These general-purpose or other logical registers may optionally not be stored or included in the execution elimination structure.
[0048] To further illustrate certain concepts, it will be provided how instructions for executing the elimination structure 410 can be trained or populated using loop 100 in method 320 to achieve the desired effect. Figure 4 A brief description of the state shown. (Reference) Figure 3 In box 321, the first instruction RBX = LOAD(RAX) can be the next instruction to be submitted and can be checked. At box 322, it can be determined that the first instruction is the load instruction whose load value is predicted, so the method can proceed to box 323. At box 323, the execution elimination structure can be updated to include the first instruction. For example, as... Figure 4 As shown, the instruction pointer A of the first instruction can be stored in the first column of the second row, the predicted load value (e.g., 2) can be stored in the second column of the second row, and the first physical register identifier (PRID1) that identifies the first physical register (e.g., the destination general-purpose register RBX mapped to the first instruction) can be stored in the third column of the second row. As an example, the first instruction can be converted into a move immediate instruction or operation, where the immediate value can provide the predicted load value, such that the immediate value can be written to a physical register that can be read and consumed by dependencies.
[0049] refer to Figure 3 In box 321, the second instruction is now R8=RBX. 2 can be the next instruction to be committed and can be checked. At box 322, it can be determined that the second instruction is not the load instruction whose load value is predicted, so the method can proceed to box 325. At box 325, it can be determined that the source of the second instruction is a physical register identified by the physical register identifier reflected in the execution elimination structure. For example, the second instruction has only a single source general-purpose register RBX, which is mapped to the first physical register identified by PRID1 reflected in the third column of the second row of the execution elimination structure. The method can proceed to box 323, where the execution elimination structure can be updated to include the second instruction. For example, as Figure 4 As shown, the instruction pointer A+4 of the second instruction can be stored in the first column of the third row, the first physical register identifier (PRID1) that identifies the first physical register (e.g., the source general-purpose register RBX mapped to the second instruction) can be stored in the second column of the third row, and the second physical register identifier (PRID2) that identifies the second physical register (e.g., the destination general-purpose register R8 mapped to the second instruction) can be stored in the third column of the third row.
[0050] refer to Figure 3 In box 321, the third instruction R9=R8 3 can be the next instruction to be committed and can be checked. At box 322, it can be determined that the third instruction is not the load instruction whose load value is predicted, so the method can proceed to box 325. At box 325, it can be determined that the source of the third instruction is a physical register identified by the physical register identifier reflected in the execution elimination structure. For example, the third instruction has only a single source general-purpose register R8, which is mapped to a second physical register identified by PRID2 reflected in the third column of the third row of the execution elimination structure. The method can proceed to box 323, where the execution elimination structure can be updated to include the third instruction. For example, as Figure 4 As shown, the instruction pointer A+8 of the third instruction can be stored in the first column of the fourth row, the second physical register identifier (PRID2) identifying the second physical register (e.g., the source general-purpose register R8 mapped to the third instruction) can be stored in the second column of the fourth row, and the third physical register identifier (PRID3) identifying the third physical register (e.g., the destination general-purpose register R9 mapped to the third instruction) can be stored in the third column of the fourth row. A similar process can be repeated for each of the fourth through ninth instructions in the loop to... Figure 4 The state training or filling execution elimination structure is shown in the figure.
[0051] Figure 5 This is a block diagram of an embodiment of the processor's allocation phase 530 (e.g., allocation circuitry). The allocation phase includes a register renaming unit 532, an execution elimination structure 510, a comparison circuit 534, a free list structure 533, and a selection circuit 535.
[0052] Micro-operations, micro-instructions, or other decoded instructions 531 prepared for allocation are input to the allocation stage. These decoded instructions are used to check the register renaming unit and execute the elimination structure. A comparator circuit is coupled to the outputs of both the register renaming unit and the execution elimination structure. A selector circuit has inputs coupled to the outputs of both the free list structure and the execution elimination structure. The output of the comparator circuit is coupled to the selector circuit as a control input. The comparator circuit can compare the outputs of the register renaming unit and the execution elimination structure to see if they are equal or otherwise match (e.g., a renamed source matches a renamed source from the execution elimination table). If the comparator circuit determines that the outputs of the register renaming unit and the execution elimination structure match, it can control the selector circuit to select the output of the execution elimination structure as the destination physical register 536 to be assigned to the decoded instruction. The destination physical register assignment from the free list structure output may not be used. In this case, the execution of the decoded instruction may also be eliminated or omitted. Otherwise, if the comparator circuit determines that the values do not match, the selector circuit may select the destination physical register from the free list structure output as the destination physical register 536 to be assigned to the decoded instruction.
[0053] The execution elimination structure 510 may be similar to or identical to those described elsewhere herein (e.g., execution elimination structure 210 and / or execution elimination structure 410). The execution elimination structure has entries indexed by an instruction pointer. The instruction pointer of an assigned decoded instruction can be used to look up the corresponding entry in the execution elimination structure before renaming by a register renaming unit. The entries store information about the previously renamed source and destination physical registers of the corresponding instruction. Each instruction with an entry in the execution elimination table includes a destination physical register identifier that identifies the destination physical register. When decoded instructions are assigned, they can be assigned to the identified destination physical register from the entry indexed by the instruction pointer. The destination physical registers of the instructions reflected in the execution elimination structure can be retained without being reclaimed. The results of previous computations of the instructions can still be stored or retained in these destination physical registers. Decoded instructions can be marked for execution elimination by simply pointing the instruction's destination physical register to the destination physical register used to store or retain the results of previous computations. As previously described, this allows for the elimination or omission of the execution of decoded instructions, which can help improve performance and / or reduce power consumption. As an example, performance monitoring can be used to count or otherwise determine the number of instructions executed, which may be less than the number of instructions counted or otherwise determined to be decoded and / or the number of instructions counted or otherwise determined to be issued, and so on.
[0054] As an example, consider using it for training or filling. Figure 4 The execution elimination structure 410 is followed by subsequent iterations of loop 100 after the training iterations of the loop. During these subsequent iterations, when decoded instructions are assigned for each of the second through ninth instructions, they should hit in the execution elimination structure. This can begin with the first instruction RBX = LOAD(RAX). The immediate value of the MOV micro-operation at instruction pointer value A will be matched with the immediate value stored in the execution elimination table at instruction pointer value A. Therefore, the output of this move will remain unchanged from what was observed during training. The destination physical register of this move will be the first physical register identified by PRID1. The register alias table will map RBX to PRID1.
[0055] Next, the second instruction at instruction pointer A+4 can be assigned. Its source is RBX. This source RBX will be renamed PRID1 because the first instruction uses PRID1 for its destination RBX. The mapping of RBX stored in the VET table is also PRF1. Since the renamed source matches between the execution elimination table and the assigned second instruction, it is guaranteed that the destination physical register and output identified in the execution elimination structure will remain unchanged. Therefore, the second instruction can be marked for elimination by pointing its destination to PRID2. IP A+4 is assigned the destination of PRID2, and RAT will map R8 to PRID2. A similar process can be repeated for each of the third through ninth instructions in the loop.
[0056] One thing that should be checked is the continued correctness of the constant load value prediction (e.g., the first instruction in a loop). For example, at each iteration, a constant load value prediction load (e.g., the first instruction in loop 100) can be performed to load a value from memory to verify that the actual value loaded remains the same as the constant load value prediction. The processor may include circuitry for comparing the value loaded by executing the load instruction during the iteration with the value predicted for the load instruction, or this may optionally be done in software. If the actual value remains the same as the predicted value, the above-described elimination is effective. Alternatively, this can be detected if there has been intermediate storage, write, or listener from another core that has changed or modified the value. In this case, the incorrect value can be discarded (e.g., cleared from the pipeline and cleared from the physical register or marked as invalid in the physical register), and the instruction can be re-executed to produce the correct result.
[0057] Example computer architecture.
[0058] The following is a detailed description of an example computer architecture. Other system designs and configurations known in the art for laptops, desktop computers, personal computers (PCs), personal digital assistants, engineering workstations, servers, split servers, network devices, network hubs, switches, routers, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, cellular phones, portable media players, handheld devices, and a wide variety of other electronic devices are also suitable. In summary, various systems or electronic devices capable of incorporating the processors and / or other execution logic disclosed herein are suitable.
[0059] Figure 6The illustration depicts an example computing system. The multiprocessor system 600 is a docking system and includes multiple processors or cores, comprising a first processor 670 and a second processor 680 coupled via interfaces 650 such as point-to-point (PP) interconnects, fabrics, and / or buses. In some examples, the first processor 670 and the second processor 680 are homogeneous. In some examples, the first processor 670 and the second processor 680 are heterogeneous. Although the example system 600 is shown as having two processors, the system can have three or more processors, or it can be a single-processor system. In some examples, the computing system is a system-on-a-chip (SoC).
[0060] Processors 670 and 680 are shown to include integrated memory controller (IMC) circuitry 672 and 682, respectively. Processor 670 also includes interface circuitry 676 and 678; similarly, the second processor 680 includes interface circuitry 686 and 688. Processors 670 and 680 can exchange information via interface 650 using interface circuitry 678 and 688. IMCs 672 and 682 couple processors 670 and 680 to their respective memories, namely memories 632 and 634, which may be part of the main memory locally attached to each processor.
[0061] Processors 670 and 680 can each use interface circuits 676, 694, 686, and 698 to exchange information with network interface (NWI / F) 690 via interfaces 652 and 654. Network interface 690 (e.g., one or more of an interconnect, bus, and / or structure, and in some examples a chipset) can optionally exchange information with coprocessor 638 via interface circuit 692. In some examples, coprocessor 638 is a dedicated processor, such as a high-throughput processor, network or communication processor, compression engine, graphics processor, general-purpose graphics processing unit (GPGPU), neural-network processing unit (NPU), embedded processor, etc.
[0062] A shared cache (not shown) may be included in either of the processors 670, 680, or may be located outside of the two processors but connected to them via an interface such as a PP interconnect, such that if one processor is placed in a low-power mode, the local cache information of either or both processors may be stored in the shared cache.
[0063] Network interface 690 may be coupled to first interface 616 via interface circuitry 696. In some examples, first interface 616 may be an interface such as a Peripheral Component Interconnect (PCI) interconnect, a PCI Express interconnect, or another I / O interconnect. In some examples, first interface 616 is coupled to power control unit (PCU) 617, which may include circuitry, software, and / or firmware to perform power management operations with respect to processors 670, 680, and / or coprocessor 638. PCU 617 provides control information to a voltage regulator (not shown) to cause the voltage regulator to generate an appropriate regulated voltage. PCU 617 also provides control information to control the generated operating voltage. In various examples, PCU 617 may include various power management logic units (circuitry) to perform hardware-based power management. Such power management can be entirely controlled by the processor (e.g., controlled by various processor hardware and can be triggered by workload and / or power constraints, thermal constraints or other processor constraints), and / or power management can be performed in response to external sources (e.g., platform or power management sources or system software).
[0064] The PCU 617 is illustrated as logic separate from processors 670 and / or 680. In other cases, the PCU 617 may execute on one or more cores of processors 670 or 680 (not shown). In some cases, the PCU 617 may be implemented as a microcontroller (dedicated or general-purpose) or other control logic configured to execute its own dedicated power management code (sometimes called P-code). In still other examples, the power management operations to be performed by the PCU 617 may be implemented externally to the processor, for example, by a separate power management integrated circuit (PMIC) or another component external to the processor. In still other examples, the power management operations to be performed by the PCU 617 may be implemented within the BIOS or other system software.
[0065] Various I / O devices 614 and a bus bridge 618 can be coupled to a first interface 616, which in turn couples the first interface 616 to a second interface 620. In some examples, one or more additional processors 615 are coupled to the first interface 616. These additional processors may include coprocessors, high-throughput many-integrated-core (MIC) processors, GPGPUs, accelerators (such as graphics accelerators or digital signal processing (DSP) units), field-programmable gate arrays (FPGAs), or any other processor. In some examples, the second interface 620 may be a low-pin-count (LPC) interface. Various devices can be coupled to the second interface 620, including, for example, a keyboard and / or mouse 622, a communication device 627, and storage circuitry 628. Storage circuitry 628 may be one or more non-transitory machine-readable storage media, such as disk drives or other mass storage devices, which in some examples may include instructions / code and data 630 and may implement storage device 628. Additionally, the audio I / O 624 can be coupled to the second interface 620. Note that other architectures besides the point-to-point architecture described above are also possible. For example, a system such as the multiprocessor system 600 can implement a multi-drop interface or other such architectures instead of a point-to-point architecture.
[0066] Example core architecture, processor, and computer architecture.
[0067] Processor cores can be implemented in different ways, for different purposes, and in different processors. For example, these core implementations can include: 1) general-purpose ordered cores for general computing purposes; 2) high-performance general-purpose out-of-order cores for general computing purposes; and 3) dedicated cores primarily for graphics and / or scientific (throughput) computing purposes. Different processor implementations can include: 1) CPUs, comprising one or more general-purpose ordered cores and / or one or more general-purpose out-of-order cores for general computing purposes; and 2) coprocessors, comprising one or more dedicated cores primarily for graphics and / or scientific (throughput) computing purposes. These different processors lead to different computer system architectures, which can include: 1) coprocessors and CPUs on separate chips; 2) coprocessors and CPUs on separate dies within the same package; 3) coprocessors and CPUs on the same die (in this case, such coprocessors are sometimes referred to as dedicated logic, such as integrated graphics and / or scientific (throughput) logic, or dedicated cores); and 4) system-on-a-chip (SoC), which may include the described CPU (sometimes referred to as application cores or application processors), the aforementioned coprocessors, and additional functionality on the same die. An example core architecture is described next, followed by a description of the example processor and computer architecture.
[0068] Figure 7 A block diagram of an example processor and / or SoC 700 is illustrated, which may have one or more cores and an integrated memory controller. The processor 700 illustrated by solid-line boxes has a single core 702(A), system proxy unit circuitry 710, and a set of one or more interface controller unit circuits 716, while alternative processors 700 can be illustrated by optional dashed-line boxes having multiple cores 702(A)-(N), a set of one or more integrated memory control unit circuits 714 from the system proxy unit circuitry 710, dedicated logic 708, and a set of one or more interface controller unit circuits 716. Note that processor 700 may be... Figure 6 It is one of the processors 670 or 680 or the coprocessor 638 or 615.
[0069] Therefore, different implementations of processor 700 can include: 1) a CPU, where dedicated logic 708 is integrated graphics and / or scientific (throughput) logic (which may include one or more cores, not shown), and cores 702(A)-(N) are one or more general-purpose cores (e.g., general-purpose ordered cores, general-purpose out-of-order cores, or a combination of both); 2) a coprocessor, where cores 702(A)-(N) are a large number of dedicated cores primarily for graphics and / or scientific (throughput) purposes; and 3) a coprocessor, where cores 702(A)-(N) are a large number of general-purpose ordered cores. Thus, processor 700 can be a general-purpose processor, a coprocessor, or a dedicated processor, such as a network or communication processor, a compression engine, a graphics processor, a GPGPU (General-Purpose Graphics Processing Unit), a high-throughput many-integrated-core (MIC) coprocessor (including 30 or more cores), an embedded processor, etc. The processor can be implemented on one or more chips. The processor 700 may be part of one or more substrates and / or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, complementary metal oxide semiconductor (CMOS), bipolar CMOS (BiCMOS), P-type metal oxide semiconductor (PMOS), or N-type metal oxide semiconductor (NMOS).
[0070] The memory hierarchy includes one or more levels of cache cell circuitry 704(A)-(N) within cores 702(A)-(N), a group of one or more shared cache cell circuitry 706, and external memory (not shown) coupled to the group(s) of integrated memory controller cell circuitry 714. The group(s) of shared cache cell circuitry 706 may include one or more intermediate level caches, such as Level 2 (L2), Level 3 (L3), Level 4 (4), or other levels of cache, such as the last level cache (LLC), and / or combinations thereof. While in some examples interface network circuitry 712 (e.g., a ring interconnect) interfaces to dedicated logic 708 (e.g., integrated graphics logic), the group(s) of shared cache cell circuitry 706, and system agent cell circuitry 710, alternative examples use any number of known techniques to interface to these units. In some examples, one or more circuits in the shared cache cell circuitry 706 maintain consistency with cores 702(A)-(N). In some examples, the interface controller unit circuit 716 couples the core 702(A)-(N) to one or more other devices 718, such as one or more I / O devices, storage devices, one or more communication devices (e.g., wireless networks, wired networks, etc.).
[0071] In some examples, one or more of the cores 702(A)-(N) have multi-threading capabilities. The system agent unit circuitry 710 includes those components that coordinate and operate the cores 702(A)-(N). The system agent unit circuitry 710 may include, for example, power control unit (PCU) circuitry and / or display unit circuitry (not shown). The PCU may be (or may include) the logic and components required to regulate the power state of the cores 702(A)-(N) and / or dedicated logic 708 (e.g., integrated graphics logic). The display unit circuitry is used to drive one or more externally connected displays.
[0072] Cores 702(A)-(N) can be homogeneous in terms of instruction set architecture (ISA). Alternatively, cores 702(A)-(N) can be heterogeneous in terms of ISA; that is, a subset of cores 702(A)-(N) may be able to execute an ISA, while other cores may be able to execute only a subset of that ISA or be able to execute another ISA.
[0073] Example core architecture - ordered and out-of-order core block diagrams.
[0074] Figure 8(A) is a block diagram illustrating both the example ordered pipeline and the example register renaming, out-of-order issue / execution pipeline according to the example. Figure 8(B) is a block diagram illustrating both the example ordered architecture core and the example register renaming, out-of-order issue / execution architecture core to be included in the processor according to the example. Figures 8(A)-8(B) The solid boxes in the diagram illustrate ordered pipelines and ordered kernels, while the optional dashed boxes illustrate register renaming, out-of-order issue / execution pipelines, and kernels. Since ordered aspects are a subset of out-of-order aspects, out-of-order aspects will be described.
[0075] In Figure 8(A), the processor pipeline 800 includes a fetch phase 802, an optional length-decode phase 804, a decode phase 806, an optional alloc phase 808, an optional rename phase 810, a scheduling (also known as dispatch or dispatch) phase 812, an optional register read / memory read phase 814, an execution phase 816, a write-back / memory write phase 818, an optional exception handling phase 822, and an optional commit phase 824. One or more operations can be performed in each of these processor pipeline phases. For example, during the fetch phase 802, one or more instructions are fetched from instruction memory, and during the decode phase 806, the fetched instructions can be decoded, an address using a forwarding register port (e.g., a load store unit (LSU) address) can be generated, and branch forwarding (e.g., an immediate offset or a link register (LR)) can be performed. In one example, the decode phase 806 and the register read / memory read phase 814 can be combined into a single pipeline phase. In one example, during execution phase 816, decoded instructions can be executed, LSU address / data pipelined to the Advanced Microcontroller Bus (AMB) interface can be executed, multiplication and addition operations can be performed, arithmetic operations with branch results can be performed, and so on.
[0076] As an example, the core of the example register renaming, out-of-order issue / execution architecture in Figure 8(B) can be implemented by pipeline 800 in the following ways: 1) Instruction fetch circuit 838 performs fetch and length decoding stages 802 and 804; 2) Decoding circuit 840 performs decoding stage 806; 3) Rename / allocator unit circuit 852 performs allocation stage 808 and rename stage 810; 4) (one or more) scheduler circuits 856 perform scheduling stage 812; 5) (one or more) physical register file circuits 858 and memory unit circuits 870 perform register read / memory read stage 814; (one or more) execution cluster 860 performs execution stage 816; 6) memory unit circuit 870 and (one or more) physical register file circuits 858 perform write-back / memory write stage 818; 7) Various circuits may be involved in exception handling stage 822; and 8) retirement unit circuit 854 and (one or more) physical register file circuits 858 perform commit stage 824.
[0077] Figure 8(B) shows that processor core 890 includes front-end unit circuitry 830 coupled to execution engine unit circuitry 850, and both are coupled to memory unit circuitry 870. Core 890 can be a reduced instruction set architecture computing (RISC) core, a complex instruction set architecture computing (CISC) core, a very long instruction word (VLIW) core, or a hybrid or alternative core type. Alternatively, core 890 can be a dedicated core, such as a network or communication core, a compression engine, a coprocessor core, a general purpose computing graphics processing unit (GPGPU) core, a graphics core, etc.
[0078] Front-end unit circuitry 830 may include branch prediction circuitry 832 coupled to instruction cache circuitry 834, which is coupled to translation lookaside buffer (TLB) 836, which is coupled to instruction fetch circuitry 838, which is coupled to decode circuitry 840. In one example, instruction cache circuitry 834 is included in memory unit circuitry 870 instead of front-end circuitry 830. Decoding circuitry 840 (or decoder) may decode instructions and generate one or more micro-operations, microcode entry points, microinstructions, other instructions, or other control signals as outputs, which are decoded from, or otherwise reflect or are derived from, the original instruction. Decoding circuitry 840 may also include address generation unit (AGU, not shown) circuitry. In one example, the AGU uses forwarded register ports to generate LSU addresses and may further perform branch forwarding (e.g., immediate offset branch forwarding, LR register branch forwarding, etc.). Various mechanisms can be used to implement the decoding circuit 840. Examples of suitable mechanisms include, but are not limited to, lookup tables, hardware implementations, programmable logic arrays (PLAs), microcode read-only memory (ROMs), and so on. In one example, core 890 includes a microcode ROM (not shown) or other medium that stores microcode for certain macro instructions (e.g., in the decoding circuit 840 or otherwise within the front-end circuitry 830). In one example, the decoding circuit 840 includes micro-ops or operation caches (not shown) to save / cachise decoded operations, micro-tags, or micro-operations generated during decoding or other stages of the processor pipeline 800. The decoding circuit 840 may be coupled to the renaming / allocator unit circuitry 852 in the execution engine circuitry 850.
[0079] The execution engine circuitry 850 includes a renaming / allocator unit circuitry 852, which is coupled to a retirement unit circuitry 854 and one or more scheduler circuits 856. The scheduler circuits 856 represent any number of different schedulers, including reservation stations, central instruction windows, etc. In some examples, the scheduler circuits 856 may include an arithmetic logic unit (ALU) scheduler / scheduling circuit, an ALU queue, an address generation unit (AGU) scheduler / scheduling circuit, an AGU queue, etc. The scheduler circuits 856 are coupled to one or more physical register file circuits 858. Each of the physical register file circuits 858 represents one or more physical register files, which store one or more different data types, such as scalar integers, scalar floating-point numbers, compressed integers, compressed floating-point numbers, vector integers, vector floating-point numbers, status (e.g., an instruction pointer, i.e., the address of the next instruction to be executed), etc. In one example, one or more physical register file circuits 858 include vector register unit circuits, write mask register unit circuits, and scalar register unit circuits. These register units can provide architectural vector registers, vector mask registers, general-purpose registers, etc. One or more physical register file circuits 858 are coupled to retirement unit circuits 854 (also called retirement queues) to demonstrate various ways that can be used to implement register renaming and out-of-order execution (e.g., utilizing one or more reorder buffers (ROBs) and one or more retirement register files; utilizing one or more future heaps, one or more history buffers, and one or more retirement register files; utilizing register maps and register pools; etc.). Retirement unit circuits 854 and one or more physical register file circuits 858 are coupled to one or more execution clusters 860. One or more execution clusters 860 include a set of one or more execution unit circuits 862 and a set of one or more memory access circuits 864. The execution unit circuit 862 can perform various arithmetic, logical, floating-point, or other types of operations (e.g., shift, addition, subtraction, multiplication) on various types of data (e.g., scalar integers, scalar floating-points, compressed integers, compressed floating-points, vector integers, vector floating-points). While some examples may include several execution units or execution unit circuits dedicated to a specific function or set of functions, other examples may include only one execution unit circuit or multiple execution units / execution unit circuits that perform all functions.One or more scheduler circuits 856, one or more physical register file circuits 858, and one or more execution clusters 860 are shown as potentially multiple, because some examples create separate pipelines for certain types of data / operations (e.g., scalar integer pipelines, scalar floating-point / compact integer / compact floating-point / vector integer / vector floating-point pipelines, and / or memory access pipelines, each with its own scheduler circuitry, one or more physical register file circuits, and / or execution clusters—and in the case of separate memory access pipelines, in some examples of implementations only the execution cluster of that pipeline has one or more memory access unit circuits 864). It should also be understood that, in the case of using separate pipelines, one or more of these pipelines may be issued / executed out of order, while the rest are ordered.
[0080] In some examples, the execution engine unit circuit 850 can perform load memory unit (LSU) address / data pipelined operations to the Advanced Microcontroller Bus (AMB) interface (not shown), as well as address phases and write-back, data phase load, store, and branch.
[0081] A set of memory access circuitry 864 is coupled to memory cell circuitry 870, which includes data TLB circuitry 872, coupled to data cache circuitry 874, which is coupled to Level 2 (L2) cache circuitry 876. In one example, memory access circuitry 864 may include load cell circuitry, memory address cell circuitry, and memory data cell circuitry, each of which is coupled to data TLB circuitry 872 in memory cell circuitry 870. Instruction cache circuitry 834 is further coupled to Level 2 (L2) cache circuitry 876 in memory cell circuitry 870. In one example, instruction cache 834 and data cache 874 are combined to form L2 cache circuitry 876, Level 3 (L3) cache circuitry (not shown), and / or a single instruction and data cache (not shown) in main memory. L2 cache circuitry 876 is coupled to one or more other levels of cache and ultimately coupled to main memory.
[0082] Core 890 may support one or more instruction sets (e.g., x86 instruction set architecture (optionally with some extensions added with later versions); MIPS instruction set architecture; ARM instruction set architecture (optionally with optional additional extensions, such as NEON)) that include the instructions(s) described herein. In one example, Core 890 includes logic supporting compressed data instruction set architecture extensions (e.g., AVX1, AVX2), thereby allowing the use of compressed data to perform operations used by many multimedia applications.
[0083] Example (one or more) execution unit circuits.
[0084] Figure 9 Examples of execution unit circuits, such as execution unit circuits 862 in Figure 8(B), are illustrated. As shown, execution unit circuits 862 may include one or more ALU circuits 901, optional vector / single instruction multiple data (SIMD) circuits 903, load / store circuits 905, branch / jump circuits 907, and / or floating-point unit (FPU) circuits 909. ALU circuits 901 perform integer arithmetic and / or Boolean operations. Vector / SIMD circuits 903 perform vector / SIMD operations on compressed data (e.g., SIMD / vector registers). Load / store circuits 905 execute load and store instructions to load data from memory into registers or store data from registers into memory. Load / store circuits 905 may also generate addresses. Branch / jump circuits 907 cause a branch or jump to a memory address depending on the instruction. FPU circuits 909 perform floating-point arithmetic. The width of the (one or more) execution unit circuits 862 varies depending on the example and can range, for example, from 16 bits to 1024 bits. In some examples, two or more smaller execution units are logically combined to form a larger execution unit (e.g., two 128-bit execution units are logically combined to form a 256-bit execution unit).
[0085] Example register architecture.
[0086] Figure 10 This is a block diagram of register architecture 1000 based on some examples. As shown, register architecture 1000 includes vector / SIMD registers 1010, with widths ranging from 128 bits to 1024 bits. In some examples, vector / SIMD register 1010 is physically 512 bits, and depending on the mapping, only some low-order bits are used. For example, in some examples, vector / SIMD register 1010 is a 512-bit ZMM register: the lower 256 bits are used for the YMM register, and the lower 128 bits are used for the XMM register. Therefore, register overriding exists. In some examples, the vector length field is chosen between a maximum length and one or more other shorter lengths, where each such shorter length is half the length of the previous length. Scalar operations are performed on the lowest-order data element positions in the ZMM / YMM / XMM registers; higher-order data element positions are either preserved as they were before the instruction or zeroed out, depending on the example.
[0087] In some examples, the register architecture 1000 includes a write mask / predicate register 1015. For example, in some examples, there are eight write mask / predicate registers (sometimes referred to as k0 to k7), each with a size of 16 bits, 32 bits, 64 bits, or 128 bits. The write mask / predicate register 1015 may allow merging (e.g., allowing any set of elements in the destination to be protected from updates during the execution of any operation) and / or zeroing (e.g., a zeroing vector mask allows any set of elements in the destination to be zeroed during the execution of any operation). In some examples, each data element position in a given write mask / predicate register 1015 corresponds to a data element position in the destination. In other examples, the write mask / predicate register 1015 is scalable and consists of a set number of enable bits for a given vector element (e.g., 8 enable bits for each 64-bit vector element).
[0088] The register architecture 1000 includes multiple general-purpose registers 1025. These registers can be 16-bit, 32-bit, 64-bit, etc., and can be used for scalar operations. In some examples, these registers are referred to by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.
[0089] In some examples, register architecture 1000 includes a scalar floating-point (FP) register file 1045, which is used to perform scalar floating-point operations on 32 / 64 / 80-bit floating-point data using x87 instruction set architecture extensions, or as an MMX register to perform operations on 64-bit compressed integer data, and to store operation objects for some operations performed between the MMX and XMM registers.
[0090] One or more flag registers 1040 (e.g., EFLAGS, RFLAGS, etc.) store status and control information used for arithmetic, comparison, and system operations. For example, one or more flag registers 1040 may store condition code information such as carry, parity, auxiliary carry, zero, sign, and overflow. In some examples, one or more flag registers 1040 are referred to as program status and control registers.
[0091] Segment register 1020 contains segment points used to access memory. In some examples, these registers are referred to by the names CS, DS, SS, ES, FS, and GS.
[0092] Machine-specific registers (MSRs) 1035 control and report processor performance. Most MSRs 1035 handle system-related functions and are not accessible to applications. Machine check registers 1060 consist of control, status, and error reporting MSRs used for detecting and reporting hardware errors.
[0093] One or more instruction pointer registers 1030 store instruction pointer values. One or more control registers 1055 (e.g., CR0-CR4) determine the operating mode of the processor (e.g., processors 670, 680, 638, 615, and / or 700) and the characteristics of the currently executing task. Debug register 1050 controls and allows monitoring of debug operations on the processor or core.
[0094] Memory (mem) management register 1065 specifies the location of data structures used in protected-mode memory management. These registers may include the global descriptor table register (GDTR), the interrupt descriptor table register (IDTR), the task register, and the local descriptor table register (LDTR).
[0095] Alternative examples may use wider or narrower registers. Furthermore, alternative examples may use more, fewer, or different register files and registers. The register architecture 1000 may be used, for example, in a register file / memory or (one or more) physical register file circuits 858.
[0096] Instruction set architecture.
[0097] An instruction set architecture (ISA) may include one or more instruction formats. A given instruction format may define various fields (e.g., number of bits, bit positions) to specify the operation to be performed (e.g., opcode) and the operand(s) to be performed on, and / or other data fields(e.g., mask), etc. Some instruction formats are further decomposed through the definition of instruction templates (or subformats). For example, an instruction template for a given instruction format may be defined as having different subsets of the fields of that instruction format (the included fields are usually in the same order, but at least some have different bit positions because fewer fields are included) and / or be defined as having given fields interpreted in different ways. Thus, each instruction in an ISA is expressed using a given instruction format (and, if defined, as a given instruction template in the instruction templates of that instruction format) and includes fields for specifying the operation and operand. For example, the sample ADD instruction has a specific opcode and instruction format, which includes an opcode field to specify the opcode and an operand field to select the operand (source 1 / destination and source 2); and the appearance of this ADD instruction in the instruction stream will have specific content in the operand field that selects the specific operand. Furthermore, although the following description is made in the context of an x86 ISA, applying the teachings of this disclosure to another ISA is within the knowledge of those skilled in the art.
[0098] Example instruction format.
[0099] Examples of the instructions(s) described herein can be implemented in different formats. Furthermore, example systems, architectures, and pipelines are detailed below. The examples of the instructions(s) can be executed on these systems, architectures, and pipelines, but are not limited to those detailed herein.
[0100] Figure 11 An example of an instruction format is illustrated. As shown, an instruction may include multiple components, including but not limited to one or more fields for the following: one or more prefixes 1101, opcode 1103, addressing information 1105 (e.g., register identifier, memory addressing information, etc.), offset value 1107, and / or immediate value 1109. Note that some instructions utilize some or all of the fields of this format, while others may only use the fields of opcode 1103. In some examples, the order shown is the order in which these fields are encoded; however, it should be understood that in other examples, these fields may be encoded in a different order, combined, etc.
[0101] One or more prefix fields 1101 modify instructions when used. In some examples, one or more prefixes are used for repeating string instructions (e.g., 0xF0, 0xF2, 0xF3, etc.), providing section override (e.g., 0x2E, 0x36, 0x3E, 0x26, 0x64, 0x65, 0x2E, 0x3E, etc.), performing bus locking operations, and / or changing the operand (e.g., 0x66) and address size (e.g., 0x67). Some instructions require mandatory prefixes (e.g., 0x66, 0xF2, 0xF3, etc.). Some of these prefixes can be considered "traditional" prefixes. Other prefixes (one or more examples of which are detailed herein) indicate and / or provide further capabilities, such as specifying particular registers, etc. These other prefixes typically follow "traditional" prefixes.
[0102] Opcode field 1103 is used to at least partially define the operation to be performed during instruction decoding. In some examples, the main opcode encoded in opcode field 1103 is 1, 2, or 3 bytes long. In other examples, the main opcode can be of other lengths. An additional 3-bit opcode field is sometimes encoded in another field.
[0103] Addressing information field 1105 is used to address one or more operands of an instruction, such as a location in memory or one or more registers. Figure 12 An example of addressing information field 1105 is illustrated. This illustration shows the optional MOD R / M byte 1202 and the optional Scale, Index, Base (SIB) byte 1204. The MOD R / M byte 1202 and SIB byte 1204 are used to encode up to two operands of an instruction, each operand being either a direct register or an effective memory address. Note that both fields are optional; that is, not all instructions include one or more of these fields. The MOD R / M byte 1202 includes the MOD field 1242, the register (reg) field 1244, and the R / M field 1246.
[0104] The contents of MOD field 1242 distinguish between memory access and non-memory access modes. In some examples, when MOD field 1242 has a binary value of 11 (11b), register direct addressing mode is used; otherwise, register indirect addressing mode is used.
[0105] Register field 1244 can encode either the destination register operand or the source register operand, or it can encode an opcode extension without being used to encode any instruction operand. The contents of register field 1244 directly specify or are generated from an address to specify the location of the source or destination operand (in a register or in memory). In some examples, register field 1244 is supplemented with extra bits from a prefix (e.g., prefix 1101) to allow for larger addressing.
[0106] R / M field 1246 can be used to encode instruction operands that reference memory addresses, or it can be used to encode destination register operands or source register operands. Note that in some examples, R / M field 1246 can be combined with MOD field 1242 to specify the addressing mode.
[0107] SIB byte 1204 includes a scaling field 1252, an index field 1254, and a base field 1256 for address generation. The scaling field 1252 indicates the scaling factor. The index field 1254 specifies the index register to be used. In some examples, the index field 1254 is supplemented with extra bits from the prefix (e.g., prefix 1101) to allow for larger addressing. The base field 1256 specifies the base address register to be used. In some examples, the base field 1256 is supplemented with extra bits from the prefix (e.g., prefix 1101) to allow for larger addressing. In practice, the contents of the scaling field 1252 allow scaling the contents of the index field 1254 for memory address generation (e.g., for addresses using 2...). 缩放 Address generation from index + base address.
[0108] Some addressing schemes use bitwise shift values to generate memory addresses. For example, they can be based on 2 缩放 Index + Base Address + Displacement, Index Memory addresses are generated using scaling plus offset, r / m plus offset, instruction pointer (RIP / EIP) plus offset, register plus offset, etc. The offset can be a value of 1 byte, 2 bytes, 4 bytes, etc. In some examples, the offset field 1107 provides this value. Furthermore, in some examples, the use of an offset factor is encoded in the MOD field of the addressing information field 1105, which indicates a compact offset scheme for which the offset value is calculated and stored in the offset field 1107.
[0109] In some examples, the immediate value field 1109 specifies an immediate value for the instruction. Immediate values can be encoded as 1-byte values, 2-byte values, 4-byte values, and so on.
[0110] Figure 13 An example of the first prefix 1101(A) is illustrated. In some examples, the first prefix 1101(A) is an example of the REX prefix. Instructions using this prefix can specify general-purpose registers, 64-bit compact data registers (e.g., single-instruction multiple-data (SIMD) registers or vector registers), and / or control registers and debug registers (e.g., CR8-CR15 and DR8-DR15).
[0111] Instructions using the first prefix 1101(A) can specify up to three registers using 3-bit fields, depending on the format: 1) using reg field 1244 and R / M field 1246 of MOD R / M byte 1202; 2) using MOD R / M byte 1202 with SIB byte 1204, including using reg field 1244 as well as base field 1256 and index field 1254; or 3) using the register field of the opcode.
[0112] In the first prefix 1101(A), bits 7:4 are set to 0100. Bit 3 (W) can be used to determine the operand size, but cannot determine the operand width alone. Therefore, when W = 0, the operand size is determined by the code segment descriptor (CS.D), while when W = 1, the operand size is 64 bits.
[0113] Note that adding another bit allows for 16(2) pairs. 4 It can address 8 registers at a time, while the individual MOD R / M reg field 1244 and MOD R / M R / M field 1246 can each address 8 registers.
[0114] In the first prefix 1101(A), bit position 2(R) can be an extension of the reg field 1244 of the MOD R / M, and can be used to modify the reg field 1244 of the MOD R / M when this field encodes a general-purpose register, a 64-bit compressed data register (e.g., an SSE register), or a control or debug register. R is ignored when MOD R / M byte 1202 specifies other registers or defines extended opcodes.
[0115] Bit position 1 (X) can modify SIB byte index field 1254.
[0116] Setting bit position 0 (B) can modify the base address in the R / M field 1246 or the SIB byte base address field 1256 of MOD R / M; or it can modify the opcode register field used to access general-purpose registers (e.g., general-purpose register 1025).
[0117] Figures 14(A)-14(D) The figures illustrate examples of how the R, X, and B fields of the first prefix 1101(A) are used. Figure 14(A) illustrates how the R and B from the first prefix 1101(A) are used to extend the reg field 1244 and R / M field 1246 of the MOD R / M byte 1202 when SIB byte 1204 is not used for memory addressing. Figure 14(B) illustrates how the R and B from the first prefix 1101(A) are used to extend the reg field 1244 and R / M field 1246 of the MOD R / M byte 1202 (register-to-register addressing) when SIB byte 1204 is not used. Figure 14(C) illustrates how the R, X, and B from the first prefix 1101(A) are used to extend the reg field 1244, index field 1254, and base address field 1256 of the MOD R / M byte 1202 when SIB byte 1204 is used for memory addressing. Figure 14(D) illustrates that when the register is encoded in opcode 1103, the B from the first prefix 1101(A) is used to extend the reg field 1244 of the MOD R / M byte 1202.
[0118] Figures 15(A)-15(B) An example of the second prefix 1101(B) is illustrated. In some examples, the second prefix 1101(B) is an example of a VEX prefix. The second prefix 1101(B) encoding allows instructions to have more than two operands and allows SIMD vector registers (e.g., vector / SIMD register 1010) to be longer than 64 bits (e.g., 128 bits and 256 bits). The use of the second prefix 1101(B) provides syntax for three operands (or more). For example, the previous two-operand instructions performed operations such as A = A + B, which overwrote the source operands. The use of the second prefix 1101(B) allows operands to perform non-destructive operations, such as A = B + C.
[0119] In some examples, the second prefix 1101(B) has two forms—two-byte and three-byte. The two-byte second prefix 1101(B) is mainly used for 128-bit, scalar, and some 256-bit instructions; while the three-byte second prefix 1101(B) provides a compact replacement for 3-byte opcode instructions and the first prefix 1101(A).
[0120] Figure 15(A) illustrates an example of the two-byte form of the second prefix 1101(B). In one example, format field 1501 (byte 0 1503) contains the value C5H. In another example, byte 1 1505 includes the value “R” in bit [7]. This value is the complement of the value of “R” in the first prefix 1101(A). Bit [2] is used to specify the length (L) of the vector (where the value of 0 is a scalar or a 128-bit vector, and the value of 1 is a 256-bit vector). Bits [1:0] provide opcode extensions equivalent to some conventional prefixes (e.g., 00 = no prefix, 01 = 66H, 10 = F3H, and 11 = F2H). The bits [6:3] shown as vvvv can be used to: 1) encode the first source register operand, which is specified in reverse (ones complement) form and is valid for instructions with two or more source operands; 2) encode the destination register operand, which is specified in ones complement form for some vector shift; or 3) not encode any operand, in which case the field is reserved and should contain a value such as 1111b.
[0121] Instructions using this prefix can use the R / M field 1246 of MOD R / M to encode instruction operands that reference memory addresses, or to encode destination register operands or source register operands.
[0122] Instructions using this prefix can use the reg field 1244 of MOD R / M to encode either the destination register operand or the source register operand, or they can be treated as an opcode extension without being used to encode any instruction operand.
[0123] For instruction syntax supporting four operands, vvvv, R / M field 1246 of MOD R / M, and reg field 1244 of MOD R / M encode three of the four operands. Then bits [7:4] of the immediate value field 1109 are used to encode the third source register operand.
[0124] Figure 15(B) illustrates an example of the three-byte form of the second prefix 1101(B). In one example, format field 1511 (bytes 0-1513) contains the value C4H. Byte 1 1515 includes “R”, “X”, and “B” in bits [7:5], which are the complements of these values from the first prefix 1101(A). Bits [4:0] of byte 1 1515 (shown as mmmmm) include the content to encode one or more implicit preamble opcode bytes as needed. For example, 00001 means 0FH preamble opcode, 00010 means 0F38H preamble opcode, 00011 means 0F3AH preamble opcode, and so on.
[0125] The use of bits [7] in byte 2 1517 is similar to that of W in the first prefix 1101(A), including helping to determine the size of the operand that can be promoted. Bit [2] is used to specify the length (L) of the vector (where the value of 0 is a scalar or a 128-bit vector, and the value of 1 is a 256-bit vector). Bits [1:0] provide opcode extensions equivalent to some conventional prefixes (e.g., 00 = no prefix, 01 = 66H, 10 = F3H, and 11 = F2H). Bits [6:3] shown as vvvv can be used to: 1) encode the first source register operand, which is specified in reverse (ones complement) form, valid for instructions with two or more source operands; 2) encode the destination register operand, which is specified in ones complement form, for some vector shift; or 3) not encode any operand, in which case the field is reserved and should contain a value, such as 1111b.
[0126] Instructions using this prefix can use the R / M field 1246 of MOD R / M to encode instruction operands that reference memory addresses, or to encode destination register operands or source register operands.
[0127] Instructions using this prefix can use the reg field 1244 of MOD R / M to encode either the destination register operand or the source register operand, or they can be treated as an opcode extension without being used to encode any instruction operand.
[0128] For instruction syntax supporting four operands, vvvv, R / M field 1246 of MOD R / M, and reg field 1244 of MOD R / M encode three of the four operands. Then bits [7:4] of the immediate value field 1109 are used to encode the third source register operand.
[0129] Figure 16 The illustration shows an example of the third prefix 1101(C). In some examples, the third prefix 1101(C) is an example of the EVEX prefix. The third prefix 1101(C) is a four-byte prefix.
[0130] The third prefix 1101(C) enables the encoding of 32 vector registers (e.g., 128-bit, 256-bit, and 512-bit registers) in 64-bit mode. In some examples, write masks / operation masks are used (see the discussion of registers in the previous diagrams, e.g.) Figure 10Instructions that use predicates or operations utilize this prefix. Operation mask registers allow conditional processing or selection control. Operation mask instructions—whose source / destination operands are operation mask registers and whose contents are treated as a single value—are encoded using the second prefix 1101(B).
[0131] The third prefix 1101(C) can encode instruction class-specific features (e.g., a compact instruction with "load + operation" semantics can support embedded broadcast functionality, a floating-point instruction with rounding semantics can support static rounding functionality, a floating-point instruction with non-rounding arithmetic semantics can support "suppress all exceptions" functionality, etc.).
[0132] The first byte of the third prefix 1101(C) is the format field 1611, which in one example has a value of 62H. The subsequent bytes are referred to as payload bytes 1615-1619, and together they form the 24-bit value of P[23:0], which provides specific capabilities in the form of one or more fields (detailed herein).
[0133] In some examples, P[1:0] of payload byte 1619 is the same as the two lower mm bits. In some examples, P[3:2] is reserved. Bit P[4] (R') allows access to the high 16 vector register set when combined with P[7] and MOD R / M's reg field 1244. P[6] can also provide access to the high 16 vector registers when SIB type addressing is not required. P[7:5] consists of R, X, and B, which are operand specifier modifier bits for vector registers, general-purpose registers, and memory addressing, and when combined with MOD R / M's register field 1244 and MOD R / M's R / M field 1246, allows access to the next set of 8 registers beyond the lower 8 registers. P[9:8] provides opcode extensions equivalent to some conventional prefixes (e.g., 00 = no prefix, 01 = 66H, 10 = F3H, and 11 = F2H). P
[10] is a fixed value of 1 in some examples. P[14:11], shown as vvvv, can be used to: 1) encode the first source register operand, which is specified in reverse (ones complement) form, for instructions with two or more source operands; 2) encode the destination register operand, which is specified in ones complement form, for some vector shift; or 3) not encode any operand, in which case the field is reserved and should contain a value such as 1111b.
[0134] P
[15] is similar to W in the first prefix 1101(A) and the second prefix 1111(B), and can be used as an opcode extension bit or an operand size boost.
[0135] P[18:16] specifies the index of the register in the operation mask (write mask) register (e.g., write mask / predicate register 1015). In one example, the specific value aaa = 000 has special behavior, implying that no operation mask is used for this particular instruction (this can be achieved in various ways, including using a hard-wired operation mask to all one or hardware that bypasses the masking hardware). When merging, the vector mask allows any set of elements in the destination to be protected from updates during the execution of any operation (specified by the basic and enhanced operations); in another example, the old value of each element in the destination is preserved (if the corresponding mask bit has a value of 0). In contrast, when zeroing, the vector mask allows any set of elements in the destination to be zeroed during the execution of any operation (specified by the basic and enhanced operations); in one example, the elements in the destination are set to 0 when the corresponding mask bit has a value of 0. A subset of this functionality is the ability to control the vector length of the operation being performed (i.e., the span of the modified elements, from the first to the last); however, the modified elements do not necessarily have to be contiguous. Thus, the operation mask field allows for some vector operations, including load, store, arithmetic, logical, and so on. While in the described example, the content of the operation mask field selects the one among several operation mask registers containing the operation mask to be used (thus the content of the operation mask field indirectly identifies the mask to be performed), alternatively or additionally, alternative examples allow the content of the mask write field to directly specify the mask to be performed.
[0136] P
[19] can be combined with P[14:11] to encode a second source vector register in a non-destructive source syntax that allows access to the high 16 vector registers using P
[19] . P
[20] encodes various functions that differ across different classes of instructions and can affect the meaning of the vector length / rounding control specifier field (P[22:21]). P
[23] indicates support for merge-write masking (e.g., when set to 0) or support for zeroing and merge-write masking (e.g., when set to 1).
[0137] The table below details examples of register encoding in instructions using the third prefix 1101(C).
[0138] Table 1: 32 Registers Supported in 64-bit Mode Table 2: Encoding Register Specify in 32-bit Mode Table 3: Operation Mask Register Specifier Encoding Program code can be applied to input information to perform the functions described herein and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, the processing system includes any system having a processor, such as a digital signal processor (DSP), microcontroller, application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), microprocessor, or any combination thereof.
[0139] The program code can be implemented in a procedural or object-oriented high-level programming language to communicate with the processing system. Assembly or machine language can also be used if desired. In fact, the mechanisms described in this article are not limited to any particular programming language. In any case, the language can be a compiled language or an interpreted language.
[0140] Examples of the mechanisms disclosed herein can be implemented in hardware, software, firmware, or a combination of these approaches. The examples can be implemented as computer programs or program code, executing on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and / or storage elements), at least one input device, and at least one output device.
[0141] One or more aspects of at least one example can be implemented by representative instructions stored on a machine-readable medium, representing various logic within a processor, which, when read by a machine, cause the machine to produce logic for performing the technology described herein. These representations, referred to as “intellectual property (IP) cores,” can be stored on a tangible machine-readable medium and provided to various customers or manufacturing facilities for loading into the manufacturing machine that actually produces the logic or processor.
[0142] These machine-readable storage media may include—but are not limited to—non-transient tangible arrangements of articles made or formed by machines or equipment, including storage media such as: hard disks, any other type of disk (including floppy disks, optical disks, compact disk read-only memory (CD-ROM), compact disk rewritable (CD-RW), and magneto-optical disks), semiconductor devices (e.g., read-only memory (ROM), random access memory (RAM) such as dynamic random access memory (DRAM), static random access memory (SRAM), erasable programmable read-only memory (EPROM), flash memory, electrically erasable programmable read-only memory (EEPROM), phase change memory (PCM)), magnetic cards or optical cards, or any other type of media suitable for storing electronic instructions.
[0143] Therefore, examples also include non-transitory tangible machine-readable media containing instructions or design data that defines the features of the structures, circuits, devices, processors, and / or systems described herein, such as Hardware Description Language (HDL). Such examples may also be referred to as program products.
[0144] Simulation (including binary translation, code transformation, etc.).
[0145] In some cases, instruction translators can be used to translate instructions from a source instruction set architecture to a target instruction set architecture. For example, an instruction translator can translate (e.g., using static binary translation, including dynamic binary translation with dynamic compilation), transform, emulate, or otherwise translate instructions into one or more other instructions to be processed by the kernel. Instruction translators can be implemented in software, hardware, firmware, or a combination thereof. Instruction translators can be on-processor, off-processor, or partially on-processor and partially off-processor.
[0146] Figure 17This is a block diagram illustrating the use of a software instruction converter to convert binary instructions in a source ISA into binary instructions in a target ISA, according to an example. In the illustrated example, the instruction converter is a software instruction converter, but alternatively, the instruction converter can be implemented using software, firmware, hardware, or various combinations thereof. Figure 17 A program in high-level language 1702 is shown to be compiled using a first ISA compiler 1704 to generate first ISA binary code 1706, which can be natively executed by a processor 1716 having at least one first ISA core. A processor 1716 having at least one first ISA core represents any processor capable of performing substantially the same function as an Intel processor having at least one first ISA core by compatiblely executing or otherwise processing (1) a substantial portion of the first ISA or (2) a version of object code for an application or other software targeted to run on an Intel® processor having at least one first ISA core, in order to achieve substantially the same results as a processor having at least one first ISA core. The first ISA compiler 1704 represents a compiler operable to generate first ISA binary code 1706 (e.g., object code) that can be executed on a processor 1716 having at least one first ISA core, with or without additional linking processing. Similarly, Figure 17 A program in high-level language 1702 is shown to be compiled using an alternative ISA compiler 1708 to generate alternative ISA binary code 1710, which can be natively executed by a processor 1714 without a first ISA core. An instruction converter 1712 is used to convert the first ISA binary code 1706 into code that can be natively executed by a processor 1714 without a first ISA core. This converted code is not necessarily the same as the alternative ISA binary code 1710; however, the converted code will implement the overall operation and consist of instructions from the alternative ISA. Thus, the instruction converter 1712 represents software, firmware, hardware, or a combination thereof that, through emulation, simulation, or any other process, allows a processor or other electronic device without a first ISA processor or core to execute the first ISA binary code 1706.
[0147] The components, features, and details described for any processor or other device disclosed herein may be optionally applied to any method disclosed herein, which, in embodiments, may optionally be performed by and / or utilized by such a processor. In embodiments, any processor described herein may optionally be included in any system disclosed herein. Any processor disclosed herein may optionally have any of the microarchitectures shown herein.
[0148] References to "an example," "example," etc., indicate that the described example may include a particular feature, structure, or characteristic, but not every example may necessarily include that particular feature, structure, or characteristic. Furthermore, such phrases do not necessarily refer to the same example. Moreover, when describing a particular feature, structure, or characteristic in conjunction with an example, it is considered that the influence of such feature, structure, or characteristic on such feature, structure, or characteristic in conjunction with other examples, whether explicitly described or not, is within the knowledge of those skilled in the art.
[0149] The processor components disclosed herein may be described as and / or claimed to be operable, capable of operating, able to be configured, adapted, or otherwise perform operations. For example, a decoder may be described as and / or claimed to be used for decoding instructions, an execution unit may be described as and / or claimed to be used for storing results, and so on. As used herein, these expressions refer to the characteristics, properties, or attributes of the components in a power-off state, and do not imply that these components or the devices or apparatuses in which these components are included are currently powered on or operating. For clarity, it should be understood that the processors and devices claimed herein are not required to be powered on or operating.
[0150] The terms “coupled” and / or “connected” and their derivatives may be used in the specification and claims. These terms are not intended to be synonyms with each other. Rather, in various embodiments, “connected” can be used to indicate that two or more elements are in direct physical and / or electrical contact with each other. “Coupled” can mean that two or more elements are in direct physical and / or electrical contact with each other. However, “coupled” can also mean that two or more elements are not in direct contact with each other, but still cooperate or act on each other. For example, an execution unit may be coupled to a register and / or decoding unit via one or more intermediate components. In the drawings, arrows are used to illustrate connection and coupling.
[0151] Some embodiments include an article of manufacture (e.g., a computer program product) that includes a machine-readable medium. The medium may include mechanisms for providing (e.g., storing) information in a machine-readable form. The machine-readable medium may provide or store instructions or sequences of instructions that, if executed by a machine, are operable to cause the machine to perform and / or induce the machine to perform one or more operations, methods, or techniques disclosed herein.
[0152] In some embodiments, a machine-readable medium may include tangible and / or non-transitory machine-readable storage media. For example, a non-transitory machine-readable storage medium may include floppy disks, optical storage media, optical disks, optical data storage devices, CD-ROMs, magnetic disks, magneto-optical disks, read-only memory (ROM), programmable ROM (PROM), erasable-and-programmable ROM (EPROM), electrically erasable-and-programmable ROM (EEPROM), random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), flash memory, phase-change memory, phase-change data storage materials, non-volatile memory, non-volatile data storage devices, non-transitory memory, or non-transitory data storage devices, etc. A non-transitory machine-readable storage medium does not consist of transient propagating signals. In some embodiments, the storage medium may include tangible media, which include solid substances or materials such as, for example, semiconductor materials, phase-change materials, magnetic solid materials, solid data storage materials, etc. Alternatively, non-tangible, transient, computer-readable transmission media may be used, such as, for example, electrical, optical, acoustic, or other forms of propagation signals—such as carrier waves, infrared signals, and digital signals.
[0153] Examples of suitable machines include, but are not limited to, general-purpose processors, special-purpose processors, digital logic circuits, integrated circuits, etc. Other examples of suitable machines include computer systems or other electronic devices that contain processors, digital logic circuits, or integrated circuits. Examples of such computer systems or electronic devices include, but are not limited to, desktop computers, laptop computers, notebook computers, tablet computers, netbooks, smartphones, cellular phones, servers, network equipment (e.g., routers and switches), mobile internet devices (MIDs), media players, smart TVs, internet access devices, set-top boxes, and video game controllers.
[0154] Moreover, in the examples described above, unless otherwise specifically stated, delimited language such as the phrase “at least one of A, B or C” or “A, B and / or C” is intended to be understood as referring to A, B, or C, or any combination thereof (i.e., A and B, A and C, B and C, and A, B and C).
[0155] In the foregoing description, specific details have been set forth to provide a thorough understanding of the embodiments. However, other embodiments may be practiced without some of these specific details. Various modifications and changes may be made to this disclosure without departing from the broader spirit and scope of the disclosure as set forth in the claims. Therefore, the specification and drawings should be considered illustrative rather than restrictive. The scope of the invention is not intended to be determined by the specific examples provided above, but only by the appended claims. In other instances, well-known circuits, structures, devices, and operations have been shown in block diagram form and / or without detail to avoid obscuring the understanding of the specification.
[0156] Example Implementation
[0157] The following examples relate to further embodiments. Details from the examples may be used anywhere in one or more embodiments.
[0158] Example 1 is a method that includes: predicting the value that a load instruction in the first iteration of a loop will load, and executing multiple instructions that occur after the load instruction in the first iteration of the loop to generate multiple results. Each of the multiple results depends only on that value, one or more constant values, values derived from it, or any combination thereof. The method also includes generating multiple results for the multiple instructions during each iteration in one or more iterations following the first iteration of the loop, without re-executing the multiple instructions.
[0159] Example 2 includes the method of Example 1, further comprising: storing multiple results generated for a first iteration in multiple physical registers. Optionally, the generation also includes generating multiple results from the multiple physical registers during each iteration in one or more iterations.
[0160] Example 3 includes a method from any of Examples 1 to 2, wherein producing includes generating multiple results during the assignment.
[0161] Example 4 includes the method of any one of Examples 1 to 3, further comprising: during the first iteration of the loop: storing an instruction pointer for a load instruction, a predicted value for the load instruction, and a destination physical register identifier for the load instruction in an entry of the structure; and storing at least one source physical register identifier and a destination physical register identifier for each of the multiple instructions in different corresponding entries of the structure.
[0162] Example 5 includes a method from any of Examples 1 to 4, wherein the storage includes, upon submitting a load instruction, storing in an entry of the structure an instruction pointer for the load instruction, a value predicted for the load instruction, and a destination physical register identifier for the load instruction.
[0163] Example 6 includes the method of any one of Examples 1 to 5, and further includes: storing each of the plurality of results generated for the first iteration in a destination physical register identified by a destination physical register identifier for the corresponding instruction.
[0164] Example 7 includes a method of any one of Examples 1 to 6, wherein producing each of a plurality of results includes producing a destination physical register identified by a destination physical register identifier for a corresponding instruction.
[0165] Example 8 includes the method of any one of Examples 1 to 7, further including: during the first iteration of the loop: instructing the execution of elimination hits for the destination physical register identifier of the load instruction; and optionally, instructing the execution of elimination hits for the destination physical register identifier of each of the plurality of instructions.
[0166] Example 9 includes the method of any one of Examples 1 to 8, and further includes: executing a load instruction during the first iteration and each iteration in one or more iterations.
[0167] Example 10 includes a method of any one of Examples 1 to 9, further comprising: if a value loaded by executing a load instruction during a given iteration in one or more iterations differs from a value predicted for the load instruction, then discarding multiple results generated during the given iteration.
[0168] Example 11 is an apparatus comprising: a prediction unit for predicting the value to be loaded by a load instruction in the first iteration of a loop; and one or more execution units for executing multiple instructions occurring after the load instruction in the first iteration of the loop to generate multiple results. Each of the multiple results depends only on the value, one or more constant values, values derived therefrom, or any combination thereof. The apparatus also includes allocation circuitry for generating multiple results for the multiple instructions during each of one or more iterations following the first iteration of the loop, without re-executing the multiple instructions.
[0169] Example 12 includes the apparatus of Example 11, further comprising: a plurality of physical registers for storing a plurality of results generated for a first iteration. Optionally, as well, allocation circuitry is used to generate a plurality of results from the plurality of physical registers during each iteration in one or more iterations.
[0170] Example 13 includes an apparatus comprising any one of Examples 11 to 12, further comprising: a structure having a plurality of entries, wherein, during the first iteration of the loop: (1) an instruction pointer for a load instruction, a value predicted for a load instruction, and a destination physical register identifier for a load instruction are used to be stored in an entry of the structure; and (2) optionally, for each of the plurality of instructions, at least one source physical register identifier and a destination physical register identifier are used to be stored in different corresponding entries of the structure.
[0171] Example 14 includes an apparatus of any one of Examples 11 to 13, further comprising: a submission unit coupled to the structure for submitting a load instruction, wherein, when the load instruction is submitted, an instruction pointer for the load instruction, a value predicted for the load instruction, and a destination physical register identifier for the load instruction are stored in an entry.
[0172] Example 15 includes an apparatus of any of Examples 11 to 14, wherein one or more execution units are configured to execute load instructions during a first iteration and each of the one or more iterations.
[0173] Example 16 includes the apparatus of any one of Examples 11 to 15, and further includes: circuitry for comparing a value loaded by executing a load instruction during iteration with a value predicted for the load instruction.
[0174] Example 17 is a system comprising: a processor including: a prediction unit for predicting a value to be loaded by a load instruction in the first iteration of a loop; and one or more execution units for executing multiple instructions occurring after the load instruction in the first iteration of the loop to generate multiple results. Each of the multiple results depends only on the value, one or more constant values, values derived therefrom, or any combination thereof. The processor also includes allocation circuitry for generating multiple results for the multiple instructions during each of one or more iterations following the first iteration of the loop, without re-executing the multiple instructions. The system also includes dynamic random access memory (DRAM) coupled to the processor.
[0175] Example 18 includes the system of Example 17, further comprising: a plurality of physical registers for storing a plurality of results generated for a first iteration. Optionally, allocation circuitry is also provided for generating a plurality of results from the plurality of physical registers during each iteration in one or more iterations.
[0176] Example 19 includes a system comprising any one of Examples 17 to 18, further comprising: a structure having multiple entries, wherein, during the first iteration of a loop: (1) an instruction pointer for a load instruction, a value predicted for a load instruction, and a destination physical register identifier for a load instruction are used to be stored in an entry of the structure; and (2) optionally, for each of the multiple instructions, at least one source physical register identifier and a destination physical register identifier are used to be stored in different corresponding entries of the structure.
[0177] Example 20 includes a system comprising any one of Examples 17 to 19, further comprising: a submission unit coupled to the structure for submitting a load instruction, wherein, when a load instruction is submitted, an instruction pointer for the load instruction, a value predicted for the load instruction, and a destination physical register identifier for the load instruction are stored in an entry.
[0178] Example 21 is a processor or other device operable to perform a method as described in any of Examples 1 through 10.
[0179] Example 22 is a processor or other device that includes means for performing a method as described in any of Examples 1 to 10.
[0180] Example 23 is a processor or other device that includes any combination of modules and / or units and / or logic and / or circuits and / or means operable to perform the methods of any one of Examples 1 to 10.
Claims
1. A method comprising: The load instruction in the first iteration of the predictor loop will load the value; Execute multiple instructions that occur after the load instruction in the first iteration of the loop to generate multiple results, wherein each of the multiple results depends only on the value, one or more constant values, a value derived from the value and / or the one or more constant values, or any combination thereof; as well as During each of one or more iterations following the first iteration of the loop, the plurality of results are generated for the plurality of instructions without re-executing the plurality of instructions.
2. The method of claim 1, further comprising: The plurality of results generated for the first iteration are stored in a plurality of physical registers, wherein the generation of the plurality of results is included in each of the one or more iterations, and the plurality of results are generated from the plurality of physical registers.
3. The method of claim 1, wherein, The generation includes generating the multiple results during the allocation process.
4. The method of any one of claims 1 to 3, further comprising: During the first iteration of the loop: The instruction pointer for the load instruction, the value predicted for the load instruction, and the destination physical register identifier for the load instruction are stored in the entries of the structure; as well as At least one source physical register identifier and one destination physical register identifier for each of the multiple instructions are stored in different corresponding entries of the structure.
5. The method of claim 4, wherein, The storage includes, upon submitting the load instruction, storing the instruction pointer for the load instruction, the predicted value for the load instruction, and the destination physical register identifier for the load instruction in the entries of the structure.
6. The method of claim 4, further comprising: Each of the plurality of results generated for the first iteration is stored in a destination physical register identified by a destination physical register identifier for the corresponding instruction.
7. The method of claim 6, wherein, The generation includes generating each of the plurality of results from the destination physical register identified by the destination physical register identifier for the corresponding instruction.
8. The method of claim 4, further comprising: During the first iteration of the loop: Instructs the execution of a kill event for the destination physical register identifier of the load instruction; as well as Instructions to perform kill elimination for the destination physical register identifier of each of the multiple instructions.
9. The method of any one of claims 1 to 3, further comprising: The loading instruction is executed during each of the first iteration and the one or more iterations.
10. The method of claim 9, further comprising: If the value loaded by executing the load instruction during a given iteration of the one or more iterations is different from the value predicted for the load instruction, then the plurality of results generated during the given iteration are discarded.
11. A computer program product comprising instructions that, when executed by a processor, cause the processor to perform the method as described in any one of claims 1 to 3.
12. An apparatus comprising: The prediction unit is used to predict the value that the load instruction will load in the first iteration of the loop; One or more execution units are configured to execute multiple instructions occurring after the load instruction in the first iteration of the loop to generate multiple results, wherein each of the multiple results depends only on the value, one or more constant values, a value derived from the value and / or the one or more constant values, or any combination thereof; as well as An allocation circuit is configured to generate the plurality of results for the plurality of instructions during each of one or more iterations following the first iteration of the loop, without re-executing the plurality of instructions.
13. The apparatus of claim 12, further comprising: Multiple physical registers are provided for storing the multiple results generated for the first iteration, wherein the allocation circuitry is used to generate the multiple results from the multiple physical registers during each of the one or more iterations.
14. The apparatus according to any one of claims 12 to 13, further comprising: A structure having multiple entries, wherein, during the first iteration of the loop: The instruction pointer for the load instruction, the predicted value for the load instruction, and the destination physical register identifier for the load instruction are stored in an entry of the structure; and For each of the multiple instructions, at least one source physical register identifier and a destination physical register identifier are used to be stored in different corresponding entries of the structure.
15. The apparatus of claim 14, further comprising: A submission unit coupled to the structure, the submission unit being used to submit the load instruction, wherein, when the load instruction is submitted, the instruction pointer for the load instruction, the value predicted for the load instruction, and the destination physical register identifier for the load instruction are stored in the entry.
16. The apparatus according to any one of claims 12 to 13, wherein, The one or more execution units are used to execute the load instructions during each iteration of the first iteration and the one or more iterations.
17. The apparatus of claim 16, further comprising: A circuit for comparing the value loaded by executing the load instruction during iteration with the value predicted for the load instruction.
18. A system comprising: Processor, the processor comprising: The prediction unit is used to predict the value that the load instruction will load in the first iteration of the loop; One or more execution units are configured to execute a plurality of instructions occurring after the load instruction in the first iteration of the loop to generate a plurality of results, wherein each of the plurality of results depends only on the value, one or more constant values, a value derived from the value and / or the one or more constant values, or any combination thereof; and An allocation circuit is configured to generate the plurality of results for the plurality of instructions during each of one or more iterations following the first iteration of the loop, without re-executing the plurality of instructions; and Dynamic random access memory (DRAM) is coupled to the processor.
19. The system of claim 18, further comprising: Multiple physical registers are provided for storing the multiple results generated for the first iteration, wherein the allocation circuitry is used to generate the multiple results from the multiple physical registers during each of the one or more iterations.
20. The system according to any one of claims 18 to 19, further comprising: A structure having multiple entries, wherein, during the first iteration of the loop: The instruction pointer for the load instruction, the predicted value for the load instruction, and the destination physical register identifier for the load instruction are stored in an entry of the structure; and For each of the multiple instructions, at least one source physical register identifier and a destination physical register identifier are used to be stored in different corresponding entries of the structure.
21. The system of claim 20, further comprising: A submission unit coupled to the structure, the submission unit being used to submit the load instruction, wherein, when the load instruction is submitted, the instruction pointer for the load instruction, the value predicted for the load instruction, and the destination physical register identifier for the load instruction are stored in the entry.
22. An apparatus comprising: A device for predicting the value that the load instruction will load for the first iteration of a loop; A means for executing multiple instructions that occur after the load instruction in the first iteration of the loop to generate multiple results, wherein each of the multiple results depends only on the value, one or more constant values, a value derived from the value and / or the one or more constant values, or any combination thereof; as well as A means for generating the plurality of results for the plurality of instructions during each of one or more iterations following the first iteration of the loop, without re-executing the plurality of instructions.
23. The device according to claim 22, further comprising: Means for storing the plurality of results generated for the first iteration in a plurality of physical registers, wherein the means for generating includes means for generating the plurality of results from the plurality of physical registers during each of the one or more iterations.
24. The device according to claim 22, wherein, The means for generating includes means for generating the plurality of results during the distribution process.
25. The apparatus according to any one of claims 22 to 24, further comprising: Means for storing, during the first iteration of the loop, an instruction pointer for the load instruction, the value predicted for the load instruction, and a destination physical register identifier for the load instruction in an entry of the structure. as well as A means for storing at least one source physical register identifier and a destination physical register identifier for each of the plurality of instructions in different corresponding entries of the structure during the first iteration of the loop.