Arithmetic processing unit and arithmetic processing method
By detecting early termination and bypassing intermediate stage results, the arithmetic processing unit optimizes execution latency, enhancing processing performance through out-of-order execution.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- FUJITSU LTD
- Filing Date
- 2022-08-12
- Publication Date
- 2026-07-01
AI Technical Summary
Existing arithmetic processing units determine execution latency based on the longest cycle of an arithmetic execution unit, leading to inefficient processing performance due to variations in execution latency based on input data and operation type, particularly in multiply-accumulate instructions.
Incorporating a detection unit to identify early termination of intermediate stage calculations and a bypass control unit to transfer results directly to the execution unit when possible, allowing for out-of-order execution and reduced latency.
Improves processing performance by enabling early bypassing of calculation results, reducing overall execution time and enhancing the efficiency of arithmetic processing units.
Smart Images

Figure 0007883095000001 
Figure 0007883095000002 
Figure 0007883095000003
Abstract
Description
Technical Field
[0001] The present invention relates to an arithmetic processing unit and an arithmetic processing method.
Background Art
[0002] There is known a processor that mounts an instruction scheduler that issues instructions to an arithmetic execution unit in an order executable by the instructions and improves processing performance by performing out-of-order processing (see, for example, Patent Documents 1-5).
Prior Art Documents
Patent Documents
[0003]
Patent Document 1
Patent Document 2
Patent Document 3
Patent Document 4
Patent Document 5
Summary of the Invention
Problems to be Solved by the Invention
[0004] An arithmetic processing unit that performs out-of-order processing determines the number of cycles of arithmetic execution by an arithmetic execution unit when issuing an instruction from an instruction scheduler. The number of cycles of arithmetic execution is also referred to as execution latency. Then, the arithmetic result is bypassed to the input of the execution unit in a cycle corresponding to the determined execution latency. On the other hand, the net execution latency of the execution unit varies depending on the input data or the type of arithmetic operation.
[0005] For example, when executing a multiply-accumulate instruction and a multiply instruction in a multiply-accumulate unit that includes both a multiplier and an adder, the net execution latency is smaller for the multiply instruction. Also, in a multiply-accumulate instruction, when the multiplication result is added to "0", the result is obtained without waiting for the adder's execution result. Therefore, the net latency when the multiplication result is added to "0" is smaller than the net latency when the multiplication result is added to a value other than "0". However, the instruction scheduler uses the longest latency in the execution unit to determine when to issue an instruction.
[0006] In one aspect, the present invention aims to improve the processing performance of an arithmetic processing unit by detecting whether or not the calculation results at intermediate stages of an execution unit, which includes multiple stages, can be bypassed, and by bypassing them if possible. [Means for solving the problem]
[0007] From one perspective, the arithmetic processing unit includes an instruction scheduler that issues executable instructions, a register file that holds data used by the instructions, an execution unit that includes a plurality of stages that sequentially execute the instructions issued by the instruction scheduler, a detection unit that detects early termination when the calculation result in an intermediate stage prior to the final stage is the same as the calculation result by the execution unit, and a bypass control unit that transfers data output from the register file or the calculation result from the execution unit to the input of the execution unit, and if the detection unit detects the early termination, transfers the calculation result in the intermediate stage to the input of the execution unit. [Effects of the Invention]
[0008] By detecting whether the calculation results at intermediate stages of an execution unit, which includes multiple stages, can be bypassed, and by bypassing them when possible, the processing performance of the arithmetic processing unit can be improved. [Brief explanation of the drawing]
[0009] [Figure 1] It is a block diagram showing an example of the main part of an arithmetic processing unit in one embodiment. [Figure 2] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 3] It is a block diagram showing an example of the main part of an arithmetic processing unit in yet another embodiment. [Figure 4] It is a block diagram showing the continuation of FIG. 3. [Figure 5] It is a block diagram showing an example of the main part of another arithmetic processing unit. [Figure 6] It is a block diagram showing the continuation of FIG. 5. [Figure 7] It is a block diagram showing an example of the main part of yet another arithmetic processing unit. [Figure 8] It is a block diagram showing the continuation of FIG. 7. [Figure 9] It is an explanatory diagram showing an example of pipeline operation when performing a floating-point multiply-accumulate operation. [Figure 10] It is an explanatory diagram showing an example of pipeline operation from another perspective when performing a floating-point multiply-accumulate operation. [Figure 11] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 12] It is a block diagram showing the continuation of FIG. 11. [Figure 13] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 14] It is a block diagram showing the continuation of FIG. 13. [Figure 15] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 16] It is a block diagram showing the continuation of FIG. 15. [Figure 17] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 18] It is a block diagram showing the continuation of FIG. 17. [Figure 19]It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 20] It is a block diagram showing the continuation of FIG. 19. [Figure 21] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 22] It is a block diagram showing the continuation of FIG. 21. [Figure 23] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 24] It is a block diagram showing the continuation of FIG. 23. [Figure 25] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [[ID=…]]… [[ID=…]]… [Figure 26] It is a block diagram showing the continuation of FIG. 25. [Figure 27] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 28] [[ID=…]]… It is a block diagram showing the continuation of FIG. 27. [Figure 29] It is a block diagram showing an example of the main part of an arithmetic processing unit in another embodiment. [Figure 30] It is a block diagram showing the continuation of FIG. 29.
Embodiments for Carrying Out the Invention
[0010] Hereinafter, embodiments will be described with reference to the drawings.
[0011] Figure 1 shows an example of the main components of an arithmetic processing unit in one embodiment. The arithmetic processing unit 10 shown in Figure 1 is a processor such as a CPU (Central Processing Unit). The arithmetic processing unit 10 includes an instruction decoder ID, an instruction scheduler IS, a register file RF, a bypass control unit BCNT, and an execution unit EX. The instruction decoder ID includes a detection unit DET. The execution unit EX has a plurality of stages STG (STG1, STG2). The number of stages STG may be three or more. Below, an example in which the execution unit EX includes a multiply-accumulate circuit will be described, but the execution unit EX may also include other arithmetic circuits.
[0012] The instruction decoder ID decodes the instruction received from the instruction buffer, etc., and outputs instruction information indicating the decoded instruction to the instruction scheduler IS. The detection unit DET determines whether the calculation result by the execution unit of the decoded instruction can be obtained without using all of the multiple stages STG, and outputs the determination result as a shortening flag sht to the instruction scheduler IS. For example, if the decoded instruction is a multiply-accumulate instruction, the instruction decoder ID resets the shortening flag sht, and if it is a multiplication instruction, it sets the shortening flag sht.
[0013] The instruction scheduler IS sequentially stores instruction information and the abbreviation flag sht received from the instruction decoder ID. The instruction scheduler IS sequentially feeds the instruction information that can be executed by the execution unit EX and the abbreviation flag sht from the stored instruction information to the execution unit EX, and sequentially outputs the register information included in the executable instruction information to the register file RF. Hereafter, the instruction information output by the instruction scheduler IS will also be simply referred to as an instruction.
[0014] In this case, the instruction scheduler IS changes the timing of issuing subsequent dependent instructions that depend on the preceding instruction, depending on whether or not the process terminates early. For example, the instruction scheduler IS issues subsequent dependent instructions so that the timing of their input to the execution unit EX matches the timing at which the calculation result of the preceding instruction is bypassed from the bypass control unit BCNT to the execution unit EX. This allows the calculation result of the preceding instruction to be bypassed to the execution unit EX in accordance with the execution timing of the subsequent dependent instruction, even if the cycle in which the calculation result of the preceding instruction is output from the execution unit EX changes depending on the value of the shortening flag sht.
[0015] The register file RF has multiple registers, each capable of holding data used for instructions and calculation results. The register file RF retrieves the data used by the execution unit EX for calculations from the registers indicated by the register information (source operand) from the instruction scheduler IS, and outputs it to the bypass control unit BCNT.
[0016] Furthermore, the register file RF receives the execution result rslta, which is output from the final stage STG2 of the execution unit EX, or the execution result rsltb, which is output from an intermediate stage STG1 of the execution unit EX. The register file RF stores the received execution results, along with the execution result of the calculation, in the register indicated by the register information (destination operand) received from the execution unit EX.
[0017] When the bypass control unit BCNT receives the enable signal ena from the execution unit EX, it selects the calculation result rstla output from the final stage STG2 and outputs it as a source operand to the input of the execution unit EX (for example, the first stage STG). When the bypass control unit BCNT receives the enable signal enb from the execution unit EX, it selects the calculation result rstlb output from the intermediate stage STG1 and outputs it as a source operand to the input of the execution unit EX. If the bypass control unit BCNT does not receive either the enable signal ena or enb from the execution unit EX, it selects the source operand output from the register file RF and outputs it to the input of the execution unit EX.
[0018] The execution unit EX pipelines multiple stages STG connected in series to execute instructions received from the instruction scheduler IS. In the example shown in Figure 1, the first stage STG1 (=intermediate stage STG) performs multiplication, and the final stage STG2 performs addition. For example, the execution unit EX is a multiply-accumulate unit. When the execution unit EX executes a multiplication instruction, the first stage STG can obtain the calculation result.
[0019] If the abbreviation flag sht is set, the intermediate stage STG1 outputs the enable signal enb along with the calculation result rsltb to the bypass control unit BCNT. This allows the execution unit EX to bypass the calculation result rsltb one cycle earlier when executing a multiplication instruction, compared to outputting the calculation result rslta from the final stage STG2.
[0020] Note that if the shortening flag sht is set, intermediate stage STG1 does not output the enable signal ena. If the shortening flag sht is reset, intermediate stage STG1 outputs the enable signal ena along with the calculation result rsltb to the final stage STG. If the final stage STG2 receives the enable signal ena, it outputs the enable signal ena along with the calculation result rslta to the bypass control unit BCNT.
[0021] In this embodiment, if the arithmetic processing unit 10 detects an early termination by the detection unit DET, where the calculation result from the execution unit can be obtained without using all of the multiple stages STG, it bypasses the calculation result rsltb from the intermediate stage STG1. This allows the calculation result rsltb to be bypassed earlier compared to when the calculation result rslta is output from the final stage STG2, thereby improving the processing performance of the arithmetic processing unit 10.
[0022] Furthermore, by providing the detection unit DET in the instruction decoder ID, it is possible to determine during instruction decoding whether or not the calculation result rsltb in the intermediate stage STG1 can be bypassed (i.e., whether or not it will terminate early). The instruction scheduler IS changes the timing of issuing subsequent dependent instructions that depend on the preceding instruction, depending on whether or not it will terminate early. As a result, even if the cycle in which the calculation result of the preceding instruction is output from the execution unit EX changes according to the value of the shortening flag sht, the calculation result of the preceding instruction can be bypassed to the execution unit EX in accordance with the execution timing of the subsequent dependent instruction.
[0023] The detection unit DET may be provided in the instruction scheduler IS. In this case, the instruction scheduler IS determines whether or not it will terminate early between receiving an instruction from the instruction decoder ID and outputting it to the execution unit EX, and sets or resets the shortening flag sht according to the determination result.
[0024] Figure 2 shows an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figure 1 are denoted by the same reference numerals, and detailed explanations are omitted. The arithmetic processing unit 10A shown in Figure 1 is, for example, a processor such as a CPU. The arithmetic processing unit 10A has a secondary scheduler 2S in addition to the arithmetic processing unit 10 in Figure 1. The secondary scheduler 2S is located between the instruction scheduler IS and the execution unit EX. In addition, the arithmetic processing unit 10A has a detection unit DET in the intermediate stage STG1 (the first stage in this example) instead of the instruction decoder ID. The execution unit EX is, for example, a multiply-accumulate arithmetic unit that executes a multiply-accumulate instruction ("a*b+c (* is the multiplication sign)").
[0025] The secondary scheduler 2S includes a plurality of entries that sequentially hold instructions issued from the instruction scheduler IS and source operands transferred from the bypass control unit BCNT. The secondary scheduler 2S has a function to delay the held instructions and source operands by a predetermined cycle, and issues the instruction held in the entry in which the source operand has been determined, along with the source operand, to the execution unit.
[0026] This allows, for example, when the calculation result is bypassed from the final stage STG2, the output of the instruction received from the instruction scheduler IS to the execution unit EX to wait until the source operand is received from the bypass control unit BCNT. In other words, even when the cycle in which the execution unit EX outputs the calculation result (rslta or rsltb) changes, the secondary scheduler 2S can wait for the instruction from the instruction scheduler IS and the source operand from the bypass control unit BCNT.
[0027] The detection unit DET, located in intermediate stage STG1, resets the enable signal ena and sets the enable signal enb if it detects that "c" in the multiply-accumulate instruction ("a*b+c") is "0". The detection unit DET sets the enable signal ena and resets the enable signal enb if it detects that "c" in the multiply-accumulate instruction ("a*b+c") is not "0".
[0028] This allows the calculation result to be bypassed from the intermediate stage STG1 when "c" in the multiply-accumulate instruction ("a*b+c") is "0". In other words, the calculation result rsltb can be bypassed one cycle earlier compared to bypassing the calculation result rslta from the final stage STG.
[0029] For example, suppose the execution unit EX has five stages STG, and the detection unit DET is located in the second stage STG. In this case, the instruction scheduler IS may issue the subsequent dependent instruction so that the timing of its input to the execution unit EX is earlier than the timing at which the calculation result of the preceding instruction in the final stage STG is bypassed and input to the execution unit EX. That is, the instruction scheduler IS may set the cycle from the issuance of the preceding instruction to the issuance of the subsequent dependent instruction to be between two and five cycles. In this case as well, the secondary scheduler 2S can synchronize the timing of the input of the calculation result of the preceding instruction and the subsequent dependent instruction to the execution unit EX.
[0030] As described above, the same effects as those of the embodiments described can be obtained in this embodiment as well. For example, if the arithmetic processing unit 10A detects an early termination by the detection unit DET, it can bypass the calculation result rsltb at the intermediate stage STG1, thereby bypassing the calculation result rsltb earlier than the calculation result rslta.
[0031] In this case, the detection unit DET provided in the intermediate stage STG1 can determine whether or not the calculation result rsltb can be bypassed. Therefore, it is possible to determine whether or not the calculation result rsltb can be bypassed according to the value of the source operand, which cannot be determined by the instruction code. As a result, the calculation result rsltb can be bypassed earlier compared to when the calculation result rslta is output from the final stage STG2, and the processing performance of the arithmetic processing unit 10 can be improved.
[0032] Because a secondary scheduler 2S is provided, the instruction scheduler IS can make the number of cycles between the issuance of a preceding instruction and the issuance of a subsequent dependent instruction variable. In this case as well, the timing of inputting the calculation result of the preceding instruction and the subsequent dependent instruction to the execution unit EX can be synchronized.
[0033] Furthermore, in addition to the detection unit DET in Figure 2, the arithmetic processing unit 10A may also be provided with a detection unit DET at the instruction decoder ID, as in Figure 1. This makes it possible to determine whether or not the calculation result rsltb can be bypassed both before and after the instruction is issued to the execution unit EX. As a result, the frequency of bypassing the calculation result rsltb can be improved compared to the case where the detection unit DET is provided only at the intermediate stage STG, and the processing performance of the arithmetic processing unit 10A can be further improved.
[0034] Figures 3 and 4 show an example of the main components of an arithmetic processing unit in yet another embodiment. Elements similar to those in Figure 1 are denoted by the same reference numerals, and detailed descriptions are omitted. The circuit shown in Figure 4 is a continuation of the circuit shown in Figure 3 and is connected to the circuit shown in Figure 3 with the A-A' line as the boundary. The arithmetic processing unit 10B shown in Figures 3 and 4 is, for example, a processor such as a CPU. For example, Figure 3 shows the front-end circuit including the instruction decoder ID and instruction scheduler IS, and Figure 4 shows the back-end circuit including the execution unit EX. Thick signal lines indicate multi-bit data lines.
[0035] The arithmetic processing unit 10B includes an instruction decoder ID, an instruction scheduler IS, a register file RF, comparators C1 and C2, a multiplexer MUX1, multiple FIFO (First-In First-Out) logic sections, and multiple flip-flops FF. The multiple flip-flops FF operate in synchronization with the clock. The arithmetic processing unit 10B also includes comparators C3 and C4, a multiplexer MUX2, logic circuits LGC1-LGC4, and an output control unit OUTCNT.
[0036] The logic circuits LGC1-LGC4 are an example of an execution unit that executes instructions. Hereafter, the five FIFO logic sections will be referred to as FIFO1-FIFO5. Each stage of the pipeline of the arithmetic processing unit 10B is separated by a series-connected flip-flop FF. Instruction information for executing instructions is then transferred to the next stage with each clock cycle.
[0037] The instruction decoder ID decodes the instruction received from the instruction buffer, etc., generates instruction information such as tags tagD, tag1, and the valid flag valid, and outputs the generated instruction information to the instruction scheduler IS. Tag tagD contains the number of the physical register where the destination operand, which is the result of the calculation, is stored. Tag tag1 contains the number of the physical register where the source operand, which is the data used in the calculation, is stored. The valid flag valid is set to "1" when the instruction information such as tags tagD and tag1 is valid. For the sake of simplicity, the instruction code and tags indicating the second and subsequent source operands have been omitted.
[0038] The instruction decoder ID has a shortening detection unit shtdet1. If the shortening detection unit shtdet1 detects that the calculation result by the execution unit of the decoded instruction can be obtained without using all of the logic circuits LGC1-LGC4, it sets the shortening flag sht to "1" and outputs it to the instruction scheduler IS. If the shortening detection unit shtdet1 detects that the calculation result by the decoded instruction can be obtained using all of the logic circuits LGC1-LGC4, it sets the shortening flag sht to "0" and outputs it to the instruction scheduler IS. The shortening detection unit shtdet1 is an example of a detection unit that detects early termination based on the decoded instruction.
[0039] For example, if the execution unit EX is a multiply-accumulate circuit that executes a multiply-accumulate instruction, the shortened instruction detection unit shtdet1 sets the shortened instruction flag sht to "1" if the decoded instruction is a multiplication instruction. The dashed boxes shown in Figures 3 and 4 show the propagation path of the shortened instruction flag sht generated by the instruction decoder ID to the execution unit EX and the circuits included in the propagation path.
[0040] The instruction scheduler IS has an instruction selector SEL and a queue Q with multiple entries. Queue Q sequentially holds instruction information such as tags tagD, tag1, valid flag valid, and abbreviation flag sht, received from the instruction decoder ID. The instruction selector SEL selects an entry from the multiple entries in queue Q that holds instruction information that can be executed by the execution unit EX, and issues the selected instruction information as an instruction. The instruction scheduler IS makes it possible to output instructions received in order from the instruction decoder ID to the execution unit EX out of order, enabling out-of-order execution of instructions.
[0041] Here, the number of cycles until the instruction information selected by the instruction selector SEL of the instruction scheduler IS reaches the flip-flop FF1 of FIFO1-FIFO5 is also called the issue latency. The number of cycles from when the instruction information is output from the flip-flop FF2 of FIFO1-FIFO5 until the calculation by the execution unit EX is completed is also called the execution latency. Each of the multiple flip-flops FF1 and multiple flip-flops FF2 is an example of an entry that holds an instruction issued by the instruction scheduler IS and a source operand transferred from the register file RF and buses B1 and B2.
[0042] The register file RF has multiple registers, each holding the data (source operand) used in the calculation and the calculation result (destination operand). The register file RF outputs the data DT (source operand) held in the register indicated by tag1 to the execution unit EX via FIFO4.
[0043] Furthermore, the register file RF receives the calculation result (destination operand), tag D, and enable signals en1 and en2 from the execution unit EX via bus B1 or bus B2 at the write port. The register file RF stores the calculation result received at the write port in the register indicated by tag D, which corresponds to the calculation result.
[0044] Each of the FIFO1-FIFO5 has a two-stage configuration. As shown in Figure 4, each of the FIFO1-FIFO5 has multiplexers MUXa and MUXb that control whether or not to delay instruction information in each flip-flop FF1 and FF2, and a control unit CNT that controls the multiplexers MUXa and MUXb. The control unit CNT generates a selection signal that controls the selection of the multiplexers MUXa and MUXb according to the FIFO control signal FCNT generated in the arithmetic processing unit 10B.
[0045] Flip-flop FF1 receives instruction information nexta output from multiplexer MUXa and outputs it as instruction information curra to multiplexers MUXa and MUXb. Flip-flop FF2 receives instruction information nextb output from multiplexer MUXb and outputs it as instruction information currb to multiplexer MUXb and execution unit EX.
[0046] FIFO1-FIFO5 can adjust the bypass timing of the calculation result after the instruction scheduler IS has issued an instruction by delaying the transfer cycle of instruction information transferred from the instruction scheduler IS to the execution unit EX. In other words, FIFO1-FIFO5 functions as a secondary scheduler that adjusts the bypass timing determined by the instruction scheduler IS.
[0047] By providing FIFO1-FIFO5, stalls or flashes can be suppressed even in the event of unforeseen circumstances that could not be predicted by the instruction scheduler (IS). The technique of adjusting bypass timing using FIFO1-FIFO5 within the pipeline is called OoS (Out-of-Step). Furthermore, a pipeline that includes FIFO1-FIFO5 is called an OoS pipeline.
[0048] For example, the OoS pipeline is described in the following paper. Yi Kue et al., "Out-of-Step Pipeline for Efficient Gather / Scatter Processing," Information Processing Society of Japan Research Report, 2021-03-18. <URL:https: / / ipsj.ixsq.nii.ac.jp / ej / ?action=pages_view_main&active_action=repository_view_main_item_detail&item_id=210485&item_no=1&page_id=13&block_id=8>
[0049] Here, the number of cycles it takes for the instruction information selected by the instruction selector SEL of the instruction scheduler IS to reach the flip-flop FF1 of FIFO1-FIFO5 is also called the issue latency. Furthermore, the number of cycles it takes from the time the instruction information is output from the flip-flop FF2 of FIFO1-FIFO5 until the calculation by the execution unit EX is completed is also called the execution latency.
[0050] Execution latency consists of the worst-case execution latency when the calculation result is output using all logic circuits LGC1-LGC4 of the execution unit EX, and the reduced execution latency when the calculation result is output using only logic circuits LGC1-LGC2. In the example shown in Figure 4, the worst-case execution latency is "4" and the reduced execution latency is "2".
[0051] Bus B1 carries the calculation results when bypassing at the worst execution latency, and bus B2 carries the calculation results when bypassing at the reduced execution latency. Each of buses B1 and B2 includes a data line for transferring the calculation results, a signal line for transferring the tag tagD, and a signal line for transferring an enable signal en (en1 or en2) indicating that bypass is enabled.
[0052] Furthermore, when employing an OoS (Out of Execution) method that allows adjustment of bypass timing, the shortened execution latency may be greater than the minimum shortened execution latency, as long as it is less than the worst-case execution latency. The instruction scheduler (IS), when instructions have dependencies, issues subsequent dependent instructions in accordance with the timing at which the calculation result of the preceding dependent instruction is bypassed by the shortened execution latency.
[0053] In Figure 4, the logic circuits LGC1-LGC4, which are arithmetic units, function as execution units EX of the fused multiply-add (FMA) floating-point multiply-add circuit. Hereafter, the fused multiply-add (FMA) circuit will also be simply referred to as FMA. For example, the function of the floating-point multiplier fmul is realized by logic circuits LGC1 and LGC2, and the function of the floating-point adder fadd is realized by logic circuits LGC3 and LGC4.
[0054] Therefore, when the instruction decoder ID decodes a floating-point multiplication instruction, the data output by the multiplier fmul logic circuit LGC2 becomes the calculation result. When a floating-point multiplication instruction is decoded, the instruction decoder ID detects early termination and adds a "1" abbreviation flag sht to bypass the multiplication result from the output of the logic circuit LGC2 via bus B2.
[0055] On the other hand, when the instruction decoder ID decodes a floating-point multiply-accumulate instruction, the data output by the adder fadd becomes the calculation result. Similarly, when the instruction decoder ID decodes a floating-point addition instruction, the data output by the adder fadd becomes the calculation result. In these cases, the instruction decoder ID does not detect early termination and adds a "0" abbreviation flag sht to bypass the addition result from the output of the logic circuit LGC4 via bus B1.
[0056] Of the logic circuits LGC1 and LGC2 that execute addition instructions, logic circuit LGC2 has a shortening detection unit shtdet2. When the shortening detection unit shtdet2 detects that "c" in a floating-point multiply-accumulate operation instruction represented as "a*b+c (* is the multiplication sign)" is "0", it detects early termination and outputs a shortening detection signal sht2, set to "1", to the output control unit OUTCNT.
[0057] The abbreviation detection unit shtdet2 is an example of a detection unit that detects early termination based on the calculation results of intermediate stages in the execution unit EX. The abbreviation detection unit shtdet2 may also detect early termination and output a abbreviation detection signal sht2, set to "1", to the output control unit OUTCNT, even when the instruction executed by the logic circuits LGC1 and LGC2 is a floating-point multiplication instruction.
[0058] The shortening detection unit shtdet2 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency, depending on the data generated inside the execution unit EX. In other words, even if the appropriate execution latency is not determined at the time of instruction issuance, the appropriate execution latency can be set after the instruction issuance. For the sake of simplicity, Figures 3 and 4 show the execution unit of the calculation of interest (FMA in Figure 4) and the circuit elements associated with that execution unit.
[0059] The arithmetic processing unit 10B may have other execution units EX, such as integer arithmetic circuits or logic circuits ALU (Arithmetic Logic Unit). In this case, the shortened character detection unit shtdet1 is provided in common to multiple execution units EX, and the shortened character detection unit shtdet2 is provided for each execution unit EX.
[0060] The output control unit OUTCNT outputs an enable signal en1 of "0" and an enable signal en2 of "1" when it receives the valid flag "valid" of "1" and the abbreviated flag "sht" of "1". The output control unit OUTCNT outputs an enable signal en1 of "0" and an enable signal en2 of "1" when it receives the valid flag "valid" of "1" and the abbreviated flag "sht2" of "1". In addition, the output control unit OUTCNT outputs an enable signal en1 of "1" and an enable signal en2 of "0" when it receives the valid flag "valid" of "1" and the abbreviated flags "sht" and "sht2" of "0".
[0061] The enable signal en2 is transferred via bus B2 to comparators C2, C4 and register file RF. The enable signal en1 is transferred via the flip-flop FF located in the adder fadd and bus B1 to comparators C1, C3 and register file RF.
[0062] Comparator C1, upon receiving the enable signal en1 of "1", outputs a signal to multiplexer MUX1 to select the calculation result from bus B1 if tagD from bus B1 and tag1(nexta) from FIFO5 match. Comparator C2, upon receiving the enable signal en2 of "1", outputs a signal to multiplexer MUX1 to select the calculation result from bus B2 if tagD from bus B2 and tag1(nexta) from FIFO5 match.
[0063] If multiplexer MUX1 receives a selection signal from comparator C1, it selects the calculation result from bus B1 and outputs it to flip-flop FF1. If multiplexer MUX1 receives a selection signal from comparator C2, it selects the calculation result from bus B2 and outputs it to flip-flop FF1. If multiplexer MUX1 does not receive a selection signal from either comparator C1 or C2, it selects the data nexta from FIFO4 (i.e., the data DT from register file RF) and outputs it to flip-flop FF1.
[0064] Comparator C3, upon receiving the enable signal en1 of "1", outputs a signal to multiplexer MUX2 to select the calculation result from bus B1 if tagD from bus B1 and tag1(nextb) from FIFO5 match. Comparator C4, upon receiving the enable signal en2 of "1", outputs a signal to multiplexer MUX2 to select the calculation result from bus B2 if tagD from bus B2 and tag1(nextb) from FIFO5 match.
[0065] The output control circuit OUTCNT and comparators C1-C4 are an example of a bypass control unit that bypasses the calculation results at an intermediate stage to the input of the execution unit EX when one or both of the shortening detection units shtdet1 and shtdet2 detect early termination. In this embodiment, when one or both of the shortening detection units shtdet1 and shtdet2 detect early termination, the output control circuit OUTCNT controls comparators C1-C4 to bypass the multiplication result of the floating-point multiplier fmul.
[0066] If the multiplexer MUX2 receives a selection signal from comparator C3, it selects the calculation result from bus B1 and outputs it to flip-flop FF2. If the multiplexer MUX2 receives a selection signal from comparator C4, it selects the calculation result from bus B2 and outputs it to flip-flop FF2. If the multiplexer MUX2 does not receive a selection signal from either comparator C3 or C4, it selects the data nextb from FIFO4 (i.e., the data DT from register file RF) and outputs it to flip-flop FF2.
[0067] The instruction decoder ID shortening detection unit shtdet1 is optional. In this case, the queue Q that holds the shortening flag sht (shown by the dashed box), and the flip-flop FF and FIFO3 that transfer the shortening flag sht are not provided. Furthermore, the output control unit OUTCNT controls the bypass at the shortened execution latency based solely on the detection result by the shortening detection unit shtdet2.
[0068] The arithmetic processing unit 10B may be a SIMD processor capable of executing SIMD arithmetic instructions. In this case, the arithmetic processing unit 10B has multiple execution units EX that can execute a single instruction in parallel using different data from each other. The register file RF has multiple registers that hold data used by the multiple execution units EX.
[0069] The abbreviation detection unit shtdet1, installed in the SIMD processor, is provided in common to multiple execution units EX and detects early termination based on the instructions for each execution unit. The abbreviation detection unit shtdet2 is provided for each execution unit EX and detects early termination for each execution unit EX. In addition, multiple sets of comparators C1 and C2, multiple multiplexers MUX1, multiple sets of comparators C3 and C4, multiple multiplexers MUX2, and output control unit OUTCNT are provided corresponding to each of the multiple execution units EX.
[0070] Figures 5 and 6 show examples of the main components of other arithmetic processing units. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed explanations are omitted. The arithmetic processing unit 20 shown in Figures 5 and 6 is similar in configuration to the arithmetic processing unit 10B shown in Figures 3 and 4, but with the FIFO1-FIFO5 and the circuits related to the abbreviated flags sht and sht2 removed.
[0071] The execution latency of the FMA installed in the arithmetic processing unit 20 is fixed at "4" corresponding to logic circuits LGC1-LGC4, and there is no reduced execution latency. The only bypass path for the calculation results output from the execution unit EX is bus B1 connected to the output of logic circuit LGC4. Furthermore, since the arithmetic processing unit 20 does not employ an OoS method using FIFO1-FIFO5, bypassing of the calculation results is performed at only one point by the multiplexer MUX2.
[0072] Figures 7 and 8 show examples of key components of other arithmetic processing units. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed explanations are omitted. The arithmetic processing unit 30 shown in Figures 7 and 8 is similar in configuration to the arithmetic processing unit 10B shown in Figures 3 and 4, but with the circuits related to the abbreviated flags sht and sht2 removed.
[0073] The execution latency of the FMA mounted on the arithmetic processing unit 30 is fixed to "4" corresponding to logic circuits LGC1-LGC4, similar to the arithmetic processing unit 20 shown in Figures 5 and 6, and there is no reduced execution latency. The only bypass path for the calculation results output from the execution unit EX is bus B1 connected to the output of logic circuit LGC4. However, since the arithmetic processing unit 30 employs an OoS method using FIFO1-FIFO5 (excluding FIFO3), bypassing of the calculation results can be performed at two locations by multiplexers MUX1 and MUX2.
[0074] Figure 9 shows an example of pipeline operation when performing a floating-point multiply-accumulate operation. In the example shown in Figure 9, the preceding floating-point multiply-accumulate operation fmadd and subsequent dependent instructions that depend on this floating-point multiply-accumulate operation fmadd are executed sequentially. The rectangles in the figure represent cycles (i.e., stages). I0-I4 represent the instruction issuance stage, and E0-E3 represent the instruction execution stage. The execution stages E0 and E1 of the preceding instruction represent the execution cycle of the multiplication instruction by the floating-point multiplier fmul. The execution stages E2 and E3 of the preceding instruction represent the execution cycle of the addition instruction by the floating-point adder fadd.
[0075] The instruction scheduler IS of the arithmetic processing units 20 and 30 issues the preceding instruction (fmadd) and then issues the subsequent dependent instruction after the execution latency ("4") of the floating-point multiply-accumulate instruction fmadd. The instruction scheduler IS of the arithmetic processing unit 10B issues the preceding instruction (fmadd) and then issues the subsequent dependent instruction after the shortened execution latency ("2") of the floating-point multiply-accumulate instruction fmadd.
[0076] If "c" in the preceding instruction fmadd(a*b+c) is not "0", the calculation result is obtained by execution stage E3. In arithmetic processing units 20 and 30, the calculation result from the preceding instruction's execution stage E3 is bypassed to the execution stage E0 of the subsequent dependent instruction in the next cycle after the preceding instruction's execution stage E3. In arithmetic processing unit 10B, the calculation result of the preceding instruction is not obtained in the next cycle after the issuance stage I4 of the subsequent dependent instruction. Therefore, the subsequent dependent instruction is stalled for 2 cycles by the secondary scheduler using FIFO1-FIFO5.
[0077] The calculation result from the preceding instruction's execution stage E3 is then bypassed to the subsequent dependent instruction's execution stage E0 in the next cycle of the preceding instruction's execution stage E3. Therefore, even when a subsequent dependent instruction is issued in conjunction with the timing at which the calculation result of the preceding dependent instruction is bypassed with reduced execution latency, the subsequent dependent instruction's execution stage E3 can receive the bypassed calculation result. As a result, malfunctions of the arithmetic processing unit 10B can be suppressed.
[0078] If "c" in the preceding instruction fmadd(a*b+c) is "0", the calculation result is obtained by execution stage E1. In arithmetic processing units 20 and 30, the calculation result of the preceding instruction is bypassed to the execution stage E0 of the subsequent dependent instruction in the cycle following the execution stage E3 of the preceding instruction. In contrast, arithmetic processing unit 10B can bypass the calculation result of the preceding instruction to the execution stage E0 of the subsequent dependent instruction in the cycle following the execution stage E1 of the preceding instruction. As a result, arithmetic processing unit 10B can output the calculation result of the subsequent dependent instruction two cycles earlier than arithmetic processing units 20 and 30, thereby improving the processing efficiency of the calculation.
[0079] Figure 10 shows an example of pipeline operation from a different perspective when performing a floating-point multiply-accumulate operation. A detailed explanation of the operation, which is similar to that in Figure 9, is omitted. In Figure 10, as in Figure 9, the preceding floating-point multiply-accumulate operation fmadd and the subsequent dependent instructions that have a dependency on this floating-point multiply-accumulate operation fmadd are executed sequentially.
[0080] However, in Figure 10, the issuance stage consists of two stages, I0 and I1, while the execution stage for subsequent dependent instructions is one stage, E0. The execution stages E0 and E1 of preceding instructions represent the execution cycle of a multiplication instruction by the floating-point multiplier fmul. The execution stages E2 and E3 of preceding instructions represent the execution cycle of an addition instruction by the floating-point adder fadd.
[0081] In the worst-case scenario where the program is always running at the lowest execution latency, if the preceding instruction fmadd(a*b+c) has "c" as "0", the result of the preceding instruction's calculation is bypassed after the execution cycles E2 and E3 of the addition instruction, causing unnecessary waiting for subsequent dependent instructions.
[0082] On the other hand, if the program is always running with reduced execution latency, and the "c" in the preceding instruction fmadd(a*b+c) is not "0", the subsequent dependent instruction will acquire the source operand before the preceding instruction has finished executing. As a result, the subsequent dependent instruction will be canceled and reissued. Since the cancellation and reissue of the subsequent dependent instruction significantly degrades the instruction's processing performance, the method of always running with reduced execution latency is undesirable.
[0083] In contrast, the arithmetic processing unit 10B shown in Figures 3 and 4 can delay the bypass waiting time by stalling in FIFO1-FIFO5, even when instructions are issued with reduced latency. As a result, even if "c" in the preceding instruction fmadd(a*b+c) is not "0", the cancellation and reissuance of subsequent dependent instructions can be suppressed. Furthermore, the arithmetic processing unit 10B can bypass the calculation result of the preceding instruction with reduced latency if "c" in the preceding instruction fmadd(a*b+c) is not "0".
[0084] As described above, this embodiment can also obtain the same effects as the embodiments described above. For example, the shortening detection unit shtdet1 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency, depending on the calculation instruction issued to the execution unit EX. The shortening detection unit shtdet2 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency, depending on the data generated inside the execution unit EX. As a result, if the calculation result can be output with shortened execution latency, the calculation result can be bypassed with shortened execution latency.
[0085] For example, when executing a multiplication instruction with FMA, the calculation result can be bypassed using reduced execution latency. Also, when executing a multiply-accumulate operation (a*b+c) with FMA, if "c" is "0", the calculation result can be bypassed using reduced execution latency. As a result, the processing performance of the arithmetic processing unit 10B can be improved compared to when the reduced execution latency bypass is not used.
[0086] Furthermore, in this embodiment, shortening detection units shtdet1 and shtdet2 are provided in the instruction decoder ID and the logic circuit LGC2, respectively. This makes it possible to determine whether or not the calculation result rsltb can be bypassed in both the instruction and the source operand. As a result, the frequency of bypassing the calculation result at shortened execution latency can be improved, and the processing performance of the arithmetic processing unit 10B can be further improved.
[0087] Figures 11 and 12 show an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The arithmetic processing unit 10C shown in Figures 11 and 12 is, for example, a processor such as a CPU.
[0088] The arithmetic processing unit 10C has the same configuration as the arithmetic processing unit 10B shown in Figures 3 and 4, except that it has three multiplexers MUX3 (MUX31, MUX32, MUX33 (Figure 12)) that aggregate buses B1 and B2 shown in Figure 12 onto bus B3. Each multiplexer MUX3 selects the calculation result from the logic circuit LGC2, which is an intermediate stage, when the shortening detection unit shtdet1 or the shortening detection unit shtdet2 detects the early termination of the calculation by the execution unit EX. Each multiplexer MUX3 then transfers the selected calculation result to the input of the execution unit EX and to the register file RF. Each multiplexer MUX3 is an example of a selection unit that selects the calculation result from an intermediate stage or the calculation result from the final stage and outputs it to the input of the execution unit.
[0089] For example, if the enable signal en1 is "1", the multiplexer MUX31 outputs the calculation result that would be transferred to bus B1 to bus B3, and if the enable signal en1 is "0", it outputs the calculation result that would be transferred to bus B2 to bus B3. If the enable signal en1 is "1", the multiplexer MUX32 outputs the tagD that would be transferred to bus B1 to bus B3, and if the enable signal en1 is "0", it outputs the tagD that would be transferred to bus B2 to bus B3.
[0090] If the enable signal en1 is "1", the multiplexer MUX33 outputs the enable signal en1, which is transferred to bus B1, as the enable signal en to bus B3. Also, if the enable signal en1 is "0", the multiplexer MUX33 outputs the enable signal en2, which is transferred to bus B2, as the enable signal en to bus B3.
[0091] Furthermore, when both enable signals en1 and en2 are "1", each multiplexer MUX3 transfers data and information that would normally be transferred to bus B1 to bus B3. In other words, when both enable signals en1 and en2 are "1", the bypass operation due to worst-case execution latency takes precedence.
[0092] Comparators C1 and C3 are the same as those in Figures 3 and 4, except that they receive the enable signal en instead of the enable signal en1. If multiplexer MUX1 receives a selection signal from comparator C1, it selects the calculation result from bus B3. If multiplexer MUX1 does not receive a selection signal from comparator C1, it selects the data nexta from FIFO4 (i.e., the data DT from register file RF).
[0093] If the multiplexer MUX2 receives a selection signal from comparator C3, it selects the calculation result from bus B3. If the multiplexer MUX2 does not receive a selection signal from comparator C3, it selects the data nextb from FIFO4 (i.e., the data DT from register file RF).
[0094] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection units shtdet1 and shtdet2 can determine whether to bypass with shortened execution latency or with worst-case execution latency. As a result, if the calculation result can be output with shortened execution latency, the calculation result can be bypassed with shortened execution latency. As a result, the processing performance of the arithmetic processing unit 10C can be improved compared to the case where the shortened execution latency is not bypassed.
[0095] Furthermore, in this embodiment, by providing the multiplexer MUX3, buses B1 and B2 can be consolidated into bus B3, reducing the wiring area compared to Figures 3 and 4. Also, compared to Figures 3 and 4, comparators C2 and C4 can be removed, reducing the number of write ports for register file RF. As a result, the circuit size of the arithmetic processing unit 10C can be made smaller than that of the arithmetic processing unit 10B.
[0096] Figures 13 and 14 show an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The arithmetic processing unit 10D shown in Figures 13 and 14 is, for example, a processor such as a CPU.
[0097] The arithmetic processing unit 10D has a configuration similar to that of the arithmetic processing unit 10B shown in Figures 3 and 4, but with the circuitry related to the abbreviated flag sht removed. Therefore, the instruction decoder ID shown in Figure 13 does not have the abbreviated detection unit shtdet1 shown in Figure 3. Whether to bypass the calculation result with abbreviated execution latency or worst-case execution latency is determined by the abbreviated detection unit shtdet2 provided in the logic circuit LGC2.
[0098] The output control circuit UTCNT outputs an enable signal en1 of "0" and an enable signal en2 of "1" when it receives the valid flag valid of "1" and the abbreviated flag sht2 of "1". Similarly, the output control unit UTCNT outputs an enable signal en1 of "1" and an enable signal en2 of "0" when it receives the valid flag valid of "1" and the abbreviated flag sht2 of "0". As a result, the arithmetic processing unit 10D can bypass the calculation result with a shortened execution latency if "c" in the preceding instruction fmadd(a*b+c) is "0". The instruction scheduler IS shown in Figure 13 issues the preceding instruction (fmadd) and then issues the subsequent dependent instruction after the shortened execution latency ("2") of the floating-point multiply-accumulate instruction.
[0099] As described above, the same effects as those of the embodiments described can be obtained in this embodiment as well. For example, the shortening detection unit shtdet2 can determine whether to bypass with shortened execution latency or with worst-case execution latency. As a result, if the calculation result can be output with shortened execution latency, the calculation result can be bypassed with shortened execution latency. As a result, the processing performance of the arithmetic processing unit 10D can be improved compared to the case where the shortened execution latency is not bypassed.
[0100] Figures 15 and 16 show an example of the main components of a processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The processing unit 10E shown in Figures 15 and 16 is, for example, a processor such as a CPU.
[0101] The arithmetic processing unit 10E has a configuration similar to that of the arithmetic processing unit 10B shown in Figures 3 and 4, but with the circuits related to FIFO1-FIFO5 and the shortened flag sht2 removed. Therefore, the logic circuit LGC2 shown in Figure 16 does not have the shortened detection unit shtdet2 shown in Figure 4. Whether to bypass the calculation result with shortened execution latency or worst execution latency is determined by the shortened detection unit shtdet1 provided in the instruction decoder ID.
[0102] The output control circuit OUTCNT outputs an enable signal en1 of "0" and an enable signal en2 of "1" when it receives the valid flag "valid" of "1" and the abbreviated flag "sht" of "1". Similarly, the output control unit OUTCNT outputs an enable signal en1 of "1" and an enable signal en2 of "0" when it receives the valid flag "valid" of "1" and the abbreviated flag "sht" of "0".
[0103] As a result, when the abbreviated detection unit shtdet1 detects that the floating-point multiply instruction is to be executed in the floating-point multiply-accumulate circuit, the arithmetic processing unit 10E can bypass the calculation result with a shortened execution latency. The instruction scheduler IS shown in Figure 15 issues the preceding instruction (fmadd), and then issues the subsequent dependent instruction after the shortened execution latency ("2") of the floating-point multiply-accumulate instruction.
[0104] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection unit shtdet1 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency. As a result, if the calculation result can be output with shortened execution latency, the calculation result can be bypassed with shortened execution latency. As a result, the processing performance of the arithmetic processing unit 10E can be improved compared to the case where the calculation result is not bypassed with shortened execution latency.
[0105] Figures 17 and 18 show an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The arithmetic processing unit 10F shown in Figures 17 and 18 is, for example, a processor such as a CPU.
[0106] The arithmetic processing unit 10F has a configuration similar to that of the arithmetic processing unit 10B shown in Figures 3 and 4, but with the FIFO1-FIFO5 section and the circuitry related to the shortened flag sht2 removed. Therefore, the execution unit EX shown in Figure 18 does not have the shortened detection unit shtdet2 shown in Figure 4. Whether to bypass the calculation result with shortened execution latency or worst execution latency is determined by the shortened detection unit shtdet1 provided in the instruction decoder ID.
[0107] The execution unit EX of the arithmetic processing unit 10F has a logic circuit ALU that includes serially connected logic circuits LGC1 and LGC2 capable of performing shift operations. Logic circuit LGC1 is an example of a shift circuit that handles shift amounts from "1" to "7". Logic circuit LGC2 is an example of a shift circuit that handles shift amounts from "8" to "63".
[0108] The instruction decoder ID shortening detection unit shtdet1 sets the shortening flag to "1" if the shift amount is between "1" and "7" in a shift instruction where the shift amount is given as an immediate value, and sets the shortening flag to "0" if the shift amount is between "8" and "63". The shortening flag sht set by the shortening detection unit shtdet1 is transferred to the output control circuit OUTCNT.
[0109] The output control circuit OUTCNT outputs an enable signal en1 of "0" and an enable signal en2 of "1" when it receives the valid flag "valid" of "1" and the abbreviated flag "sht" of "1". Similarly, the output control unit OUTCNT outputs an enable signal en1 of "1" and an enable signal en2 of "0" when it receives the valid flag "valid" of "1" and the abbreviated flag "sht" of "0".
[0110] As a result, the arithmetic processing unit 10F can bypass the calculation result of a shift instruction with reduced execution latency if the shift amount of the preceding shift instruction is between "1" and "7". The arithmetic processing unit 10E bypasses the calculation result of a shift instruction with worst-case execution latency if the shift amount of the preceding shift instruction is between "8" and "63". The instruction scheduler IS shown in Figure 17 issues a preceding instruction (shift instruction) and then issues a subsequent dependent instruction after the reduced execution latency of the shift instruction ("1"). Normally, logic arithmetic operations circuits (ALUs) are implemented so that calculations are completed in one cycle, so the performance benefits are considered to be small. However, for example, further frequency improvements can be expected.
[0111] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection unit shtdet1 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency. This allows the calculation result to be bypassed with shortened execution latency if it is possible to output the calculation result with shortened execution latency.
[0112] For example, in a shift instruction where the shift amount is given as an immediate value, if the shift amount is small and the calculation result can be obtained by the logic circuit LGC1, the calculation result can be bypassed with reduced execution latency. As a result, the processing performance of the arithmetic processing unit 10F can be improved compared to when the bypass with reduced execution latency is not performed.
[0113] Figures 19 and 20 show an example of the main components of a processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The processing unit 10G shown in Figures 19 and 20 is, for example, a processor such as a CPU.
[0114] The arithmetic processing unit 10G has a configuration similar to that of the arithmetic processing unit 10B shown in Figures 3 and 4, but with the FIFO1-FIFO5 section and the circuit related to the shortened flag sht2 removed. Therefore, the logic circuit LGC2 shown in Figure 20 does not have the shortened detection unit shtdet2 shown in Figure 4. Whether to bypass the calculation result with shortened execution latency or worst execution latency is determined by the shortened detection unit shtdet1 provided in the instruction decoder ID.
[0115] The execution unit EX of the arithmetic processing unit 10G has floating-point arithmetic circuits including logic circuits LGC1, LGC2, and LGC3. Logic circuits LGC1 and LGC2 operate as floating-point arithmetic units that perform floating-point calculations. Logic circuit LGC3 operates as a rounding circuit that performs rounding by adding "+1" to the calculation result.
[0116] For example, the 10G arithmetic processing unit adopts IEEE 754 (Floating-Point Arithmetic Standard). IEEE 754 has four rounding modes: "round to nearest," "round to zero," "round to +infinity," and "round to -infinity." Of these four rounding modes, "round to zero" does not involve the addition of "+1."
[0117] The instruction decoder ID abbreviation detection unit shtdet1 detects early termination when decoding a floating-point arithmetic instruction with a rounding mode of "round to zero" and sets the abbreviation flag sht to "1". The abbreviation detection unit shtdet1 sets the abbreviation flag sht to "0" when decoding a floating-point arithmetic instruction with a rounding mode other than "round to zero".
[0118] This allows the calculation result to be bypassed with reduced execution latency when executing a floating-point multiplication instruction with rounding mode set to "round to zero". The instruction scheduler IS shown in Figure 19 issues the preceding instruction (floating-point arithmetic instruction), and then issues the subsequent dependent instruction after the reduced execution latency of the floating-point operation ("2").
[0119] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection unit shtdet1 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency. This allows the calculation result to be bypassed with shortened execution latency if it is possible to output the calculation result with shortened execution latency.
[0120] For example, when decoding a floating-point arithmetic instruction with the rounding mode set to "round to zero," the calculation result can be bypassed using reduced execution latency. As a result, the processing performance of the arithmetic unit 10G can be improved compared to when the reduced execution latency bypass is not used.
[0121] Figures 21 and 22 show an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The arithmetic processing unit 10H shown in Figures 21 and 22 is, for example, a processor such as a CPU.
[0122] The arithmetic processing unit 10H has removed the circuit related to the abbreviated flag sht in the arithmetic processing unit 10B shown in Figures 3 and 4. Therefore, the instruction decoder ID shown in Figure 21 does not have the abbreviated detection unit shtdet1 shown in Figure 3. The abbreviated detection unit shtdet2 is mounted in the logic circuit LGC4.
[0123] The arithmetic processing unit 10H adds a logic circuit LGC5 to the FMA of the arithmetic processing unit 10B shown in Figures 3 and 4, which performs processing when the sum result is a denormalized number. The logic circuit LGC5 operates as a denormalized number processing circuit that performs processing when the calculation result in logic units LGC1-LGC4 is a denormalized number. A denormalized number is a number that represents values close to "0" that cannot be represented in a normalized state of floating-point numbers without normalization. The other configurations of the arithmetic processing unit 10H are the same as those of the arithmetic processing unit 10B shown in Figures 3 and 4.
[0124] In this embodiment of FMA, the reduced execution latency is set to 4 cycles, which is the time it takes to perform the multiply-accumulate operation (fmul + fadd), and the worst-case execution latency is set to 5 cycles, which is the time it takes to perform the multiply-accumulate operation and the denormalization number processing. The reduced execution detection unit shtdet2 detects early termination if the multiply-accumulate result obtained by the logic circuit LGC4 is not a denormalized number, and sets the reduced execution flag sht2 to "1". The reduced execution detection unit shtdet2 sets the reduced execution flag sht2 to "0" if the multiply-accumulate result obtained by the logic circuit LGC4 is a denormalized number.
[0125] This allows the calculation result to be bypassed with reduced execution latency if the sum-of-accumulate result is not a denormalized number. The instruction scheduler IS shown in Figure 21 issues the preceding instruction (sum-of-accumulate instruction), and then issues the subsequent dependent instruction after the reduced execution latency ("4") of the floating-point sum-of-accumulate instruction.
[0126] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection unit shtdet2 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency. This allows the calculation result to be bypassed with shortened execution latency if it is possible to output the calculation result with shortened execution latency.
[0127] For example, if the result of a sum-of-accumulate operation is not a denormalized number, the operation result can be bypassed using reduced execution latency. As a result, the processing performance of the arithmetic processing unit 10H can be improved compared to when the bypass using reduced execution latency is not performed.
[0128] Figures 23 and 24 show an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The arithmetic processing unit 10I shown in Figures 23 and 24 is, for example, a processor such as a CPU.
[0129] The arithmetic processing unit 10I has removed the circuit related to the abbreviated flag sht in the arithmetic processing unit 10B shown in Figures 3 and 4. Therefore, the instruction decoder ID shown in Figure 23 does not have the abbreviated detection unit shtdet1 shown in Figure 3.
[0130] The execution unit EX of the arithmetic processing unit 10I has a floating-point arithmetic circuit including logic circuits LGC1, LGC2, and LGC3 that execute floating-point arithmetic instructions, similar to Figure 20. Logic circuits LGC1 and LGC2 operate as floating-point arithmetic units that perform floating-point operations. Logic circuit LGC3 operates as a rounding circuit that performs rounding by adding "+1" to the calculation result.
[0131] The abbreviation detection unit shtdet2 detects early termination if no "+1" addition occurs during the rounding process of floating-point arithmetic, and sets the abbreviation flag sht2 to "1". The abbreviation detection unit shtdet2 sets the abbreviation flag sht2 to "0" if a "+1" addition occurs during the rounding process of floating-point arithmetic.
[0132] This allows the calculation result to be bypassed with reduced execution latency if a "+1" addition does not occur during rounding in floating-point operations. The instruction scheduler IS shown in Figure 25 issues a preceding instruction (floating-point operation instruction), and then issues a subsequent dependent instruction after the reduced execution latency ("2") of the floating-point operation instruction.
[0133] Furthermore, as in Figure 19, the instruction decoder ID may have a shortening detection unit shtdet1 that sets the shortening flag sht to "1" when decoding a floating-point arithmetic instruction with a rounding mode of "round to zero". In this case, the output control circuit OUTCNT operates by receiving the shortening flag sht from the shortening detection unit shtdet1 and the shortening flag sht2 from the shortening detection unit shtdet2, as in Figure 4.
[0134] In this case, when executing a floating-point arithmetic instruction with rounding mode set to "round to zero", the calculation result is bypassed by a reduced execution latency due to the flag sht from the reduced detection unit shtdet1. The reduced detection unit shtdet2 detects early termination if no "+1" addition occurs during the rounding process of a floating-point arithmetic instruction with rounding mode set to "round to zero", and sets the reduced flag sht2 to "1".
[0135] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection unit shtdet2 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency. This allows the calculation result to be bypassed with shortened execution latency if it is possible to output the calculation result with shortened execution latency.
[0136] For example, if a "+1" addition does not occur during rounding in floating-point arithmetic, the calculation result can be bypassed using reduced execution latency. As a result, the processing performance of the arithmetic processing unit 10I can be improved compared to when the bypass using reduced execution latency is not performed.
[0137] Figures 25 and 26 show an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The arithmetic processing unit 10J shown in Figures 25 and 26 is, for example, a processor such as a CPU.
[0138] The arithmetic processing unit 10J has removed the circuit related to the abbreviated flag sht in the arithmetic processing unit 10B shown in Figures 3 and 4. Therefore, the instruction decoder ID shown in Figure 25 does not have the abbreviated detection unit shtdet1 shown in Figure 3. The abbreviated detection unit shtdet2 is mounted in the logic circuit LGC2.
[0139] The execution unit EX of the arithmetic processing unit 10J has a floating-point addition circuit including logic circuits LGC1, LGC2, and LGC3 that execute floating-point addition instructions. Logic circuits LGC1 and LGC2 operate as floating-point adders that perform addition operations. Logic circuit LGC3 operates as a normalization circuit that performs normalization operations (shift operations for digit alignment) of the addition result.
[0140] The shortening detection unit shtdet2 detects whether the addition result (e.g., subtraction result) obtained by the logic circuit LGC2 requires final digit alignment normalization (shifting). For example, the shortening detection unit shtdet2 detects whether it has already been normalized by referring to the mantissa. In this case, if the values to be subtracted are far apart, final normalization may not be necessary.
[0141] The shortening detection unit shtdet2 detects early termination and sets the shortening flag sht2 to "1" if normalization processing for digit alignment is not required after calculation by the logic circuit LGC2. The shortening detection unit shtdet2 sets the shortening flag sht2 to "0" if normalization processing for digit alignment is required after calculation by the logic circuit LGC2.
[0142] This allows the calculation result to be bypassed with reduced execution latency when the final normalization process of the addition result is unnecessary. The instruction scheduler IS shown in Figure 25 issues a preceding instruction (floating-point addition instruction), and then issues a subsequent dependent instruction after the reduced execution latency ("2") of the floating-point addition instruction.
[0143] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection unit shtdet2 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency. This allows the calculation result to be bypassed with shortened execution latency if it is possible to output the calculation result with shortened execution latency.
[0144] For example, in addition using a floating-point adder, if normalization for the final digit alignment is unnecessary, the calculation result can be bypassed with reduced execution latency. As a result, the processing performance of the arithmetic unit 10J can be improved compared to when the reduced execution latency bypass is not used.
[0145] Figures 27 and 28 show an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The arithmetic processing unit 10K shown in Figures 27 and 28 is, for example, a processor such as a CPU.
[0146] The arithmetic processing unit 10K has removed the circuit related to the abbreviated flag sht in the arithmetic processing unit 10B shown in Figures 3 and 4. Therefore, the instruction decoder ID shown in Figure 27 does not have the abbreviated detection unit shtdet1 shown in Figure 3. The abbreviated detection unit shtdet2 is mounted on the logic circuit LGC1.
[0147] The execution unit EX of the arithmetic processing unit 10K has a logic arithmetic unit ALU that includes logic circuits LGC1 and LGC2 capable of executing integer addition instructions. Logic circuit LGC1 operates as an integer adder that performs addition operations. Logic circuit LGC2 operates as a carry processing unit that performs carry propagation operations from the lower bits (e.g., the lower 16 bits).
[0148] In an adder circuit, the propagation of the carry from the least significant bit is the critical path. Therefore, in Figure 28, the integer adder circuit is implemented so that the number of execution stages differs between the worst-case execution latency case where the carry is propagated and the shortened execution latency case where the carry is not propagated. When the carry is not propagated, the addition result is bypassed using the shortened execution latency.
[0149] The integer adder circuit shown in Figure 28, for example, when adding two 32-bit numbers, performs separate operations in the logic circuit LGC1 to add the lower 16 bits and the upper 16 bits, generating the carry for each. When the carry propagates from the lower 16 bits, the integer adder circuit adds "+1" to the upper 16 bits in the logic circuit LGC2.
[0150] The shortening detection unit shtdet2 detects early termination if the carry does not propagate from the lower 16 bits and sets the shortening flag sht2 to "1". In other words, the shortening detection unit shtdet2 detects early termination if carry processing by logic circuit LGC2 is unnecessary after addition by logic circuit LGC1. The shortening detection unit shtdet2 sets the shortening flag sht2 to "0" if the carry propagates from the lower 16 bits.
[0151] This allows the calculation result to be bypassed with reduced execution latency if the carry does not propagate from the lower 16 bits. The instruction scheduler IS shown in Figure 27 issues a preceding instruction (integer addition instruction), and then issues a subsequent dependent instruction after the reduced execution latency (="1") of the integer addition instruction.
[0152] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection unit shtdet2 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency. If it is possible to output the calculation result with shortened execution latency, the calculation result can be bypassed with shortened execution latency.
[0153] For example, in an integer addition circuit that performs 32-bit addition by splitting it into 16-bit addition, if the carry does not propagate from the lower 16 bits, the calculation result can be bypassed with reduced execution latency. As a result, the processing performance of the arithmetic unit 10K can be improved compared to when the bypass with reduced execution latency is not performed.
[0154] Figures 29 and 30 show an example of the main components of an arithmetic processing unit in another embodiment. Elements similar to those in Figures 3 and 4 are denoted by the same reference numerals, and detailed descriptions are omitted. The arithmetic processing unit 10L shown in Figures 29 and 30 is, for example, a processor such as a CPU.
[0155] The arithmetic processing unit 10L has removed the circuit related to the abbreviated flag sht in the arithmetic processing unit 10B shown in Figures 3 and 4. Therefore, the instruction decoder ID shown in Figure 29 does not have the abbreviated detection unit shtdet1 shown in Figure 3. The abbreviated detection unit shtdet2 is mounted on the logic circuit LGC1.
[0156] The execution unit EX of the arithmetic processing unit 10L has a logic operation circuit ALU (shift circuit) that includes series-connected logic circuits LGC1 and LGC2, similar to Figure 18, which perform shift operations. Logic circuit LGC1 processes shift amounts from "1" to "7", and logic circuit LGC2 processes shift amounts from "8" to "63".
[0157] The shortening detection unit shtdet2 detects early termination and sets the shortening flag to "1" if the shift amount supplied from the register (source operand) is between "1" and "7". The shortening detection unit shtdet2 sets the shortening flag to "0" if the shift amount supplied from the register (source operand) is between "8" and "63".
[0158] This allows the result of the shift instruction to be bypassed with reduced execution latency when the shift amount specified in the register is small. The instruction scheduler IS shown in Figure 29 issues the preceding instruction (shift instruction), and then issues the subsequent dependent instruction after the reduced execution latency of the shift instruction (="1").
[0159] As described above, this embodiment also provides the same effects as the embodiments described above. For example, the shortening detection unit shtdet2 can determine whether to bypass the calculation result with shortened execution latency or with worst-case execution latency. This allows the calculation result to be bypassed with shortened execution latency if it is possible to output the calculation result with shortened execution latency.
[0160] For example, in a shift instruction where the shift amount is supplied from a register, if the shift amount is small and the calculation result can be obtained by the logic circuit LGC1, the calculation result can be bypassed with reduced execution latency. As a result, the processing performance of the arithmetic processing unit 10L can be improved compared to when the bypass with reduced execution latency is not performed.
[0161] In the above-described embodiment, an example was given in which there is only one shortened execution latency. However, multiple shortened execution latencies may be set. For example, the shift amount of the shift instruction may be divided into three stages, and two shortened execution latencies may be set. That is, if the shortened execution detection unit shtdet1 or the shortened execution detection unit shtdet2 detects an early termination, the output control circuit OUTCNT may bypass any of the bypassable calculation results from among the multiple calculation results at multiple intermediate stages.
[0162] Furthermore, in the above-described embodiment, an example was explained in which both the calculation results output from the execution unit EX with reduced execution latency and the worst-case execution latency are bypassed. However, it is also possible to bypass only the calculation results output with reduced execution latency, and transfer the calculation results output with worst-case execution latency to the register file RF. Note that the calculation results output with reduced execution latency are also transferred to the register file RF.
[0163] Furthermore, in the embodiments described above, for the sake of simplicity, an example was described in which each arithmetic processing unit has an execution unit EX having a floating-point multiply-accumulate circuit FMA, a logic operation circuit ALU, and a floating-point arithmetic circuit or a floating-point addition circuit. However, each arithmetic processing unit may have multiple types of execution units.
[0164] The features and advantages of the embodiments will become clear from the detailed description above. This is intended to be so as not to deviate from the spirit and scope of the claims, that the features and advantages of the embodiments described above are included. Furthermore, any improvement and modification should be readily conceivable to a person with ordinary skill in the art. Therefore, there is no intention to limit the scope of inventive embodiments to those described above, and it is also possible to rely on appropriate improvements and equivalents that fall within the scope disclosed in the embodiments. [Explanation of Symbols]
[0165] 10, 10A, 10B, 10C, 10D Arithmetic Processing Units 10E, 10F, 10G, 10H, 10I Arithmetic Processing Units 10J, 10K, 10L arithmetic processing units 20, 30 Arithmetic Processing Units 2S Secondary Scheduler B1, B2, B3 buses BCNT Bypass Control Unit C1, C2, C3, C4 Comparators DET detection unit EX Execution Unit FF Flip-Flop FIFO1-FIFO5 FIFO logic section ID instruction decoder IS Command Scheduler sht, sht2 abbreviation flags LGC, LGC1-LGC5 Logic Circuits MUX1, MUX2 Multiplexer MUX3 (MUX31, MUX32, MUX33) Multiplexer OUTCNT Output Control Circuit Q (Queue) RF Register File SEL instruction selector shtdet1, shtdet2 Shortening detection unit STG Stage
Claims
1. An instruction scheduler that issues executable instructions, A register file that holds the data used by the instruction, An execution unit including multiple stages that sequentially execute instructions issued by the instruction scheduler, A detection unit for detecting early termination where the calculation result in an intermediate stage prior to the final stage among the multiple stages is the same as the calculation result by the execution unit, A bypass control unit transfers data output from the register file or calculation results from the execution unit to the input of the execution unit, and if the detection unit detects the early termination, bypasses the calculation results at the intermediate stage to the input of the execution unit. A processing unit having a arithmetic processing unit.
2. It has an instruction decoder that decodes an instruction and outputs the decoded instruction to the instruction scheduler, The detection unit is provided in the instruction decoder and detects the early termination based on the decoded instruction. The arithmetic processing device according to claim 1.
3. The instruction scheduler issues the subsequent dependent instruction, which depends on a preceding instruction issued earlier, at a time when the subsequent dependent instruction is fed into the execution unit, such that the timing of the subsequent dependent instruction being fed into the execution unit coincides with the timing of the calculation result of the preceding instruction being fed into the execution unit via the bypass control unit. The arithmetic processing device according to claim 2.
4. A secondary scheduler is located between the instruction scheduler and the execution unit and includes a plurality of entries that sequentially hold instructions issued from the instruction scheduler and source operands transferred from the bypass control unit, and issues the instruction held in the entry in which the source operand has been determined to the execution unit together with the source operand. The detection unit is provided in the execution unit. The arithmetic processing device according to claim 1.
5. The instruction scheduler issues the subsequent dependent instruction, which depends on the preceding instruction issued earlier, so that the timing at which the subsequent dependent instruction is fed to the execution unit is earlier than the timing at which the calculation result in the final stage is fed to the execution unit via the bypass control unit. The arithmetic processing device according to claim 4.
6. The execution unit has a floating-point multiply-accumulate circuit that includes a floating-point multiplier and a floating-point adder connected to the output of the floating-point multiplier. The detection unit detects early termination when floating-point multiplication is performed by the floating-point sum-accumulate circuit. The bypass control unit bypasses the multiplication result of the floating-point multiplier when the detection unit detects an early termination. The arithmetic processing apparatus according to any one of claims 2 to 5.
7. The execution unit has a logic circuit in which multiple shift circuits are connected in series. The detection unit detects early termination when a shift command, in which the shift amount is given as an immediate value, is executed by the plurality of shift circuits. The bypass control unit bypasses the calculation results of the intermediate stage when the detection unit detects an early termination. The arithmetic processing apparatus according to any one of claims 2 to 5.
8. The execution unit has a floating-point arithmetic circuit that includes a floating-point arithmetic unit and a rounding circuit that adds "+1" to the calculation result of the floating-point arithmetic unit. The detection unit detects early termination when a floating-point arithmetic instruction executed by the floating-point arithmetic circuit is in a rounding mode that does not generate rounding by adding "+1" to the calculation result. The bypass control unit bypasses the calculation result of the floating-point arithmetic unit when the detection unit detects an early termination. The arithmetic processing apparatus according to any one of claims 2 to 5.
9. The execution unit has a floating-point multiply-accumulate circuit that includes a floating-point multiplier and a floating-point adder connected to the output of the floating-point multiplier. The detection unit detects early termination if the value to be added to the multiplication result is "0". The bypass control unit bypasses the multiplication result of the floating-point multiplier when the detection unit detects an early termination. The arithmetic processing device according to claim 4 or claim 5.
10. The execution unit has a floating-point sum-of-accumulate circuit that includes a denormalized number processing circuit that performs processing when the calculation result is a denormalized number, The detection unit detects early termination if the summation result from the floating-point multiply-accumulate circuit is not a denormalized number. The bypass control unit bypasses the addition result when the detection unit detects an early termination. The arithmetic processing device according to claim 4 or claim 5.
11. The execution unit has a floating-point arithmetic circuit that includes a floating-point arithmetic unit and a rounding circuit that adds "+1" to the calculation result of the floating-point arithmetic unit. The detection unit detects early termination if no rounding process occurs that adds "+1" to the calculation result of the floating-point arithmetic unit. The bypass control unit bypasses the calculation result of the floating-point arithmetic unit when the detection unit detects an early termination. The arithmetic processing device according to claim 4 or claim 5.
12. The execution unit has a floating-point adder circuit that includes a floating-point adder and a normalization processing circuit connected to the output of the floating-point adder. The detection unit detects early termination when normalization processing after calculation by the floating-point adder is unnecessary. The bypass control unit bypasses the calculation result of the floating-point adder when the detection unit detects an early termination. The arithmetic processing device according to claim 4 or claim 5.
13. The execution unit has an integer adding circuit including an integer adder and a carry processing unit connected to the output of the integer adder. The detection unit detects early termination when carry processing after addition by the integer adder is unnecessary. The bypass control unit bypasses the addition result of the integer adder when the detection unit detects an early termination. The arithmetic processing device according to claim 4 or claim 5.
14. The execution unit has a logic circuit in which multiple shift circuits are connected in series. The detection unit detects early termination when, in the execution of a shift instruction by the logic circuit for which the shift amount is given from a register file, the shift result is obtained at the intermediate stage. The bypass control unit bypasses the shift result at the intermediate stage when the detection unit detects an early termination. The arithmetic processing device according to claim 4 or claim 5.
15. The system has multiple execution units capable of executing a single instruction in parallel using different data from each other. The detection unit detects whether or not the calculation results at intermediate stages can be bypassed for each of the multiple execution units. The arithmetic processing device according to claim 4 or claim 5.
16. The execution unit has a selection unit that selects the calculation result of the intermediate stage or the calculation result of the final stage and outputs it to the input of the execution unit. If the detection unit detects an early termination, the bypass control unit causes the selection unit to select the calculation result of the intermediate stage. The arithmetic processing apparatus according to any one of claims 1 to 5.
17. When the detection unit detects an early termination, the bypass control unit bypasses one of the bypassable calculation results from among the multiple calculation results at the multiple intermediate stages. The arithmetic processing apparatus according to any one of claims 1 to 5.
18. The calculation results at the bypassed intermediate stage are transferred to the bypass control unit and the register file. The calculation result of the final stage is transferred to the register file. The arithmetic processing apparatus according to any one of claims 1 to 5.
19. A method for processing an arithmetic processing unit comprising: an instruction scheduler that issues executable instructions; a register file that holds data used for instructions; an execution unit that includes a plurality of stages that sequentially execute instructions issued by the instruction scheduler; and a bypass control unit that transfers data output from the register file or calculation results from the execution unit to the input of the execution unit, wherein The detection unit of the aforementioned processing unit, Early termination is detected when the calculation result in an intermediate stage prior to the final stage is the same as the calculation result by the execution unit. If the results are the same, the calculation result of the intermediate stage is transferred to the bypass control unit. Calculation processing method.