Optimization method for fusing floating-point addition and subtraction instructions in single-precision floating-point multiply-accumulators
By integrating the addition/subtraction module with the multiplication/accumulation module in single-precision floating-point calculations and reusing the logic of the multiplication/accumulation module, the area overhead problem that the addition/subtraction module cannot be reused is solved, thereby reducing hardware area and power consumption.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI JUNZHENG TECH CO LTD
- Filing Date
- 2024-12-27
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, the separate design of addition/subtraction and multiplication/addition modules in single-precision floating-point calculation modules results in the inability to reuse logic modules, increases hardware area overhead, and leads to higher power consumption and cost.
By integrating the addition/subtraction module with the multiplication/addition module into the same module, the logic of the multiplication/addition module can be reused, especially the special value judgment at the P0 level, the order shift and mantissa summation at the P1 level, and the post-regulation at the P2 level, thereby reducing the hardware design area.
While ensuring basic performance, by reusing the multiplication and addition module logic, the independent area of the addition and subtraction modules is reduced, thereby lowering the power consumption and cost of the hardware design.
Smart Images

Figure CN122308778A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of chip processing technology, and specifically relates to an optimization method for integrating floating-point addition and subtraction instructions in a single-precision floating-point multiply-accumulator. Background Technology
[0002] Floating-point numbers are a real number data type in computers, possessing high computational precision and meeting the requirements of high-precision calculations. With the continuous development of modern technology, the importance of floating-point arithmetic capabilities has become increasingly prominent, playing an indispensable key element in various fields such as scientific computing, computer technology, and engineering. In scientific computing, large-scale data simulations are needed for in-depth research. For example, in meteorological forecasting, fluid mechanics, and exploration, scientists require high-precision floating-point calculations to obtain accurate simulation results. Similarly, in the booming field of autonomous driving, accurate floating-point calculations are crucial for the correctness and performance of algorithms. For machine learning algorithm optimization, floating-point calculations are highly efficient, accelerating the computation process and improving algorithm performance and efficiency, playing a vital role. Furthermore, with advancements in hardware technology, floating-point computing units can now be integrated into CPUs, becoming an important component of the CPU's core computing power.
[0003] In floating-point computing units, single-precision floating-point computing is a frequently used module, primarily applied in image recognition, deep learning, object detection and tracking, and QR code recognition. The design strategy of this module significantly impacts the overall computing unit. It is well known that a large chip area has a series of negative consequences; as the area increases, power consumption and cost also increase. Therefore, simply increasing the area of single-precision floating-point computing to achieve high performance is counterproductive. A key challenge is how to minimize area overhead, thereby reducing power consumption and cost, in single-precision floating-point applications while improving or maintaining basic performance.
[0004] For addition and subtraction instructions, existing technologies primarily separate addition / subtraction and multiplication / addition instructions into different addition / subtraction and multiplication / addition modules. These modules then employ different algorithms to perform the corresponding operations, aiming to maximize performance. However, this approach prevents the reuse of logic between algorithms, resulting in additional area overhead.
[0005] The addition and subtraction module mainly deals with addition and subtraction instructions: In the addition and subtraction module, the two operands mainly perform the operations of summing the order and mantissa, as well as the operations of rounding and final result processing to obtain the result of the addition and subtraction instruction;
[0006] The multiply-add module is mainly for multiply-add instructions: In the multiply-add module, the three operands are mainly used to multiply the mantissas, and the product is aligned with the addend operands. Then the mantissas are summed and the result is entered into the post-ruling part. Finally, the rounding and final result processing operations are completed to obtain the result of the multiply-add instruction.
[0007] Existing technologies for implementing addition and subtraction instructions separate the addition / subtraction module from the multiplication / accumulation module to improve performance. However, this separation also incurs additional overhead in terms of area. In other words, existing technologies separate the addition / subtraction and multiplication / accumulation modules, executing the corresponding calculation operation in different modules based on the instruction type. Since some operations are identical in addition / subtraction and multiplication / accumulation—for example, the alignment and mantissa summation operations in addition / subtraction are very similar to those in multiplication / accumulation, with essentially the same rounding and final result deriving operations—existing technologies can only isolate these reusable modules, allowing the addition / subtraction and multiplication / accumulation modules to use different algorithms to perform the corresponding calculations, thus improving performance. While this does improve performance, it comes at the cost of excessive area.
[0008] Therefore, it can be seen that the main technical defect of the existing technology is the area overhead caused by the inability to reuse the same logical modules between instructions. Summary of the Invention
[0009] The technical problem solved by this application is that, in the implementation of addition and subtraction instructions, while ensuring basic performance, the addition and subtraction module and the multiplication and addition module are integrated into the same module, so that the same logic module can be reused, thereby reducing the area overhead.
[0010] For addition and subtraction instructions, multiplication operations are unnecessary in the multiply-accumulate module. Therefore, the multiply-accumulate module integrates all the operations from the addition and subtraction modules. This solution utilizes the multiplexing related operations in the multiply-accumulate module to complete the relevant addition and subtraction operations, thereby reducing hardware design area, power consumption, and cost.
[0011] In summary, the technical problem solved by this solution is to reuse devices of the same logic module as much as possible while ensuring the performance of addition and subtraction instructions, so as to reduce additional area overhead and achieve the goal of reducing power consumption and cost.
[0012] Due to differences in data bit width and pipeline division, double-precision multiply-accumulators differ significantly from the single-precision multiply-accumulators used in this method, and their usage methods are not entirely the same. Therefore, this method specifically focuses on the fusion using single-precision floating-point multiply-accumulators.
[0013] Specifically, this invention provides an optimized method for integrating floating-point addition and subtraction instructions in a single-precision floating-point multiply-accumulator. In the process of performing single-precision floating-point addition and subtraction calculations, this method utilizes existing logic to complete the following functions at each pipeline stage:
[0014] P0 level, the first pipeline stage: completes the special value judgment module function, the exponent difference and comparison logic module;
[0015] P1 stage, the second pipeline stage: completes the alignment shift operation, comparison operation, and summation of the mantissa after alignment;
[0016] P2 level, the third pipeline stage: post-processing, rounding, final result processing, and final exception state judgment; the P0 level includes:
[0017] (a) The instruction control signal module (1) and special value judgment module (2) in the multiply-accumulate module need to be reused to perform special value judgment;
[0018] The special value determination module includes:
[0019] According to the instruction control signal module, the data entering the special value judgment module needs to be adjusted; the expression of multiplication-addition instructions is: a*b+c. For multiplication-addition instructions, the operands a, b, and c remain unchanged; for addition and subtraction instructions, it is necessary to complete the a+b operation. For the expression a*b+c, the operand a remains unchanged, the operand b is assigned the value 32'h3f80_0000, and the operand c will be assigned the original value of the input operand b. That is, the expression is transformed into a*1+b.
[0020] For addition and subtraction instructions, there is also the concept of a valid operation. The extracted instruction information and the sign bits of the two operands together determine whether the actual operation is addition or subtraction; opra represents the original 'a' operand of the addition / subtraction instruction, and oprb represents the original 'b' operand of the addition / subtraction instruction.
[0021] opra sign bit Input Operation oprb sign bit Effective operation + + + + + + - - - + + - - + - + + - + - + - - + - - + + - - - -
[0022] Note: The input operation is +, which corresponds to the addition instruction (FADD_S); the input operation is -, which corresponds to the subtraction instruction (FSUB_S).
[0023] After the effective operation is clear, proceed to the special value module for judgment;
[0024] The special value criteria are shown in the following two tables:
[0025] The table below shows the results of special value judgments for valid operations that are addition:
[0026]
[0027]
[0028] The table below shows the results of special value judgments for valid operations that are subtraction:
[0029]
[0030] Note: (sub)norm is a general term for normalized and denormalized numbers. NaN numbers (Not a number, representing an inexpressible value) are divided into two categories: qNaN numbers and sNaN numbers.
[0031] sNaN is a number whose exponent is all 1s, whose first digit is 0, and whose overall mantissa is not 0.
[0032] qNaN is a value whose exponent is all 1s and whose mantissa is 1 in the first position.
[0033] RISC-V specifies that if the result of a floating-point operation is a NaN number, then a fixed NaN number should be used. The NaN value corresponding to single-precision floating-point is 0x7fc0_0000. Therefore, the final result qNaN needs to be assigned a fixed value, i.e., qNaN = 32'h7fc0_0000.
[0034] The crossed-out table indicates that the result needs to be obtained through normal calculation, rather than a special value;
[0035] If the above special value result is generated, a special value signal needs to be set to mark that the operation is a special case. The result is assigned the special value result obtained above to facilitate subsequent calculations; (b) It needs to go through the exponent difference and comparison module (8) to complete the calculation of the exponent difference and the exponent comparison of the two operands to determine the exchange signal;
[0036] (c) Since the addition and subtraction instructions do not need to complete the multiplication operation of the mantissas, there is no need to reuse the product mantissa calculation module (3) to calculate the mantissa product result. It is only necessary to use the selector (4) to select the mantissa or special value of the number a according to the special value signal. The selector (4) is a selector module that selects the mantissa or special value of the operand a according to the instruction control signal and the special value judgment signal.
[0037] The P1 stage: The P1 stage completes the alignment shift operation, comparison operation, and mantissa summation after alignment. Therefore, a right shifter needs to be multiplexed to complete the alignment operation. Since the relevant exponent difference has been calculated in the P0 stage, alignment shifting is required based on the exponent difference, and mantissa summation and comparison operations are completed after alignment. Alignment for addition and subtraction operations is only right alignment, so left shifting is not required here. Since addition and subtraction instructions do not have multiplication operations, there is no need for pipeline pauses due to mantissa product result normalization comparison or alignment operation.
[0038] The P2 level: The post-regulation, rounding, final result processing and final abnormal state judgment operations required by the P2 level are basically the same as the P2 level operations in the multiply-accumulate algorithm and are completely reused, with only differences in the leading zero statistics part and the post-regulation shift part.
[0039] The leading zero statistics section includes:
[0040] There is no need to count the leading zeros in the lower half of the result of the addition and subtraction instructions. Since addition and subtraction operations have exponent pairing, the number of leading zeros counted in the post-comparison will never exceed 24. Even in the most extreme case, when two numbers are subtracted and the exponents differ by 1, the mantissa of the larger exponent is 1.000……000 and the mantissa of the smaller exponent is 1.111……111. These are two very close numbers being subtracted, and the number of leading zeros in the post-comparison will never exceed 24. Therefore, only the leading zeros in the higher half need to be counted to obtain the result. In this case, there is no need to compensate for the leading zeros in the post-comparison. The leading zeros in the higher half that are counted are the accurate leading zeros in the post-comparison.
[0041] The rear gauge shifting portion includes:
[0042] Since the leading zero count will not exceed 24, the shifting part of the subsequent rule can always be shifted using the original data.
[0043] The P1 level further includes:
[0044] (a) Comparison logic reuse
[0045] For the multiply-add algorithm, due to the existence of the multiplication operation, the comparison logic is divided into two parts. The first part is the mantissa comparison logic (13), which is the comparison logic of the high half of the mantissa 25-bit before the shift. This is for the convenience of the current level alignment operation and to improve the timing. The second part is the final mantissa comparison logic (28), which is the logic of comparing the subsequent parts again after the mantissa has been aligned. This is used to ensure that the multiply-add algorithm determines the accurate comparison result.
[0046] As for addition and subtraction instructions, since there is no multiplication operation, the effective bits of the mantissa are 1-bit hidden bit + 23-bit mantissa, totaling 24-bit. Therefore, the first part of the mantissa comparison logic (13), that is, the comparison of the high half of the mantissa 25-bit before the shift, can obtain an accurate comparison result. There is no need to worry about the mantissa multiplication after the shift adjustment. Therefore, the addition and subtraction instructions only need to reuse the first part of the mantissa comparison logic (13) in the multiplication module, that is, the comparison logic of the high half of the mantissa 25-bit before the shift, and do not need to reuse the second part of the final mantissa comparison logic (28), that is, the logic of performing subsequent part comparison after the mantissa is aligned.
[0047] (b) Reusing the order part
[0048] The overall implementation of the alignment part is as follows:
[0049] After the mantissa comparison logic (13), which compares the high half of the mantissa (25 bits) before the alignment shift, the exchange signal and its exact size are determined. The selector (15) is then controlled by the relevant shift control logic (19), which selects the corresponding shift data based on the corresponding shift logic. This selector (15) also performs the function of adding 0 to the shift data if the shift amount exceeds 15. The shift amount is specifically determined by the shift amount adjustment logic (20). After both the shift amount and the shift data are ready, the process proceeds... A 48+SHF_NUM, where SHF_NUM uses a 16-bit right shifter (16) according to operator statistics to obtain the shift result; since addition and subtraction instructions do not have multiplication operations, the alignment only needs to be completed by right shift; finally, the big mantissa and little mantissa after alignment need to be selected by selectors (23 / 24), where selector (23) is the selector that selects the big mantissa result after alignment according to shift control logic; and selector (24) is the selector that selects the little mantissa result after alignment according to shift control logic;
[0050] Accordingly, the exponent adjustment logic (21) is used in the exponent adjustment module to further adjust the exponent and obtain the adjusted exponent;
[0051] (c) Reusing the last digit summation part: The overall scheme for summing the last digits is implemented as follows:
[0052] After obtaining the mantissa after alignment, if the actual operation here is subtraction, the mantissa after alignment needs to be passed through the method taking module (25) to obtain the opposite number; if the actual operand here is addition, no additional operation is required, and the value of the mantissa after alignment is still maintained; at this time, the corresponding result is selected by the selector (26), which is a selector that selects the corresponding mantissa summation data according to the subtraction signal; the selected mantissa, mantissa and subtraction signal enter the mantissa summation calculation module (27) at the same time. The mantissa summation calculation module contains a 26-bit lower half adder and a 25-bit higher half adder. It is necessary to wait for the carry result of the lower half adder before entering the higher half adder to complete the mantissa summation calculation and obtain the final mantissa summation result.
[0053] The overall implementation scheme for the P2 level operation in the addition and subtraction algorithm is as follows:
[0054] The mantissa summation result passed from P1 level is directly reused in the 24-bit high half leading zero statistics module (2) to obtain the actual number of leading zeros in the back guide; at this time, it needs to be entered into the shift amount calculation module (6) together with the exponent to obtain the shift amount result, and the shift data needs to be obtained according to the shift data adjustment module (7); since the number of leading zeros will not exceed 24, that is, the shift amount is always less than 24, so the shift data always selects the original data for shifting; the shift data is obtained through the 40-bit left shifter (9) according to the shift amount and the shift data; if the mantissa summation result obtained in P1 level produces mantissa overflow, at this time, 1-bit 0 needs to be concatenated in the highest bit of the right shift logic (8) of this level to simulate the mantissa shifted one bit to the right, and the final mantissa result is obtained through the mantissa selection module (12);
[0055] Meanwhile, the exponent adjustment module (10) and the rounding part (11) adjustment module complete the adjustment of the exponent and the rounding part; the sign bit needs to be calculated by the sign bit calculation module.
[0056] These data are collectively entered into the rounding operation, rounding adjustment and abnormal status judgment module (13) to complete the rounding operation, rounding adjustment and abnormal status judgment, and finally enter the final result selection (14) to obtain the final result.
[0057] The P2-level operation in the addition / subtraction algorithm further includes:
[0058] 1) Post-regulation section:
[0059] The mantissa summation data is obtained from the P1 pipeline stage, totaling 50 bits, namely 2 hidden bits + 46 mantissa bits + 1 rounding bit + 1 sticky bit. The exponent should be the adjusted exponent result generated by the P1 stage. It first enters the high half part of the leading zero module (2) for statistics, and then performs the shift operation.
[0060] Further includes:
[0061] (a) The specific shift details are as follows:
[0062] ① The exponent can satisfy the left shift with leading zero;
[0063] For addition and subtraction instructions, the number of leading zeros after the bit rule will not exceed 24, so the shift amount will not exceed 24. Therefore, the shift always selects the original data for shifting; the shift amount is the number of leading zeros after the bit rule.
[0064] ② The exponent cannot satisfy the left shift with leading zero;
[0065] For addition and subtraction instructions, the leading zero count after the shift will not exceed 24. If the exponent cannot even satisfy the shift requirement, i.e., the shift amount is less than the leading zero count after the shift, the shift amount will not exceed 24. For shifts with a shift amount not exceeding 32, the original data is still selected for shifting; the shift amount is the exponent - 1.
[0066] ③ The sum of the last digits overflows by one place;
[0067] If the mantissa overflows after summing, the most significant bit of the mantissa needs to be appended with 1-bit0 through the right shift control logic (8) to simulate the mantissa shifting one bit to the right; at this time, there is no need to enter the left shifter for shifting.
[0068] ④ No back gauge shift is required;
[0069] If there is no leading zero after the normalization, or if the obtained number is itself a nonnormalized number, then the original mantissa result can be directly selected through the mantissa selection module (12), and the shift amount is 0.
[0070] (b) Further reduction of the shift amount
[0071] From the previous discussion of the two cases requiring left shift, it can be seen that the maximum shift amount is 24. When the timing allows, the shift amount of the shifter is reduced. For data with a shift amount exceeding 15, the high half of the 16-bit 0 is removed through the internal control signal, and the relevant 0 is added afterward. This reduces the maximum shift amount from 31 to 15, thus achieving the purpose of reducing the shift amount.
[0072] The specific implementation details are as follows:
[0073] (I) When the shift amount is less than 16, you can directly enter the left shifter to perform the shift;
[0074] (II) When the shift amount is greater than 16, the shifted data needs to enter the shift data adjustment module (7), where 16 bits of 0 are removed from the front and 16 bits of 0 are added to the back; the processed data enters the left shifter for the remaining shift.
[0075] (c) Further reduction of shifted data;
[0076] Originally, 48 bits of shift data were required for the shift, but since the shifter can shift at most 15 bits, it is only necessary to satisfy the left shifter normalization, that is, only (24+16) bits of shift data are needed, and the remaining 8 bits can be used as sticky bits; thus, the shift data bit width can be reduced.
[0077] In summary, the original 48-bit shifter can be reduced to a 40-bit shifter;
[0078] 2) Rounding adjustments:
[0079] The rounding adjustment module (11) completes the corresponding function; where the protection bit is the last bit of the mantissa, the rounding bit is the first bit after the last bit of the mantissa, and the sticky bit is the self-OR of all bits after the second bit after the last bit of the mantissa.
[0080] 3) Exponent adjustment section:
[0081] The exponent adjustment module (10) completes the corresponding function;
[0082] The exponent adjustment is performed synchronously with the post-regulation, and there are four possibilities: exponent + 1, exponent, exponent - leading zero of the post-regulation, 0; the appropriate choice is made according to the mantissa selection.
[0083] 4) Rounding operation and post-rounding adjustment:
[0084] This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13);
[0085] To improve timing, the mantissa is directly rounded by a 25-bit adder and incremented by 1. Subsequent rounding is performed based on the rounding mode, selecting either the original mantissa or the mantissa with the incremented value, to obtain the final rounded result. This rounded result is then used to further adjust the exponent and mantissa.
[0086] Exponent adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the exponent needs to be incremented by 1;
[0087] Arrangement of last digits: If rounding by +1 causes the last digits to overflow by 1 digit, the last digits need to be adjusted one digit to the right.
[0088] 5) Final sign bit calculation:
[0089] Sign bit calculation (1) completes the corresponding function;
[0090] ① If a special value exists, assign the sign bit of the special value;
[0091] ② The absolute values of the two numbers are equal, that is, after alignment, operands a and b are equal, and the effective operation is subtraction. This is related to the rounding mode. If it is rounding to negative infinity, the sign bit is negative; otherwise, the sign bit is positive.
[0092] ③ The remaining cases need to be determined based on the exchange signal and the subtraction signal, as shown in the table below:
[0093] Where A is the opra operand, B is the oprb operand, and opra_sign is the sign of the opra operand. The actual operation is the valid operation obtained at level P0.
[0094] condition Practical operation opra_sign Final symbol |A|<|B| - + - |A|<|B| - - + |A|<|B| + + + |A|<|B| + - - |A|>|B| - + + |A|>|B| - - - |A|>|B| + + + |A|>|B| + - -
[0095] 6) Anomaly detection:
[0096] This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13);
[0097] The logic for judging abnormal states is as follows:
[0098] ① Invalid exception: Found in the special value judgment at level P0;
[0099] ② Division by zero exception: None;
[0100] ③ Overflow exception: The result is greater than the maximum exponent value after the exponent is updated, and it is not an infinite value, qNaN result, or invalid exception.
[0101] ④ Underflow anomaly: After rounding, the data is strictly between ± minimum normalized number (+ minimum normalized number is 32'h0080_0000, - minimum normalized number is 32'h8080_0000) and is inaccurate; ⑤ Inaccuracy anomaly: The rounding bit or sticky bit is 1, or an overflow anomaly occurs at this time.
[0102] 7) Final result processing:
[0103] The final result selection (14) completes the corresponding function;
[0104] If an overflow exception occurs, special assignments are needed for different maxima depending on the rounding mode. In other cases, only the sign bit, exponent, and mantissa need to be concatenated normally. This yields the final result.
[0105] The multiply-accumulate module implements the multiply-accumulate algorithm, including:
[0106] After obtaining the three operands a, b, and c, the product result of a*b is calculated first. The product result and the c operand (the addend operand) are aligned. Then, the mantissa is summed. The normalization part is then performed. Finally, rounding, exception handling, and the final result are obtained.
[0107] Before the exponent alignment, the mantissa of the product, the leading zero of the product, the exponent difference, and the c operand (addition operand) are required. Part of the exponent difference's shift amount during alignment can offset the normalization shift amount of the leading zero of the product. The specific algorithm is as follows:
[0108] (1) The result of multiplication exponent > the exponent of operand c
[0109] d. Exponent difference > leading zeros of the product: The product result has a large exponent. It needs to be left-shifted for normalization with a shift amount equal to the leading zeros of the product. Since the c operand has a small exponent, it needs to be right-shifted for alignment with a shift amount equal to the exponent difference minus the number of leading zeros of the product. At this time, two shifts are required, and the multiplication result is still the large operand. If there are no leading zeros generated in the multiplication at this time, only the c operand needs to be shifted by the exponent difference;
[0110] e. Exponent difference < leading zeros of the product: Before shifting, the product result has a large exponent. First, the multiplication mantissa is left-shifted for normalization with a shift amount equal to the leading zeros of the product. At this time, because the exponent difference < leading zeros of the product, the product result instead becomes a small exponent. At this time, the shifted multiplication mantissa still needs to be right-shifted for alignment with a shift amount equal to the leading zeros of the product minus the exponent difference;
[0111] From the overall effect, only the multiplication mantissa needs to be left-shifted by the exponent difference. At this time, the multiplication result becomes the small operand and needs to be swapped;
[0112] f. Exponent difference = leading zeros of the product: At this time, only the multiplication mantissa needs to be left-shifted for normalization with a shift amount equal to the leading zeros of the product to complete the alignment of the c operand and the multiplication result;
[0113] (2) Multiplication exponent result < c operand exponent
[0114] At this time, regardless of the relationship between the leading zeros count of the product and the exponent difference, the multiplication result must be a small exponent. First, the multiplication mantissa is left-shifted for normalization with a shift amount equal to the leading zeros of the product. At this time, the exponent of the multiplication result is smaller, and the shifted multiplication mantissa still needs to be right-shifted for alignment with a shift amount equal to the leading zeros of the product plus the exponent difference; From the overall effect, only the multiplication mantissa needs to be right-shifted by the exponent difference. At this time, the multiplication result is still the small operand;
[0115] (3) Multiplication exponent result = c operand exponent, exponent difference = 0 a. If the leading zeros of the product = 0, the multiplication result and the c operand are already aligned and no shift is required; b. If the leading zeros of the product ≠ 0, at this time the exponent difference is less than the leading zeros of the product. First, the multiplication mantissa is left-shifted for normalization with a shift amount equal to the leading zeros of the product. After the shift, the exponent difference is the leading zeros of the product. At this time, the adjusted multiplication mantissa result needs to be right-shifted by the leading zeros of the product for alignment; From the overall result, no shift is required.
[0116] Therefore, the advantage of this application is to reuse the logic of the multiply-add module to complete addition and subtraction instructions: The technical solution only uses the multiply-add module, reuses the relevant logic part of the multiply-add module, and completes addition and subtraction instructions and multiply-add type instructions. On the premise of ensuring the basic performance, it can reduce the independent area of the addition and subtraction module.
[0117] The addition and subtraction instructions reuse the special value judgment, exponent difference calculation and comparison part of the P0 level, the right shifter and mantissa summer adder of the P1 level, and the post-regulation part and final result adjustment of the P2 level, etc., from the multiply-accumulate module. Compared with the previous design where the addition and subtraction instructions were completed in a separate module, the addition and subtraction instructions are completed in the multiply-accumulate module, which reduces the design area. While ensuring basic performance, the hardware design area is reduced, and power consumption and cost are also reduced.
[0118] In summary, this solution achieves significant area improvement while maintaining basic performance, and also reduces power consumption and cost. Attached Figure Description
[0119] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.
[0120] Figure 1 This is a diagram of the P0-P1 pipeline implementation structure using addition and subtraction instructions.
[0121] Figure 2 This is a diagram of the P2 pipeline implementation structure, which uses addition and subtraction instructions.
[0122] Figure 3 This is a flowchart illustrating the method.
[0123] in:
[0124] Figure 1 The corresponding selectors are shown below:
[0125] Selector 4: A selector module that selects the mantissa or special value result of opra based on the instruction control signal and the special value judgment signal;
[0126] Selector 15: Selects the corresponding shift data according to the corresponding shift logic;
[0127] Selector 23: Selects the corresponding big-endian result based on the shift control logic;
[0128] Selector 24: Selector that selects the corresponding little-endian result according to the shift control logic;
[0129] Selector 26: Selector that selects the data required for summing the mantissas based on the subtraction signal; Figure 1 , Figure 2 The selector that draws a dashed line does not select the annotation because it does not perform the actual selection logic function. Detailed Implementation
[0130] To better understand the technical content and advantages of the present invention, the present invention will now be described in further detail with reference to the accompanying drawings.
[0131] This invention relates to the field of chips, primarily focusing on single-precision floating-point arithmetic. This solution, while maintaining basic performance, utilizes a multiply-accumulate module to perform addition / subtraction instructions and multiply-accumulate instructions, sharing the same functional module as much as possible. This reduces the hardware design area, thereby lowering power consumption and cost. The application of this solution offers the advantages of reduced hardware area and low power consumption.
[0132] In addition, the technical terms included in this article are:
[0133] (1) Operands a, b, c, for three-operand instructions such as multiply-accumulate instructions, represent a*b+c; (2) The exponent of the product result is the sum of the exponents of operands a and b - offset (the offset is 127 for single-precision floating-point numbers); The exponent difference of the multiply-accumulate algorithm mentioned in the text refers to the absolute value of the difference between the exponent of the product result and the exponent of operand c, and the exponent difference of the addition-subtraction algorithm refers to the absolute value of the difference between the exponent of operand a and the exponent of operand b;
[0134] (3) The exponent alignment operation refers to aligning the mantissa of the smaller exponent operand with the mantissa of the larger exponent operand based on the exponent difference between two operands with different exponents.
[0135] (4) Postnormalization refers to the fact that after the floating-point calculation is completed, the exponent may exceed 0, but the mantissa result is a denormalized number. At this time, the floating-point number normalization operation needs to be performed.
[0136] (5) Leading zeros refer to the number of zeros that appear in the mantissa when the hidden bit is 0, up to a single 1; product leading zeros refer to the number of leading zeros in the mantissa result obtained after multiplying operands a and b; post-ruling leading zeros refer to the number of leading zeros generated in the mantissa of the result after post-ruling the floating-point calculation.
[0137] Specifically, this solution utilizes a multiply-accumulate module to perform addition and subtraction instructions. The main principle behind the addition and subtraction instructions that can perform the above operations is to complete the a+b operation, thus primarily reusing operations such as alignment, mantissa summation adder, and post-alignment.
[0138] In this solution, the multiply-accumulate module is reused to implement addition and subtraction functions:
[0139] like Figure 3 As shown, in the process of performing single-precision floating-point addition and subtraction calculations, this solution utilizes existing logic to complete the following functions at each pipeline stage:
[0140] P0 level, the first pipeline stage: completes the special value judgment module function, the exponent difference and comparison logic module;
[0141] P1 stage, the second pipeline stage: completes the alignment shift operation, comparison operation, and summation of the mantissa after alignment;
[0142] Stage P2, the third pipeline stage: post-processing, rounding, final result processing, and final exception handling; from the perspective of the pipeline stage's implementation of addition and subtraction instructions, some logic in the multiplication-addition algorithm is not used. (The implementation structure of the addition and subtraction instructions is described in the original text.) Figure 1 , Figure 2 The parts that are not needed are marked with dashed lines;
[0143] The P0 level includes: (see details) Figure 1 —Diagram of the P0-P1 pipeline implementation using addition and subtraction instructions
[0144] (a) The instruction control signal module (1) and the special value judgment module (2) need to be reused to judge special values. See the description of special value judgment in the multiplication-addition algorithm for details.
[0145] (b) It needs to go through the exponent difference and comparison module (8) to complete the calculation of the exponent difference and the exponent comparison of the two addends, and determine the exchange signal, etc.
[0146] (c) The product leading zero statistics module composed of selector (5), leading zero statistics module (6), and selector (7) does not need to be reused. There is no need to judge the leading zero of the product result here, because there is no multiplication operation in the addition and subtraction instructions, so there is no product result.
[0147] (d) Since the addition and subtraction instructions do not need to complete the multiplication mantissa operation, there is no need to reuse the product mantissa calculation module (3) to calculate the mantissa product result. It is only necessary to use the selector (4) to select the mantissa or special value of the number a according to the special value signal. The selector (4) is a selector module that selects the mantissa or special value of the operand a according to the instruction control signal and the special value judgment signal.
[0148] The P1 level: (see details) Figure 1 —Diagram of the P0-P1 pipeline implementation using addition and subtraction instructions
[0149] The P1 level completes the alignment shift operation, comparison operation, and mantissa summation after alignment. Therefore, the right shifter needs to be reused to complete the alignment operation. Since the relevant exponent difference has been calculated in the P0 level, alignment shift needs to be performed based on the exponent difference, and the mantissa summation and comparison operation after alignment need to be completed.
[0150] The alignment of addition and subtraction operations is only right-aligned. There is no need to shift left here, nor is there any need to pause the pipeline due to errors in the normalization comparison of the mantissa product result or the alignment operation.
[0151] Further includes:
[0152] (a) Comparison logic reuse
[0153] For the multiply-add algorithm, due to the existence of the multiplication operation, the comparison logic is divided into two parts. The first part is the mantissa comparison logic (13), which is the comparison logic of the high half of the mantissa 25-bit before the shift. This is for the convenience of the current level alignment operation and to improve the timing. The second part is the final mantissa comparison logic (28), which is the logic of comparing the subsequent parts again after the mantissa has been aligned. This is used to ensure that the multiply-add algorithm determines the accurate comparison result.
[0154] As for addition and subtraction instructions, since there is no multiplication operation, the effective bits of the mantissa are 1-bit hidden bit + 23-bit mantissa, totaling 24-bit. Therefore, the first part of the mantissa comparison logic (13), that is, the comparison of the high half of the mantissa 25-bit before the shift, can obtain an accurate comparison result. There is no need to worry about the mantissa multiplication after the shift adjustment. Therefore, the addition and subtraction instructions only need to reuse the first part of the mantissa comparison logic (13) in the multiplication module - that is, the comparison logic of the high half of the mantissa 25-bit before the shift, and do not need to reuse the second part of the final mantissa comparison logic (28) - that is, the logic of performing subsequent part comparison after the mantissa is aligned.
[0155] (b) Reusing the order part
[0156] The overall implementation of the alignment part can be summarized as follows:
[0157] After the mantissa comparison logic (13)—which compares the high half of the mantissa (25 bits) before the order shift—the basic exchange signal and initial size are determined. The selector (15) is controlled by the relevant shift control logic (19) to select the corresponding shift data. This selector (15) also performs the function of adding 0 to the shift data if the shift amount exceeds 15. The shift amount is specifically determined by the shift amount adjustment logic (20). After both the shift amount and the shift data are ready, Entering a 48+SHF_NUM, SHF_NUM uses a 16-bit right shifter (16) according to the operator statistics to obtain the shift result; since the addition and subtraction instructions do not have multiplication operations, the alignment only needs to be completed by right shift; finally, the big mantissa and little mantissa after alignment need to be selected by selectors (23 / 24) respectively. Selector (23): selector for the big mantissa result after alignment according to the shift control logic; Selector (24): selector for the little mantissa result after alignment according to the shift control logic;
[0158] Accordingly, the exponent adjustment logic (21) is used in the exponent adjustment module to further adjust the exponent and obtain the adjusted exponent;
[0159] (c) Reusing the summation of the last digits:
[0160] The overall implementation of the last digit summation scheme is summarized as follows:
[0161] After obtaining the mantissa after alignment, if the actual operation here is subtraction, the mantissa after alignment needs to be passed through the method taking module (25) to obtain the opposite number; if the addition does not require additional operation, the selector (26) selects the corresponding result according to the subtraction signal to select the corresponding mantissa summation data; the selected mantissa, mantissa and subtraction signal enter the mantissa summation calculation module (27) at the same time. The mantissa summation calculation module contains a 26-bit lower half adder and a 25-bit higher half adder. It is necessary to wait for the carry result of the lower half adder before entering the higher half adder to complete the mantissa summation calculation and obtain the final mantissa summation result.
[0162] The P2 level: (see details) Figure 2 — P2 pipeline implementation structure diagram (implementation of addition and subtraction instructions) The post-regulation, rounding, final result processing and final exception state judgment operations required by the P2 level are basically the same as the P2 level operations in the multiply-add algorithm, and are completely reused, with only differences in the leading zero statistics part and the post-regulation shift part.
[0163] The leading zero statistics section includes:
[0164] There is no need to count the leading zeros in the lower half of the result of the addition and subtraction instructions. Since addition and subtraction operations have exponent pairing, the number of leading zeros counted in the post-regulation will not exceed 24. Even in the most extreme case, when two numbers are subtracted and the exponents differ by 1, the mantissa of the larger exponent is 1.000……000 and the mantissa of the smaller exponent is 1.111……111. These are two very close numbers being subtracted, and the number of leading zeros in the post-regulation will not exceed 24. Therefore, only the leading zeros in the higher half need to be counted to obtain the result. In this case, there is no need to compensate for the leading zeros in the post-regulation. The leading zeros in the higher half that are counted are the accurate leading zeros in the post-regulation.
[0165] The rear gauge shifting portion includes:
[0166] Since the leading zero count will not exceed 24, the original data can always be selected for the shifting part of the subsequent rule;
[0167] The overall implementation scheme for the P2-level operation in the multiply-accumulate algorithm is as follows:
[0168] The mantissa summation result passed from P1 level is directly reused in the 24-bit high half leading zero statistics module (2) to obtain the actual number of leading zeros in the back guide; at this time, it needs to be entered into the shift amount calculation module (6) together with the exponent to obtain the shift amount result, and the shift data needs to be obtained according to the shift data adjustment module (7); since the number of leading zeros will not exceed 24, that is, the shift amount is always less than 24, so the shift data always selects the original data for shifting; the shift data is obtained through the 40-bit left shifter (9) according to the shift amount and the shift data; if the mantissa summation result obtained in P1 level produces mantissa overflow, at this time, 1-bit 0 needs to be concatenated in the highest bit of the right shift logic (8) of this level to simulate the mantissa shifted one bit to the right, and the final mantissa result is obtained through the mantissa selection module (12);
[0169] Meanwhile, the exponent adjustment module (10) and the rounding part (11) adjustment module complete the adjustment of the exponent and the rounding part; the sign bit needs to be calculated by the sign bit calculation module.
[0170] These data are collectively entered into the rounding operation, rounding adjustment and abnormal status judgment module (13) to complete the rounding operation, rounding adjustment and abnormal status judgment, and finally enter the final result selection (14) to obtain the final result.
[0171] 1) Post-regulation section:
[0172] The mantissa sum data is obtained from the P1 pipeline stage, totaling 50 bits (2 hidden bits + 46 mantissa bits + 1 rounding bit + 1 sticky bit). The exponent should be the adjusted exponent result generated by the P1 stage. It first enters the leading zero module for statistics, and then performs the shift operation.
[0173] Further includes:
[0174] (a) The specific shift details are as follows:
[0175] ① The exponent can satisfy the left shift with leading zero;
[0176] For addition and subtraction instructions, the number of leading zeros after the bit rule will not exceed 24, so the shift amount will not exceed 24. Therefore, the shift always selects the original data for shifting; the shift amount is the number of leading zeros after the bit rule.
[0177] ② The exponent cannot satisfy the left shift with leading zero;
[0178] For addition and subtraction instructions, the leading zero count after the shift will not exceed 24. If the exponent cannot even satisfy the shift requirement, i.e., the shift amount is less than the leading zero count after the shift, the shift amount will not exceed 24. For shifts with a shift amount not exceeding 32, the original data is still selected for shifting; the shift amount is the exponent - 1.
[0179] ③ The sum of the last digits overflows by one place;
[0180] If the mantissa overflows after summing, the most significant bit of the mantissa needs to be appended with 1-bit0 through the right shift control logic (8) to simulate the mantissa shifting one bit to the right; at this time, there is no need to enter the left shifter for shifting.
[0181] ④ No back gauge shift is required;
[0182] If there is no leading zero after the normalization, or if the obtained number is itself a nonnormalized number, then the original mantissa result can be directly selected through the mantissa selection module (12), and the shift amount is 0.
[0183] (b) Further reduction of the shift amount
[0184] From the previous discussion of the two cases requiring left shift, it can be seen that the maximum shift amount is 24. When the timing allows, the shift amount of the shifter is reduced. For data with a shift amount exceeding 15, the high half of the 16-bit 0 is removed through the internal control signal, and the relevant 0 is added afterward. This reduces the maximum shift amount from 31 to 15, thus achieving the purpose of reducing the shift amount.
[0185] The specific implementation details are as follows:
[0186] (I) When the shift amount is less than 16, you can directly enter the left shifter to perform the shift;
[0187] (II) When the shift amount is greater than 16, the shifted data needs to enter the shift data adjustment module (7), where 16 bits of 0 are removed from the front and 16 bits of 0 are added to the back; the processed data enters the left shifter for the remaining shift.
[0188] (c) Further reduction of shifted data;
[0189] Originally, 48 bits of shift data were required for the shift, but since the shifter can shift at most 15 bits, it is only necessary to satisfy the left shifter normalization, that is, only (24+16) bits of shift data are needed, and the remaining 8 bits can be used as sticky bits; thus, the shift data bit width can be reduced.
[0190] In summary, the original 48-bit shifter can be reduced to a 40-bit shifter;
[0191] 2) Rounding adjustments:
[0192] The rounding adjustment module (11) completes the corresponding function; where the protection bit is the last bit of the mantissa, the rounding bit is the first bit after the last bit of the mantissa, and the sticky bit is the self-OR of all bits after the second bit after the last bit of the mantissa.
[0193] 3) Exponent adjustment section:
[0194] The exponent adjustment module (10) completes the corresponding function;
[0195] The exponent adjustment is performed synchronously with the post-regulation, and there are four possibilities: exponent + 1, exponent, exponent - leading zero of post-regulation, 0; the appropriate choice is made according to the mantissa selection.
[0196] 4) Rounding operation and post-rounding adjustment:
[0197] This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13);
[0198] To improve timing, the mantissa is directly rounded by a 25-bit adder and incremented by 1. Subsequent rounding is performed based on the rounding mode, selecting either the original mantissa or the mantissa with the incremented value, to obtain the final rounded result. This rounded result is then used to further adjust the exponent and mantissa.
[0199] Exponent adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the exponent needs to be incremented by 1;
[0200] Arrangement of last digits: If rounding by +1 causes the last digits to overflow by 1 digit, the last digits need to be adjusted one digit to the right.
[0201] 5) Final sign bit calculation:
[0202] Sign bit calculation (1) completes the corresponding function;
[0203] ① If a special value exists, assign the sign bit of the special value;
[0204] ② The absolute values of the two numbers are equal, that is, after alignment, operands a and b are equal, and the effective operation is subtraction. This is related to the rounding mode. If it is rounding to negative infinity, the sign bit is negative; otherwise, the sign bit is positive.
[0205] ③ The remaining cases need to be determined based on the exchange signal and the subtraction signal, as shown in the table below:
[0206] Where A is the opra operand, B is the oprb operand, and opra_sign is the sign of the opra operand. The actual operation is the valid operation obtained at level P0.
[0207]
[0208]
[0209] 6) Anomaly detection:
[0210] This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13);
[0211] The logic for judging abnormal states is as follows:
[0212] ① Invalid exception: Found in the special value judgment at level P0;
[0213] ② Division by zero exception: None;
[0214] ③ Overflow exception: The result is greater than the maximum exponent value after the exponent is updated, and it is not an infinite value, qNaN result, or invalid exception.
[0215] ④ Underflow anomaly: After rounding, the data is strictly between ± minimum normalized number (+ minimum normalized number is 32'h0080_0000, - minimum normalized number is 32'h8080_0000) and is inaccurate; ⑤ Inaccuracy anomaly: The rounding bit or sticky bit is 1, or an overflow anomaly occurs at this time.
[0216] 7) Final result processing:
[0217] The final result selection (14) completes the corresponding function;
[0218] If an overflow exception occurs, special assignments are needed for different maxima depending on the rounding mode. In other cases, only the sign bit, exponent, and mantissa need to be concatenated normally. This yields the final result.
[0219] In summary, the addition and subtraction instructions reuse the logic modules in the multiplication and addition module, which reduces the hardware design area to the greatest extent. However, there are also differences between the two. The specific differences are described as follows: (1) The addition and subtraction instructions do not need to reuse the multiplication part of the product mantissa and the pre-statistics of the leading zero of the product at the P0 level. This is because the addition and subtraction instructions do not contain multiplication-related operations. For the addition and subtraction instructions, only the logic of the special value judgment and exponent difference and comparison module in the P0 level needs to be reused to complete the special value judgment, exponent difference calculation, and determination of the exchange signal and other pre-alignment work.
[0220] (2) Addition and subtraction instructions at level P1 need to reuse the comparison logic, alignment logic, and mantissa summation logic; among them, the alignment logic and comparison logic differ from those of multiplication-addition instructions:
[0221] (a) Parallel logic:
[0222] Since the alignment of addition and subtraction instructions occurs between two original operands, unlike the alignment between the product result and the original operands in non-multiplication-addition instructions, and since there are no operations related to leading zeros, the alignment algorithm is greatly simplified. For alignment between two original operands, there are only two possibilities: ① The floating-point mantissa of the smaller exponent is aligned to the floating-point mantissa of the larger exponent, so the floating-point mantissa of the smaller exponent needs to be right-shifted to align the exponent and ensure correct operation; ② The exponents of both are equal, so no alignment is needed. Therefore, only one right shift alignment can be performed using an alignment shifter, so no pipeline pause is required.
[0223] (b) Comparison logic:
[0224] Since the alignment of addition and subtraction instructions exists between the two original operands, a 25-bit comparison logic can be used before alignment to obtain the corresponding comparison result. At this time, only two situations are possible: ① the two numbers are equal, in which case no alignment is needed; ② the two numbers are not equal, in which case alignment is needed. That is, the comparison obtained at this time is an accurate comparison result, and there is no need to worry about the alignment affecting the comparison result. Therefore, there is no need to reuse the mantissa alignment and pipeline pause comparison logic to accurately compare the result.
[0225] (3) Addition and subtraction instructions at level P2 need to reuse operations such as post-regulation, rounding, and final result adjustment; there are differences in the post-regulation part, as follows:
[0226] (a) Statistics of leading zeros after the rule:
[0227] For addition and subtraction instructions, the situation where leading zeros are generated and need to be corrected is when subtracting two similar operands. The situation with the most leading zeros generated occurs when subtracting two numbers and the exponents differ by 1. The mantissa of the larger exponent is 1.000...000, and the mantissa of the smaller exponent is 1.111...111. After exponent alignment, the mantissa of the smaller exponent becomes 0.111...1111. After the mantissa summation operation, the result is 0.000...0001. That is, in this case, there are still at most 24 leading zeros generated. Therefore, the number of leading zeros will not exceed 24, and the requirement can be met by using the leading zeros of the higher half of the operands.
[0228] (b) Back gauge shift:
[0229] Based on the statistical reasoning of the preceding leading zeros, the shift amount of the following rule will not exceed 24. Therefore, at this time, it is only necessary to select the original data to shift.
[0230] In summary, this application, considering the application scenarios of single-precision floating-point data, ensures that the performance requirements of most single-precision floating-point calculation scenarios are met. It utilizes the multiply-accumulate module to perform calculations for both addition / subtraction instructions and multiply-accumulate instructions, reusing the same module logic as much as possible to reduce the area of the computation.
[0231] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations can be made to the embodiments of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method of optimizing fused floating point add / subtract instructions in a single precision floating point multiplier-adder, characterized by, In performing single-precision floating-point addition and subtraction calculations, the method utilizes existing logic to perform the following functions at each pipeline stage: P0 level, the first pipeline stage: completes the special value judgment module function, the exponent difference and comparison logic module; P1 stage, the second pipeline stage: completes the alignment shift operation, comparison operation, and summation of the mantissa after alignment; P2 level, the third pipeline stage: post-regulation, rounding, final result processing and final abnormal state judgment; The P0 level includes: (a) The instruction control signal module (1) and special value judgment module (2) in the multiply-accumulate module need to be reused to perform special value judgment; The special value judgment module includes: According to the instruction control signal module, the data entering the special value judgment module needs to be adjusted; For addition and subtraction instructions, there is a concept of a valid operation. The extracted instruction information and the sign bits of the two operands together determine whether the actual operation is addition or subtraction; opra represents the original operand 'a' of the addition / subtraction instruction, and oprb represents the original operand 'b' of the addition / subtraction instruction. Note: The input operation is +, which corresponds to the addition instruction (FADD_S); the input operation is -, which corresponds to the subtraction instruction (FSUB_S). After the effective operation is clear, proceed to the special value module for judgment; The special value criteria are shown in the following two tables: The table below shows the results of special value judgments for valid operations that are addition: The table below shows the results of special value judgments for valid operations that are subtraction: Note: (sub)norm is a general term for normalized and denormalized numbers. NaN numbers (Not a number, representing an inexpressible value) are divided into two categories: qNaN numbers and sNaN numbers. sNaN is a value whose exponent is all 1s, whose first digit of the mantissa is 0, and whose overall mantissa is not 0; qNaN is a value whose exponent is all 1s, and whose first digit of the mantissa is 1. RISC-V specifies that if the result of a floating-point operation is a NaN number, then a fixed NaN number should be used. The NaN value corresponding to single-precision floating-point is 0x7fc0_0000. Therefore, the final result qNaN needs to be assigned a fixed value, i.e., qNaN = 32'h7fc0_0000. The crossed-out table indicates that the result needs to be obtained through normal calculation, rather than a special value; If the above special value result is generated, a special value signal needs to be set to mark that the operation is a special case, and the result is assigned the special value result obtained above to facilitate subsequent calculations; (b) It needs to go through the exponent difference and comparison module (8) to complete the calculation of the exponent difference and the exponent comparison of the two operands to determine the exchange signal; (c) Since the addition and subtraction instructions do not need to complete the multiplication operation of the mantissas, there is no need to reuse the product mantissa calculation module (3) to calculate the mantissa product result. It is only necessary to use the selector (4) to select the mantissa or special value of the number a according to the special value signal. The selector (4) is a selector module that selects the mantissa or special value of the operand a according to the instruction control signal and the special value judgment signal. The P1 stage: The P1 stage completes the alignment shift operation, comparison operation, and mantissa summation after alignment. Therefore, a right shifter needs to be multiplexed to complete the alignment operation. Since the relevant exponent difference has been calculated in the P0 stage, alignment shifting is required based on the exponent difference, and mantissa summation and comparison operations are completed after alignment. Alignment for addition and subtraction operations is only right alignment, so left shifting is not required here. Since addition and subtraction instructions do not have multiplication operations, there is no need for pipeline pauses due to mantissa product result normalization comparison or alignment operation. The P2 level: The post-regulation, rounding, final result processing and final abnormal state judgment operations required by the P2 level are basically the same as the P2 level operations in the multiply-accumulate algorithm and are completely reused, with only differences in the leading zero statistics part and the post-regulation shift part. The leading zero statistics section includes: There is no need to count the leading zeros in the lower half of the result of the addition and subtraction instructions. Since addition and subtraction operations have exponent pairing, the number of leading zeros counted in the post-comparison will never exceed 24. Even in the most extreme case, when two numbers are subtracted and the exponents differ by 1, the mantissa of the larger exponent is 1.000……000 and the mantissa of the smaller exponent is 1.111……111. These are two very close numbers being subtracted, and the number of leading zeros in the post-comparison will never exceed 24. Therefore, only the leading zeros in the higher half need to be counted to obtain the result. In this case, there is no need to compensate for the leading zeros in the post-comparison. The leading zeros in the higher half that are counted are the accurate leading zeros in the post-comparison. The rear gauge shifting portion includes: Since the leading zero count will not exceed 24, the shifting part of the subsequent rule can always be shifted using the original data.
2. The method for optimizing fused floating point add / subtract instructions in a single precision floating point multiplier-adder according to claim 1, wherein, The P1 level further includes: (a) Comparison logic reuse The addition and subtraction instructions only need to reuse the first part of the mantissa comparison logic (13) in the multiplication module, that is, the 25-bit comparison logic of the high half of the mantissa before the shift, and do not need to reuse the second part of the final comparison logic (28), that is, the logic of comparing the subsequent parts after the mantissa is aligned. (b) Reusing the order part The overall implementation of the alignment part is as follows: After the mantissa comparison logic (13), which compares the high half of the mantissa (25 bits) before the alignment shift, the exchange signal and its exact size are determined. The selector (15) is then controlled according to the relevant shift control logic (19)—that is, the selector selects the corresponding shift data according to the corresponding shift logic. This selector (15) also performs the function of adding 0 to the shift data if the shift amount exceeds 15. The shift amount is specifically determined by the shift amount adjustment logic (20). After both the shift amount and the shift data are ready, the process proceeds... Input a 48+SHF_NUM, and use a 16-bit right shifter (16) according to the operator statistics to obtain the shift result; since the addition and subtraction instructions do not have multiplication operations, the alignment only needs to be completed by right shift; finally, the big mantissa and little mantissa after alignment need to be selected by selectors (23 / 24) respectively. The selector (23) is the selector that selects the big mantissa result after alignment according to the shift control logic; the selector (24) is the selector that selects the little mantissa result after alignment according to the shift control logic. Accordingly, the exponent adjustment logic (21) is used in the exponent adjustment module to further adjust the exponent and obtain the adjusted exponent; (c) Reusing the last digit summation part: The overall scheme for summing the last digits is implemented as follows: After obtaining the mantissa after alignment, if the actual operation here is subtraction, the mantissa after alignment needs to be passed through the method taking module (25) to obtain the opposite number; if the actual operand here is addition, no additional operation is required, and the value of the mantissa after alignment is still maintained; at this time, the corresponding result is selected by the selector (26), which is a selector that selects the corresponding mantissa summation data according to the subtraction signal; the selected mantissa, mantissa and subtraction signal enter the mantissa summation calculation module (27) at the same time. The mantissa summation calculation module contains a 26-bit lower half adder and a 25-bit higher half adder. It is necessary to wait for the carry result of the lower half adder before entering the higher half adder to complete the mantissa summation calculation and obtain the final mantissa summation result.
3. The optimization method for integrating floating-point addition and subtraction instructions in a single-precision floating-point multiply-accumulator according to claim 1, characterized in that, The overall implementation scheme for the P2 level operation in the addition and subtraction algorithm is as follows: The mantissa summation result passed from P1 level is directly reused in the 24-bit high half leading zero statistics module (2) to obtain the actual number of leading zeros in the back guide; at this time, it needs to be entered into the shift amount calculation module (6) together with the exponent to obtain the shift amount result, and the shift data needs to be obtained according to the shift data adjustment module (7); since the number of leading zeros will not exceed 24, that is, the shift amount is always less than 24, so the shift data always selects the original data for shifting; according to the shift amount and the shift data, the shift data is obtained through the 40-bit left shifter (9); if the mantissa summation result obtained in P1 level produces mantissa overflow, at this time, 1-bit0 needs to be concatenated in the highest bit of the right shift logic (8) of this level to simulate the mantissa shifted one bit to the right, and the final mantissa result is obtained through the mantissa selection module (12); Meanwhile, the exponent adjustment module (10) and the rounding part (11) adjustment module complete the adjustment of the exponent and the rounding part; the sign bit needs to be calculated by the sign bit calculation module. These data are collectively entered into the rounding operation, rounding adjustment and abnormal status judgment module (13) to complete the rounding operation, rounding adjustment and abnormal status judgment, and finally enter the final result selection (14) to obtain the final result.
4. The optimization method for integrating floating-point addition and subtraction instructions in a single-precision floating-point multiply-accumulator according to claim 3, characterized in that, The P2-level operation in the addition / subtraction algorithm further includes: 1) Post-regulation section: The mantissa summation data is obtained from the P1 pipeline stage, totaling 50 bits, namely 2 hidden bits + 46 mantissa bits + 1 rounding bit + 1 sticky bit. The exponent should be the adjusted exponent result generated by the P1 stage. It first enters the high half part of the leading zero module (2) for statistics, and then performs the shift operation. Further includes: (a) The specific shift details are as follows: ① The exponent can satisfy the left shift with leading zero; For addition and subtraction instructions, the number of leading zeros after the bit rule will not exceed 24, so the shift amount will not exceed 24. Therefore, the shift always selects the original data for shifting; the shift amount is the number of leading zeros after the bit rule. ② The exponent cannot satisfy the left shift with leading zero; For addition and subtraction instructions, the leading zero count after the shift will not exceed 24. If the exponent cannot even satisfy the shift requirement, i.e., the shift amount is less than the leading zero count after the shift, the shift amount will not exceed 24. For shifts with a shift amount not exceeding 32, the original data is still selected for shifting; the shift amount is the exponent - 1. ③ The sum of the last digits overflows by one place; If the mantissa overflows after summing, the most significant bit of the mantissa needs to be appended with 1-bit0 through the right shift control logic (8) to simulate the mantissa shifting one bit to the right; at this time, there is no need to enter the left shifter for shifting. ④ No back gauge shift is required; If there is no leading zero after normalization, or the number obtained is itself a denormalized number, then the original mantissa result can be directly selected through the mantissa selection module (12), and the shift amount is 0. (b) Further reduction of the shift amount From the previous discussion of the two cases requiring left shift, it can be seen that the maximum shift amount is 24. When the timing allows, the shift amount of the shifter can be reduced. For data with a shift amount exceeding 15, the high half of the 16-bit 0 is removed through the internal control signal, and the relevant 0 is added afterward. This reduces the maximum shift amount from 31 to 15, thus achieving the purpose of reducing the shift amount. The specific implementation details are as follows: (I) When the shift amount is less than 16, you can directly enter the left shifter to perform the shift; (II) When the shift amount is greater than 16, the shifted data needs to enter the shift data adjustment module (7), where 16 bits of 0 are removed from the front and 16 bits of 0 are added to the back; the processed data enters the left shifter for the remaining shift. (c) Further reduction of shifted data; Originally, 48 bits of shift data were required for the shift, but since the shifter can shift at most 15 bits, it is only necessary to satisfy the left shifter normalization, that is, only (24+16) bits of shift data are needed, and the remaining 8 bits can be used as sticky bits; thus, the shift data bit width can be reduced. In summary, the original 48-bit shifter can be reduced to a 40-bit shifter; 2) Rounding adjustments: The rounding adjustment module (11) completes the corresponding function; where the protection bit is the last bit of the mantissa, the rounding bit is the first bit after the last bit of the mantissa, and the sticky bit is the self-OR of all bits after the second bit after the last bit of the mantissa. 3) Exponent adjustment section: The exponent adjustment module (10) completes the corresponding function; The exponent adjustment is performed synchronously with the post-regulation, and there are four possibilities: exponent + 1, exponent, exponent - leading zero of the post-regulation, 0; the appropriate choice is made according to the mantissa selection. 4) Rounding operation and post-rounding adjustment: This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13); To improve timing, the mantissa is directly rounded by a 25-bit adder and incremented by 1. Subsequent rounding is performed based on the rounding mode, selecting either the original mantissa or the mantissa with the incremented value, to obtain the final rounded result. This rounded result is then used to further adjust the exponent and mantissa. Exponent adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the exponent needs to be incremented by 1; Arrangement of last digits: If rounding by +1 causes the last digits to overflow by 1 digit, the last digits need to be adjusted one digit to the right. 5) Final sign bit calculation: Sign bit calculation (1) completes the corresponding function; ① If a special value exists, assign the sign bit of the special value; ② The absolute values of the two numbers are equal, that is, after alignment, operands a and b are equal, and the effective operation is subtraction. This is related to the rounding mode. If it is rounding to negative infinity, the sign bit is negative; otherwise, the sign bit is positive. ③ The remaining cases need to be determined based on the exchange signal and the subtraction signal, as shown in the table below: Where A represents the opra operands, B represents the oprb operands, and opra_sign is the sign of the opra operand. The actual operation is the valid operation obtained at level P0. 6) Anomaly detection: This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13); The logic for judging abnormal states is as follows: ① Invalid exception: Found in the special value judgment at level P0; ② Division by zero exception: None; ③ Overflow exception: The result is greater than the maximum exponent value after the exponent is updated, and it is not an infinite value, qNaN result, or invalid exception. ④ Underflow anomaly: After rounding, the data is strictly between ± minimum normalized number (+ minimum normalized number is 32'h0080_0000, - minimum normalized number is 32'h8080_0000) and is inaccurate; ⑤ Inaccuracy anomaly: The round bit or sticky bit contains a 1, or an overflow anomaly occurs at this time; 7) Final result processing: The final result selection (14) completes the corresponding function; If an overflow exception occurs, special assignments are needed for different maxima depending on the rounding mode. In other cases, only the sign bit, exponent, and mantissa need to be concatenated normally to obtain the final result.
5. The optimization method for integrating floating-point addition and subtraction instructions in a single-precision floating-point multiply-accumulator according to claim 1, characterized in that, The multiply-accumulate module implements the multiply-accumulate algorithm, including: After obtaining the three operands a, b, and c, first calculate the product result of a * b, align the exponents of the product result and the c operand (i.e., the addend operand), then perform the mantissa summation operation, enter the post-normalization part for normalization, and finally complete rounding, exception status judgment, and obtain the final result; For the exponent alignment, before exponent alignment, it is necessary to obtain the product mantissa result, the leading zero result of the product, the exponent difference, and the c operand (i.e., the addend operand). Among them, the exponent alignment shift amount of part of the exponent difference can offset the normalization shift amount of the leading zeros of the product. The specific algorithm is as follows: (1) The multiplication exponent result > the exponent of the c operand a. The exponent difference > the leading zeros of the product: The product result has a large exponent and needs to be left-shifted for normalization with a shift amount equal to the leading zeros of the product. The c operand has a small exponent and needs to be right-shifted for exponent alignment with a shift amount equal to the exponent difference minus the number of leading zeros of the product. At this time, two shifts are required, and the multiplication result remains a large operand; if there are no leading zeros generated in the multiplication at this time, only the c operand needs to be shifted by the exponent difference; b. The exponent difference < the leading zeros of the product: Before shifting, the product result has a large exponent. First, left-shift the multiplication mantissa for normalization with a shift amount equal to the leading zeros of the product. At this time, because the exponent difference < the leading zeros of the product, the product result becomes a small exponent. At this time, it is still necessary to right-shift the shifted multiplication mantissa with a shift amount equal to the leading zeros of the product minus the exponent difference; From the overall effect, only the multiplication mantissa needs to be left-shifted by the exponent difference. At this time, the multiplication result becomes a small operand and needs to be exchanged; c. The exponent difference = the leading zeros of the product: At this time, only the multiplication mantissa needs to be left-shifted for normalization with a shift amount equal to the leading zeros of the product to complete the exponent alignment between the c operand and the multiplication result; (2) The multiplication exponent result < the exponent of the c operand At this time, regardless of the relationship between the leading zero count of the product and the exponent difference, the multiplication result must have a small exponent. First, left-shift the multiplication mantissa for normalization with a shift amount equal to the leading zeros of the product. At this time, the exponent of the multiplication result is even smaller, and it is still necessary to right-shift the shifted multiplication mantissa with a shift amount equal to the leading zeros of the product plus the exponent difference; From the overall effect, only the multiplication mantissa needs to be right-shifted by the exponent difference. At this time, the multiplication result remains a small operand; (3) The multiplication exponent result = the exponent of the c operand, and the exponent difference = 0 a. If the leading zeros of the product = 0, the multiplication result and the c operand are already exponent-aligned and no shift is required; b. If the leading zeros of the product ≠ 0, at this time, the exponent difference is less than the leading zeros of the product. First, left-shift the multiplication mantissa for normalization with a shift amount equal to the leading zeros of the product. After the shift, the exponent difference is equal to the leading zeros of the product. At this time, it is necessary to right-shift the adjusted multiplication mantissa result by the leading zeros of the product for exponent alignment; From the overall result, no shift is required.