An optimization method for fusing floating-point multiplication-addition algorithm in single-precision floating-point multiplier-adder
By designing a single data path and a pipelined pause multiplexing logic module, the alignment and post-order shifter of the single-precision floating-point multiply-accumulate unit were optimized, solving the problem of excessive area overhead in the multiply-accumulate algorithm and reducing hardware area and power consumption.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI JUNZHENG TECH CO LTD
- Filing Date
- 2024-12-27
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies in single-precision floating-point calculations suffer from excessive area overhead and high power consumption and cost because the dual data paths of the multiply-accumulate algorithm prevent the reuse of the same logic module.
A single data path design is adopted, and related logic modules are reused through pipeline pauses to reduce area overhead. Performance is sacrificed in special scenarios to reduce hardware design costs. The design of the alignment and post-gauge shifters is optimized in combination with the application characteristics of single-precision floating-point data.
While ensuring performance in common scenarios, the hardware design area and power consumption have been reduced, and the performance and efficiency of the single-precision floating-point multiply-accumulator have been optimized.
Smart Images

Figure CN122308781A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of chip processing technology, and specifically relates to an optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit. Background Technology
[0002] Floating-point numbers are a real number data type in computers, possessing high computational precision and meeting the requirements of high-precision calculations. With the continuous development of modern technology, the importance of floating-point arithmetic capabilities has become increasingly prominent, playing an indispensable key element in various fields such as scientific computing, computer technology, and engineering. In scientific computing, large-scale data simulations are needed for in-depth research. For example, in meteorological forecasting, fluid mechanics, and exploration, scientists require high-precision floating-point calculations to obtain accurate simulation results. Similarly, in the booming field of autonomous driving, accurate floating-point calculations are crucial for the correctness and performance of algorithms. For machine learning algorithm optimization, floating-point calculations are highly efficient, accelerating the computation process and improving algorithm performance and efficiency, playing a vital role. Furthermore, with advancements in hardware technology, floating-point computing units can now be integrated into CPUs, becoming an important component of the CPU's core computing power.
[0003] In floating-point computing units, single-precision floating-point computing is a frequently used module, applicable to projects including image recognition, deep learning, object detection and tracking, and QR code recognition. The design strategy of this module significantly impacts the overall computing unit. It is well known that a large chip area has a series of negative consequences; as the area increases, so does power consumption and cost. Therefore, simply increasing the area of single-precision floating-point computing to achieve high performance is counterproductive. A key challenge is how to minimize area overhead, thereby reducing power consumption and cost, in single-precision floating-point applications while improving or maintaining basic performance.
[0004] For multiply-accumulate algorithms, existing technologies typically divide the algorithm into near-path and far-path paths to ensure performance, but this approach also incurs additional area overhead. This patented solution takes a different approach. For existing single-precision applications, it guarantees performance for common scenarios, while reducing area overhead by sacrificing performance for less frequent scenarios. The core idea is that less frequent scenarios are inherently low-probability events, and there's no need to incur significant area overhead for these low-probability events, leading to increased power consumption and cost.
[0005] Regarding the multiply-accumulate algorithm, current technologies mainly rely on dividing the multiply-accumulate algorithm into near-path and far-path to ensure performance improvement;
[0006] Near-path is primarily used when the multiplication result and addend operand do not require significant shifting for alignment. The addend is shifted and aligned according to the exponent difference, and the multiplication is performed in parallel with the multiplier to obtain the multiplication result. Both are then fed into the adder to calculate the mantissa sum, followed by post-normalization to obtain the final result. Although near-path allows for a small shifter width during the alignment stage, this operation results in a significantly larger post-normalization shifter area required.
[0007] Far-path is primarily used for cases where the multiplication result and addend operands require significant shifting for alignment. In this case, it waits for the multiplication result to complete before moving the product or addend to the correct order using a shifter. Then, it moves to the adder to calculate the mantissa sum, rounds it off, and obtains the final result. Although the shifter width is relatively small in far-path, it consumes a significant amount of shifter area during the alignment stage.
[0008] While existing technologies aim to improve performance by changing floating-point multiply-accumulate operations from a single data path to a dual data path, this also brings a significant problem: related devices with the same logic in the two data paths cannot be reused, resulting in additional area.
[0009] Due to the different alignment methods, the two paths each require different alignment shifters to perform the alignment operation in different pipeline stages. These two alignment shifters cannot be reused. Furthermore, in the post-regulation part, the near-path requires a huge post-regulation shifter to ensure the correctness of the result. Because of the different alignment methods used in the near-path and far-path, the adders for the mantissa summation of the two paths are not in the same pipeline stage, and therefore cannot be reused.
[0010] This shows that the main technical drawback of the existing technology is the area overhead caused by the inability to reuse the same logic module due to the dual data paths. In particular, the inability to reuse devices with the same logic during multiplication and accumulation calculations in the existing technology leads to additional area overhead. Summary of the Invention
[0011] To address the problems of existing technologies, this method combines the application of single-precision floating-point data in real-world scenarios, employing a single data path to ensure the reuse of the same internal logic modules and reduce area overhead. Under the single data path design, to guarantee performance, based on the characteristics of single-precision data occurrence, performance is maintained for common scenarios, while for low-probability scenarios, pipeline pauses are used to reuse existing related logic. This reduces area overhead at the cost of performance, avoiding excessive area consumption for performance improvements in infrequent scenarios. In other words, by reusing related components through pipeline pauses in special cases, the hardware design area is reduced, along with power consumption and cost.
[0012] Due to differences in data bit width and pipeline division, double-precision multiply-accumulators differ significantly from the single-precision multiply-accumulators used in this method, and their usage methods are not entirely the same. Therefore, this method specifically focuses on the fusion using single-precision floating-point multiply-accumulators.
[0013] Specifically, this invention provides an optimized method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit. During the single-precision floating-point multiply-accumulate calculation, the pipeline implementation corresponding to this method is as follows:
[0014] P0 stage, the first pipeline stage: completes the functions of multiplication, leading zero counting, and special value judgment; P1 stage, the second pipeline stage: completes the alignment shift operation, comparison operation, and summation of the mantissa after alignment.
[0015] Level P2, the third pipeline stage: post-processing, rounding, final result processing, and final exception handling; the schemes for each pipeline stage are as follows:
[0016] The P0 stage: The first pipeline stage receives three operands a, b, and c from its input; performs: special value judgment; product mantissa calculation; product leading zero pre-statistics; exponent difference and comparison logic; and obtains the mantissa multiplication result, the comparison result between the product result exponent and the exponent of operand c, the comparison result between the exponent difference and the product leading zero, and the comparison result between the exponent difference and the product leading zero.
[0017] The P1 stage: The input of the second pipeline stage obtains the multiplication result, the comparison result between the product result exponent and the exponent of the C operand, the comparison result between the exponent difference and the leading zero of the product, and the comparison result between the exponent difference and the leading zero of the product; performs: comparison logic; exponent alignment operation; mantissa summation part, obtains the result after mantissa summation, adjusts the exponent data, the sign bit to be operated on, and the comparison result;
[0018] The P2 stage: The input of the third pipeline stage is the result of mantissa summation, the exponent data is adjusted, the sign bit to be operated on is compared, and the result is performed: post-regulation operation, rounding, final result processing and abnormal state judgment.
[0019] Obtain the final calculation result and any abnormal statuses.
[0020] In the P0 level:
[0021] The special value determination module includes:
[0022] (1) Introduce the concept of effective operation, and decompose the internal information of the instruction through the instruction control module (1). The instruction and the operand sign bit together determine whether the actual operation in the mantissa summation is addition or subtraction.
[0023] mul_res sign bit Input Operation oprc sign bit Effective operation + + + + + + - - - + + - - + - + + - + - + - - + - - + + - - - -
[0024] Note: In the valid operation judgment:
[0025] For multiply-accumulate instructions, the mul_res sign bit is the sign bit of the product result in the multiply-accumulate instruction, and the oprc sign bit is the sign bit of the C operand;
[0026] The specific sign bit and input operations correspond to the multiply-accumulate instructions as follows:
[0027] FMADD: That is, a*b+c, where the sign bit of mul_res represents the sign bit of the product result of a*b, and the input operation is +;
[0028] FMSUB: That is, a*bc, where the mul_res sign bit represents the sign bit of the product result of a*b, and the input operation is -;
[0029] FNMADD: That is, -a*bc, where the sign bit of mul_res represents the sign bit of the product result of -a*b, and the input operation is -;
[0030] FNMSUB: That is, -a*b+c, where the sign bit of mul_res represents the sign bit of the product result of -a*b, and the input operation is +;
[0031] (3) After the effective operation is clear, you can enter the special value module to make a judgment;
[0032] Multiplication-addition instructions require the multiplication to be performed first, followed by the summation of the mantissas based on the valid operations. Therefore, the special value judgment is shown in the following three tables:
[0033] The table below shows the results of special value judgments for multiplication:
[0034]
[0035] The table below shows the results of special value judgments for valid operations that are addition:
[0036]
[0037] The table below shows the results of special value judgments for valid operations that are subtraction:
[0038]
[0039] Note: (sub)norm is a general term for normalized and denormalized numbers. NaN numbers (Not a number, representing an inexpressible value) are divided into two categories: qNaN numbers and sNaN numbers.
[0040] sNaN is a number whose exponent is all 1s, whose first digit is 0, and whose overall mantissa is not 0.
[0041] qNaN is a value whose exponent is all 1s and whose mantissa is 1 in the first position.
[0042] RISC-V specifies that if the result of a floating-point operation is a NaN number, then a fixed NaN number should be used. The NaN value corresponding to single-precision floating-point is 0x7fc0_0000. Therefore, the final result qNaN needs to be assigned a fixed value, i.e., qNaN = 32'h7fc0_0000.
[0043] The table crossed out by the horizontal line indicates that the result needs to be obtained through normal calculation, rather than a special value; for multiplication and addition instructions, firstly, the result of a*b needs to be assigned to |x| according to the special value table of multiplication, and then the special value result is obtained from the corresponding table according to the corresponding valid operation; if the above special value result is generated, a special value signal needs to be set to mark that the operation is a special case, and the result is assigned to the special value result obtained above, which is convenient for subsequent calculation; (3) After obtaining the valid operation, the invalidity exception of the special value is judged and left for use by P2 level;
[0044]
[0045] The calculation of the product mantissa includes:
[0046] The operands a and b need to enter the product mantissa calculation module (3) to calculate the result of multiplying the mantissas. The product mantissa calculation module (3) contains a multiplier and two adders. First, the mantissas of operands a and b are multiplied to obtain two partial product results. The lower half of the partial product first enters the 24-bit adder to calculate the sum of the lower 24-bit partial products. The carry result generated by the summation will be passed to the higher half adder and enter the 25-bit adder together with the higher half partial product to obtain the product mantissa calculation result. Finally, through the selector (4), the final required mantissa result is selected according to the special value signal and the special value result.
[0047] The product leading zero pre-statistics module includes:
[0048] First, it is necessary to enter the selector (5) to select the non-normalized data according to the non-normalized number judgment signal, enter the 24-bit leading zero statistics module (6) to count the number of leading zeros of the selected data, and then enter the next selector (7). If both numbers are normalized numbers, the statistical result is adjusted to 0 to obtain the product leading zero pre-statistical result.
[0049] The exponent difference and comparison logic (8) includes:
[0050] This module mainly completes the calculation of the product result exponent, the comparison of the product result exponent with the exponent of the C operand, and the comparison between the exponent difference and the leading zero of the product. This facilitates the exponent shift of the P1 level and initially determines the shift method required for the P1 level.
[0051] In the P1 level:
[0052] The comparison logic includes: This comparison logic is divided into two parts:
[0053] The first part of the mantissa comparison logic (13) must perform a comparison of the high half 25-bits before shifting. This comparison is to make a preliminary judgment on the size of the two numbers, which is convenient for the current level of alignment operation. Here, for timing considerations, only the high half 25-bits are compared.
[0054] The second part of the final comparison logic (28) needs to continue the operation of comparing whether the remaining parts are equal after the alignment is completed. This is to obtain the final comparison result, which is convenient for subsequent sign bit calculation.
[0055] The order alignment part: The overall solution for the order alignment part is as follows:
[0056] After the mantissa comparison logic (13), the basic exchange signal and preliminary size are determined. According to the relevant shift control logic (19), the selector (15) is controlled to select the relevant data. This selector (15) also undertakes the function of adding 0 to the corresponding shifted data if the shift amount exceeds 15. If it is a left shift, the shifted data must first enter the reverse order module (14) to perform the reverse order operation. The shift amount is specifically determined by the shift amount adjustment logic (20). After the shift amount and the shifted data are ready, The bit shifter (16) is a 48+SHF_NUM, which is a 64-bit right shifter. SHF_NUM uses 16 according to the operator statistics to obtain the shift result. If it is a left shift, it needs to enter a reverse module (17) again after the shift is completed to obtain the left shift result. The final left and right shift results are selected by the selector (18) to obtain the final result of the mantissa shift. Finally, the mantissa and mantissa after alignment need to be selected by the selector (23 / 24).
[0057] If a shifter needs to be reused for a pipeline pause, the selector (9 / 10 / 11 / 12) is controlled according to the selector control logic (22) during the pipeline pause to update each input. In the next pipeline, the shifter is controlled according to the shift control logic (19) to reuse the right shifter (16). If a pipeline pause is required to further determine the comparison signal after alignment, the comparison needs to be completed in the next pipeline by reusing the mantissa comparison logic (13) and the final mantissa comparison logic (28).
[0058] Accordingly, the exponent adjustment logic (21) is used in the exponent adjustment module to further adjust the exponent to obtain the adjusted exponent;
[0059] The mantissa summation part: The mantissa summation operation has two possibilities, namely, adding or subtracting the mantissas: including:
[0060] (1) Adding the last two digits: The two last two digits can be directly added together by the adder;
[0061] (2) Subtraction of mantissas: Since the previous alignment operation completed the part of the comparison logic operation, this must be the big mantissa minus the small mantissa; at this time, for the expression (xy), in order to use the adder to perform the calculation, we can use the expression (xy) = (x + (~y) + 1) to perform the subtraction operation by using addition.
[0062] The overall scheme for summing the last digits is implemented as follows:
[0063] After obtaining the mantissa after alignment, if the actual operation here is subtraction, the mantissa after alignment needs to be passed through the method taking module (25) to obtain the opposite number; if the addition does not require additional operation, the corresponding result is selected by the selector (26); the selected mantissa, mantissa and subtraction signal enter the mantissa summation module (27) at the same time. The mantissa summation module contains a 26-bit lower half adder and a 25-bit higher half adder. It is necessary to wait for the carry result of the lower half adder before entering the higher half adder to complete the mantissa summation calculation and obtain the final mantissa summation result.
[0064] The hardware design of the alignment section has three important design points, as follows:
[0065] (1) For special alignment cases, the flow stops and the shifter is reused;
[0066] Analysis of the alignment algorithm reveals that if and only if the multiplication exponent result > the exponent of the c operand, and the exponent difference > the leading zero of the product (i.e., the leading zero of the product ≠ 0), then a left shift normalization shift of the product mantissa result and a right shift alignment shift of the c operand are required. This is an uncommon scenario, so instead of using two shifters to complete this operation, a pipeline pause is implemented, sacrificing performance for low-probability scenarios, and reusing the right shifter to reduce the shifter area.
[0067] (2) Use only right shifters, and implement left shifts using right shifters:
[0068] Analysis of the order alignment algorithm reveals that the number of left shifters used is much smaller than that of right shifters. Setting up a single left shifter would result in unnecessary area loss. This method uses only one right shifter in the order alignment part, and the left shift is performed in reverse order, followed by the right shift, and then the right shift is performed in reverse order again to obtain the corresponding result.
[0069] (3) Reduce the shifter bit width:
[0070] Set the shifter bit width to (48 + SHF_NUM)-bit. SHF_NUM is configurable. The shift amount can be adjusted according to different application scenarios. Currently, SHF_NUM is set to 16, and the maximum shift amount is 15. If the shift amount exceeds this maximum, the part greater than 16 is directly adjusted by adding 0s in front, and then the shifter completes the remaining shift.
[0071] The specific states of water flow stagnation include:
[0072] (1) The water flow stops once.
[0073] a. The result of multiplication is greater than the exponent of operand c, and the exponent difference is greater than the leading zero of the product; the leading zero of the product is not equal to 0.
[0074] At this point, the multiplication exponent result needs to be shifted left by the exponent difference for normalization; then, the c operand result needs to be shifted right to align the exponents. The right shift amount is the exponent difference minus the leading zeros of the product, and the multiplication result is still a large number. In the first step, the shifter is entered, and the reversed data is selected for right shifting. At the same time, the reversed data is selected for the last large number data. At this point, the shift amount needs to be adjusted at the top level, and the c operand data is right-shifted in the next step to obtain the alignment result.
[0075] b. The result of the multiplication exponent is 1, the exponent of the c operand is 0, and the hidden bit of the mantissa result of the multiplication is 0 at this time;
[0076] The result of the multiplication mantissa calculation is a denormalized number, but the exponent is a normalized number. In this case, the multiplication mantissa needs to be shifted left by one bit in the first step to obtain the true mantissa result. In the next step, return to the current pipeline level and compare the true result to prevent comparison error.
[0077] c. The exponent difference equals the leading zero of the product result, and the exponent result of the multiplication is greater than the exponent of the c operand. The first step requires left normalization of the product result to ensure that the exponent alignment is completed. The next step returns to the current pipeline level to compare the actual result and prevent errors in the comparison result.
[0078] d. If the product exponent is negative, the c operand is a denormalized number;
[0079] The first step requires right-shifting the product mantissa result and shifting the product exponent to 0 to complete the adjustment of the product exponent result; the next step returns to the current pipeline level to compare the actual result and prevent comparison result errors.
[0080] e. The result of multiplication is greater than the exponent of operand c, the exponent difference is less than the number of leading zeros, and operand c is a denormalized number;
[0081] At this point, the multiplication mantissa result needs to be shifted left by the exponent difference to ensure that the two are aligned before comparison; in the first step, the multiplication mantissa result is shifted left to obtain the aligned result; in the second step, the pipeline returns to the current level, compares with the actual result, and updates the result to prevent errors.
[0082] (2) The water flow stopped twice.
[0083] a. The result of multiplication is greater than the exponent of operand c, and the exponent difference is greater than the leading zero of the product, and the second right shift amount is 1; the leading zero of the product is not equal to 0;
[0084] In the first cycle, the data enters the shifter, and the multiplication exponent result is shifted left by the product leading zero to normalize it. In the second cycle, the shifted data is the exponent of the c operand, and the right shift amount (exponent difference - product leading zero) = 1. However, the two may be unequal due to inaccurate product leading zero, resulting in an inaccurate comparison result. Therefore, a pipeline pause is required for a second comparison. In the third cycle, the mantissa of the multiplication and the mantissa of the c operand are compared to determine the true comparison size.
[0085] The overall solution for level P2 is implemented as follows:
[0086] The result of summing the mantissas is divided into a high half and a low half, and then entered into the 24-bit high half leading zero statistics (2) and the 24-bit low half leading zero statistics (3) respectively to count the number of leading zeros. At this time, it is necessary to select the statistical result by passing through the selector (4) based on whether the high half contains 1. If the high half leading zero statistics result is selected, then the number of leading zeros is the number of leading zeros in the high half. If the low half leading zero statistics result is selected, then it is necessary to compensate through the back-rule leading zero compensation module (5), that is, the 24-bit high half is all 0, so the number of leading zeros is 24 + the number of leading zeros in the low half, to obtain the true back-rule leading zeros. At this point, the exponent and the shift amount need to be entered together into the shift amount calculation module (6) to obtain the shift amount result; the shift data needs to be obtained according to the shift data adjustment module (7). Here, the shift data adjustment needs to select the lower half of the data with a shift amount exceeding 24 for shifting to ensure that the subsequent shifter can meet the shift requirements; the shift result is obtained through the 40-bit left shift shifter (9) according to the shift amount and the shift data. According to the right shift control logic (8), it is determined whether the highest bit needs to be concatenated with 1-bit 0, and the final result of the mantissa is obtained through the mantissa selection module (12). At the same time, the exponent adjustment module (10) and the rounding part adjustment module (11) complete the exponent and rounding part adjustment; the sign bit needs to be calculated by the sign bit calculation module (1). These data enter the rounding operation, rounding adjustment and abnormal state judgment module (13) together to complete the rounding operation, rounding adjustment and abnormal state judgment, and finally enter the final result selection (14) to obtain the final result.
[0087] In the P2 level:
[0088] The following section:
[0089] The mantissa sum data, totaling 50 bits, is obtained from the P1 pipeline stage: 2 hidden bits + 46 mantissa bits + 1 rounding bit + 1 sticky bit. The exponent should be the adjusted exponent result generated by the P1 stage. It first enters the leading zero module for statistics, and then undergoes a shift operation. Further steps include:
[0090] (a) The specific shift details are as follows:
[0091] ① The exponent can satisfy the left shift with leading zero;
[0092] For shifts with a shift amount not exceeding 24, the original data is selected for shifting, and the leading zero statistics of the higher half are selected as the shift amount.
[0093] For shifts exceeding 24, the lower half of the data is selected, followed by 24 bits of 0 for shifting. This simulates the case where the higher half of the 24 bits of 0 has already been shifted, and the data of the lower half of the data with leading zeros is selected as the shift amount.
[0094] This achieves the goal of reducing the amount of displacement;
[0095] ② The exponent cannot satisfy the left shift with leading zero;
[0096] For shifts with a shift amount not exceeding 32, the original data is still selected for shifting, and the shift amount is the exponent - 1;
[0097] For shifts exceeding 32, the shift amount is the exponent minus 1. In this case, the lower half of the data needs to be selected, followed by 24 bits of 0 for shifting, which simulates that the upper half of 24 bits of 0 has been shifted. At this time, the shift amount needs to be adjusted. For exponent shifts, the total amount needs to be subtracted by (24-1), that is, subtract 23.
[0098] This achieves the goal of reducing the amount of displacement;
[0099] ③ The sum of the last digits overflows by one place;
[0100] If the mantissa overflows after summing, the most significant bit of the mantissa needs to be appended with 1-bit 0 through the right shift control logic (8) to simulate the mantissa shifting one bit to the right; at this time, there is no need to enter the left shifter for shifting.
[0101] ④ No back gauge shift is required;
[0102] If there is no leading zero after normalization, or the number obtained is itself a denormalized number, then the original mantissa result can be directly selected through the mantissa selection module (12), and the shift amount is 0.
[0103] (b) Further reduction of the shift amount
[0104] From the previous discussion of the two cases requiring left shift, it can be seen that the maximum shift amount is 31. When the timing allows, the shift amount of the shifter can be reduced. For data with a shift amount exceeding 15, the high half of the 16-bit 0 is removed through the internal control signal, and the relevant 0 is added afterward. This reduces the maximum shift amount from 31 to 15, thus achieving the purpose of reducing the shift amount.
[0105] The specific implementation details are as follows:
[0106] (I) When the shift amount is less than 16, you can directly enter the left shifter to perform the shift;
[0107] (II) When the shift amount is greater than 16, the shifted data needs to be processed by the shift data adjustment module (7). 16 bits of 0 are removed from the front and 16 bits of 0 are added to the back. The processed data is then sent to the left shifter for the remaining shift.
[0108] The reason for reducing the shifter to below 16 is to ensure that the control circuit is as simple as possible; otherwise, the control circuit may be too complex, resulting in additional area overhead.
[0109] (c) Further reduction of shifted data;
[0110] Originally, 48 bits of shift data were required for the shift, but the shifter can shift up to 15 bits. Therefore, it is only necessary to satisfy the left shifter normalization, that is, only (24+16) bits of shift data are needed, and the remaining 8 bits are used as sticky bits; thus, the shift data bit width can be reduced.
[0111] In summary, the original 48-bit shifter can be reduced to a 40-bit shifter.
[0112] The rounding part is adjusted as follows:
[0113] The rounding adjustment module (11) completes the corresponding function; where the protection bit is the last bit of the mantissa, the rounding bit is the first bit after the last bit of the mantissa, and the sticky bit is the self-OR of all bits after the second bit after the last bit of the mantissa.
[0114] The exponent adjustment part:
[0115] The exponent adjustment module (10) completes the corresponding function;
[0116] The exponent adjustment is performed synchronously with the post-regulation, and there are four possibilities: exponent + 1, exponent, exponent - leading zero of post-regulation, 0; the appropriate choice is made according to the mantissa selection.
[0117] The rounding operation and post-rounding adjustment:
[0118] This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13);
[0119] To improve timing, the mantissa is directly rounded by a 25-bit adder and incremented by 1. Subsequent rounding is performed based on the rounding mode, selecting either the original mantissa or the mantissa with the incremented value, to obtain the final rounded result. This rounded result is then used to further adjust the exponent and mantissa.
[0120] Exponent adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the exponent needs to be incremented by 1;
[0121] Mantissa adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the mantissa needs to be adjusted one bit to the right; the final sign bit calculation:
[0122] Sign bit calculation (1) completes the corresponding function;
[0123] ① If a special value exists, assign the sign bit of the special value;
[0124] ② The absolute values of the two numbers are equal, that is, after alignment, the product result is equal to the c operand, and the effective operation is subtraction, which is related to the rounding mode; if it is rounding to negative infinity, the sign bit is negative; otherwise, the sign bit is positive.
[0125] ③ The remaining cases need to be determined based on the exchange signal and the subtraction signal, as shown in the table below:
[0126] Where A is the product result, B is oprc, and mult_sign is the sign of the product result obtained from the P0-level adjustment; the valid operations here are consistent with the valid operations at the P0 level.
[0127] condition Practical operation mult_sign Final symbol |A|<|B| - + - |A|<|B| - - + |A|<|B| + + + |A|<|B| + - - |A|>|B| - + + |A|>|B| - - - |A|>|B| + + + |A|>|B| + - -
[0128] The judgment of the anomaly:
[0129] This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13);
[0130] ① Invalid exception: Found in the special value judgment at level P0;
[0131] ② Division by zero exception: None;
[0132] ③ Overflow exception: The result is greater than the maximum exponent value after the exponent is updated, and it is not a maximum value at this time; qNaN result and invalid exception;
[0133] ④ Underflow anomaly: After rounding, the data is strictly between ± minimum normalized number (+ minimum normalized number is 32'h0080_0000, - minimum normalized number is 32'h8080_0000) and is inaccurate; ⑤ Inaccuracy anomaly: The rounding bit or sticky bit is 1, or an overflow anomaly occurs at this time.
[0134] The final result processing:
[0135] The final result selection (14) completes the corresponding function;
[0136] If an overflow exception occurs, special assignments are needed for different maxima depending on the rounding mode. In other cases, only the sign bit, exponent, and mantissa need to be concatenated normally. This yields the final result.
[0137] The hardware design of the rear guide section has three important design points, which are described below:
[0138] (1) To improve timing, the leading zero statistics of the high half and the low half are divided into two modules and counted in parallel. The leading zero data of the subsequent rules are corrected according to whether the high half is all 0. When the timing is good, the area will also benefit to a certain extent.
[0139] (2) The back gauge shift only uses the left shift shifter;
[0140] There is only one left shift shifter in the back gauge part for the partial back gauge with leading zeros of the back gauge; for the right shift overflow situation, as can be seen from the previous alignment algorithm analysis, at most one right shift is required here, and it can be directly completed by bit concatenation;
[0141] (3) Under the condition that the timing permits, directly correct the shifter data in advance through the control circuit to reduce the shift amount and the bit width of the shifted data, so as to achieve the purpose of reducing the bit width of the shifter, thereby reducing the hardware design area.
[0142] The multiplication and addition module implements the multiplication and addition algorithm including:
[0143] After obtaining the three operands a, b, and c, first calculate the product result of a*b, align the product result with the c operand, that is, the addend operand, then perform the mantissa summation operation, enter the normalization of the back gauge part, and finally complete rounding, abnormal state judgment, and obtain the final result;
[0144] For the alignment, before alignment, it is necessary to obtain the product mantissa result, the product leading zero result, the exponent difference, and the c operand, that is, the addend operand. Among them, the alignment shift amount of part of the exponent difference can offset the normalization shift amount of the product leading zeros. The specific algorithm is as follows:
[0145] (1) The multiplication exponent result > the exponent of the c operand
[0146] a. The exponent difference > the product leading zeros: The product result is the large exponent, and a left shift normalization with a shift amount of the product leading zeros is required. The c operand is the small exponent, and a right shift alignment with a shift amount of the exponent difference - the number of product leading zeros is required. At this time, two shifts are required, and the multiplication result is still the large operand; if no product leading zeros are generated in the multiplication at this time, only the c operand needs to be shifted by the exponent difference;
[0147] b. The exponent difference < the product leading zeros: Before shifting, the product result is the large exponent. First, perform a left shift normalization of the multiplication mantissa with a shift amount of the product leading zeros. At this time, because the exponent difference < the product leading zeros, the product result instead becomes the small exponent. At this time, a right shift alignment with a shift amount of the product leading zeros - the exponent difference is still required for the shifted multiplication mantissa;
[0148] From the overall effect, only the multiplication mantissa needs to be left shifted by the exponent difference. At this time, the multiplication result becomes the small operand and needs to be exchanged;
[0149] c. The exponent difference = the product leading zeros: At this time, only the multiplication mantissa needs to be left shifted with a shift amount of the product leading zeros to complete the alignment of the c operand and the multiplication result;
[0150] (2) The multiplication exponent result < the exponent of the c operand
[0151] Regardless of the relationship between the leading zeros of the product and the exponent difference, the multiplication result will always be a small exponent. First, the mantissa of the multiplication is shifted left by the amount of the leading zeros of the product to normalize it. At this time, the exponent of the multiplication result is even smaller. It is still necessary to shift the mantissa of the multiplication after shifting by the amount of the leading zeros of the product plus the exponent difference to right to align the exponents. From the overall effect, it is only necessary to shift the mantissa of the multiplication to the right by the exponent difference. At this time, the multiplication result is still a small operand.
[0152] (3) Multiplication exponent result = c operand exponent, exponent difference = 0. a. If the product leading zero = 0, the multiplication result and c operand have been aligned and no shift is needed; b. If the product leading zero ≠ 0, the exponent difference is less than the product leading zero. First, the mantissa of the multiplication is shifted left by the product leading zero for normalization. After the shift, the exponent difference is the product leading zero. Then, the adjusted mantissa result needs to be right-shifted by the product leading zero for alignment. From the overall result, no shift is needed.
[0153] Therefore, the advantage of this application lies in the optimization of the module area of the multiply-accumulate algorithm:
[0154] A) Optimization of the order shifter:
[0155] (a) The alignment algorithm is optimized to ensure that the alignment shift and the leading zero normalization shift can cancel each other out to the greatest extent, so as to reduce the bit width and reduce the hardware design area.
[0156] In this alignment algorithm, pipeline pauses and shifter reuse are only required in special alignment cases. Analysis of the alignment algorithm reveals that a left shift normalization shift of the product mantissa and a right shift alignment shift of the c operand are required only if the multiplication exponent result > the c operand exponent and the exponent difference > the product leading zero (leading zero ≠ 0). From an application perspective, this is a low-probability, uncommon scenario. Therefore, instead of using two shifters to satisfy this situation, the performance of this low-probability event is sacrificed by pipeline pauses and shifter reuse, reducing the area overhead of the shifters.
[0157] (b) Use only right shifters, and implement left shifts using right shifters:
[0158] Analysis of the alignment algorithm reveals that right shifts far outnumber left shifts. Adding an extra shifter for left shifts would incur area overhead. Therefore, this scheme uses only one right shifter in the alignment process, allowing left shifts to be performed using the right shifter. Specifically, it employs a reverse order: right shift, then reverse order again to obtain the corresponding left shift result. This reduces the area overhead of a single left shifter.
[0159] (c) Reduce the shifter bit width:
[0160] Set the shifter bit width to (48 + SHF_NUM)-bit. SHF_NUM is configurable and can be adjusted according to different application scenarios. Currently, SHF_NUM is set to 16, and the maximum shift amount is 15. If the shift amount exceeds this maximum, the part greater than 16 is directly adjusted by the control circuit by adding 0s at the beginning before entering the shifter to complete the remaining shift. In this way, the bit width of the shifter can be reduced.
[0161] B) Optimization of leading zero statistics for post-rule:
[0162] The original leading zero statistics for post-rule planning directly used a 48-bit leading zero statistics module, but this resulted in an excessively long critical path, which failed to meet timing requirements. If timing requirements could not be met, it would be necessary to split the pipeline stage. Therefore, this application uses a method of parallel statistics of the high and low halves of post-rule planning leading zero statistics, and then uses a selector for corresponding compensation to finally obtain the true leading zero result for post-rule planning. This method can effectively improve timing and meet timing requirements, thus eliminating the need to split the pipeline stage and saving design area.
[0163] C) Optimization of the back gauge shifter:
[0164] The original normal back gauge shifter requires a 48-bit shifter with a maximum shift amount of 48. This solution, under the condition that the timing allows, directly corrects the shift data in advance through the control circuit to reduce the shift amount and the shift data bit width, thereby reducing the shifter bit width and thus reducing the hardware design area.
[0165] In summary, this application uses a single data path algorithm, which avoids incurring additional area costs for multiple different devices for the same logic. To ensure performance in common scenarios and in conjunction with real-world applications, pipeline pauses are implemented for special low-probability events, sacrificing performance and reusing the corresponding shifter to achieve area reduction. Simultaneously, the bit width of the shifter is reduced to a certain extent according to relevant algorithmic strategies to minimize the area. Attached Figure Description
[0166] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.
[0167] Figure 1 This is a diagram of the P0-P1 pipeline implementation structure using multiply-accumulate instructions.
[0168] Figure 2 This is a diagram of the P2 pipeline implementation structure using multiply-accumulate instructions.
[0169] Figure 3 This is a flowchart illustrating the method.
[0170] in:
[0171] Figure 1 The corresponding selectors are shown below:
[0172] Selector 4: A selector module that selects the final product mantissa result based on the instruction control signal and the special value judgment signal;
[0173] Selector 5: Selects the denormalized number based on the operands;
[0174] Selector 7: A selector that adjusts the number of leading zeros based on the operands;
[0175] Selector 9: A selector used to update the product mantissa result after a pause in the flow;
[0176] Selector 10: A selector used to update the mantissa of operand C after a pause in the flow;
[0177] Selector 11: Selector used to update the leading zeros of the product after the flow stops;
[0178] Selector 12: A selector used to update results such as swaps, comparisons, and exponent differences after a pause in the flow.
[0179] Selector 15: Selects the corresponding shift data according to the corresponding shift logic;
[0180] Selector 18: Selector that selects the mantissa result after shifting according to the shift control signal;
[0181] Selector 23: Selects the corresponding big-endian result based on the shift control logic;
[0182] Selector 24: Selector that selects the corresponding little-endian result according to the shift control logic;
[0183] Selector 26: Selector that selects the data required for summing the mantissas based on the subtraction signal; Figure 2 The corresponding selectors are shown below:
[0184] Selector 4: Select the corresponding leading zero statistical result based on whether the tail data of the high half contains 1. Detailed Implementation
[0185] To better understand the technical content and advantages of the present invention, the present invention will now be described in further detail with reference to the accompanying drawings.
[0186] This invention relates to the field of chips, and is mainly designed for single-precision floating-point operations. The application of this solution has the advantages of reducing hardware area and low power consumption.
[0187] In addition, the technical terms included in this article are:
[0188] (1) a, b, c operands, for three-operand instructions such as multiply-accumulate instructions, represent a*b+c; (2) the product result exponent is the sum of the exponents of operands a and b - offset (the offset is 127 for single-precision floating-point numbers), and the exponent difference mentioned in the text refers to the absolute value of the difference between the product result exponent and the exponent of operand c.
[0189] (3) The exponent alignment operation refers to aligning the mantissa of the smaller exponent operand with the mantissa of the larger exponent operand based on the exponent difference between two operands with different exponents.
[0190] (4) Postnormalization refers to the fact that after the floating-point calculation is completed, the exponent may exceed 0, but the mantissa result is a denormalized number. At this time, the floating-point number normalization operation needs to be performed.
[0191] (5) Leading zeros refer to the number of zeros that appear in the mantissa when the hidden bit is 0, up to a single 1; product leading zeros refer to the number of leading zeros in the mantissa result obtained after multiplying operands a and b; post-ruling leading zeros refer to the number of leading zeros generated in the mantissa of the result after post-ruling the floating-point calculation.
[0192] Specifically, this solution utilizes a multiply-accumulate module to complete multiply-accumulate instructions. The main principle enabling these operations is the multiply-accumulate algorithm:
[0193] The main task is to perform the operation a*b+c, which requires multiplication, alignment, summation of mantissas, normalization, rounding, and final result deriving.
[0194] This method further includes: The multiply-add module implements the multiply-add function. The main body of the multiply-add algorithm of this scheme is as follows: After obtaining the three operands a, b, and c, the product result of a*b is first calculated, the product result and the operand c (addition operand) are aligned, the mantissa is summed, and then the normalization part is entered. Finally, rounding, abnormal state judgment and final result are completed.
[0195] The key part of the algorithm is the exponent alignment. Before alignment, it is necessary to obtain the mantissa of the product, the leading zero of the product, the exponent difference, and the C operands (addition operands). The shift amount of the exponent difference for alignment can offset the normalization shift amount of the leading zero of the product. The specific algorithm is described below:
[0196] (1) The result of multiplication exponent > the exponent of operand c
[0197] a. Exponent difference > leading zeros of the product: The product result has a large exponent. It needs to be left-shifted by the number of leading zeros of the product for normalization. The c operand has a small exponent and needs to be right-shifted by the exponent difference - the number of leading zeros of the product for alignment. In this case, two shifts are required, and the multiplication result is still the large operand; if there are no leading zeros generated in the multiplication at this time, only shift the c operand by the exponent difference;
[0198] b. Exponent difference < leading zeros of the product: Before shifting, the product result has a large exponent. First, left-shift the multiplication mantissa by the number of leading zeros of the product for normalization. At this time, because the exponent difference < the leading zeros of the product, the product result becomes a small exponent instead. At this time, it is still necessary to right-shift the shifted multiplication mantissa by the number of leading zeros of the product - the exponent difference for alignment;
[0199] Overall, only need to left-shift the multiplication mantissa by the exponent difference. At this time, the multiplication result becomes the small operand and needs to be exchanged;
[0200] c. Exponent difference = leading zeros of the product: At this time, only need to left-shift the multiplication mantissa by the number of leading zeros of the product to complete the alignment of the c operand and the multiplication result;
[0201] (2) Multiplication exponent result < c operand exponent
[0202] At this time, regardless of the relationship between the leading zeros count of the product and the exponent difference, the multiplication result must have a small exponent. First, left-shift the multiplication mantissa by the number of leading zeros of the product for normalization. At this time, the exponent of the multiplication result is smaller, and it is still necessary to right-shift the shifted multiplication mantissa by the number of leading zeros of the product + the exponent difference for alignment; Overall, only need to right-shift the multiplication mantissa by the exponent difference. At this time, the multiplication result is still the small operand;
[0203] (3) Multiplication exponent result = c operand exponent (exponent difference = 0)
[0204] a. If the leading zeros of the product = 0, the multiplication result and the c operand are already aligned and no shift is required; b. If the leading zeros of the product ≠ 0, at this time the exponent difference is less than the leading zeros of the product. First, left-shift the multiplication mantissa by the number of leading zeros of the product for normalization. After the shift, the exponent difference is the leading zeros of the product. At this time, it is necessary to right-shift the adjusted multiplication mantissa result by the leading zeros of the product for alignment; Overall, no shift is required.
[0205] As Figure 3 shown, in the process of completing single-precision floating-point multiply-accumulate calculation, the corresponding pipeline stage implementation of this scheme is as follows:
[0206] Stage P0, the first pipeline stage: Complete the functions of multiplication, leading zeros counting, and special value judgment module; Stage P1, the second pipeline stage: Complete the alignment shift operation, comparison operation, and summation of the mantissa after alignment;
[0207] P2 stage, the third pipeline stage: post-processing, rounding, final result processing, and final exception handling; the pipeline stages are as follows:
[0208] The P0 stage: The first pipeline stage obtains three operands a, b, and c; including:
[0209] Special value judgment; product mantissa calculation; product leading zero pre-statistics; exponent difference and comparison logic; the P1 stage: in the second pipeline stage, the multiplication result is obtained, the product result exponent is compared with the exponent of the c operand, the exponent difference is compared with the product leading zero, and the exponent difference is compared with the product leading zero; after exponent alignment, the mantissa summation operation is performed on the aligned result; including:
[0210] Comparison logic; alignment part; summation of mantissas;
[0211] The P2 stage: The third pipeline stage obtains the result after summing the mantissas, adjusts the exponent data, the sign bit to be computed, and the comparison result; including:
[0212] The following sections are included: post-rounding adjustment; exponent adjustment; rounding operation and post-rounding adjustment; final sign bit calculation; exception handling; and final result processing.
[0213] Furthermore, the detailed plans for each flow stage are as follows:
[0214] The P0 level: (see details) Figure 1 —Diagram of the P0-P1 pipeline implementation using multiply-accumulate instructions) The first pipeline stage obtains three operands: a, b, and c.
[0215] 1. Special value judgment module:
[0216] This special judgment module requires that various special values be judged in advance for different instructions;
[0217] According to the instruction control signal module (1), the data entering the special value judgment module (2) needs to be adjusted;
[0218] (1) Introduce the concept of effective operation, and decompose the internal information of the instruction through the instruction control module (1). The instruction and the operand sign bit together determine whether the actual operation in the mantissa summation is addition or subtraction.
[0219] mul_res sign bit Input Operation oprc sign bit Effective operation + + + + + + - - - + + - - + - + + - + - + - - + - - + + - - - -
[0220] Note: In the valid operation judgment:
[0221] For multiply-accumulate instructions, the mul_res sign bit is the sign bit of the product result in the multiply-accumulate instruction, and the oprc sign bit is the sign bit of the C operand;
[0222] The specific sign bit and input operations correspond to the multiply-accumulate instructions as follows:
[0223] FMADD: That is, a*b+c, where the sign bit of mul_res represents the sign bit of the product result of a*b, and the input operation is +;
[0224] FMSUB: That is, a*bc, where the mul_res sign bit represents the sign bit of the product result of a*b, and the input operation is -;
[0225] FNMADD: That is, -a*bc, where the sign bit of mul_res represents the sign bit of the product result of -a*b, and the input operation is -;
[0226] FNMSUB: That is, -a*b+c, where the sign bit of mul_res represents the sign bit of the product result of -a*b, and the input operation is +;
[0227] (2) After the effective operation is clear, you can enter the special value module to make a judgment;
[0228] Multiplication-addition instructions require the multiplication to be performed first, followed by the summation of the mantissas based on the valid operations. Therefore, the special value judgment is shown in the following three tables:
[0229] The table below shows the results of special value judgments for multiplication:
[0230]
[0231] The table below shows the results of special value judgments for valid operations that are addition:
[0232]
[0233] The table below shows the results of special value judgments for valid operations that are subtraction:
[0234]
[0235] Note: (sub)norm is a general term for normalized and denormalized numbers. NaN numbers (Not a number, representing an inexpressible value) are divided into two categories: qNaN numbers and sNaN numbers.
[0236] sNaN is a number whose exponent is all 1s, whose first digit is 0, and whose overall mantissa is not 0.
[0237] qNaN is a value whose exponent is all 1s and whose mantissa is 1 in the first position.
[0238] RISC-V specifies that if the result of a floating-point operation is a NaN number, then a fixed NaN number should be used. The NaN value corresponding to single-precision floating-point is 0x7fc0_0000. Therefore, the final result qNaN needs to be assigned a fixed value, i.e., qNaN = 32'h7fc0_0000.
[0239] The table crossed out by the horizontal line indicates that the result needs to be obtained through normal calculation, rather than a special value; for multiplication and addition instructions, firstly, the result of a*b needs to be assigned to |x| according to the special value table of multiplication, and then the special value result is obtained from the corresponding table according to the corresponding valid operation; if the above special value result is generated, a special value signal needs to be set to mark that the operation is a special case, and the result is assigned to the special value result obtained above, which is convenient for subsequent calculation; (3) After obtaining the valid operation, the invalidity exception of the special value is judged and left for use by P2 level;
[0240]
[0241] Specifically, in the special value judgment module, if it is a multiplication instruction, the c operand needs to be assigned the value 32'h0000_0000, while the a and b operands remain unchanged; if it is an addition or subtraction instruction, the b operand is assigned the value 32'h3f80_0000, and the c operand is assigned the value b; if it is a multiplication-addition instruction, then the a, b, and c operands all remain unchanged.
[0242] 2. Calculation of the mantissa of the product: The difference between single-precision and double-precision multipliers is that ① the bit width of a single-precision multiplier is 24-bit × 24-bit, while the bit width of a double-precision multiplier is 53-bit × 53-bit, with the single-precision multiplier having a smaller bit width; ② the mantissa multiplication part of a single-precision multiplier can be completed in one clock cycle, while the mantissa multiplication part of a double-precision multiplier needs to be completed in two clock cycles due to timing considerations.
[0243] The operands a and b need to enter the product mantissa calculation module (3) to calculate the product mantissa result. Since the single-precision multiplier is 24-bit×24-bit and has a smaller bit width than the 53-bit×53-bit double-precision multiplier, it can be completed in one clock cycle. The product mantissa calculation module (3) contains one multiplier and two adders. First, the mantissas of operands a and b are multiplied to obtain two partial product results. The lower half of the partial product first enters the 24-bit adder to calculate the sum of the lower 24-bit partial products. The carry result generated by the summation will be passed to the higher half adder and enter the 25-bit adder together with the higher half partial product to obtain the product mantissa calculation result. Finally, through the selector (4), the final required mantissa result is selected according to the special value signal and the special value result.
[0244] 3. Pre-statistics module for leading zeros in products:
[0245] First, the selector (5) selects the non-normalized data based on the non-normalized number judgment signal, enters the 24-bit leading zero statistics module (6) to count the number of leading zeros of the selected data, and then enters the next selector (7). If both numbers are normalized numbers, the statistical result is adjusted to 0, and the product leading zero pre-statistical result is obtained.
[0246] 4. Exponent difference and comparison logic (8):
[0247] This module mainly completes the calculation of the product result exponent, the comparison of the product result exponent with the exponent of the C operand, and the comparison between the exponent difference and the leading zero of the product. This facilitates the exponent shift of the P1 level and initially determines the shift method required for the P1 level.
[0248] The P1 level: (see details) Figure 1 —Diagram of the P0-P1 pipeline implementation structure using multiply-accumulate instructions
[0249] In the second pipeline stage, the multiplication result is obtained. The product exponent is compared with the exponent of the C operand, the exponent difference is compared with the leading zero of the product, and the exponent difference is compared with the leading zero of the product. After exponent alignment, the mantissa summation operation is performed on the aligned result.
[0250] 1. Comparison logic:
[0251] The comparison logic here is divided into two parts. The first part, the mantissa comparison logic (13), must perform a comparison of the high half 25-bits before shifting. This comparison is to make a preliminary judgment on the size of the two numbers, which is convenient for the current level of alignment operation. Here, it is also for timing considerations, only the high half 25-bits are compared.
[0252] The second part of the final comparison logic (28) needs to continue the operation of comparing whether the remaining parts are equal after the alignment is completed. This is to obtain the final comparison result, which is convenient for subsequent sign bit calculation.
[0253] 2. Alignment section:
[0254] The basic principle of the alignment part is the same as the previous algorithm introduction, so it will not be repeated here.
[0255] There are three important design points in the hardware design of the alignment section, which are mainly described below:
[0256] (1) For special alignment cases, the flow stops and the shifter is reused;
[0257] From the previous analysis of the alignment algorithm, we know that if and only if the multiplication exponent result > the exponent of the c operand, and the exponent difference > the leading zero of the product (leading zero of the product ≠ 0), then the mantissa result of the product needs to be left-shifted for normalization, and the c operand needs to be right-shifted for alignment. This is an uncommon scenario, so instead of setting two shifters to complete this operation, we use pipeline pauses, sacrificing the performance of low-probability scenarios, and reusing the right shifter to perform the shift, thereby reducing the shifter area.
[0258] (2) Use only right shifters, and implement left shifts using right shifters:
[0259] Analysis of the order alignment algorithm reveals that the number of left shifters used is much smaller than that of right shifters. Setting up a single left shifter would result in unnecessary area loss. This method uses only one right shifter in the order alignment part, and the left shift is performed in reverse order, followed by the right shift, and then the right shift is performed in reverse order again to obtain the corresponding result.
[0260] (3) Reduce the shifter bit width:
[0261] Set the shifter bit width to (48 + SHF_NUM)-bit. SHF_NUM is configurable. The shift amount can be adjusted according to different application scenarios. Currently, SHF_NUM is set to 16, and the maximum shift amount is 15. If the shift amount exceeds this maximum, the part greater than 16 is directly adjusted by adding 0s in front, and then the shifter completes the remaining shift.
[0262] The overall implementation of the alignment part is summarized as follows:
[0263] After the mantissa comparison logic (13), the basic exchange signal and preliminary size are determined. According to the relevant shift control logic (19), the selector (15) is controlled to select the relevant data. This selector (15) also undertakes the function of adding 0 to the corresponding shifted data if the shift amount exceeds 15. If it is a left shift, the shifted data must first enter the reverse order module (14) for reverse order operation. The shift amount is specifically determined by the shift amount adjustment logic (20). After the shift amount and shifted data are ready, Enter a 48+SHF_NUM (SHF_NUM uses 16, i.e. 64)-bit right shifter (16) to obtain the shift result; if it is a left shift, it needs to enter a reverse module (17) again after the shift is completed to obtain the left shift result; the final left shift and right shift results are selected by selector (18) to obtain the final result of mantissa shift; finally, selectors (23 / 24) are used to select the big mantissa and little mantissa after alignment.
[0264] If a shifter needs to be reused for a pipeline pause, the selector (9 / 10 / 11 / 12) is controlled according to the selector control logic (22) during the pipeline pause to update each input. In the next pipeline, the shifter is controlled according to the shift control logic (19) to reuse the right shifter (16). If a pipeline pause is required to further determine the comparison signal after alignment, the comparison needs to be completed in the next pipeline by reusing the mantissa comparison logic (13) and the final mantissa comparison logic (28).
[0265] Accordingly, the exponent adjustment logic (21) is used in the exponent adjustment module to further adjust the exponent and obtain the adjusted exponent.
[0266] Specifically, the state of water flow stagnation is as follows:
[0267] (1) The water flow stops once.
[0268] a. The result of the multiplication exponent is greater than the exponent of the operand c, and the exponent difference is greater than the leading zero of the product; (leading zero of the product ≠ 0)
[0269] At this point, the multiplication exponent result needs to be shifted left by the exponent difference for normalization; then, the c operand result needs to be shifted right to align the exponents. The right shift amount is the exponent difference minus the leading zeros of the product, and the multiplication result is still a large number. In the first step, the shifter is entered, and the reversed data is selected for right shifting. At the same time, the reversed data is selected for the last large number data. At this point, the shift amount needs to be adjusted at the top level, and the c operand data is right-shifted in the next step to obtain the alignment result.
[0270] b. The result of the multiplication exponent is 1, the exponent of the c operand is 0, and the hidden bit of the mantissa result of the multiplication is 0 at this time;
[0271] The result of the multiplication mantissa calculation is a denormalized number, but the exponent is a normalized number. In this case, the multiplication mantissa needs to be shifted left by one bit in the first step to obtain the true mantissa result. In the next step, return to the current pipeline level and compare the true result to prevent comparison error.
[0272] c. The exponent difference equals the leading zeros of the product, and the exponent of the multiplication result is greater than the exponent of operand c.
[0273] The first step requires left normalization of the product result to ensure that the alignment is completed; the next step returns to the current pipeline level and compares the actual result to prevent errors in the comparison result.
[0274] d. If the product exponent is negative, the c operand is a denormalized number;
[0275] The first step requires right-shifting the product mantissa result and shifting the product exponent to 0 to complete the adjustment of the product exponent result; the next step returns to the current pipeline level to compare the actual result and prevent comparison result errors.
[0276] e. The result of multiplication is greater than the exponent of operand c, the exponent difference is less than the number of leading zeros, and operand c is a denormalized number;
[0277] At this point, the multiplication mantissa result needs to be shifted left by the exponent difference to ensure that the two are aligned before comparison; in the first step, the multiplication mantissa result is shifted left to obtain the aligned result; in the second step, the pipeline returns to the current level, compares with the actual result, and updates the result to prevent errors.
[0278] (2) The water flow stopped twice.
[0279] a. The result of the multiplication exponent is greater than the exponent of operand c, and the exponent difference is greater than the leading zero of the product, and the second right shift amount is 1; (leading zero of the product ≠ 0)
[0280] In the first cycle, the data enters the shifter, and the multiplication exponent result is shifted left by the product leading zero to normalize it. In the second cycle, the shifted data is the exponent of the c operand, and the right shift amount (exponent difference - product leading zero) = 1. However, the two may be unequal due to inaccurate product leading zero, resulting in an inaccurate comparison result. Therefore, a pipeline pause is required for a second comparison. In the third cycle, the mantissa of the multiplication and the mantissa of the c operand are compared to determine the true comparison size.
[0281] 3. Sum of last digits
[0282] The mantissa summation operation has two possibilities: adding or subtracting the mantissas.
[0283] (1) Adding the last two digits: The two last two digits can be directly added together by the adder;
[0284] (2) Subtraction of mantissas: Since the previous alignment operation completed the comparison logic operation, this must be the larger mantissa minus the smaller mantissa. In this case, for the expression (xy), in order to use the adder for calculation, we can use the expression (xy) = (x + (~y) + 1) and use addition to complete the subtraction operation.
[0285] The overall implementation of the scheme for summing the last digits is summarized as follows:
[0286] After obtaining the mantissa after alignment, if the actual operation here is subtraction, the mantissa after alignment needs to be passed through the method taking module (25) to obtain the opposite number; if the addition does not require additional operation, the corresponding result is selected by the selector (26); the selected mantissa, mantissa and subtraction signal enter the mantissa summation module (27) at the same time. The mantissa summation module contains a 26-bit lower half adder and a 25-bit higher half adder. It is necessary to wait for the carry result of the lower half adder before entering the higher half adder to complete the mantissa summation calculation and obtain the final mantissa summation result.
[0287] The P2 level: (see details) Figure 2 — P2 pipeline implementation structure diagram of multiply-add instruction) The third pipeline obtains the result of mantissa summation, adjusts the exponent data, the sign bit to be operated on, and compares the result.
[0288] The overall implementation of the P2-level solution can be summarized as follows:
[0289] The result of summing the mantissas is divided into a high half and a low half, and then entered into the 24-bit high half leading zero statistics (2) and the 24-bit low half leading zero statistics (3) respectively to count the number of leading zeros. At this time, it is necessary to select the statistical result by passing through the selector (4) based on whether the high half contains 1. If the high half leading zero statistics result is selected, then the number of leading zeros is the number of leading zeros in the high half. If the low half leading zero statistics result is selected, then it is necessary to compensate through the back-rule leading zero compensation module (5), that is, the 24-bit high half is all 0, so the number of leading zeros is 24 + the number of leading zeros in the low half, to obtain the true back-rule leading zeros. At this point, the exponent and the shift amount need to be entered together into the shift amount calculation module (6) to obtain the shift amount result; the shift data needs to be obtained according to the shift data adjustment module (7). Here, the shift data adjustment needs to select the lower half of the data with a shift amount exceeding 24 for shifting to ensure that the subsequent shifter can meet the shift requirements; the shift result is obtained through the 40-bit left shift shifter (9) according to the shift amount and the shift data. According to the right shift control logic (8), it is determined whether the highest bit needs to be concatenated with 1-bit 0, and the final result of the mantissa is obtained through the mantissa selection module (12). At the same time, the exponent adjustment module (10) and the rounding part adjustment module (11) complete the exponent and rounding part adjustment; the sign bit needs to be calculated by the sign bit calculation module (1). These data enter the rounding operation, rounding adjustment and abnormal state judgment module (13) together to complete the rounding operation, rounding adjustment and abnormal state judgment, and finally enter the final result selection (14) to obtain the final result.
[0290] 1. Post-regulation section:
[0291] The mantissa sum data is obtained from the P1 pipeline stage, totaling 50 bits (2 hidden bits + 46 mantissa bits + 1 rounding bit + 1 sticky bit). The exponent should be the adjusted exponent result generated by the P1 stage. It first enters the leading zero module for statistics, and then performs the shift operation.
[0292] The hardware design of the rear section has three important design points, which are mainly described below:
[0293] (1) To improve timing, the leading zero statistics of the high half and the low half are divided into two modules and counted in parallel. The leading zero data of the subsequent rules are corrected according to whether the high half is all 0. When the timing is good, the area will also benefit to a certain extent.
[0294] (2) Back gauge shifting uses only the left shifter;
[0295] The back-shift part only has one left shifter to perform the back-shift with leading zeros; for the right shift overflow case, as can be seen from the previous alignment algorithm analysis, the right shift here is at most one bit, which can be directly completed by bit concatenation;
[0296] (3) When timing permits, the shifter data can be corrected in advance by the control circuit to reduce the shift amount and the shift data bit width, thereby reducing the shifter bit width and thus reducing the hardware design area.
[0297] (a) The specific shift details are as follows:
[0298] ① The exponent can satisfy the left shift with leading zero;
[0299] For shifts with a shift amount not exceeding 24, the original data is selected for shifting, and the leading zero statistics of the higher half are selected as the shift amount.
[0300] For shifts exceeding 24, the lower half of the data is selected, followed by 24 bits of 0 for shifting. This simulates the case where the higher half of the 24 bits of 0 has already been shifted, and the data of the lower half of the data with leading zeros is selected as the shift amount.
[0301] This achieves the goal of reducing the amount of displacement;
[0302] ② The exponent cannot satisfy the left shift with leading zero;
[0303] For shifts with a shift amount not exceeding 32, the original data is still selected for shifting, and the shift amount is the exponent - 1;
[0304] For shifts exceeding 32, the shift amount is the exponent minus 1. In this case, the lower half of the data needs to be selected, followed by 24 bits of 0 for shifting, which simulates that the upper half of 24 bits of 0 has been shifted. At this time, the shift amount needs to be adjusted. For exponent shifts, the total amount needs to be subtracted by (24-1), that is, subtract 23.
[0305] This achieves the goal of reducing the amount of displacement;
[0306] ③ The sum of the last digits overflows by one place;
[0307] If the mantissa overflows after summing, the most significant bit of the mantissa needs to be appended with 1-bit 0 through the right shift control logic (8) to simulate the mantissa shifting one bit to the right; at this time, there is no need to enter the left shifter for shifting.
[0308] ④ No back gauge shift is required;
[0309] If there is no leading zero after the normalization, or if the obtained number is itself a nonnormalized number, then the original mantissa result can be directly selected through the mantissa selection module (12), and the shift amount is 0.
[0310] (b) Further reduction of the shift amount
[0311] From the previous discussion of the two cases requiring left shift, it can be seen that the maximum shift amount is 31. When the timing allows, the shift amount of the shifter can be reduced. For data with a shift amount exceeding 15, the high half of the 16-bit 0 is removed through the internal control signal, and the relevant 0 is added afterward. This reduces the maximum shift amount from 31 to 15, thus achieving the purpose of reducing the shift amount.
[0312] The specific implementation details are as follows:
[0313] (I) When the shift amount is less than 16, you can directly enter the left shifter to perform the shift;
[0314] (II) When the shift amount is greater than 16, the shifted data needs to be processed by the shift data adjustment module (7). 16 bits of 0 are removed from the front and 16 bits of 0 are added to the back. The processed data is then sent to the left shifter for the remaining shift.
[0315] The reason for reducing the shifter to below 16 is to ensure that the control circuit is as simple as possible; otherwise, the control circuit may be too complex, resulting in additional area overhead.
[0316] (c) Further reduction of shifted data;
[0317] Originally, 48 bits of shift data were required for the shift, but the shifter can shift up to 15 bits. Therefore, it is only necessary to satisfy the left shifter normalization, that is, only (24+16) bits of shift data are needed, and the remaining 8 bits are used as sticky bits; thus, the shift data bit width can be reduced.
[0318] In summary, the original 48-bit shifter can be reduced to a 40-bit shifter.
[0319] 2. Rounding adjustments:
[0320] The rounding adjustment module (11) completes the corresponding functions; where the protection bit is the last bit of the mantissa, the rounding bit is the first bit after the last bit of the mantissa, and the sticky bit is the self-ORing of all bits after the second bit after the last bit of the mantissa; 3. Exponent adjustment part:
[0321] The exponent adjustment module (10) completes the corresponding function;
[0322] The exponent adjustment is performed synchronously with the post-regulation, and there are four possibilities: exponent + 1, exponent, exponent - leading zero of post-regulation, 0; the appropriate choice is made according to the mantissa selection.
[0323] 4. Rounding operation and post-rounding adjustment:
[0324] This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13);
[0325] To improve timing, the mantissa is directly rounded by a 25-bit adder and incremented by 1. Subsequent rounding is performed based on the rounding mode, selecting either the original mantissa or the mantissa with the incremented value, to obtain the final rounded result. This rounded result is then used to further adjust the exponent and mantissa.
[0326] Exponent adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the exponent needs to be incremented by 1;
[0327] Mantissa adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the mantissa needs to be adjusted one bit to the right; 5. Final sign bit calculation:
[0328] Sign bit calculation (1) completes the corresponding function;
[0329] ① If a special value exists, assign the sign bit of the special value;
[0330] ② The absolute values of the two numbers are equal, that is, after alignment, the product result is equal to the c operand, and the effective operation is subtraction, which is related to the rounding mode; if it is rounding to negative infinity, the sign bit is negative; otherwise, the sign bit is positive.
[0331] ③ The remaining cases need to be determined based on the exchange signal and the subtraction signal, as shown in the table below:
[0332] Where A is the product result, B is oprc, and mult_sign is the sign of the product result obtained from the P0-level adjustment; the valid operations here are consistent with the valid operations at the P0 level.
[0333] condition Practical operation mult_sign Final symbol |A|<|B| - + - |A|<|B| - - + |A|<|B| + + + |A|<|B| + - - |A|>|B| - + + |A|>|B| - - - |A|>|B| + + + |A|>|B| + - -
[0334] 6. Anomaly detection:
[0335] This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13);
[0336] ① Invalid exception: Found in the special value judgment at level P0;
[0337] ② Division by zero exception: None;
[0338] ③ Overflow exception: The result is greater than the maximum exponent value after the exponent is updated, and it is not a maximum value at this time; qNaN result and invalid exception;
[0339] ④ Underflow anomaly: After rounding, the data is strictly between ± minimum normalized number (+ minimum normalized number is 32'h0080_0000, - minimum normalized number is 32'h8080_0000) and is inaccurate; ⑤ Inaccuracy anomaly: The rounding bit or sticky bit is 1, or an overflow anomaly occurs at this time.
[0340] 7. Final result processing:
[0341] The final result selection (14) completes the corresponding function;
[0342] If an overflow exception occurs, special assignments are needed for different maxima depending on the rounding mode. In other cases, only the sign bit, exponent, and mantissa need to be concatenated normally. This yields the final result.
[0343] In summary, this application, considering the application scenarios of single-precision floating-point data, optimizes the multiply-accumulate algorithm while meeting the performance requirements of most single-precision floating-point computing scenarios. It improves the dual data path to a single data path and, by combining it with real-world application scenarios, ensures that common situations do not result in performance loss due to pipeline pauses. For special low-probability events, performance is sacrificed by using pipeline pauses to reuse the shifter, thus reducing its area. Simultaneously, an internal post-regulation algorithm strategy is used to further reduce the shifter's bit width, thereby achieving the goal of reducing area.
[0344] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations can be made to the embodiments of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. An optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit, characterized in that, In the process of performing single-precision floating-point multiplication and accumulation calculations, the pipeline implementation corresponding to the method is as follows: P0 stage, the first pipeline stage: completes the functions of multiplication, leading zero counting, and special value judgment; P1 stage, the second pipeline stage: completes the alignment shift operation, comparison operation, and summation of the mantissa after alignment. Level P2, the third pipeline stage: post-processing, rounding, final result processing, and final exception handling; the schemes for each pipeline stage are as follows: The P0 stage: The input of the first pipeline stage yields three operands a, b, and c; conduct: Special value identification; product mantissa calculation; leading zero pre-statistics for products; Exponent difference and comparison logic; The results are obtained by multiplying the mantissas, comparing the product exponent with the exponent of the C operand, comparing the exponent difference with the leading zero of the product, and comparing the exponent difference with the leading zero of the product. The P1 stage: The input of the second pipeline stage yields the multiplication result, the comparison result between the product result exponent and the exponent of the c operand, the comparison result between the exponent difference and the leading zero of the product, and the comparison result between the exponent difference and the leading zero of the product; Perform: Comparison logic; Alignment operation; the mantissa summation part, that is, after alignment, the mantissa summation operation is performed on the aligned result, and the comparison operation is completed at the same time; The result of mantissa summation is obtained, the exponent data is adjusted, the sign bit to be operated on is compared, and the result is compared; P2 stage: The input of the third pipeline stage obtains the result of mantissa summation, the exponent data is adjusted, the sign bit to be operated on is compared, and the result is compared. Perform: post-processing, rounding, final result processing, and abnormal status judgment; Obtain the final calculation result and any abnormal statuses.
2. The optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit according to claim 1, characterized in that, In the P0 level: The special value judgment module includes: In this special judgment module, it is necessary to make corresponding special value judgments on various different instructions in advance; according to the instruction control signal module (1), the data entering the special value judgment module (2) needs to be adjusted; (1) introduce the concept of effective operation, and disassemble the internal information of the instruction through the instruction control module (1), which, together with the sign bit of the operand, determines whether the actual operation in the summation of the mantissa is addition or subtraction; Note: In the valid operation judgment: For multiply-accumulate instructions, the mul_res sign bit is the sign bit of the product result in the multiply-accumulate instruction, and the oprc sign bit is the sign bit of the C operand; The specific sign bit and input operations correspond to the multiply-accumulate instructions as follows: FMADD: That is, a*b+c, where the sign bit of mul_res represents the sign bit of the product result of a*b, and the input operation is +; FMSUB: That is, a*bc, where the mul_res sign bit represents the sign bit of the product result of a*b, and the input operation is -; FNMADD: That is, -a*bc, where the sign bit of mul_res represents the sign bit of the product result of -a*b, and the input operation is -; FNMSUB: That is, -a*b+c, where the sign bit of mul_res represents the sign bit of the product result of -a*b, and the input operation is +; (2) After the effective operation is clear, you can enter the special value module to make a judgment; Multiplication-addition instructions require the multiplication to be performed first, followed by the summation of the mantissas based on the valid operations. Therefore, the special value judgment is shown in the following three tables: The table below shows the results of special value judgments for multiplication: The table below shows the results of special value judgments for valid operations that are addition: The table below shows the results of special value judgments for valid operations that are subtraction: Note: (sub)norm is a general term for normalized and denormalized numbers. NaN numbers (Not a number, representing an inexpressible value) are divided into two categories: qNaN numbers and sNaN numbers. sNaN is a number whose exponent is all 1s, whose first digit is 0, and whose overall mantissa is not 0. qNaN is a value whose exponent is all 1s and whose mantissa is 1 in the first position. RISC-V specifies that if the result of a floating-point operation is a NaN number, then a fixed NaN number should be used. The NaN value corresponding to single-precision floating-point is 0x7fc0_0000. Therefore, the final result qNaN needs to be assigned a fixed value, i.e., qNaN = 32'h7fc0_0000. The table crossed out by the horizontal line indicates that the result needs to be obtained through normal calculation, rather than a special value; for multiplication and addition instructions, firstly, the result of a*b needs to be assigned to |x| according to the special value table of multiplication, and then the special value result is obtained from the corresponding table according to the corresponding valid operation; if the above special value result is generated, a special value signal needs to be set to mark that the operation is a special case, and the result is assigned to the special value result obtained above, which is convenient for subsequent calculation; (3) After obtaining the valid operation, the invalidity exception of the special value is judged and left for use by P2 level; The calculation of the product mantissa includes: The operands a and b need to enter the product mantissa calculation module (3) to calculate the product mantissa result. The product mantissa calculation module (3) contains a multiplier and two adders. First, the mantissas of operands a and b are multiplied to obtain two partial product results. The lower half of the partial product first enters the 24-bit adder to calculate the sum of the lower 24-bit partial products. The carry result generated by the summation will be passed to the higher half adder and enter the 25-bit adder together with the higher half partial product to obtain the product mantissa calculation result. Finally, through the selector (4), the final required mantissa result is selected according to the special value signal and the special value result. The product leading zero pre-statistics module includes: First, it is necessary to enter the selector (5) to select the non-normalized data according to the non-normalized number judgment signal, enter the 24-bit leading zero statistics module (6) to count the number of leading zeros of the selected data, and then enter the next selector (7). If both numbers are normalized numbers, the statistical result is adjusted to 0 to obtain the product leading zero pre-statistical result. The exponent difference and comparison logic (8) includes: This module mainly completes the calculation of the product result exponent, the comparison of the product result exponent with the exponent of the C operand, and the comparison between the exponent difference and the leading zero of the product. This facilitates the exponent shift of the P1 level and initially determines the shift method required for the P1 level.
3. The optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit according to claim 1, characterized in that, In the P1 level: The comparison logic includes: This comparison logic is divided into two parts: The first part of the mantissa comparison logic (13) must perform a comparison of the high half 25-bits before shifting. This comparison is to make a preliminary judgment on the size of the two numbers, which is convenient for the current level of alignment operation. Here, for timing considerations, only the high half 25-bits are compared. The second part of the final comparison logic (28) needs to continue the operation of comparing whether the remaining parts are equal after the alignment is completed. This is to obtain the final comparison result, which is convenient for subsequent sign bit calculation. The order alignment part: The overall solution for the order alignment part is as follows: After the mantissa comparison logic (13), the basic exchange signal and preliminary size are determined. According to the relevant shift control logic (19), the selector (15) is controlled to select the relevant data. This selector (15) also undertakes the function of adding 0 to the corresponding shifted data if the shift amount exceeds 15. If it is a left shift, the shifted data must first enter the reverse order module (14) to perform the reverse order operation. The shift amount is specifically determined by the shift amount adjustment logic (20). After the shift amount and the shifted data are ready, The bit shifter (16) is a 48+SHF_NUM, which is a 64-bit right shifter. SHF_NUM uses 16 according to the operator statistics to obtain the shift result. If it is a left shift, it needs to enter a reverse module (17) again after the shift is completed to obtain the left shift result. The final left and right shift results are selected by the selector (18) to obtain the final result of the mantissa shift. Finally, the mantissa and mantissa after alignment need to be selected by the selector (23 / 24). If a shifter needs to be reused for a pipeline pause, the selector (9 / 10 / 11 / 12) is controlled according to the selector control logic (22) during the pipeline pause to perform relevant control and update each input. In the next pipeline, the shifter is controlled according to the shift control logic (19) to reuse the right shifter (16). If a pipeline pause is required to further determine the comparison signal after alignment, the mantissa comparison logic (13) and the final mantissa comparison logic (28) are reused in the next pipeline to complete the comparison. Accordingly, the exponent adjustment logic (21) is used in the exponent adjustment module to further adjust the exponent to obtain the adjusted exponent; The mantissa summation part: The mantissa summation operation has two possibilities, namely, adding or subtracting the mantissas: including: (1) Adding the last two digits: The two last two digits can be directly added together by the adder; (2) Subtraction of mantissas: Since the previous alignment operation completed the part of the comparison logic operation, this must be the big mantissa minus the small mantissa; at this time, for the expression (xy), in order to use the adder to perform the calculation, we can use the expression (xy) = (x + (~y) + 1) to perform the subtraction operation by using addition. The overall scheme for summing the last digits is implemented as follows: After obtaining the mantissa after alignment, if the actual operation here is subtraction, the mantissa after alignment needs to be passed through the method taking module (25) to obtain the opposite number; if the addition does not require additional operation, the corresponding result is selected by the selector (26); the selected mantissa, mantissa and subtraction signal enter the mantissa summation module (27) at the same time. The mantissa summation module contains a 26-bit lower half adder and a 25-bit higher half adder. It is necessary to wait for the carry result of the lower half adder before entering the higher half adder to complete the mantissa summation calculation and obtain the final mantissa summation result.
4. The optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit according to claim 3, characterized in that, The hardware design of the alignment section has three important design points, as follows: (1) For special alignment cases, the flow stops and the shifter is reused; Analysis of the alignment algorithm reveals that if and only if the multiplication exponent result > the exponent of the c operand, and the exponent difference > the leading zero of the product (i.e., the leading zero of the product ≠ 0), then a left shift normalization shift of the product mantissa result and a right shift alignment shift of the c operand are required. This is an uncommon scenario, so instead of using two shifters to complete this operation, a pipeline pause is implemented, sacrificing performance for low-probability scenarios, and reusing the right shifter to reduce the shifter area. (2) Use only right shifters, and implement left shifts using right shifters: Analysis of the order alignment algorithm reveals that the number of left shifters used is much smaller than that of right shifters. Setting up a single left shifter would result in unnecessary area loss. This method uses only one right shifter in the order alignment part, and the left shift is performed in reverse order, followed by the right shift, and then the right shift is performed in reverse order again to obtain the corresponding result. (3) Reduce the shifter bit width: Set the shifter bit width to (48 + SHF_NUM)-bit. SHF_NUM is configurable. The shift amount can be adjusted according to different application scenarios. Currently, SHF_NUM is set to 16, and the maximum shift amount is 15. If the shift amount exceeds this maximum, the part greater than 16 is directly adjusted by adding 0s in front, and then the shifter completes the remaining shift.
5. The optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit according to claim 3, characterized in that, The specific states of water flow stagnation include: (1) The water flow stops once. a. The result of multiplication is greater than the exponent of operand c, and the exponent difference is greater than the leading zero of the product; the leading zero of the product is not equal to 0. At this point, the multiplication exponent result needs to be shifted left by the exponent difference for normalization; then, the c operand result needs to be shifted right to align the exponents. The right shift amount is the exponent difference minus the leading zeros of the product, and the multiplication result is still a large number. In the first step, the shifter is entered, and the reversed data is selected for right shifting. At the same time, the reversed data is selected for the last large number data. At this point, the shift amount needs to be adjusted at the top level, and the c operand data is right-shifted in the next step to obtain the alignment result. b. The result of the multiplication exponent is 1, the exponent of the c operand is 0, and the hidden bit of the mantissa result of the multiplication is 0 at this time; The result of the multiplication mantissa calculation is a denormalized number, but the exponent is a normalized number. In this case, the multiplication mantissa needs to be shifted left by one bit in the first step to obtain the true mantissa result. In the next step, return to the current pipeline level and compare the true result to prevent comparison error. c. The exponent difference equals the leading zero of the product result, and the exponent result of the multiplication is greater than the exponent of the c operand. The first step requires left normalization of the product result to ensure that the exponent alignment is completed. The next step returns to the current pipeline level to compare the actual result and prevent errors in the comparison result. d. If the product exponent is negative, the c operand is a denormalized number; The first step requires right-shifting the product mantissa result and shifting the product exponent to 0 to complete the adjustment of the product exponent result; the next step returns to the current pipeline level to compare the actual result and prevent comparison result errors. e. The result of multiplication is greater than the exponent of operand c, the exponent difference is less than the number of leading zeros, and operand c is a denormalized number; At this point, the multiplication mantissa result needs to be shifted left by the exponent difference to ensure that the two are aligned before comparison; in the first step, the multiplication mantissa result is shifted left to obtain the aligned result; in the second step, the pipeline returns to the current level, compares with the actual result, and updates the result to prevent errors. (2) The water flow stopped twice. a. The result of multiplication is greater than the exponent of operand c, and the exponent difference is greater than the leading zero of the product, and the second right shift amount is 1; the leading zero of the product is not equal to 0; In the first cycle, the data enters the shifter, and the multiplication exponent result is shifted left by the product leading zero to normalize it. In the second cycle, the shifted data is the exponent of the c operand, and the right shift amount (exponent difference - product leading zero) = 1. However, the two may be unequal due to inaccurate product leading zero, resulting in an inaccurate comparison result. Therefore, a pipeline pause is required for a second comparison. In the third cycle, the mantissa of the multiplication and the mantissa of the c operand are compared to determine the true comparison size.
6. The optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit according to claim 1, characterized in that, The overall solution for level P2 is implemented as follows: The result of summing the mantissa is divided into a high half and a low half, and then entered into the 24-bit high half leading zero count (2) and the 24-bit low half leading zero count (3) respectively to count the number of leading zeros. At this time, it is necessary to select the statistical result through the selector (4) based on whether the high half contains 1. If the high half of the leading zeros statistics are selected, then the number of leading zeros is the same as the number of leading zeros in the high half. If the leading zero statistics of the lower half are selected, compensation is required by the leading zero compensation module (5) after the post-regulation. That is, the 24-bit of the upper half is all 0, so the number of leading zeros is 24 + the number of leading zeros of the lower half, to obtain the true leading zeros after the post-regulation. At this time, it needs to be entered into the shift amount calculation module (6) together with the exponent to obtain the shift amount result. The shift data needs to be obtained according to the shift data adjustment module (7). Here, the shift data adjustment needs to select the lower half for shifting data with a shift amount exceeding 24 to ensure that the subsequent shifter can meet the shift requirements. The shift result is obtained by the 40-bit left shift shifter (9) according to the shift amount and shift data. The right shift control logic (8) determines whether the highest bit needs to be concatenated with 1 bit. 0, and the final result of the mantissa is obtained through the mantissa selection module (12); at the same time, the exponent adjustment module (10) and the rounding part adjustment module (11) complete the adjustment of the exponent and the rounding part; the sign bit needs to be calculated by the sign bit calculation module (1); these data enter the rounding operation, rounding adjustment and abnormal state judgment module (13) to complete the rounding operation, rounding adjustment and abnormal state judgment, and finally enter the final result selection (14) to obtain the final result.
7. The optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit according to claim 1, characterized in that, In the P2 level: The following section: The mantissa sum data, totaling 50 bits, is obtained from the P1 pipeline stage: 2 hidden bits + 46 mantissa bits + 1 rounding bit + 1 sticky bit. The exponent should be the adjusted exponent result generated by the P1 stage. It first enters the leading zero module for statistics, and then undergoes a shift operation. Further steps include: (a) The specific shift details are as follows: ① The exponent can satisfy the left shift with leading zero; For shifts with a shift amount not exceeding 24, the original data is selected for shifting, and the leading zero statistics of the higher half are selected as the shift amount. For shifts exceeding 24, the lower half of the data is selected, followed by 24 bits of 0 for shifting. This simulates the case where the higher half of the 24 bits of 0 has already been shifted, and the data of the lower half of the data with leading zeros is selected as the shift amount. This achieves the goal of reducing the amount of displacement; ② The exponent cannot satisfy the left shift with leading zero; For shifts with a shift amount not exceeding 32, the original data is still selected for shifting, and the shift amount is the exponent - 1; For shifts exceeding 32, the shift amount is the exponent minus 1. In this case, the lower half of the data needs to be selected, followed by 24 bits of 0 for shifting, which simulates that the upper half of 24 bits of 0 has been shifted. At this time, the shift amount needs to be adjusted. For exponent shifts, the total amount needs to be subtracted by (24-1), that is, subtract 23. This achieves the goal of reducing the amount of displacement; ③ The sum of the last digits overflows by one place; If the mantissa overflows after summing, the most significant bit of the mantissa needs to be appended with 1-bit0 through the right shift control logic (8) to simulate the mantissa shifting one bit to the right; at this time, there is no need to enter the left shifter for shifting. ④ No back gauge shift is required; If there is no leading zero after normalization, or the number obtained is itself a denormalized number, then the original mantissa result can be directly selected through the mantissa selection module (12), and the shift amount is 0. (b) Further reduction of the shift amount From the previous discussion of the two cases requiring left shift, it can be seen that the maximum shift amount is 31. When the timing allows, the shift amount of the shifter can be reduced. For data with a shift amount exceeding 15, the high half of the 16-bit 0 is removed through the internal control signal, and the relevant 0 is added afterward. This reduces the maximum shift amount from 31 to 15, thus achieving the purpose of reducing the shift amount. The specific implementation details are as follows: (I) When the shift amount is less than 16, you can directly enter the left shifter to perform the shift; (II) When the shift amount is greater than 16, the shifted data needs to be processed by the shift data adjustment module (7). 16 bits of 0 are removed from the front and 16 bits of 0 are added to the back. The processed data is then sent to the left shifter for the remaining shift. The reason for reducing the shifter to below 16 here is to ensure that the control circuit is as simple as possible; otherwise, the control circuit may be too complex, resulting in additional area overhead. (c) Further reduction of shifted data; Originally, 48 bits of shift data were required for the shift, but since the shifter can shift at most 15 bits, it is only necessary to satisfy the left shifter normalization, that is, only (24+16) bits of shift data are needed, and the remaining 8 bits can be used as sticky bits; thus, the shift data bit width can be reduced. In summary, the original 48-bit shifter can be reduced to a 40-bit shifter; The rounding part is adjusted as follows: The rounding adjustment module (11) completes the corresponding function; where the protection bit is the last bit of the mantissa, the rounding bit is the first bit after the last bit of the mantissa, and the sticky bit is the self-OR of all bits after the second bit after the last bit of the mantissa. The exponent adjustment part: The exponent adjustment module (10) completes the corresponding function; The exponent adjustment is performed synchronously with the post-regulation, and there are four possibilities: exponent + 1, exponent, exponent - leading zero of the post-regulation, 0; the appropriate choice is made according to the mantissa selection. The rounding operation and post-rounding adjustment: This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13); To improve timing, the mantissa is directly rounded by a 25-bit adder and incremented by 1. Subsequent rounding is performed based on the rounding mode, selecting either the original mantissa or the mantissa with the incremented value, to obtain the final rounded result. This rounded result is then used to further adjust the exponent and mantissa. Exponent adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the exponent needs to be incremented by 1; Mantissa adjustment: If rounding by +1 causes the mantissa to overflow by 1 bit, the mantissa needs to be adjusted one bit to the right; the final sign bit calculation: Sign bit calculation (1) completes the corresponding function; ① If a special value exists, assign the sign bit of the special value; ② The absolute values of the two numbers are equal, that is, after alignment, the product result is equal to the c operand, and the effective operation is subtraction, which is related to the rounding mode; if it is rounding to negative infinity, the sign bit is negative; otherwise, the sign bit is positive. ③ The remaining cases need to be determined based on the exchange signal and the subtraction signal, as shown in the table below: Where A is the product result, B is oprc, and mult_sign is the sign of the product result obtained from the P0-level adjustment; the valid operations here are consistent with the valid operations at the P0 level. The judgment of the anomaly: This part of the logic is included in the rounding operation, rounding adjustment and abnormal state judgment module (13); ① Invalid exception: Found in the special value judgment at level P0; ② Division by zero exception: None; ③ Overflow exception: The result is greater than the maximum exponent value after the exponent is updated, and it is not a maximum value at this time; qNaN result and invalid exception; ④ Underflow anomaly: After rounding, the data is strictly between ± minimum normalized number (+ minimum normalized number is 32'h0080_0000, - minimum normalized number is 32'h8080_0000) and is inaccurate; ⑤ Inaccuracy anomaly: The round bit or sticky bit contains a 1, or an overflow anomaly occurs at this time; The final result processing: The final result selection (14) completes the corresponding function; If an overflow exception occurs, special assignments are needed for different maxima depending on the rounding mode. In other cases, only the sign bit, exponent, and mantissa need to be concatenated normally to obtain the final result.
8. An optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit according to claim 7, characterized in that, The hardware design of the rear guide section has three important design points, which are described below: (1) To improve timing, the leading zero statistics of the high half and the low half are divided into two modules and counted in parallel. The leading zero data of the subsequent rules are corrected according to whether the high half is all 0. When the timing is good, the area will also benefit to a certain extent. (2) Back gauge shifting uses only the left shifter; The back-shift part only has one left shifter to perform the back-shift with leading zeros; for the right shift overflow case, as can be seen from the previous alignment algorithm analysis, the right shift here is at most one bit, which can be directly completed by bit concatenation; (3) When timing permits, the shifter data can be corrected in advance by controlling the circuit to reduce the shift amount and the shift data bit width, thereby reducing the shifter bit width and thus reducing the hardware design area.
9. An optimization method for integrating floating-point multiply-accumulate algorithms in a single-precision floating-point multiply-accumulate unit according to claim 1, characterized in that, The multiply-accumulate module implements the multiply-accumulate algorithm, including: After obtaining the three operands a, b, and c, the product result of a*b is calculated first. The product result and the c operand (the addend operand) are aligned. Then, the mantissa is summed. The normalization part is then performed. Finally, rounding, exception handling, and the final result are obtained. Before the exponent alignment, the mantissa of the product, the leading zero of the product, the exponent difference, and the c operand (addition operand) are required. Part of the exponent difference's shift amount during alignment can offset the normalization shift amount of the leading zero of the product. The specific algorithm is as follows: (1) The result of multiplication exponent > the exponent of operand c a. Exponent difference > product leading zero: The product result is a large exponent, which requires left shift normalization with a shift amount equal to the product leading zero, while the c operand is a small exponent, which requires right shift normalization with a shift amount equal to the exponent difference minus the number of product leading zeros. In this case, two shifts are required, and the multiplication result is still a large operand; if the multiplication does not produce a product leading zero, only the c operand needs to be shifted by the exponent difference. b. Exponent difference < leading zeros of the product: Before shifting, the product result has a large exponent. First, the multiplication mantissa is left-shifted by the amount of leading zeros of the product for normalization. At this time, because the exponent difference < leading zeros of the product, the product result instead becomes a small exponent. At this time, it is still necessary to right-shift the shifted multiplication mantissa by the amount of leading zeros of the product - exponent difference for exponent alignment; From the overall effect, it is only necessary to left-shift the multiplication mantissa by the exponent difference. At this time, the multiplication result becomes a small operand and needs to be swapped; c. Exponent difference = leading zeros of the product: At this time, it is only necessary to left-shift the multiplication mantissa by the amount of leading zeros of the product for normalization, and then the c operand and the multiplication result can be exponent-aligned; (2) Multiplication exponent result < c operand exponent At this time, regardless of the relationship between the leading zeros count of the product and the exponent difference, the multiplication result must be a small exponent. First, the multiplication mantissa is left-shifted by the amount of leading zeros of the product for normalization. At this time, the exponent of the multiplication result is smaller. It is still necessary to right-shift the shifted multiplication mantissa by the amount of leading zeros of the product + exponent difference for exponent alignment; From the overall effect, it is only necessary to right-shift the multiplication mantissa by the exponent difference. At this time, the multiplication result is still a small operand; (3) Multiplication exponent result = c operand exponent, exponent difference = 0 a. If the leading zeros of the product = 0, at this time the multiplication result and the c operand have been exponent-aligned and no shifting is required; b. If the leading zeros of the product ≠ 0, at this time the exponent difference is less than the leading zeros of the product. First, the multiplication mantissa is left-shifted by the amount of leading zeros of the product for normalization. After the shift, the exponent difference is the leading zeros of the product. At this time, it is necessary to right-shift the adjusted multiplication mantissa result by the leading zeros of the product for exponent alignment; From the overall result, no shifting is required.