A vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions
The vector floating-point multiply-accumulator designed with a four-level modular pipeline solves the problems of excessive hardware area and poor compatibility of floating-point multiply-accumulator operations with multiple precisions in the existing technology. It achieves efficient floating-point operations with multiple precisions, supports the IEEE-754 standard, and is suitable for information technology fields such as image processing, signal transmission, and aerospace.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTH CHINA UNIV OF TECH
- Filing Date
- 2023-03-30
- Publication Date
- 2026-06-30
AI Technical Summary
In the prior art, the hardware design of floating-point arithmetic units with multiple independent arithmetic logic designs has the disadvantage of excessive area and inability to meet the requirements of floating-point multiplication and addition operations of various precisions.
The vector floating-point multiply-adder, which adopts a four-level modular pipeline design, includes a partial product generation module, a Wallace network, an inversion module, an exponent alignment module, an adder, and an exception handling module. It implements half-precision, single-precision, and double-precision floating-point multiply-add operations and shares partial product and adder resources through SIMD optimization and hardware isolation design.
It implements floating-point multiply-accumulate operations with multiple precisions, supports all floating-point number types of the IEEE-754 standard, has high area efficiency, a maximum path delay of 0.32ns, and a maximum operating frequency of 3.125GHz, making it suitable for information technology fields such as image processing, signal transmission, and aerospace.
Smart Images

Figure CN116521124B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of floating-point arithmetic technology, and in particular to a vector floating-point multiplier-accumulator suitable for floating-point arithmetic of various precisions. Background Technology
[0002] In computers, data can be expressed in two ways: fixed-point numbers and floating-point numbers. Fixed-point numbers have the advantage of simple computational logic, which makes the design of functional components easier. However, fixed-point numbers have a smaller numerical range and lower precision. Floating-point numbers, on the other hand, can dynamically adjust the position of the decimal point, allowing for more accurate data representation. Therefore, floating-point numbers have the advantage of representing a much larger data range with higher precision than fixed-point numbers. However, their disadvantage lies in the fact that their computational algorithms are much more complex than fixed-point logic, posing significant challenges in hardware design.
[0003] For floating-point instructions, multiplication and addition are the most frequently used. To improve the computing power of floating-point processors, multiplication and addition instructions can be combined into a single instruction, namely the floating-point multiply-add instruction. Furthermore, since multiply-add eliminates one rounding operation, the loss of precision is also smaller.
[0004] Currently, floating-point multiply-accumulate units often employ a dual-path design. In hardware, dividing the path into close and far paths based on the exponent difference to shorten the critical path is a common optimization strategy. While this design can achieve high-speed computation, it typically leads to excessive hardware footprint, reducing area efficiency. Furthermore, existing floating-point multiply-accumulate units are usually only suitable for floating-point multiply-accumulate operations of a single precision, which limits their effectiveness when multiple precisions of floating-point multiply-accumulate operations are required. Summary of the Invention
[0005] In view of this, embodiments of the present invention provide a vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions.
[0006] The first aspect of the present invention provides a vector floating-point multiply-accumulator suitable for floating-point operations of various precisions, including a first operation module, a second operation module, a third operation module and a fourth operation module;
[0007] The first operation module specifically includes a partial product generation module, a Wallace network, a first inversion module, an exponent alignment module, a mantissa compound right shifter, a sticky logic module, and an exception pre-judgment module; wherein, the Wallace network is connected to the output of the partial product generation module; the mantissa compound right shifter is connected to the output of the first inversion module; the mantissa compound right shifter is connected to the output of the exponent alignment module; and the sticky logic module is connected to the output of the mantissa compound right shifter.
[0008] The second arithmetic module specifically includes a 3:2 CSA adder, a CPA adder, an increment circuit, a GRS logic module, and a sign pre-determination module; wherein, the 3:2 CSA adder is connected to the output of the Wallace network; the CPA adder is connected to the output of the 3:2 CSA adder; the increment circuit is connected to the output of the CPA adder, the output of the mantissa compound right shifter, and the output of the sticky logic module; the GRS logic module is connected to the output of the mantissa compound right shifter and the output of the sticky logic module; and the sign pre-determination module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit.
[0009] The third operation module specifically includes a second inversion module, a leading zero detection module, a trailing zero detection module, a normalized compound left shift module, a normalization correction module, a rounding preprocessing module, a fast GRS solver module, and an exponent adjustment module. Specifically, the second inversion module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit; the leading zero detection module is connected to the output of the second inversion module; the trailing zero detection module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit; the normalized compound left shift module is connected to the output of the second inversion module, the output of the leading zero detection module, and the output of the exponent alignment module; the exponent adjustment module is connected to the output of the exponent alignment module; the normalization correction module is connected to the output of the normalized compound left shift module; the fast GRS solver module is connected to the outputs of the leading zero detection module and the trailing zero detection module; and the rounding preprocessing module is connected to the outputs of the normalization correction module and the fast GRS solver module.
[0010] The fourth operation module specifically includes a mantissa increment logic module, an exponent increment logic module, a sign judgment module, an exception judgment module, and a control logic output module. The mantissa increment logic module is connected to the output of the normalization correction module and the output of the rounding preprocessing module. The exponent increment logic module is connected to the output of the exponent adjustment module and the output of the rounding preprocessing module. The sign judgment module is connected to the output of the normalization correction module, the output of the rounding preprocessing module, and the output of the sign pre-judgment module. The exception judgment module is connected to the output of the exception pre-judgment module, the output of the exponent adjustment module, the output of the normalization correction module, and the output of the rounding preprocessing module. The control logic output module is connected to the outputs of the mantissa increment logic module, the exponent increment logic module, the sign judgment module, and the exception judgment module.
[0011] Furthermore, in the first arithmetic module:
[0012] The partial product generation module is used to obtain the input floating-point numbers and multiply them into partial products to obtain 27 partial products;
[0013] The Wallace network is used to calculate the first summation (Sum) and first carry (Carry) of a floating-point number based on the 27 partial products of the partial product generation module, with the first output being the floating-point number.
[0014] The first inversion module is used to invert the mantissa of the floating-point number with the opposite sign bit when performing floating-point multiplication and subtraction operations on the input floating-point number;
[0015] The exponent alignment module is used to obtain the shift value required for alignment shifting in the input floating-point number based on the exponent of the floating-point number that does not require shifting, and to generate the alignment signal;
[0016] The mantissa composite right shifter is used to shift the mantissa of the input floating-point number to the right according to the required shift value;
[0017] The sticky logic module is used to calculate the sticky bits of the input floating-point number;
[0018] The anomaly pre-judgment module is used to determine the NaN value and the value of infinity in the input floating-point number and generate anomaly pre-judgment signal.
[0019] Furthermore, the sticky logic module specifically calculates the sticky bit of the input floating-point number using the following formula:
[0020] sticky = rshiftnum > (tzd fc +2×width+2)
[0021] In the formula, f c The `rfhiftnum` represents the mantissa of the shifted floating-point number; `tzd` represents the shift value. fc Indicates f c The post-zero detection value, width represents f c The bit width.
[0022] Furthermore, in the second arithmetic module:
[0023] A 3:2 CSA adder is used to store the carry value of the input floating-point number;
[0024] The CPA adder is used to add the first output result of the first arithmetic module to the carry value to obtain the second sum (Sum) and the second carry (Carry) as the second output result.
[0025] The add-1 circuit consists of cascaded half-adders; it is used to select the input of the half-adder as the previous half-adder or the constant 1 according to the carry value, and to determine whether the bit values of the second summation Sum and the second carry Carry need to be added by 1;
[0026] The GRS logic module is used to calculate the GRS value of the second output result;
[0027] The symbol pre-judgment module is used to generate a symbol pre-judgment signal based on the GRS value and the result of adding 1.
[0028] Furthermore, the GRS logic module calculates the GRS value of the second output result through the following steps:
[0029] When the CPA adder performs addition, the GRS value of the second output result is taken as the mantissa f of the shifted floating-point number. c GRS value;
[0030] When the CPA adder performs a subtraction, the GRS value of the second output result is taken as the shifted floating-point mantissa f. c The two's complement of the GRS value;
[0031] Wherein, when the CPA adder performs subtraction and shifts the mantissa f of the floating-point number c When the GRS value is 0, increment the least significant bit of the carry in the second output result by 1.
[0032] Furthermore, in the third operation module:
[0033] The second inversion module is used to perform a bitwise inversion operation on the second output result when the second output result of the second operation module is negative.
[0034] The leading zero detection module is used to detect leading zeros in the inverted second output result;
[0035] The leading zero detection module is used to detect leading zeros in the second output result;
[0036] The normalized compound left shift module is used to normalize and left shift the mantissa of the second output result to obtain the third output result;
[0037] The normalization correction module is used to correct negative results for the third output result;
[0038] The fast GRS solver module is used to calculate the G-bit, R-bit, and S-bit values of floating-point numbers;
[0039] The rounding preprocessing module is used to correct the result based on the negative result and generate an increment enable signal based on the G-bit, R-bit, and S-bit values;
[0040] The exponent adjustment module generates an exponent adjustment signal based on the exponent alignment signal.
[0041] Furthermore, the normalized left shift is determined by the following steps:
[0042] When the temporary exponent is greater than the preset exponent value, the shift value is the temporary exponent minus 1; the temporary exponent is obtained from the input floating-point number;
[0043] When the temporary exponent is less than the preset exponent value, the shift value is the preset exponent value.
[0044] Furthermore, the negative result correction refers to adding 1 to the mantissa of the floating-point number when the second output result is negative.
[0045] Furthermore, the calculation of the G-bit, R-bit, and S-bit values of the floating-point number is specifically determined by the following formula:
[0046] sticky=(lzd invf +tzd f )<(2×width+3)
[0047] In the formula, lzd invf For the leading zero detection result, tzd f This is the result of detecting trailing zeros; width represents the bit width of the mantissa.
[0048] When the second output result is negative and S is 0 as calculated by the above inequality, if R is 1 at this time, it is corrected to 0; otherwise, if R is 0, it is corrected to 1. If S is 1 or positive as calculated by the above formula, R does not need to be corrected.
[0049] When the second output result is negative and R is 0 after correction, if G is 1 at this time, then it is corrected to 0; otherwise, if G is 0, it is corrected to 1. If R is 1 or positive after correction, G does not need to be corrected.
[0050] When the result is negative and GRS is 0, the output is incremented by 1 to enable the signal.
[0051] Furthermore, in the fourth operation module:
[0052] The mantissa increment logic module is used to increment the mantissa of the third output result by 1 based on the increment enable signal and the normalization correction result;
[0053] The exponent increment logic module is used to increment the bit order of the third output result by 1 according to the increment enable signal and the exponent adjustment signal;
[0054] The sign determination module is used to adjust the sign of the third output result based on the increment enable signal, the normalization correction result, and the sign pre-determination signal;
[0055] The anomaly detection module is used to generate an anomaly indication signal based on the increment enable signal, the exponent adjustment signal, the exponent alignment signal, and the anomaly pre-judgment signal; the anomaly indication signal includes invalid operation anomaly, underflow anomaly, overflow anomaly, division by zero anomaly, and inaccuracy anomaly;
[0056] The control logic output module is used to output the multiplication and addition result based on the logical operation results of the mantissa increment logic module, the exponent increment logic module, the sign judgment module, and the exception judgment module.
[0057] The embodiments of the present invention have the following beneficial effects:
[0058] 1. The floating-point multiply-accumulator of the present invention can perform half-precision, single-precision and double-precision floating-point multiply-accumulator operations, and is applicable to all floating-point number types specified in the IEEE-754 standard, including normalized and denormalized numbers, infinity and NaN.
[0059] 2. The floating-point multiply-accumulator of the present invention can realize vectorized floating-point multiply-accumulator operations and can execute four half-precision floating-point multiply-accumulator operations, or two single-precision floating-point multiply-accumulator operations, or one double-precision floating-point multiply-accumulator operation in parallel.
[0060] 3. The floating-point multiply-accumulator of the present invention has significant advantages in speed and area, achieving high area efficiency. When synthesized using Design Compiler in TSMC's 7nm process, the maximum path delay does not exceed 0.32ns, the maximum operating frequency reaches 3.125GHz, and the area is no larger than 3639.744nm2.
[0061] Additional aspects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description or may be learned by practice of the invention. Attached Figure Description
[0062] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0063] Figure 1 This is a flowchart of a vector floating-point multiplier-accumulator structure applicable to floating-point operations of various precisions according to the present invention;
[0064] Figure 2 This is a schematic diagram of the data structure of a partial product generation module in a vector floating-point multiplier-accumulator applicable to floating-point operations of various precisions according to the present invention;
[0065] Figure 3 This is a schematic diagram of the shared partial product generated by the partial product generation module in a vector floating-point multiplier-accumulator applicable to floating-point operations of various precisions according to the present invention;
[0066] Figure 4 This is a schematic diagram of the Wallace network structure in a vector floating-point multiplier-accumulator applicable to floating-point operations of various precisions according to the present invention;
[0067] Figure 5 This is a schematic diagram of a mantissa composite right shifter in a vector floating-point multiplier-accumulator applicable to floating-point operations of various precisions according to the present invention;
[0068] Figure 6 This is a schematic diagram of the second operation module in a vector floating-point multiplier-accumulator applicable to floating-point operations of various precisions according to the present invention;
[0069] Figure 7 This is a schematic diagram of the vectorized implementation of the add-1 circuit in a vector floating-point multiplier-accumulator applicable to floating-point operations of various precisions according to the present invention;
[0070] Figure 8 This is a schematic diagram of the fast GRS solver module in a vector floating-point multiplier-accumulator applicable to various precision floating-point operations of the present invention, which solves for sticky bits. Detailed Implementation
[0071] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0072] In information technology fields such as image processing, signal transmission, and aerospace, where large data volumes, high data precision, and wide data ranges are required, the precision of fixed-point numbers cannot meet the demands. Therefore, floating-point numbers are needed to represent the relevant data, and calculations on this data rely on floating-point arithmetic units. Since floating-point numbers of different precisions have different bit widths and therefore different calculation errors, current technologies typically implement floating-point operations on different precisions by setting up multiple independent parallel arithmetic logic units. However, this results in an excessively large area occupied by the floating-point arithmetic unit in the hardware design.
[0073] Vector operations are the lowest-cost method to improve data throughput and parallelism, offering more efficient computing power and enabling multiple floating-point operations to be performed with a single instruction. However, the combination of vectorization and multiple independent parallel operations exacerbates the area problem of floating-point units, and there is currently no design solution that balances area and speed.
[0074] To address this issue, this invention proposes a vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions, to achieve floating-point multiplication and addition operations of A*B+C.
[0075] The floating-point multiply-accumulator in this embodiment uses a four-stage modular pipeline to implement vector floating-point multiply-accumulate operations. It can perform four half-precision floating-point multiply-accumulate operations, two single-precision floating-point multiply-accumulate operations, or one double-precision floating-point multiply-accumulate operation in a single operation, and output the correct result and generate an error indication signal.
[0076] refer to Figure 1 In this embodiment, the floating-point multiply-accumulate unit has a four-stage pipeline consisting of a first arithmetic module, a second arithmetic module, a third arithmetic module, and a fourth arithmetic module. The first arithmetic module performs mantissa multiplication, the second arithmetic module performs addition, the third arithmetic module performs normalization and exponent adjustment, and the fourth arithmetic module performs increment operations and exception handling.
[0077] First computation module: Reference Figure 1 The first operation module specifically includes a partial product generation module, a Wallace network, a first inversion module, an exponent alignment module, a mantissa composite right shifter, a sticky logic module, and an exception pre-judgment module; wherein, the Wallace network is connected to the output of the partial product generation module; the mantissa composite right shifter is connected to the output of the first inversion module; the mantissa composite right shifter is connected to the output of the exponent alignment module; and the sticky logic module is connected to the output of the mantissa composite right shifter.
[0078] In the first operation module:
[0079] The partial product generation module is used to obtain the input floating-point numbers and perform partial product multiplication to obtain 27 partial products. In this embodiment, the partial product generation module uses radix 4-booth multiplication. The multiplicand and multiplier in the input floating-point numbers are first input to the booth partial product generation module to generate 27 partial products. The data structure of the multiplicand is as follows: Figure 2 As shown, the booth algorithm partial product generated based on this data structure can be used for calculations of different precisions.
[0080] like Figure 2 As shown, f d f s and f h These represent the mantissas of double-precision, single-precision, and half-precision data, respectively. Taking single-precision and half-precision data as examples, this data structure makes it possible to calculate f... s1 When the partial product is also obtained, f is also obtained. h2 and f h3 The partial product. Furthermore, the high-order bits of the partial product need to be padded with an S-related sequence. High-order alignment allows partial products of different precisions to share these padded sequences, thereby simplifying the control logic and reducing the area. Therefore, f... d0 f s0 and f h0 Perform high-order alignment, f s1 and f h2 Align them in the same way.
[0081] In particular, Figure 2In this context, using a certain number of "0"s between different mantissas at the same precision ensures that, during partial accumulation addition, the partial products corresponding to different mantissas do not interfere with each other in vector operations. Therefore, based on... Figure 2 The data structure allows bootstrap multiplication at all precisions to share a set of partial products, facilitating hardware implementation of vector multiplication.
[0082] like Figure 3 As shown, this embodiment of the invention optimizes the partial product generation module using SIMD (Single Instruction Multiple Data, i.e., one instruction completes multiple floating-point operations). Based on the idea of hardware isolation, a mixed-precision vector mantissa multiplication structure is designed, allowing half-precision, single-precision, and double-precision vector operations to share these 27 partial products. Taking the execution of single-precision floating-point multiplication and addition as an example, it requires 24 bits of mantissa multiplication. The 13 27-bit partial products generated (including the high-order extended sequence {1, ~S}, where S is the sign bit of the partial product) can be used by two groups of 11-bit wide mantissa multiplications of half-precision floating-point multiplication and addition. Each group of 11-bit wide multiplications uses 6 of its partial products, requiring a partial product width of 14 bits.
[0083] The Wallace network is used to calculate the first sum (Sum) and first carry (Carry) of a floating-point number based on the 27 partial products generated by the partial product generation module, with the first output being the floating-point number. For example... Figure 4 As shown, in this embodiment, the Wallace network consists of a three-stage 3:2 CSA array (Carry-Save Adders) and a two-stage 4:2 CSA array. The 3:2 CSA is used to reduce the three partial products x, y, z to two partial products Sum and Carry, specifically achieved through the following formula:
[0084] sum=x⊕y⊕z
[0085] carry = {x&y|y&z|x&z, 0}
[0086] After each level of 3:2 CSA reduction, the resulting Sum and Carry are used as the x, y, and z for further reduction at the next level.
[0087] The calculation of 4:2CSA is similar, the difference being that the input is four partial products:
[0088] sum i =x i ⊕y i ⊕z i ⊕w i ⊕c i
[0089] c i+1 =Mux(x i⊕y i ,z i ,x i )
[0090] carry i =Mux(x i ⊕y i ⊕z i ⊕w i ,c i ,w i )
[0091] The calculation of 4:2CSA is represented bit by bit because it requires the previous bit as a carry. i The result of the calculation is as follows: if the first item of Mux is 1, the second item is selected; otherwise, the third item is selected.
[0092] In this embodiment, CSA calculates and stores the carry and sum separately, and each bit of the carry and sum is calculated independently without interference, resulting in extremely high speed. In the third stage, the number of parts in the partial product is a multiple of 4, so 4:2 CSA is used to reduce the number of CSA stages. The delay of 3:2 CSA is two XOR gates, while the delay of 4:2 CSA is three XOR gates. By accumulating the same number of parts, the two-stage 4:2 CSA array optimizes the delay of 8 XOR gates to 6, thus shortening the critical path.
[0093] The first inversion module is used when performing floating-point multiplication and subtraction operations on input floating-point numbers A×B and C, i.e., performing A×B+C, where the sign bits of A×B and C are opposite, to adjust the mantissa f of floating-point number C. c Perform a bitwise inversion operation.
[0094] The exponent alignment module is used to calculate the shift value required for alignment in the input floating-point number based on the exponent of the floating-point number that does not require shifting, and to generate the alignment signal. For example... Figure 1 As shown, in this embodiment, the mantissa of the floating-point number C is shifted. The exponent alignment module obtains the shift value required for the alignment shift. Assume that the exponents of floating-point numbers A, B, and C are e, respectively. a e b e c The exponent obtained by multiplying A and B is e. ab , when e c -e ab When C is ≥56, the last digit f c No shifting is needed; the exponent after alignment should be e. c . When e c -e ab When <56, then f c Shifting to the right, considering the exponent bias, yields a shift length r of e. a +eb -e c -bias+width+3, where the temporary exponent is e. c Adding the value of r, width is the mantissa width.
[0095] The mantissa right shifter is used to shift the mantissa of the input floating-point number to the right according to the required shift value. For example... Figure 5 As shown, the vector right shifter is based on the idea of a step-by-step shifter. The shift operation of each stage is controlled by a one-bit signal of the shift number l. The specific control signal is selected according to the precision of the current operation. Assuming that s0 is shifted, if a 16-bit vector shift is performed, the shift of s0[31:16] is controlled by r1[0]. However, if a 32-bit shift is performed, it is controlled by r0[0]. During the right shift, the lower bits are all shifted into "0". The shift logic of s0[15:0] is similar, but regardless of whether it performs single-precision or half-precision operations, it shares the r0 control signal. However, its lower bits are not necessarily shifted into "0" as in the former case. If a 16-bit operation is performed, the shift operations of s0[31:16] and s0[15:0] are independent of each other. In this case, when s0[31:16] is shifted to the right, it is still shifted into "0". Conversely, if it is a 32-bit operation, then s0[31:0] is a whole. In this case, when s0[31:16] is shifted to the right, it should be shifted into s0
[15] . The second to fourth levels are similar, the difference being that the number of bits shifted to the right is 2, 4 and 8 bits respectively. In this embodiment, the 32-bit vector right shifter can realize the right shift of a 32-bit number or the right shift of two 16-bit numbers.
[0096] The sticky logic module is used to calculate the sticky bits of the input floating-point number. In this embodiment, the sticky calculation is based on the fast sticky operation logic of trailing zero detection. Trailing Zero Detector (TZD) is to detect bit by bit from the least significant bit to the most significant bit of the binary number to find the position of the first "1". The zero after this "1" is the trailing zero. Its operation logic is essentially a selection logic. Suppose that TZD is performed on the binary number X. When X[0] = 1, the result of TZD is 0. When X[1] = 1, the result of TZD is 1, and so on.
[0097] In this embodiment, the sticky logic module specifically calculates the sticky bit of the input floating-point number using the following formula:
[0098] sticky = rshiftnum > (tzd fc +2×width+2)
[0099] In the formula, f cThe `rfhiftnum` represents the mantissa of the shifted floating-point number; `rfhiftnum` represents the shift value, obtained through the exponent alignment module described above; `tzd` fc Indicates f c The post-zero detection value, width represents f c The bit width.
[0100] The anomaly pre-judgment module is used to determine the NaN and infinity values in the input floating-point number and generate an anomaly pre-judgment signal. In this embodiment, the anomaly preprocessing module mainly judges NaN and infinity. For example, for multiplicand A and multiplier B, when one of them is infinity and the other is neither 0 nor NaN, the result of A*B is infinity; if either A or B is NaN, the result of A*B is processed as aNaN, and if either A or B is sNaN, the invalid operation anomaly indicator is raised; if infinity is multiplied by 0, the result of A*B is qNaN, and the invalid operation anomaly indicator is raised. Then, A*B is treated as a whole, let it be AB. If AB or C is NaN, then the result is qNaN; if C is sNaN, then the invalid operation anomaly indicator is raised; when both AB and C are infinity and a valid subtraction is performed, the result is NaN and the invalid operation anomaly indicator signal is raised.
[0101] Second computation module: Reference Figure 1 The second operation module specifically includes a 3:2 CSA adder, a CPA adder, an increment circuit, a GRS logic module, and a sign pre-determination module. The 3:2 CSA adder is connected to the output of the Wallace network; the CPA adder is connected to the output of the 3:2 CSA adder; the increment circuit is connected to the output of the CPA adder, the output of the mantissa compound right shifter, and the output of the sticky logic module; the GRS logic module is connected to the output of the mantissa compound right shifter and the output of the sticky logic module; and the sign pre-determination module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit. After the second operation module obtains the two addition terms output by the CSA array, it reduces the addition terms by combining the result of the alignment shift with the addition term through the CSA and inputs it into the carry propagation adder to complete the addition operation. In order to reduce the resource consumption of the large bit width adder, this invention adopts an improved large bit width adder. The high 56 bits of the alignment shift are added by adding 1 through an additional circuit, while the low bits of GRS are processed by GRS logic. The remaining low bits are completed by a 107-bit carry propagation adder. After obtaining the result, it is determined whether to invert it based on the sign value of the result.
[0102] In the second operation module:
[0103] The 3:2 CSA adder stores the carry value of the input floating-point number; the CPA adder adds the first output result of the first arithmetic module to the carry value, resulting in a second sum (Sum) and a second carry (Carry) as the second output result. Figure 6 As shown, in this embodiment, the Sum and Carry output by the Wallace network of the first arithmetic module require a 162-bit addition plus a sign bit. Since the high 56-bit addition term does not coincide with the Sum and Carry output by the multiplier, the result of the high 56-bit addition is only related to the carry from the low 107-bit addition. Therefore, to preserve the carry signal, the second arithmetic module uses a 107-bit wide adder, including a single-stage carry-preserving adder and a single-stage two-input carry-propagation adder (CPA). Then, the value of the calculated 107th bit is used to determine whether the high 56 bits need to be added by 1. The addition circuit is implemented in hardware using cascaded half-adders to optimize area, and it can run in parallel with the 107-bit addition, thereby removing the 56-bit addition from the critical path and greatly shortening the timing path.
[0104] The add-1 circuit consists of cascaded half-adders; it is used to select whether the input of the half-adder is the previous half-adder or a constant 1 based on the carry value, and to determine whether the bit values of the second sum (Sum) and the second carry (Carry) need to be added by 1. For example... Figure 7 As shown, this invention performs SIMD optimization on a 56-bit add-1 circuit based on the idea of hardware isolation. It consists of cascaded half-adders, and the input of the half-adder is selected from the previous half-adder or the constant 1 according to the format signal indicating the current operation precision.
[0105] The GRS logic module is used to calculate the GRS value of the second output result. For example... Figure 1 As shown, GRS calculation is implemented by GRS logic. When the CPA adder performs addition, the GRS of the result is f. c The GRS value; when the CPA adder performs subtraction, the GRS of the result is the complement of the GRS of fc. And when performing subtraction, f... c When GRS is 0, carry over and place this carry into the least significant bit of Carry output by the Wallace network. The least significant bit of Carry is then filled with 1.
[0106] The symbol pre-judgment module is used to generate a symbol pre-judgment signal based on the GRS value and the result of adding 1.
[0107] Third Operation Module: Reference Figure 1The third operation module specifically includes a second inversion module, a leading zero detection module, a trailing zero detection module, a normalized compound left shift module, a normalization correction module, a rounding preprocessing module, a fast GRS solver module, and an exponent adjustment module. Specifically, the second inversion module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit; the leading zero detection module is connected to the output of the second inversion module; the trailing zero detection module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit; the normalized compound left shift module is connected to the output of the second inversion module, the output of the leading zero detection module, and the output of the exponent alignment module; the exponent adjustment module is connected to the output of the exponent alignment module; the normalization correction module is connected to the output of the normalized compound left shift module; the fast GRS solver module is connected to the outputs of the leading zero detection module and the trailing zero detection module; and the rounding preprocessing module is connected to the outputs of the normalization correction module and the fast GRS solver module.
[0108] In the third operation module:
[0109] The second inversion module is used to invert the second output result bit by bit when the second output result of the second operation module is negative.
[0110] The leading zero detection module is used to detect leading zeros in the inverted second output result. The principle of leading zero detection (LZD) is similar to that of trailing zero detection, the difference being the detection direction. In traditional designs, the leading zero detection module should be placed before the adder. However, since the mantissa of the first output result has a width of 161 bits, in order to shorten the timing path, this embodiment performs leading zero detection after the adder.
[0111] The leading zero detection module is used to detect leading zeros in the second output result.
[0112] The normalized composite left shift module is used to normalize and left shift the mantissa of the second output result to obtain the third output result.
[0113] In this embodiment, the normalized composite left shift module uses the result z from the leading zero detection to perform a normalized left shift operation on the mantissa and adjust the exponent. For each bit the mantissa is shifted left, the exponent is incremented by 1. When the temporary exponent e... tmp When z ≤ z, the left shift number is e. tmp -1, otherwise take z. Here, the temporary exponent is the temporary exponent calculated by the exponent alignment module in the first operation module.
[0114] The normalization correction module is used to correct negative results for the third output.
[0115] like Figure 1As shown, since the first and second arithmetic modules only invert negative numbers without adding 1, when the obtained negative number consists of consecutive "1"s followed by consecutive "0"s, the normalized mantissa needs to be shifted right by one bit for normalization correction. When the second output floating-point number is negative, only bitwise inversion is still performed, but the addition operation cannot be combined with the CSA unit in the second arithmetic module for processing. Therefore, it needs to be added in subsequent paths for correction, i.e., the negative result correction performed by the normalization correction module in the third arithmetic module. The effective addition of 1 caused by the negative result correction has the same effect as the mantissa addition during rounding, and the two will not occur simultaneously. The two are combined for processing to reduce the use of the addition circuit.
[0116] The fast GRS solver module is used to calculate the G-bit, R-bit, and S-bit values of floating-point numbers. In this embodiment, it is assumed that the leading zero detection result is tzd. f The leading zero detection result is lzd invf Then, when the following formula is satisfied, the actual value of the sticky bit can be considered to be 1:
[0117] sticky=(lzd invf +tzd f )<(2×width+3)
[0118] In the formula, lzd invf For the leading zero detection result, tzd f This is the result of the leading zero detection; width represents the bit width of the mantissa.
[0119] If the second output result is negative and S is 0 as calculated by the above inequality, if R is 1 at this time, then it is corrected to 0; otherwise, if R is 0, it is corrected to 1. If S is 1 or a positive result as calculated by the above formula, R does not need to be corrected.
[0120] If the second output result is negative and R is 0 after correction, if G is 1 at this time, then it is corrected to 0; otherwise, if G is 0, it is corrected to 1. If R is 1 or positive after correction, G does not need to be corrected.
[0121] When the result is negative and GRS is 0, the output is incremented by 1 to enable the signal; otherwise, it is rounded as usual.
[0122] Figure 8 The logic of quickly finding the sticky bit to correct negative results is shown. After finding the sticky bit, if the sticky bit is 0, it should be incremented by 1 to the round bit. If the round bit is 1 at this time, then the actual value of the round bit is 0. The same logic applies to the guard bit.
[0123] In this embodiment, since TZD, LZD, and normalized left shift are performed in parallel, the GRS solution will not introduce new delays in the data path, thus enabling faster acquisition of the actual value of the S bits.
[0124] The rounding preprocessing module is used to correct the result based on the negative result and generate an increment enable signal based on the G, R, and S bit values. In this embodiment, the GRS value is combined with the rounding preprocessing. When the result of the effective subtraction is negative, if the final actual value of GRS is 0, the rounding preprocessing sends an increment enable signal, and 1 should be added to the effective mantissa.
[0125] The exponent adjustment module generates an exponent adjustment signal based on the exponent alignment signal.
[0126] Fourth Calculation Module: Reference Figure 1 The fourth operation module specifically includes a mantissa increment logic module, an exponent increment logic module, a sign judgment module, an exception judgment module, and a control logic output module. The mantissa increment logic module is connected to the output of the normalization correction module and the rounding preprocessing module. The exponent increment logic module is connected to the output of the exponent adjustment module and the rounding preprocessing module. The sign judgment module is connected to the output of the normalization correction module, the rounding preprocessing module, and the sign pre-judgment module. The exception judgment module is connected to the output of the exception pre-judgment module, the exponent adjustment module, the normalization correction module, and the rounding preprocessing module. The control logic output module is connected to the outputs of the mantissa increment logic module, the exponent increment logic module, the sign judgment module, and the exception judgment module.
[0127] In the fourth operation module:
[0128] The mantissa increment logic module is used to increment the mantissa of the third output result by 1 based on the increment enable signal and the normalization correction result;
[0129] The exponent increment logic module is used to increment the bit order of the third output result by 1 according to the increment enable signal and the exponent adjustment signal;
[0130] The sign determination module is used to adjust the sign of the third output result based on the increment enable signal, the normalization correction result, and the sign pre-determination signal;
[0131] The exception detection module is used to generate exception indication signals based on the increment enable signal, the exponent adjustment signal, the exponent alignment signal, and the exception pre-judgment signal. The exception indication signals include invalid operation exception, underflow exception, overflow exception, division by zero exception, and inaccuracy exception.
[0132] The control logic output module is used to output the multiplication and addition result based on the logical operation results of the mantissa increment logic module, the exponent increment logic module, the sign judgment module, and the exception judgment module.
[0133] In this embodiment, the fourth arithmetic module performs exception detection and generates exception indication signals, including invalid operation exception, underflow exception, overflow exception, division by zero exception (multiplication and addition will not occur), and inaccuracy exception. The output control logic selects the output value conforming to the IEEE-754 standard based on the sign, exponent, mantissa, and exception signal, including normalized number, denormalized number, infinity, and NaN, and outputs a 5-bit exception indication signal.
[0134] Compared with the prior art, the advantages of this invention are:
[0135] 1. The floating-point multiply-accumulator of the present invention can perform half-precision, single-precision and double-precision floating-point multiply-accumulator operations, and supports all floating-point number types specified by the IEEE-754 standard, including normalized numbers, denormalized numbers, infinity and NaN.
[0136] 2. The floating-point multiply-accumulator of the present invention can realize vectorized floating-point multiply-accumulator operations and can execute four half-precision floating-point multiply-accumulator operations, or two single-precision floating-point multiply-accumulator operations, or one double-precision floating-point multiply-accumulator operation in parallel.
[0137] 3. The floating-point multiply-accumulator of the present invention has significant advantages in speed and area, achieving high area efficiency. When synthesized using Design Compiler in TSMC's 7nm process, the maximum path delay is 0.32ns, the maximum operating frequency is 3.125GHz, and the area is 3639.744nm2.
[0138] In some alternative embodiments, the functions / operations mentioned in the block diagrams may not occur in the order shown in the operation diagrams. For example, depending on the functions / operations involved, two consecutively shown blocks may actually be executed substantially simultaneously, or the blocks may sometimes be executed in reverse order. Furthermore, the embodiments presented and described in the flowcharts of this invention are provided by way of example to provide a more comprehensive understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and sub-operations described as part of a larger operation are executed independently.
[0139] Furthermore, although the invention has been described in the context of functional modules, it should be understood that, unless otherwise stated, one or more of the functions and / or features may be integrated into a single physical device and / or software module, or one or more functions and / or features may be implemented in a separate physical device or software module. It is also understood that a detailed discussion of the actual implementation of each module is unnecessary for understanding the invention. Rather, given the properties, functions, and internal relationships of the various functional modules in the apparatus disclosed herein, the actual implementation of the module will be understood within the scope of conventional skill of an engineer. Therefore, those skilled in the art can implement the invention as set forth in the claims using ordinary techniques without excessive experimentation. It is also understood that the specific concepts disclosed are merely illustrative and not intended to limit the scope of the invention, which is determined by the full scope of the appended claims and their equivalents.
[0140] In the description of this specification, references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0141] Although embodiments of the invention have been shown and described, those skilled in the art will understand that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
[0142] The above is a detailed description of the preferred embodiments of the present invention. However, the present invention is not limited to the embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention. All such equivalent modifications or substitutions are included within the scope defined by the claims of this application.
Claims
1. A vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions, characterized in that, It includes a first operation module, a second operation module, a third operation module, and a fourth operation module; The first operation module specifically includes a partial product generation module, a Wallace network, a first inversion module, an exponent alignment module, a mantissa compound right shifter, a sticky logic module, and an exception pre-judgment module; wherein, the Wallace network is connected to the output of the partial product generation module; the mantissa compound right shifter is connected to the output of the first inversion module; the mantissa compound right shifter is connected to the output of the exponent alignment module; and the sticky logic module is connected to the output of the mantissa compound right shifter. The second arithmetic module specifically includes a 3:2 CSA adder, a CPA adder, an increment circuit, a GRS logic module, and a sign pre-determination module; wherein, the 3:2 CSA adder is connected to the output of the Wallace network; the CPA adder is connected to the output of the 3:2 CSA adder; the increment circuit is connected to the output of the CPA adder, the output of the mantissa compound right shifter, and the output of the sticky logic module; the GRS logic module is connected to the output of the mantissa compound right shifter and the output of the sticky logic module; and the sign pre-determination module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit. The third operation module specifically includes a second inversion module, a leading zero detection module, a trailing zero detection module, a normalized compound left shift module, a normalization correction module, a rounding preprocessing module, a fast GRS solver module, and an exponent adjustment module. Specifically, the second inversion module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit; the leading zero detection module is connected to the output of the second inversion module; the trailing zero detection module is connected to the output of the GRS logic module, the output of the CPA adder module, and the output of the increment circuit; the normalized compound left shift module is connected to the output of the second inversion module, the output of the leading zero detection module, and the output of the exponent alignment module; the exponent adjustment module is connected to the output of the exponent alignment module; the normalization correction module is connected to the output of the normalized compound left shift module; the fast GRS solver module is connected to the outputs of the leading zero detection module and the trailing zero detection module; and the rounding preprocessing module is connected to the outputs of the normalization correction module and the fast GRS solver module. The fourth operation module specifically includes a mantissa increment logic module, an exponent increment logic module, a sign judgment module, an exception judgment module, and a control logic output module. The mantissa increment logic module is connected to the output of the normalization correction module and the output of the rounding preprocessing module. The exponent increment logic module is connected to the output of the exponent adjustment module and the output of the rounding preprocessing module. The sign judgment module is connected to the output of the normalization correction module, the output of the rounding preprocessing module, and the output of the sign pre-judgment module. The exception judgment module is connected to the output of the exception pre-judgment module, the output of the exponent adjustment module, the output of the normalization correction module, and the output of the rounding preprocessing module. The control logic output module is connected to the outputs of the mantissa increment logic module, the exponent increment logic module, the sign judgment module, and the exception judgment module.
2. A vector floating-point multiplier-accumulator suitable for multi-precision floating-point operations according to claim 1, characterized in that, In the first arithmetic module: The partial product generation module is used to obtain the input floating-point numbers and multiply them into partial products to obtain 27 partial products; The Wallace network is used to calculate the first summation (Sum) and first carry (Carry) of a floating-point number based on the 27 partial products of the partial product generation module, with the first output being the floating-point number. The first inversion module is used to invert the mantissa of the floating-point number with the opposite sign bit when performing floating-point multiplication and subtraction operations on the input floating-point number; The exponent alignment module is used to obtain the shift value required for alignment shifting in the input floating-point number based on the exponent of the floating-point number that does not require shifting, and to generate the alignment signal; The mantissa composite right shifter is used to shift the mantissa of the input floating-point number to the right according to the required shift value; The sticky logic module is used to calculate the sticky bits of the input floating-point number; The anomaly pre-judgment module is used to determine the NaN value and the value of infinity in the input floating-point number and generate anomaly pre-judgment signal.
3. A vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions according to claim 2, characterized in that, The sticky logic module specifically calculates the sticky bit of the input floating-point number using the following formula: where f c represents a mantissa of a shifted floating point number; rfhiftnum represents a shift value, tzd fc represents a trailing zero detection value of f c , and width represents a bit width of f c .
4. A vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions according to claim 2, characterized in that, In the second operation module: A 3:2 CSA adder is used to store the carry value of the input floating-point number; The CPA adder is used to add the first output result of the first arithmetic module to the carry value to obtain the second sum (Sum) and the second carry (Carry) as the second output result. The add-1 circuit consists of cascaded half-adders; it is used to select the input of the half-adder as the previous half-adder or the constant 1 according to the carry value, and to determine whether the bit values of the second summation Sum and the second carry Carry need to be added by 1; The GRS logic module is used to calculate the GRS value of the second output result; The symbol pre-judgment module is used to generate a symbol pre-judgment signal based on the GRS value and the result of adding 1.
5. A vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions according to claim 4, characterized in that, The GRS logic module calculates the GRS value of the second output result through the following steps: When the CPA adder performs the addition, the GRS value of the second output result takes the shifted floating-point mantissa f c of the GRS value; When the CPA adder performs a subtraction, the GRS value of the second output result is taken as the shifted floating-point mantissa f. c The two's complement of the GRS value; Wherein, when the CPA adder performs subtraction and shifts the mantissa f of the floating-point number c When the GRS value is 0, increment the least significant bit of the carry in the second output result by 1.
6. A vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions according to claim 4, characterized in that, In the third operation module: The second inversion module is used to perform a bitwise inversion operation on the second output result when the second output result of the second operation module is negative. The leading zero detection module is used to detect leading zeros in the inverted second output result; The leading zero detection module is used to detect leading zeros in the second output result; The normalized compound left shift module is used to normalize and left shift the mantissa of the second output result to obtain the third output result; The normalization correction module is used to correct negative results for the third output result; The fast GRS solver module is used to calculate the G-bit, R-bit, and S-bit values of floating-point numbers; The rounding preprocessing module is used to correct the result based on the negative result and generate an increment enable signal based on the G-bit, R-bit, and S-bit values; The exponent adjustment module generates an exponent adjustment signal based on the exponent alignment signal.
7. A vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions according to claim 6, characterized in that, The normalized left shift is determined by the following steps: When the temporary exponent is greater than the preset exponent value, the shift value is the temporary exponent minus 1; the temporary exponent is obtained from the input floating-point number; When the temporary exponent is less than the preset exponent value, the shift value is the preset exponent value.
8. A vector floating-point multiplier-accumulator suitable for multi-precision floating-point operations according to claim 6, characterized in that, The negative result correction refers to adding 1 to the mantissa of the floating-point number when the second output result is negative.
9. A vector floating-point multiplier-accumulator suitable for floating-point operations of various precisions according to claim 6, characterized in that, The calculation of the G-bit, R-bit, and S-bit values of the floating-point number is specifically determined by the following formula: In the formula, lzd invf For the leading zero detection result, tzd f This is the result of detecting trailing zeros; width represents the bit width of the mantissa. When the second output result is negative and S is 0 as calculated by the above formula, if R is 1 at this time, it is corrected to 0; otherwise, if R is 0, it is corrected to 1. If S is 1 or positive as calculated by the above formula, R does not need to be corrected. When the second output result is negative and R is 0 after correction, if G is 1 at this time, then it is corrected to 0; otherwise, if G is 0, it is corrected to 1. If R is 1 or positive after correction, G does not need to be corrected. When the result is negative and GRS is 0, the output is incremented by 1 to enable the signal.
10. A vector floating-point multiplier-accumulator suitable for multi-precision floating-point operations according to claim 6, characterized in that, In the fourth operation module: The mantissa increment logic module is used to increment the mantissa of the third output result by 1 based on the increment enable signal and the normalization correction result; The exponent increment logic module is used to increment the bit order of the third output result by 1 according to the increment enable signal and the exponent adjustment signal; The sign determination module is used to adjust the sign of the third output result based on the increment enable signal, the normalization correction result, and the sign pre-determination signal; The anomaly detection module is used to generate an anomaly indication signal based on the increment enable signal, the exponent adjustment signal, the exponent alignment signal, and the anomaly pre-judgment signal; the anomaly indication signal includes invalid operation anomaly, underflow anomaly, overflow anomaly, division by zero anomaly, and inaccuracy anomaly; The control logic output module is used to output the multiplication and addition result based on the logical operation results of the mantissa increment logic module, the exponent increment logic module, the sign judgment module, and the exception judgment module.