A 2-group signed tensor computing circuit structure based on 6-bit approximate full adder
By introducing a 6-bit approximate full adder module into the neural network accelerator, the structure of the signed tensor computation circuit is optimized, solving the problems of area and power consumption in tensor computation and improving circuit performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SOUTHEAST UNIV
- Filing Date
- 2022-10-18
- Publication Date
- 2026-06-26
AI Technical Summary
In neural network accelerators, existing technologies struggle to effectively reduce the area and power consumption of tensor computations while maintaining high computational accuracy.
Design a circuit structure for calculating two sets of signed tensors based on a 6-bit approximate full adder. By introducing a 6-bit approximate full adder module into the calculation process, the calculation process of the signed 8*8 approximate multiplier and the two sets of signed tensors is optimized, reducing carry and output bits, and lowering the circuit area and power consumption.
Compared to the precise tensor computation unit, it reduces the area resource occupation of the computation circuit and ignores the power consumption caused by extra carry, thus achieving optimization of power consumption and area.
Smart Images

Figure CN115840556B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of neural network hardware acceleration, and in particular to a two-set signed tensor merging structure. Background Technology
[0002] While the rapid growth of big data applications has fueled the development of neural networks, it has also presented traditional computer systems with significant challenges in terms of data processing speed and scalability. Multilayer neural networks can increase recognition accuracy, but they also introduce a large number of computational units and increased power consumption. Addition and multiplication are the most widely used computational operations. Multipliers and adders play a crucial role in the functionality of any digital circuit or system, and the overall performance of a processor largely depends on the area and power consumption of its multipliers and adders. Approximation computation has become a common approach to reducing the power consumption of neural network hardware.
[0003] Furthermore, neural networks have a high fault tolerance during training, and neural network accelerators can sacrifice some data accuracy in exchange for optimization of circuit structure latency, area, and power consumption. This invention utilizes a designed 6-bit approximate full adder module to optimize the calculation process of two sets of signed tensors. Although some data accuracy is lost, improvements in area and power consumption are achieved. Summary of the Invention
[0004] Technical Problem: This invention aims to address the issue of reducing the area and power consumption of tensor computation in neural network accelerators, and provides a circuit structure for two sets of signed tensor computation based on a 6-bit approximate full adder. This invention applies the designed 6-bit approximate full adder module to the calculation process of a signed 8*8 approximate multiplier and the calculation process of two sets of signed tensors, thereby optimizing the computation unit and reducing hardware performance such as power consumption and area.
[0005] Technical solution: The present invention provides a circuit structure for calculating two sets of signed tensors based on a 6-bit approximate full adder, comprising:
[0006] 6-bit approximate full adder module: has six data input bits s1, s2, s3, s4, s5, s6; two carry input bits Cin1 and Cin2; two carry output bits Cout1 and Cout2; and one local sum bit S;
[0007] The signed 8*8 approximate multiplier circuit utilizes the aforementioned 6-bit approximate full adder module to optimize the calculation process. It has two 8-bit binary input data x and y, where the range of input data x is -127 to 127, and the range of input data y is -127 to 127. The signed 8*8 approximate multiplier has 16-bit binary output results S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, and S16.
[0008] Two sets of signed tensor computation circuits: These circuits optimize the computation process using the aforementioned 6-bit approximate full adder module. The input data consists of two vectors containing 16 decimal data points, with each data point in each vector ranging from -127 to 127. The output of the two sets of signed tensor computation circuits is a set of 20-bit binary data output bits M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, M11, M12, M13, M14, M15, M16, M17, M18, M19, and M20.
[0009] When the six data input bits s1, s2, s3, s4, s5, and s6 of the 6-bit approximate full adder module are 111111 and the two carry input bits Cin1 and Cin2 are 11, the output results of the two carry output bits Cout1 and Cout2, as well as the output result of the sum bit S, have two possible output results: Output result one is that the output results of the two carry output bits Cout2 and Cout1, as well as the output result of the sum bit S, are 111; Output result two is that the output results of the two carry output bits Cout2 and Cout1, as well as the output result of the sum bit S, are 000.
[0010] In the signed 8x8 approximate multiplier circuit, the first bit of the 16-bit binary output result is the sign bit. When the sign bit is 1, the 16-bit binary output result is negative; when the sign bit is 0... The 16-bit binary output result is a positive number; the sign bit of the input data x and the sign bit of the input data y together determine the sign bit S1 of the signed 8*8 approximate multiplier; bits 2 to 16 of the 16-bit binary output result are data bits, which are the binary representation of the absolute value of the output result of the approximate multiplier; bit 11 of the 16-bit binary output result of the signed 8*8 approximate multiplier is calculated by the 6-bit approximate full adder module to obtain the sum bit S11; bit 10 of the 16-bit binary output result of the signed 8*8 approximate multiplier is calculated by the 6-bit approximate full adder to obtain the sum bit S10; bit 9 of the 16-bit binary output result of the signed 8*8 approximate multiplier is calculated by the 6-bit approximate full adder to obtain the sum bit S9; bits S2 to S8 and S12 to S16 are calculated by the precise full adder.
[0011] The 16 sets of 16-bit binary data required by the two sets of signed tensor calculation circuits are obtained by calculating 16 times using the signed 8*8 approximate multiplier circuit structure. The 16 sets of 16-bit binary data are represented by binary two's complement logic and labeled as m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11, m12, m13, m14, m15, m16.
[0012] The two sets of signed tensor computation circuits are used to calculate the first part of the 20-bit binary data output bits. Specifically, the 15th to 20th bits of m1, m2, m3, m4, m5, and m6 are calculated using the 6-bit approximate full adder module to obtain the first part of the data output bits, S15_1, S16_1, S17_1, S18_1, S19_1, and S20_1. The remaining bits S1_1 to S14_1 of the first part are obtained by multiple precise full adders calculating the 1st to 14th bits of m1, m2, m3, m4, and m5.
[0013] The two sets of signed tensor computation circuits are used to calculate the second part of the 20-bit binary data output bits. Specifically, the 15th to 20th bits of m7, m8, m9, m10, m11, and m12 are calculated using the 6-bit approximate full adder module to obtain the second part of the data output bits, S15_2, S16_2, S17_2, S18_2, S19_2, and S20_2. The remaining bits of the second part, S1_2 to S14_2, are obtained by multiple precise full adders calculating the 1st to 14th bits of m7, m8, m9, m10, and m11.
[0014] The two sets of signed tensor computation circuits are used to calculate the third-stage computation circuit structure of the 20-bit binary data output bits to solve for the third part of the data output bits. Specifically, bits 15 to 20 of m13, m14, m15, and m16, as well as S15_1, S16_1, S17_1, S18_1, S19_1, S20_1 and S15_2, S16_2, S17_2, S18_2, S19_2, S20_2, are calculated using the 6-bit approximate full adder module. The remaining bits S1_3 to S14_3 of the third part are obtained by calculating bits 1 to 14 of m13, m14, m15, and m16 using multiple precise full adders.
[0015] The two sets of signed tensor computation circuits are used to calculate the fourth-level computation circuit structure of the 20-bit binary data output bits to solve for the fourth part of the data output bits. Specifically, the first part of the first-level computation circuit structure (S1_1 to S14_1), the second part of the second-level computation circuit structure (S1_2 to S14_2), the third part of the third-level computation circuit structure (S1_3 to S14_3), the first to the 14th bits of m6, and the first to the 14th bits of m12 are used together by multiple precise full adders to calculate the 20-bit binary data output bits M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, M11, M12, M13, and M14.
[0016] Beneficial effects: Compared with 16 sets of precise tensor calculation units, the two-set signed tensor merging method proposed in this invention has a smaller area required due to the fact that the designed 6-bit approximate full adder module has one less input carry and one less output carry than the 6-bit precise full adder. Therefore, compared with the precise tensor calculation process, the two-set signed tensor merging method proposed in this invention has a smaller area required. In terms of overall circuit performance, the designed signed tensor calculation circuit structure can ignore the power consumption generated by the extra carry, thereby further reducing the power consumption of tensor calculation. Attached Figure Description
[0017] Figure 1 This is the circuit diagram of a signed 8*8 approximate multiplier based on a 6-bit approximate full adder.
[0018] Figure 2 This is a flowchart showing the overall structure of two sets of signed tensor computation circuits.
[0019] Figure 3 This is a diagram of the first-level computational circuit structure for two sets of signed tensor computation circuits.
[0020] Figure 4 This is a diagram of the second-level computational circuit structure for two sets of signed tensor computation circuits.
[0021] Figure 5 This is a diagram of the third-level computational circuit structure for two sets of signed tensor computation circuits.
[0022] Figure 6 This is a diagram of the fourth-level computational circuit structure for two sets of signed tensor computation circuits.
[0023] Figure 7 The diagram shows the misaligned accumulation calculation of a signed 8x8 approximate multiplier.
[0024] Figure 8 This is a graph showing the cumulative calculation of 16 groups of 11111100000101011111 during the tensor calculation process proposed in this invention. Detailed Implementation
[0025] The present invention will be further illustrated below with reference to specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. After reading the present invention, any modifications of the present invention in various equivalent forms by those skilled in the art will fall within the scope defined by the appended claims.
[0026] This invention includes a 6-bit approximate full adder module, a signed 8*8 approximate multiplier circuit, and two sets of signed tensor calculation circuits.
[0027] The 6-bit approximate full adder module is designed based on a 6-bit precise full adder, omitting one output carry and one input carry of the 6-bit precise full adder, thus approximating it as a 6-bit approximate full adder. This approximation idea is only effective in the cascading process. The specific method for obtaining a 6-bit approximate full adder is as follows: When the 6-bit precise full adder is used as the first stage (without input carry), the maximum input condition for the 6-bit precise full adder is: all 6 input data bits are 1, and there is no carry input. At this time, the output of the 6-bit precise full adder is: output carry Cout1_1 is 1, output carry Cout1_2 is 1, and the sum bit S1 of this stage is 0. When the 6-bit precise full adder is used as the second stage (receiving the output carry Cout1_1 from the first stage), the maximum input condition for the 6-bit precise full adder is: all 6 input data bits are 1, and there is one carry input value of 1 (i.e., the value of Cout1_1). At this time, the output of the 6-bit precise full adder is: output carry Cout2_1 is 1, output carry Cout2_2 is 1, and the sum bit S2 of this stage is 1. When the 6-bit precise full adder is used as the third stage (receiving the output carry Cout1_2 from the first stage and the output carry Cout1_2 from the second stage), the maximum input condition for the 6-bit precise full adder is: all 6 input data bits are 1, and there is one carry input value of 1 (i.e., the value of Cout1_1). When ut2_1), the maximum input condition for the 6-bit precise full adder is: all 6 input data bits are 1, and both carry input values are 1 (i.e., the values of Cout1_2 and Cout2_1). At this time, the output of the 6-bit precise full adder is: output carry Cout3_1 is 1, output carry Cout3_2 is 0, output carry Cout3_3 is 0, and the current level sum bit S3 is 0. At this time, the output result of the third-level 6-bit precise full adder is approximated (one output carry is omitted). This approximation idea has two output results. The first output result is that the output data of the third-level 6-bit precise full adder is approximated as output carry Cout3_1 is 1, output carry Cout3_2 is 1, and the current level sum bit S3 is 1. The second output result is that the output data of the third-level 6-bit precise full adder is approximated as output carry Cout3_1 is 0, output carry Cout3_2 is 0, and the current level sum bit S3 is 0. From the output carry and input carry of the 6-bit precise full adder cascade, it can be observed that the 6-bit approximate full adder module designed in this invention only requires two input carry and two output carry to meet the cascading requirements.
[0028] The 6-bit approximate full adder module has the following structure: two carry input bits Cin1 and Cin2, six data input bits s1, s2, s3, s4, s5, s6, two carry output bits Cout1 and Cout2, and one sum bit S. The 6-bit approximate full adder module functions such that, only when the values of the six data input bits s1, s2, s3, s4, s5, s6 and the values of the two carry input bits Cin1 and Cin2 are both 1, the output values of the carry output bits Cout1 and Cout2, as well as the sum bit S, are either 111 (output result one) or 000 (output result two). Under other input conditions, the output values of the carry output bits Cout1 and Cout2, as well as the sum bit S, are all precisely calculated values.
[0029] The signed 8*8 approximate multiplier circuit described above takes the last 7 bits of the binary form of two sets of data x and y as input data, i.e., x2, x3, x4, x5, x6, x7, x8 and y2, y3, y4, y5, y6, y7, y8 are multiplied bitwise to obtain seven sets of seven binary calculation units. The staggered summation result of these seven sets of binary calculation units is the output result of the approximate multiplier. x1 and y1 are the sign bits of the two input data, respectively. The output of the 8*8 approximate multiplier is a set of 16-bit binary data S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, and S16, where S1 represents the sign of the output result, determined by x1 and y1, and S2 to S16 represent the absolute value of the output result. When calculating S9 to S11, the aforementioned 6-bit approximate full adder module and multiple cascaded half adders are introduced for approximate calculation.
[0030] The circuit structure of the signed 8*8 approximate multiplier mainly consists of three parts: a first accurate full adder module 1, a second accurate calculation module 2, and an approximate calculation module. In calculating S16, S15, S14, S13, S12, S8, S7, S6, S5, S4, S3, and S2, multiple cascaded 5-bit precise full adders are used for calculation. In calculating S11, the first 6-bit approximate full adder 301 receives the carry Cout13to11 and carry Cout12to11 generated from the first precise calculation module 1. In calculating S10, the second 6-bit approximate full adder 302 first calculates six of the seven accumulated data sets, for example, x2*y8, x3*y7, x4*y6, x5*y5, x6*y4, and x7*y3. The second 6-bit approximate full adder 302 receives the carry Cout12to10 generated from the precise calculation module 1 and the carry Cout12t from the first 6-bit approximate full adder 301. The first half-adder 401 generates S10_1 and x8*y2, which are then processed by the first half-adder 401 to obtain S10 and the carry Cout10to9_2. When calculating S9, Cout10to9_2 and the carry Cout10to9_1 generated by the second 6-bit approximate full adder 302 are processed together by the second half-adder 402 to obtain the output carry Cout9 and Cout8_1. Cout8_1 and Cout10to8 are processed together by the third half-adder 403 to obtain the output carry Cout8_2 and Cout7. The third 6-bit approximate full adder 303 receives the carry Cout9 generated by the second half-adder 402 and the carry Cout11to9 generated by the first 6-bit approximate full adder 301, and finally outputs S9.
[0031] The computational circuit structure corresponding to the two-set signed tensor merging method described above takes two sets of tensors A and B as input data. A is a tensor containing 16 data points (which can be represented as a matrix [A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16]), with each data point ranging from -127 to 127. B is also a tensor containing 16 data points (which can be represented as a matrix [B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16]), with each data point ranging from -127 to 127. A and B are multiplied bitwise, i.e., A1*B1, A2*B2, A3*B3. ...There are a total of 16 multiplication units (labeled m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11, m12, m13, m14, m15, m16). Each multiplication unit participates in the merging calculation circuit structure in the form of 20-bit two's complement. The use of 20-bit two's complement is to solve the problem of the sign bit being difficult to accumulate when accumulating multiple sets of data. In addition, the range of the 20-bit binary output result can accommodate 16 sets of 16-bit binary accumulation results. The output of the two sets of signed tensor calculation circuits is a set of 20-bit binary data (which can be represented as M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, M11, M12, M13, M14, M15, M16, M17, M18, M19, M20). This output is the two's complement form of the actual value of the tensor calculation, and the actual value needs to be further calculated using calculation tools.
[0032] The computational circuit structure corresponding to the two sets of signed tensor merging methods described above can be divided into a four-level computational circuit structure: the computation result of the first-level computational circuit structure can be divided into an approximate part and an exact part. The approximate part is formed by m1 (20) (i.e., the 20th bit of m1, the same representation below), m2 (20), m3 (20), m4 (20), m5 (20), and m6 (20) passing through a first 6-bit approximate full adder 11. The first 6-bit approximate full adder 11 and m1 (19), m2 (19), m3 (19), m 4 (19), m5 (19), m6 (19) are the second 6-bit approximate full adders 12 cascaded together. Similarly, the second 6-bit approximate full adder 12, the third 6-bit approximate full adder 13, the fourth 6-bit approximate full adder 14, the fifth 6-bit approximate full adder 15, and the sixth 6-bit approximate full adder 16 are cascaded together to calculate bits M15_1 to M20_1 of the first part. The precise part consists of 14 5-bit precise full adders, such as the first 5-bit precise full adder 101 and the second 5-bit precise full adder 102, which are cascaded together to calculate bits M1_1 to M14_1 of the first part.
[0033] The computational circuit structure corresponding to the two sets of signed tensor merging methods described above has a second-level computational circuit structure whose computational results can be divided into an approximate part and an exact part. The approximate part is formed by m7 (20), m8 (20), m9 (20), m10 (20), m11 (20), and m12 (20) passing through a first 6-bit approximate full adder 21. The first 6-bit approximate full adder 21 is cascaded with a second 6-bit approximate full adder 22 whose inputs are m6 (19), m7 (19), m8 (19), m9 (19), m10 (19), and m11 (19). And so on, with the second 6-bit approximate full adder 22, the third 6-bit approximate full adder 23, the fourth 6-bit approximate full adder 24, and the fifth 6-bit approximate full adder 25. The sixth part consists of 26 cascaded approximate full adders, which together calculate bits M15_2 to M20_2; the precise part consists of 14 5-bit precise full adders, such as the first 5-bit precise full adder 201 and the second 5-bit precise full adder 202, which together calculate bits M1_2 to M14_2.
[0034] The computational circuit structure corresponding to the two sets of signed tensor merging methods described above has a third-level computational circuit structure whose computational results can be divided into an approximate part and an exact part. The approximate part is formed by m13(20), m14(20), m15(20), m16(20), M20_1, and M20_2 passing through the first 6-bit approximate full adder 31. The first 6-bit approximate full adder 31, together with m1(19), m2(19), m3(19), m4(19), and M19_1 M19_2 is the second 6-bit approximate full adder 32 cascaded for input. Similarly, the second 6-bit approximate full adder 32, the third 6-bit approximate full adder 33, the fourth 6-bit approximate full adder 34, the fifth 6-bit approximate full adder 35, and the sixth 6-bit approximate full adder 36 are cascaded together to calculate bits M15 to M20. The precise part consists of 14 5-bit precise full adders, such as the first 5-bit precise full adder 301 and the second 5-bit precise full adder 302, which are cascaded together to calculate bits M1_3 to M14_3.
[0035] The computational circuit structure corresponding to the two sets of signed tensor merging methods has only a precise part in its fourth-level computational circuit structure. The precise part is composed of 14 5-bit precise full adders, such as the first 5-bit precise full adder 401 with inputs M14_1, M14_2, M14_3, m6 (14), and m12 (14), which are cascaded together to calculate bits M1 to M14.
[0036] The results M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, M11, M12, M13, M14, M15, M16, M17, M18, M19, and M20 are the two's complement form of the tensor calculation results.
[0037] Example 1:
[0038] In Embodiment 1 of this invention, a calculation method for a signed 8*8 approximate multiplier based on a 6-bit approximate full adder is first provided. The main approximation part of the functional logic of the 6-bit approximate full adder is that when the 8 inputs (6 data input bits + 2 carry) are 11111111, the output result is 111. When the input data x of the signed 8*8 approximate multiplier is 127 (binary 01111111) and y is -127 (binary 11111111), the last seven bits of the two inputs are multiplied bit-by-bit and then accumulated with a shifted summation, as shown in the following form: Figure 7 As shown, the cascaded circuit structure of its approximate part is as follows: Figure 1 As shown. The calculation results are analyzed as follows:
[0039] The S16 bit is obtained through the first precise calculation module 1, and is 1;
[0040] The S15 bit is derived from the first precise calculation module 1 and is 0;
[0041] Bit S14 is obtained through the first precise calculation module 1 and is 0;
[0042] Bit S13 is obtained through the first precise calculation module 1 and is 0;
[0043] The S12 bit is obtained through the first precise calculation module 1 and is 0;
[0044] Bit S11 is obtained by passing through the first 6-bit approximate full adder 301 (all 6 input bits are 1, and the 2 carry bits are also 1), and is 1;
[0045] The S10 bit is obtained by passing through the second 6-bit approximate full adder 302 and the first half adder 401, and is 0;
[0046] The S9 bit is obtained by passing through the third 6-bit approximate full adder 303, and is 1;
[0047] The S8 bit is obtained through the second precise calculation module 2 and is 0;
[0048] The S7 bit is obtained through the second precise calculation module 2, and is 1;
[0049] The S6 bit is obtained through the second precise calculation module 2, and is 1;
[0050] The S5 bit is obtained through the second precise calculation module 2, and is 1;
[0051] The S4 bit is obtained through the second precise calculation module 2, and is 1;
[0052] Bits S3 and S2 are obtained through the second precise calculation module 2, and are 1 and 0 respectively;
[0053] The S1 bit is obtained through x(1) and y(1), and is 1 (representing a negative number).
[0054] Therefore, the binary representation of the absolute value of the output is 011111010100001. Considering that the sign bit 1 represents a negative number, i.e. -16033, the exact result of -127*127 can be found using a calculator to be -16129, which is within the allowable range.
[0055] Example 2:
[0056] In Embodiment 2 of this invention, a merging method for two groups of signed tensor calculation circuits based on a 6-bit approximate full adder is first provided. For ease of calculation, the input value and output result of the signed 8*8 approximate multiplier in Embodiment 1 are used. Let tensor A be [-127 -127 -127 -127 -127 -127 -127 -127 -127 -127 -127 -127 -127 -127 -127 -127 -127] and tensor B be [127 ... After multiplying tensor A and tensor B by their elements, the resulting m1 to m16 are all 11111100000101011111. Therefore, the original input data for tensor calculation is the sum of 16 sets of 11111100000101011111. Figure 8 As shown. The four-level computational circuit structure for calculating two sets of signed tensors is as follows. Figure 3 , 4 As shown in Figures 5 and 6.
[0057] The results of the first-stage computation circuits of the two sets of signed tensor computation circuits are as follows:
[0058] M20_1 is obtained by passing m1(20), m2(20), m3(20), m4(20), m5(20), and m6(20) through the first 6-bit approximate full adder 11, and is 0;
[0059] M19_1 is obtained by passing m1(19), m2(19), m3(19), m4(19), m5(19), and m6(19) through a second 6-bit approximate full adder 12, and is 1;
[0060] M18_1 is obtained by passing m1(18), m2(18), m3(18), m4(18), m5(18), and m6(18) through a third 6-bit approximate full adder 13, and is 1;
[0061] M17_1 is obtained by passing m1(17), m2(17), m3(17), m4(17), m5(17), and m6(17) through a fourth 6-bit approximate full adder 14, and is 1;
[0062] M16_1 is obtained by passing m1(16), m2(16), m3(16), m4(16), m5(16), and m6(16) through a fifth 6-bit approximate full adder 15, and is 1;
[0063] M15_1 is obtained by passing m1(15), m2(15), m3(15), m4(15), m5(15), and m6(15) through a sixth 6-bit approximate full adder 16, and is 0;
[0064] M14_1 is obtained by passing m1(14), m2(14), m3(14), m4(14), and m5(14) through the first 5-bit precise full adder 101, and is 1;
[0065] M13_1 is obtained by passing m1(13), m2(13), m3(13), m4(13), and m5(13) through a second 5-bit precise full adder 102, and is 1;
[0066] M12_1 is obtained by passing m1(12), m2(12), m3(12), m4(12), and m5(12) through a third 5-bit precise full adder 103, and is 0;
[0067] M11_1 is obtained by passing m1(11), m2(11), m3(11), m4(11), and m5(11) through a fourth 5-bit precise full adder 104, and is 1;
[0068] M10_1 is obtained by passing m1(10), m2(10), m3(10), m4(10), and m5(10) through a fifth 5-bit precise full adder 105, and is 1;
[0069] M9_1 is obtained by passing m1(9), m2(9), m3(9), m4(9), and m5(9) through a 6th 5-bit precise full adder 106, and is 0;
[0070] M8_1 is obtained by passing m1(8), m2(8), m3(8), m4(8), and m5(8) through the seventh 5-bit precise full adder 107, and is 0;
[0071] M7_1 is obtained by passing m1(7), m2(7), m3(7), m4(7), and m5(7) through an eighth 5-bit precise full adder 108, and is 0;
[0072] M6_1 is obtained by passing m1(6), m2(6), m3(6), m4(6), and m5(6) through the ninth 5-bit precise full adder 109, and is 1;
[0073] M5_1 is obtained by passing m1(5), m2(5), m3(5), m4(5), and m5(5) through the 15th bit precise full adder 110, and is 1;
[0074] M4_1 is obtained by passing m1(4), m2(4), m3(4), m4(4), and m5(4) through the eleventh-fifth-bit precise full adder 111, and is 0;
[0075] M3_1 is obtained by passing m1(3), m2(3), m3(3), m4(3), and m5(3) through the twelfth and fifth-bit precise full adder 112, and is 1;
[0076] M2_1 is obtained by passing m1(2), m2(2), m3(2), m4(2), and m5(2) through the thirteenth 5-bit precise full adder 113, and is 1;
[0077] M1_1 is obtained by passing m1(1), m2(1), m3(1), m4(1), and m5(1) through the fourteenth 5th bit precise full adder 114, and is 1;
[0078] Due to the similarity of the input data and circuit structure, the calculation results of the second-level computation circuit of the two sets of signed tensor merging methods are the same as those of the first-level computation circuit.
[0079] The results of the third-level computation circuits of the two sets of signed tensor computation circuit structures are as follows:
[0080] M20 is obtained by passing M20_1, M20_2, m13(20), m14(20), m15(20), and m16(20) through the first 6-bit approximate full adder 31, and is 0;
[0081] M19 is obtained by passing M19_1, M19_2, m13(19), m14(19), m15(19), and m16(19) through a second 6-bit approximate full adder 32, and is 0;
[0082] M18 is obtained by passing M18_1, M18_2, m13(18), m14(18), m15(18), and m16(18) through a third 6-bit approximate full adder 33, and is 1;
[0083] M17 is obtained by passing M17_1, M17_2, m13(17), m14(17), m15(17), and m16(17) through a fourth 6-bit approximate full adder 34, and is 1;
[0084] M16 is obtained by passing M16_1, M16_2, m13(16), m14(16), m15(16), and m16(16) through a fifth 6-bit approximate full adder 35, and is 1;
[0085] M15 is obtained by passing M15_1, M15_2, m13(15), m14(15), m15(15), and m16(15) through a sixth 6-bit approximate full adder 36, and is 0;
[0086] M14_3 is obtained by passing m13(14), m14(14), m15(14), and m16(14) through the first 5-bit precise full adder 301, and is 0;
[0087] M13_3 is obtained by passing m13(13), m14(13), m15(13), and m16(13) through the second 5-bit precise full adder 302, and is 1;
[0088] M12_3 is obtained by passing m13(12), m14(12), m15(12), and m16(12) through a third 5-bit precise full adder 303, and is 1;
[0089] M11_3 is obtained by passing m13(11), m14(11), m15(11), and m16(11) through the fourth 5-bit precise full adder 304, and is 0;
[0090] M10_3 is obtained by passing m13(10), m14(10), m15(10), and m16(10) through the fifth 5-bit precise full adder 305, and is 1;
[0091] M9_3 is obtained by passing m13(9), m14(9), m15(9), and m16(9) through the sixth 5-bit precise full adder 306, and is 0;
[0092] M8_3 is obtained by passing m13(8), m14(8), m15(8), and m16(8) through the seventh 5-bit precise full adder 307, and is 0;
[0093] M7_3 is obtained by passing m13(7), m14(7), m15(7), and m16(7) through the eighth 5-bit precise full adder 308, and is 0;
[0094] M6_3 is obtained by passing m13(6), m14(6), m15(6), and m16(6) through the ninth 5-bit precise full adder 309, and is 0;
[0095] M5_3 is obtained by passing m13(5), m14(5), m15(5), and m16(5) through the 15th bit precise full adder 310, and is 0;
[0096] M4_3 is obtained by passing m13(4), m14(4), m15(4), and m16(4) through the eleventh-fifth-bit precise full adder 311, and is 1;
[0097] M3_3 is obtained by passing m13(3), m14(3), m15(3), and m16(3) through the twelfth 5th bit precise full adder 312, and is 1;
[0098] M2_3 is obtained by passing m13(2), m14(2), m15(2), and m16(2) through the thirteenth 5-bit precise full adder 313, and is 1;
[0099] M1_3 is obtained by passing m13(1), m14(1), m15(1), and m16(1) through the fourteenth 5th bit precise full adder 314, and is 1;
[0100] The results of the fourth-level computation circuits of the two sets of signed tensor computation circuit structures are as follows:
[0101] M14 is obtained by passing m6(14), m12(14), m14_1, m14_2, and m14_3 through the first 5-bit precise full adder 401, and is 0;
[0102] M13 is obtained by passing m6(13), m12(13), m13_1, m13_2, and m13_3 through a second 5-bit precise full adder 402, and is 1;
[0103] M12 is obtained by passing m6(12), m12(12), m12_1, m12_2, and m12_3 through a third 5-bit precise full adder 403, and is 1;
[0104] M11 is obtained by passing m6(11), m12(11), m11_1, m11_2, and m11_3 through a fourth 5-bit precise full adder 404, and is 0;
[0105] M10 is obtained by passing m6(10), m12(10), m10_1, m10_2, and m10_3 through a fifth 5-bit precise full adder 405, and is 1;
[0106] M9 is obtained by passing m6(9), m12(9), m9_1, m9_2, and m9_3 through the sixth 5-bit precise full adder 406, and is 0;
[0107] M8 is obtained by passing m6(8), m12(8), m8_1, m8_2, and m8_3 through the seventh 5-bit precise full adder 407, and is 1;
[0108] M7 is obtained by passing m6(7), m12(7), m7_1, m7_2, and m7_3 through the eighth 5-bit precise full adder 408, and is 0;
[0109] M6 is obtained by passing m6(6), m12(6), m6_1, m6_2, and m6_3 through the ninth 5-bit precise full adder 409, and is 0;
[0110] M5 is obtained by passing m6(5), m12(5), m5_1, m5_2, and m5_3 through the 15th bit precise full adder 410, and is 0;
[0111] M4 is obtained by passing m6(4), m12(4), m4_1, m4_2, and m4_3 through the eleventh-fifth-bit precise full adder 411, and is 0;
[0112] M3 is obtained by passing m6(3), m12(3), m3_1, m3_2, and m3_3 through the twelfth 5th bit precise full adder 412, and is 0;
[0113] M2 is obtained by passing m6(2), m12(2), m2_1, m2_2, and m2_3 through the thirteenth 5-bit precise full adder 413, and is 1;
[0114] M1 is obtained by passing m6(1), m12(1), m1_1, m1_2, and m1_3 through the fourteenth 5th bit precise full adder 414, and is 1;
[0115] In summary, the outputs M1 to M20 of the two sets of signed tensor computation circuits are 11000001010110011100. Since this is in two's complement form, the actual binary value obtained after conversion by a calculator is 10111110101001100100, which is -256612. The precise value of tensor A and tensor B multiplied bitwise and then summed is -127*127*16, which is -256064. The error of the calculation result of the adopted merging scheme is within the allowable range.
Claims
1. A circuit structure for calculating two sets of signed tensors based on a 6-bit approximate full adder, characterized in that, The structure includes: 6-bit approximate full adder module: has six data input bits s1, s2, s3, s4, s5, s6; two carry input bits Cin1 and Cin2; two carry output bits Cout1 and Cout2; and one local sum bit S; The signed 8*8 approximate multiplier circuit utilizes the aforementioned 6-bit approximate full adder module to optimize the calculation process. It has two 8-bit binary input data x and y, where the range of input data x is -127 to 127; the range of input data y is -127 to 127; the signed 8*8 approximate multiplier has 16-bit binary output results S1, S2, S3, S4, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, and S16. The 11th bit of the 16-bit binary output of the signed 8*8 approximate multiplier is calculated by the 6-bit approximate full adder module, resulting in bit S11; the 10th bit of the 16-bit binary output of the signed 8*8 approximate multiplier is calculated by the 6-bit approximate full adder, resulting in bit S10; the 9th bit of the 16-bit binary output of the signed 8*8 approximate multiplier is calculated by the 6-bit approximate full adder, resulting in bit S9; and the exact full adder calculates bits S2 to S8 and bits S12 to S16. Two sets of signed tensor computation circuits: These circuits optimize the computation process using the aforementioned 6-bit approximate full adder module. The input data consists of two vectors containing 16 decimal data points, with each data point in each vector ranging from -127 to 127. The output of these two sets of signed tensor computation circuits is a set of 20-bit binary data output bits M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, M11, M12, M13, M14, M15, M16, M17, M18, M19, and M20. The 16 sets of 16-bit binary data required by the two sets of signed tensor calculation circuits are obtained by calculating 16 times by the signed 8*8 approximate multiplier circuit structure. The 16 sets of 16-bit binary data are represented by binary two's complement logic and labeled as m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, m11, m12, m13, m14, m15, m16. The two sets of signed tensor computation circuits are used to calculate the first part of the 20-bit binary data output bits. Specifically, the 15th to 20th bits of m1, m2, m3, m4, m5, and m6 are calculated using the 6-bit approximate full adder module to calculate the first part of the data output bits, S15_1, S16_1, S17_1, S18_1, S19_1, and S20_1. The remaining bits of the first part, S1_1 to S14_1, are obtained by multiple precise full adders calculating the 1st to 14th bits of m1, m2, m3, m4, and m5. The two sets of signed tensor calculation circuits are used to calculate the second part of the 20-bit binary data output bits. Specifically, the 15th to 20th bits of m7, m8, m9, m10, m11, and m12 are calculated by the 6-bit approximate full adder module to obtain the second part of the data output bits, S15_2, S16_2, S17_2, S18_2, S19_2, and S20_2. The remaining bits of the second part, S1_2 to S14_2, are obtained by multiple precise full adders calculating the 1st to 14th bits of m7, m8, m9, m10, and m11. The two sets of signed tensor computation circuits are used to calculate the third-stage computation circuit structure of the 20-bit binary data output bits to solve for the third part of the data output bits. Specifically, bits 15 to 20 of m13, m14, m15, and m16, as well as S15_1, S16_1, S17_1, S18_1, S19_1, S20_1 and S15_2, S16_2, S17_2, S18_2, S19_2, S20_2, are calculated using the 6-bit approximate full adder module. The remaining bits S1_3 to S14_3 of the third part are obtained by calculating bits 1 to 14 of m13, m14, m15, and m16 using multiple precise full adders.
2. The circuit structure for calculating two groups of signed tensors based on a 6-bit approximate full adder as described in claim 1, characterized in that, When the six data input bits s1, s2, s3, s4, s5, and s6 of the 6-bit approximate full adder module are 111111 and the two carry input bits Cin1 and Cin2 are 11, the output results of the two carry output bits Cout1 and Cout2, as well as the output result of the sum bit S, have two possible output results: Output result one is that the output results of the two carry output bits Cout2 and Cout1, as well as the output result of the sum bit S, are 111; Output result two is that the output results of the two carry output bits Cout2 and Cout1, as well as the output result of the sum bit S, are 000.
3. The circuit structure for calculating two groups of signed tensors based on a 6-bit approximate full adder as described in claim 1, characterized in that, In the signed 8*8 approximate multiplier circuit, the first bit of the 16-bit binary output result is the sign bit. When the sign bit is 1, the 16-bit binary output result is a negative number; when the sign bit is 0, the 16-bit binary output result is a positive number. The sign bits of the input data x and the sign bits of the input data y together determine the sign bit S1 of the signed 8*8 approximate multiplier. The second to the sixteenth bits of the 16-bit binary output result are data bits, which are the binary representation of the absolute value of the output result of the approximate multiplier.
4. The circuit structure for calculating two groups of signed tensors based on a 6-bit approximate full adder as described in claim 1, characterized in that, The two sets of signed tensor computation circuits are used to calculate the fourth-level computation circuit structure of the 20-bit binary data output bits to solve for the fourth part of the data output bits. Specifically, the first part of the first-level computation circuit structure (S1_1 to S14_1), the second part of the second-level computation circuit structure (S1_2 to S14_2), the third part of the third-level computation circuit structure (S1_3 to S14_3), the first to the 14th bits of m6, and the first to the 14th bits of m12 are used together by multiple precise full adders to calculate the 20-bit binary data output bits M1, M2, M3, M4, M5, M6, M7, M8, M9, M10, M11, M12, M13, and M14.