Chip, and data processing method for chip

WO2026137902A1PCT designated stage Publication Date: 2026-07-02HUAWEI TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: HUAWEI TECH CO LTD
Filing Date: 2025-08-15
Publication Date: 2026-07-02

Smart Images

Figure CN2025114924_02072026_PF_FP_ABST

Patent Text Reader

Abstract

Provided in the present application are a chip, and a data processing method for the chip. The chip comprises: a first data converter, which is used for converting N first vectors into N third vectors and M first shared exponents, wherein a plurality of elements in each first vector are floating-point numbers, and a plurality of elements in each third vector are integers; a multiplication array, which is used for acquiring the N third vectors from the first data converter by means of direct transmission, and multiplying P fourth vectors by the N third vectors, so as to obtain N×P vector multiplication results, wherein a plurality of elements in the P fourth vectors are integers, and P≥1; an accumulator, which is used for acquiring the N×P vector multiplication results from the multiplication array by means of direct transmission, and calculating, on the basis of the M first shared exponents, an accumulated sum of the N×P vector multiplication results and a historical accumulation result, so as to obtain an accumulation result; and a second data converter, which is used for converting integers in the accumulation result into floating-point numbers. The chip provided in the present application can realize efficient floating-point number vector multiplication.

Need to check novelty before this filing date? Find Prior Art

Description

Chips and chip data processing methods

[0001] This application claims priority to Chinese Patent Application No. 202411955597.5, filed on December 25, 2024, entitled “Chip and Data Processing Method for Chip”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence technology, specifically to a chip and a data processing method for the chip. Background Technology

[0003] With the continuous development of artificial intelligence (AI) algorithms, the inference performance of neural networks has attracted increasing attention from academia and industry. Neural networks involve a large number of matrix multiplication calculations. To accelerate neural network inference, chips typically contain multiplication arrays for matrix multiplication or vector inner product operations, thus speeding up matrix multiplication operations such as A×B+C.

[0004] For floating-point multiplication in inner product operations, the multiplication process includes several steps: adding the exponents, multiplying the mantissas, and finally normalizing the result to a floating-point number. Adding two floating-point numbers is more complicated, requiring alignment based on the exponents, shifting the mantissas, calculating the addition, and finally normalizing the result to a floating-point number. Floating-point addition in inner product operations consumes a lot of power, limiting the chip's processing efficiency for floating-point vector multiplication. Summary of the Invention

[0005] This application provides a chip and a data processing method for the chip, which can improve the chip's processing efficiency for floating-point vector multiplication.

[0006] In a first aspect, a chip is provided, comprising: a first data converter for converting N first vectors into N third vectors and M first shared exponents, wherein multiple elements in each first vector are floating-point numbers and multiple elements in each third vector are integers, N≥1, 1≤M≤N; a multiplication array for obtaining the N third vectors from the first data converter in a direct transmission manner, and multiplying P fourth vectors with the N third vectors to obtain N×P vector multiplication results, wherein multiple elements in the P fourth vectors are integers, P≥1; an accumulator for obtaining the N×P vector multiplication results from the multiplication array in the direct transmission manner, and calculating the sum of the N×P vector multiplication results and historical accumulation results based on the M first shared exponents to obtain an accumulation result; and a second data converter for converting the integers in the accumulation result into floating-point numbers.

[0007] This application provides a chip in which a first data converter can convert a floating-point vector into an integer vector, a multiplication array can obtain an integer vector from the first data converter for multiplication operations via direct transmission, and an accumulator can obtain the vector multiplication result from the multiplication array for accumulation via direct transmission. The chip's vector multiplication process does not require external memory, which can reduce data transmission latency and improve the computational efficiency of floating-point vector multiplication.

[0008] Meanwhile, when performing a multiplication operation on two matrices, the chip can split the two matrices into multiple data blocks for processing. The size of these data blocks is usually much smaller than the size of the original matrix. The first data converter, multiplication array, and accumulator can process these data blocks in a pipelined manner, reducing data processing latency and improving the efficiency of matrix multiplication calculation.

[0009] For example, the first matrix A and the second matrix B are multiplied. The first matrix A is split into data blocks A1, A2, A3, and A4, and the second matrix B is split into data blocks B1, B2, B3, and B4. Each data block includes N vectors, and A×B = A1×B1 + A2×B2 + A3×B3 + A4×B4. At time t1, the first data converter converts the floating-point vectors in A1 and B1 into integer vectors to obtain A11 and B11; at time t2, the first data converter converts the floating-point vectors in A2 and B2 into integer vectors to obtain A12 and B12, and the multiplication array calculates A11×B11; at time t3, the first data converter converts the floating-point vectors in A3 and B3 into integer vectors to obtain A13 and B13, the multiplication array calculates A12×B12, and the accumulator calculates A11×B11 and the historical... The sum of the accumulated results is 0 + A11 × B11, and the historical accumulated result at time t3 is 0; at time t4, the first data converter converts the floating-point vectors in A4 and B4 into integer vectors to obtain A14 and B14, the multiplication array calculates A13 × B13, the accumulator calculates the sum of A12 × B12 and the historical accumulated result to obtain the accumulated result 0 + A11 × B11 + A12 × B12, and the historical accumulated result at time t4 is 0 + A11 × B11; ... and so on.

[0010] In conjunction with the first aspect, in some implementations of the first aspect, the chip further includes at least one cache for caching the M first shared indices, the historical accumulated results, and the accumulated results.

[0011] The at least one buffer refers to the storage space within the chip for temporary data storage, such as an on-chip register or cache.

[0012] In some possible implementation scenarios, historical accumulation results and the current accumulation result can be stored in the same cache. For example, the current accumulation result can be used to overwrite the historical accumulation result in the cache.

[0013] This application provides a chip that uses at least one cache to cache M first shared exponents, historical accumulation results, and accumulation results, which can save memory overhead compared with the prior art method of storing vector multiplication results in large-capacity memory.

[0014] In conjunction with the first aspect, in some implementations of the first aspect, the at least one cache includes: a first register for caching the M first shared exponents; and a second register for caching the historical accumulation result and the accumulation result.

[0015] The first register and the second register can be the same register or they can be different registers.

[0016] In some possible implementation scenarios, the M first shared exponents of the (h+1)th data block obtained at time t2 can cover the M first shared exponents of the hth data block stored in the first register at time t1, where h ≥ 1, and time t2 is later than time t1.

[0017] In some possible implementation scenarios, the accumulated result obtained at time t2 can overwrite the historical accumulated result stored in the second register at time t1.

[0018] This application provides a chip that uses registers to cache M first shared exponents, historical accumulation results, and accumulation results, which can save memory overhead compared with the prior art method of storing vector multiplication results in large-capacity memory.

[0019] In conjunction with the first aspect, in some implementations of the first aspect, the direct transmission method includes on-chip hardwired connections.

[0020] The multiplication array can obtain N third vectors from the first data converter through on-chip hardwired connections, and the accumulator can obtain N×P vector multiplication results from the multiplication array through on-chip hardwired connections. That is, the first data converter, multiplication array and accumulator can be directly connected through the interface between modules, hardware circuits, wires, etc., without the need for external dynamic random access memory (DRAM), which can reduce transmission latency and enhance the implementation effect of pipelined processing.

[0021] In conjunction with the first aspect, in some implementations of the first aspect, each first vector is a matrix to be processed or a part of a vector to be processed.

[0022] The matrix or vector to be processed can be obtained from any one or more of image data, audio data, video data, or text data.

[0023] This application provides a chip that, when performing multiplication on a matrix or vector, can split the matrix or vector into multiple data blocks for processing. The size of these data blocks is typically much smaller than the size of the matrix or vector. A first data converter, a multiplication array, and an accumulator can process these data blocks in a pipelined manner, reducing data processing latency and improving the efficiency of matrix or vector multiplication.

[0024] In conjunction with the first aspect, in some implementations of the first aspect, the accumulator includes a Kulisch accumulator.

[0025] This application provides a chip in which the accumulator can be a Kulisch accumulator, which can more accurately represent the calculation results, thereby improving the accuracy and stability of numerical calculations.

[0026] In conjunction with the first aspect, in some implementations of the first aspect, the accumulator is also used to limit the bit width of the N×P vector multiplication results to a power of 2.

[0027] Each vector multiplication result includes a single value. In some possible implementation scenarios, the numerical range of the vector multiplication result may not be hardware-friendly. For example, based on the Float16 data range, the accumulator requires a data width of 41 bits (2^32 bits). -24 to 2 15 (Add a sign bit). Since 41-bit integers are not hardware-friendly, this application can make a trade-off between precision and hardware by limiting the bit width of each vector multiplication result to a power of 2, for example, selecting 32 bits or 16 bits for accumulation.

[0028] This application provides a chip that, before calculating the sum of N×P vector multiplication results and historical accumulation results based on M first shared exponents, can limit the bit width of the N×P vector multiplication results to a power of 2, thereby improving the friendliness of hardware design.

[0029] In conjunction with the first aspect, in some implementations of the first aspect, the first data converter is further configured to convert P second vectors into Q second shared exponents and the P fourth vectors, wherein multiple elements in each second vector include floating-point numbers, 1≤Q≤P; the accumulator is specifically configured to calculate the sum of the N×P vector multiplication results and the historical accumulation results based on the M first shared exponents and the Q second shared exponents.

[0030] Multiplying N first vectors by P second vectors, where multiple elements in the first vectors are floating-point numbers and multiple elements in the second vectors can be either integers or floating-point numbers, the first and second vectors contain the same number of elements. If multiple elements in the second vectors are integers, the P second vectors can be the same P fourth vectors, thus omitting the conversion step. The resulting N third vectors can be multiplied by the P second vectors, yielding N×P vector multiplication results. The accumulator calculates the sum of the N×P vector multiplication results and the historical accumulation results based on M first shared exponents. If multiple elements in the second vectors are floating-point numbers, the first data converter can convert the P second vectors into Q second shared exponents and the P fourth vectors. The resulting N third vectors can be multiplied by the same P fourth vectors. The accumulator calculates the sum of the N×P vector multiplication results and the historical accumulation results based on M first shared exponents and Q second shared exponents, yielding N×P values.

[0031] This application provides a chip that, when performing vector multiplication, can convert floating-point vectors into integer vectors, enabling the multiplication array to perform integer vector multiplication, and uses a second data converter to restore the final calculation result to floating-point numbers, thereby improving the computational efficiency of floating-point vector multiplication.

[0032] In conjunction with the first aspect, in some implementations of the first aspect, the first data converter, the multiplication array, and the accumulator operate in a pipelined manner.

[0033] This application provides a chip that, when performing multiplication of two matrices, can split the two matrices into multiple data blocks for processing. A first data converter, a multiplication array, and an accumulator operate in a pipelined manner. For example, when the first data converter converts the floating-point vector in the (h+2)th data block into an integer vector, the multiplication array can perform the multiplication operation in the (h+1)th data block, while the accumulator calculates the sum of the first h data blocks. This way, most of the data conversion overhead can be masked, improving the overall computing performance of the chip.

[0034] In a second aspect, a computing device is provided, comprising a chip as described in the first aspect or any possible implementation thereof.

[0035] For example, the computing device may be a mobile phone, tablet computer, laptop computer, desktop computer, server, in-vehicle device, wearable device, etc.

[0036] Thirdly, a data processing method for a chip is provided, comprising: a first data converter in the chip converting N first vectors into N third vectors and M first shared exponents, wherein multiple elements in each first vector are floating-point numbers and multiple elements in each third vector are integers, N≥1, 1≤M≤N; a multiplication array in the chip obtaining the N third vectors from the first data converter via direct transmission, multiplying P fourth vectors with the N third vectors to obtain N×P vector multiplication results, wherein multiple elements in the P fourth vectors are integers, P≥1; an accumulator in the chip obtaining the N×P vector multiplication results from the multiplication array via the direct transmission method, calculating the sum of the N×P vector multiplication results and historical accumulation results based on the M first shared exponents to obtain an accumulation result; and a second data converter in the chip converting the integers in the accumulation result into floating-point numbers.

[0037] This application provides a data processing method for a chip. A first data converter can convert a floating-point vector into an integer vector, enabling the multiplication array to perform integer vector multiplication operations. A second data converter is then used to restore the final calculation result to a floating-point number, thereby improving the computational efficiency of floating-point vector multiplication.

[0038] Meanwhile, the data processing method provided in this application can split the two matrices into multiple data blocks for processing when performing multiplication operations. The size of the data block is usually much smaller than the size of the original matrix. The first data converter, multiplication array and accumulator can process these data blocks in a pipeline manner, reducing data processing latency and improving the efficiency of matrix multiplication calculation. Attached Figure Description

[0039] Figure 1 is a schematic diagram of an electronic device provided in an embodiment of this application.

[0040] Figure 2 is a schematic diagram of the structure of an AI processor provided in an embodiment of this application.

[0041] Figure 3 is a schematic diagram of a floating-point addition provided in an embodiment of this application.

[0042] Figure 4 is a schematic diagram of the structure of a chip provided in an embodiment of this application.

[0043] Figure 5 is a schematic diagram of another chip structure provided in an embodiment of this application.

[0044] Figure 6 is a schematic diagram of a chip data processing method provided in an embodiment of this application.

[0045] Figure 7 is a schematic diagram of another chip data processing method provided in an embodiment of this application.

[0046] Figure 8 is an exemplary flowchart of a chip data processing method provided in an embodiment of this application.

[0047] Figure 9 is a schematic diagram of converting a floating-point vector into an integer vector according to an embodiment of this application.

[0048] Figure 10 is a schematic diagram of a process for converting a floating-point vector into an integer vector according to an embodiment of this application.

[0049] Figure 11 is a schematic diagram of the accuracy of a Kulisch accumulator provided in an embodiment of this application.

[0050] Figure 12 is a schematic diagram of a limited vector multiplication result bit width provided in an embodiment of this application.

[0051] Figure 13 is a schematic diagram of an improved Kulisch accumulator provided in an embodiment of this application.

[0052] Figure 14 is a timing diagram of a chip data processing method provided in an embodiment of this application.

[0053] Figure 15 is a schematic diagram of another method for converting a floating-point vector into an integer vector according to an embodiment of this application. Detailed Implementation

[0054] The technical solutions in the embodiments of this application will now be described with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of this application without creative effort should fall within the scope of protection of this application.

[0055] In the embodiments of this application, the words "exemplary," "for example," etc., are used to indicate that they are examples, illustrations, or descriptions. Any embodiment or design that is described as "exemplary" in this application should not be construed as being more preferred or advantageous than other embodiments or design options. Specifically, the use of the term "exemplary" is intended to present the concept in a concrete manner.

[0056] The business scenarios described in the embodiments of this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided in the embodiments of this application. As those skilled in the art will know, with the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.

[0057] Hereinafter, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

[0058] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.

[0059] In this application, "at least one" means one or more, and "more than one" means two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can mean: A alone, A and B simultaneously, and B alone, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.

[0060] To facilitate understanding of the embodiments of this application, some definitions involved in this application will be briefly explained first.

[0061] 1. Floating-point numbers: These typically consist of a sign bit, an exponent, and a mantissa. The exponent determines the position of the decimal point in the mantissa to represent values of different orders of magnitude. For example, Float16 (also known as half-precision floating-point numbers) uses 16 bits (2 bytes) to represent a floating-point number, including 1 sign bit, 5 exponent bits, and 10 mantissa bits.

[0062] 2. Kulisch accumulator: An accumulator designed to improve the precision of floating-point calculations. The main idea is to retain more significant digits during calculation, thus reducing the impact of rounding errors. By increasing the number of precision bits, the Kulisch accumulator can represent calculation results more accurately, thereby improving the accuracy and stability of numerical calculations.

[0063] Figure 1 is a schematic diagram of an electronic device provided in an embodiment of this application.

[0064] The electronic device 100 can be an electronic device or a module, chip, chipset, circuit board, or component integrated within an electronic device. The electronic device can be user equipment (UE), such as a mobile phone, tablet computer, laptop computer, or image capturing device, among other types of devices. The electronic device can be equipped with a data acquisition device 105, which includes image sensors, optical sensors, sound sensors, etc., for acquiring various types of data. The electronic device can also install various software applications, such as camera applications, video call applications, or online video recording applications, to drive the data acquisition device 105 to acquire data. Users can use the data acquisition device 100 to acquire data by launching these applications.

[0065] Electronic device 100 may specifically be a chip or chipset, a circuit board carrying the chip or chipset, or an electronic device including said circuit board, but is not intended to limit the embodiment. The chip or chipset or the circuit board carrying the chip or chipset can operate under software control. Electronic device 100 includes one or more processors, such as signal processor 102 and AI processor 101. Optionally, the one or more processors may be integrated within one or more chips, which can be considered a chipset. When one or more processors are integrated within the same chip, the chip is also called a system on a chip (SOC). In addition to the one or more processors, electronic device 100 may also include one or more other components, such as memory 103. In one possible implementation, memory 103 may be located within the same system on a chip in electronic device 100 as AI processor 101 and signal processor 102, i.e., memory 103 is integrated into the SOC shown in Figure 1 above. In this case, memory 103 may include on-chip random access memory (RAM).

[0066] It should be understood that in some possible application scenarios, the electronic device 100 may only include the AI processor 101, while the data acquisition device 105, signal processor 102 and memory 103 are optional components.

[0067] In this embodiment, the AI processor 101 may include a dedicated neural processor such as a neural network processing unit (NPU), including but not limited to convolutional neural network processors, tensor processors, or neural processing engines. The AI processor may be a standalone chip or integrated into other digital logic chips, including but not limited to central processing units (CPUs), graphics processing units (GPUs), or digital signal processors (DSPs). The signal processor 102 may have multiple hardware modules or run necessary software programs to process the acquired data or communicate with the AI processor 101. The signal processor 102 and the AI processor 101 can communicate via direct hardware connection or through signal forwarding by a controller.

[0068] Figure 2 is a schematic diagram of the structure of an AI processor provided in an embodiment of this application.

[0069] AI processor 101 includes matrix computing array 210, vector computing array 220, storage unit 230 and data interface 240.

[0070] The matrix computing array 210 is used to accelerate matrix multiplication and is the main part of the AI processor 101 that provides computing power.

[0071] The vector computation array 220 is used to implement activation functions and tensor transformations in neural networks, typically non-matrix multiplication calculations.

[0072] Storage unit 230 is a high-speed cache inside AI processor 101.

[0073] Data interface 240 is used to realize data transmission between different modules within AI processor 101, and data transmission between AI processor 101 and external storage units such as system storage unit 250. System storage unit 250 is a storage unit external to AI processor 101, generally a memory unit, such as memory 103. Typically, system storage unit 250 includes DRAM.

[0074] With the continuous development of AI algorithms, the inference performance of neural networks has attracted increasing attention from academia and industry. Neural networks involve a large number of matrix multiplication calculations. To accelerate neural network inference, AI processors typically contain acceleration units for matrix multiplication or vector inner product operations, used to speed up matrix multiplication operations such as A×B+C.

[0075] Mathematically, matrix multiplication can be viewed as the parallel computation of the inner product of m × n vectors, where m and n are integers greater than 1:

[0076] in, B T =[b1,b2,…b n ]. a i Let b represent the row vectors of matrix A. j Let a represent the column vector of matrix B. i b j Represents vector a i And vector b j The inner product of the vectors, where 1 ≤ i ≤ m, 1 ≤ j ≤ n. Each vector may include one or more elements. Each element represents information or data processed by matrix multiplication, which may be activation data or weights, though this embodiment is not limited in this respect.

[0077] Multiplication arrays typically consist of multiple vector dot product calculation units. Further examining the vector dot product operation reveals that it comprises two parts: 1. Vector a... i Each element in the vector b j 1. Multiply corresponding elements in the vector; 2. Sum the results of the multiplication. Assume vector a i And vector b j Each of the q elements is represented by formula (1), where q is an integer greater than or equal to 1.

[0078] Typically, for compatibility, AI processors 101 provide support for accelerated half-precision floating-point (Float16) matrix multiplication, meaning that elements in the left matrix A and right matrix B include half-precision floating-point numbers. Optimization of the vector inner product operation unit for half-precision floating-point numbers is an important optimization direction for the device.

[0079] Formula (2) represents the floating-point multiplication in vector dot product operations, where the two floating-point numbers x = (-1). sx ×2 Ex-E0 M x y = (-1) sy ×2 Ey-E0 M y The multiplication process includes: 1. Adding the exponents; 2. Multiplying the mantissas; 3. Normalizing the result to a floating-point number. sx is the sign bit of the floating-point number x, Ex-E0 is the exponent of the floating-point number x, and M... x is the mantissa of the floating-point number x. sy is the sign bit of the floating-point number y, Ey-E0 is the exponent of the floating-point number y, and M... y Let x be the mantissa of the floating-point number y. x × y = (-1) sx+sy ×2Ex+Ey-2E0 M x ·M y (2)

[0080] Adding two floating-point numbers is more complicated. It requires first aligning them according to their exponents, then shifting the mantissas, calculating the addition, and finally normalizing the result into a single floating-point number. Figure 3 is a schematic diagram of floating-point addition provided in an embodiment of this application, where x = 1.000 × 2 1 y = -1.011 × 2 -2 In step 1, the exponents of x and y are aligned to 2. 1 y = -0.0010110 × 2 1 In step 2, x and y are added together, and the result is 0.1101010 × 2. 1 In step 3, 0.1101010×2 1 Normalized shift yields 1.101010×2 0 In step 4, 1.101010×2 0 Numerical reduction yields 1.101 × 2 0 .

[0081] The acceleration unit of the AI processor 101 typically computes multiple vector inner products in parallel. Each vector inner product unit requires hardware to perform the aforementioned alignment, shifting, and normalization calculations. This results in the power consumption of floating-point addition being significantly higher than that of integer addition, limiting the efficiency of floating-point vector multiplication.

[0082] This application provides a chip and a chip data processing method that can convert floating-point vectors into integer vectors, perform integer vector inner product calculations, and restore the final calculation result to floating-point numbers. The energy efficiency and surface efficiency of integer vector inner product calculations are much higher than those of floating-point vector inner product calculations.

[0083] Figure 4 is a schematic diagram of the structure of a chip provided in an embodiment of this application.

[0084] Chip 400 includes a first data converter 410, a multiplication array 420, an accumulator 430, and a second data converter 440. The first data converter 410, multiplication array 420, accumulator 430, and second data converter 440 may be located within the AI processor of chip 400, specifically within the matrix calculation array of the AI processor, but this embodiment is not limited to this.

[0085] The first data converter 410 is coupled on-chip to the multiplier array 420, the multiplier array 420 is coupled on-chip to the accumulator 430, and the second data converter 440 is coupled on-chip to the accumulator 430.

[0086] Chip 400 is used to multiply N first vectors by P second vectors. Multiple elements in each first vector are floating-point numbers, and multiple elements in each second vector can be either integers or floating-point numbers. The first and second vectors contain the same number of elements, N≥1, P≥1.

[0087] The first data converter 410 is used to convert N first vectors into N third vectors and M first shared indices, wherein multiple elements in each third vector are integers, 1≤M≤N.

[0088] Each first vector can be a matrix to be processed or a part of a vector to be processed. The number of elements in each first vector can be equal to the computational parallelism of the multiplication array 420.

[0089] The multiplication array 420 is used to obtain the N third vectors from the first data converter 410 in a direct transmission manner, and multiply the N third vectors by P fourth vectors to obtain N×P vector multiplication results, where multiple elements of the P fourth vectors are integers. Each vector multiplication result includes a value.

[0090] This direct transmission method eliminates the need to transfer data to external DRAM, reducing data transmission latency and improving the computational efficiency of floating-point vector multiplication. For example, this direct transmission method can be achieved through on-chip hardwired connections, which further reduces transmission latency.

[0091] When multiple elements in the second vector are integers, the P second vectors can be the P fourth vectors, thus omitting the conversion step. The multiplication array 420 can multiply the N converted third vectors with the P second vectors to obtain N×P vector multiplication results. When multiple elements in the second vector are floating-point numbers, the first data converter 410 can convert the P second vectors into Q second shared exponents and the P fourth vectors. The multiplication array 420 can then multiply the N converted third vectors with the P converted fourth vectors, where 1≤Q≤P.

[0092] Accumulator 430 is used to obtain the N×P vector multiplication results from multiplication array 420 in this direct transmission mode, calculate the sum of the N×P vector multiplication results and the historical accumulation results according to the M first shared exponents to obtain the accumulation result, which may include N×P values.

[0093] When multiple elements in the second vector are floating-point numbers, the accumulator 430 will calculate the sum of the N×P vector multiplication results and the historical accumulation results based on the M first shared exponents and Q second shared exponents.

[0094] The second data converter 440 is used to convert the integer in the accumulation result into a floating-point number. This application does not limit the specific type of the floating-point number, such as Float16, Float8, Float32, Float64, etc.

[0095] Figure 5 is a schematic diagram of another chip structure provided in an embodiment of this application.

[0096] In this embodiment, chip 400 may further include at least one buffer for caching the M first shared indices, historical accumulation results, and accumulation results. The at least one buffer refers to the storage space within the chip for temporary data storage, such as an on-chip register, cache, or static random-access memory (SRAM). The following embodiments mainly use registers as an example.

[0097] Taking the at least one buffer including a first register and a second register as an example, as shown in Figure 5, relative to chip 400, chip 500 also includes a first register 431 and a second register 432. The first register 431 and the second register 432 may be located in the AI processor of chip 500, specifically, they may be located in the matrix calculation array of the AI processor, but this embodiment is not limited to this.

[0098] The first register 431 is coupled on-chip to the first data converter 410 and the accumulator 430 respectively, and the second register 432 is coupled on-chip to the accumulator 430.

[0099] The first register 431 is used to cache the M first shared exponents, and optionally, it can also cache the Q second shared exponents.

[0100] The second register 432 is used to cache the historical accumulation result and the current accumulation result.

[0101] In some possible implementation scenarios, when performing a multiplication operation on two matrices, chip 500 can split the two matrices into multiple data blocks for processing. The M first shared exponents of the (h+1)th data block obtained at time t2 can cover the M first shared exponents of the h-th data block stored in the first register 431 at time t1, where h ≥ 1, and time t2 is later than time t1. In this embodiment, covering existing data with new data means that the new data replaces the existing data.

[0102] In some possible implementation scenarios, the accumulated result obtained at time t2 can overwrite the historical accumulated result stored in the second register 432 at time t1.

[0103] Figure 6 is a schematic diagram of a chip data processing method provided in an embodiment of this application. The data processing method shown in Figure 6 can perform vector multiplication calculations of floating-point numbers and integer types. The floating-point number type includes, but is not limited to, Float16, Float8, Float32, Float64, etc., and the integer type includes, but is not limited to, INT4, INT8, INT16, etc.

[0104] Taking chip 400 as an example, the first data converter 410 in chip 400 converts N first vectors into N third vectors and M first shared exponents. Multiple elements in each first vector are floating-point numbers, and multiple elements in each third vector are integers, where N≥1 and 1≤M≤N.

[0105] When M = N, each first vector corresponds to a first shared index; when M is less than N, multiple first vectors correspond to one first shared index. The first shared index can be the maximum index of the elements included in the corresponding first vector, or it can be the maximum index minus a preset value.

[0106] The multiplication array 420 in chip 400 obtains the N third vectors from the first data converter 410 in a direct transmission manner, and multiplies the P fourth vectors with the N third vectors to obtain N×P vector multiplication results, where multiple elements in the P fourth vectors are integers.

[0107] The accumulator 430 in chip 400 obtains the N×P vector multiplication results from the multiplication array 420 in a direct transmission manner, and calculates the sum of the N×P vector multiplication results and the historical accumulation results based on the M first shared exponents to obtain the accumulation result.

[0108] The second data converter 440 in chip 400 converts the integer in the accumulated result into a floating-point number.

[0109] The first data converter 410 can directly send M first shared indices to the accumulator 430, or it can store the M first shared indices in a cache, from which the accumulator 430 reads the M first shared indices.

[0110] Figure 7 is a schematic diagram of another chip data processing method provided in an embodiment of this application. The data processing method shown in Figure 7 can realize vector multiplication of floating-point numbers. The floating-point number type includes, but is not limited to, Float16, Float8, Float32, Float64, etc.

[0111] Taking chip 400 as an example, the first data converter 410 in chip 400 converts N first vectors into N third vectors and M first shared exponents. Multiple elements in each first vector are floating-point numbers, and multiple elements in each third vector are integers, N≥1, 1≤M≤N. At the same time, the first data converter 410 converts P second vectors into Q second shared exponents and P fourth vectors. Multiple elements in each second vector include floating-point numbers, and multiple elements in each fourth vector are integers, 1≤Q≤P.

[0112] When M = N, each first vector corresponds to a first shared index; when M is less than N, multiple first vectors correspond to one first shared index. The first shared index can be the maximum index of the elements included in the corresponding first vector, or it can be the maximum index minus a preset value.

[0113] When Q = P, each second vector corresponds to a second shared index; when Q is less than P, multiple second vectors correspond to one second shared index. The second shared index can be the largest index of the elements included in the corresponding second vector, or it can be the largest index minus a preset value.

[0114] The multiplication array 420 in chip 400 obtains the N third vectors and P fourth vectors from the first data converter 410 in a direct transmission manner, and multiplies the P fourth vectors with the N third vectors to obtain N×P vector multiplication results.

[0115] The accumulator 430 in chip 400 obtains the N×P vector multiplication results from the multiplication array 420 in a direct transmission manner, and calculates the sum of the N×P vector multiplication results and the historical accumulation results based on the M first shared exponents and Q second shared exponents to obtain the accumulation result.

[0116] The second data converter 440 in chip 400 converts the integer in the accumulated result into a floating-point number.

[0117] The first data converter 410 can directly send M first shared indices and Q second shared indices to the accumulator 430, or it can store the M first shared indices and Q second shared indices in a cache, from which the accumulator 430 reads the M first shared indices and Q second shared indices.

[0118] Figure 8 is an exemplary flowchart of a chip data processing method provided in an embodiment of this application.

[0119] Figure 8 uses matrix A×B as an example to specifically describe the multiplication calculation of floating-point matrices in the chip. Both matrices A and B are floating-point matrices, and each element in both matrices is in Float16 data format. Matrix A is split into G data blocks A1 to AG, and matrix B is split into G data blocks B1 to BG. Each data block A1 to AG includes N first vectors, and each data block B1 to BG includes P second vectors. A×B = A1×B1 + A2×B2 + ... + AG×BG. The chip can use a data block-based pipelined processing mechanism. When the multiplication array 420 calculates the multiplication operation of the first data block of matrix A and matrix B, the first data converter 410 can simultaneously initiate the conversion of the second data block of matrix A and matrix B. In this way, the conversion overhead from the second data block to the Gth data block can be masked, improving the efficiency of matrix multiplication calculation. G is an integer greater than 1.

[0120] 810 converts the floating-point vector in the data block into an integer vector.

[0121] Taking data blocks A1 and B1 as examples, A1 contains N first vectors and B1 contains P second vectors. The number of elements in the first and second vectors is the same. The number of elements in the first and second vectors is related to the computational parallelism of the multiplication array 420. For example, when the computational parallelism of the multiplication array 420 is 8, the first and second vectors can include 8 elements.

[0122] The following example uses one of N first vectors to illustrate the data transformation process. Assume that one of the input first vectors is v. float1 , v float1 It includes k elements, each of which is in Float16 data format, meaning each element contains a 1-bit sign bit s. i Five-digit exponent e i and 10 mantissas m i , 0≤i<k, k is an integer greater than 1.

[0123] Figure 9 is a schematic diagram illustrating the conversion of a floating-point vector to an integer vector according to an embodiment of this application. The conversion of a floating-point vector to an integer vector includes two steps: pre-alignment and shift conversion.

[0124] During the pre-alignment process, the first vector v is first determined. float1 The sharing index, which can be the first vector v float1 The maximum exponent of the element can also be the maximum exponent minus a preset value. The first vector v can be obtained through formula (3). float1 The largest exponent of the element, e shared Indicates v float1The sharing index, where J in formula (4) represents the preset value. For example, this preset value can be equal to 14. share =max(e0,e1,…,e k-1 (3) e shared =max(e0,e1,…,e k-1 )-J (4)

[0125] Then calculate v using formula (5). floa The difference Δe between the exponents of each element and the shared exponent. i Δe i =e shared -e i (5)

[0126] In the shifting and transformation stage, v float1 The mantissa of each element is determined by the difference in exponents Δe. i Perform a shift. In e shared It is v float1 In the case of the largest exponent, all element shifts are right shifts. As shown in formula (6), m i For the element before the shift, m i ′ This is the element after the shift.

[0127] In e shared equals v float1 In the case of subtracting J from the maximum exponent, first reduce m i Shift left by J bits, then shift right by Δe i As shown in formula (7), m i For the element before the shift, m i ′ This method reduces precision loss by using the shifted elements.

[0128] Next, the shifted m i ′ Combine the sign bit s of the element i Convert to integer m a ′ NTi This conversion method includes, but is not limited to, rounding, rounding up, or rounding down. For example, the shifted m i ′i It is 1.001, which becomes 1 after conversion to an integer.

[0129] Figure 10 is a flowchart illustrating the process of converting a floating-point vector into an integer vector according to an embodiment of this application. A first vector is input to a first data converter 410, and the data format of the elements in the first vector is Float16. First, the exponent bits of all elements in the first vector are extracted. The extracted exponent bits are in INT5 format, i.e., they include 5 integer bits. Then, the shared exponent of the first vector is calculated using formula (3) or formula (4), and this shared exponent is cached. Then, the difference between the exponent of each element in the first vector and the shared exponent is calculated using formula (6) or formula (7). Finally, the mantissa of each element is shifted according to the exponent bit difference and converted into an integer in INT16 data format. INT16 includes 16 integer bits. Optionally, the mantissa of each element can also be shifted according to the exponent bit difference and converted into an integer in INT8 data format. This application does not limit the format of the integers.

[0130] In one possible implementation scenario, when the input to the first data converter 410 is an N×k floating-point matrix, each row of elements can be independently pre-aligned and shifted to obtain N shared exponents e. shared And an INT16 matrix of size N×k.

[0131] The transformed INT16 matrix is used as the input to the multiplication array 420, e shared As inputs to the subsequent accumulator 430, no external memory is required.

[0132] The size k of the first vector matches the input size of the multiplication array 420, typically 16 or 32. A smaller first vector size results in less precision loss during the conversion. When the first vector size is 1, each element in the first vector has an independent exponent, preventing any precision loss.

[0133] Another example, with the first vector x a Second vector x b For example:

[0134] First vector x a Includes k floating-point numbers, where the i-th floating-point number is s i,a floating-point number The sign bit, e i,a floating-point number The index, m i,a floating-point number The last digit of the second vector x. b Includes k floating-point numbers, where the i-th floating-point number is s i,b floating-point number The sign bit, ei,b floating-point number The index, m i,b floating-point number The last digit of the number is 0 ≤ i < k.

[0135] For the two floating-point vectors x mentioned above a and x b After conversion, we get:

[0136] First vector = First sharing index × Third vector, where shared_a is the first vector x a The first shared index of elements in the middle, [m′ INT0,a m′ INT1,a …m′ INTk-1,a ] is the third vector, m i,a After being converted to m′ by the first data converter 410 INTi,a The second vector = the second sharing index × the fourth vector, where shared_b is the second vector x. b The second sharing index of elements in the middle, [m′ INT0,b m′ INT1,b …m′ INTk-1,b ] is the fourth vector, m i,b After being converted to m′ by the first data converter 410 INTi,b .

[0137] 820, Integer vector multiplication calculation.

[0138] Calculate the vector dot product using the integer mantissas of the first and second vectors. Using the third vector [m′]... INT0,a m′ INT1,a …m′ INTk-1,a ] and the fourth vector [m′ INT0,b m′ INT1,b …m′ INTk-1,b For example, the inner product operation of the third and fourth vectors is shown in formula (8).

[0139] in, This is an integer vector dot product operation, the operation process includes m′ INTi,c =m′ INTi,a m′ INTi,b Integer multiplication and The integer summation operation. res block Represents the first vector x a Second vector x b The result of the inner product operation of the mantissas of integers.

[0140] 830, Data block results are accumulated.

[0141] Formula (9) represents the first vector x a Second vector x b According to the distributive law of multiplication, the accumulator 430 can obtain res from the multiplication array 420 for the multiplication operation. block Multiplying this by the common coefficient 2shared_a + shared_b yields the first vector x. a Second vector x b The result of the multiplication operation; where shared_a is x a The sharing index, i.e., the first sharing index, is shared_b. The sharing index, i.e., the second sharing index, is calculated in the same way as the e mentioned in the previous embodiments. shared Since A1 contains N first vectors and B1 contains P second vectors, A1 × B1 can result in N × P vector multiplications.

[0142] Since A×B=A1×B1+A2×B2+……+AG×BG, the shared exponent is usually different for different data blocks. Therefore, they cannot be directly accumulated. They need to be shifted to a unified exponent domain before accumulation. It should be understood that each data block multiplication operation in A1×B1, A2×B2……AG×BG corresponds to N×P vector multiplication results, and each vector multiplication result includes a value.

[0143] Optionally, embodiments of this application may use a Kulisch accumulator to accumulate the vector multiplication results between multiple data blocks. Figure 11 is a schematic diagram of the precision of a Kulisch accumulator provided in an embodiment of this application. Compared with the traditional Float16 accumulator, the Kulisch accumulator can alleviate the "drowning" problem in floating-point addition and achieve higher precision. When calculating the accumulation of floating-point numbers, the Kulisch accumulator first starts with a common exponent, and then shifts the mantissa of all other data according to the common exponent. Taking the accumulation of x1 to x4 shown in Table 1 as an example, x1 to x4 are the vector multiplication results in different data blocks, x1 = 1.0100000 × 2 3 x2 = 1.0101100 × 2 6 x3 = 1.0001001 × 2 7 x4 = 1.0000111 × 2 0 .

[0144] Table 1

[0145] As shown in Table 2, starting from x4, the mantissas of the remaining data are shifted left based on the difference in the exponents, and then accumulated to obtain a higher bit width integer y, y = x1 + x2 + x3 + x4.

[0146] Table 2

[0147] Formula (8) represents the summation of vector multiplication results between multiple data blocks. Since in the above calculation process, Among them res block The result is obtained by calculating the inner product of the integer mantissas of the vectors within the data block, while the integer multiplication by 2shared_a + shared_b can be achieved through low-cost bit shifting. Therefore, res is obtained in accumulator 430. block Next, a shift operation is performed; that is, if shared_a + shared_b is greater than 0, res is shifted. block Shift left by shared_a + shared_b bits; if shared_a + shared_b is less than 0, set res... block Shift right by |shared_a+shared_b| bits. Then add it to the historical accumulated result, where partial_sum represents the historical accumulated result.

[0148] In some possible implementations, res can be used block First, shift left by L + (shared_a + shared_b) bits, then change the shifted res... block Add the result to the historical accumulation result. In step 840, accumulate multiple data blocks again, and then right-shift the final accumulation result by L bits or divide it by 2. L L is a preset value. partial_sum = partial_sum + (res) block Move (|shared_a+shared_b|) bits (10)

[0149] Optionally, embodiments of this application may also use an improved Kulisch accumulator to accumulate the vector multiplication results between multiple data blocks. Figure 12 is a schematic diagram of a limitation on the bit width of the vector multiplication result provided by an embodiment of this application. According to the data range of Float16, the data bit width required by the accumulator 430 is 41 bits (2^32 bits). -24 to 2 15 (Add a sign bit). Since 41-bit integers are not hardware-friendly, this embodiment of the application can make a trade-off between precision and hardware design, limiting the bit width of the vector multiplication result to a power of 2, for example, selecting 32 bits or 16 bits for accumulation. This is equivalent to reducing the data bit width to a preset bit width to facilitate subsequent processing. As shown in Figure 12, the bit width of the vector multiplication result can be adjusted according to the data distribution through the fractionation point.

[0150] Figure 13 is a schematic diagram of an improved Kulisch accumulator provided in an embodiment of this application. block It is an INT_b integer with a bit width of b. Before accumulation, res is adjusted according to the split point. block The bit width is converted to an INT_a integer with a bit width of a to fit the hardware design, where a is a power of 2.

[0151] This application uses a Kulisch accumulator to accumulate the vector multiplication results between multiple data blocks. Compared with Float32 or Float16 accumulators, the Kulisch accumulator can significantly reduce circuit complexity and power consumption.

[0152] 840 converts the accumulated result to a floating-point number.

[0153] After all data blocks have been accumulated, the integers in the accumulated result output by accumulator 430 are converted back to floating-point numbers.

[0154] Figure 14 is a timing diagram of a chip data processing method provided in an embodiment of this application.

[0155] The main steps in the above calculation process include preprocessing of converting floating-point vectors to integer vectors, calculating integer vector multiplication, and calculating the accumulated results. These three steps can use a pipeline mechanism, that is, work in a pipeline manner, so that the time overhead of these three steps can mask each other, thereby improving the overall performance.

[0156] At the same time, for matrix A, the first data converter 410 performs data conversion on the floating-point vector in the (h+2)th data block and extracts the N shared exponents of the N vectors in the (h+2)th data block. The multiplication array 420 performs integer multiplication on the vectors in the (h+1)th data block. The accumulator 430 adds the vector multiplication result of the hth data block to the sum of the vector multiplication results of the previous h-1 data blocks, where h≥1.

[0157] For example, when multiplying the first matrix A and the second matrix B, the first matrix A is split into data blocks A1, A2, A3, and A4, and the second matrix B is split into data blocks B1, B2, B3, and B4. A×B = A1×B1 + A2×B2 + A3×B3 + A4×B4. Table 3 shows the process of the chip calculating A×B using a pipeline mechanism at different times. At time t1, the first data converter converts the floating-point vectors in A1 and B1 into integer vectors to obtain A11 and B11; at time t2, the first data converter converts the floating-point vectors in A2 and B2 into integer vectors to obtain A12 and B12, and the multiplication array calculates A11×B11; at time t3, the first data converter converts the floating-point vectors in A3 and B3 into integer vectors to obtain A13 and B13, and the multiplication array calculates A12×B12, accumulating the results. The accumulator calculates the sum of A11×B11 and the historical accumulation result, obtaining the accumulation result 0+A11×B11. The historical accumulation result at time t3 is 0. At time t4, the first data converter converts the floating-point vectors in A4 and B4 into integer vectors to obtain A14 and B14. The multiplication array calculates A13×B13, and the accumulator calculates the sum of A12×B12 and the historical accumulation result, obtaining the accumulation result 0+A11×B11+A12×B12. The historical accumulation result at time t4 is 0+A11×B11. And so on.

[0158] The data processing method for the chip provided in this application embodiment can achieve an accuracy close to that of floating-point multiplication, enabling seamless replacement and improving chip energy efficiency by more than 40%.

[0159] Table 3

[0160] Figure 15 is a schematic diagram of another method for converting a floating-point vector into an integer vector according to an embodiment of this application.

[0161] With the development of neural network accelerators, the Float8 data type has gradually gained industry adoption. The data format of elements in the vector input to the first data converter 410 can be Float8 data type. The implementation scheme for vector multiplication of type Float8 is similar to the implementation scheme for vector multiplication of type Float16 in Figures 8-13.

[0162] Compared with the implementation scheme of vector multiplication of type Float16 in Figures 8 to 13, the main difference in the implementation scheme of vector multiplication of type Float8 lies in the data type and data bit width.

[0163] A vector in Float8 data format is input into the first data converter 410. First, the exponent bits of all elements in the vector are extracted. The data format of the extracted exponent bits is INT4, which includes 4-bit integers. Then, the shared exponent of the vector is calculated by formula (3) or formula (4), and the shared exponent is cached. Then, the difference between the exponent of each element in the vector and the shared exponent is calculated by formula (6) or formula (7). Finally, the mantissa of each element is shifted according to the difference of the exponent bits and converted into an integer in INT8 data format. INT8 includes 8-bit integers.

[0164] When the input to the first data converter 410 is an N×k floating-point matrix, each row of elements undergoes independent pre-alignment and shift transformation to obtain N shared exponents e. share And an INT8 matrix of size N×k.

[0165] The INT8 matrix obtained from the above transformation is directly used as the input to the multiplication array 420, e share As inputs to the subsequent accumulator 430, no external memory cache is required.

[0166] The remaining calculation process is similar to the implementation scheme of vector multiplication of type Float16 in Figures 8 to 13, and will not be repeated in this application.

[0167] This application also provides a computing device, including the chip shown in FIG1, FIG4 or FIG5.

[0168] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the above-described apparatus and unit can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0169] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A chip, characterized by The method comprises the following steps: a first data converter is configured to convert N first vectors into N third vectors and M first shared exponents, wherein each of the first vectors comprises floating-point numbers, each of the third vectors comprises integers, N≥1, and 1≤M≤N; a multiplication array is configured to acquire the N third vectors from the first data converter in a direct transmission manner, multiply P fourth vectors with the N third vectors to obtain N×P vector multiplication results, wherein each of the P fourth vectors comprises integers, and P≥1; an accumulator is configured to acquire the N×P vector multiplication results from the multiplication array in the direct transmission manner, and calculate an accumulated sum of the N×P vector multiplication results and a historical accumulated result according to the M first shared exponents to obtain an accumulated result; a second data converter is configured to convert integers in the accumulated result into floating-point numbers.

2. The chip according to claim 1, characterized in that, The method further comprises the following steps:

3. The chip of claim 2, wherein, at least one buffer is configured to buffer the M first shared exponents, the historical accumulated result, and the accumulated result. The at least one buffer comprises: a first register configured to buffer the M first shared exponents; 4. The chip according to any one of claims 1 to 3, characterized in that, a second register configured to buffer the historical accumulated result and the accumulated result.

5. The chip according to any one of claims 1 to 4, characterized in that, The direct transmission manner comprises on-chip hardwiring.

6. The chip according to any one of claims 1 to 5, wherein, Each of the first vectors is a part of a to-be-processed matrix or a to-be-processed vector.

7. The chip according to any one of claims 1 to 6, wherein The accumulator comprises a Kulisch accumulator.

8. The chip according to any one of claims 1 to 7, characterized in that, The accumulator is further configured to limit a bit width of the N×P vector multiplication results to a power of 2. The first data converter is further configured to convert P second vectors into Q second shared exponents and the P fourth vectors, wherein each of the second vectors comprises floating-point numbers, and 1≤Q≤P.

9. The chip according to any one of claims 1 to 8, characterized in that, The accumulator is specifically configured to calculate the accumulated sum of the N×P vector multiplication results and the historical accumulated result according to the M first shared exponents and the Q second shared exponents.

10. A computing device, comprising: The first data converter, the multiplication array, and the accumulator work in a pipeline manner.

11. A data processing method of a chip, characterized by, The chip comprises any one of the chips according to claims 1 to 9. The method comprises the following steps: a first data converter in a chip is configured to convert N first vectors into N third vectors and M first shared exponents, wherein each of the first vectors comprises floating-point numbers, each of the third vectors comprises integers, N≥1, and 1≤M≤N; a multiplication array in the chip is configured to acquire the N third vectors from the first data converter in a direct transmission manner, multiply P fourth vectors with the N third vectors to obtain N×P vector multiplication results, wherein each of the P fourth vectors comprises integers, and P≥1; an accumulator in the chip is configured to acquire the N×P vector multiplication results from the multiplication array in the direct transmission manner, and calculate an accumulated sum of the N×P vector multiplication results and a historical accumulated result according to the M first shared exponents to obtain an accumulated result; a second data converter in the chip is configured to convert integers in the accumulated result into floating-point numbers.