Improved apparatus for performing multiply / accumulate operations
By employing parallel processing and finite bit representation in the GPU pipeline, the multiplication and accumulation circuits are optimized, solving the problems of resource waste and high energy consumption in existing technologies, and improving computational efficiency and resource utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- VERISILICON MICROELECTRONICS (SHANGHAI) CO LTD
- Filing Date
- 2021-06-08
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, GPU pipelines or multiprocessing devices suffer from resource waste and excessive energy consumption when performing a large number of mathematical operations, especially when processing large-scale data.
By configuring the graphics processing unit (GPU) into multiple processing pipelines, parallel processing techniques are utilized, combined with finite bit representation and symbolic processing, to optimize multiplication and accumulation circuits, thereby reducing the number of computing units and circuit area.
It achieves a reduction in circuit area and energy consumption while maintaining accuracy, and improves computational efficiency and resource utilization.
Smart Images

Figure CN113778376B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to systems and methods for performing a large number of mathematical operations. Background Technology
[0002] One of the most common ways to improve execution speed is to execute operations in parallel, such as by using multiple processor cores. This principle can be leveraged on a larger scale by configuring graphics processing units (GPUs) with many (e.g., thousands) processing pipelines, where each pipeline can be configured to perform a mathematical function. In this way, large amounts of data can be processed in parallel. Although GPUs were originally used for graphics processing applications, they are also frequently used in other applications, especially artificial intelligence.
[0003] Improving the functionality of a GPU pipeline or any processing device that includes many processing units would be an improvement in the field. Attached Figure Description
[0004] Figure 1 This is a schematic block diagram of a computer system that can implement the methods of embodiments of the present invention.
[0005] Figure 2 This is a schematic block diagram of a multiplication / accumulation circuit according to an embodiment of the present invention.
[0006] Figure 3 This is a flowchart illustrating a method for processing input variables in a multiplication / accumulation circuit according to an embodiment of the present invention.
[0007] Figure 4 This is a flowchart of a method for post-processing a product that is to be accumulated in a product / accumulation circuit, according to an embodiment of the present invention. Detailed Implementation
[0008] To facilitate understanding of the advantages of the present invention, a more specific description of the invention will be presented with reference to specific embodiments shown in the accompanying drawings. It should be understood that these drawings merely illustrate exemplary embodiments of the invention and do not constitute a limitation on its scope. The invention will now be described and explained with additional specificity and detail using the accompanying drawings.
[0009] The components of this invention can be arranged and designed in a variety of different ways. Therefore, as shown in the accompanying drawings, the following more detailed description of embodiments of the invention is not intended to limit the scope of protection claimed by the invention, but rather to illustrate the basic concept of the invention in a schematic manner. Please refer to the accompanying drawings for a better understanding of the embodiments described herein, wherein the same components are always represented by the same numbers.
[0010] Embodiments of the invention may be embodied as apparatus, method, or computer program product. Accordingly, the invention may take the form of a fully hardware embodiment, a fully software embodiment (including firmware, resident software, microcode, etc.), or an embodiment combining software and hardware, which may be referred to herein as a “module” or “system.” Furthermore, the invention may be embodied in any tangible medium having computer-usable program code.
[0011] This invention can utilize any combination of one or more computer-usable or computer-readable media, including non-transitory media. For example, computer-readable media may include portable computer floppy disks, hard disks, random access memory (RAM) devices, read-only memory (ROM) devices, erasable programmable read-only memory (EPROM or flash memory) devices, portable optical disc read-only memory (CDROM), optical storage devices, and magnetic storage devices. In selected embodiments, the computer-readable medium may include any non-transitory medium that can contain, store, communicate, propagate, or transmit programs used by or in connection with an instruction execution system, apparatus, or device.
[0012] This invention can be written in any combination of one or more programming languages to perform the operations of this invention. These programming languages include object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" programming language. The program code can be executed entirely as a standalone software package on a computer system, on a standalone hardware unit, partially on a remote computer located at a distance from the computer, or entirely on a remote computer or server. In the latter case, the remote computer can be connected to the computer via any type of network (including a local area network (LAN) or a wide area network (WAN)), or can establish a connection with an external computer (e.g., via the Internet through an Internet service provider).
[0013] The present invention will now be described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. Each part of the flowchart illustrations and block diagrams can be implemented by computer program instructions or code. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus used to produce machines, such that the instructions executed by the processor of the computer or other programmable data processing apparatus can create instructions for implementing the functions / actions specified in the flowchart illustrations and / or block diagrams.
[0014] These computer program instructions may also be stored in a non-transitory computer-readable medium that can instruct a computer or other programmable data processing device to operate in a particular manner, thereby causing the instructions stored in the computer-readable medium to produce an article of writing, which includes instruction means that can perform the functions / actions specified in the flowcharts and / or block diagrams.
[0015] Computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, such that the instructions, which are executed by the processor of the computer or other programmable data processing apparatus, provide a process for implementing the functions / actions specified in the flowchart and / or block diagram.
[0016] Figure 1 This is a block diagram of an example computing device 100. The computing device 100 can be used to perform various processes, such as those discussed herein. The computing device 100 can be used as a server, client, or any other computing entity. The computing device can perform the various functions disclosed herein and can execute one or more applications, such as those disclosed herein. The computing device 100 can be any of a variety of computing devices, such as a desktop computer, laptop computer, server computer, handheld computer, tablet computer, etc.
[0017] The computing device 100 includes one or more processors 102, one or more storage devices 104, one or more interfaces 106, one or more mass storage devices 108, one or more input / output (I / O) devices 110, and a display device 130, all coupled to a bus 112. The processor 102 includes one or more processors or controllers that execute instructions stored in the storage devices 104 and / or the mass storage devices 108. The processor 102 may also include various types of computer-readable media, such as cache memory.
[0018] Storage device 104 includes various computer-readable media, such as volatile memory (e.g., random access memory (RAM) 114) and / or non-volatile memory (e.g., read-only memory (ROM) 116). Storage device 104 may also include erasable and rewritable ROM, such as flash memory.
[0019] Mass storage devices 108 include various computer-readable media, such as magnetic tape, disks, optical discs, solid-state storage (e.g., flash memory), etc. Figure 1As shown, in one example, the mass storage device is a hard disk drive 124. The mass storage device 108 may also include various drives to enable it to read from and / or write to various computer-readable media. The mass storage device 108 includes removable media 126 and / or non-removable media.
[0020] I / O device 110 includes various devices that can input data and / or other information to or retrieve said data or other information from computing device 100. Example I / O devices 110 include cursor control devices, keyboards, keypads, microphones, monitors or other display devices, speakers, printers, network interface cards, modems, lenses, CCDs or other image capture devices, etc.
[0021] Display device 130 includes any type of device capable of displaying information to one or more users of computing device 100. Display device 130 may be a monitor, display terminal, or video projection device, etc.
[0022] The graphics processing unit (GPU) 132 may be coupled to the processor 102 and / or the display device 130. The GPU can be used to render computer-generated images and perform other graphics processing. The GPU may have some or all of the functions of a general-purpose processor such as the processor 102. The GPU may also have additional graphics processing-specific functions. The GPU may have hard-coded and / or fixed-line graphics functions related to coordinate transformation, shading, texturing, rasterization, and other functions that help render computer-generated images.
[0023] Interface 106 includes various interfaces that allow computing device 100 to interact with other systems, devices, or computing environments. Interface 106 may include any number of different network interfaces 120, such as interfaces to local area networks (LANs), wide area networks (WANs), wireless networks, and the Internet. Other interfaces include user interfaces 118 and peripheral device interfaces 122. Interface 106 may also include one or more user interface elements 118. Interface 106 may also include one or more peripheral interfaces for, for example, printers, pointing devices (mice, touchpads, etc.), keyboards, etc.
[0024] Bus 112 allows processor 102, memory device 104, interface 106, mass storage device 108, and I / O device 110 to communicate with each other or with other components or devices coupled to bus 112. Bus 112 represents one or more of several types of bus architectures, such as system bus, PCI bus, IEEE 1394 bus, USB bus, etc.
[0025] In some embodiments, processor 102 may include cache 134, such as one or both of L1 and L2 caches. Similarly, GPU 132 may include cache 136, which may also include one or both of L1 and L2 caches.
[0026] For the purpose of better illustrating the invention, programs and other executable program components are shown herein as discrete blocks, although such programs and components may reside in different storage components of computing device 100 at different times and be executed by processor 102. The systems and processes described herein may also be implemented in hardware, or a combination of hardware, software, and / or firmware. For example, one or more application-specific integrated circuits (ASICs) may be programmed to perform one or more of the systems or processes described herein.
[0027] Reference Figure 2 The GPU 132 or other components of the computing device 100 may include Figure 2 The components are shown. As illustrated, buffers 200 and 202 can store the arguments that will become the objects of multiplication / accumulation operations. For example, buffer 200 can store coefficients used to implement graphics processing operations (e.g., kernels) and artificial intelligence operations (e.g., as part of a convolutional neural network). Buffer 202 can store the value to be multiplied by the coefficients (often called an "activation"). Of course, this is just an example, and any value can be loaded into buffers 200 and 202 and become the object of multiplication / accumulation operations. Buffers 200 and 202 can be defined as part of memory (RAM 114) or cache 134 or 136.
[0028] Each value retrieved from buffers 200 and 202 can be input to delimiter 204. Delimiter 204 converts all values into unsigned values. For example, in some applications, values can be represented in the following format: [Type][Amplitude]. The [Type] field indicates whether the bits in [Amplitude] represent a signed or unsigned number; for example, 0 indicates unsigned, and 1 indicates signed. When the [Type] field indicates a signed value, negative numbers will be represented by 1 and positive numbers by 0 for the most significant bit (MSB) in the [Amplitude] field, using two's complement representation.
[0029] The output of separator 204 is the sign 206 and amplitude 208 of the value from buffer 200, and the sign 210 and amplitude 212 of the value from buffer 202.
[0030] Symbols 206 and 210 and amplitudes 208 and 212 can then be input to checker 214. Checker 214 evaluates amplitudes 208 and 212 to detect certain cases requiring special handling. In particular, to limit the size of the circuitry performing the actual multiplication and addition / accumulation operations, the number of bits used to represent amplitudes 208 and 212 can be limited to a number of bits N. For example, when defining a value in the form of [type][amplitude], the value of N can be the number of bits of the actual value input to the multiplication circuitry and can be less than the number of bits in [amplitude]. For example, in the case of a 9-bit input value, there will be 8 bits in the [amplitude] field. Therefore, in some embodiments, the number of bits N = 7 for each buffer 200 and 202 input to the multiplication circuitry.
[0031] However, for a signed value, seven bits are insufficient to represent the size of the largest negative number that eight signed bits can represent. For example, seven unsigned bits can only represent 0 to 127, while eight signed bits could represent -128 to 127. The largest positive number that the [Amplitude] field of a signed value can be called MaxSign, and it can be defined as 2^N-1, where the number of bits in the [Amplitude] field is N+1.
[0032] Therefore, inspector 214 can detect instances where amplitudes 208 and 212 exceed MaxSign and respond by making adjustments. See below for further details. Figure 3 Describe how to handle this scenario.
[0033] For unsigned values, the maximum value represented by N+1 bits is 2^(N+1)-1. Therefore, values from 2^N to 2^(N+1)-1 cannot be represented by N bits. Checker 214 can similarly measure when the amplitudes 208 and 212 of the unsigned values exceed MaxSign and make corresponding adjustments, as follows regarding... Figure 3 The description.
[0034] The output of checker 214 is a pair of parameters. For example, for a pair of values from buffers 200 and 202, checker 214 outputs one or more pairs of parameters to be input to sequencer 216. Sequencer 216 submits this pair of parameters to computation unit 218. In particular, there can be multiple computation units, for example, 8, 64, 1024, or any number of computation units. Sequencer 216 implements logic to submit the arguments to the correct computation unit. Specifically, sequencer 216 ensures that the parameters of a pair of values from buffers 200 and 202 are submitted to computation unit 218 to accumulate the multiplication / addition result of that pair of values.
[0035] For example, in matrix multiplication, each value in the output matrix is the result of the dot product of a row of the first matrix and a column of the second matrix. Therefore, in this example, the sequencer 216 submits parameters from buffers 200, 202 for the input value pairs, such that each computation unit 218 can accumulate the sum of the products of the elements of a particular row and the corresponding elements of the column. Of course, this is just an example, and the sequencer 216 can be programmed to accumulate products according to any desired functionality.
[0036] Each computation unit 218 may include: an N-bit multiplier 220 that takes a pair of parameters from a sequencer 216 as input; and an adder 222 that takes the product of the multiplier 220 and the contents of an accumulation buffer 224 as input. The output of the adder 222 is then written back to the accumulation buffer 224. The result of the accumulation buffer 224 may be read by a controller of a GPU 132, or by a CPU 102 according to application control, or according to methods known in the art for reading, retrieving, and processing the results of multiplication / accumulation operations. Figure 2 As shown, adder 222 can further take the signs of the input parameters separated by separator 204 as input.
[0037] Reference Figure 3 The method 300 shown can be performed by checker 214 to determine whether to divide the input amplitudes 208, 212 into two parameters or to output a single parameter including the input amplitudes 208, 212. Method 300 can perform this operation for each input amplitude 208, 212 (hereinafter referred to as "input amplitude").
[0038] Method 300 may include step 302: receiving the amplitude and type of the input amplitude from separator 204. If it is a signed type (step 304), then method 300 may include obtaining the absolute value of the input amplitude (step 306).
[0039] Then, method 300 may include evaluating whether the absolute value is greater than MaxSign (step 308). If not, the absolute value can be input as an argument (Arg) into sequencer 216 (step 314). If yes, method 300 may include splitting the absolute value into two parameters (Arg_1, Arg_2) (step 310). In particular, for signed values, the only value that can be greater than MaxSign is MaxSign+1, so Arg_1 and Arg_2 can be set to (MaxSign+1) / 2 respectively. For example, for MaxSign = 127, step 310 may include setting Arg_1 = Arg_2 = 64.
[0040] Then, method 300 may include inputting Arg_1 and Arg_2 into sequencer 216 (step 312).
[0041] If the input magnitude is found not to be from a signed number (step 316), method 300 may include evaluating whether the input magnitude is greater than MaxSign (step 318). If so, two parameters are set according to steps 318 and 320: Arg_1 is set to be equal to the input parameter minus MaxSign, and Arg_2 is set to be equal to MaxSign.
[0042] Then, input Arg_1 and Arg_2 into sequencer 322. If no input size is found to exceed MaxSign (step 316), it is input as an argument (Arg) into sequencer 216.
[0043] The input parameters at steps 312, 314, 322, and 324 can be executed in a coordinated manner. Specifically, the parameters determined for the input amplitude of the value from buffer 200 can be coordinated with the parameters determined for the input amplitude of the corresponding value from buffer 202 and input to sequencer 216.
[0044] As described above, the first and second values to be multiplied can be retrieved from buffers 200 and 202 respectively, and processed by separator 204 and checker 214. Table 1 describes the first and second pairs of parameters to be input into sequencer 216 for various results of method 300. Specifically, for the first value, the possible result of method 300 is either a single output argument designated as Arg1 (step 312 or 324) or two output parameters designated as Arg1_1 and Arg1_2 (step 312 or 322). For the second value, the possible result is either a single parameter Arg2 (step 312 or 324) or two output parameters Arg2_1 and Arg2_2 (step 312 or 322). In the "Inputs to Sequencer" column, each pair in parentheses represents a pair of parameters input to sequencer 216, which will be multiplied and accumulated by calculation unit 218.
[0045] Table 1: Checker outputs input to sequencer
[0046]
[0047] The sequencer 216 can be programmed to input paired parameters into the same computational unit 218 corresponding to the first and second values. Similarly, the sequencer 216 can associate the sign of each independent variable with that independent variable. Specifically, when splitting a signed value into two parameters Arg_1 and Arg_2, the sequencer 216 associates the sign with the two parameters in all parameter pairs containing either Arg_1 or Arg_2.
[0048] Figure 4 A method 400 is shown for performing a multiplication / accumulation calculation on a pair of parameters input to a calculation unit 218 by a sequencer 216. Each pair of parameters input to the sequencer 216 is input to a multiplier 220 of one of the calculation units. The calculation unit then calculates the product P (step 402). Method 400 may further include evaluating the type and / or sign of the parameters in the pair (404). For example, for parameters obtained from unsigned values, in all cases, the sign of step 404 can be assumed to be positive. For signed values, the sign will be signs 206, 210, which are separated from the signed values by a separator 204.
[0049] If only one parameter is found to have a negative sign (step 406), method 400 may include adjusting the product P (step 408). In the presence of a negative parameter, the sign of P becomes negative, that is, P is converted to a negative number, for example, according to the two's complement definition. Then, the negative product P is input to adder 222 (step 410), and then summer 222 sums the negative product P with the current contents of accumulator buffer 224 and writes the result of the sum to accumulator buffer 224.
[0050] If no parameter is found to be negative (step 406), the product P is input to adder 222 (step 410), and then summer 222 adds the product P to the current contents of accumulator buffer 224 and writes the accumulated result to accumulator buffer 224.
[0051] It is evident from the above description that the multiplier 220 can be manufactured smaller while still possessing [the necessary characteristics]. Figure 2 Hezhi Figure 4 The method described in [the document] provides the same level of accuracy. In applications such as GPUs, there are hundreds or thousands of computing units218, which greatly reduces circuit area and power consumption.
[0052] The above embodiments are merely illustrative of the principles and effects of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or alter the above embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or alterations made by those skilled in the art without departing from the spirit and technical concept disclosed in the present invention should still be covered by the claims of the present invention.
Claims
1. An apparatus comprising: A separator configured to receive two input values, for each of the two input values: If the input value is a signed value, then the input value is converted into a sign value and an amplitude value; If the input value is an unsigned value, then the amplitude value of the input value is set to the input value. The checker is configured to convert the amplitude values of the two input values into one or more pairs of input parameters; The computing unit is configured to perform an operation on each of the one or more pairs of input parameters and any symbolic value of the two input values, and to generate an output based on the operation; The input parameter in one or more pairs of input parameters has N bits, where N is a predefined integer; The checker is also configured to convert the amplitude values of two input values into the one or more pairs of input parameters in the following manner: (a) For a first amplitude M1 of the two input values, if M1 corresponds to the first signed input value that is greater than 2^N-1, then M1 is represented by parameters Arg1_1 = 2^(N – 1) and Arg1_2 = 2^(N – 1); (b) For a second amplitude M2 of one of the amplitude values of the two input values, if M2 corresponds to the second signed input value of the two input values that is greater than 2^N - 1, then M2 is represented by the parameters Arg2_1 = 2^(N – 1) and Arg2_2 = 2^(N – 1); The calculation unit includes a multiplication circuit, and the two inputs of the multiplication circuit have N-1 bits.
2. The device according to claim 1, wherein, The computing unit is programmed to perform multiplication and accumulation operations.
3. The device according to claim 1, wherein, The checker is configured to convert the amplitude values of the two input values into one or more pairs of input parameters such that the input parameter in one or more pairs of input parameters has fewer bits than the amplitude values of the two input values.
4. The device according to claim 3, wherein one of the pair or more pairs of input parameters is one bit smaller than the amplitude value of the two input values.
5. The device according to claim 1, wherein, The checker is configured to convert the magnitude values of the two input values into the one or more pairs of input parameters in the following manner: (c) For the first amplitude: If M1 corresponds to the unsigned first of the two input values, and M1 is greater than 2^N-1, then M1 is divided into parameters Arg1_1 = M1-2^N +1 and Arg1_2 = 2^N-1; (d) For the second amplitude: If M2 corresponds to the unsigned second input value of the two input values, and M2 is greater than 2^N-1, then M2 is divided into parameters Arg2_1 = M2-2^N +1 and Arg2_2 = 2^N-1.
6. The device according to claim 5, wherein, The checker is configured to convert the magnitude values of the two input values into the one or more pairs of input parameters in the following manner: (e) If M1 is less than or equal to 2^N-1, then set one of the parameters Arg1 of the pair or more pairs of parameters to M1; and (f) If M2 is less than or equal to 2^N-1, then set one of the parameters Arg2 of the pair or more parameters to M2.
7. The device according to claim 6, wherein, The checker is configured to convert the magnitude values of the two input values into one or more pairs of input parameters in the following manner: If the results of (a) to (f) are Arg1 for M1 and Arg2 for M2, then output a pair of input parameters (Arg1, Arg2). If the results of (a) to (f) are Arg2_1 for M1 and Arg1 and Arg2_2 for M2, then output two pairs of input parameters: (Arg1, Arg2_1) and (Arg1, Arg2_2). If the results of (a) to (f) are Arg1_1 and Arg1_2 for M1 and Arg2_1 and Arg2_2 for M2, then the output consists of four pairs of input parameters: (Arg1_1, Arg2_1), (Arg1_1, Arg2_2), (Arg1_2, Arg2_1), (Arg1_2, Arg2_2); If the results of (a) to (f) are Arg1_1 and Arg1_2 for M1 and Arg2 for M2, then output two pairs of input parameters: (Arg1_1, Arg2) and (Arg1_2, Arg2).
8. The device of claim 7, further comprising a sequencer programmed to input the one or more pairs of input parameters to the computing unit, the computing unit being programmed to perform a multiplication-accumulation operation.
9. The device according to claim 8, wherein, The computing unit is programmed to handle parameters for each of the one or more input pairs: Calculate the product P for each pair of input parameters; (g) If only one of the two input values is a negative signed number, where each pair of input parameters is derived from the two input values according to (a) to (f), set P = -P; After executing (g), P is added to the contents of the accumulator buffer to obtain a sum, and the sum is written to the accumulator buffer.
10. The device according to claim 1, wherein, The separator is configured to read the two input values from a coefficient buffer and an activation buffer.
11. A device, programmed to: Receive the first input value and the second input value; The first input value and the second input value are converted into one or more pairs of input parameters, wherein each independent variable in the one or more pairs of input parameters has fewer bits than the first input value and the second input value; Input one or more pairs of input parameters into the computing unit; in, The input parameter in one or more pairs of input parameters has N bits, where N is a predefined integer; The device is also configured to convert the first input value and the second input value into one or more pairs of input parameters in the following manner: (a) If the first input value is signed and the magnitude M1 of the first input value is greater than 2^N-1, then M1 is split into parameters Arg1_1 = 2^N - 1) and Arg1_2 = 2^(N-1); (b) If the second input value is signed and the magnitude M2 of the second input value is greater than 2^N-1, then M2 is split into parameters Arg2_1 = 2^(N-1) and Arg2_2 = 2^(N-1); The calculation unit includes a multiplication circuit, and the two inputs of the multiplication circuit have N-1 bits.
12. The device according to claim 11, wherein, The computing unit performs multiplication and accumulation operations.
13. The device according to claim 11, wherein, N is one bit less than the number of bits in the amplitudes M1 and M2.
14. The device according to claim 11, wherein, The device is also configured to convert the first input value and the second input value into one or more pairs of input parameters in the following manner: (c) If the first input value is unsigned and M1 is greater than 2^N-1, then divide M1 into parameters Arg1_1 = M1-2^N + 1 and Arg1_2 = 2^N-1; (d) If the second input value is unsigned and M2 is greater than 2^N-1, then divide M2 into parameters Arg2_1 = M2-2^N+1 and Arg2_2 = 2^N-1.
15. The device according to claim 14, wherein, The device is also configured to convert the first input value and the second input value into one or more pairs of input parameters in the following manner: (e) If M1 is less than or equal to 2^N-1, then set the parameter Arg1 of one or more pairs of parameters to M1; and (f) If M2 is less than or equal to 2^N-1, then set the parameter Arg2 of one or more pairs of parameters to M2.
16. The device according to claim 15, wherein, The device is also configured to convert the first input value and the second input value into one or more pairs of input parameters in the following manner: If the results of (a) to (f) are Arg1 for M1 and Arg2 for M2, then output a pair of input parameters (Arg1, Arg2). If the results of (a) to (f) are Arg2_1 for M1 and Arg1 and Arg2_2 for M2, then output two pairs of input parameters: (Arg1, Arg2_1) and (Arg1, Arg2_2). If the results of (a) to (f) are Arg1_1 and Arg1_2 for M1 and Arg2_1 and Arg2_2 for M2, then the output consists of four pairs of input parameters: (Arg1_1, Arg2_1), (Arg1_1, Arg2_2), (Arg1_2, Arg2_1), (Arg1_2, Arg2_2); If the results of (a) to (f) are Arg1_1 and Arg1_2 for M1 and Arg2 for M2, then output two pairs of input parameters: (Arg1_1, Arg2) and (Arg1_2, Arg2).
17. The device according to claim 16, wherein, The device is also configured to handle parameters for each of the one or more input pairs: Calculate the product P for each pair of input parameters; (g) If only one of the two input values is a negative signed number, where each pair of input parameters is derived from the two input values according to (a) to (f), set P = -P; After executing (g), P is added to the contents of the accumulator buffer to obtain a sum, and the sum is written to the accumulator buffer.
18. The device according to claim 11, wherein, The device is also configured to read the first input value from the coefficient buffer and the second input value from the activation buffer.