Accelerator, method of operating accelerator, and electronic device including accelerator

By determining the data layout in the accelerator based on the word width of the memory and the size of the filter space, packing and storing the input data and performing convolution operations, the problems of memory efficiency and access cost are solved, thus improving the efficiency of neural network processing.

CN114118348BActive Publication Date: 2026-06-16SAMSUNG ELECTRONICS CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SAMSUNG ELECTRONICS CO LTD
Filing Date
2021-05-07
Publication Date
2026-06-16

Smart Images

  • Figure CN114118348B_ABST
    Figure CN114118348B_ABST
Patent Text Reader

Abstract

Disclosed are an accelerator, a method of operating an accelerator, and an electronic device including the accelerator. The method of operating the accelerator configured to perform a target operation: packs input data with a data layout determined based on a word width of a memory in the accelerator and a spatial size of a filter to be applied to the target operation, and stores the packed input data in the memory; and performs the target operation between a portion of the input data stored in the same word in the memory and a weight of the filter.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application claims the benefit of Korean Patent Application No. 10-2020-0110530, filed on August 31, 2020, with the Korean Intellectual Property Office, the entire disclosure of which is incorporated herein by reference for all purposes. Technical Field

[0002] The following description relates to an accelerator, a method of operating the accelerator, and an electronic device including the accelerator. Background Technology

[0003] With advancements in artificial intelligence (AI) technology, there is a need for dedicated AI hardware capable of performing reasoning and learning through computation. Various devices are being developed specifically for implementing AI.

[0004] Research is underway on hardware accelerators to efficiently utilize deep neural networks (DNNs). Neural network processing devices may require massive computations on complex input data. Memory efficiency and access costs can be performance bottlenecks in many processing systems. Summary of the Invention

[0005] This summary is provided to introduce, in a simplified form, the selection of concepts that will be further described in the detailed embodiments below. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to help determine the scope of the claimed subject matter.

[0006] In one general aspect, a method of operating an accelerator includes: packing input data using a data layout determined based on the word width of memory in the accelerator and the spatial size of a filter to be applied to a target operation, and storing the packed input data in memory; and performing a target operation between a portion of the packed input data stored in the same word in memory and the weights of the filter.

[0007] The storage steps may include: packing the input data corresponding to multiple filters based on the data layout, and storing the packed input data in a single word.

[0008] The number of filters can be determined based on the horizontal and vertical size of each filter, the number of input data channels, the stride size of each filter, and the number of operand pairs that the arithmetic unit configured to perform the target operation can process simultaneously.

[0009] The storage step may include storing the packed input data by performing an im2col transform based on the spatial size and stride size of a virtual filter, the spatial size and stride size of which are determined based on the word width of the memory and the spatial size of the filter.

[0010] The steps of performing the target operation may include: acquiring input data from the same word stored in memory into an input register; acquiring the weights of the filter into a filter register; performing a first target operation between a first portion of the input data acquired into the input register and the weights; and performing a second target operation between a second portion of the input data acquired into the input register and the weights.

[0011] The first and second parts of the input data may include redundant data.

[0012] The steps of performing the objective operation may include: multiplexing the weights used for the first objective operation; and performing a second objective operation between a second portion of the input data and the multiplexed weights.

[0013] The steps for performing the target operation may include: multiplexing a second portion of the input data obtained from the input register after the first target operation; and performing a second target operation between the second portion of the multiplexed input data and the weights.

[0014] The steps of performing the target operation may include: multiplexing the weights used for the first target operation and storing the multiplexed weights back in the filter register; and performing a second target operation between a second portion of the input data and the stored weights.

[0015] The steps of performing the target operation may include: after the first target operation, multiplexing a second part of the input data obtained from the input register and storing the second part of the multiplexed data back into the input register; and performing a second target operation between the second part of the restored input data and the weights.

[0016] The target operation may include convolution operations performed in a neural network running in an accelerator.

[0017] The steps of performing the target operation may include: performing the target operation in a multioperation multiplier accumulator (MAC), where a portion of the input data stored in the same word and the weights of the filter are input to the multioperation multiplier accumulator (MAC).

[0018] Accelerators can be included in user terminals where data to be inferred by a neural network performing target computations is input, or in servers that receive data to be inferred from user terminals.

[0019] In another general aspect, an accelerator configured to perform a target operation includes: a memory configured to store input data packaged using a data layout determined based on the word width of the memory and the space size of a filter to be applied to the target operation; and an arithmetic unit configured to perform the target operation between a portion of the input data stored in the same word in the memory and the weights of the filter.

[0020] In another general aspect, an electronic device includes: a host processor configured to generate instructions executable by the accelerator in response to a request for processing a neural network in an accelerator in which a target operation is performed; and an accelerator configured to, when the instructions are executed, pack input data using a data layer determined based on the word width of internal memory and the space size of a filter to be applied to the target operation and store the packed input data in internal memory, and perform the target operation between a portion of the input data stored in the same word in internal memory and the weights of the filter.

[0021] In another general aspect, an accelerator configured to perform a target operation includes: an input memory configured to pack input data into a word according to a data layout; a filter memory configured to store weights of filters applied to the target operation; an arithmetic unit including a plurality of multipliers configured to perform the target operation between the packed input data stored in the same word in the input memory and one or more weights stored in the filter memory; and a multiplexer selectively disposed between the arithmetic unit and one of the input memory and the filter memory. When the multiplexer is disposed between the arithmetic unit and the filter memory, the multiplexer is configured to selectively transfer one of the weights stored in the filter memory to each of the plurality of multipliers in the arithmetic unit. When the multiplexer is disposed between the arithmetic unit and an input register, the multiplexer is configured to selectively transfer a set of the packed input data stored in the input memory to each of the plurality of multipliers in the arithmetic unit.

[0022] The accelerator may include: an input register, into which packaged input data stored in the same word in input memory is accessed; and a filter register, into which filter weights are accessed. When the multiplexer is positioned between the arithmetic unit and the filter memory, the multiplexer may be selectively positioned between the filter register and either the filter memory or the arithmetic unit. Similarly, when the multiplexer is positioned between the arithmetic unit and the input memory, the multiplexer may be selectively positioned between the input register and either the input memory or the arithmetic unit.

[0023] Other features and aspects will become clearer from the following detailed description, drawings and claims. Attached Figure Description

[0024] Figure 1 An example of an electronic device is shown.

[0025] Figure 2 An example of an accelerator is shown.

[0026] Figure 3 and Figure 4 An example of an arithmetic unit is shown.

[0027] Figure 5 , Figure 6 , Figure 7 and Figure 8 An example is shown that input data is packaged using a data layout and the packaged data is stored in memory.

[0028] Figure 9 , Figure 10 , Figure 11 , Figure 12 , Figure 13 , Figure 14 , Figure 15 , Figure 16 , Figure 17 , Figure 18 , Figure 19 , Figure 20 and Figure 21 An example of performing the target operation is shown.

[0029] Figure 22 An example flowchart of a method for operating an accelerator is shown.

[0030] Figure 23 and Figure 24 An example of an electronic device is shown.

[0031] Throughout the accompanying drawings and detailed embodiments, unless otherwise described or provided, the same reference numerals will be understood to denote the same elements, features, and structures. The drawings may not be to scale, and for clarity, illustration, and convenience, the relative dimensions, scale, and depiction of elements in the drawings may be exaggerated. Detailed Implementation

[0032] The following detailed description is provided to help the reader gain a full understanding of the methods, apparatus, and / or systems described herein. However, after understanding the disclosure of this application, various changes, modifications, and equivalents of the methods, apparatus, and / or systems described herein will become clear. For example, the order of operations described herein is merely illustrative and is not limited to those set forth herein, but can be clearly changed after understanding the disclosure of this application, except for operations that must occur in a specific order.

[0033] The features described herein may be implemented in different forms and should not be construed as limited to the examples described herein. Rather, the examples described herein are provided only to illustrate some of the many possible ways in which the methods, apparatus, and / or systems described herein will be clear upon understanding the disclosure of this application.

[0034] The terminology used herein is for the purpose of describing various examples only and is not intended to limit disclosure. Unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. The terms “comprising,” “including,” and “having” indicate the presence of the features, quantities, operations, components, elements, and / or combinations thereof stated, but do not preclude the presence or addition of one or more other features, quantities, operations, components, elements, and / or combinations thereof.

[0035] Although terms such as “first,” “second,” and “third” may be used herein to describe various components, assemblies, regions, layers, or parts, these components, assemblies, regions, layers, or parts should not be limited by these terms. Rather, these terms are used only to distinguish one component, assembly, region, layer, or part from another. Thus, without departing from the teaching of the examples described herein, the first component, first assembly, first region, first layer, or first part mentioned in the examples may also be referred to as a second component, second assembly, second region, second layer, or second part.

[0036] Throughout this specification, when a component is described as "connected to" or "attached to" another component, that component may be directly "connected to" or "attached to" said other component, or there may be one or more other components in between. Conversely, when an element is described as "directly connected to" or "directly attached to" another element, there may not be any other elements in between. Similarly, similar expressions (e.g., "between" and "directly between," and "adjacent to" and "immediately adjacent to") will be interpreted in the same manner. As used herein, the term "and / or" includes any one of the associated listed items and any combination of any two or more.

[0037] Unless otherwise defined, all terms used herein (including technical and scientific terms) shall have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains and based on the understanding of the disclosure of this application. Terms (such as those defined in a general dictionary) shall be interpreted as having the same meaning as they have in the relevant field and in the context of the disclosure of this application, and shall not be interpreted in an idealized or overly formal sense unless expressly defined herein.

[0038] Furthermore, in the description of the exemplary embodiments, descriptions will be omitted where a detailed description of a structure or function known therefrom after understanding the disclosure of this application would lead to a vague interpretation of the exemplary embodiments. Hereinafter, the examples will be described in detail with reference to the accompanying drawings, and the same reference numerals in the drawings always refer to the same elements.

[0039] Figure 1 An example of an electronic device is shown.

[0040] Reference Figure 1 The electronic device 100 includes a host processor 110, off-chip memory 120, memory controller 130, and accelerator 140. The host processor 110, off-chip memory 120, memory controller 130, and accelerator 140 can communicate with each other via a bus.

[0041] The host processor 110 may be a device configured to control the corresponding operation of components included in the electronic device 100, and may include, for example, a central processing unit (CPU). The host processor 110 may receive requests for processing neural networks in the accelerator 140, and generate instructions executable in the accelerator 140 in response to the received requests. Requests may be made for data inference based on neural networks, and requests may be made for obtaining results of data inference by allowing the accelerator 140 to execute neural networks for object recognition, pattern recognition, computer vision, speech recognition, machine translation, machine interpretation, etc. The host processor 110 may transmit inference target data and parameters of the neural network to the accelerator 140.

[0042] The off-chip memory 120 may be a memory located outside the accelerator 140 and may include, for example, dynamic random access memory (DRAM) used as the main memory of the electronic device 100. The off-chip memory 120 can be accessed via the memory controller 130. The off-chip memory 120 may store inference target data and / or parameters of the neural network to be executed in the accelerator 140, and the data stored in the off-chip memory 120 may be transferred to the accelerator 140 for inference. Additionally, the off-chip memory 120 may be used when the on-chip memory within the accelerator 140 is insufficient to execute the neural network within the accelerator 140.

[0043] Accelerator 140 may be an artificial intelligence (AI) accelerator configured to execute neural networks and infer input data according to instructions from host processor 110, and may be a separate processor distinct from host processor 110. Accelerator 140 may be, for example, a neural processing unit (NPU), a graphics processing unit (GPU), a tensor processing unit (TPU), etc.

[0044] Based on the characteristics of neural network operations, accelerator 140 can handle tasks more efficiently by a separate dedicated processor (e.g., accelerator 140) rather than by a host processor 110 for general purposes. Here, one or more processing elements (PEs) included in accelerator 140, as well as on-chip memory, can be used. On-chip memory may be a global buffer included in accelerator 140 and can be distinguished from off-chip memory 120 located outside accelerator 140. On-chip memory may be, for example, temporary storage memory accessible via address space, static random access memory (SRAM), etc.

[0045] A neural network can include multiple layers. In one example, a neural network may include an input layer, multiple hidden layers, and an output layer. Each layer may include multiple nodes, each referred to as an artificial neuron. Each node may represent a computational unit or operational unit with at least one input and output, and nodes may be connected to each other. Weights can be set for the connections between nodes and can be adjusted or changed. Weights can determine the influence of relevant data values ​​on the final result by increasing, decreasing, or maintaining data values. Weighted inputs from nodes included in the previous layer can be fed into each node included in the output layer. The processing of weighted data from one layer to subsequent layers of that layer can be called propagation.

[0046] In neural networks, convolution operations can be performed. Convolution operations can be performed by applying filters or kernels to the input data to extract features from the input data. To perform convolution operations more efficiently based on operational characteristics, a new method is proposed. Examples will be described in detail below with reference to the accompanying drawings.

[0047] Figure 2 An example of an accelerator is shown.

[0048] Reference Figure 2 The accelerator 200 includes an input / filter memory 210, one or more multi-operand multiplier-accumulator (MAC) 220, an output memory 230, a direct memory access (DMA) 240, a distributor 250, an im2col (image-to-column) engine 260, and a CPU 270. These internal components of the accelerator 200 can communicate with each other via a bus.

[0049] Input / filter memory 210 may be an on-chip memory (e.g., SRAM) within accelerator 200 and is configured to store input data and filter weights. Multiple-operand MAC 220 may perform target operations (e.g., convolution operations included in a neural network) on multiple operands from input / filter memory 210. For example, multiple-operand MAC 220 may correspond to the PE in accelerator 200 described above. Output memory 230 may be an on-chip memory (e.g., SRAM) configured to store result data obtained as a result of operations performed in multiple-operand MAC 220. For ease of description, multiple-operand MAC 220 may also be referred to herein as an arithmetic unit.

[0050] DMA 240 can control the data input to input / filter memory 210 and / or the data output to output memory 230. Allocator 250 can allocate target operations to control the execution of target operations in multi-operand MAC 220. im2col engine 260 can transform two-dimensional (2D) image data into one-dimensional (1D) string data based on preset spatial size and stride size. By applying such an im2col transformation to the input data, the same convolution operation result can be obtained even using the matrix multiplication of the input data obtained by the im2col transformation. In one example, the im2col transformation can be performed not only by im2col engine 260, but also by various combinations of DMA 240, allocator 250, and CPU 270. The spatial size and stride size applied to the im2col transformation can be different from the filter to be applied to the convolution operation, which will be described in detail below.

[0051] Figure 3 and Figure 4 An example of an arithmetic unit is shown.

[0052] Reference Figure 3 The arithmetic unit 330 may include a multi-operational MAC based on an adder tree, which is a form of multi-operational MAC. The arithmetic unit 330 may perform convolution operations using multiple multipliers. Convolution operations can be performed in a neural network and are also referred to herein as multiply-accumulate operations or MAC operations. The filter weights used for convolution operations can be included in the parameters of the neural network.

[0053] The arithmetic unit 330 can receive input data from the input memory 310 and filter weights from the filter memory 320. Each of the input memory 310 and the filter memory 320 can be designed to have a data throughput corresponding to the computational throughput. For example, a word in each of the input memory 310 and the filter memory 320 can store elements in the same number as the number of multipliers included in the arithmetic unit 330. Figure 3 In this example, the arithmetic unit 330 includes 16 multipliers, and each word in the input memory 310 and filter memory 320 can store 16 elements. In this example, each word in the input memory 310 and filter memory 320 can be represented by a string. Here, the corresponding input data and weights can be passed from the input memory 310 and filter memory 320 to the multipliers, and then multiplied in the multipliers. The results of such multiplications performed as described above can be summed, thus determining the convolution value. The arithmetic unit 330 can have a computational throughput for performing a total of 16 multiplications at a time.

[0054] Although for the sake of ease of description, in Figure 3 The example shown includes an arithmetic unit 330 comprising 16 multipliers, and each word in the input memory 310 and the filter memory 320 comprises 16 elements; however, this example is not limited to this and various other examples may be applied without limitation.

[0055] Reference Figure 4 The output data map can be determined based on the convolution operation between the input data map and the filter. The input data map can also be referred to herein as the input feature map or image data. Each input data included in the input data map can also be referred to herein as an input activation.

[0056] like Figure 4 As shown, using the im2col transformation described above, a convolution operation on a dataset continuous in the channel direction of the input data graph can be transformed into a convolution operation on a dataset continuous in the spatial direction of the input data graph. In this case, only a portion of the multiple multipliers included in the arithmetic unit can receive operand pairs. For example, only 9 of the 16 multipliers included in the arithmetic unit can perform operations in one cycle, thus potentially reducing the utilization of the arithmetic unit. Figure 4 In the diagram, H, W, and C indicate the height, width, and number of channels of the input data graph, respectively; Q, P, and K indicate the height, width, and number of channels of the output data graph, respectively.

[0057] Figures 5 to 8 An example is shown that input data is packaged using a data layout and the packaged data is stored in memory.

[0058] Reference Figure 5 Packing input data can improve memory storage efficiency. For ease of description, assume that the filter applied to the input data graph in the convolution operation has a size of 3×3 and a stride of 1. For ease of description, Figure 5 The sizes of the input data graph, filter, and input memory 570 shown are provided only as examples, and therefore various other examples can be applied without limitation. Figure 5 .

[0059] For example, the first input data 510 in the input data diagram, where a filter is applied for the first time, can be entirely stored in the first word (or the first string on the right) of the input memory 570. Additionally, the second input data 520 in the input data diagram, where a filter is applied for the second time, may include redundant data where the second input data 520 and the first input data 510 partially overlap. Since the redundant data is already stored in the input memory 570, only the data that does not overlap in the second input data 520 can be continuously stored in the first word of the input memory 570. Furthermore, the third input data 530 in the input data diagram, where a filter is applied for the third time, may include redundant data where the third input data 530 and the second input data 520 partially overlap. Similarly, since the redundant data is already stored in the input memory 570, only the data that does not overlap in the third input data 530 can be continuously stored in the first word of the input memory 570.

[0060] In one example, the fourth input data 540, where the filter is applied for the fourth time in the input data graph, may include redundant data in which the fourth input data 540 and the third input data 530 partially overlap. Here, the first word of the input memory 570 may be insufficient to store the non-overlapping data in the fourth input data 540. Therefore, the fourth input data 540 may be entirely stored in the second word (or the second string on the right) of the input memory 570. Although the fourth input data 540 includes redundant data that partially overlaps with the third input data 530, the fourth input data 540 may be stored in a different word than the word that stores the third input data 530, so the fourth input data 540 may be entirely stored in the second word of the input memory 570. Similarly, for the fifth input data 550, where the filter is applied for the fifth time in the input data graph, and the sixth input data 560, where the filter is applied for the sixth time in the input data graph, only the non-overlapping data may be stored consecutively in the second word of the input memory 570.

[0061] As described above, the word width of the input memory 570 (i.e., the number of elements included in a word) can be used as a basis for calculation. Figure 5 In the example, it is 16)) and based on the spatial size of the filter (e.g., in Figure 5 In the example, 9) determines the data layout to minimize redundant storage of input data. Therefore, by packing the input data corresponding to multiple filters and storing the packed input data in a single word of the input memory 570, the number of elements without stored data in a single word can be minimized, significantly improving the storage efficiency of the input memory 570. Furthermore, for six convolution operations, only two words of the input memory 570 may be needed; therefore, a three-fold increase in memory efficiency can be expected compared to the six words required when the data needed for a single convolution operation is stored in a single word.

[0062] Reference Figure 6 Convolution operations can be performed by packing data based on the im2col transform. For ease of description, only data including... Figure 6 The input data graph, filter size, and number of multipliers shown in the arithmetic unit are examples, and therefore the examples are not limited to these; various other examples can be applied without restriction.

[0063] As described above, input data corresponding to multiple filters can be packed into a single word and stored in the input memory. This data packing can be easily achieved by adjusting the filter size to be applied to the im2col transform. For example, the filter size to be applied to the im2col transform can be determined by adding the filters corresponding to the input data packed into a single word. For example, in Figure 5 In the example, the filter size can be 3×3, and the input data corresponding to the three filters can be packed into a single word. Figure 6 In the example, the filter size to be applied to the im2col transform can be determined to be 5×3. That is, unlike a typical square filter, the filter to be applied to the im2col transform can be a long rectangle in the stride direction (e.g., the direction the filter moves when performing a convolution operation). With this data packing, 15 sets of input data (e.g., 15 input data points) and 9 filter weights can be fed into the arithmetic unit in one cycle. The filter to be applied to the im2col transform can also be called a virtual filter to distinguish it from the filter applied to the convolution operation based on the fact that it is not used in the actual convolution operation.

[0064] Figure 7 A flowchart illustrating an example of performing a convolution operation on input data is shown.

[0065] Reference Figure 7In operation 710, it is determined whether spatial convolution operation needs to be performed. For example, if the number of channels in the input data image and / or filter is greater than a preset standard, it can be determined that spatial convolution operation does not need to be performed. Conversely, if the number of channels in the input data image and / or filter is less than a preset standard, the computational efficiency of convolution operation in the channel direction may be significantly reduced, so it can be determined that spatial convolution operation needs to be performed.

[0066] In response to determining that spatial convolution is not required, operation 750 can be performed, and normal height-width-channel (HWC) convolution can also be performed. HWC convolution refers to performing convolution on a dataset that is continuous in the channel direction of the input data graph.

[0067] In response to the determination that spatial convolution operations need to be performed, operation 720 can be executed. In operation 720, the maximum number n of filters corresponding to the input data to be packaged into a word is determined by the following equation. max .

[0068] [Equation 1]

[0069]

[0070] In Equation 1 above, R represents the horizontal size of the filter, i.e., the size in the x-direction. S represents the vertical size of the filter, i.e., the size in the y-direction. C represents the number of channels in the input data graph, and X represents the number of operand pairs (or the word width of the memory) that can be processed simultaneously by the arithmetic unit. T represents the stride size of the filter to be applied to the convolution operation. Figure 5 and Figure 6 In the example, since R = S = 3, C = 1, X = 16 and T = 1, therefore n max It can be determined as 3(n) max =3).

[0071] In operation 730, n is determined. max Is it 1? When n max When n is 1, operation 770 can be performed. Here, n max A value of 1 indicates that the input data corresponding to a filter will be packed into a single word. However, data packing may not actually occur, and the unpacking convolution operation can therefore be performed as described above. Figure 4 implement.

[0072] Conversely, when n maxWhen the value is not 1, operation 740 can be performed. In operation 740, the virtual filter to be applied to the im2col transform is determined by the following equation, and the im2col transform is performed on the input data graph. Therefore, the input data is packed and stored in the input memory. In one example, the spatial size and stride size of the virtual filter can be determined based on the word width of the memory and the spatial size of the filter.

[0073] [Equation 2]

[0074] R′=R+(n max -1)·T

[0075] S'=S

[0076] T' = n max ·T

[0077] In Equation 2 above, R' represents the horizontal size of the virtual filter to be applied to im2col, S' represents the vertical size of the virtual filter to be applied to im2col, and T' represents the step size of the virtual filter to be applied to im2col. Figure 5 and Figure 6 In the example, R' can be determined as 5 (R' = 5), S' is determined as 3 (S' = 3), and T' is determined as 3 (T' = 3).

[0078] In operation 760, a convolution operation is performed between the packed input data and the filter weights. This convolution operation can also be referred to here as packed convolution, which will be discussed in detail below. Figures 9 to 21 Detailed description.

[0079] Reference Figure 8 The im2col transform can be performed based on the spatial size and stride size of the virtual filter to pack the input data and store the packed input data in the input memory 830.

[0080] exist Figure 8 In the example, the space size of the virtual filter is determined to be 5×3, and the stride is determined to be 3. For example, the first input data 810, to which the virtual filter is applied for the first time, can be entirely stored in the first word of the input memory 830 (e.g., the first string on the right). Similarly, the second input data 820, to which the virtual filter is applied for the second time, can be entirely stored in the second word of the input memory 830 (e.g., the second string on the right). As described above, the packing of input data can be easily achieved through an im2col transform based on the virtual filter.

[0081] although Figure 8Although not shown in the diagram, the above-described im2col transform based on a virtual filter can be performed on the input data, thus allowing ordinary im2col transforms to be performed on the filter weights. Since the filter has a 3×3 size, the nine weights in the filter memory can be stored in a single word.

[0082] Although the above primarily describes an example of input data packing in the x-direction, the examples are not limited to this. The preceding description also applies to input data packing in the y-direction.

[0083] Figures 9 to 21 An example of performing the target operation is shown.

[0084] Reference Figure 9 The data layout described herein can be used to perform target operations or convolution operations. To perform a convolution operation, input data stored in the input memory can be fetched into the input register word by word. When a preset number of convolution operations are performed on the input data fetched into the input register, subsequent words of input data stored in the input memory can be fetched into the input register. Filter weights to be applied to the convolution operation can be fetched into the filter register. Figure 9 In the example, each box connected to the multiplier of the arithmetic unit can indicate an element included in the input register or filter register, a filled box can indicate that valid data is stored and a blank box can indicate that 0 is stored. Figure 9 The arithmetic unit can include a multi-operand MAC based on an adder tree.

[0085] like Figure 9 As shown, 15 sets of input data can be stored in the input register, and 9 weights can be stored in the filter register. A convolution operation can be performed between a portion of the 15 sets of input data stored in the input register and the weights. That is, a first convolution operation can be performed by multiplying each of the first 9 sets of input data from the 15 sets stored in the input register by its corresponding weight, and then summing the results of these multiplications. A second convolution operation can be performed based on another portion of the 15 sets of input data stored in the input register, which will refer to... Figures 10 to 21 Detailed description. As described above, by acquiring input data in the same word packed in the input memory into the input register at once, multiple convolution operations can be performed, thereby improving data reuse and increasing memory efficiency based on the characteristics of convolution operations.

[0086] Figure 10 An example of an accelerator configured to perform convolution operations using a data layout is shown. (See also...) Figure 10The accelerator includes an input SRAM 1010, a filter SRAM 1020, an input register 1030, a filter register 1040, a multiplexer (or multiplexer) (MUX) 1050, a MAC 1060, an output register 1070, and an output SRAM 1080.

[0087] The input SRAM 1010 can pack the input data corresponding to multiple filters into a word according to the data layout and store the packed input data. The input register 1030 can store the input data obtained from the input SRAM 1010 in word units to perform convolution operations.

[0088] The filter SRAM 1020 can store the weights of the filter applied to the convolution operation. The filter register 1040 can store the stored weights obtained from the filter SRAM 1020.

[0089] The MUX 1050, configured to select one dataset from the dataset and transmit the selected dataset, can be positioned between the filter register 1040 and the MAC 1060, selectively transmitting one of the multiple weights stored in the filter register 1040 to each of the multipliers that receive the weights as input. With the MUX 1050, multiple convolution operations can be performed even if the input data, packed into a single word, is only fetched into the input register 1030 once. For ease of description, this structure is referred to herein as a weighted multiplexing structure.

[0090] MAC 1060 can perform convolution operations between a portion of the input data stored in input register 1030 and weights multiplexed by MUX 1050. Output register 1070 can temporarily store the result of MAC 1060, and output SRAM 1080 can receive the result from output register 1070 and store the received result at the appropriate address.

[0091] Figure 11 This shows the execution of the above reference. Figure 9 An example of a convolution operation performed in a weighted multiplexing structure within a loop following the described operation. Refer to the above. Figure 9 Following the convolution operation described, each weight can be selected to be input into a multiplier spaced apart by a preset number (e.g., the horizontal or vertical size of the filter). This allows for modification of the input data to be applied to the filter, thus enabling simple implementation similar to, for example... Figure 5 The second input data 520 corresponds to the second convolution operation. Figure 11In the example, for ease of description, the MUX is omitted, as are the first through third multipliers that input 0 instead of weights into the arithmetic unit (e.g., from...). Figure 11 (From the first multiplier to the third multiplier, starting from the left).

[0092] Figure 12 This shows the execution of the above reference. Figure 11 An example of a convolution operation performed in a weighted multiplexing structure within a loop following the described operation. Refer to the above. Figure 11 Following the convolution operation described, each weight can be selected to be input into a multiplier spaced apart by a preset number. This allows the input data to be changed, thus enabling simple implementation similar to, for example... Figure 5 The third convolution operation corresponds to the third input data 530. As described above, by performing multiple convolution operations by changing a portion of the input data after the input data, which is packed into a word, is obtained into the input register, memory efficiency can be maximized through data reuse.

[0093] Figure 13 Another example of an accelerator configured to perform convolution operations using a data layout is shown. Figure 13 In the example above, as referenced Figure 10 Depending on the description, the MUX 1320 can be positioned between the input register 1310 and the MAC 1330, selectively transferring one set of input data stored in the input register 1310 to each multiplier that receives weights from the filter register. For ease of description, this structure is referred to herein as an input data multiplexing structure. The MAC 1330 can perform convolution operations between a portion or a selected portion of the input data stored in the input register 1310 and the weights stored in the filter register.

[0094] Figure 14 This shows the execution of the above reference. Figure 9 An example of a convolution operation performed in a loop following the described operation within a multiplexed structure of the input data. Figure 9 In the example, after performing a convolution operation on the first portion of the input data (e.g., the first to the ninth input data), a second portion of the input data (e.g., the fourth to the twelfth input data) can be selected to be input into a preset multiplier. Here, the preset multiplier can include the first to the ninth multipliers to which weights are to be input. Thus, the input data to be applied to the filter can be changed, and therefore, it is possible to simply implement, for example... Figure 5 The second input data 520 corresponds to the second convolution operation. Figure 14In the example, for ease of description, the MUX is omitted, as are the tenth to twelfth multipliers that input 0 instead of input data into the arithmetic unit (e.g., from...). Figure 14 (The tenth to twelfth multiplication tables starting from the left).

[0095] Figure 15 This shows the execution of the above reference. Figure 14 An example of a convolution operation performed in a loop following the described operation within a multiplexed structure of the input data. Figure 14 In the example, after performing a convolution operation on the second part of the input data, a third part of the input data (e.g., the seventh to fifteenth input data) can be selected to be input into a preset multiplier. Thus, the input data to be applied to the filter can be changed, and therefore, it is possible to simply implement, for example... Figure 5 The third convolution operation corresponds to the third input data 530.

[0096] Figure 16 This illustrates yet another example of an accelerator configured to perform convolution operations using a data layout. Figure 16 In the example above, as referenced Figure 10 In a different example, the MUX 1620 can be positioned between the filter SRAM 1610 and the filter register 1630, selectively transferring one of the weights stored in the filter SRAM 1610 and the weights stored in the filter register 1630 to each element of the stored weights in the filter register 1630. This allows the order of the weights stored in the filter register 1630 to be changed, and then the weights in the filter register 1630 can be stored back in the filter register 1630. For ease of description, this structure is referred to herein as a weight shifting structure.

[0097] and Figure 10 The positions of the MUX 1050 and MUX 1620 differ in the weighted multiplexing structure shown. Figure 10 In the weighted multiplexing structure shown, the MUX 1050 is directly connected to the MAC 1060. However, in the weighted shift structure, the MUX 1620 may differ structurally because the critical path is directly connected to a small SRAM. Therefore, the weighted shift structure can have lower power consumption and occupy less physical area compared to the weighted multiplexing structure. Although the weighted shift structure is described in comparison to the weighted multiplexing structure for ease of description, this description also applies to comparisons between input data multiplexing structures and input data shift structures.

[0098] Figure 17 This shows the execution of the above reference. Figure 9An example of a convolution operation performed in a weighted shift structure within a loop following the described operation. Figure 9 After the convolution operation, each weight can be selected to be input into multipliers spaced by a preset number (e.g., the horizontal or vertical size of the filter) and then stored back in the filter register. Figure 17 Each element of the filter register shown can include weights that are re-stored in the filter register after selection. This allows the input data to be changed, thus enabling simple implementations similar to, for example... Figure 5 The second input data 520 corresponds to the second convolution operation. Figure 17 In the example, the arrows connecting the elements of the filter register indicate that... Figure 17 After the convolution operation is completed, the weights are shifted by selection. Figure 17 In the example, MUX has been omitted for ease of description.

[0099] Figure 18 This shows the execution of the above reference. Figure 17 An example of a convolution operation performed in a weighted shift structure within a loop following the described operation. Figure 17 After the convolution operation, each weight can be selected to be input into a multiplier spaced apart by a preset number and then stored back in the filter register. Figure 18 Each element of the filter register shown can include weights that are re-stored in the filter register after selection. This allows the input data to be changed, thus enabling simple implementations similar to, for example... Figure 5 The third convolution operation corresponds to the third input data 530. Figure 18 In the example, the arrows connecting the elements of the filter register can indicate where... Figure 18 After the convolution operation is completed, the weights are shifted by selection.

[0100] Figure 19 This illustrates yet another example of an accelerator configured to perform convolution operations using a data layout. Figure 19 In the example above, as referenced Figure 10 Unlike the description, MUX 1920 can be positioned between input SRAM 1910 and input register 1930, selectively transferring either the input data stored in input SRAM 1910 or the input data stored in input register 1930 to each element of input register 1930. This allows the order of input data stored in input register 1930 to be changed, and then the input data stored in input register 1930 can be stored back in input register 1930. For ease of description, this structure is referred to herein as an input data shift structure.

[0101] Figure 20 This shows the execution of the above reference. Figure 9 An example of a convolution operation performed in a loop following the described operation, within a shifted structure of the input data. Figure 9 In the example, after performing a convolution operation on the first part of the input data, the second part of the input data can be selected to be input into a preset multiplier and stored back in the input register. Figure 20 In the example, the arrows connecting the elements of the input register can indicate where... Figure 9 After the convolution operation is completed, the input data is shifted by selection. Figure 20 In the example, each element of the input register can include input data that is re-stored in the input register after selection. Thus, the input data to be applied to the filter can be changed, and therefore it is easy to implement, for example... Figure 5 The second input data 520 corresponds to the second convolution operation. Figure 20 In the example, MUX has been omitted for ease of description.

[0102] Figure 21 This shows the execution of the above reference. Figure 20 An example of a convolution operation within a shifted input data structure in a loop following the described operation. Figure 20 In the example, after performing a convolution operation on the second part of the input data, the third part of the input data can be selected to be input into a preset multiplier and stored again in the input register. Figure 21 Each element of the input register shown can include input data that is re-stored in the input register after selection. Thus, the input data to be applied to the filter can be changed, and therefore, it is easy to implement, for example... Figure 5 The third convolution operation corresponds to the third input data 530.

[0103] The above reference can be easily achieved by setting the MUX before or after the input register or filter register. Figure 10 , Figure 13 , Figure 16 and Figure 19 The accelerator described. The circuitry controlling this MUX can be implemented using a simple state machine. Although power overhead may occur due to this addition of circuitry, it can be fully compensated for by reduced energy consumption from memory reads, by a significant reduction in memory accesses to input data, and by reduced power leakage through the memory being disabled when not being read. Furthermore, in many cases, the overall system power consumption can be significantly reduced.

[0104] Although an 8×8 input data graph, a 3×3 filter, and a 16-operand MAC based on an adder tree are shown, the examples are not limited to these. The preceding description also applies to convolution operations on input data graphs, filters, and MACs based on various structures and sizes.

[0105] For example, the aforementioned weight / input data multiplexing structure and weight / input data shifting structure are also applicable when performing a convolution operation with a 5×5 filter and a stride of 1 in a 32-operand MAC. In this case, n max Equation 1 above can be determined to be 2, and the convolution operation corresponding to the two filters can be performed when the input data packed into a word in the input memory is read once.

[0106] For another example, in the case of performing a convolution operation with a 3×3 filter and a stride of 2 in a 16-operand MAC, the aforementioned weight / input data multiplexing structure and weight / input data shifting structure are also applicable. In this case, n max Equation 1 above can be determined to be 2, and the convolution operation corresponding to the two filters can be performed when the input data packed into a word in the input memory is read once.

[0107] Figure 22 An example flowchart of a method for operating an accelerator is shown.

[0108] Reference Figure 22 The method of operating the accelerator includes: operation 2210, packing input data using a data layout determined based on the word width of the memory in the accelerator and the space size of the filter to be applied to the target operation, and storing the packed input data in memory; and operation 2220, performing the target operation between a portion of the input data stored in the same word in memory and the weights of the filter. For a more detailed description of operations 2210 and 2220, please refer to the reference above. Figures 1 to 21 For the sake of brevity, a more detailed and repetitive description of operations 2210 and 2220 will be omitted here.

[0109] The above can be applied to accelerators included in electronic devices, and the power consumption of electronic devices can be effectively reduced by increasing the storage space efficiency of memory and reducing the number of times input data is read from input memory.

[0110] Figure 23 and Figure 24 An example of an electronic device is shown.

[0111] Reference Figure 23 The electronic device can be implemented as a user terminal 2300. Although for ease of description... Figure 23 The user terminal 2300 is shown as a smartphone, but other devices (including, for example, computing devices such as personal computers (PCs), tablet PCs, and laptops; wearable devices such as smartwatches and smart glasses; home appliances such as smart speakers, smart TVs, and smart refrigerators; and other devices such as smart vehicles, smart self-service terminals, Internet of Things (IoT) devices, and robots) can be used without restriction. The user terminal 2300 can directly access the data to be inferred using the neural network. The host processor 2310 can generate instructions to be executed by the accelerator 2320 in response to a request from the neural network to process the target operation to be executed in the accelerator 2320. When the instructions are executed, the accelerator 2320 can package the input data using a data layout determined based on the word width of the internal memory and the space size of the filters to be applied to the target operation, store the packaged input data in the internal memory, and perform the target operation between a portion of the input data stored in the same word in the internal memory and the weights of the filters. User terminal 2300 can provide the user with the inference result obtained by a neural network including target operation without changing the inference result, or can perform subsequent operations based on the inference result through host processor 2310.

[0112] Reference Figure 24 The electronic device can be implemented as server 2400. Server 2400 can be a separate device distinct from the user terminal controlled by the user, and can communicate with the user terminal via wired and / or wireless networks. Data intended for inference using a neural network can be collected by the user terminal and then transmitted to server 2400. As described above, host processor 2410 can generate instructions executable by accelerator 2420 in response to a request for processing the neural network whose target operation will be executed in accelerator 2420. When the instructions are executed, accelerator 2420 can package the input data using a data layout determined based on the word width of internal memory and the space size of the filter to be applied to the target operation, store the packaged input data in internal memory, and perform the target operation between a portion of the input data stored in the same word in internal memory and the weights of the filter. Server 2400 can return the result of the neural network inference including the target operation to the user terminal, and the user terminal can simply provide the user with this inference result received from server 2400 or perform subsequent operations based on the inference result.

[0113] Regarding Figure 1 , Figure 2 , Figure 10 , Figure 13 , Figure 16 , Figure 19 , Figure 23 and Figure 24 The described accelerators and other devices, units, modules, apparatuses, and other components are implemented by hardware components. Examples of hardware components that can be used to perform the operations described in this application include, where appropriate, controllers, sensors, generators, drivers, memories, comparators, arithmetic logic units, adders, subtractors, multipliers, dividers, integrators, and any other electronic components configured to perform the operations described in this application. In other examples, one or more of the hardware components performing the operations described in this application are implemented by computing hardware (e.g., by one or more processors or computers). A processor or computer can be implemented by one or more processing elements, such as logic gate arrays, controllers and arithmetic logic units, digital signal processors, microcomputers, programmable logic controllers, field-programmable gate arrays, programmable logic arrays, microprocessors, or any other means or combination of means configured to respond to and execute instructions in a defined manner to achieve a desired result. In one example, the processor or computer includes or is connected to one or more memories storing instructions or software executed by the processor or computer. The hardware components implemented by the processor or computer can execute instructions or software for performing the operations described in this application, such as an operating system (OS) and one or more software applications running on the OS. Hardware components can also access, manipulate, process, create, and store data in response to the execution of instructions or software. For simplicity, the singular terms "processor" or "computer" may be used in the description of the examples described in this application; however, in other examples, multiple processors or computers may be used, or a processor or computer may include multiple processing elements or multiple types of processing elements or both. For example, a single hardware component or two or more hardware components may be implemented by a single processor, or two or more processors, or a processor and a controller. One or more hardware components may be implemented by one or more processors or a processor and a controller, and one or more other hardware components may be implemented by one or more other processors or additional processors and additional controllers. One or more processors or a processor and a controller may implement a single hardware component or two or more hardware components. Hardware components may be any one or more of different processing configurations, examples of which include a single processor, a standalone processor, a parallel processor, Single Instruction Single Data (SISD) multiple processing, Single Instruction Multiple Data (SIMD) multiple processing, Multiple Instruction Single Data (MISD) multiple processing, and Multiple Instruction Multiple Data (MIMD) multiple processing.

[0114] Figures 1 to 24The methods for performing the operations described in this application, as shown, are executed by computing hardware (e.g., by one or more processors or a computer), which is implemented to execute instructions or software as described above to perform the operations performed by the methods described in this application. For example, a single operation or two or more operations may be executed by a single processor, or two or more processors, or a processor and a controller. One or more operations may be executed by one or more processors or a processor and a controller, and one or more other operations may be executed by one or more other processors or additional processors and additional controllers. One or more processors or a processor and a controller may execute a single operation or two or more operations.

[0115] Instructions or software for controlling computing hardware (e.g., one or more processors or computers) to implement hardware components and perform the methods described above can be written as computer programs, code segments, instructions, or any combination thereof to individually or collectively instruct or configure one or more processors or computers to operate as machines or special-purpose computers to perform operations performed by the hardware components and methods described above. In one example, the instructions or software include machine code, such as machine code generated by a compiler, that is directly executed by one or more processors or computers. In another example, the instructions or software include high-level code that is executed by one or more processors or computers using an interpreter. The instructions or software can be written using any programming language based on the block diagrams and flowcharts shown in the accompanying drawings and the corresponding descriptions in the specification, which disclose algorithms for performing operations performed by the hardware components and methods described above.

[0116] Instructions or software used to control computing hardware (e.g., one or more processors or computers) to implement hardware components and perform the methods described above, as well as any associated data, data files, and data structures, may be recorded, stored, or fixed in, or on, one or more non-transitory computer-readable storage media. Examples of non-transitory computer-readable storage media include read-only memory (ROM), random access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disc storage, hard disk drive (HDD), solid-state drive (SSD), flash memory, card-type storage (such as multimedia cards or microcards (e.g., Secure Digital (SD) or Extreme Digital (XD))), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid-state drive, and any other device configured to store instructions or software and any associated data, data files, and data structures in a non-transitory manner and to provide instructions or software and any associated data, data files, and data structures to one or more processors or computers, such that one or more processors or computers can execute the instructions. In one example, the instructions or software and any associated data, data files, and data structures are distributed across a networked computer system, such that the instructions and software and any associated data, data files, and data structures are stored, accessed, and executed in a distributed manner through one or more processors or computers.

[0117] Although this disclosure includes specific examples, it will be clear upon understanding this disclosure that various changes in form and detail may be made in these examples without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered descriptive only and not for limiting purposes. The description of features or aspects in each example will be considered applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order, and / or if the components in the described system, architecture, apparatus, or circuit are combined in a different manner, and / or replaced or supplemented by other components or their equivalents.

[0118] Therefore, the scope of the disclosure is not limited by the specific embodiments, but by the claims and their equivalents, and all variations within the scope of the claims and their equivalents shall be interpreted as included in the disclosure.

Claims

1. A method of operating an accelerator, the accelerator being configured to perform a target computation, the method comprising: The input data included in the input data graph is packaged using a data layout determined based on the word width of the memory in the accelerator and the space size of the filter to be applied to the target operation, and the packaged input data is stored in memory. as well as The target operation is performed between a portion of the packaged input data stored in the same word in memory and the filter weights. The method further includes: packaging the input data corresponding to multiple filters based on the data layout, and storing the packaged input data in a single word. In this input data diagram, the first input data of the first applied filter is stored entirely in the first word of the memory, and the second input data of the second applied filter in the input data diagram, which does not overlap with the first input data, is continuously stored in the first word.

2. The method according to claim 1, wherein, The number of filters is determined based on the horizontal and vertical size of each filter, the number of input data channels, the stride size of each filter, and the number of operand pairs that the arithmetic unit configured to perform the target operation can process simultaneously.

3. The method according to claim 1 or 2, wherein, The storage steps include: Packed input data is stored by performing an im2col transform based on the spatial size and stride size of a virtual filter, which are determined based on the word width of the memory and the spatial size of the filter.

4. The method according to claim 1 or 2, wherein, The steps for performing the target operation include: The input data from the same word stored in memory is retrieved into the input register; The filter weights are retrieved into the filter register; Perform a first target operation between the first part of the input data obtained from the input register and the weights; and A second objective operation is performed between the second part of the input data obtained from the input register and the weights.

5. The method according to claim 4, wherein, The first part and the second part of the input data include redundant data that partially overlaps with each other.

6. The method according to claim 4, wherein, The steps for performing the target operation include: Selecting the weights used for the first objective operation; and A second objective operation is performed between the second part of the input data and the selected weights.

7. The method according to claim 4, wherein, The steps for performing the target operation include: After the first target operation, the second part of the input data obtained from the input register is selected; and A second objective operation is performed between the second part of the selected input data and the weights.

8. The method according to claim 4, wherein, The steps for performing the target operation include: The weights used for the first objective operation are selected, and the selected weights are stored back in the filter register; and A second objective operation is performed between the second part of the input data and the re-stored weights.

9. The method according to claim 4, wherein, The steps for performing the target operation include: After performing the first target operation, the second part of the input data obtained from the input register is selected and the selected second part of the input data is stored back in the input register; and A second objective operation is performed between the second part of the re-stored selected input data and the weights.

10. The method according to claim 1 or 2, wherein, The target computations include convolution operations performed in neural networks running in the accelerator.

11. The method according to claim 1 or 2, wherein, The steps for performing the target operation include: The target operation is performed in the multioperand multiplier accumulator, where a portion of the input data stored in the same word and the filter weights are fed into the multioperand multiplier accumulator.

12. The method according to claim 1 or 2, wherein, The accelerator is included in the user terminal into which the data to be inferred by the neural network performing the target computation is input, or in the server that receives the data to be inferred from the user terminal.

13. A non-transitory computer-readable storage medium for storing commands, which, when executed by a processor, cause the processor to perform the method according to any one of claims 1 to 12.

14. An accelerator configured to perform a target computation, the accelerator comprising: The memory is configured to store input data included in the input data graph, packaged using a data layout determined based on the word width of the memory and the space size of the filter to be applied to the target operation. as well as The arithmetic unit is configured to perform target operations between a portion of the input data stored in the same word in memory and the weights of the filter. The memory is configured to: pack the input data corresponding to multiple filters based on the data layout, and store the packed input data in a single word. In this input data diagram, the first input data of the first applied filter is stored entirely in the first word of the memory, and the second input data of the second applied filter in the input data diagram, which does not overlap with the first input data, is continuously stored in the first word.

15. The accelerator of claim 14, further comprising: Input register: Input data stored in the same word in memory is retrieved from the input register. as well as The filter register contains the filter weights. The arithmetic unit is configured as follows: Perform the first target operation between the first part of the input data obtained from the input register and the weight; as well as A second objective operation is performed between the second part of the input data obtained from the input register and the weights.

16. An electronic device, the electronic device comprising: The host processor is configured to generate instructions executable by the accelerator in response to a request for processing a neural network in the accelerator in which a target computation is performed. as well as The accelerator is configured to: when an instruction is executed, pack input data included in the input data graph using a data layout determined based on the word width of internal memory and the space size of the filter to be applied to the target operation, and store the packed input data in internal memory; and perform the target operation between a portion of the input data stored in the same word in internal memory and the weights of the filter. The accelerator is configured to: pack the input data corresponding to multiple filters based on the data layout, and store the packed input data in a single word. In this input data diagram, the first input data of the first applied filter is entirely stored in the first word of the internal memory, and the second input data of the second applied filter in the input data diagram, which does not overlap with the first input data, is continuously stored in the first word.

17. An accelerator configured to perform a target computation, the accelerator comprising: The input memory is configured to pack the input data into the input data graph according to a data layout determined based on the word width of the input memory in the accelerator and the space size of the filter to be applied to the target operation; The filter memory is configured to store the weights of the filters applied to the target operation; The arithmetic unit includes multiple multipliers configured to perform a target operation between packed input data stored in the same word in the input memory and one or more weights stored in the filter memory. as well as The multiplexer is selectively positioned between the arithmetic unit and one of the input memory and filter memory, wherein... When the multiplexer is positioned between the arithmetic unit and the filter memory, the multiplexer is configured to selectively transfer one of the weights stored in the filter memory to each of the plurality of multipliers in the arithmetic unit, and When a multiplexer is positioned between the arithmetic unit and the input memory, the multiplexer is configured to selectively transmit a set of packaged input data stored in the input memory to each of the plurality of multipliers in the arithmetic unit. The input memory is configured to: pack the input data corresponding to multiple filters based on the data layout, and store the packed input data in a single word. In this input data diagram, the first input data of the first applied filter is stored entirely in the first word of the input memory, and the second input data of the second applied filter in the input data diagram, which does not overlap with the first input data, is continuously stored in the first word.

18. The accelerator of claim 17, further comprising: The input register is where the packaged input data stored in the same word in the input memory is retrieved. as well as The filter register contains the filter weights. In the case where the multiplexer is located between the arithmetic unit and the filter memory, the multiplexer is selectively located between the filter register and either the filter memory or the arithmetic unit. In cases where the multiplexer is positioned between the arithmetic unit and the input memory, the multiplexer is selectively positioned between the input register and either the input memory or the arithmetic unit.

19. The accelerator according to claim 18, wherein, The target computations include convolution operations performed in neural networks running in the accelerator.

20. An electronic device, the electronic device comprising: The accelerator according to claim 17, and The host processor is configured to generate instructions to be executed by the accelerator in response to a request from a neural network whose target computation will be performed in the accelerator.