Electronic device for performing data mapping by performing operation of CNN model with low power, and driving method thereof
The DRAM-AiM structure with dynamic multiplier management and PIM controller optimizes CNN computations in mobile devices, addressing power and speed issues by reducing energy consumption and enhancing efficiency.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- FOUND FOR RES & BUSINESS SEOUL NAT UNIV OF SCI & TECH
- Filing Date
- 2025-04-22
- Publication Date
- 2026-06-25
Smart Images

Figure KR2025095257_25062026_PF_FP_ABST
Abstract
Description
Electronic device for performing data mapping by performing CNN model computations with low power and method for driving the same
[0001] Various embodiments of the present invention relate to an electronic device for performing data mapping by performing computations of a CNN model with low power and a method for driving the same.
[0002] Convolutional Neural Networks (CNNs) are recognized as an innovative technology in the field of computer vision, providing high accuracy and efficiency, particularly in tasks such as image classification and object detection. Thanks to advancements in deep learning performance and lightweighting techniques, the development of these CNN-based technologies has enabled their use with excellent security and safety on mobile and edge devices.
[0003] However, conventionally, while these devices are easy to use in devices with large capacities or various types of memory, such as computers, laptops, supercomputers, and quantum computers, there have been many difficulties in using them in mobile environments.
[0004] Meanwhile, in such mobile environments, Depthwise Separable Convolution (DSC) structures are widely used for computational efficiency and high performance / high power consumption, and interest in Processing-in-Memory (PIM) is increasing; however, there was a problem in that performing computations with DSC structures in PIM caused a serious speed degradation.
[0005] Therefore, to overcome these problems, an integrated software / hardware solution for efficient computation is currently required.
[0006] Accordingly, the present embodiment relates to an electronic device and a method for operating the same, which can construct and utilize a low-power and high-efficiency computing device capable of adjusting the size of the data, deactivating unnecessary computing systems, and performing maximum utilization when external assistance is performed upon receiving at least one input data.
[0007] According to various embodiments, an electronic device for performing data mapping by performing operations of a CNN model with low power comprises: a communication interface; a processor; and a memory including a Dynamic Random Access Memory (DRAM) that stores a plurality of total multipliers for performing multiple operations that enable the processor to perform at least one operation related to the CNN model while simultaneously performing a plurality of operations when executed by the processor, a control bit for deactivating a second multiplier not required for the operation in addition to a plurality of first multipliers required for the operation, and a 2:1 selector. The method includes, wherein the at least one operation comprises: when the at least one input data is acquired from an external electronic device through the communication interface, if it is determined that the kernel size of the at least one input data is greater than or equal to a preset threshold size value, the operation of automatically dividing the kernel size of the at least one input data to a preset size; when the kernel size of the at least one input data is less than or equal to the preset threshold size value or is automatically divided, the operation of inputting the at least one input data into at least one space (stage) stored in the memory; determining the number of the first multipliers requiring a weighted sum (MAC, multiplication-accumulation) operation of at least one input data among the spaces, and if it is determined that the number of the first multipliers is less than or equal to the total number of multipliers, driving the control bit and the 2-to-1 selector to disable the second multiplier; and receiving a global buffer containing a plurality of weight values to each of the plurality of subprocessors through the communication interface, and inputting the at least one input data into the first multiplier for mapping.
[0008] The electronic device according to the present embodiment has the advantage of being able to minimize energy generated according to the kernel size by checking the kernel size of at least one input data, reduce energy generated from the multiplier by deactivating unnecessary multipliers by checking the number of multipliers used, and maximize energy efficiency by using the maximum amount of energy in the minimum amount of time by reading the input data with weights into the global buffer while putting the input data into the global buffer.
[0009] FIG. 1 illustrates a block diagram of an electronic device and a network according to various embodiments of the present invention.
[0010] FIGS. 2a to 2c are specific exemplary diagrams of a memory according to various embodiments of the present invention.
[0011] FIGS. 3a to 3c are exemplary diagrams of a processor unit design change of a kernel convolution layer of an electronic device according to various embodiments of the present invention.
[0012] FIGS. 4a and 4b are exemplary diagrams illustrating the data mapping of a convolution layer in an activation group of an electronic device before and after modification according to various embodiments of the present invention.
[0013] FIG. 5 is a flowchart illustrating a method of operation of an electronic device according to various embodiments of the present invention.
[0014] FIGS. 6a to 6d are exemplary diagrams of data mapping based on the computational process of a convolution layer of an electronic device according to various embodiments of the present invention.
[0015] Hereinafter, various embodiments of this document are described with reference to the accompanying drawings. The embodiments and the terms used therein are not intended to limit the technology described in this document to specific embodiments and should be understood to include various modifications, equivalents, and / or substitutions of said embodiments. In relation to the description of the drawings, similar reference numerals may be used for similar components. A singular expression may include a plural expression unless the context clearly indicates otherwise. In this document, expressions such as "A or B" or "at least one of A and / or B" may include all possible combinations of items listed together. Expressions such as "first," "second," "first," or "second" may modify said components regardless of order or importance and are used only to distinguish one component from another and do not limit said components. When it is mentioned that a certain (e.g., 1st) component is "(functionally or telecommunicationally) connected" or "connected" to another (e.g., 2nd) component, said certain component may be directly connected to said other component or connected through another component (e.g., 3rd component).
[0016] In this document, "configured to" may be used interchangeably with, depending on the context, for example, hardware- or software-wise, "suitable for," "capable of," "modified to," "made to," "capable of," or "designed to." In some cases, the expression "device configured to" may mean that the device is "capable of" in conjunction with other devices or components. For example, the phrase "processor configured to perform A, B, and C" may mean a dedicated processor for performing the corresponding operations (e.g., an embedded processor), or a general-purpose processor capable of performing the corresponding operations by executing one or more software programs stored in a memory device (e.g., a CPU or application processor).
[0017] An electronic device according to various embodiments of the present document may include, for example, at least one of a smartphone, a tablet PC, a desktop PC, a laptop PC, a netbook computer, a workstation, and a server.
[0018] Referring to FIG. 1, an electronic device (101) within a network environment (100) in various embodiments is described. The electronic device (101) may include a bus (110), a processor (120), a memory (130), an input / output interface (140), a display (150), and a communication interface (160). In some embodiments, the electronic device (101) may omit at least one of the components or additionally include other components. The bus (110) may include a circuit that connects the components (110-160) to each other and transmits communication (e.g., control messages or data) between the components. The processor (120) may include one or more of a central processing unit, an application processor, or a communication processor (CP). The processor (120) may, for example, perform operations or data processing regarding the control and / or communication of at least one other component of the electronic device (101).
[0019] The memory (130) may include volatile and / or non-volatile memory. The memory (130) may store instructions or data related to at least one other component of the electronic device (101), for example. According to one embodiment, the memory (130) may store software and / or a program (140).
[0020] The input / output interface (140) can, for example, transmit commands or data input from a patient or other external device to other component(s) of the electronic device (101), or output commands or data received from other component(s) of the electronic device (101) to the patient or other external device.
[0021] The display (150) may include, for example, a liquid crystal display (LCD), a light-emitting diode (LED) display, an organic light-emitting diode (OLED) display, a microelectromechanical system (MEMS) display, or an electronic paper display. The display (150) may display various content (e.g., text, images, videos, icons, and / or symbols, etc.) to a patient, for example. The display (150) may include a touch screen and may receive touch, gesture, proximity, or hovering input using, for example, an electronic pen or a part of the patient's body. The communication interface (160) may establish communication between, for example, the electronic device (101) and an external device (e.g., a first external electronic device (102), a second external electronic device (104), or a server (108)). For example, the communication interface (160) can be connected to a network (162) via wireless or wired communication to communicate with an external device (e.g., a second external electronic device (104) or a server (108)).
[0022] Wireless communication may include cellular communication using at least one of, for example, LTE, LTE-A (LTE Advance), CDMA (code division multiple access), WCDMA (wideband CDMA), UMTS (universal mobile telecommunications system), WiBro (Wireless Broadband), or GSM (Global System for Mobile Communications). According to one embodiment, wireless communication may include at least one of, for example, WiFi (wireless fidelity), Bluetooth, Bluetooth Low Energy (BLE), Zigbee, NFC (near field communication), Magnetic Secure Transmission, Radio Frequency (RF), or Body Area Network (BAN). According to one embodiment, wireless communication may include GNSS. GNSS may be, for example, GPS (Global Positioning System), Glonass (Global Navigation Satellite System), Beidou Navigation Satellite System (hereinafter "Beidou"), or Galileo, the European global satellite-based navigation system. Hereinafter, in this document, "GPS" may be used interchangeably with "GNSS". Wired communication may include at least one of, for example, USB (universal serial bus), HDMI (high definition multimedia interface), RS-232 (recommended standard 232), power line communication, or POTS (plain old telephone service).The network (162) may include at least one of a telecommunications network, for example, a computer network (e.g., LAN or WAN), the Internet, or a telephone network.
[0023] Each of the first and second external electronic devices (102, 104, 106) may be the same or a different type of device as the electronic device (101). According to various embodiments, all or part of the operations performed on the electronic device (101) may be performed on one or more other electronic devices (e.g., electronic devices (102, 104, 106), or a server (108). According to one embodiment, when the electronic device (101) needs to perform a function or service automatically or upon request, the electronic device (101) may request at least some of the associated functions from another device (e.g., electronic devices (102, 104, 106), or a server (108)) instead of performing the function or service itself or additionally. The other electronic device (e.g., electronic devices (102, 104, 106), or a server (108)) may perform the requested function or additional functions and transmit the result to the electronic device (101). The electronic device (101) may provide the requested function or service by processing the received result as is or additionally. For this purpose, for example, cloud computing, distributed computing, or client-server computing technologies may be used.
[0024]
[0025] FIGS. 2a to 2c are specific exemplary diagrams of a memory according to various embodiments of the present invention.
[0026] FIGS. 3a to 3c are exemplary diagrams of a processor unit design change of a kernel convolution layer of an electronic device according to various embodiments of the present invention.
[0027] FIGS. 4a and 4b are exemplary diagrams illustrating the data mapping of a convolution layer in an activation group of an electronic device before and after modification according to various embodiments of the present invention.
[0028]
[0029] Referring to FIGS. 2a to 2c and FIGS. 3a to 3c, the memory (130) according to the present invention may include a plurality of total multipliers for performing multiple operations that enable the processor (120) to perform at least one operation related to a CNN model while simultaneously performing a plurality of operations when executed by the processor (120), and a Dynamic Random Access Memory (DRAM) that stores a control bit for deactivating a second multiplier not required for the operation and a 2:1 selector in addition to the plurality of first multipliers required for the operation. Furthermore, as shown in FIGS. 3a to 3c, the memory (130) may include a DRAM-AiM (accelerator in memory) that can accelerate the operation speed by subtracting data movement values for memory-intensive and computation-intensive operations of the processor (120), can activate weights while fixing vector data of at least one input data, and can drive the total multipliers. Specifically, the memory (130), which has a DRAM-AiM structure, can be configured with technology to improve the computational efficiency of an artificial intelligence model, particularly a Convolutional Neural Network (CNN) model. Additionally, the memory (130) may include multiple multipliers that perform multiple multiplication, which is a core element of CNN computation, and may include control bits and a selector (2:1 Selector) that can efficiently use computational resources by controlling multipliers that can be disabled, and may be configured to accelerate computation speed by reducing the cost of the processor (120) repeatedly moving data. Furthermore, the memory (130), which has a DRAM-AiM structure, may include a function that maximizes computational parallelism by fixing vector data and activating weights during CNN model computation, and increases computational throughput by driving all multipliers.Additionally, the memory (130) having a DRAM-AiM structure can be used for weight sparsity, which reduces the amount of computation and memory (130) usage by deactivating weights that are not frequently used in the CNN model and increases efficiency by processing only the activated weights, and can include parallel computations that improve model inference speed by simultaneously running multipliers within the DRAM to parallel process filter-based operations (e.g., convolution operations). Additionally, the memory (130) having a DRAM-AiM structure can increase speed and energy efficiency by performing computations within the memory (130) while minimizing data between the memory (130) and the processor (120) to overcome the main cause of bottlenecks in the computation process of the CNN model. In addition, the memory (130) having a DRAM-AiM structure can be configured to include on-chip computation designed to reduce the workload of the processor by utilizing multipliers and control circuits embedded in the DRAM and to perform computations in the DRAM, and to utilize multi-computation performance by applying the DRAM-AiM structure to time-series data processing such as RNN and Transformer models and to computations such as attention mechanisms, and can be applied to small models used in IoT and edge devices composed of lightweight artificial intelligence models to efficiently manage hardware resources. Furthermore, the memory (130) having a DRAM-AiM structure can increase the execution efficiency of CNN and other AI models in battery-based devices through a DRAM-AiM structure that reduces power consumption such as in mobile and edge computing, and can be of great help in the development of low-power AI hardware by reducing power consumption by disabling unnecessary multipliers described later for energy optimization and enabling high performance with fewer resources.In addition, the memory (130) having a DRAM-AiM structure can design custom hardware optimized for CNN models based on DRAM-AiM technology based on a custom artificial intelligence model processor design, and can design the possibility of implementing a memory-centric architecture by performing calculations directly inside the DRAM through in-memory computing, and can reduce data bottlenecks and improve speed even in the AI training stage requiring large-scale parallel computing, thereby enabling various applications in the artificial intelligence model training process.
[0030] According to one embodiment, when the memory (130) performs the activation of at least one input data as illustrated in FIGS. 3a to 3c and FIGS. 4a to 4b, 16 Thirty-two multiple weights are loaded simultaneously, 16 values are computed simultaneously when performing a weighted sum operation of at least one input data, and the computed value belonging to the global buffer can be applied equally to multiple subprocessors of all activations. According to another embodiment, the memory (130) may further include a processing in memory (PIM) controller (command) that reads the address of at least one input data in reverse. Specifically, the memory (130) may be based on a processing in memory (PIM) architecture designed to support efficient computation of artificial intelligence models such as CNNs, and 16 weights required for large-scale parallel computations, such as CNN layers, at once It is possible to load 32 weights simultaneously, thereby reducing data transfer time and accelerating computation speed. In the process of summing the results after multiplying the input data and weights, it is possible to perform a weighted sum operation in which 16 values are calculated simultaneously. Subsequently, the memory (130) serves as a central hub for storing computation results and sharing data with sub-processors inside or outside the memory. The computation values stored in the global buffer can be reused in various activation processes, preventing duplicate data calculations and contributing to increased efficiency of system resources. Additionally, the PIM controller is a control circuit embedded in the memory (130) that can perform computations beyond simple storage space in the method of processing data. In particular, it can read data directly from within the memory (130) and execute multiplication-accumulation (weighted sum) operations. It can also be designed to maximize efficiency in specific computation patterns of the CNN model (e.g., backpropagation operations or non-linear data access) by including a function to read the address of the input data in reverse, processing the order of the data array in reverse, or optimizing memory access patterns to reduce data bottlenecks. Additionally, the memory (130) has at least one input data loaded into it, and 16 32 weights can be read simultaneously to complete the preparation for operation, and after multiplying the input data and the weights, 16 operation values can be calculated simultaneously in parallel while the result can be stored in a global buffer. Additionally, the global buffer transmits the calculated value to each sub-processor, and all sub-processors can perform additional operations using the same activation data. The PIM controller controls operations within memory and, if necessary, can invert data addresses to improve memory access efficiency. Thus, the memory (130) according to this embodiment can enable efficient data processing not only in the forward operation of the CNN but also in the backpropagation process. As a result, the memory (130) according to this embodiment can load and process multiple weights and input data simultaneously, thereby dramatically improving the operation speed. Furthermore, through data sharing using a global buffer, the same data can be reused for multiple activation operations, minimizing unnecessary data movement. Additionally, while the cost of data transfer between the processor and memory and power consumption are reduced, optimized memory access methods such as address inversion can maintain performance even in complex operation patterns. In addition, memory (130) can be used in applications such as real-time image processing or object recognition by accelerating the convolutional operations of CNN models used in deep learning frameworks using the above advantages, and can also be accelerated through data parallelization in natural language processing (NLP) models that require large-scale matrix operations in modern models such as Transformer, and can be useful in IoT and edge computing environments where low-power, high-efficiency computation is essential, and can also be applied to big data analysis or AI training that requires large-scale data computation.
[0031]
[0032] FIG. 5 is a flowchart illustrating a method of operation of an electronic device according to various embodiments of the present invention.
[0033] FIGS. 6a to 6d are exemplary diagrams of data mapping based on the computational process of a convolution layer of an electronic device according to various embodiments of the present invention.
[0034]
[0035] In operation 501, an electronic device (101) (e.g., processor (120) of FIG. 1) can obtain at least one input data from an external electronic device (102, 104, 106) through a communication interface (e.g., communication interface (160) of FIG. 1). According to one embodiment, the at least one input data may consist of data input to a memory (130) which is a DRAM-AiM structure and may have a structure including a kernel size. Additionally, the at least one input data may be obtained from an external electronic device (102, 104, 106), but may consist of data obtained from a server (108) and / or data generated by the electronic device (101) itself. Subsequently, the electronic device (101) can extract the kernel size of the at least one input data.
[0036] In operation 503, the electronic device (101) (e.g., the processor (120) of FIG. 1) can determine whether the kernel size of at least one input data is greater than or equal to a preset threshold size value, as illustrated in FIG. 6a through 6d. According to one embodiment, the threshold size value is such that the kernel size to be input to each of at least one space is 3 It is a case where it is 3 or greater, and may include a case where at least one of the horizontal and / or vertical values is at least 3 or greater, and may be set as a reference value in which it is determined that a case of the same size is also exceeded. Subsequently, the electronic device (101) can compare the kernel size of at least one input data based on the set threshold size value. Subsequently, the electronic device (101) determines that the kernel size of at least one input data is 2 2 and / or 1 If it is 1, it can be determined that it is smaller than the threshold size value, and the 505 operation can be skipped and the 507 operation can be executed immediately.
[0037] In operation 505, if the electronic device (101) (e.g., the processor (120) of FIG. 1) determines that the kernel size of at least one input data is greater than or equal to a threshold size value, it can automatically divide the kernel size of at least one input data to a set size. According to one embodiment, if the kernel size of at least one input data exceeds the threshold size value, the electronic device (101) divides the kernel size of at least one input data into 2 2 and / or 1 Decompose into 1s, and the kernel size of at least one input data is 2 Input the case where the size is 2 into a first space that performs the process at least twice, and the kernel size of at least one input data is 1 CNN operations can be performed when the size is 1. Specifically, the electronic device (101) has a kernel size of at least one input data of 3 In the case of 3, it can be determined to be the same as the threshold size value. Subsequently, the electronic device (101) determines that the kernel size of at least one input data is 3 3 is 2 2 kernel sizes, 2, 1 One can be disassembled into a kernel size. Subsequently, the electronic device (101) is 2 of the space (stage) to be described later. 2 kernel sizes, 2 in the first space, 1 One kernel size can be input into the second space. Afterwards, the electronic device (101) can prepare to perform operations on the kernel sizes input into the first space and the second space, respectively.
[0038] In operation 507, the electronic device (101) (e.g., the processor (120) of FIG. 1) can input at least one input data into at least one space (stage) already stored in memory (130) when the kernel size of at least one input data is less than or equal to a threshold size value or is automatically partitioned, as illustrated in FIG. 6a to 6d. According to one embodiment, the electronic device (101) [inputs] 2 of the kernel size of at least one input data By inputting a kernel size of 2 into the first space and performing it twice during the execution of the existing convolution cycle, the bank activation process can be saved by more than double, and among the kernel sizes of at least one input data, 2 2. The kernel size can be set to be input into the second space to perform the same operation as the basic convolution. Afterwards, the electronic device (101) can perform low power by determining the number of multipliers for the weighted sum operation.
[0039] In operation 509, the electronic device (101) (e.g., the processor (120) of FIG. 1) can determine whether the number of first multipliers requiring a weighted sum (MAC, multiplication-accumulation) operation of at least one input data in space is less than or equal to the total number of multipliers. According to one embodiment, the electronic device (101) assigns a bit value to each control bit included in memory, and if the output from the multiplier connected to the control bit is 1, it determines that it is a first multiplier requiring a weighted sum operation and can activate the corresponding multiplier. Specifically, the electronic device (101) can activate only the first multiplier requiring a weighted sum operation to overcome the cause of the existing process unit consuming a lot of energy because it cannot simultaneously calculate 16 values and automatically adjust them, and can add a 2-to-1 selector to the control bit and multiplier for this purpose as shown in FIG. 3a to 3c. Subsequently, the electronic device (101) may add multiple x logic gates and control bits to each deactivate and reduce depth. Subsequently, as illustrated in FIGS. 3a to 3c, the electronic device (101) may have a total of 16 control bits, and by inputting a 1-bit value to each control bit, it may bypass when the value is 1 and maintain activation while determining that the corresponding multiplier is the first multiplier. Additionally, if all 16 multipliers are the first multipliers, the electronic device (101) may skip the 511 operation and proceed directly to the 513 operation.
[0040] In operation 511, if the electronic device (101) (e.g., the processor (120) of FIG. 1) determines that the number of first multipliers is less than or equal to the total number of multipliers, it can drive the control bit and the 2-to-1 selector to extract and disable the second multiplier. According to one embodiment, the electronic device (101) assigns a bit value to each control bit included in memory, and if the output from the multiplier connected to the control bit is 0, it determines that the second multiplier is unnecessary for weighted sum operation and can disable the multiplier. Additionally, as shown in FIG. 3a to 3c, the electronic device (101) may configure the control bits to have a total of 16 bits, input a 1-bit value to each control bit, and output 0 when the result is 0, thereby determining that the corresponding multiplier is unnecessary for the second multiplier and can disable it. Additionally, the electronic device (101) can disable each adder module and reduce the depth of the adder tree in the case of the second multiplier. Afterward, the electronic device (101) can perform a setting to be learned by the first multiplier required for at least one input data.
[0041] In operation 513, the electronic device (101) (e.g., the processor (120) of FIG. 1) may receive a global buffer containing multiple weight values in a plurality of sub-processors in a configured first multiplier. According to one embodiment, the electronic device (101) may load weights into the global buffer and schedule activation map data of at least one input data to a PIM-controller in the sub-processor. Specifically, the electronic device (101) may find that, as illustrated in FIG. 4a, the one-to-one correspondence between weights and activation maps is difficult for the entire global buffer obtained from external electronic devices (102, 104, 106) and / or a server (108), and that there may be a problem of consuming a lot of energy or other types of data, which is a cause of slowdown for the depthwise convolution layer. Additionally, the electronic device (101) can be configured to be learned in a one-to-one correspondence as much as possible by assigning weight values to each of the global buffers in advance without consuming a lot of energy, as shown in FIG. 4b.
[0042] In operation 515, the electronic device (101) (e.g., the processor (120) of FIG. 1) can determine whether the number of groups in the global buffer exceeds a preset bit threshold. According to one embodiment, the preset bit threshold may be set as a reference value for determining whether the number of at least one input data is equal to the total number of subprocessors input into the global buffer. Additionally, the preset bit threshold may generally be set to the case of one input data and thus determined as a value of 1. Subsequently, the electronic device (101) can compare whether the number of groups in the global buffer exceeds the bit threshold value set by the manager. Additionally, if the number of groups in the global buffer determined to be equal to the bit threshold (e.g., 1) is equal, the electronic device (101) may skip operation 517 and proceed to operation 519.
[0043] In operation 517, if the electronic device (101) (e.g., the processor (120) of FIG. 1) determines that the number of groups of the global buffer exceeds a preset bit threshold, it can activate a PIM-controller to activate each of the multiple sub-processors included in the global buffer. According to one embodiment, the electronic device (101) can determine that the number of groups of the global buffer exceeds a preset bit threshold (e.g., 1) when it determines that there are 4 groups of the global buffer, as shown in FIG. 4b. Subsequently, the electronic device (101) can input at least one input data to each of the sub-processors of the groups of the global buffer, and can assign a weight to the global buffer so that each can be driven by the sub-processors by the weight.
[0044] In operation 519, an electronic device (101) (e.g., processor (120) of FIG. 1) can map at least one input data to a first multiplier in which a global buffer is activated. According to one embodiment, the electronic device (101) sets the first multiplier in which activation is completed to at least one input data with a kernel size organized, and when the global buffer is activated to a weight value, it can learn and utilize at least one input data. This allows the electronic device (101) to reduce energy generation according to the kernel size, reduce energy generation by deactivating unnecessary multipliers, and reduce energy usage time by maximizing the utilization of the global buffer by inputting input data to each sub-processor and weighting into the global buffer.
[0045]
[0046] The electronic device (101) according to the present embodiment has the advantage of being able to minimize energy generated according to the kernel size by checking the kernel size of at least one input data, reduce energy generated from the multiplier by checking the number of multipliers used and deactivating unnecessary multipliers, and maximize energy efficiency by using the maximum amount of energy in the minimum amount of time by reading the input data with weights into the global buffer while putting the input data into the global buffer.
[0047]
[0048] According to various embodiments, an electronic device for performing data mapping by performing operations of a CNN model with low power comprises: a communication interface; a processor; and a memory including a Dynamic Random Access Memory (DRAM) that stores a plurality of total multipliers for performing multiple operations that enable the processor to perform at least one operation related to the CNN model while simultaneously performing a plurality of operations when executed by the processor, a control bit for deactivating a second multiplier not required for the operation in addition to a plurality of first multipliers required for the operation, and a 2:1 selector. The method includes, wherein the at least one operation comprises: when the at least one input data is acquired from an external electronic device through the communication interface, if it is determined that the kernel size of the at least one input data is greater than or equal to a preset threshold size value, the operation of automatically dividing the kernel size of the at least one input data to a preset size; when the kernel size of the at least one input data is less than or equal to the preset threshold size value or is automatically divided, the operation of inputting the at least one input data into at least one space (stage) stored in the memory; determining the number of the first multipliers requiring a weighted sum (MAC, multiplication-accumulation) operation of at least one input data among the spaces, and if it is determined that the number of the first multipliers is less than or equal to the total number of multipliers, driving the control bit and the 2-to-1 selector to disable the second multiplier; and receiving a global buffer containing a plurality of weight values to each of the plurality of subprocessors through the communication interface, and inputting the at least one input data into the first multiplier for mapping.
[0049] According to various embodiments, the memory includes a DRAM-AiM (accelerator in memory) capable of accelerating the operation speed by subtracting a data movement value for a computationally intensive operation of the processor, activating a weight while fixing the vector data of the at least one input data, and driving the entire multiplier, wherein the at least one operation, when performing the activation of the at least one input data, 16 The operation includes 32 of the above-mentioned multiple weights being loaded simultaneously, 16 values being calculated simultaneously when performing the weighted sum operation of at least one input data, and the operation value belonging to the global buffer being applied equally to multiple subprocessors of all activations.
[0050] According to various embodiments, the threshold size value is such that the kernel size to be input to each of the at least one space is 3 The case is 3 or greater, including the case where at least one of the horizontal and / or vertical values is at least 3 or greater, and the at least one operation is such that if the kernel size of the at least one input data exceeds the threshold size value, the kernel size of the at least one input data is 2 2 and / or 1 Decomposed into 1s, and the kernel size of the at least one input data is 2 Input the case where the size is 2 into a first space that performs the process at least twice, and the kernel size of the at least one input data is 1 In the case of size 1, it includes the operation of inputting into a second space that performs the operation of the above CNN.
[0051] According to various embodiments, the at least one operation includes assigning a bit value to each of the control bits included in the memory, and if the output from the multiplier connected to the control bit is 1, determining that it is the first multiplier requiring the weighted sum operation and activating the multiplier, and if the output from the multiplier connected to the control bit is 0, determining that it is the second multiplier requiring the weighted sum operation and deactivating the multiplier.
[0052] According to various embodiments, the memory further includes a processing in memory (PIM) controller (command) that reads the address of at least one input data in reverse, and the at least one operation includes loading weights into the global buffer, scheduling the activation map data of the at least one input data to the subprocessor with the PIM controller, and, when the number of groups of the global buffer in the scheduling of the PIM controller is greater than or equal to a preset bit threshold, activating the PIM controller to activate each of the plurality of subprocessors included in the global buffer.
[0053]
[0054] As used in this document, the terms “module” or “part” include a unit composed of hardware, software, or firmware, and may be used interchangeably with terms such as logic, logic block, component, or circuit, for example. “Module” or “part” may be a component formed integrally or a minimum unit or part thereof that performs one or more functions. “Module” or “part” may be implemented mechanically or electronically and may include, for example, an application-specific integrated circuit (ASIC) chip, field-programmable gate arrays (FPGAs), or programmable logic device known or to be developed that performs certain operations, and may be executed by a processor (120). At least part of the device (e.g., modules or functions thereof) or method (e.g., operations) according to various embodiments may be implemented as instructions stored in a computer-readable storage medium (e.g., memory (130)) in the form of a program module. When the above instruction is executed by a processor (e.g., processor (120)), the processor may perform a function corresponding to the above instruction. Computer-readable recording media may include a hard disk, a floppy disk, a magnetic medium (e.g., magnetic tape), an optical recording medium (e.g., CD-ROM, DVD, magneto-optical medium (e.g., floptical disk), built-in memory, etc. Instructions may include code generated by a compiler or code that can be executed by an interpreter. A module or program module according to various embodiments may include at least one of the aforementioned components, some of which may be omitted, or additionally include other components. Operations performed by a module, program module, or other components according to various embodiments may be executed sequentially, in parallel, iteratively, or heuristically, or at least some operations may be executed in a different order, omitted, or other operations may be added.
[0055] Furthermore, the embodiments disclosed in this document are presented for the purpose of explaining and understanding the disclosed technical content and are not intended to limit the scope of this disclosure. Accordingly, the scope of this disclosure should be interpreted to include all modifications or various other embodiments based on the technical concept of this disclosure.
Claims
1. An electronic device for performing data mapping by performing computations of a CNN model with low power, Communication interface; processor; A memory comprising a plurality of total multipliers for performing multiple operations that enable multiple operations to be performed simultaneously while the processor performs at least one operation related to the CNN model when executed by the processor, a control bit for deactivating a second multiplier not required for the operation in addition to a plurality of first multipliers required for the operation, and a 2:1 selector, and a 2:1 selector; Includes, The above at least one operation is, When at least one input data is obtained from an external electronic device through the communication interface, if it is determined that the kernel size of the at least one input data is greater than or equal to a preset threshold size value, the operation of automatically dividing the kernel size of the at least one input data to match the preset size. When the kernel size of the at least one input data is less than or equal to the threshold size value or is automatically partitioned, the operation of inputting the at least one input data into at least one space (stage) already stored in the memory. The operation of determining the number of the first multipliers requiring a weighted sum (MAC, multiplication-accumulation) operation of at least one input data in the above space, and if it is determined that the number of the first multipliers is less than or equal to the total number of multipliers, driving the control bit and the 2-to-1 selector to disable the second multiplier, and Through the communication interface above, the operation of receiving a global buffer containing multiple weight values to each of the multiple subprocessors, inputting at least one input data to the global buffer, inputting the multiple weight values to the DRAM, and inputting the at least one input data to the first multiplier for mapping. including, Electronic device.
2. In Paragraph 1, The above memory is, It includes a DRAM-AiM (accelerator in memory) capable of accelerating the operation speed by subtracting a data movement value for a computationally intensive operation of the processor, activating weights while fixing the vector data of at least one input data, and driving the entire multiplier. The above at least one operation is, When performing the activation of at least one input data above, 16 32 of the above-mentioned plurality of weights are loaded simultaneously, 16 values are computed simultaneously when performing the weighted sum operation of at least one input data, and the operation includes applying the computed value belonging to the global buffer equally to a plurality of subprocessors of all activations. Electronic device.
3. In Paragraph 1, The above threshold size value is, The kernel size to be input to each of the above at least one space is 3 It is the case where it is 3 or more, including the case where at least one of the horizontal and / or vertical values is at least 3 or more, and The above at least one operation is, If the kernel size of the at least one input data exceeds the threshold size value, the kernel size of the at least one input data is 2 2 and / or 1 Decomposed into 1s, and the kernel size of the at least one input data is 2 Input the case where the size is 2 into a first space that performs the process at least twice, and the kernel size of the at least one input data is 1 In the case of size 1, including the operation of inputting into a second space that performs the operation of the above CNN, Electronic device.
4. In Paragraph 1, The above at least one operation is, The operation includes assigning a bit value to each of the control bits included in the memory, and if the multiplier connected to the control bit outputs 1, determining that it is the first multiplier requiring the weighted sum operation and activating the corresponding multiplier, and if the multiplier connected to the control bit outputs 0, determining that it is the second multiplier requiring the weighted sum operation and deactivating the multiplier. Electronic device.
5. In Paragraph 1, The above memory is, It further includes a processing in memory (PIM) controller (command) that loads the address of at least one input data in the opposite direction to the loading direction of the global buffer, and The above at least one operation is, The operation includes loading weights and at least one input data into the global buffer, scheduling the PIM-controller to the subprocessor using activation map data of the at least one input data, and, when the number of groups in the global buffer is greater than or equal to a preset bit threshold value during the scheduling of the PIM-controller, activating the PIM-controller to activate each of the plurality of subprocessors included in the global buffer. Electronic device.