Bit-width allocation method and apparatus in CNN accelerator

The method optimizes CNN accelerator resource allocation by dynamically setting bit widths using an actor-critic model, addressing inefficiencies and bottlenecks, enhancing performance and efficiency in resource-constrained environments.

WO2026127160A1PCT designated stage Publication Date: 2026-06-18SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SEOUL NATIONAL UNIVERSITY R&DB FOUNDATION
Filing Date
2024-12-19
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing CNN accelerators face challenges in efficiently allocating resources such as BRAM and DSP blocks due to varying computational complexity across layers, leading to bottlenecks and resource inefficiencies, which degrade overall performance.

Method used

A method for dynamically allocating optimal bit widths to each CNN layer based on resource constraints, using an actor-critic reinforcement learning model to balance accuracy and resource usage, optimizing BRAM and DSP utilization.

🎯Benefits of technology

Enhances resource efficiency and computational performance by minimizing waste and ensuring balanced resource allocation, enabling effective execution of CNN models in resource-constrained environments like edge devices and IoT devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2024020682_18062026_PF_FP_ABST
    Figure KR2024020682_18062026_PF_FP_ABST
Patent Text Reader

Abstract

A bit-width allocation method according to a first aspect of the present invention comprises the steps of: acquiring resource constraints for a field programmable gate array (FPGA) including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP); allocating, for a plurality of convolutional neural network (CNN) layers, an initial bit-width for each layer on the basis of the resource constraints; providing the initial bit-width to a CNN model including the plurality of CNN layers to evaluate the accuracy of the CNN model; and adjusting the initial bit-width on the basis of the evaluated accuracy to acquire an adjusted bit-width for each layer.
Need to check novelty before this filing date? Find Prior Art

Description

Bitwidth Allocation Method and Device in CNN Accelerator

[0001] The present invention relates to a bit width allocation method and apparatus in a CNN accelerator. This research is a research project funded by the Ministry of Science and ICT (Government) and supported by the Korea Institute of Information and Communications Technology Planning and Evaluation (IITP), an affiliate of the National Research Foundation of Korea (Project Unique ID: 2020001080; Project Number: 2020-0-01080; R&D Project: Development of Next-Generation Intelligent Semiconductor Technology (Design) (R&D); Research Title: Development of Variable-Precision High-Speed ​​Multi-Object Recognition Deep Learning Processor Technology; Project Period: 2020.04.01. – 2024.12.31.), and a research project funded by the Ministry of Science and ICT (Government) and supported by IITP (Project Unique ID: 2710007826; Project Number: 00256081; R&D Project: Cultivation of Innovative Talent in Information and Communications Broadcasting (R&D); Research Title: Graduate School of Artificial Intelligence Semiconductors (Seoul National University); Project Period: 2023.07.01. – It is related to 2028.12.31.).

[0002] For reference, the present application claims priority based on Korean patent application filed on December 13, 2024 (Application No. 10-2024-0186035). The entire contents of the said application, which form the basis of this priority, are cited in the present application as reference.

[0003] A convolutional neural network accelerator is a hardware device optimized for specific operations of a CNN (e.g., convolution operations, application of activation functions, etc.) and may generally include specialized hardware such as a field programmable gate array (FPGA), an application-specific integrated circuit (ASIC), or a tensor processing unit (TPU).

[0004] In this regard, the pipelined architecture is a structure used to accelerate CNNs on FPGAs. Looking at the computational process in the pipelined architecture, the input feature maps (IFMs) of each layer are first stored in a line buffer, and the weights of each layer are stored in a dedicated buffer. Subsequently, multiplication operations are performed based on the feature maps and weights of each layer, and the result is calculated accordingly. In this computational process, all intermediate results are immediately passed to the next layer after the computation of the corresponding layer is completed, and through this method, overall DRAM access can be minimized. By reducing the number of data accesses to DRAM through the aforementioned hierarchical computation flow, energy efficiency can be improved by preventing the repeated storage and retrieval of calculated data from memory. Additionally, parallel processing performance can be enhanced because each layer of the pipeline can perform computations simultaneously.

[0005] In a pipeline structure, since each layer performs computations simultaneously, bottlenecks can occur if processing speeds vary by layer; therefore, equal throughput may be required for each layer. For example, if one layer of a CNN has a higher computational complexity than another, the layer with higher complexity may cause a bottleneck in the entire pipeline. In other words, because computational complexity and memory requirements differ for each layer, it may be necessary to allocate resources evenly to ensure equal throughput for each layer.

[0006] Meanwhile, FPGAs contain hardware design elements that affect data representation, computation speed, and resource utilization, such as BRAM (block RAM) blocks for storing data, DSP (digital signal processor) blocks for performing operations, and bit width for representing data. For example, a larger bit width requires more memory space, which can increase the usage of BRAM for data storage. As another example, a larger bit width requires more DSP blocks to be used in parallel to process it, which increases DSP block consumption and consequently can slow down computation speed.

[0007] However, in reality, it is difficult to efficiently allocate BRAM and DSP blocks to all layers to accelerate CNNs; consequently, this can lead to problems where resources are excessively used during computation or specific layers become bottlenecks, resulting in a degradation of overall computational performance.

[0008] The problem that the present invention aims to solve includes providing a method for allocating an optimal bit width for each CNN layer by considering the resource constraints of the FPGA and the computational accuracy of the CNN accelerator.

[0009] However, the problems that the present invention aims to solve are not limited to those mentioned above, and other problems not mentioned will be clearly understood by those skilled in the art to which the present invention belongs from the description below.

[0010] A bit-width allocation method according to the first aspect of the present invention comprises: a step of obtaining resource constraints for a field programmable gate array (FPGA) including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP); a step of allocating an initial bit-width for each layer based on the resource constraints in a plurality of convolutional neural network (CNN) layers; a step of providing the initial bit-width to a CNN model including the plurality of CNN layers to evaluate the accuracy of the CNN model; and a step of adjusting the initial bit-width based on the evaluated accuracy to obtain an adjusted bit-width for each layer.

[0011] The above resource constraint may include a threshold value for resource usage for the BRAM and the DSP.

[0012] The initial bit width for each of the above layers can be determined by taking into account the resource constraints.

[0013] The initial bit width for each of the above layers may include an initial bit width determined differently for each of the above layers.

[0014] The above bit width may be the number of bits used to represent the weight and activation value of the layer.

[0015] The above CNN model may be a deep learning model that performs a classification operation. In this case, the accuracy may be the ratio of the CNN model that correctly predicted the classification problem.

[0016] The bit width allocation model described above may include an actor network and a critic network. Here, the actor network may receive the resource constraints as input and determine the initial bit width for each layer based on a predetermined policy. Additionally, the critic network may evaluate the accuracy and provide the evaluated accuracy to the actor network.

[0017] The above actor network can adjust the initial bit width in a direction that maximizes the evaluated accuracy.

[0018] The step of allocating the initial bit width may include the step of calculating the usage of the BRAM and the usage of the DSP based on the initial bit width, and the step of determining whether the usage of the BRAM and the usage of the DSP exceed the resource constraint.

[0019] The usage of the BRAM and the usage of the DSP may exceed the resource constraints. In this case, during the step of allocating the initial bit width, the bit width for any layer among the plurality of CNN layers may be reduced by 1 to reallocate the initial bit width.

[0020] A bit-width allocation device according to a second aspect of the present invention includes a memory capable of storing computer-executable instructions and a processor, wherein the computer-executable instructions are executed by the processor, thereby obtaining resource constraints for a field-programmable gate array (FPGA) including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP), and in a plurality of convolutional neural network (CNN) layers, an initial bit-width for each layer is allocated based on the resource constraints, and the initial bit-width is provided to a CNN model including the plurality of CNN layers so that the accuracy of the CNN model is evaluated, and based on the evaluated accuracy, the initial bit-width is adjusted so that an adjusted bit-width for each layer is obtained.

[0021] A computer-readable recording medium storing computer-executable instructions according to a third aspect of the present invention, wherein the computer-executable instructions, when executed by a processor, enable the processor to perform a method comprising the steps of: obtaining resource constraints for a field-programmable gate array (FPGA) including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP); allocating an initial bit-width for each layer based on the resource constraints in a plurality of convolutional neural network (CNN) layers; providing the initial bit-width to a CNN model including the plurality of CNN layers to evaluate the accuracy of the CNN model; and adjusting the initial bit-width based on the evaluated accuracy to obtain an adjusted bit-width for each layer.

[0022] A computer program stored in a computer-readable recording medium according to a fourth aspect of the present invention, wherein the computer program comprises instructions for the processor to perform a method comprising the steps of: obtaining resource constraints for a field programmable gate array (FPGA) including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP), wherein, in a plurality of convolutional neural network (CNN) layers, allocating an initial bit-width for each layer based on the resource constraints; providing the initial bit-width to a CNN model including the plurality of CNN layers to evaluate the accuracy of the CNN model; and adjusting the initial bit-width based on the evaluated accuracy to obtain an adjusted bit-width for each layer.

[0023] According to one embodiment, compared to a conventional CNN accelerator with a fixed bit width method, resource usage is optimized and resource waste is minimized, so that more CNN models can be executed even in an environment with limited FPGA resources.

[0024] In addition, a balance between model accuracy and resource usage can be maintained even within limited resources, thereby maximizing the resource usage and computational performance of FPGA-based CNN accelerators. Therefore, this can provide significant advantages in situations where deep learning models are effectively executed in resource-constrained environments, such as edge devices or IoT devices.

[0025] In addition, the present invention can be used in low-power, high-efficiency applications such as edge computing, autonomous driving, and smart IoT devices. This can greatly expand the potential for utilizing FPGA-based AI accelerators in various industrial fields.

[0026] The effects obtainable from the present invention are not limited to those mentioned above, and other unmentioned effects will be clearly understood by those skilled in the art to which the present disclosure belongs from the description below.

[0027] FIG. 1 is a block diagram exemplarily illustrating a bit width allocation device according to one embodiment.

[0028] Figure 2 is a block diagram exemplifying the function of a bit width allocation program.

[0029] FIG. 3 is a flowchart exemplarily showing a bit width allocation method according to one embodiment.

[0030] Figure 4 is an example diagram showing a pipeline structure used to accelerate CNN on an FPGA.

[0031] FIG. 5 is an example diagram conceptually showing a learning algorithm of a bit width allocation model according to one embodiment.

[0032] FIG. 6 is an example diagram showing the entire network for a bit width allocation model according to one embodiment.

[0033] FIG. 7 is a flowchart specifically showing the learning algorithm of a bit width allocation model according to one embodiment.

[0034] The advantages and features of the present invention and the methods for achieving them will become clear by referring to the embodiments described below in detail together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but can be implemented in various different forms. These embodiments are provided merely to ensure that the disclosure of the present invention is complete and to fully inform those skilled in the art of the scope of the invention, and the present invention is defined only by the scope of the claims.

[0035] In describing the embodiments of the present invention, specific descriptions of known functions or configurations will be omitted if it is determined that such detailed descriptions could unnecessarily obscure the essence of the invention. Furthermore, the terms described below are defined in consideration of their functions in the embodiments of the present invention, and these definitions may vary depending on the intentions or practices of the user or operator. Therefore, such definitions should be based on the content throughout this specification.

[0036] The terms used in this specification will be briefly explained, and the invention will be described in detail.

[0037] The terms used in this specification have been selected to be as widely used as possible, taking into account the functions of the present invention; however, these terms may vary depending on the intent of those skilled in the art, case law, the emergence of new technologies, etc. Additionally, in specific cases, terms have been arbitrarily selected by the applicant, and in such cases, their meanings will be described in detail in the relevant description of the invention. Therefore, the terms used in this invention should be defined not merely by their names, but based on their meanings and the overall content of the invention.

[0038] When a part of a specification is described as 'comprising' a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components.

[0039] Additionally, the term "part" as used in the specification refers to software or hardware components, such as FPGAs or ASICs, and the "part" performs certain roles. However, the meaning of "part" is not limited to software or hardware. The "part" may be configured to reside in an addressable storage medium or configured to run one or more processors. Thus, by example, the "part" includes components such as software components, object-oriented software components, class components, and task components, as well as processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables. The functions provided within the components and "parts" may be combined into a smaller number of components and "parts" or further separated into additional components and "parts."

[0040] Below, embodiments of the present invention are described in detail with reference to the attached drawings so that those skilled in the art can easily implement the present invention.

[0041] FIG. 1 is a block diagram exemplarily illustrating a bit width allocation device according to one embodiment.

[0042] As shown in FIG. 1, the bit width allocation device (100) may include an input unit (110), an output unit (120), a processor (130), a memory (140), or a communication unit (160).

[0043] For convenience of explanation, the following description describes the bit width allocation device (100) as an example that includes an input unit (110), an output unit (120), a processor (130), a memory (140), or a communication unit (160), but is not limited thereto. That is, each unit configuration may be provided outside the bit width allocation device (100) and operate in a manner that interacts with the bit width allocation device (100).

[0044] The input unit (110) may include a user interface for receiving commands, information, etc., used to control the bit width allocation device (100). Additionally, the input unit (110) may be a hardware device (e.g., a keyboard, a mouse, etc.) capable of directly receiving commands, information, etc., used to control the bit width allocation device (100).

[0045] In one embodiment, the input unit (110) may receive information required for a bit width allocation method from a user. Specifically, the user may input information through the input unit (110), including resource constraints such as the number of blocks of BRAM and the number of blocks of DSP, information related to bit width, information related to a CNN model, output data of a CNN model, and accuracy of a CNN model.

[0046] The output unit (120) can provide information to the user as visual information through an interface, including resource constraints such as the number of blocks of BRAM and the number of blocks of DSP, information related to bit width, information related to the CNN model, output data of the CNN model, and accuracy of the CNN model.

[0047] The processor (130) can control the overall operation of the bit width allocation device (100) to carry out the present invention.

[0048] The processor (130) can execute the bit width allocation program (150) by loading the bit width allocation program (150) and information required for the execution of the bit width allocation program (150) from memory (140).

[0049] The processor (130) can control the bit width allocation device (100) to store data received from an external device through the communication unit (160) in memory (140). Additionally, the processor (130) can control the bit width allocation device (100) to transmit and receive information to and from an external device through the communication unit (160), including information including resource constraints such as the number of blocks in BRAM and the number of blocks in DSP, information related to bit width, information related to the CNN model, output data of the CNN model, and information including the accuracy of the CNN model.

[0050] The processor (130) may refer to a processing device such as a microprocessor, a central processing unit (CPU), a graphic processing unit (GPU), a processor core, a multiprocessor, an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or a micro controller unit (MCU), but is not limited to the embodiments described above.

[0051] The memory (140) can store information necessary for the execution of the bit width allocation program (150) and the bit width allocation program (150). Additionally, the memory (140) can store processing results by the processor (130).

[0052] The bit width allocation program (150) may mean software that includes instructions programmed to perform the method according to the present invention.

[0053] The memory (140) can store information including resource constraints such as the number of blocks of BRAM and the number of blocks of DSP, information related to bit width, information related to the CNN model, output data of the CNN model, and information including the accuracy of the CNN model. Additionally, the memory (140) can store information received from an external device through the communication unit (160).

[0054] Memory (140) may refer to a computer-readable recording medium such as a magnetic medium such as a hard disk, floppy disk, and magnetic tape, an optical recording medium such as a CD-ROM and DVD, a magneto-optical medium such as a floptical disk, a random access memory such as a DRAM (dynamic random access memory), a SRAM (static random access memory), a hardware device specifically configured to store and execute program instructions such as a flash memory, but is not limited to the embodiments described above.

[0055] The communication unit (160) may be a wireless communication module capable of performing wireless communication by adopting a communication method such as CDMA, GSM, W-CDMA, TD-SCDMA, WiBro, LTE, EPC, 5G, wireless LAN, Wi-Fi, Bluetooth, Zigbee, WFD (Wi-Fi Direct), UWB (Ultra Wide Band), infrared communication (IrDA; infrared data association), BLE (Bluetooth Low Energy) or NFC (Near Field Communication), but is not limited to the above-described embodiment.

[0056] In addition, information input and output through the input unit (110) and output unit (120), information stored in the memory (140), and information transmitted and received through the communication unit (160) include all information related to the present invention and are not limited to the embodiments described above.

[0057] The function or operation of the bit width allocation program (150) will be examined in detail through FIG. 2.

[0058] Figure 2 is a block diagram exemplifying the function of a bit width allocation program.

[0059] As shown in FIG. 2, the bit width allocation program (150) may include a resource constraint acquisition unit (210), an initial bit width allocation unit (220), an accuracy evaluation unit (230), and a bit width adjustment unit (240). The resource constraint acquisition unit (210), the initial bit width allocation unit (220), the accuracy evaluation unit (230), and the bit width adjustment unit (240) are exemplary divisions of the functions of the bit width allocation program (150) and are not limited thereto.

[0060] According to the embodiment, the functions of the resource constraint acquisition unit (210), the initial bit width allocation unit (220), the accuracy evaluation unit (230), and the bit width adjustment unit (240) can be merged / separated and can be implemented as a series of instructions included in at least one program.

[0061] The resource constraint acquisition unit (210), initial bit width allocation unit (220), accuracy evaluation unit (230), and bit width adjustment unit (240) may be implemented by a processor (130) and may refer to a data processing device embedded in hardware having a physically structured circuit to perform a function expressed by a code or instruction included in a bit width allocation program (150) stored in memory (140).

[0062] The resource constraint acquisition unit (210) can acquire resource constraints for a field programmable gate array (FPGA), including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP).

[0063] BRAM (block RAM) can refer to embedded memory blocks used in FPGAs (field-programmable gate arrays). It resides within the FPGA and can be used to store data and handle fast read / write operations. In other words, a BRAM block can refer to a storage device within the FPGA for processing memory operations. The number of available BRAM blocks may be predetermined based on the design of the FPGA chip.

[0064] Digital Signal Processing Blocks (DSP Blocks) refer to hardware acceleration blocks for digital signal processing tasks within an FPGA. They rapidly process arithmetic operations (multiplication, addition, and accumulation) and can be utilized in applications requiring high-performance computation, such as filtering, Fast Fourier Transform (FFT), image and video processing, and AI / ML computation. In other words, DSP Blocks can refer to dedicated hardware for processing high-speed computations. DSP Blocks may include hardware multipliers, accumulators, and adders, and can be optimized for fixed-point and floating-point operations. The number of available DSP Blocks may be predetermined based on the design of the FPGA chip.

[0065] Bitwidth refers to the number of bits (binary digits) used to represent the weights and activation values ​​of a layer. Bitwidth can be a significant factor that directly affects the computational speed and resource usage of a CNN model. Specifically, bitwidth can influence the computational accuracy and resource consumption of a CNN model. Increasing the bitwidth improves computational accuracy but requires more computational resources, while decreasing the bitwidth can save resources but may result in lower computational precision.

[0066] In one embodiment, resource constraints may include thresholds for resource usage for BRAM and DSP.

[0067] The initial bit width allocation unit (220) can allocate an initial bit width for each layer based on resource constraints in a plurality of convolutional neural network (CNN) layers. Specifically, the initial bit width allocation unit (220) can calculate the usage of BRAM and DSP based on the initial bit width, and allocate an initial bit width for each layer included in the CNN model such that the calculated usage of BRAM and DSP does not exceed the threshold for the number of blocks of BRAM and the threshold for the number of blocks of DSP included in the resource constraints. If the usage of BRAM and DSP exceeds the resource constraints, the initial bit width allocation unit (220) can reallocate the initial bit width by decreasing the bit width for any layer among the plurality of CNN layers by 1.

[0068] In one embodiment, the initial bit width allocation unit (220) may determine the initial bit width differently for each layer during the process of allocating the initial bit width for each layer based on resource constraints.

[0069] The accuracy evaluation unit (230) can evaluate the accuracy of the CNN model by providing an initial bit width to a CNN model including multiple CNN layers.

[0070] In one embodiment, it may be a deep learning model that performs a classification operation. In this case, the accuracy of the CNN model may refer to the ratio of correct predictions made by the CNN model in a classification problem.

[0071] The bit width adjustment unit (240) can obtain an adjusted bit width for each layer by adjusting the initial bit width based on the accuracy of the evaluated CNN model.

[0072] In one embodiment, the bit width allocation model may include an actor network and a critic network. Here, the actor network may receive resource constraints as input and determine the initial bit width for each layer based on a predetermined policy. Additionally, the critic network may evaluate accuracy and provide the evaluated accuracy to the actor network.

[0073] In one embodiment, the actor network can adjust the initial bit width in a direction that maximizes the accuracy of the CNN model. Accordingly, the optimal bit width for each layer of the CNN model can be allocated in a direction that maximizes the accuracy of the CNN model without exceeding the resource constraints for the FPGA in terms of resource usage.

[0074] FIG. 3 is a flowchart illustrating a bit width allocation method according to one embodiment. Here, the method illustrated in FIG. 3 can be executed by the bit width allocation device (100) illustrated in FIG. 1. Furthermore, since the flowchart illustrated in FIG. 3 is merely illustrative, depending on the embodiment, each step may be executed in a different order than that described in the flowchart, or additional steps not described in the flowchart may be executed, or one or more of the steps described in the flowchart may not be executed.

[0075] As shown in FIG. 3, a bit width allocation method according to one embodiment comprises the steps of: obtaining resource constraints for a field programmable gate array (FPGA) including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP) (S310); allocating an initial bit width for each layer based on resource constraints in a plurality of convolutional neural network (CNN) layers (S320); providing the initial bit width to a CNN model including a plurality of CNN layers to evaluate the accuracy of the CNN model (S330); and adjusting the initial bit width based on the evaluated accuracy to obtain an adjusted bit width for each layer (S340).

[0076] Figure 4 is an example diagram showing a pipeline structure used to accelerate CNN on an FPGA.

[0077] As mentioned above, when examining the computation process in the pipeline structure, first, the input feature maps (IFMs) of each layer are stored in a line buffer, and the weights of each layer are stored in a dedicated buffer. Subsequently, a multiplication operation is performed based on the feature maps and weights of each layer, and a result value can be produced accordingly.

[0078] Meanwhile, in a pipeline structure, since each layer performs computations simultaneously, bottlenecks can occur if processing speeds vary by layer; therefore, equal throughput may be required for each layer. For example, if one layer of a CNN has a higher computational complexity than another, the layer with higher complexity may cause a bottleneck in the entire pipeline. In other words, because computational complexity and memory requirements differ for each layer, it may be necessary to allocate resources evenly to ensure equal throughput for each layer.

[0079] However, in order to accelerate CNNs, it is difficult to efficiently allocate BRAM blocks and DSP blocks to all layers. Consequently, problems may arise where resources are overused during computation or a specific layer becomes a bottleneck, leading to a degradation in overall computational performance. To solve the aforementioned problems, the latency (the inverse of throughput) of all layers must be evenly distributed, and the resources used by each layer must be appropriately controlled.

[0080] FIG. 5 is an example diagram conceptually showing a learning algorithm of a bit width allocation model according to one embodiment, FIG. 6 is an example diagram showing the entire network of a bit width allocation model according to one embodiment, and FIG. 7 is a flowchart specifically showing a learning algorithm of a bit width allocation model according to one embodiment.

[0081] According to a bit-width allocation method according to an embodiment of the present invention, the bit-width for the weights and activation values ​​of each layer is appropriately adjusted so that the BRAM and DSP resources of the FPGA can be used efficiently. Specifically, the overall resource usage can be optimized by allocating bit-widths such that layers requiring a lot of computation use a higher bit-width, while layers that do not require a lower bit-width use a lower bit-width. Through this, the problem of imbalance in resource allocation per layer is resolved, and the performance and energy efficiency of the entire pipeline can be improved. Hereinafter, a learning algorithm of a bit-width allocation model according to an embodiment will be described in detail with reference to FIGS. 5 to 7.

[0082] A bit width allocation model according to one embodiment can be trained through reinforcement learning. The reinforcement learning described here may include a reinforcement learning-based resource optimization technique utilizing the deep deterministic policy gradient (DDPG) algorithm. DDPG is a reinforcement learning algorithm that can be used to learn an optimal policy for a model in a continuous action space. DDPG is an algorithm that combines policy-based and value-based approaches and includes an actor-critic structure, namely an actor network and a critic network, and may utilize a deep neural network. Here, the actor network receives the current state as input to determine the optimal policy, and the critic network can evaluate the value (Q-value) of the policy selected by the actor. The critic network can be updated based on Q-value loss, and the actor network can be updated to maximize the Q-value provided by the critic network. Since DDPG uses a deterministic policy rather than a probabilistic policy, it can always output the same action given a given state.

[0083] Through the DDPG algorithm, the optimal bit configuration for each layer of the CNN accelerator can be automatically optimized and determined. Referring to Fig. 6, the determined bit configuration can be transmitted to the main network for a CNN model containing multiple layers. Based on the bit configuration transmitted to the main network, the main network can perform CNN operations, the result of the CNN operations is transmitted to the critique network, and the accuracy based on the CNN operations can be calculated by the critique network.

[0084] A bit width allocation model may include an agent that determines the optimal bit width and an evaluator that evaluates the accuracy of a CNN model based on the bit width determined by the agent. Here, accuracy is one of the metrics used to evaluate the performance of a deep learning model and may refer to the ratio of correctly predicted values ​​in a classification problem. In one embodiment, the agent in FIGS. 5 and 6 may correspond to an actor network, and the evaluator may correspond to a critic network. Accordingly, the agent receives the current state as input to determine an optimal policy, and the evaluator may evaluate the value (Q-value) of the policy selected by the agent. The evaluator may be updated based on Q-value loss, and the agent may be updated to maximize the Q-value provided by the evaluator. Below, the operation of a bit width allocation model according to one embodiment will be described in detail with reference to FIGS. 5 through 7.

[0085] First, resource constraints of the FPGA can be input to the agent. Here, resource constraints may include the number of BRAM blocks and DSP blocks of the FPGA.

[0086] The agent can perform the role of determining the optimal bit configuration for each CNN layer. Bit configuration refers to the bit width for the weights and activation values ​​of each layer, through which the resource usage and accuracy of the CNN model can be controlled. In this process, the bit width for the weights and activation values ​​can be set separately. During this bit configuration process, the necessary bit width for each layer can be set using the DDPG algorithm, a type of reinforcement learning algorithm. The agent can set the bit width according to the algorithmic policy of DDPG, which aims to maintain the accuracy of the CNN model as much as possible within the given resources.

[0087] The agent can check whether the determined bit setting exceeds limited resources and adjust the bit setting. Specifically, the agent can check whether the determined bit setting exceeds the FPGA's BRAM and DSP resources.

[0088] In DSP blocks, a smaller bit width allows multiple multiplication operations to be performed simultaneously within a single DSP block, thereby enabling more efficient resource utilization. Accordingly, the agent can calculate whether the DSP block is being overused based on the bit width set for each layer. From this perspective, the agent can determine whether the determined bit settings exceed DSP resources.

[0089] In the case of a BRAM block, since BRAM usage varies depending on the memory size required by each layer, the amount of memory required can be determined by the bit width of the weights and activation values ​​of each layer. As the bit width of the weights or activation values ​​increases, more memory space is required, which may lead to increased BRAM usage. From this perspective, the agent can evaluate whether the memory requirement—that is, the BRAM usage—is within the allowable resources of the FPGA. If BRAM usage exceeds the resource limit of the FPGA, the agent can readjust the bit width by reducing the corresponding bit setting. Through the aforementioned process, the agent can adjust the bit width to reduce resource usage within the constraints, thereby ensuring that the system operates normally. In this context, resource usage may refer to the usage of BRAM and DSP. Resource usage can be calculated using quantized weights for each layer, bit width for representing input data, the number of input channels, the number of output channels, or the kernel size, and the calculation of resource usage can be performed using a previously disclosed method.

[0090] If the bit setting configured by the agent does not exceed the resource limit, the bit setting can be passed to the evaluation unit to perform an accuracy evaluation. The evaluation unit can run the CNN model using the proposed bit setting and evaluate the model's accuracy and resource usage. Based on the evaluation results, the evaluation unit can calculate a reward based on the model's accuracy. For example, the reward can be determined based on the difference between the accuracy calculated as a floating-point number and the accuracy achieved by the bit setting, as shown in Equation 1.

[0091]

[0092] In this case, R is the reward, is a scaling factor that adjusts the size of the reward according to a given situation, quantized, that is, accuracy achieved by bit settings, and can mean accuracy calculated as a floating-point number. The evaluation unit can transmit the reward calculated through the aforementioned method to the agent.

[0093] Based on the accuracy evaluation results, reinforcement learning can proceed in a direction that maximizes model accuracy while efficiently utilizing resources. Through iterative learning, the agent can acquire a policy for optimal bit settings, thereby enabling the model to achieve optimal performance within limited resources.

[0094] Referring to FIG. 7, a learning algorithm for a bit width allocation model according to one embodiment includes the steps of determining a bit configuration for each layer in an actor, calculating BRAM and DSP usage through information of each layer, checking whether BRAM or DSP usage exceeds a threshold, measuring accuracy after the main network has been trained for one epoch, and training the bit width allocation model based on reinforcement learning through a reward based on accuracy. At this time, if BRAM or DSP usage exceeds a threshold, the bit of an arbitrarily determined layer is decreased by 1, and the process returns to the step of checking whether BRAM or DSP usage exceeds a threshold, thereby performing the aforementioned operations again in sequence. As the bit of an arbitrarily determined layer is decreased by 1, BRAM or DSP usage may be reduced.

[0095] As described above, according to one embodiment of the present invention, compared to a conventional CNN accelerator with a fixed bit width method, resource usage is optimized and resource waste is minimized, so that more CNN models can be executed even in an environment with limited FPGA resources.

[0096] In addition, a balance between model accuracy and resource usage can be maintained even within limited resources, thereby maximizing the resource usage and computational performance of FPGA-based CNN accelerators. Therefore, this can provide significant advantages in situations where deep learning models are effectively executed in resource-constrained environments, such as edge devices or IoT devices.

[0097] In addition, the present invention can be used in low-power, high-efficiency applications such as edge computing, autonomous driving, and smart IoT devices. This can greatly expand the potential for utilizing FPGA-based AI accelerators in various industrial fields.

[0098] The embodiments of the present invention described above may be implemented through various means. For example, the embodiments of the present invention may be implemented by hardware, firmware, software, or a combination thereof.

[0099] Combinations of each block of the block diagram attached to the present invention and each step of the flowchart may be performed by computer program instructions. Since these computer program instructions may be loaded into an encoding processor of a general-purpose computer, a special-purpose computer, or other programmable data processing equipment, the instructions performed through the encoding processor of the computer or other programmable data processing equipment create means for performing the functions described in each block of the block diagram or each step of the flowchart. Since these computer program instructions may also be stored in computer-available or computer-readable memory that can be directed toward the computer or other programmable data processing equipment to implement the function in a specific way, the instructions stored in computer-available or computer-readable memory may also produce a manufactured item containing instruction means for performing the function described in each block of the block diagram or each step of the flowchart. Since computer program instructions can be loaded onto a computer or other programmable data processing equipment, instructions that execute a computer or other programmable data processing equipment by performing a series of operation steps on the computer or other programmable data processing equipment to create a process executed by the computer can also provide steps for executing the functions described in each block of the block diagram and each step of the flowchart.

[0100] Additionally, each block or each step may represent a module, segment, or part of code containing one or more executable instructions for executing a specified logical function(s). In some embodiments, the functions mentioned in the blocks or steps may occur out of order. For example, two blocks or steps depicted consecutively may actually be performed substantially simultaneously, or the blocks or steps may be performed in reverse order according to the corresponding function.

[0101] The above description is merely an illustrative explanation of the technical concept of the present invention, and those skilled in the art to which the present invention pertains will be able to make various modifications and variations within the scope of the essential quality of the present invention. Accordingly, the embodiments disclosed in the present invention are intended to explain, not limit, the technical concept of the present invention, and the scope of the technical concept of the present invention is not limited by such embodiments. The scope of protection of the present invention shall be interpreted by the claims below, and all technical concepts within the equivalent scope shall be interpreted as being included within the scope of rights of the present invention.

Claims

1. A bit width allocation method performed by a bit width allocation device including a bit width allocation model, A step of obtaining resource constraints for a field programmable gate array (FPGA), including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP); In a plurality of convolutional neural network (CNN) layers, a step of allocating an initial bit-width for each layer based on the resource constraints; A step of providing the above initial bit width to a CNN model including the above plurality of CNN layers to evaluate the accuracy of the CNN model; and Based on the accuracy evaluated above, the method includes the step of adjusting the initial bit width to obtain an adjusted bit width for each layer. Bit width allocation method.

2. In Paragraph 1, The above resource constraints are, including a threshold value for resource usage for the above BRAM and the above DSP Bit width allocation method.

3. In Paragraph 1, The initial bit width for each of the above layers is, Determined by considering the above resource constraints Bit width allocation method.

4. In Paragraph 1, The initial bit width for each of the above layers is, including an initial bit width determined differently for each of the above layers Bit width allocation method.

5. In Paragraph 1, The above bit width is, The number of bits used to represent the layer's weights and activation values Bit width allocation method.

6. In Paragraph 1, The above CNN model is, It is a deep learning model that performs classification operations, and The above accuracy is, The ratio of accurate predictions by the above CNN model in classification problems Bit width allocation method.

7. In Paragraph 1, The above bit width allocation model is, It includes an actor network and a critic network, The above actor network is, The above resource constraints are received as input, and the initial bit width for each layer is determined based on a predetermined policy, and The above critique network is, Evaluate the above accuracy and provide the evaluated accuracy to the actor network Bit width allocation method.

8. In Paragraph 7, The above actor network is, Adjusting the initial bit width in a direction that maximizes the accuracy evaluated above Bit width allocation method.

9. In Paragraph 1, The step of allocating the initial bit width mentioned above is, A step of calculating the usage of the BRAM and the usage of the DSP based on the initial bit width; and A step of determining whether the usage of the BRAM and the usage of the DSP exceed the resource constraints. Bit width allocation method.

10. In Paragraph 9, The usage of the above BRAM and the usage of the above DSP exceed the above resource constraints, and In the step of allocating the initial bit width mentioned above, Reassigning the initial bit width by decreasing the bit width of any layer among the plurality of CNN layers by 1. Bit width allocation method.

11. Memory capable of storing computer-executable instructions; and It includes a processor, wherein the computer-executable instruction is executed by the processor, Resource constraints for an FPGA (field programmable gate array), including the number of blocks of BRAM (block RAM) and the number of blocks of DSP (digital signal processor), are obtained, and In a plurality of convolutional neural network (CNN) layers, an initial bit-width is allocated for each layer based on the resource constraints, and The above initial bit width is provided to a CNN model including the above plurality of CNN layers, and the accuracy of the CNN model is evaluated, and Based on the accuracy evaluated above, the initial bit width is adjusted to obtain an adjusted bit width for each layer. Bit width allocation device.

12. A non-transient computer-readable recording medium storing computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, A step of obtaining resource constraints for a field programmable gate array (FPGA), including the number of blocks of a block RAM (BRAM) and the number of blocks of a digital signal processor (DSP); In a plurality of convolutional neural network (CNN) layers, a step of allocating an initial bit-width for each layer based on the resource constraints; A step of providing the above initial bit width to a CNN model including the above plurality of CNN layers to evaluate the accuracy of the CNN model; and A method in which the processor performs a step of adjusting the initial bit width to obtain an adjusted bit width for each layer based on the accuracy evaluated above. Non-transient computer-readable recording medium.