Simultaneous quantization and pruning method for lightweighting deep learning network

An automated method for simultaneous quantization and pruning of deep learning networks addresses the challenges of manual optimization and hardware neglect, enhancing acceleration efficiency for edge devices by classifying and pruning filters based on rank.

WO2026141724A1PCT designated stage Publication Date: 2026-07-02KOREA ELECTRONICS TECH INST

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
KOREA ELECTRONICS TECH INST
Filing Date
2024-12-26
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Simultaneous quantization and pruning of deep learning networks are challenging for object detection networks, requiring manual optimization that is time-consuming and costly, and hardware characteristics are not adequately considered, leading to suboptimal acceleration effects.

Method used

An automated method for simultaneous quantization and pruning that incorporates hardware characteristics, classifies filters into groups based on rank, prunes lower-ranked filters, and fine-tunes the network, with iterative refinement until a target pruning rate is achieved.

Benefits of technology

Minimizes human intervention and optimizes acceleration efficiency by reflecting hardware characteristics, enabling efficient deployment on edge devices and embedded systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2024021093_02072026_PF_FP_ABST
    Figure KR2024021093_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A simultaneous quantization and pruning method for lightweighting a deep learning network is provided. The simultaneous quantization and pruning method for a deep learning network, according to an embodiment of the present invention, comprises the steps of: quantizing the deep learning network; classifying filters into a plurality of groups for the quantized deep learning network; pruning filters of any one group from among the classified groups; fine-tuning the quantized deep learning network in which the filters of the any one group are pruned; and, when a pruning ratio of the fine-tuned deep learning network does not reach a target, re-performing, on the fine-tuned deep learning network, the classification step to the fine-tuning step.
Need to check novelty before this filing date? Find Prior Art

Description

Method for Simultaneous Quantization and Pruning for Deep Learning Network Lightweighting

[0001] The present invention relates to the lightweighting of deep learning networks, and more specifically, to a method for lightweighting deep learning networks so that they can be installed on edge devices and embedded systems.

[0002] Quantization and pruning of deep learning networks are techniques for lightweighting deep learning networks, and there are attempts to effectively combine these two techniques and perform them simultaneously for optimal lightweighting.

[0003] However, simultaneous quantization and pruning are being developed primarily for object classification networks, while progress is lagging for object detection networks, where lightweighting is challenging.

[0004] Meanwhile, in the simultaneous performance of quantization and pruning, the optimal pruning is manually set by a person, but finding the optimal pruning requires multiple experiments, which demands a lot of time, cost, and effort.

[0005] Meanwhile, when lightweighting deep learning networks for deployment on edge devices or embedded systems, hardware characteristics are not well reflected. For instance, the hardware output unit is not taken into account during the pruning process, which is a factor that reduces acceleration effects. For instance, if the hardware inputs and outputs data in units of 32 and the data length is 512, pruning the data to 358 is expected to save 30% of time, but in reality, only 25% (= 1-upper(358 / 32) / 512, upper is the rounding function) is saved.

[0006] The present invention has been devised to solve the above-mentioned problems, and the objective of the present invention is to provide a method for simultaneously performing quantization and pruning of a deep learning network that is automated by minimizing human settings and incorporates hardware characteristics to improve acceleration efficiency.

[0007] A method for simultaneously performing quantization and pruning of a deep learning network according to an embodiment of the present invention for achieving the above objective comprises: a step of quantizing a deep learning network; a step of classifying filters into a plurality of groups for the quantized deep learning network; a step of pruning filters of one of the classified groups; and a step of fine-tuning the quantized deep learning network in which filters of one of the classified groups have been pruned. If the pruning rate of the fine-tuned deep learning network has not reached a target, the process from the classification step to the fine-tuning step can be repeated for the fine-tuned deep learning network.

[0008] The method for simultaneously performing quantization and pruning of a deep learning network according to the present invention further comprises, for a quantized deep learning network, a step of calculating the ranks of output feature maps and sorting them in descending order; and a step of sorting the filters that generated the output feature maps according to the sorting of the output feature maps; and the classification step may be to classify the filters into a plurality of groups according to the sorted order.

[0009] The method for simultaneously performing quantization and pruning of a deep learning network according to the present invention further includes the step of fixing filters of different groups among classified groups; and the fine-tuning step may be fine-tuning a quantized deep learning network in which filters of one group are pruned and filters of another group are fixed.

[0010] The ranks of filters in one group may be lower than the ranks of filters in another group.

[0011] The ranks of the remaining group of filters being fine-tuned may be higher than the ranks of one group of filters and lower than the ranks of another group of filters.

[0012] The number of filters classified into each group can be an integer multiple of the output unit of the output feature map.

[0013] The classification step may include: a step of generating preliminary groups by grouping the sorted filters into output units of the output feature map; and a step of merging the preliminary groups to form a fixed number of groups if the number of generated preliminary groups exceeds the number of given groups.

[0014] The merge step can merge the reserve groups based on the rank difference between the reserve groups.

[0015] A deep learning network may include an object recognition model and an object detection model.

[0016] According to another aspect of the present invention, a system for simultaneous quantization and pruning of a deep learning network is provided, comprising: a processor that quantizes a deep learning network, classifies filters for the quantized deep learning network into a plurality of groups, prunes filters of any one of the classified groups, fine-tunes the quantized deep learning network in which filters of any one group have been pruned, and if the pruning rate of the fine-tuned deep learning network has not reached a target, re-performs the process from group classification to fine-tuning for the fine-tuned deep learning network; and a storage unit in which a deep learning network in which the pruning rate has reached a target is stored.

[0017] According to another aspect of the present invention, a method for simultaneously performing quantization and pruning of a deep learning network is provided, comprising: a step of classifying filters into a plurality of groups for a quantized deep learning network; a step of pruning filters of one of the classified groups; a step of fixing filters of another of the classified groups; and a step of fine-tuning filters of the remaining of the classified groups; wherein if the pruning rate of the fine-tuned deep learning network has not reached a target, the process from the classification step to the fine-tuning step is repeated for the fine-tuned deep learning network.

[0018] According to another aspect of the present invention, a system for simultaneous quantization and pruning of a deep learning network is provided, comprising: a processor for classifying filters into a plurality of groups for a quantized deep learning network, pruning filters of one of the classified groups, fixing filters of another of the classified groups, fine-tuning filters of the remaining of the classified groups, and if the pruning rate of the fine-tuned deep learning network has not reached a target, re-performing from classification to fine-tuning for the fine-tuned deep learning network; and a storage unit for storing a deep learning network in which the pruning rate has reached a target.

[0019] As explained above, according to the embodiments of the present invention, by simultaneously performing quantization and pruning of a deep learning network to which automated pruning is applied, the time, cost, and effort required for human intervention in lightweighting a deep learning network can be minimized.

[0020] In addition, according to embodiments of the present invention, the acceleration efficiency of a lightweight deep learning network can be improved by simultaneously performing quantization and pruning of a deep learning network with pruning applied that reflects hardware characteristics.

[0021] FIG. 1 is a method for simultaneously performing quantization / pruning of a deep learning network according to an embodiment of the present invention,

[0022] FIG. 2 is a system for simultaneous quantization / pruning of a deep learning network according to another embodiment of the present invention.

[0023] The present invention will be described in more detail below with reference to the drawings.

[0024] An embodiment of the present invention presents a method for simultaneously performing quantization and pruning to lighten a deep learning network. This is a technology that applies pruning that reflects hardware characteristics to improve acceleration efficiency and is automated by minimizing human settings during the simultaneous quantization and pruning of a deep learning network.

[0025] We pursue a general-purpose method for simultaneous quantization and pruning that is applicable to both object classification networks and object detection networks according to an embodiment of the present invention.

[0026] FIG. 1 is a diagram illustrating the flow of a method for simultaneous quantization and pruning of a deep learning network according to an embodiment of the present invention. The method for simultaneous quantization and pruning of a deep learning network according to an embodiment of the present invention is a method for finding a deep learning network that satisfies a target bit precision and a pruning rate.

[0027] To find such a deep learning network, as described above, the target deep learning network is first quantized according to the target bit precision (S110).

[0028] In the next step S110, for the quantized deep learning network, the ranks of the output feature maps for each layer are calculated and sorted in descending order (S120). The ranks can be calculated according to the method disclosed in “HRank: Filter Pruning using High-Rank Feature Map, 2020”. Of course, they can also be calculated using other methods.

[0029] Afterward, the filters that generated output feature maps for each layer are sorted in the same order as the output feature maps in step S120 (S130). Since an output feature map is generated for each filter, the output feature maps and filters are matched one-to-one. The reason for sorting the filters as in step S130 is to treat the filter that generated the output feature map as playing an important role as the rank of the output feature map increases.

[0030] Next, for each layer, the filters sorted in step S130 are classified into three groups (upper rank filter group, middle rank filter group, lower rank filter group) according to the sorting order (S140).

[0031] The ranks of filters classified in the upper rank filter group are higher than the ranks of filters classified in the middle rank filter group, and the ranks of filters classified in the middle rank filter group are higher than the ranks of filters classified in the lower rank filter group.

[0032] Meanwhile, the number of filters classified into each filter group is an integer multiple of the output unit of the output feature map in the hardware. For example, if the output feature map is output in groups of 4, each filter group can contain 4, 8, 12, ... or 4N filters.

[0033] It is sufficient to satisfy the above conditions, and the number of filters classified in filter groups does not have to be the same. That is, the number of filters in a filter group may be the same as or different from the number of filters in other filter groups.

[0034] To classify the filters, preliminary groups are created by first grouping the filters, sorted by rank, into output units of the output feature map. In the following examples, the explanation is changed from classifying the filters to classifying their ranks. This is for the sake of convenience of explanation and understanding, and it should be noted that in practice, filter classification, not rank classification, is being performed.

[0035] For example, when the ranks of the output feature maps of the second layer of a deep learning network are sorted in descending order [313, 298, 275, 212, 200, 186, 173, 153, 134, 112, 100, 88, 68, 54, 21, 7] and the output unit is 4, 4 preliminary groups are generated as follows.

[0036] Prep Group #1 : [313, 298, 275, 212]

[0037] Prep Group #2 : [200, 186, 173, 153]

[0038] Prep Group #3 : [134, 112, 100, 88]

[0039] Prep Group #4 : [68, 54, 21, 7]

[0040] If the number of generated reserve groups exceeds the number of filter groups, the reserve groups are merged to form a fixed number of filter groups. The reserve groups to be merged are those with small rank differences between them.

[0041] In the above example, since the number of reserve groups is 4 and the number of filter groups must be 3 (high-rank filter group, middle-rank filter group, low-rank filter group), 2 of the reserve groups must be merged to reduce the number of groups from 4 to 3.

[0042] Meanwhile, in the above example, since 1) the rank difference between Reserve Group #1 and Reserve Group #2 is 12 (=212-200), 2) the rank difference between Reserve Group #2 and Reserve Group #3 is 19 (=153-134), and 3) the rank difference between Reserve Group #3 and Reserve Group #4 is 20 (=88-68), Reserve Group #1 and Reserve Group #2, which have the smallest rank difference of 12, are merged. Accordingly, the filter groups are classified as follows.

[0043] Top Rank Filter Group = Reserve Group #1 + #2 : [313, 298, 275, 212, 200, 186, 173, 153]

[0044] Median Rank Filter Group = Reserved Group #3 : [134, 112, 100, 88]

[0045] Sub-rank filter group = Reserved group #4 : [68, 54, 21, 7]

[0046] When filter group classification is completed in this manner, the filters in the upper rank filter group are frozen (S150), and the filters in the lower rank filter group are pruned (S160).

[0047] Through steps S150 and S160, the filters in the middle rank filter group that are neither fixed nor pruned become subject to fine-tuning. Accordingly, for a deep learning network in which the filters in the upper rank filter group are fixed and the filters in the lower rank filter group are pruned, only about 10% of the total epochs are fine-tuned (S170).

[0048] When fine-tuning is complete, the pruning rate is checked (S180). If the pruning rate does not reach the target pruning rate as a result of the check (S180-N), the process returns to step S120 and re-performs the rank calculation.

[0049] On the other hand, when it is confirmed that the pruning rate has reached the target pruning rate (S180-Y), the fine-tuned deep learning network is saved as a lightweight deep learning network (S190). The deep learning network saved in step S190 is loaded into an edge device or embedded system and utilized.

[0050] This allows the deep learning network to be reliably lightweighted into an optimized structure, and the pruning process is automated so that no human intervention is required. The lightweight deep learning network can be efficiently operated on edge devices or embedded systems.

[0051] FIG. 2 is a diagram illustrating the configuration of a system for simultaneous quantization / pruning of a deep learning network according to another embodiment of the present invention.

[0052] A system for simultaneous quantization / pruning of a deep learning network according to an embodiment of the present invention can be implemented as a computing system comprising a communication unit (210), a processor (220), and a storage unit (230) as illustrated.

[0053] The communication unit (210) is a communication interface for connecting to an external network or external device, and in relation to an embodiment of the present invention, an edge device equipped with a lightweight deep learning network and an embedded system are connected via communication.

[0054] The processor (220) simultaneously performs quantization / pruning of the deep learning network according to the procedure illustrated in FIG. 1 described above to generate a lightweight deep learning network, and loads the generated deep learning network onto an edge device and an embedded system through the communication unit (210).

[0055] The storage unit (230) provides storage space necessary for the processor (220) to function and operate.

[0056] Up until now, a method for simultaneously performing quantization and pruning for lightweighting deep learning networks has been described in detail with preferred embodiments.

[0057] In the above embodiment, the time, cost, and effort required for human intervention in lightweighting a deep learning network can be minimized by simultaneously performing quantization and pruning of a deep learning network with automated pruning applied, and the acceleration efficiency of the lightweight deep learning network can be improved by simultaneously performing quantization and pruning of a deep learning network with pruning applied that reflects hardware characteristics.

[0058] Meanwhile, it goes without saying that the technical concept of the present invention may also be applied to a computer-readable recording medium containing a computer program that enables the device and method according to the present embodiment to perform their functions. Furthermore, the technical concept according to various embodiments of the present invention may be implemented in the form of computer-readable code recorded on a computer-readable recording medium. A computer-readable recording medium may be any data storage device that can be read by a computer and store data. For example, a computer-readable recording medium may be a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, etc. Additionally, computer-readable code or a program stored on a computer-readable recording medium may be transmitted through a network connected between computers.

[0059] Furthermore, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above. Various modifications are possible by those skilled in the art without departing from the essence of the invention as claimed in the claims, and such modifications should not be understood individually from the technical spirit or perspective of the present invention.

Claims

1. Step of quantizing the deep learning network; For a quantized deep learning network, a step of classifying filters into multiple groups; A step of pruning the filters of any one of the classified groups; A step of fine-tuning a pruned quantized deep learning network using filters of a certain group; A method for simultaneously performing quantization and pruning of a deep learning network, characterized by re-performing the process from the classification step to the fine-tuning step for the fine-tuned deep learning network if the pruning rate of the fine-tuned deep learning network has not reached a target.

2. In Claim 1, For a quantized deep learning network, a step of calculating the ranks of output feature maps and sorting them in descending order; The method further includes the step of aligning the filters that generated the output feature maps according to the alignment of the output feature maps, and The classification stage is, A method for simultaneously performing quantization and pruning of a deep learning network, characterized by classifying filters into multiple groups according to an ordered sequence.

3. In Claim 2, It further includes the step of fixing filters of other groups among the classified groups, and The fine-tuning stage is, A method for simultaneously performing quantization and pruning of a deep learning network, characterized by fine-tuning a quantized deep learning network in which filters of one group are pruned and filters of another group are fixed.

4. In Claim 3, The ranks of filters in a certain group are, A method for simultaneously performing quantization and pruning of a deep learning network, characterized by having ranks lower than those of filters in other groups.

5. In Claim 3, The ranks of the filters in the remaining group being fine-tuned are, Higher than the ranks of any group of filters, A method for simultaneously performing quantization and pruning of a deep learning network, characterized by having ranks lower than those of filters in other groups.

6. In Claim 2, The number of filters classified into each group is, A method for simultaneously performing quantization and pruning of a deep learning network, characterized by the output unit of the output feature map being an integer multiple.

7. In Claim 6, The classification stage is, A step of generating preliminary groups by grouping aligned filters into output units of the output feature map; A method for simultaneously performing quantization and pruning of a deep learning network, characterized by including the step of merging the preliminary groups to form a fixed number of groups when the number of generated preliminary groups exceeds the number of given groups.

8. In Claim 7, The merge stage is, A method for simultaneously performing quantization and pruning of a deep learning network, characterized by merging reserve groups based on rank differences between reserve groups.

9. In Claim 1, Deep learning networks are, A method for simultaneously performing quantization and pruning of a deep learning network, characterized by including an object recognition model and an object detection model.

10. A processor that quantizes a deep learning network, classifies filters of the quantized deep learning network into multiple groups, prunes filters of any one of the classified groups, fine-tunes the quantized deep learning network pruned with filters of any one group, and if the pruning rate of the fine-tuned deep learning network does not reach a target, re-performs the process from group classification to fine-tuning for the fine-tuned deep learning network; A system for simultaneous quantization and pruning of a deep learning network, characterized by including a storage unit in which a deep learning network whose pruning rate has reached a target is stored.

11. A step of classifying filters into multiple groups for a quantized deep learning network; A step of pruning the filters of any one of the classified groups; A step of fixing filters of other groups among the classified groups; The step of fine-tuning the filters of the remaining groups among the classified groups; A method for simultaneously performing quantization and pruning of a deep learning network, characterized by re-performing the process from the classification step to the fine-tuning step for the fine-tuned deep learning network if the pruning rate of the fine-tuned deep learning network has not reached a target.

12. A processor that classifies filters into multiple groups for a quantized deep learning network, prunes the filters of one of the classified groups, fixes the filters of another of the classified groups, fine-tunes the filters of the remaining of the classified groups, and if the pruning rate of the fine-tuned deep learning network does not reach a target, re-performs the process from classification to fine-tuning for the fine-tuned deep learning network; A system for simultaneous quantization and pruning of a deep learning network, characterized by including a storage unit in which a deep learning network whose pruning rate has reached a target is stored.