Pyramid knowledge distillation framework-based model compression limit analysis method and device

By adopting a multi-level hierarchical model compression method based on the pyramid knowledge distillation framework, the problems of knowledge explosion and degradation in knowledge distillation are solved, and the extreme exploration and accuracy balance of model compression are achieved, which is suitable for model deployment on edge devices.

CN115600672BActive Publication Date: 2026-06-26HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
Filing Date
2022-10-26
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing knowledge distillation methods suffer from knowledge explosion and knowledge degradation during model compression, and have failed to explore the limits of model compression.

Method used

We adopt a pyramid-based knowledge distillation framework to construct a multi-level, hierarchical online deep mutual learning model. The model is structured with a large model at the bottom and a small model at the top. Online learning occurs within each layer, while offline distillation occurs between layers. Furthermore, we use an adaptive multi-teacher distillation method to draw knowledge from all teacher models to guide the top model of the pyramid.

Benefits of technology

It effectively avoids knowledge explosion and knowledge degradation, explores the limits of model compression, provides a balance between model compression ratio and accuracy, and is suitable for model deployment on edge devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115600672B_ABST
    Figure CN115600672B_ABST
Patent Text Reader

Abstract

The application provides a pyramid knowledge distillation framework model compression limit analysis method, comprising the following steps: constructing N groups of online deep mutual learning models in a pyramid structure; performing online deep mutual learning on each group of online deep mutual learning models, and recording the parameter quantity and model performance of two models in each group of online deep mutual learning models; wherein, starting from the second group of online deep mutual learning models from bottom to top, while performing online deep mutual learning, the previous group of online deep mutual learning models is accepted for offline knowledge distillation; the potential representation of all models from the first group to the N-1th group is extracted and sent to an adapter to generate teacher importance weight soft labels; the Nth group of online deep mutual learning models is subjected to online deep mutual learning, and the parameter quantity and model performance of the Nth group of models are recorded; and the balance point of the model compression ratio and accuracy is analyzed according to the parameter quantity and model performance of two models in each group of online deep mutual learning models and the parameter quantity and model performance of the Nth group of models.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of knowledge distillation and model compression. Background Technology

[0002] Model compression methods based on knowledge distillation belong to an important branch of model compression acceleration. The goal of knowledge distillation is to use a small network to transfer knowledge learned from a complex network, achieving accuracy comparable to that of a complex network in a less complex network. Currently, knowledge distillation is mainly divided into three types: online knowledge distillation (such as...) Figure 2 ), offline knowledge distillation (such as Figure 3 ) and online / offline mixed distillation (such as Figure 4 Online knowledge distillation is generally based on deep mutual learning and is a highly efficient, parallel, single-stage, end-to-end distillation training method. However, online knowledge distillation suffers from the problem of optimizing highly complex teacher models. Offline knowledge distillation involves first training a complex teacher model and then using it as guidance to further train the student model. Offline distillation is usually a one-way knowledge transfer, two-stage training process. The performance of the student model largely depends on the teacher model, and the student model's absorption of knowledge from the teacher model is limited, resulting in a significant performance gap between simple student models and complex teacher models. The hybrid online-offline distillation method combines the advantages of the first two methods, offering a comprehensive distillation approach, but also bringing the drawbacks of the first two methods: in the offline stage, the student model struggles to fully absorb the knowledge from the teacher model, a phenomenon known as "knowledge explosion"; and in the online stage, highly complex models are difficult to optimize. Although some scholars have added multiple teacher models for hierarchical guidance on top of offline knowledge distillation (e.g., ... Figure 5 This aims to address the issue of insufficient knowledge acquisition by student models. However, tiered instruction suffers from inconsistencies in knowledge between teachers at different levels, a phenomenon known as "knowledge degradation." This leads to poorer performance in student models due to interference from other knowledge. An offline distillation method combining the knowledge of multiple teacher models to guide student models is proposed. Figure 5 A model that integrates knowledge from multiple teachers before instructing students can alleviate the problem of knowledge degradation. However, none of the above knowledge distillation methods have the ability to explore the compression limits of models based on knowledge distillation.

[0003] Currently, model compression methods based on offline knowledge distillation suffer from knowledge explosion, where student models cannot fully absorb the knowledge from teacher models. Online knowledge distillation-based methods struggle to optimize highly complex, large models. While hybrid offline-online distillation methods improve student models' learning of teacher model knowledge, they also suffer from the shortcomings of the previous two methods: offline, student models struggle to fully absorb teacher model knowledge; online, highly complex models are difficult to optimize. All these methods address the transfer of knowledge from complex, large models to simpler, smaller models to achieve model compression, but none have explored the limits of model compression. Summary of the Invention

[0004] The present invention aims to at least partially solve one of the technical problems in the related art.

[0005] Therefore, the purpose of this invention is to propose a model compression limit analysis method based on the pyramid knowledge distillation framework, which allows customers to select models and model distillation depths according to their needs.

[0006] To achieve the above objectives, a first aspect of the present invention proposes a compression limit analysis method based on a pyramid knowledge distillation framework model, comprising:

[0007] Construct N sets of online deep mutual learning models, wherein each set of online deep mutual learning models includes a large model and a small model. There is a gradient in model size among the N sets of online deep mutual learning models, and they are arranged in a pyramid structure with the large model at the bottom and the small model at the top.

[0008] Each group of online deep mutual learning models in the pyramid structure is subjected to online deep mutual learning, and the parameter count and model performance of the two models in each group of online deep mutual learning models are recorded; wherein, starting from the second group of online deep mutual learning models from the bottom up, while performing online deep mutual learning, the offline knowledge distillation of the previous group of online deep mutual learning models is received.

[0009] Extract the latent representations of all models from model group 1 to model group N-1 and feed them into the adaptor to generate teacher importance weight soft labels. Perform online deep mutual learning on the online deep mutual learning model of group N based on the teacher importance weight soft labels, and record the number of parameters and model performance of the model of group N.

[0010] Based on the parameter count and performance of the two models in each group of online deep mutual learning models, and the parameter count and performance of the Nth group of models, the balance point between model compression ratio and accuracy is analyzed.

[0011] In addition, the compression limit analysis method based on the pyramid knowledge distillation framework model according to the above embodiments of the present invention may also have the following additional technical features:

[0012] Furthermore, in one embodiment of the present invention, the step of performing online deep mutual learning on the Nth group of online deep mutual learning models based on the teacher importance weight soft labels includes:

[0013] Extract the soft labels of the FC layer of all models from model group 1 to model group N-1 and perform matrix multiplication with the teacher importance weight to obtain weighted knowledge. Based on the weighted knowledge, provide online deep mutual learning guidance for the online deep mutual learning model of group N.

[0014] Furthermore, in one embodiment of the present invention, the step of analyzing the balance point between model compression ratio and accuracy based on the parameter count and model performance of the two models in each group of online deep mutual learning models and the parameter count and model performance of the Nth group of models further includes:

[0015] Based on the parameters and model performance, obtain the model compression ratio and accuracy, plot the compression ratio and accuracy curve, and select the model and model distillation depth according to the requirements based on the compression ratio and accuracy curve.

[0016] To achieve the above objectives, a second aspect of the present invention provides a compression limit analysis device based on a pyramid knowledge distillation framework model, comprising:

[0017] The building module is used to build N sets of online deep mutual learning models. Each set of online deep mutual learning models includes a large model and a small model. There is a gradient in model size among the N sets of online deep mutual learning models. They are arranged in a pyramid structure with the large model at the bottom and the small model at the top.

[0018] The first learning module is used to perform online deep mutual learning on each group of online deep mutual learning models in the pyramid model, and record the parameter quantity and model performance of the two models in each group of online deep mutual learning models; wherein, starting from the second group of online deep mutual learning models from bottom to top, while performing online deep mutual learning, it receives offline knowledge distillation from the previous group of online deep mutual learning models.

[0019] The second learning module is used to extract the latent representations of all models from model group 1 to model group N-1 and feed them into the adaptor to generate teacher importance weight soft labels. Based on the teacher importance weight soft labels, online deep mutual learning is performed on the online deep mutual learning model of group N, and the number of parameters and model performance of group N are recorded.

[0020] The analysis module is used to analyze the balance point between model compression ratio and accuracy based on the number of parameters and model performance of the two models in each group of online deep mutual learning models and the number of parameters and model performance of the Nth group of models.

[0021] Furthermore, in one embodiment of the present invention, the second learning module is further configured to:

[0022] Extract the soft labels of the FC layer of all models from model group 1 to model group N-1 and perform matrix multiplication with the teacher importance weight to obtain weighted knowledge. Based on the weighted knowledge, provide online deep mutual learning guidance for the online deep mutual learning model of group N.

[0023] Furthermore, in one embodiment of the present invention, the analysis module is further configured to:

[0024] Based on the parameters and model performance, obtain the model compression ratio and accuracy, plot the compression ratio and accuracy curve, and select the model and model distillation depth according to the requirements based on the compression ratio and accuracy curve.

[0025] To achieve the above objectives, a third aspect of the present invention provides a computer device, characterized in that it includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the compression limit analysis method based on the pyramid knowledge distillation framework model as described above.

[0026] To achieve the above objectives, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the compression limit analysis method based on the pyramid knowledge distillation framework model as described above.

[0027] The pyramid-based knowledge distillation framework model compression limit analysis method of this invention addresses the problems of "knowledge explosion and knowledge degradation" in knowledge distillation and fills the gap in exploring the compression limit of knowledge distillation models. It designs a pyramid-based knowledge distillation framework with a large model at the bottom and smaller models at the top, forming a multi-level pyramid structure. Within each layer, two models engage in online deep mutual learning, while offline distillation is used between layers, with each lower layer acting as a mentor to the layer above, thus mitigating "knowledge explosion" to some extent. An adaptive multi-teacher distillation method is employed to extract knowledge from all teacher models (except the top-level model) to guide the top-level model, further preventing "knowledge degradation." Attached Figure Description

[0028] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein:

[0029] Figure 1 This is a flowchart illustrating a compression limit analysis method based on a pyramid knowledge distillation framework model, provided for an embodiment of the present invention.

[0030] Figure 2 This is a schematic diagram of online knowledge distillation provided in an embodiment of the present invention.

[0031] Figure 3 This is a schematic diagram of offline knowledge distillation provided in an embodiment of the present invention.

[0032] Figure 4 This is a schematic diagram of an online / offline hybrid knowledge distillation method provided in an embodiment of the present invention.

[0033] Figure 5 This is a schematic diagram of a multi-teacher, multi-level, hierarchical knowledge distillation method provided in an embodiment of the present invention.

[0034] Figure 6 This is a schematic diagram of a pyramid knowledge distillation framework provided for an embodiment of the present invention.

[0035] Figure 7 This is a schematic diagram illustrating a detailed technical path for pyramid knowledge distillation, as provided in an embodiment of the present invention.

[0036] Figure 8 This is a schematic diagram of the model compression ratio and accuracy curve provided in an embodiment of the present invention.

[0037] Figure 9 This is a schematic diagram of a compression limit analysis device based on a pyramid knowledge distillation framework model, provided for an embodiment of the present invention. Detailed Implementation

[0038] Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.

[0039] The compression limit analysis method based on the pyramid knowledge distillation framework model of the present invention is described below with reference to the accompanying drawings.

[0040] Example 1

[0041] Figure 1 This is a flowchart illustrating a compression limit analysis method based on a pyramid knowledge distillation framework model, provided in an embodiment of the present invention.

[0042] like Figure 1 As shown, the compression limit analysis method based on the pyramid knowledge distillation framework model includes the following steps:

[0043] S101: Construct N sets of online deep mutual learning models, wherein each set of online deep mutual learning models includes a large model and a small model. There is a gradient in model size among the N sets of online deep mutual learning models, and they are arranged in a pyramid structure with the large model at the bottom and the small model at the top.

[0044] like Figure 6 As shown, the value of N is determined by a trade-off between model compression requirements and accuracy requirements. The higher the compression ratio, the larger the value of N; the higher the accuracy requirement, the smaller the value of N.

[0045] S102: Perform online deep mutual learning on each group of online deep mutual learning models in the pyramid structure, and record the parameter quantity and model performance of the two models in each group of online deep mutual learning models; wherein, starting from the second group of online deep mutual learning models from bottom to top, while performing online deep mutual learning, the offline knowledge distillation of the previous group of online deep mutual learning models is received.

[0046] S103: Extract the latent representations of all models from model group 1 to model group N-1 and feed them into the adaptor to generate teacher importance weight soft labels. Perform online deep mutual learning on the online deep mutual learning model of group N according to the teacher importance weight soft labels, and record the number of parameters and model performance of group N.

[0047] Furthermore, in one embodiment of the present invention, the step of performing online deep mutual learning on the Nth group of online deep mutual learning models based on the teacher importance weight soft labels includes:

[0048] Extract the soft labels of the FC layer of all models from model group 1 to model group N-1 and perform matrix multiplication with the teacher importance weight to obtain weighted knowledge. Based on the weighted knowledge, provide online deep mutual learning guidance for the online deep mutual learning model of group N.

[0049] S104: Based on the parameter count and model performance of the two models in each group of online deep mutual learning models and the parameter count and model performance of the Nth group of models, analyze the balance point between model compression ratio and accuracy.

[0050] Furthermore, in one embodiment of the present invention, the step of analyzing the balance point between model compression ratio and accuracy based on the parameter count and model performance of the two models in each group of online deep mutual learning models and the parameter count and model performance of the Nth group of models further includes:

[0051] Based on the parameters and model performance, obtain the model compression ratio and accuracy, plot the compression ratio and accuracy curve, and select the model and model distillation depth according to the requirements based on the compression ratio and accuracy curve.

[0052] Example 2

[0053] The process of exploring the compression limit of the model based on the pyramid knowledge distillation framework is as follows, and the technical route is as follows: Figure 7 As shown.

[0054] 1) The bottom model group (the first model group from bottom to top) learns each other online in a deep learning process to obtain two models, one large and one small, and records the number of parameters and model performance of the two models.

[0055] 2) The second model group from bottom to top performs online deep mutual learning, while receiving offline knowledge distillation from the bottom model (offline knowledge distillation for the large model and the small model respectively), resulting in two models, one large and one small, and recording the number of parameters and model performance of the two models.

[0056] 3) The training process from the bottom up from the 3rd model group to the Nth model group (the top model group) is similar to that of the 2nd model group, and will not be repeated here.

[0057] 4) Extract the latent representations of all models from model group 1 to model group N-1 and feed them into the adaptor to generate teacher importance weights. Extract the soft labels of the fully connected (FC) layers of all models from model group 1 to model group N-1 and perform matrix multiplication with the teacher importance weights to obtain weighted knowledge. Further fine-tune and guide the online deep mutual learning of the top model groups. Obtain two models, one large and one small, and record the number of parameters and model performance of the two models.

[0058] 5) Analyze experimental records to determine the balance point between model compression ratio and accuracy, and plot the compression ratio versus accuracy curve, such as... Figure 8 As shown, customers can select the model and model distillation depth according to their needs.

[0059] The following is a deployment scheme for edge devices based on a compact deep learning model using pyramid knowledge distillation.

[0060] Choose a suitable compression ratio and accuracy model M. Edge devices have very limited computing and storage resources, therefore it is necessary to select a model suitable for their resource constraints. Simultaneously, the model should also meet the accuracy requirements for detection and recognition. The specific selection range can be determined based on the model compression ratio versus accuracy curve obtained from the pyramid knowledge distillation in step 2.

[0061] Verify the model performance on specific datasets for business scenarios (such as remote sensing image recognition datasets, remote sensing image target detection datasets, etc.). Based on the selected model M, refer to the model compression ratio and accuracy curves obtained by pyramid knowledge distillation, and repeat steps (1)(2)(3)(4) of step 2. If the verified model M meets the performance requirements of the edge device, proceed to the next step; otherwise, repeat step (5) of step 2 and restart steps (1)(2) of step 3 until a model M that meets the hardware and performance requirements of the edge device is obtained.

[0062] Load model M onto the edge device and perform testing. Install the deployment environment on the edge device, load model M obtained in the previous process, and perform edge testing using test data.

[0063] Edge devices undergo regular model updates and maintenance. Model M is fine-tuned using new datasets to enhance its task performance, and the latest model M is periodically updated to edge devices.

[0064] The pyramid-based knowledge distillation framework model compression limit analysis method of this invention addresses the problems of "knowledge explosion and knowledge degradation" in knowledge distillation and fills the gap in exploring the compression limit of knowledge distillation models. It designs a pyramid-based knowledge distillation framework with a large model at the bottom and smaller models at the top, forming a multi-level pyramid structure. Within each layer, two models engage in online deep mutual learning, while offline distillation is used between layers, with each lower layer acting as a mentor to the layer above, thus mitigating "knowledge explosion" to some extent. An adaptive multi-teacher distillation method is employed to extract knowledge from all teacher models (except the top-level model) to guide the top-level model, further preventing "knowledge degradation."

[0065] Figure 9 This is a schematic diagram of a compression limit analysis device based on a pyramid knowledge distillation framework model, provided for an embodiment of the present invention.

[0066] like Figure 9 As shown, the compression limit analysis device based on the pyramid knowledge distillation framework model includes: a construction module 100, a first learning module 200, a second learning module 300, and an analysis module 400, wherein...

[0067] The building module is used to build N sets of online deep mutual learning models. Each set of online deep mutual learning models includes a large model and a small model. There is a gradient in model size among the N sets of online deep mutual learning models. They are arranged in a pyramid structure with the large model at the bottom and the small model at the top.

[0068] The first learning module is used to perform online deep mutual learning on each group of online deep mutual learning models in the pyramid model, and record the parameter quantity and model performance of the two models in each group of online deep mutual learning models; wherein, starting from the second group of online deep mutual learning models from bottom to top, while performing online deep mutual learning, it receives offline knowledge distillation from the previous group of online deep mutual learning models.

[0069] The second learning module is used to extract the latent representations of all models from model group 1 to model group N-1 and feed them into the adaptor to generate teacher importance weight soft labels. Based on the teacher importance weight soft labels, online deep mutual learning is performed on the online deep mutual learning model of group N, and the number of parameters and model performance of group N are recorded.

[0070] The analysis module is used to analyze the balance point between model compression ratio and accuracy based on the number of parameters and model performance of the two models in each group of online deep mutual learning models and the number of parameters and model performance of the Nth group of models.

[0071] Furthermore, in one embodiment of the present invention, the second learning module is further configured to:

[0072] Extract the soft labels of the FC layer of all models from model group 1 to model group N-1 and perform matrix multiplication with the teacher importance weight to obtain weighted knowledge. Based on the weighted knowledge, provide online deep mutual learning guidance for the online deep mutual learning model of group N.

[0073] Furthermore, in one embodiment of the present invention, the analysis module is further configured to:

[0074] Based on the parameters and model performance, obtain the model compression ratio and accuracy, plot the compression ratio and accuracy curve, and select the model and model distillation depth according to the requirements based on the compression ratio and accuracy curve.

[0075] To achieve the above objectives, a third aspect of the present invention provides a computer device, characterized in that it includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the compression limit analysis method based on the pyramid knowledge distillation framework model as described above.

[0076] To achieve the above objectives, a fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the compression limit analysis method based on the pyramid knowledge distillation framework model as described above.

[0077] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0078] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0079] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention.

Claims

1. A compression limit analysis method based on the pyramid knowledge distillation framework model, characterized in that, Includes the following steps: For a remote sensing image target detection dataset, N sets of online deep mutual learning models are constructed. Each set of online deep mutual learning models includes a large model and a small model. There is a gradient in model size among the N sets of online deep mutual learning models, and they are arranged in a pyramid structure with the large model at the bottom and the small model at the top. Each group of online deep mutual learning models in the pyramid structure is subjected to online deep mutual learning, and the parameter count and model performance of the two models in each group of online deep mutual learning models are recorded; wherein, starting from the second group of online deep mutual learning models from the bottom up, while performing online deep mutual learning, the offline knowledge distillation of the previous group of online deep mutual learning models is received. Extract the latent representations of all models from model group 1 to model group N-1 and feed them into the adaptor to generate teacher importance weight soft labels. Perform online deep mutual learning on the online deep mutual learning model of group N based on the teacher importance weight soft labels, and record the number of parameters and model performance of the model of group N. Based on the parameters and model performance, obtain the model compression ratio and accuracy, plot the compression ratio and accuracy curve, and select the model and model distillation depth according to the requirements based on the compression ratio and accuracy curve; Based on the selected model, the selected model is trained by distillation according to the compression ratio and accuracy curve of the model, and the model performance is verified on the dataset to obtain a model M that meets the hardware and performance requirements of edge devices; Load the model M into the edge device.

2. The method according to claim 1, characterized in that, The step of performing online deep mutual learning on the Nth group of online deep mutual learning models based on the teacher importance weight soft labels includes: Extract the soft labels of the FC layer of all models from model group 1 to model group N-1 and perform matrix multiplication with the teacher importance weight to obtain weighted knowledge. Based on the weighted knowledge, provide online deep mutual learning guidance for the online deep mutual learning model of group N.

3. A compression limit analysis device based on the pyramid knowledge distillation framework model, characterized in that, include: The module is used to build N sets of online deep mutual learning models for remote sensing image target detection dataset. Each set of online deep mutual learning models includes a large model and a small model. There is a gradient in model size among the N sets of online deep mutual learning models. They are arranged in a pyramid structure with the large model at the bottom and the small model at the top. The first learning module is used to perform online deep mutual learning on each group of online deep mutual learning models in the pyramid structure, and record the parameter quantity and model performance of the two models in each group of online deep mutual learning models; wherein, starting from the second group of online deep mutual learning models from the bottom up, while performing online deep mutual learning, it receives offline knowledge distillation from the previous group of online deep mutual learning models. The second learning module is used to extract the latent representations of all models from model group 1 to model group N-1 and feed them into the adaptor to generate teacher importance weight soft labels. Based on the teacher importance weight soft labels, online deep mutual learning is performed on the online deep mutual learning model of group N, and the number of parameters and model performance of group N are recorded. The analysis module is used to obtain the model compression ratio and accuracy based on the parameter quantity and model performance, plot the compression ratio and accuracy curve, select the model and model distillation depth according to the requirements based on the compression ratio and accuracy curve, perform distillation training on the selected model according to the model compression ratio and accuracy curve, and verify the model performance on the dataset to obtain a model M that meets the hardware and performance requirements of the edge device. Load the model M into the edge device.

4. The apparatus according to claim 3, characterized in that, The second learning module is also used for: Extract the soft labels of the FC layer of all models from model group 1 to model group N-1 and perform matrix multiplication with the teacher importance weight to obtain weighted knowledge. Based on the weighted knowledge, provide online deep mutual learning guidance for the online deep mutual learning model of group N.

5. A computer device, characterized in that, The method includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the compression limit exploration method based on the pyramid knowledge distillation framework model as described in any one of claims 1-2.

6. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the compression limit exploration method based on the pyramid knowledge distillation framework model as described in any one of claims 1-2.