Network model compression method and device, electronic equipment and readable medium

By optimizing the parameters of the student model through feature interaction and multi-task training of the teacher model and the student model, the problem of large number of parameters in the pre-trained language model is solved, and the compression and latency reduction of the network model are achieved.

CN116796824BActive Publication Date: 2026-06-12JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD
Filing Date
2023-06-21
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing pre-trained language models have a large number of parameters, resulting in high hardware resource consumption, long latency, and high data maintenance costs.

Method used

The teacher model is trained using labeled sample datasets to generate predictions of sample distribution and perform feature interaction. The student model is then trained using a multi-task approach. Finally, the student model parameters are optimized based on the loss function until convergence. The Teacher-Student model is used to reduce the number of parameters.

🎯Benefits of technology

While ensuring the reliability and accuracy of identification, the size of the network model has been reduced, and the latency has been decreased.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116796824B_ABST
    Figure CN116796824B_ABST
Patent Text Reader

Abstract

The present disclosure provides a network model compression method, device, electronic equipment and readable medium, wherein the network model compression method comprises: training a teacher model through a labeled sample data set to obtain a prediction result of a sample distribution of the sample data set; performing feature interaction on a sample vector in the sample data set with the prediction result; performing multi-task training on a student model according to the sample after the feature interaction; and performing parameter optimization on the student model based on a loss function obtained by the multi-task training until convergence, to obtain an optimized student model. Through the embodiment of the present disclosure, the compression of the network model is realized under the premise of ensuring the recognition reliability and accuracy of the network model, the size of the network model is reduced, and the delay of the network model is reduced.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of machine recognition technology, and more specifically, to a network model compression method, apparatus, electronic device, and readable medium. Background Technology

[0002] Currently, in the field of natural language processing, with the rapid development of pre-trained language models, the number of model parameters has reached hundreds of millions. While these increasingly large network models can improve the accuracy and reliability of machine recognition solutions, they also lead to expensive hardware resource consumption, high latency, and high data maintenance costs.

[0003] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this disclosure, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0004] The purpose of this disclosure is to provide a network model compression method, apparatus, electronic device, and readable medium to overcome, at least to some extent, the problem of large network model size caused by limitations and defects in related technologies.

[0005] According to a first aspect of the present disclosure, a network model compression method is provided, comprising: training a teacher model using a labeled sample dataset to obtain a prediction result of the sample distribution of the sample dataset; performing feature interaction on sample vectors in the sample dataset having the prediction result; performing multi-task training on a student model based on the sample vectors after feature interaction; and optimizing the parameters of the student model based on the loss function obtained from the multi-task training until convergence, to obtain an optimized student model.

[0006] In one exemplary embodiment of this disclosure, before training the teacher model using the labeled sample dataset, the method further includes:

[0007] Clustering processing is performed on the collected language sample data;

[0008] The clustered language sample data are combined in pairs to generate sample pairs, and the labeled sample dataset is generated based on the vector representation of the sample pairs.

[0009] In one exemplary embodiment of this disclosure, the process of combining clustered language sample data pairwise to generate sample pairs, and generating the labeled sample dataset based on the vector representation of the sample pairs, includes:

[0010] Determine the language sample data in each cluster set after clustering;

[0011] The language sample data from any two cluster sets are combined pairwise to generate the sample pairs;

[0012] Calculate the mean of the vector values ​​of the two language sample data in any of the sample pairs;

[0013] Compare the magnitude of the mean with that of a preset vector;

[0014] The samples are labeled as positive or negative sample pairs based on the size relationship.

[0015] The sample dataset is generated based on the positive sample pairs and the negative sample pairs.

[0016] In one exemplary embodiment of this disclosure, feature interaction of sample vectors in a sample dataset having the prediction results includes:

[0017] Perform a specified operation on any two sample vectors in the sample dataset that have the predicted results;

[0018] The results of the specified operation are concatenated.

[0019] The result of the feature interaction is determined based on the splicing process.

[0020] The specified operation includes at least one of difference operation, product operation and similarity calculation.

[0021] In one exemplary embodiment of this disclosure, multi-task training of the student model based on samples after feature interaction includes:

[0022] The first training task is performed by inputting training information into the underlying layer of the student model based on the attribute information of the sample dataset. The training information includes at least one of the sample position information, character information, and sentence information.

[0023] A second training task is performed at the top level of the student model based on the results of the feature interactions.

[0024] In one exemplary embodiment of this disclosure, optimizing the parameters of the student model based on the loss function obtained from the multi-task training until convergence, to obtain an optimized student model, includes:

[0025] Determine the labeled loss function corresponding to the sample dataset based on the first training task;

[0026] Determine the prediction loss function corresponding to the sample dataset based on the second training task;

[0027] The loss function of the student model is determined based on the preset weight coefficients, the labeled loss function, and the predicted loss function;

[0028] The parameters of the student model are optimized until the loss function converges to obtain the optimized student model.

[0029] In one exemplary embodiment of this disclosure, the feature encoder of the student model includes an albert_tiny layer of a four-layer transformer network.

[0030] According to a second aspect of the present disclosure, a network model compression apparatus is provided, comprising:

[0031] The first training module is configured to train the teacher model using the labeled sample dataset to obtain the prediction results of the sample distribution of the sample dataset;

[0032] The second training module is configured to perform feature interaction on sample vectors in the sample dataset containing the predicted results;

[0033] The third training module is set to perform multi-task training on the student model based on samples after feature interaction.

[0034] The fourth training module is configured to optimize the parameters of the student model based on the loss function obtained from the multi-task training until convergence, so as to obtain the optimized student model.

[0035] According to a third aspect of this disclosure, an electronic device is provided, comprising: a memory; and a processor coupled to the memory, the processor being configured to perform the method as described in any of the preceding methods based on instructions stored in the memory.

[0036] According to a fourth aspect of this disclosure, a computer-readable storage medium is provided having a program stored thereon that, when executed by a processor, implements the network model compression method as described in any of the preceding claims.

[0037] In this embodiment, the teacher model is trained using a labeled sample dataset to obtain a prediction result of the sample distribution of the sample dataset. Feature interaction is performed on the sample vectors in the sample dataset with the prediction result. Then, the student model is trained on the sample vectors with the feature interaction result for multi-task training. Finally, the parameters of the student model are optimized until convergence based on the loss function obtained from the multi-task training to obtain an optimized student model. While ensuring the recognition reliability and accuracy of the network model, the network model is compressed, reducing the size and latency of the network model.

[0038] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0039] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure. It is obvious that the drawings described below are merely some embodiments of this disclosure, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.

[0040] Figure 1 A schematic diagram of an exemplary system architecture for which the network model compression scheme of the present invention can be applied is shown;

[0041] Figure 2 This is a flowchart of a network model compression method according to an exemplary embodiment of this disclosure;

[0042] Figure 3 This is a flowchart of another network model compression method in an exemplary embodiment of this disclosure;

[0043] Figure 4 This is a flowchart of another network model compression method in an exemplary embodiment of this disclosure;

[0044] Figure 5 This is a flowchart of another network model compression method in an exemplary embodiment of this disclosure;

[0045] Figure 6 This is a flowchart of another network model compression method in an exemplary embodiment of this disclosure;

[0046] Figure 7 This is a flowchart of another network model compression method in an exemplary embodiment of this disclosure;

[0047] Figure 8 This is a schematic diagram of data interaction in a network model compression scheme according to an exemplary embodiment of this disclosure;

[0048] Figure 9 This is a data interaction diagram of another network model compression scheme in an exemplary embodiment of this disclosure;

[0049] Figure 10 This is a data interaction diagram of another network model compression scheme in an exemplary embodiment of this disclosure;

[0050] Figure 11 This is a data interaction diagram of another network model compression scheme in an exemplary embodiment of this disclosure;

[0051] Figure 12 This is a data interaction diagram of another network model compression scheme in an exemplary embodiment of this disclosure;

[0052] Figure 13 This is a block diagram of a network model compression apparatus according to an exemplary embodiment of the present disclosure;

[0053] Figure 14 This is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure. Detailed Implementation

[0054] Example embodiments will now be described more fully with reference to the accompanying drawings. However, example embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this disclosure more comprehensive and complete, and to fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a full understanding of embodiments of this disclosure. However, those skilled in the art will recognize that the technical solutions of this disclosure can be practiced with one or more of the specific details omitted, or other methods, components, apparatus, steps, etc., can be employed. In other instances, well-known technical solutions are not shown or described in detail to avoid obscuring various aspects of this disclosure.

[0055] Furthermore, the accompanying drawings are merely illustrative of this disclosure, and the same reference numerals in the drawings denote the same or similar parts, thus repeated descriptions of them will be omitted. Some block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.

[0056] Figure 1 A schematic diagram of an exemplary system architecture for which the network model compression scheme of embodiments of the present invention can be applied is shown.

[0057] like Figure 1 As shown, system architecture 100 may include one or more of terminal devices 101, 102, and 103, a network 104, and a server 105. Network 104 serves as the medium for providing communication links between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.

[0058] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, there can be any number of terminal devices, networks, and servers. For example, server 105 could be a server cluster composed of multiple servers.

[0059] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Terminal devices 101, 102, and 103 can be various electronic devices with displays, including but not limited to smartphones, tablets, laptops, and desktop computers, etc.

[0060] In some embodiments, the network model compression method provided in this invention is generally executed by server 105, and correspondingly, the network model compression device is generally located in terminal device 103 (or terminal device 101 or 102). In other embodiments, some terminals may have similar functions to the server device to execute this method.

[0061] The knowledge distillation in the network model compression scheme of the exemplary embodiments of this disclosure adopts the Teacher-Student pattern: the complex and large model is used as the Teacher model (i.e., the teacher model), and the Student model (i.e., the student model) has a relatively simple structure. The Teacher model is used to assist the training of the Student model. The Teacher model has a strong learning ability and can transfer the knowledge it has learned to the Student model with a relatively weak learning ability, thereby enhancing the generalization ability of the Student model.

[0062] The exemplary embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.

[0063] Figure 2 This is a flowchart of a network model compression method in an exemplary embodiment of this disclosure.

[0064] refer to Figure 2 Network model compression methods may include:

[0065] Step S202: Train the teacher model using the labeled sample dataset to obtain the prediction result of the sample distribution of the sample dataset.

[0066] Step S204: Perform feature interaction on the sample vectors in the sample dataset with the predicted results.

[0067] Step S206: Perform multi-task training on the student model based on the samples after feature interaction.

[0068] Step S208: Optimize the parameters of the student model based on the loss function obtained from the multi-task training until convergence, to obtain the optimized student model.

[0069] In this embodiment, the teacher model is trained using a labeled sample dataset to obtain a prediction result of the sample distribution of the sample dataset. Feature interaction is performed on the sample vectors in the sample dataset with the prediction result. Then, the student model is trained on the sample vectors with the feature interaction result for multi-task training. Finally, the parameters of the student model are optimized until convergence based on the loss function obtained from the multi-task training to obtain an optimized student model. While ensuring the recognition reliability and accuracy of the network model, the network model is compressed, reducing the size and latency of the network model.

[0070] The following section provides a detailed explanation of each step in the network model compression method.

[0071] In one exemplary embodiment of this disclosure, such as Figure 3 As shown, before training the teacher model using the labeled sample dataset, the following steps are also included:

[0072] Step S302: Cluster the collected language sample data.

[0073] Step S304: Combine the clustered language sample data in pairs to generate sample pairs, and generate the labeled sample dataset based on the vector representation of the sample pairs.

[0074] In one exemplary embodiment of this disclosure, before clustering, a large number of text samples from the intelligent question-answering domain are acquired, and parameters are set, such as the similarity threshold, the number of cluster centers, and the number of data iterations. The acquired text data is divided into two parts, one part serving as index data in the recall engine (ES, Lucene, BM2.5, etc.). The other dataset is traversed, and the similarity between the data and the candidate sentences for the recalled cluster centers is calculated. The most similar cluster centers (e.g., top 200) are selected, and finally, a threshold is set to determine whether the data is assigned to the corresponding cluster. The smaller the threshold, the more similar the data between clusters, but the text diversity is relatively poor. The threshold can be set according to different task requirements.

[0075] In one exemplary embodiment of this disclosure, the similarity between texts generally has certain characteristics, such as sentence length, common substrings, sentence structure, etc. The degree of similarity between two texts can be distinguished by combining features. Based on this, text similarity pairs in massive sample data can be preliminarily determined.

[0076] In one exemplary embodiment of this disclosure, such as Figure 4 As shown, the process of combining clustered language sample data pairwise to generate sample pairs, and generating the labeled sample dataset based on the vector representation of the sample pairs includes:

[0077] Step S402: Determine the language sample data in each cluster set after clustering.

[0078] Step S404: Combine the language sample data from any two cluster sets in pairs to generate the sample pairs.

[0079] Step S406: Calculate the mean of the vector values ​​of the two language sample data in any of the sample pairs.

[0080] Step S408: Compare the magnitude relationship between the mean and the preset vector mean.

[0081] Step S410: Label the samples as positive sample pairs or negative sample pairs according to the size relationship.

[0082] Step S412: Generate the sample dataset based on the positive sample pairs and the negative sample pairs.

[0083] In one exemplary embodiment of this disclosure, texts from different clusters or within the same cluster are cross-combined into sample pairs, and these sample pairs are labeled as positive and negative samples, thereby enriching the sample combinations and sample quantity and improving the generalization of the network model.

[0084] In one exemplary embodiment of this disclosure, such as Figure 5 As shown, feature interaction on sample vectors in a sample dataset with the predicted results includes:

[0085] Step S502: Perform a specified operation on any two sample vectors in the sample dataset with the predicted results.

[0086] Step S504: Perform concatenation processing on the results of the specified operation.

[0087] Step S506: Determine the result of the feature interaction based on the splicing process, wherein the specified operation includes at least one of difference operation, product operation and similarity calculation.

[0088] In one exemplary embodiment of this disclosure, the results of difference operation, product operation and similarity calculation can be concatenated into a feature interaction result to fully integrate the features of the two sample vectors, which is beneficial to improving the reliability of the student model prediction loss function.

[0089] In one exemplary embodiment of this disclosure, such as Figure 6 As shown, multi-task training of the student model based on samples after feature interaction includes:

[0090] Step S602: Input training information into the underlying layer of the student model according to the attribute information of the sample dataset to perform the first training task. The training information includes at least one of the sample position information, character information, and sentence information.

[0091] Step S604: Perform a second training task at the top level of the student model based on the results of the feature interactions.

[0092] In one exemplary embodiment of this disclosure, the student model is trained using a multi-task approach, which integrates the loss functions of multiple tasks and improves its reliability in downstream tasks.

[0093] In one exemplary embodiment of this disclosure, such as Figure 7 As shown, the student model is optimized by applying the loss function obtained from the multi-task training until convergence, resulting in an optimized student model including:

[0094] Step S702: Determine the labeled loss function corresponding to the sample dataset based on the first training task.

[0095] Step S704: Determine the prediction loss function corresponding to the sample dataset according to the second training task.

[0096] Step S706: Determine the loss function of the student model based on the preset weight coefficients, the labeled loss function, and the predicted loss function.

[0097] Step S708: Optimize the parameters of the student model until the loss function converges to obtain the optimized student model.

[0098] In one exemplary embodiment of this disclosure, a pre-trained teacher model guides a student model in multi-task training, and a preset weight coefficient is used to reduce the weight of the labeling loss function, thereby increasing the weight of the prediction loss function, which in turn improves the reliability, robustness and generalization ability of the student model.

[0099] In one exemplary embodiment of this disclosure, the feature encoder of the student model includes an albert_tiny layer of a four-layer transformer network.

[0100] In one exemplary embodiment of this disclosure, the feature encoder of the teacher model generally adopts a 12-layer BERT, while the feature encoder of the student model is much smaller than that of the teacher model, thus reducing the workload of parameter optimization.

[0101] The following is combined with Figures 8 to 12 A network model compression scheme of an exemplary embodiment of this disclosure will be specifically described, such as... Figure 8As shown, the network model compression scheme includes four stages: representative question mining, dataset construction, teacher model training, and knowledge distillation. Specifically, on one hand, text mining 804 is performed on the log data 802 to obtain a large corpus 806; on the other hand, a small amount of labeled data 816 is used to fine-tune the pre-trained model 814 to obtain the teacher model 818. The teacher model 818 is then used to predict the large corpus 816 to determine the text distribution characteristics of this sample dataset, i.e., the labeled data sample set, which can be considered high-quality corpus 808 compared to the large corpus 816. The high-quality corpus 808 is then used to train the student model 810, and the student model's parameters are optimized through model distillation 812. The network model compression is achieved in the following four stages:

[0102] like Figure 9 As shown, stage 1 of network model compression represents the question mining stage.

[0103] 1.1 Obtain a large amount of question and answer text data in the field of intelligent customer service.

[0104] 1.2 Configure parameters such as similarity threshold, Jaccard, number of recall candidate cluster centers, number of iterations, etc., but not limited to these.

[0105] 1.3 Dataset Clustering: The acquired text data is divided into two parts. One part serves as the index data in the recall engine (ES, Lucene, BM2.5, etc.). The other part of the dataset is traversed, and the similarity between the data and the candidate set of recall cluster centers is calculated. The most similar cluster center is selected, and a similarity threshold is used to determine whether to assign it to the corresponding cluster. If the similarity is less than or equal to the threshold, a new cluster is created and the data is inserted. After inserting data into the new cluster, a cluster center index is added, and the Lucene index of the traversed data is written to update the Lucene recall data. If the similarity is greater than the threshold, the data is assigned to the corresponding cluster. A smaller similarity threshold indicates greater similarity between clusters, but relatively lower text diversity. The similarity threshold can be set according to different task requirements.

[0106] like Figure 10 As shown, stage 2 of network model compression involves constructing dataset labels using word vectors.

[0107] 2.1 After generating a sample dataset of 1000 containing multiple clusters such as cluster A, cluster B, and cluster C, the texts from different clusters or within the same cluster are first cross-combined into text pairs. Text pairs are one implementation of sample pairs, where a sample pair contains two sentences.

[0108] 2.2 The words in the text pairs are converted into word vectors using the word2vec / BERT word vector processor, and then the average vector value of each sample pair is calculated.

[0109] 2.3 Calculate the similarity between any two text pairs, such as cosine similarity, Euclidean distance, etc., but not limited to these. Based on the inter-class similarity threshold, classify the highly similar samples as positive samples and the less similar samples as negative samples. Mix the positive and negative samples to obtain a large number of high-quality text pairs datasets.

[0110] Phase 3 of the network model compression is used to train the teacher model.

[0111] 3.1 The pre-trained language model was fine-tuned on a small number of labeled datasets to obtain a teacher model with better performance.

[0112] 3.2 Use the teacher model to predict the dataset in stage 2 and obtain the distribution of each sample in the dataset.

[0113] like Figure 11 As shown, stage 4 of the network model compression is used for knowledge distillation 1100.

[0114] The student model employs a lightweight multi-task structure. Its input consists of sentence position information, word information, and sentence A and B information. Features from sentence A and sentence B are interacted and fed into an MLP (Multi-Layer Perceptron) for training, outputting a first loss function (Loss1) and a second loss function (Loss2) for the multi-task process.

[0115] The student model's feature encoder uses a 4-layer Albert_tiny transformer, significantly reducing parameters compared to the teacher model's 12-layer BERT. The top layer utilizes A / B sentence feature interaction, combining text matching characteristics to fully integrate A / B sentence features. Finally, a multilayer perceptron structure is used to obtain different loss functions for different tasks.

[0116] like Figure 12 As shown, model distillation obtains the labels and distribution of the dataset through stages 2 (1202) and 3 (1204) to construct two different tasks. The structural characteristics of the student model are used to construct the multi-task loss function, namely the label loss function loss_label and the prediction loss function loss_probs.

[0117] Based on the above calculation results, the total loss function of the student model is determined as loss = w * loss_label + (1-w) * loss_probs. Usually, the weight w is set to a small value, such as 0.1 or 0.15, that is, less weight is given to the label loss and more weight is given to the prediction loss.

[0118] Corresponding to the above method embodiments, this disclosure also provides a network model compression apparatus, which can be used to execute the above method embodiments.

[0119] Figure 13 This is a block diagram of a network model compression apparatus according to an exemplary embodiment of the present disclosure.

[0120] refer to Figure 13 The network model compression device 1300 may include:

[0121] The first training module 1302 is configured to train the teacher model using the labeled sample dataset to obtain the prediction results of the sample distribution of the sample dataset.

[0122] The second training module 1304 is configured to perform feature interaction on sample vectors in the sample dataset with the predicted results.

[0123] The third training module 1306 is configured to perform multi-task training on the student model based on samples after feature interaction.

[0124] The fourth training module 1308 is configured to optimize the parameters of the student model based on the loss function obtained from the multi-task training until convergence, so as to obtain the optimized student model.

[0125] In one exemplary embodiment of this disclosure, before training the teacher model using the labeled sample dataset, the first training module 1302 is further configured to:

[0126] Clustering processing is performed on the collected language sample data;

[0127] The clustered language sample data are combined in pairs to generate sample pairs, and the labeled sample dataset is generated based on the vector representation of the sample pairs.

[0128] In one exemplary embodiment of this disclosure, the first training module 1302 is further configured to:

[0129] Determine the language sample data in each cluster set after clustering;

[0130] The language sample data from any two cluster sets are combined pairwise to generate the sample pairs;

[0131] Calculate the mean of the vector values ​​of the two language sample data in any of the sample pairs;

[0132] Compare the magnitude of the mean with that of a preset vector;

[0133] The samples are labeled as positive or negative sample pairs based on the size relationship.

[0134] The sample dataset is generated based on the positive sample pairs and the negative sample pairs.

[0135] In one exemplary embodiment of this disclosure, the second training module 1304 is further configured to:

[0136] Perform a specified operation on any two sample vectors in the sample dataset that have the predicted results;

[0137] The results of the specified operation are concatenated.

[0138] The result of the feature interaction is determined based on the splicing process.

[0139] The specified operation includes at least one of difference operation, product operation and similarity calculation.

[0140] In one exemplary embodiment of this disclosure, the third training module 1306 is further configured to:

[0141] The first training task is performed by inputting training information into the underlying layer of the student model based on the attribute information of the sample dataset. The training information includes at least one of the sample position information, character information, and sentence information.

[0142] A second training task is performed at the top level of the student model based on the results of the feature interactions.

[0143] In one exemplary embodiment of this disclosure, the fourth training module 1308 is further configured to:

[0144] Determine the labeled loss function corresponding to the sample dataset based on the first training task;

[0145] Determine the prediction loss function corresponding to the sample dataset based on the second training task;

[0146] The loss function of the student model is determined based on the preset weight coefficients, the labeled loss function, and the predicted loss function;

[0147] The parameters of the student model are optimized until the loss function converges to obtain the optimized student model.

[0148] In one exemplary embodiment of this disclosure, the feature encoder of the student model includes an albert_tiny layer of a four-layer transformer network.

[0149] Since the functions of the network model compression device 1300 have been described in detail in their respective method embodiments, they will not be repeated here.

[0150] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to embodiments of this disclosure, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0151] In an exemplary embodiment of this disclosure, an electronic device capable of implementing the above-described method is also provided.

[0152] Those skilled in the art will understand that various aspects of the present invention can be implemented as systems, methods, or program products. Therefore, various aspects of the present invention can be specifically implemented in the following forms: entirely hardware implementations, entirely software implementations (including firmware, microcode, etc.), or implementations combining hardware and software aspects, collectively referred to herein as “circuits,” “modules,” or “systems.”

[0153] The following reference Figure 14 To describe an electronic device 1400 according to this embodiment of the present invention. Figure 14 The electronic device 1400 shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present invention.

[0154] like Figure 14 As shown, the electronic device 1400 is manifested in the form of a general-purpose computing device. The components of the electronic device 1400 may include, but are not limited to: at least one processing unit 1410, at least one storage unit 1420, and a bus 1430 connecting different system components (including storage unit 1420 and processing unit 1410).

[0155] The storage unit stores program code that can be executed by the processing unit 1410, causing the processing unit 1410 to perform the steps described in the "Exemplary Methods" section of this specification according to various exemplary embodiments of the present invention. For example, the processing unit 1410 can perform the method shown in the embodiments of this disclosure.

[0156] Storage unit 1420 may include readable media in the form of volatile storage units, such as random access memory (RAM) 14201 and / or cache memory 14202, and may further include read-only memory (ROM) 14203.

[0157] Storage unit 1420 may also include a program / utility 14204 having a set (at least one) of program modules 14205, such program modules 14205 including but not limited to: operating system, one or more application programs, other program modules and program data, each or some combination of these examples may include an implementation of a network environment.

[0158] Bus 1430 can represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus using any of the various bus structures.

[0159] Electronic device 1400 can also communicate with one or more external devices 1440 (e.g., keyboard, pointing device, Bluetooth device, etc.), and with one or more devices that enable a user to interact with electronic device 1400, and / or with any device that enables electronic device 1400 to communicate with one or more other computing devices (e.g., router, modem, etc.). This communication can be performed via input / output (I / O) interface 1450. Furthermore, electronic device 1400 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 1460. As shown, network adapter 1460 communicates with other modules of electronic device 1400 via bus 1430. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with electronic device 1400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.

[0160] From the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, terminal device, or network device, etc.) to execute the methods according to the embodiments of this disclosure.

[0161] In exemplary embodiments of this disclosure, a computer-readable storage medium is also provided, on which a program product capable of implementing the methods described above is stored. In some possible embodiments, various aspects of the invention may also be implemented as a program product comprising program code that, when the program product is run on a terminal device, causes the terminal device to perform the steps of the various exemplary embodiments of the invention described in the "Exemplary Methods" section of this specification.

[0162] The program product for implementing the above-described method according to embodiments of the present invention may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto. In this document, the readable storage medium may be any tangible medium containing or storing a program that may be used by or in conjunction with an instruction execution system, apparatus, or device.

[0163] The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0164] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A readable signal medium may also be any readable medium other than a readable storage medium, capable of sending, propagating, or transmitting programs for use by or in conjunction with an instruction execution system, apparatus, or device.

[0165] The program code contained on the readable medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical fiber, RF, etc., or any suitable combination thereof.

[0166] Program code for performing the operations of this invention can be written in any combination of one or more programming languages, including object-oriented programming languages ​​such as Java and C++, and conventional procedural programming languages ​​such as C or similar languages. The program code can execute entirely on the user's computing device, partially on the user's device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).

[0167] Furthermore, the above figures are merely illustrative of the processes included in the method according to exemplary embodiments of the present invention, and are not intended to be limiting. It is readily understood that the processes shown in the above figures do not indicate or limit the temporal order of these processes. Additionally, it is readily understood that these processes may be executed synchronously or asynchronously, for example, in multiple modules.

[0168] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and concept of this disclosure are indicated by the claims.

Claims

1. A network model compression method, characterized in that, include: The teacher model is trained using the labeled sample dataset to obtain the prediction results of the sample distribution of the sample dataset; Feature interaction is performed on sample vectors in the sample dataset containing the predicted results, including: Perform a specified operation on any two sample vectors in the sample dataset that have the predicted results; The results of the specified operation are concatenated. The result of the feature interaction is determined based on the splicing process. The specified operation includes at least one of difference operation, product operation and similarity calculation; The student model is trained on multiple tasks based on the samples after feature interaction, including: The first training task is performed by inputting training information into the underlying layer of the student model based on the attribute information of the sample dataset. The training information includes at least one of sample position information, character information, and sentence information. A second training task is performed at the top level of the student model based on the results of the feature interactions; The student model is optimized based on the loss function obtained from the multi-task training until convergence, resulting in an optimized student model, including: Determine the labeled loss function corresponding to the sample dataset based on the first training task; Determine the prediction loss function corresponding to the sample dataset based on the second training task; The loss function of the student model is determined based on the preset weight coefficients, the labeled loss function, and the predicted loss function; The parameters of the student model are optimized until the loss function converges to obtain the optimized student model.

2. The network model compression method as described in claim 1, characterized in that, Before training the teacher model using the labeled sample dataset, the following steps are also included: Clustering processing is performed on the collected language sample data; The clustered language sample data are combined in pairs to generate sample pairs, and the labeled sample dataset is generated based on the vector representation of the sample pairs.

3. The network model compression method as described in claim 2, characterized in that, The process of combining clustered language sample data pairwise to generate sample pairs, and generating the labeled sample dataset based on the vector representation of the sample pairs, includes: Determine the language sample data in each cluster set after clustering; The language sample data from any two cluster sets are combined pairwise to generate the sample pairs; Calculate the mean of the vector values ​​of the two language sample data in any of the sample pairs; Compare the magnitude of the mean with that of a preset vector; The samples are labeled as positive or negative sample pairs based on the size relationship. The sample dataset is generated based on the positive sample pairs and the negative sample pairs.

4. The network model compression method according to any one of claims 1-3, characterized in that, The feature encoder of the student model includes an albert_tiny layer of a four-layer transformer network.

5. A network model compression device, characterized in that, include: The first training module is configured to train the teacher model using the labeled sample dataset to obtain the prediction results of the sample distribution of the sample dataset; The second training module is configured to perform feature interaction on sample vectors in the sample dataset containing the predicted results, including: Perform a specified operation on any two sample vectors in the sample dataset that have the predicted results; The results of the specified operation are concatenated. The result of the feature interaction is determined based on the splicing process. The specified operation includes at least one of difference operation, product operation and similarity calculation; The third training module is configured to perform multi-task training on the student model based on samples after feature interactions, including: The first training task is performed by inputting training information into the underlying layer of the student model based on the attribute information of the sample dataset. The training information includes at least one of sample position information, character information, and sentence information. A second training task is performed at the top level of the student model based on the results of the feature interactions; The fourth training module is configured to optimize the parameters of the student model based on the loss function obtained from the multi-task training until convergence, to obtain an optimized student model, including: Determine the labeled loss function corresponding to the sample dataset based on the first training task; Determine the prediction loss function corresponding to the sample dataset based on the second training task; The loss function of the student model is determined based on the preset weight coefficients, the labeled loss function, and the predicted loss function; The parameters of the student model are optimized until the loss function converges to obtain the optimized student model.

6. An electronic device, characterized in that, include: Memory; as well as A processor coupled to the memory, the processor being configured to execute the network model compression method as described in any one of claims 1-4 based on instructions stored in the memory.

7. A computer-readable storage medium having a program stored thereon that, when executed by a processor, implements the network model compression method as described in any one of claims 1-4.