Training methods, inference methods, devices, equipment, and media for deep learning models
By synchronously adjusting parameters in multiple deep learning models based on target loss, the method addresses low training efficiency and enhances predictive capabilities through bidirectional knowledge transfer and accelerated convergence.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- BEIJING BAIDU NETCOM SCI & TECH CO LTD
- Filing Date
- 2024-12-20
- Publication Date
- 2026-06-10
Smart Images

Figure 0007872833000001 
Figure 0007872833000002 
Figure 0007872833000003
Abstract
Description
Technical Field
[0001] The present disclosure relates to the technical field of artificial intelligence, particularly to technical fields such as machine learning and deep learning. Specifically, it relates to a method for training a deep learning model, a method for inferring a deep learning model, a device for training a deep learning model, a device for inferring a deep learning model, an electronic device, a computer-readable storage medium, and a computer program product.
Background Art
[0002] Artificial intelligence is a subject that studies how to simulate some human thinking processes and intelligent behaviors (such as learning, inference, thinking, planning, etc.) on a computer. There are both hardware technologies and software technologies. The hardware technologies of artificial intelligence generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing. The software technologies of artificial intelligence mainly include several directions such as natural language processing technology, computer vision technology, voice recognition technology, and machine learning / deep learning, big data processing technology, and knowledge graph technology.
[0003] The methods described in this section are not necessarily the methods that were previously assumed or adopted. Unless otherwise specified, none of the methods described in this section should be considered as prior art just because they are included in this section. Similarly, unless otherwise specified, none of the problems mentioned in this section should be considered as recognized in any prior art.
Summary of the Invention
[0004] The present disclosure provides a method for training a deep learning model, a method for inferring a deep learning model, a device for training a deep learning model, a device for inferring a deep learning model, an electronic device, a computer-readable storage medium, and a computer program product.
[0005] According to one aspect of this disclosure, a method for training a computer-implemented deep learning model is provided, wherein the first deep learning model includes a plurality of first parameters, and the second deep learning model includes a plurality of second parameters, wherein the plurality of second parameters are initialized to the parameter values of a plurality of target parameters corresponding to the plurality of second parameters in the plurality of first parameters, and the number of the plurality of second parameters is less than the number of first parameters, and the training method comprises determining the target loss of the first and second deep learning models, and adjusting the parameter values of the plurality of first and second parameters based on the target loss to obtain the first and second deep learning models after training. The method includes, in response to the determination that the target loss indicates that the parameter values of at least some of the target parameters included in the first deep learning model need to be adjusted, synchronously adjusting the parameter values of second parameters included in the second deep learning model that correspond to at least some of the target parameters, and in response to the determination that the target loss indicates that the parameter values of at least some of the second parameters included in the second deep learning model need to be adjusted, synchronously adjusting the parameter values of target parameters included in the first deep learning model that correspond to at least some of the second parameters,
[0006] According to another aspect of this disclosure, an inference method for a computer-implemented deep learning model is provided, wherein a first deep learning model and a second deep learning model are obtained by training using the above-described training method, and the inference method includes: sequentially generating a plurality of prediction tokens using the second deep learning model; generating a confidence level for each of the plurality of prediction tokens based on the plurality of prediction tokens using the first deep learning model; and obtaining an inference result by checking the plurality of prediction tokens based on the generation order of the plurality of prediction tokens, wherein in response to the determination that the confidence level of the currently checked prediction token is lower than a preset threshold, the method includes: generating a correction token at the position of the prediction token using the first deep learning model and replacing the prediction token with a checked prediction token; and in response to the determination that a preset token generation condition is met, sequentially generating prediction tokens again from the position following the correction token based on the second deep learning model, generating a confidence level for the newly generated prediction tokens using the first deep learning model and performing a check; and obtaining an inference result based on the checked prediction tokens.
[0007] According to another aspect of the present disclosure, a training device for a deep learning model is provided, wherein the first deep learning model includes a plurality of first parameters, and the second deep learning model includes a plurality of second parameters, wherein the plurality of second parameters are initialized to the parameter values of a plurality of target parameters in the plurality of first parameters, the number of the plurality of second parameters is less than the number of first parameters, and the training device includes a determination unit configured to determine the target losses of the first deep learning model and the second deep learning model, and a parameter adjustment unit configured to adjust the parameter values of the plurality of first parameters and the plurality of second parameters based on the target losses to obtain the first deep learning model and the second deep learning model after training, wherein the plurality of target parameters included in the first deep learning model - A parameter tuning unit comprising: a first parameter tuning subunit configured to synchronously adjust the parameter values of second parameters corresponding to at least some target parameters included in the second deep learning model in response to the determination by the target loss that the parameter values of at least some of the target parameters among the get parameters need to be adjusted; and a second parameter tuning subunit configured to synchronously adjust the parameter values of target parameters corresponding to at least some of the second parameters included in the first deep learning model in response to the determination by the target loss that the parameter values of at least some of the second parameters among a plurality of second parameters included in the second deep learning model need to be adjusted.
[0008] According to another aspect of this disclosure, an inference device for a deep learning model is provided, wherein a first deep learning model and a second deep learning model are obtained by training using the above-mentioned training device, and the inference device comprises a first generation unit configured to sequentially generate a plurality of prediction tokens using the second deep learning model, a second generation unit configured to generate a confidence level for each of the plurality of prediction tokens based on the plurality of prediction tokens using the first deep learning model, and a check unit configured to check the plurality of prediction tokens based on the generation order of the plurality of prediction tokens and obtain an inference result, wherein the confidence level of the currently checked prediction token is set to a preset threshold. The system includes a replacement subunit configured to use a first deep learning model to generate a correction token at the position of the prediction token in response to confirmation that the prediction is low, and to replace the prediction token with a checked prediction token; a generation subunit configured to use a second deep learning model to sequentially generate prediction tokens again from the position following the correction token based on the correction token in response to confirmation that a pre-set token generation condition is met, and to use the first deep learning model to generate a confidence level for the newly generated prediction tokens and perform a check; and an inference subunit configured to obtain an inference result based on the checked prediction tokens.
[0009] According to another aspect of the present disclosure, an electronic device is provided which includes at least one processor and a memory communicated to the at least one processor, wherein the memory stores instructions that can be executed by the at least one processor, and these instructions are executed by the at least one processor to enable the at least one processor to perform the above method.
[0010] According to another aspect of the present disclosure, a non-temporary computer-readable storage medium is provided in which computer instructions are stored, the computer instructions being used to cause a computer to perform the above method.
[0011] According to another aspect of this disclosure, a computer program product including a computer program is provided, and the above method is realized when the computer program is executed by a processor.
[0012] According to one or more embodiments of this disclosure, the disclosure first determines the target loss for a first deep learning model and a second deep learning model built on some parameters of the first deep learning model. Based on the target loss, it determines the parameters of each of the two models that need to be adjusted. Then, based on the correspondence between the parameters of the two models, it synchronously adjusts the parameter values of the corresponding parameters in the other model, thereby enabling the creation of multiple models with different levels of performance in a single training session, satisfying the demands of various scenarios, various performance levels, and various objective effects. Furthermore, by training the first and second deep learning models as a whole, knowledge and information can be transmitted bidirectionally during the training process, accelerating model convergence and enabling both models to possess better predictive capabilities.
[0013] It should be understood that the content described in this section is not intended to identify the essential or important features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure are readily apparent from the following specification. [Brief explanation of the drawing]
[0014] The drawings illustrate the embodiments and constitute part of the specification, and are used to illustrate exemplary embodiments of the embodiments in conjunction with the textual description of the specification. The illustrated embodiments are for illustrative purposes only and do not limit the scope of the claims. In all drawings, the same reference numerals refer to elements that are similar but not necessarily identical. [Figure 1] This is a schematic diagram of an exemplary system capable of carrying out the various methods described herein according to the embodiments of this disclosure. [Figure 2] This is a flowchart of the training method for a deep learning model according to the embodiments of this disclosure. [Figure 3] This is a schematic diagram of the first deep learning model and the second deep learning model according to the embodiments of this disclosure. [Figure 4] This is a flowchart of the process for determining the target losses of the first deep learning model and the second deep learning model according to the embodiments of this disclosure. [Figure 5] This is a flowchart of the inference method for a deep learning model according to the embodiments of this disclosure. [Figure 6] This is a structural block diagram of a training device for a deep learning model according to an embodiment of the present disclosure. [Figure 7] This is a structural block diagram of a deep learning model inference device according to an embodiment of the present disclosure. [Figure 8] This is an exemplary structural block diagram of an electronic device that can be used to implement embodiments of the present disclosure. [Modes for carrying out the invention]
[0015] The following description illustrates exemplary embodiments of the disclosure, accompanied by drawings, and includes various details of the embodiments for the sake of ease of understanding; however, these should be considered merely illustrative. Therefore, as those skilled in the art should recognize, various changes and modifications can be made to the embodiments described herein without departing from the scope of the disclosure. Similarly, for clarity and brevity, descriptions of known functions and structures are omitted in the following description.
[0016] In this disclosure, unless otherwise specified, terms such as “first,” “second,” etc., used to describe various elements are not intended to limit the spatial, timing, or importance relationships of these elements. Such terms are used solely to distinguish one element from another. In some examples, the first and second elements may refer to the same example of that element, or, depending on the contextual description, to different examples.
[0017] The terms used in describing the various examples in this disclosure are for illustrative purposes only and are not intended to limit them. Unless otherwise explicitly indicated in the context, such elements may be one or more, unless the number of elements is specifically limited. The terms "and / or" as used in this disclosure cover any one of the listed items and all possible combinations thereof.
[0018] In related technologies, different deep learning models are trained independently, resulting in low training efficiency.
[0019] To solve the above problems, this disclosure first determines the target loss for a first deep learning model and a second deep learning model built on some parameters of the first deep learning model. Based on the target loss, it determines the parameters of each of the two models that need to be adjusted. Then, based on the correspondence between the parameters of the two models, it synchronously adjusts the parameter values of the corresponding parameters in the other model, thereby enabling the creation of multiple models with different levels of performance in a single training session, and meeting the demands for various scenarios, various performance levels, and various objective effects. Furthermore, by training the first and second deep learning models as a whole, knowledge and information can be transmitted bidirectionally during the training process, accelerating model convergence and enabling both models to have better predictive capabilities.
[0020] The embodiments of this disclosure will be described in detail below, with reference to the drawings.
[0021] FIG. 1 shows a schematic diagram of an exemplary system 100 in which the various methods and apparatuses described herein can be implemented, according to an embodiment of the present disclosure. Referring to FIG. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. The client devices 101, 102, 103, 104, 105, 106 may be configured to execute one or more applications.
[0022] In an embodiment of the present disclosure, the server 120 can operate to execute one or more services or software applications of the present disclosure.
[0023] In some embodiments, the server 120 can further provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services can be provided as web-based services or cloud services, for example, provided to users of the client devices 101, 102, 103, 104, 105, and / or 106 in a software as a service (SaaS) model.
[0024] In the configuration shown in Figure 1, the server 120 may include one or more assemblies that implement the functions performed by the server 120. These assemblies may include software assemblies, hardware assemblies, or a combination thereof that can be executed by one or more processors. Users operating client devices 101, 102, 103, 104, 105 and / or 106 can utilize the services provided by these assemblies by sequentially using one or more client applications to interact with the server 120. It should be understood that various different system configurations are possible and may differ from system 100. Therefore, Figure 1 is an example of a system for implementing the various methods described herein and is not intended to limit it.
[0025] A user can perform human-machine interactions using client devices 101, 102, 103, 104, 105, and / or 106. A client device can provide a user of the client device with an interface that allows interaction with the client device. The client device may further output information to the user through the interface. Although only six client devices are shown in Figure 1, as will be understood by those skilled in the art, this disclosure can support any number of client devices.
[0026] Client devices 101, 102, 103, 104, 105 and / or 106 may include various types of computer equipment, such as portable handheld devices, general-purpose computers (e.g., personal computers and laptop computers), workstation computers, wearable devices, smartscreen devices, self-service terminal equipment, service robots, game systems, thin clients, various message sending and receiving devices, sensors, or other sensing devices. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS), or may include various mobile operating systems, such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, and Android. Portable handheld devices may include cellular phones, smartphones, tablet computers, and personal digital assistants (PDAs). Wearable devices may include head-mounted displays (e.g., smart glasses) and other devices. Game systems may include various handheld game devices, internet-enabled game devices, and so on. Client devices can run various applications related to the Internet, communication applications (such as email applications), and short message service (SMS) applications, and can use various communication protocols.
[0027] Network 110 may be any type of network known to those skilled in the art, and it may use any one of several available protocols (including but not limited to TCP / IP, SNA, IPX, etc.) to support data communication. For example, one or more networks 110 may be a local area network (LAN), an Ethernet-based network, a token loop, a wide area network (WAN), the internet, a virtual network, a virtual private network (VPN), an intranet, an extranet, a blockchain network, a public switched telephone network (PSTN), an infrared network, a wireless network (e.g., Bluetooth, Wi-Fi), and / or any combination of these and / or other networks.
[0028] Server 120 may include one or more general-purpose computers, dedicated server computers (e.g., PC (personal computer) servers, UNIX servers, midrange servers), blade servers, large computers, server clusters, or any other suitable configuration and / or combination. Server 120 may include one or more virtual machines running a virtual operating system, or other computing architectures related to virtualization (e.g., one or more flexible pools of logical memory devices that can be virtualized to maintain the server's virtual memory devices). In various embodiments, Server 120 may run one or more services or software applications that provide the functions described below.
[0029] The computing units in server 120 may run one or more operating systems, including any of the above-mentioned operating systems and any commercially available server operating systems. Server 120 may also run any one of a variety of additional server applications and / or middle-tier applications, including HTTP servers, FTP servers, CGI servers, Java servers, database servers, etc.
[0030] In some embodiments, the server 120 may include one or more applications for analyzing and integrating data feeds and / or event updates received from users of client devices 101, 102, 103, 104, 105 and / or 106. The server 120 may further include one or more applications for displaying data feeds and / or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105 and / or 106.
[0031] In some embodiments, server 120 may be a server in a distributed system or a server incorporating blockchain. Server 120 may be a cloud server or a smart cloud computing server or smart cloud host equipped with artificial intelligence technology. A cloud server is a host product in a cloud computing service system, thereby solving the shortcomings of conventional physical hosts and virtual private server (VPS) services, which are difficult to manage and have low scalability.
[0032] The system 100 may further include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information on audio files and video files. The databases 130 may be located in various locations. For example, a database used by server 120 may be located locally at server 120, or it may be located away from server 120 and communicate with server 120 via a network or a dedicated connection. The databases 130 may be of various types. In some embodiments, a database used by server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data from the database in response to commands.
[0033] In some embodiments, one or more of the databases 130 may be used by an application to store application data. The databases used by the application may be of different types, such as a key-value repository, an object repository, or a general-purpose repository supported by a file system.
[0034] The system 100 in Figure 1 can be configured and operated in various ways so as to allow the application of the various methods and apparatus described in this disclosure.
[0035] According to one aspect of this disclosure, a method for training a computer-implemented deep learning model is provided. A first deep learning model includes a plurality of first parameters, and a second deep learning model includes a plurality of second parameters, the plurality of second parameters being initialized to the parameter values of a plurality of target parameters corresponding to the plurality of second parameters in the plurality of first parameters. The number of plurality of second parameters is less than the number of plurality of first parameters.
[0036] Figure 2 shows a flowchart of a training method 200 for a deep learning model according to an embodiment of the present disclosure. Method 200 includes step S201 of determining the target losses of a first deep learning model and a second deep learning model, and step S202 of obtaining a first deep learning model and a second deep learning model after training by adjusting the parameter values of a plurality of first parameters and a plurality of second parameters based on the target losses, the step S201 of synchronously adjusting the parameter values of second parameters in the second deep learning model that correspond to at least some of the target parameters in the plurality of target parameters in the first deep learning model in response to the determination that the target loss indicates that the parameter values of at least some of the target parameters in the plurality of target parameters in the first deep learning model need to be adjusted, and step S2022 of synchronously adjusting the parameter values of target parameters in the first deep learning model that correspond to at least some of the second parameters in the plurality of second parameters in the second deep learning model in response to the determination that the target loss indicates that the parameter values of at least some of the second parameters in the plurality of second parameters in the second deep learning model need to be adjusted.
[0037] This process first determines the target loss for the first deep learning model and the second deep learning model built based on some parameters of the first deep learning model. Based on the target loss, it determines which parameters need to be adjusted for each of the two models. Then, based on the correspondence between the parameters of the two models, the parameter values of the corresponding parameters in the other model are adjusted synchronously. This enables the creation of multiple models with different levels of performance in a single training session, meeting the demands for various scenarios, performance levels, and objective effects. Furthermore, training the first and second deep learning models as a whole allows for bidirectional transfer of knowledge and information during the training process, accelerating model convergence and enabling both models to possess better predictive capabilities.
[0038] In some embodiments, the first and second deep learning models may be based on various types of deep learning models, such as convolutional neural networks, recurrent neural networks, multilayer perceptrons, fully connected networks, Transformer structures or Transformer-like structures (extensions of each type of Transformer structure), or other types of deep learning models. The first and second deep learning models may be end-to-end complete models or parts of a model for a specific task, but are not limited thereto. It should be noted that the first and second deep learning models belong to the same type of model.
[0039] In some embodiments, a first model architecture and a second model architecture may be used to describe the model architecture and model configuration information of the first and second deep learning models, which may include, for example, the number of layers the model contains and the number of intermediate dimensions of the model layers. The scale of the second model architecture is smaller than that of the first model architecture, i.e., the model size / size / parameter count of the second deep learning model is smaller than that of the first deep learning model.
[0040] In some embodiments, based on the number of second parameters in a plurality of first parameters, some parameters (i.e., multiple target parameters) may be selected and used to build a second deep learning model. Alternatively, the same number of target parameters as the number of second parameters may be selected in a plurality of first parameters, and the plurality of second parameters may be initialized to the parameter values of the multiple target parameters. In other words, the parameter value of each second parameter in the initialized second deep learning model is the same as the parameter value of the target parameter corresponding to that second parameter in the first deep learning model. Subsequently, in the process of training the first and second deep learning models, the parameter values of the multiple target parameters in the first deep learning model and the parameter values of the multiple second parameters in the second deep learning model are always maintained to match, as described below.
[0041] In some embodiments, the target parameters may be obtained by randomly selecting from a plurality of first parameters based on the number of second parameters, or by employing several heuristic policies, such as a combination of uniform distributions, or by employing feature selection / feature engineering methods to select from a plurality of first parameters and obtain a plurality of target parameters that can achieve better training effects.
[0042] According to some embodiments, the first deep learning model may include a first layer with a first number of layers, and each of the first layers may include a first intermediate dimension with a first number of intermediate dimensions. The second deep learning model may include a second layer with a second number of layers, and each of the second layers may include a second intermediate dimension with a second number of intermediate dimensions. The second number of layers in the second deep learning model may be less than the first number of layers in the first deep learning model, and the second intermediate dimension of the second layer in the second deep learning model may be less than the first intermediate dimension of the first layer in the first deep learning model.
[0043] Multiple target parameters may include a first parameter in at least one target intermediate dimension of at least one target layer. At least one target layer may be obtained by selecting a first layer of a first number of layers based on a second number of layers, and at least one target intermediate dimension may be obtained by selecting a first intermediate dimension of a first number of layers based on a second number of intermediate dimensions.
[0044] This method allows for the selection of multiple first parameters from the first deep learning model to obtain multiple target parameters that satisfy the parameter count of the second deep learning model, while maintaining the overall model architecture of the first deep learning model as much as possible, avoiding excessive concentration of selected parameters, and improving the predictive power of the second deep learning model.
[0045] According to some embodiments, the first deep learning model and the second deep learning model may be based on a Transformer structure or a Transformer-like structure. The first deep learning model may include a first hidden dimension with a first number of hidden dimensions, and the first layer with a first number of layers may each include a first attention head with a first number of attention heads. The second deep learning model may include a second hidden dimension with a second number of hidden dimensions, and the second layer with a second number of layers may each include a second attention head with a second number of attention heads.
[0046] Multiple target parameters may include a first parameter corresponding to multiple target hidden dimensions, and may also include a first parameter in at least one target attention head. Multiple target hidden dimensions may be obtained by selection in the first hidden dimension of the first number of hidden dimensions based on a second number of hidden dimensions, and at least one target attention head may be obtained by selection in the first attention head of the first number of attention heads based on a second number of attention heads.
[0047] This method allows for the selection of multiple first parameters from the first deep learning model to obtain multiple target parameters that satisfy the parameter count of the second deep learning model, while maintaining the overall model architecture of the first deep learning model based on a Transformer structure as much as possible, avoiding excessive concentration of selected parameters, and improving the predictive power of the second deep learning model.
[0048] In some embodiments, the first layer may be, for example, a Transformer layer, and the first intermediate dimension may be, for example, the intermediate dimension in a feedforward neural network (FFN) within the Transformer layer.
[0049] Figure 3 shows schematic diagrams of a first deep learning model and a second deep learning model according to embodiments of the present disclosure. As shown in Figure 3, the first deep learning model 300 includes an embedding input 302, has a hidden dimension of 6, and includes three Transformer layers, each Transformer layer including a pair of four attention heads 304 and one feedforward neural network 306, with an intermediate dimension of 8 for the feedforward neural network. The second deep learning model 310 includes an embedding input 312, has a hidden dimension of 3, and includes two Transformer layers, each Transformer layer including a pair of two attention heads 314 and one feedforward neural network 316, with an intermediate dimension of 4 for the feedforward neural network.
[0050] As can be seen in the figure, the multiple target parameters selected for multiple first parameters include the first parameters in the four intermediate dimensions (1st, 2nd, 4th, and 5th) of the first Transformer layer, and the first parameters in the four intermediate dimensions (3rd, 5th, 7th, and 8th) of the third Transformer layer. Furthermore, the multiple target parameters include the first parameters corresponding to the three hidden dimensions (1st, 4th, and 6th), the first parameters in the two attention heads (2nd and 4th) of the first Transformer layer, and the first parameters in the two attention heads (1st and 2nd) of the third Transformer layer. Based on these target parameters, a second deep learning model can be constructed.
[0051] In some embodiments, the second model architecture (i.e., the number of layers, the number of intermediate dimensions, etc., included in the second deep learning model) may be determined based on the first model architecture and a predetermined proportion. In one exemplary embodiment, the predetermined proportion may be 50%, in which case the second number of layers in the second deep learning model may be half the number of layers in the first deep learning model, and the second number of intermediate dimensions in the second deep learning model may be half the number of intermediate dimensions in the first deep learning model. As can be seen, the second model architecture may be determined based on, and is not limited to, a set of predetermined hyperparameter values.
[0052] According to some embodiments, at least one target layer, at least one target intermediate dimension, multiple target hidden dimensions and / or at least one target attention head may be obtained by random selection.
[0053] According to some embodiments, both the first and second deep learning models may be configured to perform at least one of the following tasks: text processing, image processing, and voice processing. The first and second deep learning models may also be configured to perform the same task. Although the two models perform the same task, they differ in their parameter sizes, resulting in different inference speeds and accuracies. Therefore, they can meet the demands of various scenarios, performance levels, and objectives.
[0054] According to some embodiments, the first deep learning model may be a pre-trained large model. While pre-trained large models perform well when handling various text processing, image processing, and voice processing tasks, their inference costs are high and deployment is difficult. When training a separate "small" model independently, the knowledge acquired by the large model cannot be utilized, and the process of training the "small" model does not further improve the capabilities of the large model. Therefore, by obtaining multiple target parameters in a pre-trained large model using a method of selection, and then constructing a "small" model based on these multiple target parameters, knowledge and information can be transmitted bidirectionally by training the large model and the "small" model together, accelerating model convergence and enabling both models to have better predictive capabilities.
[0055] In some embodiments, multiple target parameters can be selected for multiple first parameters by initializing one or more fixed mask matrices. The mask matrix contains only 0s and 1s, where 0 represents training only the first deep learning model, and 1 represents training both the first and second deep learning models. For deep learning models based on Transformer structures, the mask matrix may include a mask matrix for hidden dimensions, a mask matrix for the feedforward neural network, a mask matrix for the attention head, and a mask matrix for the Transformer layer. To be clear, for different types of deep learning models, the mask matrix may include, but is not limited to, mask matrices corresponding to different model structures.
[0056] In some embodiments, the target loss for the training phase of the first and second deep learning models may be determined in step S201. To make this clear, various samples, various loss functions, and various training policies may be used to calculate the target loss, but are not limited to these.
[0057] In some embodiments, Figure 4 shows a flowchart of a process 400 for determining the target losses of a first deep learning model and a second deep learning model according to an embodiment of the present disclosure. Process 400 may be used to implement step S201 in method 200. Process 400 may include step S401 for determining a first loss of the first deep learning model for a first sample, step S402 for determining a second loss of the second deep learning model for a second sample, and step S403 for determining a target loss based on the first and second losses.
[0058] This allows us to determine the first and second losses for the first and second deep learning models for the first and second samples, respectively, and then obtain a target loss based on these two values. This enables simultaneous adjustment of the parameters of the first and second deep learning models in a single training session, achieving synchronous training of the two models.
[0059] In some embodiments, a batch training method may be employed to train the first and second deep learning models, where both the first and second samples are batch data. In one exemplary embodiment, the first deep learning model may be trained using one batch of data (the first sample), and then the second deep learning model may be trained using another batch of data (the second sample), and the overall target loss may be obtained based on the two losses, thereby achieving joint training of the first and second deep learning models.
[0060] According to some embodiments, the first sample and the second sample may be the same. By using identical training samples for the first and second deep learning models, the second deep learning model can learn the attention matrix and probability distribution of the top layer of the first deep learning model, thereby better fitting the training results of the second and first deep learning models, accelerating their joint training, and improving the predictive ability of the two models after training.
[0061] In some embodiments, in steps S401 and S402, the appropriate loss function may be selected as needed to obtain the first loss and the second loss. In step S403, the first loss and the second loss may be added together to obtain the target loss. In addition to the above methods, the target loss may be obtained based on the first loss and the second loss by other methods, and are not limited thereto.
[0062] In some embodiments, in step S202, the parameters that need to be adjusted among a plurality of first parameters and a plurality of second parameters may first be determined based on the target loss by backpropagation, and the corresponding adjustment amounts may be determined. Furthermore, in step S2021, in response to the determination that the target loss indicates that the parameter values of at least some of the plurality of target parameters included in the first deep learning model need to be adjusted, the parameter values of this portion of the target parameters in the first deep learning model are adjusted based on the adjustment amounts determined for this portion of the target parameters, and the parameter values of the second parameters corresponding to this portion of the target parameters in the second deep learning model are adjusted synchronously based on similar adjustment amounts. If the target loss indicates that the parameter value of a first parameter other than the plurality of target parameters in the first deep learning model needs to be adjusted, the parameter value of that first parameter in the first deep learning model may be adjusted based only on the adjustment amount determined for that first parameter, without adjusting the parameters of the second deep learning model.
[0063] Similarly, in step S2022, in response to the determination that the target loss indicates that the parameter values of at least some of the multiple second parameters included in the second deep learning model need to be adjusted, the parameter values of the second parameters in the second deep learning model are adjusted based on the adjustment amount determined for the second parameters in this part, and the parameter values of the target parameters (i.e., the first parameters) in the first deep learning model corresponding to the second parameters in this part are adjusted synchronously based on a similar adjustment amount.
[0064] In some embodiments, it may be possible to determine two adjustment values for the same parameter in the same training (and in the methods described below, there may be more adjustment values), and one of these adjustment values may be used to synchronously adjust the parameter value of the corresponding parameter in the two models, or both adjustment values may be used to synchronously adjust the parameter value of the corresponding parameter in the two models, but this is not limited to the above.
[0065] In some embodiments, Method 200 may be extended to training on more models. The models to be trained may further include a third deep learning model. The third deep learning model includes a plurality of third parameters, which are initialized to the parameter values of a plurality of shared parameters corresponding to the plurality of third parameters in a plurality of target parameters, where the number of the plurality of third parameters is less than the number of the plurality of second parameters. In other words, the plurality of target parameters corresponding to the plurality of second parameters are a subset of the plurality of first parameters, and the plurality of shared parameters corresponding to the plurality of third parameters are a subset of the plurality of target parameters. For the structure and operation of the third model, refer to the above description for the second model.
[0066] During the training phase, the target losses for the three models may be determined first, and then the parameter values of each model's parameters may be adjusted based on these target losses. Specifically, in response to the determination that the target loss indicates that the parameter value of one parameter in one of the models needs to be adjusted, the parameter value of that parameter is synchronously adjusted in the other two models, if those models include a corresponding parameter. This method allows for obtaining three models with different levels of performance.
[0067] To make it clear, referring to the above description, Method 200 may be extended to training four or more models, but this will not be described here. The method of this disclosure can scale the number of models to be trained without increasing the training cost, and since each model is trained in each training round, scaling the number of models does not increase the number of training rounds.
[0068] Another aspect of this disclosure provides a method for inferring a computer-implemented deep learning model. Figure 5 shows a flowchart of the deep learning model inference method 500 according to an embodiment of this disclosure. The first and second deep learning models related to method 500 may be obtained by training using the method 200 described above. Method 500 includes step S501 of sequentially generating a plurality of prediction tokens using a second deep learning model, step S502 of generating a confidence level for each of the plurality of prediction tokens based on the plurality of prediction tokens using a first deep learning model, and step S503 of checking the plurality of prediction tokens based on the generation order of the plurality of prediction tokens to obtain an inference result, which includes step S5031 of generating a correction token at the position of the prediction token using the first deep learning model in response to the determination that the confidence level of the currently checked prediction token is lower than a preset threshold, and replacing the prediction token with a checked prediction token using the first deep learning model, step S5032 of sequentially generating prediction tokens again from the position following the correction token based on the correction token using the second deep learning model in response to the determination that a preset token generation condition is met, and generating and checking the confidence level of the newly generated prediction tokens using the first deep learning model, and obtaining an inference result based on the checked prediction tokens.
[0069] Prediction in deep learning models generally involves two modes: one is the decoding mode, which requires a prediction for each token and takes more time; the other is the check mode, which involves being given a set of generated tokens and obtaining the confidence level (i.e., the probability distribution of the top layer) for each token in the result. The check mode requires significantly less time than the decoding mode because the entire generation sequence can be processed in a single pass. This disclosure combines the decoding process of a smaller-scale second deep learning model with the check process of a larger-scale first deep learning model, enabling rapid acquisition of high-quality results and further improving performance and inference effectiveness.
[0070] In some embodiments, in step S501, a second deep learning model may be used in decoding mode to sequentially generate multiple prediction tokens. Because the second deep learning model is small in scale, even when using decoding mode, the prediction result sequence, i.e., multiple prediction tokens, can be obtained quickly.
[0071] In one exemplary embodiment, the input token sequence may be input to a second deep learning model to obtain the first predicted token output from the model, and the input token sequence and the first predicted token may be input to the model again to obtain the second predicted token output from the model, and so on until the model outputs an [End] token to indicate completion, or until a predetermined number of tokens have been output. To make it clear, in addition to the above method, the second deep learning model may be used in other ways to sequentially generate multiple predicted tokens, and this is not limited to such methods.
[0072] In some embodiments, in step S502, the first deep learning model may be used in check mode to directly generate the confidence level of each of the multiple prediction tokens based on the multiple prediction tokens. Because the first deep learning model is large, using the decoding mode would require a lot of time to obtain the prediction result sequence, whereas using the check mode allows for the rapid acquisition of the top-level probability distribution corresponding to the multiple prediction tokens generated by the second deep learning model.
[0073] In some embodiments, in step S503, multiple prediction tokens may be checked based on the generation order of the multiple prediction tokens, and the corresponding operation may be performed based on the check result.
[0074] In some embodiments, step S5031 indicates that, in response to determining that the confidence level of the currently checked prediction token is lower than a preset threshold, a larger first deep learning model determines that the currently checked prediction token is inaccurate. Therefore, the first deep learning model may be used in decoding mode to generate a correction token at the location of the prediction token, replacing the prediction token generated by the second deep learning model. The correction token may then be used as the checked prediction token to generate the final inference result.
[0075] In one exemplary embodiment, in response to determining that the confidence level of the currently checked prediction token is lower than a predetermined threshold, all prediction tokens prior to the currently checked prediction token may be input to a first deep learning model to obtain the token at the current position, i.e., a correction token, generated by the first deep learning model.
[0076] In some embodiments, in step S5032, in response to confirming that a predetermined token generation condition is met, a second deep learning model is used in decoding mode to sequentially generate at least one prediction token again, starting from the position following the correction token, based on the correction token. Subsequently, if at least one prediction token has been generated or other generation termination conditions are met, the first deep learning model is used in check mode to generate a confidence score for at least one prediction token based on the newly generated at least one prediction token, and then a check is performed on at least one prediction token. The above process may be repeated until all prediction tokens have been checked.
[0077] According to some embodiments, the pre-set token generation condition may include the number of generated tokens being less than a pre-set number. In other words, one pre-set number may be set, and prediction tokens may be generated and checked in units of the pre-set number. After the check is complete, the pre-set tokens may continue to be generated and checked in units of the pre-set number until the generation process is complete, for example, until a high-confidence [End] token is generated.
[0078] This allows the above method to control the upper limit of the length of the prediction token sequences being generated and checked, thereby avoiding performance degradation caused by the generation and checking of prediction token sequences that are too long.
[0079] According to some embodiments, the pre-configured token generation conditions may include indicating that the correction token indicates that token generation is not yet complete. In one exemplary embodiment, if the correction token is not an [End] token, the pre-configured generation conditions are not met, and it is necessary to sequentially generate prediction tokens again using a second deep learning model.
[0080] According to some embodiments, step S503, which checks multiple prediction tokens based on the generation order of multiple prediction tokens to obtain an inference result, may further include holding the prediction token as a checked prediction token in response to determining that the confidence level of the prediction token currently being checked is equal to or greater than a preset threshold.
[0081] This method ensures that all checked prediction tokens ultimately held have a high level of confidence, thereby guaranteeing the quality of the inference results generated.
[0082] In some embodiments, in step S5033, content corresponding to each checked prediction token, such as text, images, or voice, may be determined, and these contents may be combined to obtain the corresponding inference result.
[0083] This allows for the rapid acquisition of high-quality inference results by using a smaller second deep learning model to generate a predicted token sequence based on the decoding mode, using a larger first deep learning model to check each generated predicted token based on the check mode, and if the check fails (e.g., is less than the confidence level), using the second deep learning model to generate a correction token for the current position based on the decoding mode, and then using the second deep learning model to generate predicted tokens one by one again from the next position.
[0084] In some embodiments, the model used for inference further includes a target deep learning model that is larger in scale than the first deep learning model, i.e., the multiple first parameters included in the first deep learning model are a subset of the multiple parameters included in the target deep learning model. In step S5033, after obtaining a plurality of check tokens checked by the first deep learning model, the target deep learning model may be used to generate confidence levels for the plurality of check tokens, and these confidence levels may be used to perform further checks on the plurality of check tokens. If a check token fails the check of the target deep learning model, the target deep learning model may generate a new token at that position to replace the check token, the second deep learning model may generate subsequent tokens after the token, and then the first deep learning model may check them again, and then the target deep learning model may re-check the result after the check by the first deep learning model.
[0085] In some embodiments, the second deep learning model generates N1 prediction tokens each time, checks the generated results, and generates new ones. The results generated and checked by the first and second deep learning models may be considered as a single whole, and each time the first and second deep learning models complete N2 rounds of generation and checking, the obtained N1 × N2 checked prediction tokens may be passed to a target deep learning large model for checking and regeneration.
[0086] To make it easier to understand, the above generation and checking method may be further extended to four or more models, but this will not be explained here.
[0087] Another aspect of this disclosure provides a training device for a deep learning model. The first deep learning model includes a plurality of first parameters, and the second deep learning model includes a plurality of second parameters, the plurality of second parameters being initialized to the parameter values of a plurality of target parameters in the plurality of first parameters, the number of the plurality of second parameters being less than the number of first parameters.
[0088] Figure 6 shows a structural block diagram of a training device 600 for a deep learning model according to an embodiment of the present disclosure. The device 600 includes a determination unit 610 configured to determine the target losses of a first deep learning model and a second deep learning model, and a parameter adjustment unit 620 configured to adjust the parameter values of a plurality of first parameters and a plurality of second parameters based on the target losses to obtain the first deep learning model and the second deep learning model after training, the parameter adjustment unit 620 includes a first parameter adjustment subunit 622 configured to synchronously adjust the parameter values of second parameters corresponding to at least some of the target parameters included in the second deep learning model in response to the determination that the target loss indicates that the parameter values of at least some of the target parameters included in a plurality of target parameters included in the first deep learning model need to be adjusted, and a second parameter adjustment subunit 624 configured to synchronously adjust the parameter values of target parameters corresponding to at least some of the second parameters included in the first deep learning model in response to the determination that the target loss indicates that the parameter values of at least some of the second parameters included in a plurality of second parameters included in the second deep learning model need to be adjusted.
[0089] To facilitate understanding, the operation of units 610 to 620 and their subunits in apparatus 600 may be described by referring to the explanation of steps S201 to S202 and their substeps in method 200 above, and will be omitted here.
[0090] According to some embodiments, the first deep learning model may include a first layer of a first number of layers, and each of the first layers of a first number of layers may include a first intermediate dimension of a first number of intermediate dimensions.
[0091] The second deep learning model may include a second layer with a second number of layers, and each of the multiple second layers may include a second intermediate dimension with a second number of intermediate dimensions. The second number of layers may be less than the first number of layers, and the second intermediate dimension may be less than the first intermediate dimension.
[0092] Multiple target parameters may include a first parameter in at least one target intermediate dimension of at least one target layer. At least one target layer may be obtained by selecting a first layer of a first number of layers based on a second number of layers, and at least one target intermediate dimension may be obtained by selecting a first intermediate dimension of a first number of layers based on a second number of intermediate dimensions.
[0093] According to some embodiments, the first deep learning model and the second deep learning model may be based on a Transformer structure or a Transformer-like structure.
[0094] The first deep learning model may include a first hidden dimension with a first number of hidden dimensions, and each of the first layers with a first number of layers may include a first attention head with a first number of attention heads.
[0095] The second deep learning model may include a second hidden dimension with a second number of hidden dimensions, and each of the second layers with a second number of layers may include a second attention head with a second number of attention heads.
[0096] Multiple target parameters may include a first parameter corresponding to multiple target hidden dimensions, and may also include a first parameter in at least one target attention head. Multiple target hidden dimensions may be obtained by selection in the first hidden dimension of the first number of hidden dimensions based on a second number of hidden dimensions, and at least one target attention head may be obtained by selection in the first attention head of the first number of attention heads based on a second number of attention heads.
[0097] According to some embodiments, at least one target layer, at least one target intermediate dimension, multiple target hidden dimensions and / or at least one target attention head may be obtained by random selection.
[0098] According to some embodiments, both the first deep learning model and the second deep learning model may be configured to perform at least one of the following tasks: a text processing task, an image processing task, and a voice processing task.
[0099] According to some embodiments, the first deep learning model may be a pre-trained large model.
[0100] According to some embodiments, the determinative unit may include a first determinative subunit configured to determine a first loss of a first deep learning model for a first sample, a second determinative subunit configured to determine a second loss of a second deep learning model for a second sample, and a third determinative subunit configured to determine a target loss based on the first and second losses.
[0101] According to some embodiments, the first sample and the second sample may be the same.
[0102] Another aspect of this disclosure provides an inference device for deep learning models. The first deep learning model and the second deep learning model may be obtained by training using the device 600.
[0103] Figure 7 shows a structural block diagram of a deep learning model inference device 700 according to an embodiment of the present disclosure. The device 700 includes a first generation unit 710 configured to sequentially generate a plurality of prediction tokens using a second deep learning model, a second generation unit 720 configured to generate a confidence level for each of the plurality of prediction tokens based on the plurality of prediction tokens using the first deep learning model, and a check unit 730 configured to check a plurality of prediction tokens based on the generation order of the plurality of prediction tokens and obtain an inference result, wherein in response to determining that the confidence level of the currently checked prediction token is lower than a preset threshold, the first deep learning model is used to determine the position of the prediction token. The system includes a replacement subunit 732 configured to generate a positive token and replace the predicted token with a checked predicted token; a generation subunit 734 configured to, in response to confirmation that a pre-set token generation condition is met, use a second deep learning model to sequentially generate predicted tokens again from the position following the corrected token based on the corrected token, and use a first deep learning model to generate and check the confidence level of the newly generated predicted tokens; and an inference subunit 736 configured to obtain an inference result based on the checked predicted tokens.
[0104] To ensure understanding, the operation of units 710 to 730 and their subunits in apparatus 700 may be described by referring to the explanation of steps S501 to S503 and their substeps in method 500 above, and will be omitted here.
[0105] According to some embodiments, the pre-set token generation condition may include the fact that the number of generated tokens is less than a pre-set number.
[0106] According to some embodiments, the pre-configured token generation conditions may include indicating that the correction token has not yet finished generating tokens.
[0107] According to some embodiments, the check unit 730 (not shown) may include a holding subunit configured to hold a prediction token as a checked prediction token in response to determining that the confidence level of the prediction token currently being checked is equal to or greater than a preset threshold.
[0108] In the proposed technology described herein, all processing of user personal information, including collection, storage, use, processing, transmission, provision, and disclosure, complies with the provisions of relevant laws and regulations and does not violate public order and morals.
[0109] Embodiments of this disclosure further provide electronic devices, readable storage media, and computer program products.
[0110] Next, with reference to Figure 8, a structural block diagram of an electronic device 800 that can be used as a server or client in this disclosure is described, which is an example of hardware equipment that can be applied to each aspect of this disclosure. Electronic devices represent various forms of digital electronic computer equipment, such as laptop computers, desktop computers, stages, personal digital assistants, servers, blade servers, large computers, and other suitable computers. Electronic devices may further represent various forms of mobile devices, such as personal digital processing, cellular phones, smartphones, wearable devices, and other similar computing devices. The components, their connections and their functions shown herein are illustrative only and are not intended to limit the realization of the disclosure described and / or claimed herein.
[0111] As shown in Figure 8, the electronic device 800 includes a computing unit 801, which can perform various appropriate operations and processes based on computer programs stored in read-only memory (ROM) 802 or computer programs loaded from storage unit 808 into random access memory (RAM) 803. RAM 803 may further store various programs and data necessary for the operation of the electronic device 800. The computing unit 801, ROM 802, and RAM 803 are connected to each other via bus 804. An input / output (I / O) interface 805 is also connected to bus 804.
[0112] Multiple components in the electronic device 800 are connected to the I / O interface 805 and include an input unit 806, an output unit 807, a storage unit 808, and a communication unit 809. The input unit 806 may be any type of device capable of inputting information into the electronic device 800, and may receive input numeric or character information and generate key signal inputs related to user settings and / or function control of the electronic device, and may include, but is not limited to, a mouse, keyboard, touchscreen, trackboard, trackball, lever, microphone, and / or remote control. The output unit 807 may be any type of device capable of presenting information, and may include, but is not limited to, a display, speaker, video / audio output terminal, vibrator, and / or printer. The storage unit 808 may include, but is not limited to, a magnetic disk or an optical disk. The communication unit 809 enables the electronic device 800 to exchange information / data with other devices via computer networks, such as the Internet, and / or various telecommunication networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers, and / or chipsets, such as Bluetooth devices, 802.11 devices, WiFi devices, WiMAX devices, cellular communication devices, and / or similar devices.
[0113] The computing unit 801 may be a variety of general-purpose and / or dedicated processing assemblies having processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs each of the methods, processes and / or operations described above. For example, in some embodiments, these methods, processes and / or operations may be implemented as computer software programs tangibly contained in a machine-readable medium, for example, a storage unit 808. In some embodiments, part or all of the computer program may be loaded and / or installed into the electronic device 800 via ROM 802 and / or a communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the methods, processes and / or operations described above can be performed. Alternatively, in another embodiment, the computing unit 801 may be configured to perform these methods, processes, and / or operations by any other suitable method (e.g., firmware).
[0114] Various embodiments of the systems and technologies described herein may be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include: being implemented in one or more computer programs, which may be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, which may receive data and instructions from a storage system, at least one input device, and at least one output device, and which may transmit data and instructions to the storage system, at least one input device, and at least one output device.
[0115] Program code for carrying out the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, a dedicated computer, or other programmable data processing device, so that when the program code is executed by the processor or controller, the functions / operations defined in the flowcharts and / or block diagrams are performed. The program code may be executed entirely by machine, partially by machine, partially by machine and partially by remote machine as a standalone software package, or entirely by remote machine or server.
[0116] In the context of this disclosure, a machine-readable medium may be a tangible medium which may contain or store a program used in or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any appropriate combination of the above. More specific examples of machine-readable storage media include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any appropriate combination of the above.
[0117] To provide user interaction, a computer may implement the systems and techniques described herein, the computer having a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitoring monitor), and a keyboard and pointing device (e.g., a mouse or trackball), the user may provide input to the computer using the keyboard and pointing device. Other types of devices may further provide user interaction. For example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form (including voice input, haptic input).
[0118] The systems and technologies described herein may be implemented in computing systems including background components (e.g., data servers), computing systems including middleware components (e.g., application servers), computing systems including front-end components (e.g., user computers having a graphical user interface or web browser, through which users can interact with embodiments of the systems and technologies described herein), or in computing systems including any combination of such background components, middleware components, or front-end components. The components of the system may be interconnected by digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (LANs), wide area networks (WANs), the internet, and blockchain networks.
[0119] A computer system may include a client and a server. The client and server are generally geographically distant from each other and typically interact via a communication network. The client-server relationship is created by running computer programs on the relevant computers that have a client-server relationship with each other. The server may be a cloud server, a server in a distributed system, or a server incorporating blockchain technology.
[0120] It should be understood that the steps may be reordered, added, or deleted using the various forms of flows described above. For example, each step described herein may be performed in parallel, sequentially, or in a different order, as long as it achieves the desired results of the proposed technology disclosed herein, and this specification does not limit this.
[0121] While embodiments or examples of this disclosure have been described with reference to the drawings, it should be understood that the above methods, systems, and apparatus are merely illustrative embodiments or examples, and the scope of the present invention is not limited by these embodiments or examples, but is limited only by the authorized claims and their equivalents. Various elements in the embodiments or examples may be omitted or replaced by equivalent elements. Furthermore, each step may be performed in a different order than that described herein. In addition, various elements in the embodiments or examples may be combined in various ways. In essence, as technology advances, many of the elements described herein may be replaced by equivalent elements appearing later in this disclosure.
Claims
1. A method for training a computer-implemented deep learning model, wherein a first deep learning model includes a plurality of first parameters, a second deep learning model includes a plurality of second parameters, the plurality of second parameters are initialized to parameter values of a plurality of target parameters corresponding to the plurality of second parameters in the plurality of first parameters, the number of the plurality of second parameters is less than the number of first parameters, and the method is To determine the target loss of the first deep learning model and the second deep learning model, Based on the target loss, the parameter values of the plurality of first parameters and the plurality of second parameters are adjusted to obtain a first deep learning model and a second deep learning model after training. In response to the determination that the target loss indicates that the parameter values of at least some of the target parameters included in the first deep learning model need to be adjusted, the parameter values of the second parameters included in the second deep learning model that correspond to at least some of the target parameters are adjusted synchronously. A method for training a deep learning model, comprising: synchronously adjusting the parameter values of target parameters corresponding to at least some of the second parameters included in the first deep learning model, in response to determining that the target loss indicates that the parameter values of at least some of the second parameters included in the second deep learning model need to be adjusted.
2. Determining the target losses of the first and second deep learning models is: To determine the first loss of the first deep learning model for the first sample, To determine the second loss of the second deep learning model for the second sample, The method according to claim 1, further comprising determining the target loss based on the first loss and the second loss.
3. The method according to claim 2, wherein the first sample and the second sample are the same.
4. The first deep learning model includes a first layer with a first number of layers, and each of the first layers includes a first intermediate dimension with a first number of intermediate dimensions. The second deep learning model includes a second layer of a second number of layers, and each of the plurality of second layers includes a second intermediate dimension of a second number of intermediate dimensions, where the second number of layers is smaller than the first number of layers, and the second number of intermediate dimensions is smaller than the first number of intermediate dimensions. The method according to claim 1, wherein the plurality of target parameters include a first parameter in at least one target intermediate dimension of at least one target layer, where the at least one target layer is obtained by selecting a first layer of the first number of layers based on the second number of layers, and the at least one target intermediate dimension is obtained by selecting a first intermediate dimension of the first number of intermediate dimensions based on the second number of intermediate dimensions.
5. The first deep learning model and the second deep learning model are based on a Transformer structure or a Transformer-like structure. The first deep learning model includes a first hidden dimension of a first number of hidden dimensions, and each first layer of a first number of layers includes a first attention head of a first number of attention heads. The second deep learning model includes a second hidden dimension of a second number of hidden dimensions, and each second layer of a second number of layers includes a second attention head of a second number of attention heads. The method according to claim 4, wherein the plurality of target parameters include a first parameter corresponding to a plurality of target hidden dimensions and a first parameter in at least one target attention head, where the plurality of target hidden dimensions are obtained by being selected in the first hidden dimension of the first number of hidden dimensions based on the second number of hidden dimensions, and the at least one target attention head is obtained by being selected in the first attention head of the first number of attention heads based on the second number of attention heads.
6. The method according to claim 5, wherein the at least one target layer, the at least one target intermediate dimension, the plurality of target hidden dimensions and / or the at least one target attention head are obtained by random selection.
7. The method according to any one of claims 1 to 6, wherein the first deep learning model is a pre-trained large model.
8. The method according to any one of claims 1 to 6, wherein both the first deep learning model and the second deep learning model are configured to perform at least one of the tasks of text processing, image processing, and voice processing.
9. A method for inference of a computer-implemented deep learning model, wherein a first deep learning model and a second deep learning model are obtained by training using the method described in any one of claims 1 to 6, and the method is Using the second deep learning model described above, multiple prediction tokens are generated sequentially, Using the first deep learning model described above, the confidence level of each of the multiple prediction tokens is generated based on the multiple prediction tokens, The method involves checking the multiple prediction tokens based on the generation order of the multiple prediction tokens to obtain an inference result, In response to determining that the confidence level of a currently checked prediction token is lower than a predetermined threshold, the first deep learning model is used to generate a correction token at the location of the prediction token and replace the prediction token with the checked prediction token. In response to the confirmation that the pre-set token generation conditions are met, the second deep learning model is used to sequentially generate prediction tokens based on the correction tokens, starting from the position following the correction tokens, and the first deep learning model is used to generate and check the confidence level of the newly generated prediction tokens. An inference method for a deep learning model, comprising obtaining the inference result based on checked prediction tokens.
10. Based on the generation order of the aforementioned multiple prediction tokens, checking the aforementioned multiple prediction tokens to obtain an inference result is: The method according to claim 9, comprising holding a currently checked prediction token as a checked prediction token in response to the confirmation that the confidence level of the currently checked prediction token is above a predetermined threshold.
11. The method according to claim 9, wherein the pre-set token generation condition includes the number of generated tokens being less than a pre-set number.
12. The method according to claim 9, wherein the pre-set token generation condition includes indicating that the correction token has not finished generating tokens.
13. A training device for a deep learning model, wherein the first deep learning model includes a plurality of first parameters, and the second deep learning model includes a plurality of second parameters, wherein the plurality of second parameters are initialized to the parameter values of a plurality of target parameters in the plurality of first parameters, and the number of the plurality of second parameters is less than the number of first parameters. A determination unit configured to determine the target loss of the first deep learning model and the second deep learning model, A parameter adjustment unit configured to obtain a first deep learning model and a second deep learning model after training by adjusting the parameter values of the plurality of first parameters and the plurality of second parameters based on the target loss, In response to the determination that the target loss indicates that the parameter values of at least some of the target parameters included in the first deep learning model need to be adjusted, a first parameter adjustment subunit is configured to synchronously adjust the parameter values of second parameters included in the second deep learning model that correspond to at least some of the target parameters, A training apparatus for a deep learning model, comprising a parameter tuning unit including a second parameter tuning subunit configured to synchronously adjust the parameter values of target parameters corresponding to at least some of the plurality of second parameters included in the second deep learning model, in response to the determination by the target loss that the parameter values of at least some of the second parameters included in the second deep learning model need to be adjusted.
14. The aforementioned confirmation unit is A first determinative subunit configured to determine the first loss of the first deep learning model for a first sample, A second determinative subunit configured to determine the second loss of the second deep learning model for a second sample, The apparatus according to claim 13, further comprising a third determining subunit configured to determine the target loss based on the first loss and the second loss.
15. The apparatus according to claim 14, wherein the first sample and the second sample are the same.
16. The first deep learning model includes a first layer with a first number of layers, and each of the first layers includes a first intermediate dimension with a first number of intermediate dimensions. The second deep learning model includes a second layer of a second number of layers, and each of the plurality of second layers includes a second intermediate dimension of a second number of intermediate dimensions, where the second number of layers is smaller than the first number of layers, and the second number of intermediate dimensions is smaller than the first number of intermediate dimensions. The apparatus according to claim 13, wherein the plurality of target parameters include a first parameter in at least one target intermediate dimension of at least one target layer, where the at least one target layer is obtained by selecting a first layer of the first number of layers based on the second number of layers, and the at least one target intermediate dimension is obtained by selecting a first intermediate dimension of the first number of intermediate dimensions based on the second number of intermediate dimensions.
17. The first deep learning model and the second deep learning model are based on a Transformer structure or a Transformer-like structure. The first deep learning model includes a first hidden dimension of a first number of hidden dimensions, and each first layer of a first number of layers includes a first attention head of a first number of attention heads. The second deep learning model includes a second hidden dimension of a second number of hidden dimensions, and each second layer of a second number of layers includes a second attention head of a second number of attention heads. The apparatus according to claim 16, wherein the plurality of target parameters include a first parameter corresponding to a plurality of target hidden dimensions and a first parameter in at least one target attention head, where the plurality of target hidden dimensions are obtained by being selected in the first hidden dimension of the first number of hidden dimensions based on the second number of hidden dimensions, and the at least one target attention head is obtained by being selected in the first attention head of the first number of attention heads based on the second number of attention heads.
18. The apparatus according to claim 17, wherein the at least one target layer, the at least one target intermediate dimension, the plurality of target hidden dimensions and / or the at least one target attention head are obtained by random selection.
19. The apparatus according to any one of claims 13 to 18, wherein the first deep learning model is a pre-trained large model.
20. The apparatus according to any one of claims 13 to 18, wherein both the first deep learning model and the second deep learning model are configured to perform at least one of the tasks of text processing, image processing, and voice processing.
21. An inference device for a deep learning model, wherein the first deep learning model and the second deep learning model are obtained by training using the device described in any one of claims 13 to 18, and the device is A first generation unit configured to sequentially generate multiple prediction tokens using the second deep learning model described above, A second generation unit is configured to use the first deep learning model described above and generate confidence levels for each of the multiple prediction tokens based on the multiple prediction tokens, A check unit configured to check the plurality of prediction tokens based on the generation order of the plurality of prediction tokens and obtain an inference result, A replacement subunit is configured to, in response to the determination that the confidence level of a currently checked prediction token is lower than a preset threshold, use the first deep learning model to generate a correction token at the location of the prediction token and replace the prediction token with a checked prediction token, A generation subunit is configured to, in response to the confirmation that pre-set token generation conditions are met, use the second deep learning model to sequentially generate prediction tokens from the position following the correction token based on the correction token, and use the first deep learning model to generate and check the confidence level of the newly generated prediction tokens, An inference device for a deep learning model, comprising a check unit including an inference subunit configured to obtain the inference result based on checked prediction tokens.
22. The aforementioned check unit The apparatus according to claim 21, comprising a holding subunit configured to hold a prediction token as a checked prediction token in response to the determination that the confidence level of a currently checked prediction token is above a predetermined threshold.
23. The apparatus according to claim 21, wherein the pre-set token generation condition includes the number of generated tokens being less than a pre-set number.
24. The apparatus according to claim 21, wherein the pre-set token generation condition includes indicating that the correction token has not finished generating tokens.
25. It is an electronic device, At least one processor, The memory is connected to at least one of the aforementioned processors, where, The electronic device is characterized in that the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by the at least one processor to cause the at least one processor to perform the method according to any one of claims 1 to 6.
26. A non-temporary computer-readable storage medium in which computer instructions are stored, wherein the computer instructions are used to cause a computer to execute the method described in any one of claims 1 to 6.
27. A computer program that, when executed by a processor, implements the method described in any one of claims 1 to 6.