Training methods, devices, terminals, and storage media for deep learning models
By approximating the inverse matrix of the network layer matrix of a deep learning model, the problem of high computational resource consumption in deep learning model training is solved, thereby improving training speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PENG CHENG LAB
- Filing Date
- 2022-12-01
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies suffer from high computational resource consumption and long training time during the training of deep learning models.
By approximating the inverses of the first and second matrices of each network layer in the deep learning model, the inverse matrix of the network layer is obtained. Based on the inverse matrix of each network layer, the inverse matrix of the Fisher information matrix of the deep learning model is obtained, thereby enabling iterative training of the model and reducing the actual inversion calculation.
This greatly reduces the computational resources required for training deep learning models, improves training speed, and saves training time.
Smart Images

Figure CN115936103B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a training method, apparatus, terminal, and computer-readable storage medium for a deep learning model. Background Technology
[0002] Artificial intelligence (AI) is a key strategic technology driving the new round of technological revolution and industrial transformation. In recent years, with the rapid development of AI, the demand for computing power has increased dramatically, far exceeding Moore's Law. Currently, AI applications often require massive amounts of computing resources and data; training a viable neural network can take weeks or even months.
[0003] Therefore, how to provide a training scheme for deep learning models that can reduce the consumption of computing resources has become an urgent technical problem to be solved. Summary of the Invention
[0004] The main objective of this invention is to provide a training method, apparatus, terminal, and computer-readable storage medium for deep learning models, aiming to solve the technical problems of high computational resource consumption and long training time in the prior art during deep learning model training.
[0005] To achieve the above objectives, embodiments of the present invention provide a method for training a deep learning model, the method comprising:
[0006] The deep learning model is trained based on training samples to obtain a first matrix and a second matrix for each network layer of the deep learning model.
[0007] The deep learning model consists of several network layers; the first matrix consists of the expected value of the gradient output before the nonlinear mapping of the loss function value back from the network layer; the second matrix consists of the expected value of the output after the nonlinear mapping of the previous layer of the network layer.
[0008] Based on the third matrix, determine the inverse matrix of the first matrix and the inverse matrix of the second matrix; wherein, the third matrix is the product of a first preset adjustable parameter and a preset identity matrix; the inverse matrix of the first matrix is the difference between the third matrix and the first matrix; the inverse matrix of the second matrix is the difference between the third matrix and the second matrix.
[0009] The inverse matrix of the network layer is determined based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix.
[0010] Based on the inverse matrix of each network layer, the inverse matrix of the Fisher information matrix of the deep learning model is obtained, and the deep learning model is trained based on the inverse matrix of the Fisher information matrix to obtain a trained deep learning model.
[0011] Optionally, after determining the first and second matrices for each network layer of the Fisher information matrix, the method further includes:
[0012] The maximum eigenvalues of the first matrix and the second matrix are approximated by the power method.
[0013] Optionally, the step of approximating the maximum eigenvalues of the first matrix and the second matrix using the power method specifically includes:
[0014] Randomly generate non-zero vectors and calculate the maximum element value of the non-zero vectors; wherein the dimension of the non-zero vectors is the same as the dimension of the first matrix;
[0015] Based on the maximum element value of the non-zero vector, the non-zero vector is normalized to obtain the normalized non-zero vector.
[0016] The product of the normalized non-zero vector and the first matrix is used as the updated non-zero vector. The process continues, normalizing the non-zero vector based on its maximum element value, and using the product of the normalized non-zero vector and the first matrix as the updated non-zero vector. This process continues until the number of iterations meets a preset condition, and the maximum eigenvalue of the first matrix is obtained.
[0017] Optionally, before determining the inverse matrix of the first matrix and the inverse matrix of the second matrix based on the third matrix, the method further includes:
[0018] Based on the largest eigenvalue of the first matrix, the first matrix is normalized, and the normalized first matrix is used as the first matrix; and
[0019] The second matrix is normalized based on its largest eigenvalue, and the normalized second matrix is then used as the second matrix.
[0020] Optionally, determining the inverse matrix of the network layer based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix specifically includes:
[0021] The first ratio matrix is obtained by dividing the inverse of the first matrix by the quotient of the largest eigenvalue of the first matrix; and
[0022] The quotient of the inverse of the second matrix and the largest eigenvalue of the second matrix is used as the second ratio matrix;
[0023] The Kronecker product of the first ratio matrix and the second ratio matrix is used as the inverse matrix of the network layer.
[0024] Optionally, before determining the inverse matrix of the network layer based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix, the method further includes:
[0025] Obtain the second preset adjustable parameters of the deep learning model;
[0026] The step of using the quotient of the inverse matrix of the first matrix and the largest eigenvalue of the first matrix as the first ratio matrix specifically includes:
[0027] Using the second preset adjustable parameter as the power value, the largest eigenvalue of the first matrix is exponentially operated on to obtain the first eigenvalue.
[0028] The quotient of the inverse of the first matrix and the first eigenvalue is used as the first ratio matrix;
[0029] The step of using the quotient of the inverse of the second matrix and the first eigenvalue as the second ratio matrix specifically includes:
[0030] Using the second preset adjustable parameter as the power value, the largest eigenvalue of the second matrix is exponentially operated to obtain the second eigenvalue;
[0031] The quotient of the inverse of the second matrix and the first eigenvalue is used as the second ratio matrix.
[0032] Optionally, the second preset adjustable parameter is 0.5.
[0033] Optionally, the selection range of the first preset adjustable parameter is: [1.05, 2].
[0034] Optionally, training the deep learning model based on the inverse of the Fisher information matrix to obtain a trained deep learning model specifically includes:
[0035] Obtain the descent gradient value during training of the deep learning model;
[0036] The deep learning model is trained using the natural gradient algorithm based on the descent gradient value and the inverse of the Fisher information matrix.
[0037] Optionally, the deep learning model is an image classification model.
[0038] To achieve the above objectives, embodiments of the present invention also provide a training apparatus for a deep learning model, the apparatus comprising:
[0039] The first determining module is used to train the deep learning model based on training samples to obtain a first matrix and a second matrix for each network layer of the deep learning model.
[0040] The deep learning model consists of several network layers; the first matrix consists of the expected value of the gradient output before the nonlinear mapping of the loss function value back from the network layer; the second matrix consists of the expected value of the output after the nonlinear mapping of the previous layer of the network layer.
[0041] The second determining module is used to determine the inverse matrix of the first matrix and the inverse matrix of the second matrix based on the third matrix; wherein the third matrix is the product of a first preset adjustable parameter and a preset identity matrix; the inverse matrix of the first matrix is the difference between the third matrix and the first matrix; and the inverse matrix of the second matrix is the difference between the third matrix and the second matrix.
[0042] The third determining module is used to obtain the inverse matrix of the network layer based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix;
[0043] The training module is used to obtain the inverse matrix of the Fisher information matrix of the deep learning model based on the inverse matrix of each network layer, so as to train the deep learning model based on the inverse matrix of the Fisher information matrix.
[0044] To achieve the above objectives, embodiments of the present invention also provide a computer-readable storage medium storing one or more programs, which can be executed by one or more processors to implement the steps in the training method of the deep learning model as described in any one of claims 1-10.
[0045] To achieve the above objectives, embodiments of the present invention also provide a terminal, comprising: a processor and a memory; the memory storing a computer-readable program executable by the processor; the processor executing the computer-readable program implements the steps in the training method of the deep learning model as described in any one of claims 1-10.
[0046] Existing technologies employ the natural gradient method to accelerate deep learning model training. This requires obtaining the inverse of the Fisher information matrix of the deep learning model, which consumes significant computational resources, especially when the deep learning model has a large number of parameters, necessitating complex matrix inversions of thousands of times. This results in substantial computational resource consumption and long training times. However, the deep learning model training method provided in this invention approximates the inverses of the first and second matrices of each network layer, thus obtaining the inverse matrix of the network layer. Based on the inverse matrix of each network layer, the inverse of the Fisher information matrix of the deep learning model is obtained. The deep learning model is then iteratively trained using the inverse of the Fisher information matrix to obtain a trained deep learning model. This eliminates the need for actual inversion calculations to obtain the inverse of the Fisher information matrix, significantly reducing the computational resources required for deep learning model training, improving training speed, and saving training time. Attached Figure Description
[0047] Figure 1 A flowchart illustrating the training method for a deep learning model provided in an embodiment of the present invention;
[0048] Figure 2 A flowchart of step S102 provided in an embodiment of the present invention;
[0049] Figure 3 The diagram shows the effect of the training method for the deep learning model provided in the embodiment of the present invention.
[0050] Figure 4 A schematic diagram of the structure of a training device for a deep learning model provided in an embodiment of the present invention;
[0051] Figure 5 This is a schematic diagram of the terminal structure provided in an embodiment of the present invention. Detailed Implementation
[0052] To make the objectives, technical solutions, and advantages of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0053] This invention provides a method for training a deep learning model, such as... Figure 1 As shown, the training method for this deep learning model may include at least the following steps:
[0054] S101, Obtain training samples and train the deep learning model based on the training samples to obtain the first matrix and the second matrix of each network layer of the deep learning model.
[0055] The deep learning model consists of several network layers. The first matrix is composed of the expected value of the gradient output before the nonlinear mapping of the loss function value back from the network layer; the second matrix is composed of the expected value of the output of the previous network layer after nonlinear mapping.
[0056] Specifically, the first matrix is: Where l represents the l-th network layer of the deep learning model, g l The gradient of the loss function value returned in the backpropagation is the gradient of the output before the nonlinear mapping of the l-th network layer, and T represents the transpose.
[0057] The second matrix is: in, This represents the output of the network layer preceding the l-th layer after nonlinearity.
[0058] In this embodiment of the invention, the execution entity of the deep learning model training method can be a terminal, a processor (NVIDIA GPU processor and Ascend AI processor), a server, etc., and no specific limitation is made in this embodiment of the invention.
[0059] Deep learning models can be used for image classification, specifically for image recognition and classification.
[0060] S102, determine the largest eigenvalue of the first matrix and the largest eigenvalue of the second matrix.
[0061] In this embodiment of the invention, the first matrix G can be approximated by the power method. l Second matrix A l-1 The largest eigenvalue.
[0062] Furthermore, such as Figure 2 As shown, step S102 can be achieved through at least the following steps:
[0063] S201: Randomly generate a non-zero vector and calculate the maximum element value of the non-zero vector.
[0064] The dimensions of the non-zero vectors are the same as the dimensions of the first matrix.
[0065] S202, based on the maximum element value of the non-zero vector, normalize the non-zero vector to obtain the normalized non-zero vector.
[0066] S203, the product of the normalized non-zero vector and the first matrix is used as the updated non-zero vector.
[0067] S204, continue to execute and normalize the non-zero vectors according to the maximum element value of the non-zero vectors, and use the product of the normalized non-zero vectors and the first matrix as the updated non-zero vectors, until the number of iterations meets the preset condition and the maximum eigenvalue of the first matrix is obtained.
[0068] With the first matrix G l For example, suppose G l Given an m×m dimension, in order to obtain G l To find the largest eigenvalue, we can first randomly generate an m-dimensional non-zero vector u, and then calculate the maximum element value u. max And according to u max Normalize u to obtain a normalized non-zero vector. Then calculate the updated non-zero vector. It then returns the step of calculating the maximum element value of u, iterates k times, and terminates, outputting the maximum eigenvalue λ of the first matrix. max (G l )=u max .
[0069] It is understandable that the method provided in steps S201-S204 above can also be used to determine the largest eigenvalue of the second matrix, and will not be elaborated further here.
[0070] In this embodiment of the invention, by using the exponentiation method, a more accurate feature value can be obtained with only a few iterations, thereby further reducing the consumption of computing resources.
[0071] S103, normalize the first matrix based on the largest eigenvalue of the first matrix, and normalize the second matrix based on the largest eigenvalue of the second matrix.
[0072] Specifically, after obtaining the largest eigenvalue of the first matrix and the largest eigenvalue of the second matrix, normalization can be performed on the first matrix and the second matrix respectively, as shown below:
[0073]
[0074]
[0075] in, This is the first matrix after normalization. The second matrix after normalization, λ max (G l ) is the first matrix G l The largest eigenvalue, λ max (A l-1 ) is the second matrix A l-1The largest eigenvalue.
[0076] Due to the first matrix G l Second matrix A l-1 All are symmetric positive semi-definite matrices, therefore their largest eigenvalues are all greater than or equal to 0. After normalization, the results are... and The largest eigenvalue is between 0 and 1, which can effectively avoid the problem of low computational accuracy caused by excessive differences in matrix eigenvalues, thereby improving the accuracy of deep learning model training.
[0077] It is understood that, in the embodiments of the present invention, the normalized first matrix and the second matrix can be used as the first matrix and the second matrix in the following embodiments.
[0078] S104: Obtain the first preset adjustable parameters of the deep learning model, and multiply the first preset adjustable parameters with the preset identity matrix as the third matrix.
[0079] In this embodiment of the invention, different parameter values can be selected from a preset parameter range based on the training results of the deep learning model to train the deep learning model. When the deep learning model achieves consistent accuracy, the fastest training value is selected from the preset parameter range as the first preset adjustable parameter. The preset parameter range is [1.05, 2].
[0080] With the preset parameter range of [1.05, 2], the ratio of the largest to the smallest eigenvalue of the inverse matrix of the first matrix can be controlled to be approximately within the range of [2, 21]. If the ratio of the largest to the smallest eigenvalue of the inverse matrix of the first matrix, i.e., the condition number of the first matrix, is too large, it is generally not conducive to improving the convergence speed. Similarly, the same applies to the second matrix, which will not be elaborated further here.
[0081] It is understood that, in the embodiments of the present invention, the maximum eigenvalues of the first and second matrices can also be obtained by exponentiation, and the computational resources for training deep learning models can be further reduced.
[0082] In this embodiment of the invention, the inverse matrices of the first and second matrices are approximately obtained by using the maximum eigenvalues of the first and second matrices. In this process, the operations are mainly addition and subtraction between vectors and matrices, without involving matrix multiplication, which further reduces the computational resources for training deep learning models.
[0083] S105, Based on the third matrix, determine the inverse matrix of the first matrix and the inverse matrix of the second matrix.
[0084] The inverse of the first matrix is the difference between the third matrix and the first matrix, and the inverse of the second matrix is the difference between the third matrix and the second matrix.
[0085] Specifically, the inverse of the first matrix is:
[0086]
[0087] in, Let be the inverse of the first matrix, c be the first preset adjustable parameter, and I be the preset identity matrix. This is the first matrix after normalization.
[0088] The inverse of the second matrix is:
[0089]
[0090] in, Let c be the inverse of the second matrix, c be the first preset adjustable parameter, and I be the preset identity matrix. This is the second matrix after normalization.
[0091] As can be seen from the above, in the embodiments of the present invention, c>1, which is used for control. and The ratio of the largest eigenvalue to the smallest eigenvalue. This is because the normalized value obtained through step SX103 above... and The largest eigenvalue is compressed between 0 and 1. When c takes a large value, it indicates that the matrix... and The difference between the largest and smallest eigenvalues in the matrix is small, meaning their ratio is close to 1; when c is small, it indicates that the matrix... and The difference between the maximum and minimum eigenvalues is large, meaning the ratio can be arbitrarily large.
[0092] Understandable, and The ratio of the largest eigenvalue to the smallest eigenvalue refers to: The ratio of the largest eigenvalue to the smallest eigenvalue; The ratio of the largest eigenvalue to the smallest eigenvalue.
[0093] S106, Obtain the second preset adjustable parameters of the deep learning model.
[0094] In this embodiment of the invention, the second preset adjustable parameter can be selected based on the training results of the deep learning model, choosing the one with the fastest training speed. The second preset adjustable parameter is 0.5, and can be obtained based on experimental results.
[0095] S107. Based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix, determine the inverse matrix of the network layer.
[0096] Specifically, the quotient of the inverse matrix of the first matrix and the largest eigenvalue of the first matrix is used as the first ratio matrix; and the quotient of the inverse matrix of the second matrix and the largest eigenvalue of the second matrix is used as the second ratio matrix; and the Kronecker product of the first ratio matrix and the second ratio matrix is used as the inverse matrix of the network layer.
[0097] In this embodiment of the invention, the matrix obtained by the Kronecker product of the first ratio matrix and the second ratio matrix is used as the inverse matrix of the network layer, which can further greatly reduce the computational and storage resources for training deep learning models.
[0098] Furthermore, the second preset adjustable parameter is used as the power value to perform a power operation on the largest eigenvalue of the first matrix to obtain the first eigenvalue; the quotient of the inverse matrix of the first matrix and the first eigenvalue is used as the first ratio matrix.
[0099] The second preset adjustable parameter is used as the power value to perform a power operation on the largest eigenvalue of the second matrix to obtain the second eigenvalue; the quotient of the inverse matrix of the second matrix and the first eigenvalue is used as the second ratio matrix.
[0100] Specifically, the inverse matrix of the network layer is:
[0101]
[0102] Among them, F l -1 Let Fl be the inverse matrix, and α be a second preset adjustable parameter. For Kronecker product, l represents the l-th network layer of the deep learning model.
[0103] The above method can be used to analyze the matrix. and All eigenvalues are further scaled. As described above, this is achieved through step S105 matrix... and All eigenvalues are still limited to between 0 and 1. Through scaling in this scheme, the eigenvalues can be differentiated to a certain extent according to the characteristics of different matrices, thereby improving the training effect of deep learning models.
[0104] S108: Based on the inverse matrix of each network layer, obtain the inverse matrix of the Fisher information matrix of the deep learning model.
[0105] Specifically, the inverse matrix of each network layer is used as the diagonal block unit of the inverse matrix of the Fisher information matrix, and the inverse matrix obtained by concatenating the diagonal block units according to a preset rule is used as the inverse matrix of the Fisher information matrix. The preset rule mentioned here can be the order of the network layers from top to bottom in the deep learning model.
[0106] S109. Based on the inverse of the Fisher information matrix, the deep learning model is trained to obtain the trained deep learning model.
[0107] Specifically, the descent gradient value of the deep learning model can be obtained first, and then the natural gradient algorithm can be used to train the deep learning model based on the descent gradient value and the inverse of the Fisher information matrix to obtain the trained deep learning model.
[0108] The deep learning model training method provided in this embodiment of the invention trains the deep learning model using training samples to obtain the first matrix and the second matrix of each network layer of the deep learning model. Then, the inverse matrices of the first matrix and the second matrix are approximated by the third matrix, thereby obtaining the inverse matrix of each network layer. Based on the inverse matrix of each network layer, the inverse matrix of the Fisher information matrix of the deep learning model is obtained. The deep learning model is then iteratively trained according to the inverse matrix of the Fisher information matrix to obtain the trained deep learning model.
[0109] In existing technologies, to accelerate the training speed of deep learning models, it is necessary to obtain the inverse matrix of the Fisher information matrix of the deep learning model. The inversion operation requires significant computational resources, especially when the deep learning model has a large number of parameters, necessitating matrix inversion operations of thousands or even thousands of times, which consumes substantial computational resources. However, the deep learning model training method provided in this invention obtains the inverse matrix of the Fisher information matrix without performing actual inversion calculations, greatly reducing the computational resources required during deep learning model training and improving the training speed. The deep learning model training method provided in this invention can be used in training deep neural networks in fields such as image processing and scientific computing. For example, neural networks represented by ResNet50 often consist of stacked convolutional layers, typically with a large number of parameters, leading to high computational resource requirements and long training times. Similarly, the SchNet network in the materials science field is constructed from network layers containing multiple physical meanings, resulting in a complex network structure and long training times with high computational resource requirements.
[0110] As can be seen from the above, the deep learning model training method provided in this embodiment of the invention can be applied to image classification models. Therefore, this embodiment of the invention can also provide an image classification method, as detailed below:
[0111] Image training samples are obtained to train the deep learning model, so as to obtain the first matrix and the second matrix of each network layer of the deep learning model;
[0112] The deep learning model consists of several network layers; the first matrix is composed of the expected value of the gradient of the output before the nonlinear mapping of the loss function value back from the network layer; the second matrix is composed of the expected value of the output after the nonlinear mapping of the previous layer.
[0113] Based on the third matrix, determine the inverse matrix of the first matrix and the inverse matrix of the second matrix; wherein, the third matrix is the product of a first preset adjustable parameter and a preset identity matrix; the inverse matrix of the first matrix is the difference between the third matrix and the first matrix; the inverse matrix of the second matrix is the difference between the third matrix and the second matrix.
[0114] The inverse matrix of the network layer is determined based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix.
[0115] Based on the inverse matrix of each network layer, the inverse matrix of the Fisher information matrix of the deep learning model is obtained, and the deep learning model is trained based on the inverse matrix of the Fisher information matrix to obtain an image classification model.
[0116] The image to be identified is input into the image classification model to obtain the non-image category of the image to be identified.
[0117] Each training image sample includes at least: a sample image and an image classification label. The image classification label is used to indicate the category of the sample image.
[0118] The above scheme accelerates image classification and recognition speed while ensuring image classification accuracy and saving computational resources. It is understood that the specific process of training the image classification model is as shown in the deep learning model training method provided in the above embodiments, and will not be elaborated further here.
[0119] For example, in the image classification model training process using the ResNet50 network, under an NVIDIA CPU hardware environment, the image classification model trained using the deep learning model training method (QISH) provided in this embodiment of the invention, compared to image classification models trained using algorithms such as SGD, K-FAC, and Shampoo, shows... Figure 3As shown, its computational resources and training time are significantly reduced.
[0120] Based on the training method of the deep learning model described above, the present invention also provides a training device for a deep learning model, such as... Figure 4 As shown, the training device includes at least: a first determining module 410, a second determining module 420, a third determining module 430, and a training module 440.
[0121] The first determining module 410 is used to train the deep learning model based on training samples to obtain a first matrix and a second matrix for each network layer of the deep learning model.
[0122] The first matrix represents the expected value of the gradient of the output before the nonlinear mapping of the current layer, which is the backpropagated loss function value; the second matrix represents the expected value of the output after the nonlinear mapping of the previous layer; the deep learning model consists of several network layers.
[0123] The second determining module 420 is used to determine the inverse matrix of the first matrix and the inverse matrix of the second matrix based on the third matrix.
[0124] Wherein, the third matrix is the product of the first preset adjustable parameter and the preset identity matrix; the inverse of the first matrix is the difference between the third matrix and the first matrix; the inverse of the second matrix is the difference between the third matrix and the second matrix.
[0125] The third determining module 430 is used to obtain the inverse matrix of the network layer based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix.
[0126] The training module 440 is used to obtain the inverse matrix of the Fisher information matrix of the deep learning model based on the inverse matrix of each network layer, so as to train the deep learning model based on the inverse matrix of the Fisher information matrix.
[0127] Based on the training method of the deep learning model described above, the present invention also provides a computer-readable storage medium storing one or more programs, which can be executed by one or more processors to implement the steps in the training method of the deep learning model described in the above embodiments.
[0128] Based on the training method of the aforementioned deep learning model, this invention also provides a terminal, such as... Figure 5As shown, it includes at least one processor 50; a display screen 51; and a memory 52, and may also include a communications interface 53 and a bus 54. The processor 50, display screen 51, memory 52, and communications interface 53 can communicate with each other via the bus 54. The display screen 51 is configured to display a preset user guide interface in the initial setup mode. The communications interface 53 can transmit information. The processor 50 can invoke logical instructions in the memory 52 to execute the methods described in the above embodiments.
[0129] Furthermore, the logic instructions in the aforementioned memory 52 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0130] The memory 52, as a computer-readable storage medium, can be configured to store software programs, computer-executable programs, such as program instructions or modules corresponding to the methods in the embodiments of this disclosure. The processor 50 executes functional applications and data processing by running the software programs, instructions, or modules stored in the memory 52, thereby implementing the deep learning model training method described in the above embodiments.
[0131] The memory 52 may include a program storage area and a data storage area. The program storage area may store the operating system and application programs required for at least one function; the data storage area may store data created based on the use of the terminal. Furthermore, the memory 52 may include high-speed random access memory (RAM) and non-volatile memory. Examples include various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks; these can also be transient storage media.
[0132] The various embodiments in this application are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the device, terminal, and medium embodiments are basically similar to the method embodiments, so the descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.
[0133] The apparatus, terminal, and medium provided in this application are one-to-one with the method. Therefore, the apparatus, terminal, and medium also have similar beneficial technical effects as their corresponding methods. Since the beneficial technical effects of the method have been described in detail above, the beneficial technical effects of the apparatus, terminal, and medium will not be repeated here.
[0134] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0135] Of course, those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.). The program can be stored in a computer-readable storage medium, and when executed, it can include the processes described in the above method embodiments. The computer-readable storage medium can be a memory, magnetic disk, optical disk, etc.
[0136] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.
Claims
1. A method for training a deep learning model, characterized in that, The method includes: The deep learning model is trained based on training samples to obtain a first matrix and a second matrix for each network layer of the deep learning model. The deep learning model consists of several network layers; the first matrix consists of the expected value of the gradient output before the nonlinear mapping of the loss function value back from the network layer; the second matrix consists of the expected value of the output after the nonlinear mapping of the previous layer of the network layer. Based on the third matrix, determine the inverse matrix of the first matrix and the inverse matrix of the second matrix; wherein, the third matrix is the product of a first preset adjustable parameter and a preset identity matrix; the inverse matrix of the first matrix is the difference between the third matrix and the first matrix; the inverse matrix of the second matrix is the difference between the third matrix and the second matrix. The inverse matrix of the network layer is determined based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix. Based on the inverse matrix of each network layer, the inverse matrix of the Fisher information matrix of the deep learning model is obtained, and the deep learning model is trained based on the inverse matrix of the Fisher information matrix to obtain a trained deep learning model. The deep learning model is an image classification model.
2. The method according to claim 1, characterized in that, After obtaining the first and second matrices of each network layer of the deep learning model, the method further includes: The maximum eigenvalues of the first matrix and the second matrix are approximated by the power method.
3. The method according to claim 2, characterized in that, The step of approximating the maximum eigenvalues of the first matrix and the second matrix using the power method specifically includes: Randomly generate non-zero vectors and calculate the maximum element value of the non-zero vectors; wherein the dimension of the non-zero vectors is the same as the dimension of the first matrix; Based on the maximum element value of the non-zero vector, the non-zero vector is normalized to obtain the normalized non-zero vector. The product of the normalized non-zero vector and the first matrix is used as the updated non-zero vector. The process continues, normalizing the non-zero vector based on its maximum element value, and using the product of the normalized non-zero vector and the first matrix as the updated non-zero vector. This process continues until the number of iterations meets a preset condition, and the maximum eigenvalue of the first matrix is obtained.
4. The method according to claim 1, characterized in that, Before determining the inverse matrix of the first matrix and the inverse matrix of the second matrix based on the third matrix, the method further includes: The first matrix is normalized based on its largest eigenvalue, and the normalized first matrix is used as the first matrix; and the second matrix is normalized based on its largest eigenvalue, and the normalized second matrix is used as the second matrix.
5. The method according to claim 1, characterized in that, Determining the inverse matrix of the network layer based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix specifically includes: The first ratio matrix is obtained by taking the quotient of the inverse matrix of the first matrix and the largest eigenvalue of the first matrix. And the quotient of the inverse of the second matrix and the largest eigenvalue of the second matrix is used as the second ratio matrix; The Kronecker product of the first ratio matrix and the second ratio matrix is used as the inverse matrix of the network layer.
6. The method according to claim 5, characterized in that, Before determining the inverse matrix of the network layer based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix, the method further includes: Obtain the second preset adjustable parameters of the deep learning model; The step of using the quotient of the inverse matrix of the first matrix and the largest eigenvalue of the first matrix as the first ratio matrix specifically includes: Using the second preset adjustable parameter as the power value, the largest eigenvalue of the first matrix is exponentially operated on to obtain the first eigenvalue. The quotient of the inverse of the first matrix and the first eigenvalue is used as the first ratio matrix; The step of using the quotient of the inverse of the second matrix and the first eigenvalue as the second ratio matrix specifically includes: Using the second preset adjustable parameter as the power value, the largest eigenvalue of the second matrix is exponentially operated to obtain the second eigenvalue; The quotient of the inverse of the second matrix and the first eigenvalue is used as the second ratio matrix.
7. The method according to claim 6, characterized in that, The second preset adjustable parameter is 0.
5.
8. The method according to claim 1, characterized in that, The selection range of the first preset adjustable parameter is: [1.05,2]。 9. The method according to claim 1, characterized in that, The step of training the deep learning model based on the inverse of the Fisher information matrix to obtain the trained deep learning model specifically includes: Obtain the descent gradient value during training of the deep learning model; The deep learning model is trained using the natural gradient algorithm based on the descent gradient value and the inverse of the Fisher information matrix, resulting in the trained deep learning model.
10. A training apparatus for a deep learning model, wherein the training apparatus for the deep learning model is used to implement the training method of the deep learning model according to any one of claims 1-9, characterized in that, The device includes: The first determining module is used to train the deep learning model based on training samples to obtain a first matrix and a second matrix for each network layer of the deep learning model. The deep learning model consists of several network layers; the first matrix consists of the expected value of the gradient output before the nonlinear mapping of the loss function value back from the network layer; the second matrix consists of the expected value of the output after the nonlinear mapping of the previous layer of the network layer. The second determining module is used to determine the inverse matrix of the first matrix and the inverse matrix of the second matrix based on the third matrix; wherein, the third matrix is the product of a first preset adjustable parameter and a preset identity matrix; the inverse matrix of the first matrix is the difference between the third matrix and the first matrix; and the inverse matrix of the second matrix is the difference between the third matrix and the second matrix. The third determining module is used to obtain the inverse matrix of the network layer based on the maximum eigenvalue and inverse matrix of the first matrix and the maximum eigenvalue and inverse matrix of the second matrix; The training module is used to obtain the inverse matrix of the Fisher information matrix of the deep learning model based on the inverse matrix of each network layer, so as to train the deep learning model based on the inverse matrix of the Fisher information matrix to obtain the trained deep learning model. The deep learning model is an image classification model.
11. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores one or more programs, which can be executed by one or more processors to implement the steps in the training method of the deep learning model as described in any one of claims 1-9.
12. A terminal, characterized in that, include: A processor and a memory; the memory stores a computer-readable program that can be executed by the processor; when the processor executes the computer-readable program, it implements the steps of the training method for a deep learning model as described in any one of claims 1-9.