Model parameter updating method and device for neural network model

By combining singular value decomposition and gradient descent, a subspace update method for neural network models is constructed, which solves the problem of limited update space in multi-task learning of neural network models and improves the average performance and accuracy of the model on multiple tasks.

CN116776958BActive Publication Date: 2026-06-16JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD
Filing Date
2023-06-19
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, the parameters of the neural network model are updated by constraining the new task to update along an orthogonal direction to the subspace formed by the old task. This results in a limited update space for the new task, which restricts the average performance and accuracy of the neural network model across multiple tasks.

Method used

The left singular vector of the neural network model is determined by singular value decomposition. A k-rank approximation operation is used to construct a subspace in a manner that is less than the hyperparameter threshold. During gradient descent, the average loss in the neighborhood of the model parameters is maximized, orthogonal gradient components are determined for updating, and training data subspaces of multiple tasks are merged to improve the update space.

🎯Benefits of technology

It improves the average performance and accuracy of neural network models across multiple tasks, reduces the forgetting of old tasks, and enhances the performance of new tasks without requiring additional model parameters or caching of old task samples.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116776958B_ABST
    Figure CN116776958B_ABST
Patent Text Reader

Abstract

The present disclosure relates to a model parameter updating method and device. The model parameter updating method comprises: obtaining first training data and second training data; updating initial model parameters of a neural network model along a gradient descent direction to obtain first model parameters, according to the first training data, with the goal of maximizing the average loss within the neighborhood of the initial model parameters; determining a plurality of left singular vectors of a representation matrix of the first training data based on the first model parameters through singular value decomposition, the number of the left singular vectors being determined through a k-rank approximation operation and the value of a hyperparameter used being less than a hyperparameter threshold; determining a second gradient of a loss function according to the second training data, with the goal of maximizing the average loss within the neighborhood of the first model parameters; determining a gradient component that is orthogonal to a subspace formed by the second gradient and the first training data; and updating the first model parameters along the gradient descent direction according to the orthogonal gradient component to obtain second model parameters.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of machine learning, and in particular to a method and apparatus for updating model parameters for neural network models, and a computer-storable medium. Background Technology

[0002] Humans can learn a series of tasks encountered consecutively without forgetting previously learned knowledge. For example, students who learn computer science in university can still remember the math and physics they learned in high school. In recent years, with the rise of artificial intelligence, more and more research has focused on using AI algorithms to serve humanity and society. Therefore, researchers hope that neural networks can also possess this ability of continuous learning, such as… Figure 1 As shown, the neural network model f with continuous learning ability first learns Task 1, namely the digit classification task, and then learns Task 2, namely the cat and dog classification task. The goal of continuous learning is to ensure that after learning Task 2, model f can still accurately classify handwritten digits.

[0003] In related technologies, the forgetting of old tasks caused by neural network models based on updated model parameters due to new tasks is mitigated by constraining the gradient direction of model parameter updates for new tasks. For example, in related technologies, updates are performed for the new task along orthogonal directions to the subspace formed by the old task. Summary of the Invention

[0004] In related technologies, in order to reduce the degree of forgetting of old tasks, the hyperparameter values ​​used when determining multiple left singular vectors constituting the subspace of old tasks through k-rank approximation operation are usually set to large values ​​during the process of constructing the subspace of old tasks using singular value decomposition. This will result in the update space of new tasks being limited, thus making the neural network model not have enough update space to learn new tasks, which greatly limits the accuracy of the neural network model on new tasks, resulting in low average performance or average accuracy of the neural network model on multiple tasks.

[0005] To address the aforementioned technical problems, this disclosure proposes a solution that can improve the average performance or average accuracy of neural network models used for multiple tasks across multiple tasks.

[0006] According to a first aspect of this disclosure, a method for updating model parameters of a neural network model is provided, comprising: acquiring first training data and second training data corresponding to different learning tasks; updating the initial model parameters of each layer of the neural network model along the direction of gradient descent based on the first training data, with the objective of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, to obtain first model parameters; and determining, through singular value decomposition, multiple left singular vectors of the representation matrix of the first training data in each layer of the network based on the first model parameters, as a subspace constituted by the first training data, wherein... The number of multiple left singular vectors is determined by a k-rank approximation operation, wherein the hyperparameter value used in the k-rank approximation operation is less than a hyperparameter threshold. Based on the second training data, with the goal of maximizing the average loss in the neighborhood of the first model parameters of each layer of the neural network model, the second gradient of the loss function in each layer of the network is determined. The gradient components in the second gradient that are orthogonal to the subspace formed by the first training data are determined as the orthogonal gradient components corresponding to each layer of the network. Based on the orthogonal gradient components corresponding to each layer of the network, the first model parameters of each layer of the network are updated along the direction of gradient descent to obtain the second model parameters.

[0007] In some embodiments, updating the initial model parameters of each network layer includes: determining a first gradient of the loss function in each network layer based on the first training data, with the objective of maximizing the average loss in the neighborhood of the initial model parameters of each network layer of the neural network model; and updating the initial model parameters of each network layer along the direction of gradient descent based on the first gradient of each network layer.

[0008] In some embodiments, determining the first gradient of the loss function in each layer of the network includes: determining a first objective function based on the first training data, wherein the first objective function aims to maximize the average loss in the neighborhood of the initial model parameters of each layer of the neural network model; approximating the first objective function using a first-order Taylor expansion to obtain a first objective value of the neighborhood that maximizes the average loss in the neighborhood of the initial model parameters; and determining the first gradient based on the first training data and the first objective value.

[0009] In some embodiments, determining the first gradient based on the first training data and the first target value includes: perturbing the initial model parameters using the first target value to obtain initial perturbed model parameters; determining a functional expression of the average loss of the first training data under the initial perturbed model parameters as a first functional expression; and determining the gradient of the first functional expression with respect to the initial perturbed model parameters as a first gradient.

[0010] In some embodiments, determining the first objective function based on the first training data includes: determining a functional expression for the average loss in the neighborhood of the initial model parameters of each layer of the neural network model based on the first training data; and determining the first objective function based on the functional expression for the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, wherein the first objective function aims to maximize the average loss in the neighborhood of the initial model parameters of each layer of the neural network model.

[0011] In some embodiments, obtaining first training data and second training data corresponding to different learning tasks includes: obtaining initial training data corresponding to the first learning task and initial training data corresponding to the second learning task; and performing data augmentation on the initial training data corresponding to the first learning task and the initial training data corresponding to the second learning task, respectively, to obtain the first training data and the second training data.

[0012] In some embodiments, determining the gradient components in the second gradient that are orthogonal to the subspace formed by the first training data as orthogonal gradient components corresponding to each layer of the network includes: projecting the second gradient onto the subspace formed by the first training data to obtain the projected gradient components of the second gradient; and removing the projected gradient components from the second gradient to obtain the orthogonal gradient components corresponding to each layer of the network.

[0013] In some embodiments, determining the second gradient of the loss function in each layer of the neural network, based on the second training data and with the objective of maximizing the average loss in the neighborhood of the first model parameters of each layer of the neural network, includes: determining a second objective function based on the second training data, wherein the second objective function aims to maximize the average loss in the neighborhood of the first model parameters of each layer of the neural network; approximating the second objective function using a first-order Taylor expansion to obtain a second objective value for the neighborhood that maximizes the average loss in the neighborhood of the first model parameters; and determining the second gradient based on the second training data and the second objective value.

[0014] In some embodiments, determining the second gradient based on the second training data and the second target value includes: perturbing the first model parameters using the second target value to obtain first perturbed model parameters; determining a functional expression of the average loss of the first training data under the first perturbed model parameters as a second functional expression; and determining the gradient of the second functional expression with respect to the first perturbed model parameters as a second gradient.

[0015] In some embodiments, the first training data corresponds to a first learning task, and the second training data corresponds to a second learning task. The neural network model is further configured to learn a third learning task after learning the first learning task and the second learning task. The model parameter update method further includes: determining multiple left singular vectors of the representation matrix of the second training data in each layer of the network based on the second model parameters through singular value decomposition, as a subspace formed by the second training data, wherein the number of multiple left singular vectors corresponding to the second training data is determined by a k-rank approximation operation, and the value of the hyperparameter used in the k-rank approximation operation is less than the hyperparameter threshold; merging the multiple left singular vectors corresponding to the first training data and the representation matrix of the second training data. Multiple left singular vectors corresponding to the training data are used to obtain a subspace formed by the first training data and the second training data; third training data corresponding to the third learning task is obtained; based on the third training data, with the goal of maximizing the average loss in the neighborhood of the second model parameters of each layer of the neural network model, the third gradient of the loss function in each layer of the network is determined; the gradient components in the third gradient that are orthogonal to the subspace formed by the first training data and the second training data are determined as the orthogonal gradient components of the third gradient corresponding to each layer of the network; based on the orthogonal gradient components of the third gradient corresponding to each layer of the network, the second model parameters of each layer of the network are updated along the direction of gradient descent to obtain the third model parameters.

[0016] In some embodiments, the hyperparameter threshold is less than or equal to 0.96.

[0017] According to a second aspect of this disclosure, a model parameter updating apparatus for a neural network model is provided, comprising: an acquisition module configured to acquire first training data and second training data corresponding to different learning tasks; a first updating module configured to update the initial model parameters of each layer of the neural network model along the direction of gradient descent based on the first training data, with the objective of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, to obtain first model parameters; and a first determining module configured to determine, through singular value decomposition, multiple left singular vectors of the representation matrix of the first training data in each layer of the network based on the first model parameters, as a subspace constituted by the first training data, wherein the multiple The number of left singular vectors is determined by a k-rank approximation operation, wherein the hyperparameter value used in the k-rank approximation operation is less than a hyperparameter threshold; the second determining module is configured to determine the second gradient of the loss function in each layer of the neural network based on the second training data, with the objective of maximizing the average loss in the neighborhood of the first model parameters of each layer of the neural network; the third determining module is configured to determine the gradient components in the second gradient that are orthogonal to the subspace formed by the first training data, as the orthogonal gradient components corresponding to each layer of the network; the second updating module is configured to update the first model parameters of each layer of the network along the gradient descent direction based on the orthogonal gradient components corresponding to each layer of the network, to obtain the second model parameters.

[0018] According to a third aspect of this disclosure, a model parameter update method for a neural network model is provided, comprising: a memory; and a processor coupled to the memory, the processor being configured to execute the model parameter update method for a neural network model according to any of the above embodiments based on instructions stored in the memory.

[0019] According to a fourth aspect of this disclosure, a computer-storeable medium is provided that stores computer program instructions thereon, which, when executed by a processor, implement the model parameter update method for a neural network model as described in any of the above embodiments.

[0020] In the above embodiments, the average performance or average accuracy of a neural network model used for multiple tasks can be improved across multiple tasks. Attached Figure Description

[0021] The accompanying drawings, which form part of this specification, illustrate embodiments of this disclosure and, together with the specification, serve to explain the principles of this disclosure.

[0022] This disclosure will become clearer with reference to the accompanying drawings and the following detailed description, wherein:

[0023] Figure 1This is a schematic diagram illustrating a neural network model with continuous learning capabilities learning multiple tasks;

[0024] Figure 2 This is a schematic diagram comparing the use of gradient projection memory method and stochastic gradient descent method to update model parameters in related technologies;

[0025] Figure 3 This is a flowchart illustrating a method for updating model parameters for a neural network model according to some embodiments of the present disclosure;

[0026] Figure 4 This is a schematic diagram showing a comparison of the flatness of loss surfaces with different flatnesses according to some embodiments of the present disclosure;

[0027] Figure 5 This is a schematic diagram showing a comparison of the accuracy of a neural network model obtained by a model parameter update method according to some embodiments of the present disclosure and a neural network model obtained by using a gradient projection memory method on a new task;

[0028] Figure 6 This is a block diagram illustrating a model parameter updating apparatus for a neural network model according to some embodiments of the present disclosure;

[0029] Figure 7 This is a block diagram illustrating a model parameter update apparatus for a neural network model according to other embodiments of the present disclosure;

[0030] Figure 8 This is a block diagram illustrating a computer system for implementing some embodiments of the present disclosure. Detailed Implementation

[0031] Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of the present disclosure.

[0032] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.

[0033] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use.

[0034] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0035] In all examples shown and discussed herein, any specific values ​​should be interpreted as merely exemplary and not as limitations. Therefore, other examples of exemplary embodiments may have different values.

[0036] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0037] Figure 2 This is a schematic diagram showing a comparison between updating model parameters using the gradient projection memory method and updating model parameters using the stochastic gradient descent method in related technologies.

[0038] like Figure 2 As shown, this disclosure compares the performance of the GPM (Gradient Projection Memory) method and the traditional SGD (Stochastic Gradient Descent) method on a new task using the CIFAR100 computer vision dataset. The comparison reveals that the neural network model updated with GPM achieves only 69.66% classification accuracy on the new task T8, while the neural network model updated with SGD achieves 75.84% accuracy. Here, since the traditional SGD method lacks continuous learning capabilities, its performance on a new task can be used as an upper bound for continuous learning tasks. Analysis shows that the gradient projection memory method used in related technologies for updating model parameters suffers from insufficient update space on new tasks due to the strong projection constraints in the GPM method, resulting in poor performance of the neural network model updated with GPM on new tasks.

[0039] Therefore, there is an urgent need for a model parameter update method that can both reduce the forgetting of old tasks by neural network models and improve the performance of neural network models on new tasks.

[0040] Figure 3 This is a flowchart illustrating a method for updating model parameters for a neural network model according to some embodiments of the present disclosure.

[0041] like Figure 3 As shown, the model parameter update method for a neural network model includes steps S310 to S360. For example, the model parameter update method for a neural network model is executed by a model parameter update device for a neural network model. The neural network model is a neural network model with continuous learning capabilities and can be used for different learning tasks.

[0042] In step S310, first training data and second training data corresponding to different learning tasks are obtained.

[0043] In some embodiments, initial training data corresponding to a first learning task and initial training data corresponding to a second learning task are first obtained; then, the initial training data corresponding to the first learning task and the initial training data corresponding to the second learning task are respectively augmented to obtain the first training data and the second training data. That is, the first training data corresponds to the first learning task, and the second training data corresponds to the second learning task. For example, the first learning task is the first task learned by the neural network model.

[0044] Data augmentation enhances training data, allowing the optimization of flatness to be explored in a broader data space and adapting to distribution drift issues in continuous learning. This further improves the average performance or average accuracy of the neural network model across all learning tasks. Continuous learning uses a neural network model to learn on a series of sequentially arriving tasks. Since each task has its own data distribution, distribution drift exists between previous and subsequent tasks. Furthermore, data augmentation through perturbation makes the neural network model more robust to adversarial examples.

[0045] In some embodiments, the initial training data corresponding to different learning tasks can be perturbed to expand the quantity of training data. For example, taking images as the initial training data, a mixup approach can be used to overlay another image onto the current image to perform data perturbation or interpolation, i.e., by combining two images... and and their corresponding tags and By superimposing the images separately, a new image and a new label are obtained. This creates new training data, for example, data called perturbed training data. This process can be represented as... Where γ~Beta(α,α)∈ This represents the proportion of two images superimposed. It is sampled from a Beta distribution, where α determines the shape of the Beta distribution, which is set to 20 by default. The Mixup data augmentation strategy linearly weights the two images and their labels. Other data augmentation strategies can also be used in this disclosure.

[0046] In some embodiments, the initial training data can be preprocessed to obtain the first training data. For example, the input format and label format of the training data for all learning tasks can be unified. Taking images as training data, each input image is processed into a format of [C, H1, H2], where the first dimension C represents the number of image channels, the second dimension H1 represents the width of the image, and the third dimension H2 represents the height of the image. Simultaneously, the labels corresponding to the images are processed into one-hot vector codes, such as [1, 0, 0, 0, 0, 0, 0, 0, 0], where the length of the vector equals the number of categories included in the task, the position of the image belonging to a category is encoded as 1, and the remaining positions are encoded as 0.

[0047] In some embodiments, the first training data is a small batch of training data sampled from the preprocessed initial training data. For example, 256 training samples can be sampled per batch as the first training data. The training process can be performed in batches.

[0048] In step S320, based on the first training data, with the goal of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, the initial model parameters of each layer are updated along the direction of gradient descent to obtain the first model parameters.

[0049] In some embodiments, step S320 can be implemented by the following steps 1)-2).

[0050] In step 1), based on the first training data, the first gradient of the loss function in each layer of the neural network is determined with the goal of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model.

[0051] In some embodiments, step 1 above can be implemented as follows.

[0052] First, based on the first training data, a first objective function is determined, wherein the first objective function aims to maximize the average loss in the neighborhood of the initial model parameters of each layer of the neural network model.

[0053] In some embodiments, determining the first objective function based on the first training data includes: determining a functional expression for the average loss in the neighborhood of the initial model parameters of each layer of the neural network model based on the first training data; and determining the first objective function based on the functional expression for the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, wherein the first objective function aims to maximize the average loss in the neighborhood of the initial model parameters of each layer of the neural network model.

[0054] For example, taking the first training data as including the initial training data (X,Y), the first objective function is expressed as follows: in, Let represent the loss function, W represent the initial model parameters of the neural network model, δ represent the neighborhood values, and ρ represent the neighborhood radius within which W is perturbed (weighted) (e.g., set to 0.05). This is a functional expression for the average loss in the neighborhood of the initial model parameters of each layer of the neural network model.

[0055] For example, the first training data includes the initial training data (X,Y) and the perturbed training data. For example, the first objective function is expressed as: in, Let represent the loss function, W represent the initial model parameters of the neural network model, δ represent the value of the neighborhood, and ρ represent the perturbation of W within the neighborhood radius (e.g., set to 0.05). This is a functional expression for the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, where λ is used to control the weight of the loss on the perturbation training data (e.g., searching from [0.001, 0.01, 0.1]).

[0056] Then, using a first-order Taylor expansion, the first objective function is approximated to obtain the first objective value of the neighborhood that maximizes the average loss in the neighborhood of the initial model parameters.

[0057] The first training data includes the initial training data (X,Y) and the perturbed training data. For example, the first objective function The goal is to find the location where the loss is maximized within a neighborhood of the initial model parameters W with radius ρ in both the original and perturbation data spaces. Since finding the exact solution is extremely difficult, this disclosure uses a first-order Taylor series to approximate the function, and thus obtains the perturbation that maximizes the average loss within the neighborhood of the initial model parameters. (i.e., the first target value) is as follows:

[0058]

[0059] Finally, the first gradient is determined based on the first training data and the first target value.

[0060] The first training data includes the initial training data (X,Y) and the perturbed training data. For example, the first gradient is:

[0061]

[0062] In some embodiments, the first gradient can be determined based on the first training data and the target value in the following manner.

[0063] First, the initial model parameters are perturbed using the target value to obtain initial perturbed model parameters. For example, the initial perturbed model parameters are...

[0064] Then, a functional expression for the average loss of the first training data under the initial perturbation model parameters is determined, which is taken as the first functional expression. For example, the functional expression for the average loss of the first training data under the initial perturbation model parameters is:

[0065] Finally, the gradient of the first function expression with respect to the initial perturbation model parameters is determined as the first gradient. For example, the first gradient is... Since the goal in calculating the first gradient is to maximize the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, the first gradient is a flat-perceived gradient.

[0066] In step 2), the initial model parameters of each network layer are updated along the gradient descent direction based on the first gradient of each layer. Taking the first learning task as an example, the gradient descent process can be represented as follows: Where l represents the layer identifier of the network, η represents the step size, and W l represents the initial model parameters of the l-th layer network, and t represents the identifier of the learning task.

[0067] In the above embodiments, the objective of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model and the process of updating the model parameters along the gradient descent direction can be simplified as follows:

[0068]

[0069] The first row in the formula represents the original data (X,Y) and the data after perturbation. The empirical risk loss is shown in the second and third rows, representing the worst-case loss surface on both the original and perturbed data. By minimizing multiple losses in P2, the empirical risk loss can be minimized simultaneously, while maximizing the flatness of the network. By eliminating redundant terms in this formula, it is equivalent to optimizing the following objective function.

[0070] In step S330, singular value decomposition is used to determine multiple left singular vectors of the representation matrix of the first training data in each layer of the network based on the first model parameters, which are used as a subspace formed by the first training data. The number of the multiple left singular vectors is determined by a k-rank approximation operation, and the value of the hyperparameter used in the k-rank approximation operation is less than the hyperparameter threshold.

[0071] In some embodiments, step S330 can be implemented in the following manner.

[0072] First, based on the first model parameters, the representation matrix of the first training data in each layer of the network is determined. For example, a batch of data from learning task 1 is randomly sampled and input into each layer of the neural network model to obtain the input representation of each layer, i.e.

[0073] Then, based on the representation matrix of the first training data in each layer of the network, Singular Value Decomposition (SVD) is performed to obtain the basis of the subspace formed by the first training data. For example, after performing SVD, R (1),l =U (1),l ∑ (1),l V (1),l , among which, U (1),l V represents a left singular matrix containing multiple left singular vectors. (1),l Let ∑ denote a right singular matrix containing multiple right singular vectors. (1),l It includes multiple singular values ​​sorted along its main diagonal. All left singular vectors in the left singular matrix form the basis of the subspace formed by the first training data.

[0074] Finally, through the k-rank approximation operation, the k left singular vectors corresponding to the k largest singular values ​​are selected from all left singular vectors as the subspace formed by the first training data, used to approximately represent the representation matrix of the first training data in each layer of the network. The hyperparameters used in the k-rank approximation operation are... The value is less than the hyperparameter threshold, and the selection rule is as follows: For example, the hyperparameter threshold is less than or equal to 0.96. For instance, hyperparameter values ​​include 0.94 or 0.95. Typically, hyperparameter values ​​range from 0.9 to 1.0. For example, the k left singular matrices selected through the k-rank approximation operation can be used as the most salient basis and stored in the projection matrix M of the subspace S representing the learned learning task, i.e. in, Let be the left singular vector of the i-th column in the left singular matrix.

[0075] In step S340, based on the second training data, the second gradient of the loss function in each layer of the neural network is determined with the objective of maximizing the average loss in the neighborhood of the first model parameter of each layer of the neural network model.

[0076] In some embodiments, the second gradient of the loss function in each layer of the neural network can be determined based on the second training data, with the objective of maximizing the average loss in the neighborhood of the first model parameters of each layer of the neural network model.

[0077] First, based on the second training data, a second objective function is determined, wherein the second objective function aims to maximize the average loss in the neighborhood of the first model parameter of each layer of the neural network model.

[0078] Then, the second objective function is approximated using a first-order Taylor expansion to obtain the second objective value of the neighborhood that maximizes the average loss in the neighborhood of the first model parameters.

[0079] Finally, the second gradient is determined based on the second training data and the second target value.

[0080] In some embodiments, the second gradient can be determined based on the second training data and the second target value in the following manner.

[0081] First, the first model parameters are perturbed using the second target value to obtain the first perturbed model parameters.

[0082] Then, the functional expression of the average loss of the first training data under the first perturbation model parameters is determined as the second functional expression.

[0083] Finally, the gradient of the second function expression with respect to the parameters of the first perturbation model is determined as the second gradient.

[0084] The specific implementation process described above can be found in the process of determining the first gradient, which will not be repeated here.

[0085] In step S350, the gradient components in the second gradient that are orthogonal to the subspace formed by the first training data are determined as the orthogonal gradient components corresponding to each layer of the network.

[0086] In some embodiments, step S350 can be implemented as follows.

[0087] First, the second gradient is projected onto the subspace formed by the first training data to obtain the projected gradient components of the second gradient. For example, the projected gradient components are represented as follows:

[0088] Then, the projected gradient component in the second gradient is removed to obtain the orthogonal gradient component corresponding to each layer of the network. For example, the orthogonal gradient component is represented as...

[0089] In step S360, based on the orthogonal gradient components corresponding to each network layer, the first model parameters of each network layer are updated along the gradient descent direction to obtain the second model parameters. For example, this gradient descent process can be represented as follows:

[0090] In some embodiments, the first training data corresponds to a first learning task, the second training data corresponds to a second learning task, and the neural network model is further configured to learn a third learning task after learning the first learning task and the second learning task. In this case, the model parameter update method further includes the following steps.

[0091] First, by singular value decomposition, multiple left singular vectors of the representation matrix of the second training data in each layer of the network based on the second model parameters are determined as a subspace formed by the second training data. The number of multiple left singular vectors corresponding to the second training data is determined by a k-rank approximation operation, wherein the value of the hyperparameter used in the k-rank approximation operation is less than the hyperparameter threshold.

[0092] Next, the multiple left singular vectors corresponding to the first training data and the multiple left singular vectors corresponding to the second training data are merged to obtain a subspace composed of the first training data and the second training data. Here, the subspaces of training data for all learned tasks are merged to facilitate the determination of orthogonal gradient components for subsequent unlearned tasks.

[0093] For example, similar to Task 1 mentioned above, the representation of Task t at each layer of the network is obtained using a batch of samples from Task t. Then, the portion of this representation that overlaps with the previous task is removed; that is, the newly introduced input representation becomes... Through the Perform singular value decomposition and select the k most significant basis values. As the subspace formed by the second training data, the selection rule is as follows: The hyperparameters used for the k-rank approximation. Adding multiple left singular vectors representing the subspace formed by the second training data to the aforementioned projection matrix M can be represented as...

[0094] Next, obtain the third training data corresponding to the third learning task.

[0095] Then, based on the third training data, the third gradient of the loss function in each layer of the neural network is determined with the goal of maximizing the average loss in the neighborhood of the second model parameters of each layer of the neural network model.

[0096] Then, the gradient components in the third gradient that are orthogonal to the subspace formed by the first training data and the second training data are determined as the orthogonal gradient components of the third gradient corresponding to each layer of the network.

[0097] Finally, based on the orthogonal gradient components of the third gradient corresponding to each network layer, the second model parameters of each network layer are updated along the gradient descent direction to obtain the third model parameters. The third learning task is taken as the final learning task, and the third model parameters are the final model parameters of the neural network model.

[0098] The neural network model obtained by the model parameter update method in the above embodiments can be used to directly process multiple tasks. For example, classifying images in different scenes.

[0099] In the above embodiments, the goal is to maximize the average loss within the neighborhood of the initial model parameters of each layer of the neural network model, and parameter updates are performed along the gradient descent direction. This improves the flatness of the loss function, resulting in a smaller change in the average loss for the same update amount of model parameters, and consequently, a smaller degree of forgetting of already learned tasks. Furthermore, by controlling the hyperparameter values ​​to be below a hyperparameter threshold, the degree to which the gradient of unlearned tasks is projected onto the subspace of already learned tasks can be reduced, thereby increasing the update space for unlearned tasks. By improving the flatness of the loss function and increasing the update space for unlearned tasks, the performance of new tasks can be improved while reducing the forgetting of old tasks, thus increasing the average performance or average accuracy of the neural network model across all learning tasks. Moreover, the model parameter update method of this disclosure does not require adding additional model parameters, caching arbitrary samples of old tasks, or calculating the importance of parameters for each old task, making it more convenient.

[0100] To better understand this disclosure, the following will be combined with Figure 4 Describe the factors that affect the flatness of the loss function.

[0101] Figure 4 This is a schematic diagram showing a comparison of the flatness of loss surfaces with different flatnesses according to some embodiments of the present disclosure.

[0102] like Figure 4 As shown, the change in average loss on the loss surface (a) when the model parameters are updated from W1 to W2 is F. 1,2 The change in average loss on the loss surface (b) is F when the model parameters are updated from W3 to W4. 3,4 The model parameter update amount for both loss surfaces (a) and (b) is ΔW. Figure 4 It can be seen that, for the same amount of model parameter updates, the change in average loss F 3,4 Much smaller than F 1,2 Therefore, the loss surface (b) is flatter than the loss surface (a).

[0103] Based on the above analysis, minimizing the maximum average loss in the neighborhood of the model parameters (i.e., aiming to maximize the average loss in the neighborhood of the initial model parameters of each layer of the neural network and updating the parameters along the gradient descent direction) can make the loss surface flatter. The flatness of the loss surface or loss function is defined as the region on the loss surface where the loss value changes slowly with the model parameters.

[0104] Through the training method described above, the training strategy proposed in this disclosure can make the loss surface of each learning task flatter. Such a flat network is less likely to forget old tasks, thus relaxing the constraints of gradient projection to a certain extent (specifically reflected in gradient memory updates). In terms of settings, The smaller the value, the weaker the gradient projection constraint, the larger the update space of the new task, and therefore the better the performance of the new task. It can also ensure that the new task performs better without serious forgetting of the old task.

[0105] Figure 5 This is a schematic diagram showing a comparison of the accuracy of a neural network model obtained by a model parameter update method according to some embodiments of the present disclosure and a neural network model obtained by using a gradient projection memory method on a new task.

[0106] like Figure 5 As shown, taking the CIFAR100 dataset as an example, the neural network model (labeled DFGP) obtained according to the model parameter update method of some embodiments of this disclosure achieves better accuracy on new tasks compared to the aforementioned continuous learning algorithm GPM. For example, the neural network model of this disclosure achieves an accuracy of 73.70% on the eighth task T8, while GPM only achieves 69.66% accuracy.

[0107] Figure 6 This is a block diagram illustrating a model parameter update apparatus for a neural network model according to some embodiments of the present disclosure.

[0108] like Figure 6As shown, the model parameter update device 6 for the neural network model includes an acquisition module 61, a first update module 62, a first determination module 63, a second determination module 64, a third determination module 65, and a second update module 66.

[0109] The acquisition module 61 is configured to acquire first training data and second training data corresponding to different learning tasks, for example, by performing... Figure 3 Step S310 is shown.

[0110] The first update module 62 is configured to update the initial model parameters of each layer of the neural network according to the first training data, with the goal of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, along the direction of gradient descent, to obtain the first model parameters, for example, by performing the following... Figure 3 The step S320 shown.

[0111] The first determining module 63 is configured to determine, through singular value decomposition, multiple left singular vectors of the representation matrix of the first training data in each layer of the network based on the first model parameters, as a subspace formed by the first training data, wherein the number of the multiple left singular vectors is determined by a k-rank approximation operation, and the hyperparameter value used in the k-rank approximation operation is less than a hyperparameter threshold, for example, by performing... Figure 3 The step S330 shown.

[0112] The second determining module 64 is configured to determine the second gradient of the loss function in each layer of the neural network based on the second training data, with the objective of maximizing the average loss in the neighborhood of the first model parameters of each layer of the neural network model, for example, by performing... Figure 3 Step S340 is shown.

[0113] The third determining module 65 is configured to determine the gradient components in the second gradient that are orthogonal to the subspace formed by the first training data, as the orthogonal gradient components corresponding to each layer of the network, for example, by performing the following: Figure 3 Step S350 is shown.

[0114] The second update module 66 is configured to update the first model parameters of each layer of the network according to the orthogonal gradient components corresponding to each layer of the network, along the direction of gradient descent, to obtain the second model parameters, for example, by performing the following... Figure 3 Step S360 is shown.

[0115] Figure 7 This is a block diagram illustrating a model parameter update apparatus for a neural network model according to other embodiments of the present disclosure.

[0116] like Figure 7As shown, the model parameter update apparatus 7 for a neural network model includes a memory 71 and a processor 72 coupled to the memory 71. The memory 71 is used to store instructions for executing embodiments of the model parameter update method for a neural network model. The processor 72 is configured to execute the model parameter update method for a neural network model in any of the embodiments of this disclosure based on the instructions stored in the memory 71.

[0117] Figure 8 This is a block diagram illustrating a computer system for implementing some embodiments of the present disclosure.

[0118] like Figure 8 As shown, the computer system 80 can be represented in the form of a general computing device. The computer system 80 includes a memory 810, a processor 820, and a bus 800 connecting different system components.

[0119] The memory 810 may include, for example, system memory, non-volatile storage media, etc. The system memory may store, for example, an operating system, application programs, a boot loader, and other programs. The system memory may include volatile storage media, such as random access memory (RAM) and / or cache memory. The non-volatile storage media may store, for example, instructions for executing at least one of the corresponding embodiments of a model parameter update method for a neural network model. Non-volatile storage media include, but are not limited to, disk storage, optical storage, flash memory, etc.

[0120] The processor 820 can be implemented using a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete hardware components such as discrete gates or transistors. Accordingly, each module, such as the decision module and the determination module, can be implemented by executing instructions in the central processing unit (CPU) memory to perform the corresponding steps, or by implementing dedicated circuitry to perform the corresponding steps.

[0121] Bus 800 can use any of the various bus architectures. For example, bus architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MCA) bus, and the Peripheral Component Interconnect (PCI) bus.

[0122] The computer system 80 may also include an input / output interface 830, a network interface 840, and a storage interface 850. These interfaces 830, 840, and 850, as well as the memory 810 and processor 820, can be connected via a bus 800. The input / output interface 830 provides a connection interface for input / output devices such as a monitor, mouse, and keyboard. The network interface 840 provides a connection interface for various networked devices. The storage interface 850 provides a connection interface for external storage devices such as floppy disks, USB flash drives, and SD cards.

[0123] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus, and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations thereof, can be implemented by computer-readable program instructions.

[0124] These computer-readable program instructions are provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces means for implementing the functions specified in one or more boxes of the flowchart and / or block diagram.

[0125] These computer-readable program instructions may also be stored in a computer-readable storage medium. These instructions cause a computer to work in a particular manner to produce an article of manufacture, including instructions that implement the functions specified in one or more boxes in a flowchart and / or block diagram.

[0126] This disclosure may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects.

[0127] The model parameter update method and apparatus for neural network models and the computer-storable medium described in the above embodiments can improve the average performance or average accuracy of neural network models used for multiple tasks on multiple tasks.

[0128] This concludes the detailed description of the model parameter update method and apparatus for neural network models, as well as the computer-storable medium, according to the present disclosure. To avoid obscuring the concept of this disclosure, some details known in the art have not been described. Those skilled in the art will fully understand how to implement the technical solutions disclosed herein based on the above description.

Claims

1. A method for updating model parameters in a neural network model, comprising: First training data and second training data corresponding to different learning tasks are obtained, wherein the first training data and the second training data are images, and the neural network model is used to classify the images; Based on the first training data, with the goal of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, the initial model parameters of each layer of the network are updated along the direction of gradient descent to obtain the first model parameters. By using singular value decomposition, multiple left singular vectors of the representation matrix of the first training data in each layer of the network based on the first model parameters are determined as a subspace formed by the first training data. The number of the multiple left singular vectors is determined by a k-rank approximation operation, and the value of the hyperparameter used in the k-rank approximation operation is less than the hyperparameter threshold. Based on the second training data, with the objective of maximizing the average loss in the neighborhood of the first model parameters of each layer of the neural network model, the second gradient of the loss function in each layer of the network is determined. Determine the gradient components in the second gradient that are orthogonal to the subspace formed by the first training data, and use them as the orthogonal gradient components corresponding to each layer of the network; Based on the orthogonal gradient components corresponding to each layer of the network, the first model parameters of each layer of the network are updated along the direction of gradient descent to obtain the second model parameters.

2. The model parameter update method according to claim 1, wherein, Updating the initial model parameters of each layer of the network includes: Based on the first training data, with the goal of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, the first gradient of the loss function in each layer of the network is determined. Based on the first gradient of each layer of the network, the initial model parameters of each layer of the network are updated along the direction of gradient descent.

3. The model parameter update method according to claim 2, wherein, Determining the first gradient of the loss function in each layer of the network includes: Based on the first training data, a first objective function is determined, wherein the first objective function aims to maximize the average loss in the neighborhood of the initial model parameters of each layer of the neural network model; The first objective function is approximated by using a first-order Taylor expansion to obtain the first objective value of the neighborhood that maximizes the average loss in the neighborhood of the initial model parameters. The first gradient is determined based on the first training data and the first target value.

4. The model parameter update method according to claim 3, wherein, Determining the first gradient based on the first training data and the first target value includes: Using the first target value, the initial model parameters are perturbed to obtain the initial perturbed model parameters; Determine a functional expression for the average loss of the first training data under the initial perturbation model parameters, and use it as the first functional expression; The gradient of the first function expression with respect to the parameters of the initial perturbation model is determined as the first gradient.

5. The model parameter update method according to claim 3, wherein, Based on the first training data, the first objective function is determined to include: Based on the first training data, determine the functional expression of the average loss in the neighborhood of the initial model parameters of each layer of the neural network model; The first objective function is determined based on a functional expression of the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, wherein the first objective function aims to maximize the average loss in the neighborhood of the initial model parameters of each layer of the neural network model.

6. The model parameter update method according to any one of claims 1-5, wherein, Obtaining the first and second training data corresponding to different learning tasks includes: Obtain the initial training data corresponding to the first learning task and the initial training data corresponding to the second learning task; The initial training data corresponding to the first learning task and the initial training data corresponding to the second learning task are respectively augmented to obtain the first training data and the second training data.

7. The model parameter update method according to any one of claims 1-5, wherein, Determining the gradient components in the second gradient that are orthogonal to the subspace formed by the first training data, as orthogonal gradient components corresponding to each layer of the network, includes: The second gradient is projected onto the subspace formed by the first training data to obtain the projected gradient component of the second gradient. Remove the projected gradient component from the second gradient to obtain the orthogonal gradient component corresponding to each layer of the network.

8. The model parameter update method according to any one of claims 1-5, wherein, Based on the second training data, with the objective of maximizing the average loss in the neighborhood of the first model parameters of each layer of the neural network model, the second gradient of the loss function in each layer of the network is determined as follows: Based on the second training data, a second objective function is determined, wherein the second objective function aims to maximize the average loss in the neighborhood of the first model parameters of each layer of the neural network model; The second objective function is approximated by using a first-order Taylor expansion to obtain the second objective value of the neighborhood that maximizes the average loss in the neighborhood of the first model parameters. The second gradient is determined based on the second training data and the second target value.

9. The model parameter update method according to claim 8, wherein, Determining the second gradient based on the second training data and the second target value includes: Using the second target value, the first model parameters are perturbed to obtain the first perturbed model parameters; Determine the functional expression for the average loss of the first training data under the parameters of the first perturbation model, and use it as the second functional expression; Determine the gradient of the second function expression with respect to the parameters of the first perturbation model, and use it as the second gradient.

10. The model parameter update method according to any one of claims 1-5, wherein, The first training data corresponds to a first learning task, the second training data corresponds to a second learning task, and the neural network model is further configured to learn a third learning task after learning the first learning task and the second learning task. The model parameter update method further includes: By using singular value decomposition, multiple left singular vectors of the representation matrix of the second training data in each layer of the network based on the second model parameters are determined as a subspace formed by the second training data. The number of multiple left singular vectors corresponding to the second training data is determined by a k-rank approximation operation, wherein the value of the hyperparameter used in the k-rank approximation operation is less than the hyperparameter threshold. Merge multiple left singular vectors corresponding to the first training data and multiple left singular vectors corresponding to the second training data to obtain a subspace composed of the first training data and the second training data; Obtain the third training data corresponding to the third learning task; Based on the third training data, with the objective of maximizing the average loss in the neighborhood of the second model parameters of each layer of the neural network model, the third gradient of the loss function in each layer of the network is determined. Determine the gradient components in the third gradient that are orthogonal to the subspace formed by the first training data and the second training data, and use them as the orthogonal gradient components of the third gradient corresponding to each layer of the network. Based on the orthogonal gradient components of the third gradient corresponding to each layer of the network, the second model parameters of each layer of the network are updated along the direction of gradient descent to obtain the third model parameters.

11. The model parameter update method according to any one of claims 1-5, wherein, The hyperparameter threshold is less than or equal to 0.

96.

12. A model parameter update device for a neural network model, comprising: The acquisition module is configured to acquire first training data and second training data corresponding to different learning tasks, wherein the first training data and the second training data are images, and the neural network model is used to classify the images; The first update module is configured to update the initial model parameters of each layer of the neural network according to the first training data, with the goal of maximizing the average loss in the neighborhood of the initial model parameters of each layer of the neural network model, along the direction of gradient descent, to obtain the first model parameters. The first determining module is configured to determine, through singular value decomposition, multiple left singular vectors of the representation matrix of the first training data in each layer of the network based on the first model parameters, as a subspace formed by the first training data, wherein the number of the multiple left singular vectors is determined by a k-rank approximation operation, and the value of the hyperparameter used in the k-rank approximation operation is less than a hyperparameter threshold. The second determining module is configured to determine the second gradient of the loss function in each layer of the neural network based on the second training data, with the objective of maximizing the average loss in the neighborhood of the first model parameters of each layer of the neural network model. The third determining module is configured to determine the gradient components in the second gradient that are orthogonal to the subspace formed by the first training data, as the orthogonal gradient components corresponding to each layer of the network. The second update module is configured to update the first model parameters of each layer of the network according to the orthogonal gradient components corresponding to each layer of the network, along the direction of gradient descent, to obtain the second model parameters.

13. A model parameter update device for a neural network model, comprising: Memory; as well as A processor coupled to the memory, the processor being configured to execute, based on instructions stored in the memory, the model parameter update method for a neural network model as described in any one of claims 1 to 11.

14. A computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the model parameter update method for a neural network model as described in any one of claims 1 to 11.