Model training and task processing method, device, system, apparatus and storage medium

By using a mask vector mechanism to select training parameters in federated learning, the robustness and generalization issues of personalized federated learning are addressed, improving the applicability and accuracy of the model on different devices while reducing computational complexity and communication load.

CN115759229BActive Publication Date: 2026-06-12JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD
Filing Date
2022-11-22
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing federated learning methods suffer from poor robustness and generalization in highly heterogeneous data environments, and the high communication load of decentralized forms leads to insufficient applicability and accuracy of models on different devices.

Method used

By employing a mask vector mechanism, the initial task processing model is determined by acquiring the training models from other devices and the local device. Based on the mask vector, parameters that need to be updated are selected from the model parameters for personalized training, reducing computational complexity and improving robustness.

🎯Benefits of technology

It improves the applicability and accuracy of the model on different devices, reduces the computational complexity of training, and enhances the stability and speed of decentralized training.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115759229B_ABST
    Figure CN115759229B_ABST
Patent Text Reader

Abstract

Embodiments of the present application disclose a model training and task processing method, device, system, equipment and storage medium. The model training method is applied to a local device in a distributed cluster, and the distributed cluster further includes other devices. The model training method comprises the following steps: obtaining a target task processing model obtained by the other devices in a previous round of training; determining an initial task processing model of the local device in the current round of training according to the target task processing model obtained by the other devices in the previous round of training and a target task processing model obtained by the local device in the previous round of training; determining a current gradient at each model parameter of the initial task processing model and determining a mask vector of the local device in the current round of training; determining a model parameter that needs to be updated by the local device in the current round of training from each model parameter based on the mask vector to obtain a target parameter; and updating the target parameter according to the current gradient at the target parameter of the initial task processing model to obtain a target task processing model of the local device in the current round of training.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments of the present invention relate to computer technology, and more particularly to a model training and task processing method, apparatus, system, device and storage medium. Background Technology

[0002] Federated learning is an emerging foundational technology in artificial intelligence. It enables efficient machine learning across multiple participants or computing nodes while ensuring information security during big data exchange, protecting terminal and personal data privacy, and guaranteeing legal compliance. Based on the connections between participants, federated learning can be categorized into centralized and decentralized forms. Traditional federated learning typically employs a centralized model with a parameter server. Local devices train their own data locally, and the central server then aggregates the models. The decentralized model, on the other hand, does not have a central server; communication and training occur directly between devices. Federated learning allows local devices to transmit only the trained model without sending their own data to a central server or other nodes, thus reducing the risk of data leakage.

[0003] Traditional federated learning, whether centralized or decentralized, trains only a single global model for deployment. In situations of high data heterogeneity, a single global model often fails to adapt to different data distributions, resulting in low inference accuracy for models deployed on local devices. To address this limitation, personalized federated learning allows local devices to have different personalized models, rather than sharing the same global model.

[0004] One approach to personalized federated learning involves layering the model into shared and personalized layers. The shared layer is trained on shared devices, while the personalized layer is trained locally without sharing with other devices. However, in developing this invention, the inventors discovered that this model-layered personalized federated learning method is less robust to different model architectures and has poor generalization capabilities because it requires designing specific personalized and shared layers for each model. Summary of the Invention

[0005] This invention provides a model training and task processing method, apparatus, system, device, and storage medium, which can reduce the computational complexity of training, have good robustness to different model structures, and have good generalizability of the training method.

[0006] In a first aspect, embodiments of the present invention provide a model training method, wherein the model training method is applied to a local device in a distributed cluster, the distributed cluster further including other devices, and the model training method includes:

[0007] Obtain the target task processing model obtained from the previous round of training on the other devices;

[0008] The initial task processing model for the current training round of the local device is determined based on the target task processing model obtained from the previous training round of the other devices and the target task processing model obtained from the previous training round of the local device.

[0009] Determine the current gradient at each model parameter of the initial task processing model, and determine the mask vector for this round of training on the local device;

[0010] Based on the mask vector, the model parameters that need to be updated in this round of training of the local device are determined from the various model parameters to obtain the target parameters;

[0011] The target parameters are updated based on the current gradient at the target parameters of the initial task processing model to obtain the target task processing model trained on the local device in this round.

[0012] Secondly, embodiments of the present invention provide a task processing method, including:

[0013] Obtain the current feature information of the target task;

[0014] The current feature information is input into the local target model trained by the model training method described in the embodiments of the present invention, so as to use the local target model to process the target task based on the current feature information, thereby obtaining the processing result of the target task.

[0015] Thirdly, embodiments of the present invention provide a model training apparatus, wherein the model training apparatus is applied to a local device in a distributed cluster, the distributed cluster further includes other devices, and the model training apparatus includes:

[0016] The model acquisition module is used to acquire the target task processing model obtained from the previous round of training on the other devices;

[0017] The model determination module is used to determine the initial task processing model for the current training round of the local device based on the target task processing model obtained from the previous training round of the other device and the target task processing model obtained from the previous training round of the local device.

[0018] The gradient determination module is used to determine the current gradient at each model parameter of the initial task processing model;

[0019] A mask determination module is used to determine the mask vector for the current training round of the local device;

[0020] The parameter determination module is used to determine the model parameters that need to be updated in this round of training of the local device from the various model parameters based on the mask vector, so as to obtain the target parameters;

[0021] The parameter update module is used to update the target parameters based on the current gradient at the target parameters of the initial task processing model, so as to obtain the target task processing model trained by the local device in this round.

[0022] Fourthly, embodiments of the present invention provide a task processing apparatus, comprising:

[0023] The feature acquisition module is used to acquire the current feature information of the target task;

[0024] The task processing module is used to input the current feature information into a local target model trained by the model training method described in the embodiments of the present invention, so as to use the local target model to process the target task based on the current feature information, thereby obtaining the processing result of the target task.

[0025] Fifthly, embodiments of the present invention provide a model training system, the model training system comprising a distributed cluster, the distributed cluster including other devices and a local device for executing the model training method as described in any one of the embodiments of the present invention.

[0026] In a sixth aspect, embodiments of the present invention provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements a model training method as described in any embodiment of the present invention, or when the processor executes the program, it implements a task processing method as described in the embodiments of the present invention.

[0027] In a seventh aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the model training method as described in any of the embodiments of the present invention, or the program, when executed by a processor, implements the task processing method as described in the embodiments of the present invention.

[0028] In this embodiment of the invention, the target task processing model obtained from the previous training round on another device can be obtained; the initial task processing model for the current training round on the local device is determined based on the target task processing model obtained from the previous training round on another device and the target task processing model obtained from the previous training round on the local device; the current gradient at each model parameter of the initial task processing model is determined, and the mask vector for the current training round on the local device is determined; the model parameters that need to be updated for the current training round on the local device are determined from the various model parameters based on the mask vector, and the target parameters are obtained; the target parameters are updated according to the current gradient at the target parameters of the initial task processing model, and the target task processing model for the current training round on the local device is obtained. That is, in this embodiment of the invention, the parameters that need to be trained on the local device for the current round are determined by the mask vector, and the model personalization is realized by the mask vector. Therefore, there is no need for hierarchical design and training of the model, which has good robustness to different model structures and good generalization of the training method.

[0029] Furthermore, the parameters that need to be trained in this round are determined from each parameter based on the mask vector. In each round of training, only some parameters participate, which can reduce the computational complexity of training to a certain extent. During gradient backpropagation, only some gradients (the gradients corresponding to the parameters that participate in training in this round) need to be backpropagated, which can further reduce the computational complexity of training.

[0030] In addition, the embodiments of the present invention adopt a decentralized training scheme, which does not rely on a central server, has good cluster stability, and allows each device in the distributed cluster to deploy personalized models, breaking through the constraint of global single model deployment and improving the inference accuracy of models deployed on local devices for their own data.

[0031] In addition, the initial task processing model for this round of training on the local device is obtained based on the target task processing model obtained from the previous round of training on other devices and the target task processing model obtained from the previous round of training on the local device. This is equivalent to a warm start training process, which can improve the training speed of the model. Attached Figure Description

[0032] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0033] Figure 1 This is a schematic flowchart of a model training method provided in an embodiment of the present invention;

[0034] Figure 2 This is a flowchart illustrating a method for determining a mask vector provided in an embodiment of the present invention;

[0035] Figure 3 This is another flowchart illustrating the method for determining the mask vector provided in this embodiment of the invention;

[0036] Figure 4 This is an example diagram of a method for determining a mask vector provided in an embodiment of the present invention;

[0037] Figure 5 This is another schematic diagram of the model training method provided in the embodiment of the present invention;

[0038] Figure 6 This is a schematic diagram of the model training device provided in an embodiment of the present invention;

[0039] Figure 7 This is a schematic diagram of a task processing device provided in an embodiment of the present invention;

[0040] Figure 8 This is a schematic diagram of the model training system provided in an embodiment of the present invention;

[0041] Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0042] The present invention will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, the accompanying drawings show only the parts relevant to the present invention, and not all of the structures.

[0043] Before introducing the model training method of the embodiments of the present invention, let's first introduce the centralized federated learning method and the decentralized federated learning method.

[0044] Centralized federated learning methods may include the following steps:

[0045] (1) The central server distributes the same global model to multiple node devices;

[0046] (2) The node device uses its own data to train the received model;

[0047] (3) The node devices upload the trained model to the central server;

[0048] (4) The central server aggregates the models uploaded by each node device.

[0049] Repeat steps (1) to (4) for the next iteration until the model converges.

[0050] Decentralized federated learning methods may include the following steps:

[0051] (1) The local device receives models sent by other devices;

[0052] (2) The local device aggregates the received models and uses its own data to train the models;

[0053] (3) The local device sends the trained model to other devices;

[0054] (4) Other devices aggregate and train the received models.

[0055] Repeat steps (1) to (4) for the next iteration until the model converges.

[0056] A horizontal comparison of centralized and decentralized federated learning methods reveals a key difference: the presence or absence of a central server. In practical applications, centralized algorithms heavily rely on a central server; if the server crashes or is maliciously attacked (e.g., by poisoning, backdoors, etc.), the entire algorithm fails. Secondly, regarding communication volume, centralized algorithms primarily involve communication between the central server and node devices, placing high demands on server bandwidth. With numerous nodes, maximum bandwidth cannot be effectively utilized. Conversely, decentralized federated learning methods involve communication primarily between node devices, maximizing bandwidth utilization and offering greater robustness against individual machine failures. However, current decentralized federated learning methods suffer from poor robustness and generalization issues in personalized federated learning. Therefore, this invention further improves decentralized personalized federated learning methods by proposing a novel model training method.

[0057] like Figure 1 As shown, Figure 1 This is a flowchart illustrating a model training method provided in an embodiment of the present invention. This method can be executed by a model training device provided in this embodiment, which can be implemented using software and / or hardware. In a specific embodiment, the device can be integrated into an electronic device, such as a computer or server. The following embodiments will illustrate this using the integration of the device into an electronic device as an example. The electronic device can be a local device, which can be a device in a distributed cluster. The distributed cluster can also include other devices, and the local device and other devices constitute a decentralized training cluster. (Reference) Figure 1 The method may specifically include the following steps:

[0058] Step 101: Obtain the target task processing model obtained from the previous round of training on other devices.

[0059] For example, other devices can be all devices in the distributed cluster other than the local device, or devices in the distributed cluster that are geographically adjacent to the local device. In a specific implementation, the local device can maintain a device list, which may include the identification information, location information, communication address information, etc., of each device in the distributed cluster. The local device can obtain the target task processing model trained on other devices in the previous round based on the device list. The methods used by the local device and other devices for model training can be the same.

[0060] In this embodiment of the invention, the task processing model can be a model for processing a target task. This model can be a model implemented by a neural network such as a convolutional neural network or a deep neural network. The target task can be a target detection task, an item recommendation task, a text classification task, a machine translation task, etc., and is not specifically limited here.

[0061] Step 102: Determine the initial task processing model for the current training round of the local device based on the target task processing model obtained from the previous training round on other devices and the target task processing model obtained from the previous training round on the local device.

[0062] For example, the current training round on the local device can be considered as a non-first round of training on the local device; the previous round of training on the local device can be either the first round of training on the local device or a non-first round of training on the local device. When the previous training round on the local device is the first round of training, the set hyperparameters can be obtained. These hyperparameters may include the learning rate, pruning rate, number of devices participating in training, and resource constraints for each device. These resource constraints can be determined by the hardware or physical configuration of the corresponding device, such as maximum processing bandwidth, maximum computing power supported per unit time, and maximum storage resources. The model parameters of the local model are then randomly initialized, meaning random values ​​are assigned to each model parameter. After random assignment, the initial task processing model for the first round of training on the local device is obtained. Next, based on the resource constraints in the hyperparameters, the model sparsity supported by the local device is determined. Based on the model sparsity, the mask vector for the first round of training on the local device is determined. Based on the mask vector for the first round of training, the model parameters that need to be updated in the first round of training are determined. Based on the training data on the local device, the current gradient at the model parameters that need to be updated is calculated. During the first round of training, the corresponding model parameters are updated based on the current gradient at the model parameters that need to be updated, thus obtaining the target task processing model for the first round of training. When the previous training round on the local device was not the first training round on the local device, the training method for the previous training round on the local device is the same as that for the current training round on the local device, which will be described in detail later.

[0063] In this embodiment of the invention, the mask vector can be a vector indicating the state of the model parameters. The mask vector for this round of training indicates the state of the corresponding model parameters in this round of training. The state of the model parameters can include a masked state and a public state. For model parameters in a masked state, the parameter values ​​are not updated during the corresponding training round; for model parameters in a public state, the parameter values ​​are updated according to the corresponding gradient during the corresponding training round. The mask vector can include vector identifiers for each model parameter in the desired state. These vector identifiers can be represented by specific numerical values. For example, 0 can represent the vector identifier of a model parameter that needs to be in a masked state, and 1 can represent the vector identifier of a model parameter that needs to be in a public state; conversely, the opposite is also possible.

[0064] In specific implementations, the model parameters mentioned in the embodiments of this invention can be structural parameters of the model, such as the weights and biases of each layer of the model. For example, when the task processing model is a convolutional neural network model, the model parameters of the convolutional layer can include the weights of the convolutional kernel and the biases of each channel, and the model parameters of the fully connected layer can include the weights of the fully connected layer and the biases of each channel.

[0065] Step 103: Determine the current gradient at each model parameter of the initial task processing model, and determine the mask vector for this round of training on the local device.

[0066] Specifically, the current gradient at each model parameter of the initial task processing model can be determined based on the local training dataset of the local device. For example, one or a batch of sample data can be selected from the local training dataset of the local device, the loss function of the initial task processing model can be determined using the selected sample data, and then the current gradient of the loss function at each model parameter of the initial task processing model can be calculated to obtain the current gradient at each model parameter of the initial task processing model. The local training dataset can include a large amount of sample data obtained by analyzing and labeling locally generated and acquired data; that is, the sample data in this embodiment of the invention is sample data with sample labels.

[0067] For example, if the target task is object detection, the local training dataset of the local device can be constructed based on the local image database. For instance, targets (such as people or vehicles) in the original images in the local image database can be labeled to obtain labeled images. These labels can be considered as sample labels, thereby constructing a large amount of sample data, and then constructing the local training dataset based on this large amount of sample data.

[0068] For example, the mask vector for the current training round of the local device can be determined based on the current gradient at each model parameter of the initial task processing model, the current value of each model parameter, and the state of each model parameter in the previous training round.

[0069] For example, we can consider the magnitude of model parameter values ​​to reflect their importance. Based on the magnitude of these values, some model parameters that were public in the previous training iteration can be muted in the current training iteration. For instance, we can select a few model parameters with smaller values ​​from the previous training iteration and mute them in the current training iteration, meaning these less important parameters will not be considered in this iteration.

[0070] For example, we can consider the magnitude of the gradient at a model parameter as reflecting how quickly the model parameter changes. Based on the magnitude of these gradients, some model parameters that were masked in the previous training iteration can be made public in the current training iteration. For instance, we can select a few model parameters with larger gradients from those that were masked in the previous training iteration and make them public in the current training iteration, thus refocusing attention on the rapidly changing model parameters.

[0071] Step 104: Based on the mask vector, determine the model parameters that need to be updated in this round of training on the local device, and obtain the target parameters.

[0072] Since the mask vector of this training round can indicate the state of the corresponding model parameters in this training round, in this embodiment of the invention, the model parameters that should be public in this training round, i.e., the public parameters of this round, can be determined based on the mask vector of this training round, and the public parameters of this round can be determined as the target parameters.

[0073] Step 105: Update the target parameters based on the current gradient at the target parameters of the initial task processing model to obtain the target task processing model trained on the local device in this round.

[0074] For example, after obtaining the target task processing model trained on the local device in this round, the target task processing model trained on the local device in this round can be sent to other devices, so that the other devices can train their models based on the target task processing model sent by the local device. This training is iterated until the training cutoff condition is met, at which point the target task processing model obtained in the last round of training can be determined as the local target model of the local device. The training cutoff condition can be a set limit on the number of training rounds or a set limit on the loss function; no specific limitation is made here.

[0075] In this embodiment of the invention, the target task processing model obtained from the previous training round on another device can be obtained; the initial task processing model for the current training round on the local device is determined based on the target task processing model obtained from the previous training round on another device and the target task processing model obtained from the previous training round on the local device; the current gradient at each model parameter of the initial task processing model is determined, and the mask vector for the current training round on the local device is determined; the model parameters that need to be updated for the current training round on the local device are determined from the various model parameters based on the mask vector, and the target parameters are obtained; the target parameters are updated according to the current gradient at the target parameters of the initial task processing model, and the target task processing model for the current training round on the local device is obtained. That is, in this embodiment of the invention, the parameters that need to be trained on the local device for the current round are determined by the mask vector, and the model personalization is realized by the mask vector. Therefore, there is no need for hierarchical design and training of the model, which has good robustness to different model structures and good generalization of the training method.

[0076] Furthermore, the parameters that need to be trained in this round are determined from each parameter based on the mask vector. In each round of training, only some parameters participate, which can reduce the computational complexity of training to a certain extent. During gradient backpropagation, only some gradients (the gradients corresponding to the parameters that participate in training in this round) need to be backpropagated (updating the target parameters according to the current gradient at the target parameter), which can further reduce the computational complexity of training.

[0077] In addition, the embodiments of the present invention adopt a decentralized training scheme, which does not rely on a central server, has good cluster stability, and allows each device in the distributed cluster to deploy personalized models, breaking through the constraint of global single model deployment and improving the inference accuracy of models deployed on local devices for their own data.

[0078] In addition, the initial task processing model of the local device in this round of training is obtained based on the target task processing model obtained in the previous round of training on other devices and the target task processing model obtained in the previous round of training on the local device. This is equivalent to a warm start training process, which can speed up model convergence and improve model training speed.

[0079] The following example illustrates the method for determining the mask vector during the first round of training on a local device. (See example...) Figure 2 As shown, it may include the following steps:

[0080] Step 201: Determine the resource limitation information of the local device.

[0081] For example, the resource limitation information of a local device can be determined by the hardware or physical configuration of the local device, such as maximum processing bandwidth, maximum computing power supported per unit time, and maximum storage resources.

[0082] Step 202: Determine the sparsity of the model based on resource constraint information.

[0083] This means determining the model sparsity that the local device can support based on the resource constraints of the local device. Model sparsity can be represented by the number of model parameters, the proportion of model parameters in the public and private states, etc. For example, determining that the local device can support model sparsity of 50%, 80%, etc., based on the resource constraints of the local device means that the local device supports 50%, 80%, etc., of all model parameters in each round of training.

[0084] Step 203: Determine the mask vector for the first round of training on the local device based on model sparsity.

[0085] Specifically, the mask vector can be randomly initialized based on the model sparsity to obtain the mask vector for the first round of training on the local device, i.e., a randomly initialized mask vector. For example, if the model sparsity is 50%, the vector identifier of the model parameters in the masked state is 0, and the vector identifier of the model parameters in the exposed state is 1, then when randomly initializing the mask vector, the vector identifiers of half the positions can be randomly set to 0, and the vector identifiers of the other half of the positions can be set to 1. For example, if the total number of model parameters is 10, that is, the length of the mask vector is 10, after randomly initializing the mask vector according to 50% model sparsity, the resulting mask vector for the first round of training on the local device could be (1, 1, 1, 1, 1, 0, 0, 0, 0, 0).

[0086] For example, when the task processing model is a neural network model, the model may consist of multiple layers. When randomly initializing the mask vector based on the model's sparsity, the ERK (Erdós-RényiKernel) method can be used to calculate the sparse distribution of each layer in the network. ERK is a method for initializing the sparsity ratio of a neural network. In particular, it allows the number of sparse parameters to vary with the input and output channels, ensuring that layers with more parameters have a higher pruning rate. In a decentralized federated learning setting, each device has a different randomly initialized mask vector with a sparse distribution calculated according to ERK.

[0087] The mask vector for the first round of training on the local device is determined based on model sparsity. Subsequent training rounds use this mask vector to determine the model parameters that need to be updated, ensuring that the entire training process is based on a sparse model. A complete neural network contains a massive number of model parameters, while a sparse model refers to a model in which some parameters are masked (i.e., not involved in actual computation). Because some model parameters are masked, the model size is reduced, and training based on a sparse model can significantly reduce training computational overhead.

[0088] The following example illustrates the method for determining the mask vector during non-first-round training on a local device. (See example...) Figure 3 As shown, it may include the following steps:

[0089] Step 301: The pruning rate of the previous training round on the local device is attenuated to obtain the pruning rate of the current training round on the local device.

[0090] For example, an initial pruning rate can be set, and in each subsequent training round, the pruning rate used in the previous round can be decreased, thereby accelerating the model training process. For instance, if the pruning rate in the previous training round was 50%, and the pruning rate decreases by 10% each round, then the pruning rate in this training round could be 40%.

[0091] Step 302: Determine the number of pruning branches and the number of restoration branches based on the pruning rate of the local device in this round of training.

[0092] For example, the pruning rate of this training round can be multiplied by the total number of model parameters to obtain the number of pruned and restored parameters. The number of pruned and restored parameters can be equal; that is, in this embodiment of the invention, the model sparsity remains unchanged in each round of model training. For instance, if the pruning rate of this training round is 40% and the total number of model parameters is 10, then the number of pruned parameters in this training round is 4, and the number of restored parameters is also 4.

[0093] Step 303: Determine the previous round's public parameters and the previous round's masking parameters from the various model parameters.

[0094] Among them, the publicly disclosed parameters of the previous round are the model parameters that need to be disclosed in the previous round of training on the local device, and the hidden parameters of the previous round are the model parameters that need to be hidden in the previous round of training on the local device. The publicly disclosed parameters and the hidden parameters of the previous round can be determined through historical training record information.

[0095] Step 304: Determine the first state change parameter and the remaining public parameters from the previous round of public parameters based on the current value of the public parameters and the number of prunings.

[0096] In this context, the first state-change parameters are the model parameters from the previous round of public parameters that should be changed to a masked state in this round of training. For example, the magnitude of the model parameter value can be considered to reflect its importance. The previous round of public parameters can be sorted according to their values, and some model parameters with smaller values ​​can be selected as the first state-change parameters. The model parameters from the previous round of public parameters other than the first state-change parameters are the remaining public parameters, meaning that these less important model parameters will not be considered in this round of training.

[0097] Step 305: Determine the second state change parameters and remaining shielding parameters from the previous shielding parameters based on the current gradient and restoration quantity at the previous shielding parameters.

[0098] In this training round, the parameters for the second state change are the model parameters that should be made public from the previously masked parameters. For example, the magnitude of the gradient at a model parameter can be considered to reflect the rate of change of the model parameter. The previously masked parameters can be sorted according to the magnitude of their gradients, and the model parameters with larger gradients can be selected as the parameters for the second state change. The model parameters other than the parameters for the second state change from the previously masked parameters are the remaining masked parameters. In other words, the model parameters that change rapidly will be given renewed attention in this training round.

[0099] Step 306: Determine the second state change parameters and the remaining public parameters as the public parameters for this round, and determine the first state change parameters and the remaining masking parameters as the masking parameters for this round.

[0100] That is, the parameters disclosed in this round include the second state change parameters and the remaining disclosed parameters, and the parameters hidden in this round include the first state change parameters and the remaining hidden parameters.

[0101] Step 307: Determine the vector identifier of the public parameter and the vector identifier of the masked parameter. The vector identifiers of the public parameter and the masked parameter are different.

[0102] For example, vector identifiers can be represented by specific numerical values. For instance, 0 can represent the vector identifier for masked parameters, and 1 can represent the vector identifier for public parameters. Of course, the reverse is also possible.

[0103] Step 308: Set the vector identifier of the public parameter at the vector position corresponding to the public parameter in this round, and set the vector identifier of the mask parameter at the vector position corresponding to the mask parameter in this round, to obtain the mask vector of the local device for this round of training.

[0104] Continuing the previous example, suppose the total number of model parameters is 10. In the previous training round, there were 5 public parameters and 5 masked parameters. The vector identifier for the public parameters is 1, and the vector identifier for the masked parameters is 0. The mask vector for the previous training round is (1, 1, 1, 1, 1, 0, 0, 0, 0). The pruning rate for this training round is 40%, meaning 4 parameters are pruned in this round, and 4 parameters are restored. If the first public parameter in the previous round has the largest value, then it can be determined that the first public parameter is the remaining public parameter, and the other four are the first state change parameters. If the gradient at the first masked parameter in the previous round is the smallest, then it can be determined that the first public parameter is the remaining masked parameter, and the other four are the second state change parameters. Therefore, the mask vector for this training round is (1, 0, 0, 0, 0, 0, 1, 1, 1, 1).

[0105] The process of determining the mask vector for the current training round on the local device in this embodiment is equivalent to pruning and branching a tree. When determining the mask vector for this round, first, some model parameters from the previously disclosed parameters are pruned (masked). Then, the same number of model parameters are grown (disclosed) from the previously masked parameters. Based on the pruned and grown model parameters, the publicly disclosed and masked parameters for this round are determined, thus obtaining the mask vector for this training round. For example, for instance... Figure 4 As shown, the original state of the tree (model) is (A). Pruning (A) results in (B), as shown by the dashed arrow in (B). Long branches from (B) result in (C), as shown by the dashed arrow in (C). The number of pruning and long branches is the same, both being 3, and the sparsity of the tree has not changed.

[0106] The method described above for determining the mask vector for each training round is merely an example. In practical applications, the method for determining the mask vector can be improved and replaced to obtain other methods for determining the mask vector. No specific limitations are made here.

[0107] The embodiments of the present invention determine the public parameters and masking parameters for each training round based on the pruning rate of each round decaying, and determine the mask vector for each training round based on the public parameters and masking parameters for each training round. This can ensure that the sparsity of the model remains unchanged throughout the entire training process, ensuring that the entire training process is carried out with a sparse model, and can reduce the training computation overhead.

[0108] The algorithms involved in determining the entire mask vector can include:

[0109] 1. Set hyperparameters: initial pruning rate;

[0110] 2. Mask initialization method:

[0111] 3. Calculate the sparsity ratio of the model based on ERK, and randomly initialize the mask vector according to the sparsity ratio of the model;

[0112] 4. Returns a random initial mask vector;

[0113] 5. Search for the next mask vector;

[0114] 6. The pruning rate decreases with each subsequent pruning cycle;

[0115] 7. Perform this operation on each layer of the model;

[0116] 8. Prune (mask) some parameters;

[0117] 9. Restore (disclose) a portion of the parameters;

[0118] 10. Return to the next mask vector.

[0119] The following describes the model training method of this invention based on the mask vector determination method provided in the previous embodiments. Figure 5 As shown, the specific steps may include the following:

[0120] Step 401: Obtain the target task processing model obtained from the previous round of training on other devices.

[0121] Step 402: Determine the initial task processing model for the current training round of the local device based on the target task processing model obtained from the previous training round on other devices and the target task processing model obtained from the previous training round on the local device.

[0122] For example, at the start of this training round, it can be assumed that all model parameters of the target task processing model trained on other devices in the previous round and all model parameters of the target task processing model trained on the local device in the previous round are in a public state. If a model parameter was public in the previous training round, its value in this round has been updated; if a model parameter was hidden in the previous training round, its value in this round has not been updated. Therefore, the initial task processing model for this training round on the local device can be determined based on the target task processing models trained on other devices in the previous round and the target task processing model trained on the local device in the previous round.

[0123] Specifically, the model parameters of the target task processing model trained on other devices in the previous round can be fused with the model parameters of the target task processing model trained on the local device in the previous round to obtain the initial task processing model for the current round of training on the local device. The specific fusion method can be: determining the average value of each model parameter based on the current values ​​of the model parameters of the target task processing model trained on other devices in the previous round and the model parameters of the target task processing model trained on the local device in the previous round, thereby obtaining the initial task processing model for the current round of training on the local device.

[0124] Step 403: Select training data for this round from the local training dataset.

[0125] For example, one or more sample data can be selected from the local training dataset on the local device. The selected sample data is used to determine the loss function of the initial task processing model. Then, the current gradient of the loss function at each model parameter of the initial task processing model is calculated, thus obtaining the current gradient of each model parameter of the initial task processing model. Specifically, to improve the accuracy of the trained model, the amount of data selected in each batch can be preset. Then, according to this amount of data, a batch of sample data is selected from the local training dataset each time, and the loss function and current gradient of the initial task processing model are determined using a batch of sample data.

[0126] Step 404: Determine the loss function of the initial task processing model based on the training data of this round.

[0127] The training data in this round consists of labeled sample data, which allows the loss function of the initial task processing model to be determined based on the actual output of the initial task processing model and the sample labels.

[0128] Step 405: Calculate the current gradient of the loss function at each model parameter of the initial task processing model to obtain the current gradient at each model parameter of the initial task processing model.

[0129] Step 406: Determine the mask vector for this round of training on the local device.

[0130] For the specific method of determining the mask vector, please refer to the method described in the previous embodiments of this invention, which will not be repeated here.

[0131] Step 407: Determine the public parameters for this round based on the vector identifier in the mask vector, and set the public parameters for this round as the target parameters.

[0132] Specifically, since the model sparsity remains constant throughout the entire model training process, the number of publicly available parameters in this round corresponds to the model sparsity supported by the local device.

[0133] Step 408: Update the target parameters based on the current gradient at the target parameters of the initial task processing model to obtain the target task processing model trained on the local device in this round.

[0134] Step 409: Send the target task processing model trained on the local device in this round to other devices.

[0135] For example, other devices may also train a model based on the target task processing model received from the local device. The method used by other devices to train the model can be the same as that used by the local device.

[0136] Step 410: Determine whether the training cutoff condition has been met. If it has, proceed to step 411; otherwise, return to step 401.

[0137] For example, the training cutoff condition can be a set limit on the number of training epochs or a set limit on the loss function; no specific limitation is made here. A training epoch limit could be, for example, stopping training when a preset number of training epochs is reached. A loss function limit could be, for example, stopping training when the loss function value of the target task processing model obtained from the local device is the lowest or lower than a preset loss function value, or when the average loss function value of the target task processing model obtained from all devices is the lowest or lower than a preset loss function value.

[0138] Step 411: Determine the target task processing model obtained from the last round of training as the local target model of the local device.

[0139] The algorithms involved in the entire model training process may include:

[0140] 1. Set hyperparameters: learning rate, number of devices participating in training, and resource limitations for each device;

[0141] 2. Randomly initialize local model parameters; each device randomly initializes a mask vector based on resource constraint information.

[0142] 3. Set the number of training iteration rounds;

[0143] 4. Perform this operation on each device;

[0144] 5. Receive the corresponding sparse model from other devices;

[0145] 6. Integrate the received models to obtain the initial task processing model for this round of training on the local device;

[0146] 7. Set the batch size for training data;

[0147] 8. Obtain a batch of data for this round of training;

[0148] 9. Calculate the current gradient at each model parameter of the initial task processing model;

[0149] 10. Update the initial task processing model based on the mask vector and the current gradient at each model parameter in this round of training to obtain the target task processing model in this round of training.

[0150] Analysis of the training process revealed a relationship between model sparsity and the generalization performance of personalized models. When the loss function is bounded (i.e., the loss function calculated using the training data is bounded) and the number of training samples exceeds a specified limit, the following conclusion can be drawn: as the sparsity of the so-called global model (a hypothetical model that does not actually exist in decentralized algorithms) decreases, the generalization performance of personalized models improves. However, in practical machine learning applications, model generalization and training error are also related, and the relationship between the two is usually a dynamic balance. In practical applications, the desired training mode can be selected based on the specific scenario and data distribution.

[0151] It should be noted that the local target model trained in the embodiments of the present invention can be used to perform various target tasks, including but not limited to target detection tasks, item recommendation tasks, text classification tasks, and machine translation tasks.

[0152] In this embodiment of the invention, the parameters that the local device needs to train in this round are determined by the mask vector. The mask vector realizes the personalization of the model, so there is no need for hierarchical design and training of the model. It has good robustness to different model structures and good generalization of the training method.

[0153] Furthermore, the parameters that need to be trained in this round are determined from each parameter based on the mask vector. In each round of training, only some parameters participate, which can reduce the computational complexity of training to a certain extent. During gradient backpropagation, only some gradients (the gradients corresponding to the parameters that participate in training in this round) need to be backpropagated (updating the target parameters according to the current gradient at the target parameter), which can further reduce the computational complexity of training.

[0154] In addition, the embodiments of the present invention adopt a decentralized training scheme, which does not rely on a central server, has good cluster stability, and allows each device in the distributed cluster to deploy personalized models, breaking through the constraint of global single model deployment and improving the inference accuracy of models deployed on local devices for their own data.

[0155] In addition, the initial task processing model for this round of training on the local device is obtained based on the target task processing model obtained from the previous round of training on other devices and the target task processing model obtained from the previous round of training on the local device. This is equivalent to a warm start training process, which can improve the training speed of the model.

[0156] It should be understood that, although Figure 1 , 2 The steps in flowcharts 3 and 5 are shown sequentially as indicated by the arrows; however, these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise explicitly stated herein, there is no strict order requirement for the execution of these steps, and they can be performed in other orders. Furthermore, Figure 1 , 2 At least some of the steps in 3 and 5 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but may be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but may be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

[0157] This invention also provides a task processing method, which may include the following steps:

[0158] (1) Obtain the current feature information of the target task.

[0159] (2) Input the current feature information into the local target model trained according to the model training method of the present invention, so as to use the local target model to process the target task based on the current feature information, thereby obtaining the processing result of the target task.

[0160] The target task can include, but is not limited to, object detection, item recommendation, text classification, and machine translation. Taking object detection as an example, the current feature information of the target task can be the feature information of the current image. For example, the feature information of the current image can be input into the local target model, and the output of the local target model can be the detection result of the current image. The detection result can include, for example, the location of the target, the type of the target (e.g., person, vehicle), and confidence score. The local target model can be trained using the model training method provided in this embodiment of the invention. The specific process of model training and use will not be elaborated here.

[0161] The local target model used in the task processing method of this invention determines the parameters that the local device needs to train in this round through a mask vector during the training process. The mask vector realizes the personalization of the model, so there is no need for hierarchical design and training of the model. It has good robustness to different model structures and good generalization of the training method.

[0162] Furthermore, the parameters that need to be trained in this round are determined from each parameter based on the mask vector. In each round of training, only some parameters participate, which can reduce the computational complexity of training to a certain extent. During gradient backpropagation, only some gradients (the gradients corresponding to the parameters that participate in training in this round) need to be backpropagated, which can further reduce the computational complexity of training.

[0163] In addition, the embodiments of the present invention adopt a decentralized training scheme, which does not rely on a central server, has good cluster stability, and allows each device in the distributed cluster to deploy personalized models, breaking through the constraint of global single model deployment and improving the inference accuracy of models deployed on local devices for their own data.

[0164] In addition, the initial task processing model for this round of training on the local device is obtained based on the target task processing model obtained from the previous round of training on other devices and the target task processing model obtained from the previous round of training on the local device. This is equivalent to a warm start training process, which can improve the training speed of the model.

[0165] Figure 6 This is a structural diagram of a model training apparatus provided in an embodiment of the present invention. This apparatus is suitable for executing the model training method provided in this embodiment of the present invention. The model training apparatus is applied to a local device in a distributed cluster, which also includes other devices. Figure 6 As shown, the device may specifically include:

[0166] The model acquisition module 601 is used to acquire the target task processing model obtained from the previous round of training on the other device;

[0167] The model determination module 602 is used to determine the initial task processing model of the local device in the current training round based on the target task processing model obtained in the previous training round on the other device and the target task processing model obtained in the previous training round on the local device.

[0168] The gradient determination module 603 is used to determine the current gradient at each model parameter of the initial task processing model;

[0169] The mask determination module 604 is used to determine the mask vector for the current training round of the local device;

[0170] The parameter determination module 605 is used to determine the model parameters that need to be updated in this round of training of the local device from the various model parameters based on the mask vector, so as to obtain the target parameters;

[0171] The parameter update module 606 is used to update the target parameters according to the current gradient at the target parameters of the initial task processing model, so as to obtain the target task processing model of the local device in this round of training.

[0172] In one embodiment, the mask determination module 604 is specifically used for:

[0173] The public parameters and the hidden parameters for this round are determined from the various model parameters. The public parameters are the model parameters that the local device needs to disclose for this round of training, and the hidden parameters are the model parameters that the local device needs to hide for this round of training.

[0174] The mask vector for this round of training of the local device is determined based on the publicly available parameters and the masking parameters for this round.

[0175] In one embodiment, the mask determination module 604 determines the current round of disclosure parameters and the current round of masking parameters from the various model parameters, including:

[0176] Determine the pruning rate of the local device in this round of training;

[0177] The previous round of public parameters and the previous round of private parameters are determined from the various model parameters. The previous round of public parameters are the model parameters that need to be made public in the previous round of training on the local device, and the previous round of private parameters are the model parameters that need to be hidden in the previous round of training on the local device.

[0178] The current round of disclosure parameters are determined from the previous round of disclosure parameters and the previous round of masking parameters based on the pruning rate, and the current round of masking parameters are determined from the previous round of disclosure parameters and the previous round of masking parameters based on the pruning rate.

[0179] In one embodiment, the mask determination module 604 determines the current round of disclosure parameters from the previous round of disclosure parameters and the previous round of masking parameters based on the pruning rate, and determines the current round of masking parameters from the previous round of disclosure parameters and the previous round of masking parameters based on the pruning rate, including:

[0180] The number of branches to be pruned and the number of branches to be restored are determined based on the pruning rate.

[0181] Based on the current value of the previously disclosed parameters and the number of pruning operations, the first state change parameters and the remaining disclosed parameters are determined from the previously disclosed parameters.

[0182] Based on the current gradient at the previous shielding parameters and the restoration quantity, the second state change parameters and the remaining shielding parameters are determined from the previous shielding parameters;

[0183] The current round of public parameters and the current round of private parameters are determined based on the first state change parameter, the remaining public parameters, the second state change parameter, and the remaining private parameters.

[0184] In one embodiment, the mask determination module 604 determines the current round of public parameters and the current round of masking parameters based on the first state change parameters, the remaining public parameters, the second state change parameters, and the remaining masking parameters, including:

[0185] The second state change parameter and the remaining public parameter are determined as the public parameter for this round, and the first state change parameter and the remaining masking parameter are determined as the masking parameter for this round.

[0186] In one embodiment, the mask determination module 604 determines the pruning rate of the local device in this round of training, including:

[0187] The pruning rate of the previous training round on the local device is attenuated to obtain the pruning rate of the current training round on the local device.

[0188] In one embodiment, the mask determination module 604 determines the mask vector for the current training round of the local device based on the current round of public parameters and the current round of masking parameters, including:

[0189] Determine the vector identifier of the public parameter and the vector identifier of the masked parameter, wherein the vector identifier of the public parameter is different from the vector identifier of the masked parameter;

[0190] The vector identifier of the public parameter is set at the vector position corresponding to the public parameter in this round, and the vector identifier of the mask parameter is set at the vector position corresponding to the mask parameter in this round, so as to obtain the mask vector of the local device in this round of training.

[0191] In one embodiment, the parameter determination module 605 is specifically used for:

[0192] The current round of public parameters are determined based on the vector identifier in the mask vector, and the current round of public parameters are determined as the target parameters.

[0193] In one embodiment, the previous training round on the local device is the first training round on the local device, and the mask vector of the first training round on the local device is determined by the mask determination module 604 according to the following method:

[0194] Determine the resource limitation information of the local device;

[0195] The sparsity of the model is determined based on the aforementioned resource constraint information;

[0196] The mask vector for the first round of training of the local device is determined based on the sparsity of the model.

[0197] In one embodiment, the device further includes:

[0198] The model sending module is used to send the target task processing model trained by the local device in this round to the other devices.

[0199] In one embodiment, the model determination module 602 is specifically used for:

[0200] The average value of each model parameter is determined based on the current values ​​of each model parameter of the target task processing model obtained in the previous training round on the other device and the current values ​​of each model parameter of the target task processing model obtained in the previous training round on the local device, so as to obtain the initial task processing model of the local device in this training round.

[0201] In one embodiment, the gradient determination module 603 is specifically used for:

[0202] Select training data for this round from the local training dataset;

[0203] The loss function of the initial task processing model is determined based on the training data from this round.

[0204] Calculate the current gradient of the loss function at each of the model parameters of the initial task processing model to obtain the current gradient of the initial task processing model at each of the model parameters.

[0205] In one embodiment, the device further includes:

[0206] The condition determination module is used to determine whether the training cutoff condition has been met.

[0207] If the training cutoff condition is not met, the model acquisition module 601 returns to the process of acquiring the target task processing model obtained from the previous round of training on the other device until the training cutoff condition is met. Then, the parameter update module 606 determines the target task processing model obtained from the last round of training as the local target model of the local device.

[0208] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional modules is merely an example. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. The specific working process of the functional modules described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0209] The model training device of this invention realizes the personalization of the model through mask vectors, thus eliminating the need for hierarchical design and training of the model. It is robust to different model structures and has good generalizability of the training method.

[0210] Furthermore, the parameters that need to be trained in this round are determined from each parameter based on the mask vector. In each round of training, only some parameters participate, which can reduce the computational complexity of training to a certain extent. During gradient backpropagation, only some gradients (the gradients corresponding to the parameters that participate in training in this round) need to be backpropagated, which can further reduce the computational complexity of training.

[0211] In addition, the embodiments of the present invention adopt a decentralized training scheme, which does not rely on a central server, has good cluster stability, and allows each device in the distributed cluster to deploy personalized models, breaking through the constraint of global single model deployment and improving the inference accuracy of models deployed on local devices for their own data.

[0212] In addition, the initial task processing model for this round of training on the local device is obtained based on the target task processing model obtained from the previous round of training on other devices and the target task processing model obtained from the previous round of training on the local device. This is equivalent to a warm start training process, which can improve the training speed of the model.

[0213] Figure 7 This is a structural diagram of a task processing apparatus provided in an embodiment of the present invention, which is suitable for executing the task processing method provided in an embodiment of the present invention. Figure 7 As shown, the device may specifically include:

[0214] The feature acquisition module 701 is used to acquire the current feature information of the target task;

[0215] The task processing module 702 is used to input the current feature information into a local target model trained by the model training method described in the embodiment of the present invention, so as to use the local target model to process the target task based on the current feature information, thereby obtaining the processing result of the target task.

[0216] The local target model used by the task processing device in this embodiment of the invention determines the parameters that the local device needs to train in this round through the mask vector during the training process. The mask vector realizes the personalization of the model, so there is no need for hierarchical design and training of the model. It has good robustness to different model structures and good generalization of the training method.

[0217] Furthermore, the parameters that need to be trained in this round are determined from each parameter based on the mask vector. In each round of training, only some parameters participate, which can reduce the computational complexity of training to a certain extent. During gradient backpropagation, only some gradients (the gradients corresponding to the parameters that participate in training in this round) need to be backpropagated, which can further reduce the computational complexity of training.

[0218] In addition, the embodiments of the present invention adopt a decentralized training scheme, which does not rely on a central server, has good cluster stability, and allows each device in the distributed cluster to deploy personalized models, breaking through the constraint of global single model deployment and improving the inference accuracy of models deployed on local devices for their own data.

[0219] In addition, the initial task processing model for this round of training on the local device is obtained based on the target task processing model obtained from the previous round of training on other devices and the target task processing model obtained from the previous round of training on the local device. This is equivalent to a warm start training process, which can improve the training speed of the model.

[0220] This invention also provides a model training system, which includes a distributed cluster, such as... Figure 8As shown, the distributed cluster includes other devices 802 and a local device 801 for executing the model training method as described in any one of the embodiments of the present invention, wherein there may be multiple other devices 802.

[0221] The following is for reference. Figure 9 It shows a schematic diagram of the structure of a computer system 900 suitable for implementing an electronic device according to embodiments of the present invention. Figure 9 The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.

[0222] like Figure 9 As shown, the computer system 900 includes a central processing unit (CPU) 901, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 902 or programs loaded from storage section 908 into random access memory (RAM) 903. The RAM 903 also stores various programs and data required for the operation of the computer system 900. The CPU 901, ROM 902, and RAM 903 are interconnected via a bus 904. An input / output (I / O) interface 905 is also connected to the bus 904.

[0223] The following components are connected to I / O interface 905: an input section 906 including a keyboard, mouse, etc.; an output section 907 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 908 including a hard disk, etc.; and a communication section 909 including a network interface card such as a LAN card, modem, etc. The communication section 909 performs communication processing via a network such as the Internet. A drive 910 is also connected to I / O interface 905 as needed. A removable medium 911, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 910 as needed so that computer programs read from it can be installed into storage section 908 as needed.

[0224] In particular, according to the embodiments disclosed in this invention, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this invention include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 909, and / or installed from removable medium 911. When the computer program is executed by central processing unit (CPU) 901, it performs the functions defined above in the system of this invention.

[0225] It should be noted that the computer-readable medium shown in this invention can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this invention, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this invention, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable medium other than computer-readable storage media, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0226] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0227] The modules and / or units described in the embodiments of this invention can be implemented in software or hardware. The described modules and / or units can also be housed in a processor; for example, a processor may be described as including a model acquisition module, a model determination module, a gradient determination module, a mask determination module, a parameter determination module, and a parameter update module; or, a processor may be described as including a feature acquisition module and a task processing module. The names of these modules do not necessarily constitute a limitation on the module itself.

[0228] In another aspect, the present invention also provides a computer-readable medium, which may be included in the device described in the above embodiments; or it may exist independently and not assembled into the device. The computer-readable medium carries one or more programs, which, when executed by a device, cause the device to include: acquiring a target task processing model trained on the other device in the previous round; determining an initial task processing model for the current round of training on the local device based on the target task processing model trained on the other device in the previous round and the target task processing model trained on the local device in the previous round; determining the current gradient at each model parameter of the initial task processing model and determining a mask vector for the current round of training on the local device; determining, based on the mask vector, the model parameters that need to be updated for the current round of training on the local device, based on the model parameters, to obtain target parameters; and updating the target parameters based on the current gradient at the target parameters of the initial task processing model, to obtain the target task processing model for the current round of training on the local device.

[0229] Alternatively, when one or more of the above programs are executed by the device, the device includes: acquiring current feature information of the target task; inputting the current feature information into a local target model trained by the model training method described in the embodiments of the present invention, so as to use the local target model to process the target task based on the current feature information, thereby obtaining the processing result of the target task.

[0230] According to the technical solution of the present invention, during the model training process, the parameters that the local device needs to train in this round are determined by the mask vector. The personalization of the model is realized by the mask vector, so there is no need for hierarchical design and training of the model. It has good robustness to different model structures and good generalization of the training method.

[0231] Furthermore, the parameters that need to be trained in this round are determined from each parameter based on the mask vector. In each round of training, only some parameters participate, which can reduce the computational complexity of training to a certain extent. During gradient backpropagation, only some gradients (the gradients corresponding to the parameters that participate in training in this round) need to be backpropagated, which can further reduce the computational complexity of training.

[0232] In addition, the embodiments of the present invention adopt a decentralized training scheme, which does not rely on a central server, has good cluster stability, and allows each device in the distributed cluster to deploy personalized models, breaking through the constraint of global single model deployment and improving the inference accuracy of models deployed on local devices for their own data.

[0233] In addition, the initial task processing model for this round of training on the local device is obtained based on the target task processing model obtained from the previous round of training on other devices and the target task processing model obtained from the previous round of training on the local device. This is equivalent to a warm start training process, which can improve the training speed of the model.

[0234] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can occur depending on design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.

Claims

1. A model training method, characterized in that, The model training method is applied to a local device in a distributed cluster, which also includes other devices. The model training method includes: Obtain the target task processing model obtained from the previous round of training on the other devices; The initial task processing model for the current training round of the local device is determined based on the target task processing model obtained from the previous training round of the other devices and the target task processing model obtained from the previous training round of the local device. Determine the current gradient at each model parameter of the initial task processing model, and determine the mask vector for this round of training on the local device; Based on the mask vector, the model parameters that need to be updated in this round of training of the local device are determined from the various model parameters to obtain the target parameters; The target parameters are updated based on the current gradient at the target parameters of the initial task processing model to obtain the target task processing model trained by the local device in this round. The step of determining the mask vector for the current training round of the local device includes: The public parameters and the hidden parameters for this round are determined from the various model parameters. The public parameters are the model parameters that the local device needs to disclose for this round of training, and the hidden parameters are the model parameters that the local device needs to hide for this round of training. The mask vector for this round of training of the local device is determined based on the publicly disclosed parameters and the masking parameters of this round. The process of determining the public parameters and the masking parameters for this round from the various model parameters includes: Determine the pruning rate of the local device in this round of training; The previous round of public parameters and the previous round of private parameters are determined from the various model parameters. The previous round of public parameters are the model parameters that need to be made public in the previous round of training on the local device, and the previous round of private parameters are the model parameters that need to be hidden in the previous round of training on the local device. The current round of disclosure parameters are determined from the previous round of disclosure parameters and the previous round of masking parameters based on the pruning rate, and the current round of masking parameters are determined from the previous round of disclosure parameters and the previous round of masking parameters based on the pruning rate.

2. The model training method of claim 1, wherein, The step of determining the current round's disclosure parameters from the previous round's disclosure parameters and the previous round's masking parameters based on the pruning rate, and determining the current round's masking parameters from the previous round's disclosure parameters and the previous round's masking parameters based on the pruning rate, includes: The number of branches to be pruned and the number of branches to be restored are determined based on the pruning rate. Based on the current value of the previously disclosed parameters and the number of pruning operations, the first state change parameters and the remaining disclosed parameters are determined from the previously disclosed parameters. Based on the current gradient at the previous shielding parameters and the restoration quantity, the second state change parameters and the remaining shielding parameters are determined from the previous shielding parameters; The current round of public parameters and the current round of private parameters are determined based on the first state change parameter, the remaining public parameters, the second state change parameter, and the remaining private parameters.

3. The model training method of claim 2, wherein, The step of determining the current round of public parameters and the current round of private parameters based on the first state change parameters, the remaining public parameters, the second state change parameters, and the remaining private parameters includes: The second state change parameter and the remaining public parameter are determined as the public parameter for this round, and the first state change parameter and the remaining masking parameter are determined as the masking parameter for this round.

4. The model training method according to claim 1, characterized in that, Determining the pruning rate of the local device in this round of training includes: The pruning rate of the previous training round on the local device is attenuated to obtain the pruning rate of the current training round on the local device.

5. The model training method according to claim 1, characterized in that, The step of determining the mask vector for the local device's training in this round based on the publicly available parameters and the masking parameters includes: Determine the vector identifier of the public parameter and the vector identifier of the masked parameter, wherein the vector identifier of the public parameter is different from the vector identifier of the masked parameter; The vector identifier of the public parameter is set at the vector position corresponding to the public parameter in this round, and the vector identifier of the mask parameter is set at the vector position corresponding to the mask parameter in this round, so as to obtain the mask vector of the local device in this round of training.

6. The model training method according to claim 5, characterized in that, The step of determining the model parameters that need to be updated in this round of training on the local device based on the mask vector from the various model parameters to obtain the target parameters includes: The current round of public parameters are determined based on the vector identifier in the mask vector, and the current round of public parameters are determined as the target parameters.

7. The model training method according to claim 1, characterized in that, The previous training round on the local device is the first training round on the local device, and the mask vector for the first training round on the local device is determined according to the following method: Determine the resource limitation information of the local device; The sparsity of the model is determined based on the aforementioned resource constraint information; The mask vector for the first round of training of the local device is determined based on the sparsity of the model.

8. The model training method according to claim 1, characterized in that, The method further includes: The target task processing model trained by the local device in this round is sent to the other devices.

9. The model training method according to claim 1, characterized in that, The step of determining the initial task processing model for the current training round of the local device based on the target task processing model obtained from the previous training round on the other devices and the target task processing model obtained from the previous training round on the local device includes: The average value of each model parameter is determined based on the current values ​​of each model parameter of the target task processing model obtained in the previous training round on the other device and the current values ​​of each model parameter of the target task processing model obtained in the previous training round on the local device, so as to obtain the initial task processing model of the local device in this training round.

10. The model training method according to claim 1, characterized in that, Determining the current gradient at each model parameter of the initial task processing model includes: Select training data for this round from the local training dataset; The loss function of the initial task processing model is determined based on the training data from this round. Calculate the current gradient of the loss function at each of the model parameters of the initial task processing model to obtain the current gradient of the initial task processing model at each of the model parameters.

11. The model training method according to claim 1, characterized in that, The method further includes: Determine whether the training deadline has been met; If the training cutoff condition is not met, the process returns to obtaining the target task processing model obtained from the previous round of training on the other device until the training cutoff condition is met. Then, the target task processing model obtained from the last round of training is determined as the local target model of the local device.

12. A task processing method, characterized in that, include: Obtain the current feature information of the target task; The current feature information is input into the local target model trained by the model training method as described in claim 11, so as to use the local target model to process the target task based on the current feature information, thereby obtaining the processing result of the target task.

13. A model training device, characterized in that, The model training device is applied to a local device in a distributed cluster, which also includes other devices. The model training device includes: The model acquisition module is used to acquire the target task processing model obtained from the previous round of training on the other devices; The model determination module is used to determine the initial task processing model for the current training round of the local device based on the target task processing model obtained from the previous training round of the other device and the target task processing model obtained from the previous training round of the local device. The gradient determination module is used to determine the current gradient at each model parameter of the initial task processing model; A mask determination module is used to determine the mask vector for the current training round of the local device; The parameter determination module is used to determine the model parameters that need to be updated in this round of training of the local device from the various model parameters based on the mask vector, so as to obtain the target parameters; The parameter update module is used to update the target parameters according to the current gradient at the target parameters of the initial task processing model, so as to obtain the target task processing model trained by the local device in this round. The mask determination module is specifically used for: The public parameters and the hidden parameters for this round are determined from the various model parameters. The public parameters are the model parameters that the local device needs to disclose for this round of training, and the hidden parameters are the model parameters that the local device needs to hide for this round of training. The mask vector for this round of training of the local device is determined based on the publicly disclosed parameters and the masking parameters of this round. The mask determination module is further used for: Determine the pruning rate of the local device in this round of training; The previous round of public parameters and the previous round of private parameters are determined from the various model parameters. The previous round of public parameters are the model parameters that need to be made public in the previous round of training on the local device, and the previous round of private parameters are the model parameters that need to be hidden in the previous round of training on the local device. The current round of disclosure parameters are determined from the previous round of disclosure parameters and the previous round of masking parameters based on the pruning rate, and the current round of masking parameters are determined from the previous round of disclosure parameters and the previous round of masking parameters based on the pruning rate.

14. A task processing device, characterized in that, include: The feature acquisition module is used to acquire the current feature information of the target task; The task processing module is used to input the current feature information into the local target model trained by the model training method as described in claim 11, so as to use the local target model to process the target task based on the current feature information, thereby obtaining the processing result of the target task.

15. A model training system, characterized in that, The model training system includes a distributed cluster, which includes other devices and a local device for performing the model training method as described in any one of claims 1 to 11.

16. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the model training method as described in any one of claims 1 to 11, or when the processor executes the program, it implements the task processing method as described in claim 12.

17. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the model training method as described in any one of claims 1 to 11, or when the program is executed by the processor, it implements the task processing method as described in claim 12.