Model training method and related device
A model training and network model technology, applied in the field of data processing, can solve problems such as high error rate of network models
Pending Publication Date: 2020-04-03
TENCENT TECH (SHENZHEN) CO LTD
0 Cites 3 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0005] However, the network model determined according to this multi-processing node parallel training m...
Method used
[0053] The data processing device configured with the target processing node can more efficiently obtain model parameters from the M processing nodes, improving the efficiency of determining the initial model parameters.
[0070] Wherein, τ is the number of mini-batches updated by the network model trained by the target processing node between two adjacent model fusions. By calculating the difference Gn of the target processing node, the change of the network model trained by the target processin...
Abstract
The embodiment of the invention discloses a model training method and a related device. The model training method comprises the following steps: performing parallel training on a network model throughN processing nodes; determining M processing nodes in the N processing nodes when the ith training iteration is finished, wherein M is less than N; and obtaining model parameters of a network model trained by the M processing nodes as to-be-fused parameters, and determining initial model parameters of a network model trained by a target processing node at the beginning of (i + 1) th training iteration according to the to-be-fused parameters, wherein the target processing node is a processing node except M processing nodes in the N processing nodes. As the M processing nodes are local processing nodes in the N processing nodes, the initial model parameters can reflect the training characteristics of the local processing nodes, and the diversity of the initial model parameters is enhanced,and the over-fitting problem of the network model when the training is finally completed is reduced, and the model quality is ensured on the premise of improving the training efficiency.
Application Domain
Character and pattern recognitionNeural architectures +1
Technology Topic
Network modelEngineering +4
Image
Examples
- Experimental program(1)
Example Embodiment
[0030] The embodiments of the present application will be described below in conjunction with the drawings.
[0031] In order to improve the training speed of complex models, the method of parallel training with multiple processing nodes can be used in related technologies to solve the problem. In order to reduce the training difference between multiple processing nodes, at the end of one or more training iterations, the model parameters of the models trained by all processing nodes are synthesized, and the synthesized model parameters are used as each processing in the next training stage The initial parameters of the model trained by the node.
[0032] Such as figure 1 As shown, figure 1 It includes a parameter server and multiple processing nodes (for example, 5 processing nodes). The parameter server is used as the central processing node to obtain the model parameters of all processing nodes at the end of a training iteration for synthesis, and then obtain The model parameters of is returned to each processing node and used as the initial parameters of each processing node at the beginning of the next training iteration.
[0033] Since the above method is a unified comprehensive processing of the model parameters of all processing nodes, the initial parameters received by each processing node at the beginning of the next training iteration are the same, lacking specificity, and easily causing over-fitting problems. That is, the trained network model is too pursued to fit the training data, which leads to poor test results in the actual test separated from the training data.
[0034] In order to solve the above technical problems, this application provides a model training method, which can be applied to a scenario where N processing nodes perform parallel training on the same network model, where N is an integer ≥2. Any one of the processing nodes can be a central processing unit (CPU), a graphics processing unit (GPU), etc. These N processing nodes can be configured in the same processing device, or they can be separated It is configured in different processing devices, and the processing device configured with the appeal processing node can be a server, a terminal, and so on.
[0035] The training process requires k training iterations, and the value of k is related to the number of training samples used to train the network model, the number of samples each training node trains each time, and so on. Generally, k is an integer ≥2. The i-th training iteration mentioned later in this embodiment of the application may be one of k training iterations. Since the parallel training of the network model will be completed after the last training iteration, i≤k-1.
[0036] In the embodiment of the present application, after the N processing nodes perform the i-th training iteration, the initial model parameters of the network model trained by the N processing nodes at the beginning of the i+1-th training iteration can be determined for the N processing nodes. , The determined initial model parameters can be different.
[0037] For any one of the processing nodes, such as the target processing node, when the i-th training iteration is completed, a part of the processing nodes in the parallel training, such as the current model parameters of the network model trained by the M processing nodes, is used to determine the corresponding target processing And determine the initial model parameters of the network model trained by the target processing node at the beginning of the i+1th training iteration according to the parameters to be fused. Among them, M
[0038] The initial model parameters in the embodiments of this application are used to identify the model parameters of the network model at the beginning of the i+1th training iteration, that is, at the i+1th training iteration, what will the network model trained by the target processing node be based on? Such model parameters are started for the i+1th training iteration.
[0039] Since the parameters to be fused are determined by the model parameters of some processing nodes, the part of the processing nodes on which the parameters to be fused for different processing nodes are determined may be totally or partially different, resulting in the parameters to be fused for different processing nodes It can be different. This difference can reflect the characteristics of local processing node training. The initial model parameters used by different processing nodes in each training iteration are diversified, which reduces the over-fitting problem of the network model when the final training is completed. The model quality is guaranteed under the premise of training efficiency.
[0040] The technical solutions provided in the embodiments of the present application can be applied to data processing equipment with model parameter processing and model parameter configuration capabilities, such as servers, terminals, and the like. The data processing device may be a processing device configured with some or all of the N processing nodes, or it may be an independent device not configured with the N processing nodes.
[0041] In order to facilitate the understanding of the technical solution of the present application, the model training method provided in the embodiments of the present application will be introduced below in combination with actual application scenarios. figure 2 In the scenario shown, N=6, the 6 processing nodes used for parallel training of the same network model can be configured in one or more servers respectively, and these 6 processing nodes are identified by numbers 10-60 .
[0042] The initial model parameters of the training network model for these 6 processing nodes at the beginning of the i+1 training iteration can be calculated separately. When calculating the initial model parameters corresponding to any one of the six processing nodes, this processing node can be used as a target processing node, such as processing node 10.
[0043] The server configured with the processing node 10 can be used as the aforementioned data processing device to calculate the initial model parameters used by the processing node 10 configured by itself at the beginning of the i+1th training iteration.
[0044] in figure 2 In the scenario shown, the process of parallel training of the network model includes k training iterations. At the end of the i-th training iteration, for the target processing node, the server configured with processing node 10 can operate on 6 processing nodes. Determine a part of the processing nodes in the, and use the model parameters of these processing nodes to determine the parameters to be fused. in figure 2 In the scenario shown, the server configured with the processing node 10 determines 2 processing nodes from the 6 processing nodes (for example, the processing node 20 and the processing node 30 shown in the dashed box) for determining the parameters to be merged.
[0045] The model parameters of the network models trained by the processing node 20 and the processing node 30 at the end of the i-th training iteration are used as the parameters to be fused, and according to the parameters to be fused, it is determined that the processing node 10 trained at the beginning of the i+1th training iteration The initial model parameters of the network model.
[0046] It is understandable that the determined initial model parameters can be applied to only one processing node or multiple processing nodes. For example, when the M processing nodes determined by multiple target nodes are the same, the acquired parameters to be fused are also Similarly, at this time, the parameters to be fused may be suitable for determining the initial model parameters of multiple processing nodes.
[0047] Since when different processing nodes are used as target processing nodes for calculation, the selected M processing nodes can be different (the difference referred to here can be understood as all different or partly different), so when different processing nodes are used as target processing nodes, corresponding The parameters to be fused can be distinguished, and the initial model parameters determined by the diversified parameters to be fused are also diverse.
[0048] E.g figure 2 At the beginning of the i+1th training iteration, the initial model parameters of the network model trained by any two processing nodes, such as processing node 10 and processing node 20, may not be exactly the same, so that at the beginning of each training iteration, since N The initial training parameters of the network model trained by the processing nodes are different, resulting in different starting points for model training, which respectively reflect the training characteristics of a part of all processing nodes on the network model, avoiding excessive homogeneity. Thus, the training diversity of some processing nodes is highlighted without affecting the integrity of parallel training. After such k training iterations, the over-fitting problem of the network model when the training is finally completed can be effectively reduced, and the quality of the model is guaranteed on the premise of improving the training efficiency.
[0049] Next, a model training method provided in an embodiment of the present application will be introduced with reference to the accompanying drawings.
[0050] See image 3 , image 3 A flowchart of a model training method is shown. The method is applied in the process of parallel training of the network model through N processing nodes, and the parallel training process includes k training iterations. The method includes:
[0051] S301: At the end of the i-th training iteration, determine M processing nodes among the N processing nodes.
[0052] At the end of the i-th training iteration of the data processing device, in order to ensure that the subsequently acquired parameters to be fused are the model parameters of the local processing nodes, it is necessary to determine M processing nodes from N processing nodes for parallel training, and M is less than N . There may be multiple ways to determine M processing nodes from N processing nodes, which are not limited in the embodiment of the present application. For example, it can be determined randomly or based on the communication relationship. Among them, the communication relationship can reflect the degree of communication convenience between processing nodes, such as figure 2 In the scenario shown, the processing node 10 has a direct communication relationship with the processing node 20 and the processing node 60, and belongs to adjacent processing nodes in the parallel training scenario. In a possible implementation manner, M processing nodes may be determined among the N processing nodes according to the communication relationship between the target processing node in this calculation and the foregoing N processing nodes. When M processing nodes are determined for the target processing node, local processing nodes can be selected based on the communication convenience embodied by the communication relationship, so that the determined M processing nodes are compared with the target processing nodes relative to the unselected processing nodes. Nodes have better communication convenience, such as figure 2 In the scenario shown, if M is 3, then the M processing nodes determined for the processing node 10 may include at least its first-level communication neighbors, which are processing node 20 and processing node 60 respectively, and may further include its second-level communication neighbors. Communication neighbors, such as processing node 30 or processing node 50.
[0053] The data processing device configured with the target processing node can obtain the model parameters from the M processing nodes more efficiently, and improve the efficiency of determining the initial model parameters.
[0054] Moreover, since among the N processing nodes, different processing nodes may have different adjacent communication relationships. Therefore, determining the M processing nodes corresponding to the target processing node based on the communication relationship also has a diversified effect. When the processing node of is used as the target processing node, the difference of the M processing nodes determined based on the communication relationship is greater than the random selection method, and for the N processing nodes as a whole, the initial model of each processing node is determined The coverage of the processing nodes involved in the parameters is more comprehensive, so that each training iteration will not lose too much of the overall training characteristics of parallel training, which guarantees the final training quality to a certain extent. The embodiment of the present application does not limit the number M of local processing nodes, as long as it is smaller than the number N. It should be noted that the larger the number of M, the smaller the diversified impact on the initial model parameters. The smaller the number of M, the homogeneity of the overall parallel training reflected by the determined initial model parameters The lower it is, it will directly affect the quality of the network model finally determined by parallel training.
[0055] In a possible implementation manner, based on the above-mentioned principle of determining the number of M, the number M of local processing nodes may be determined according to the total number N of processing nodes undergoing parallel training. For example, in a certain way, M can be equal to logN. The number of M thus determined is sufficiently large relative to the total number N, which accounts for a certain proportion of the overall parallel training. In the case that M is determined in this way, the model parameters of the network model trained by the M processing nodes at the end of the i-th training iteration can fully reflect the overall training characteristics of this training iteration, and the determined The initial model parameters can also reflect the training characteristics of local processing nodes, bringing diversified effects. Moreover, the number of M is smaller than that of N, which reduces the amount of calculation to determine the initial model parameters, thereby achieving a better balance between calculation efficiency and training quality.
[0056] S302: Acquire parameters to be fused from M processing nodes.
[0057] The data processing device obtains the parameters to be fused from the M processing nodes when determining the M processing nodes. Among them, the parameters to be fused are the model parameters of the network model trained by the M processing nodes at the end of the i-th training iteration. In the embodiment of the present application, W j (t) means that j=1, 2, 3...M.
[0058] S303: According to the parameters to be fused, determine the initial model parameters of the network model trained by the target processing node at the beginning of the i+1th training iteration.
[0059] Among them, the step of determining the initial model parameters according to the parameters to be fused is one of the steps of parallel training of the network model by multiple processing nodes. There are many methods for parallel training of the network model by multi-processing nodes, such as Model average (MA) algorithm, Blockwise Model-Update Filtering (BMUF) algorithm, etc. In the existing parallel training methods, in order to enable the network models trained by different processing nodes to reflect the training characteristics of the overall processing nodes, the model parameters of the network models trained by different processing nodes are often modeled after one training iteration is completed. Fusion means acquiring the model parameters of all processing nodes after this training iteration, and determining the initial model parameters of each processing node at the beginning of the next training iteration based on the acquired model parameters.
[0060] It is understandable that in different parallel training methods, the specific methods of model fusion are also different. For example, in the MA algorithm, model fusion only needs to average the network models of all processing nodes and assign them to each processing node, which can be used as the initial model parameter for the next training iteration of the processing node; while in the BUMF algorithm, After averaging the model parameters of all processing nodes, in order to reflect the personalized characteristics of each processing node, the historical parameters of the network model trained by each processing node must be personalized to compensate the processing node. The model parameters after personalized compensation based on the network model parameters are used as the initial model parameters for the next training iteration of the processing node.
[0061] The following will introduce the technical improvements provided by the embodiments of this application based on the scenario of model fusion in the BMUF algorithm:
[0062] (1) Calculate the mean value of the parameters to be fused when the i-th training iteration is completed
[0063]
[0064] Among them, t is the current step count at the end of the i-th training iteration, representing the number of times the network model has been trained with mini-batch training samples when the network model is trained according to the conventional stochastic gradient descent process.
[0065] In the BUMF algorithm, because the first step is to average the model parameters of all processing nodes, it cannot reflect the characteristics of the network model trained by the local processing nodes; at the same time, because the averaged model parameters are used as the calculation for each processing The initial model parameters of the node at the beginning of the next training iteration result in the same average model parameters obtained by each processing node, which lacks diversification and randomness. However, the above technical shortcomings can easily lead to the over-fitting problem of the network model trained by the BUMF algorithm.
[0066] After combining the technical solutions of this application on the basis of BUFM, the first step only needs to be to average the acquired parameters to be fused. The parameters to be fused are determined model parameters of M processing nodes, and the M nodes are among all the processing nodes. Local processing node, so the average value obtained after averaging the parameters to be fused can reflect the training characteristics of the local processing node.
[0067] Since when different processing nodes are used as target processing nodes, the determined M processing nodes can be different, so the acquired parameters to be fused can also be different, and the calculated average model parameters can also be different, so that different processing nodes can obtain The average model parameters are characterized by diversification and randomness. It is precisely because of the improvement brought by the technical solution of the present application in this step that the degree of overfitting of the network model obtained after the final parallel training is reduced.
[0068] (2) Suppose the target processing node is node n, and calculate the mean value of the parameters to be fused The difference G between the model parameters after the last model fusion with the target processing node n n :
[0069]
[0070] Among them, τ is the number of mini-batches updated by the network model trained by the target processing node between two adjacent model fusions. By calculating the difference G of the target processing node n , Can reflect the changes of the network model trained by the target processing node to a certain extent, thereby reflecting the individualization of the target processing node training.
[0071] (3) Calculate and update the amount of change with historical gradient weight△ n :
[0072] △ n :=ηG n +m△ n
[0073] Among them, the meaning of: = symbol is to assign the calculation result on the right side of the symbol to the variable on the left side of the symbol; η> 0 is the block learning rate. The block learning rate is an important parameter used to supervise a certain processing node for model training. It determines whether the objective function in the trained network model can converge to the local minimum and when it can converge to the minimum; the weight of the m historical gradient ( block momentum rate), used to reflect the influence of the historical gradient of the target training node. It can be seen that the amount of change△ n It not only reflects the training speed of the target processing node, but also reflects the influence of the previous i training iterations on the initial model parameters of the i+1th training iteration, so that the network model trained by the target processing node is more in line with the characteristics of its own training data.
[0074] (4) Calculate and update the intermediate variable Ω n :
[0075] Ω n :=Ω n +△ n
[0076] (5) Calculate the initial model parameter W of the network model trained by the target processing node n at the beginning of the i+1 training iteration n (t):
[0077] W n (t)=Ω n +η△ n
[0078] It is understandable that, in some cases, there may be large differences between the network model trained by M processing nodes and the network model trained by target processing node n. If the target is determined only by the model parameters of M processing nodes Processing the initial model parameters of node n may result in that the determined initial model parameters can reflect the training characteristics of the local processing node, but are quite different from the training characteristics of the network model trained by the target processing node itself, and it is difficult to reflect The training characteristics of the target processing node itself make the target processing node too biased towards the local training characteristics when training the network model, which will reduce the training effect of the target processing node. In order to highlight the characteristics of the network model trained by the target processing node itself, the model training of the target processing node is appropriately modified. In a possible implementation, it can also be based on the parameters to be fused and the target processing node in the i-th training iteration The model parameters of the network model trained at the end determine the initial model parameters of the network model trained by the target processing node at the beginning of the i+1th training iteration. For example, in the above calculation of the mean value of the parameters to be fused, the model parameter W of the network model trained by the target processing node n at the end of the i-th training iteration n (i) Bring them together for mean calculation:
[0079]
[0080] It is understandable that in the above technical solutions, the technical solutions of this application combined with the beneficial effects brought by the BUMF algorithm still exist in the technical solutions of this application combined with other parallel training methods that use model parameters of all processing nodes for model fusion. It can solve the over-fitting problem to a certain extent.
[0081] In a possible implementation, a network model trained in parallel through N processing nodes may include multiple sub-modules, and a sub-module of the network model may be a layer of the network model. For example, when the network model includes an input layer, a hidden When layer and output layer, you can use the input layer as a sub-module, the hidden layer as a sub-module, and the output layer as a sub-module. Each sub-module has corresponding model parameters. The model parameters of a sub-module of the network model belong to a part of the model parameters of the network model.
[0082] Different sub-modules are responsible for processing the data entering different layers of the network model according to different rules. In the initial model parameter calculation performed by a certain processing node among the N processing nodes as the target processing node, in order to further highlight the characteristics of the parameter diversity and randomness of each sub-module, the network model trained for the target processing node can be different The sub-module of obtains the model parameters of the target sub-module in the network model trained by the M processing nodes. It is understandable that the target submodule is one of multiple submodules in the network model trained by the target processing node. In this way, it can be ensured that the acquired parameters to be fused and the sub-modules that need to be trained using the parameters to be fused have corresponding training characteristics, thereby meeting the requirements of different sub-modules for different parameters to be fused. At the same time, when training for each sub-module, the acquired parameters to be fused are the model parameters of some M processing nodes in all N processing nodes, so that the trained network model with multiple sub-modules further reflects the local The training characteristics of the processing node.
[0083] In addition, since the different sub-modules of the network model trained for the target processing node are selected as the model parameters of the local processing node, the selected M processing nodes can be different for different sub-modules. For example, the multiple sub-modules include a first sub-module and a second sub-module. The first group of M processing nodes are used to determine the initial model parameters of the first sub-module, and the second group of M processing nodes are used to determine the second sub-module’s parameters. For the initial model parameters, there are different processing nodes between the first group of M processing nodes and the second group of M processing nodes, which strengthens the randomness and diversity in the network model training process to a certain extent, and reduces the over-fitting problem.
[0084] For example, when figure 2 When the network model for parallel training in the multiple processing nodes shown is a Long Short-term memory (LSTM) acoustic model, the acoustic model includes an input layer submodule, a hidden layer submodule, and an output layer submodule . The target processing node is the processing node 10. When determining the initial model parameters of the input layer submodule of the LSTM acoustic model trained by the target processing node 10 at the beginning of the i+1 training iteration, the processing node 20 and the processing node 60 can be determined as For the corresponding M processing nodes, obtain the model parameters after the i-th training iteration of the input layer submodule training the LSTM acoustic model from these two processing nodes, as the data to be fused; when determining the hidden layer submodule and When outputting the initial model parameters of the sub-module of the layer, the model parameters of the processing node 60 and the processing node 50, and the processing node 30 and the processing node 40 of the LSTM acoustic model trained by the processing node can be determined respectively. As the data to be fused, it is ensured that the model parameters of different layer sub-modules of the LSTM acoustic model trained in the processing node 10 have the comprehensive characteristics of the model parameters of different local nodes. It can be seen from the above technical solution that the process of parallel training of the network model through N processing nodes includes k training iterations. For one of them, for example, at the end of the i-th training iteration, it is determined from the N processing nodes Part of the processing nodes, such as M processing nodes. Determine the parameters to be fused based on the model parameters of the network model trained by the M processing nodes at this time, and determine the initial model parameters used by the target processing node for the network model at the beginning of the i+1 training iteration based on the parameters to be fused . Since the initial model parameters used by any processing node for the network model at the beginning of a training iteration are determined based on the local processing nodes in the N processing nodes, the initial model parameters used by each processing node at the beginning of this training iteration It can be different and reflects the characteristics of local processing node training. The diversity of the initial model parameters of each training iteration reduces the over-fitting problem of the network model when the training is finally completed, and ensures the model while improving the training efficiency. quality.
[0085] Next, the model training method provided in the embodiment of the present application will be introduced in combination with a practical application scenario. The application scenario is a voice recognition scenario. The voice recognition system includes a preprocessing module 401, a word boundary detection module 402, a Mel frequency cepstrum coefficient feature module 403, an acoustic model and voice model module 404, and an authentication module 405. The model training method provided in the embodiment of the present application can be applied to the training of the acoustic model and the speech model in the scene to achieve high-quality and efficient training.
[0086] The work of the above modules in this scenario is briefly described as follows:
[0087] The preprocessing module 401 is used to receive and preprocess the input voice signal;
[0088] The word boundary detection module 402 is used to perform word boundary detection on the preprocessed speech signal to determine whether it is human voice audio;
[0089] The Mel frequency cepstral coefficient feature module 403 is used to extract Mel frequency cepstral coefficient features from the audio data after determining that the audio is human voice audio;
[0090] The acoustic model and speech model module 404 is used for recognizing audio data through the acoustic model and the speech model;
[0091] The authentication module 405 is used to authenticate and output the identification result.
[0092] Among them, there are n processing nodes in the acoustic model and language model modules for parallel training of the LSTM acoustic model, and the adopted parallel training method is the optimized BMUF algorithm combined with the technical solution of the application. The flowchart of the model training method of the speech recognition system is as follows Figure 5 As shown, the method includes:
[0093] S501: Divide the LSTM acoustic model into m sub-modules.
[0094] First, according to the characteristics of the LSTM acoustic model, it is divided into m input layer, hidden layer and output layer sub-modules, as shown in the following code, which is the code for performing parallel training in the embodiment of this application.
[0095] Mark n nodes as 0,1,...,n-1, assuming that each node i is associated with (i-1)%n,...,(ik)%n, (i+1)%n,...,(i +k)%n is connected, and% represents the modulo symbol.
[0096]
[0097]
[0098] S502: Divide the data to be trained into n parts and send them to n processing nodes.
[0099] S503: Each processing node reads data to perform model training.
[0100] As shown in the code, after each processing node reads its own data, it updates the model according to each mini-batch according to the conventional stochastic gradient descent process. Among them, the current training step count is t.
[0101] S504: Determine the target processing node.
[0102] After n processing nodes complete τ mini-batch model updates, a model fusion is performed. At the beginning of model fusion, first determine the target processing node in each model fusion calculation. It is understandable that n calculations need to be performed in one model fusion, that is, each processing node must be used as a target processing node to perform a calculation, and a calculation includes a target processing node.
[0103] S505: Randomly determine q processing nodes for each sub-module in the acoustic model trained on the target processing node.
[0104] After determining the target processing node, for each output layer, hidden layer and output layer sub-modules of the LSTM acoustic model trained on the target processing node, q processing nodes are randomly selected according to the communication relationship.
[0105] S506: Acquire parameters to be fused corresponding to each sub-module.
[0106] After determining the q processing nodes corresponding to each sub-module, the model parameters of the sub-modules corresponding to each sub-module are obtained from the acoustic models trained by the q processing nodes as the parameters to be fused for each sub-module.
[0107] S507: Determine the initial model parameters of the acoustic model trained by the target processing node at the beginning of the next model training iteration according to the parameters to be fused.
[0108] After the parameters to be fused are obtained, the optimization algorithm obtained by using the technical solution of the present application described in the above embodiment and the BMUF algorithm, as shown in the code, calculates the initial model parameters of the target processing node at the beginning of the next training iteration.
[0109] Based on the model training method provided in the foregoing embodiment, this embodiment provides a related device 600 for model training, see Image 6 , The device 600 includes a first determining unit 601, an acquiring unit 602, and a second determining unit 603:
[0110] The first determining unit 601 is configured to determine M processing nodes among the N processing nodes at the end of the i-th training iteration, i≤k-1, M
[0111] The obtaining unit 602 is configured to obtain parameters to be fused from M processing nodes, where the parameters to be fused are model parameters of the network model trained by the M processing nodes;
[0112] The second determining unit 603 is configured to determine the initial model parameters of the network model trained by the target processing node at the beginning of the i+1th training iteration according to the parameters to be fused; the target processing nodes are N processing nodes except for M processing A processing node other than the node.
[0113] In a possible implementation, if M
[0114] According to the parameters to be fused and the model parameters of the network model trained by the target processing node at the end of the i-th training iteration, the initial model parameters of the network model trained by the target processing node at the beginning of the i+1-th training iteration are determined.
[0115] In a possible implementation, the network model includes multiple sub-modules, and the parameters to be fused are model parameters of the target sub-module in the network model trained by M processing nodes, and the target sub-module is one of the multiple sub-modules:
[0116] The initial model parameters are used to identify the model parameters of the target submodule trained by the target processing node at the beginning of the i+1th training iteration.
[0117] In a possible implementation manner, the multiple sub-modules include a first sub-module and a second sub-module, wherein the first group of M processing nodes are used to determine the initial model parameters of the first sub-module, and the second group of M processing nodes The nodes are used to determine the initial model parameters of the second sub-module, and there are different processing nodes between the first group of M processing nodes and the second group of M processing nodes.
[0118] In a possible implementation, M is determined based on N.
[0119] In a possible implementation manner, the first determining unit 601 is specifically configured to: at the end of the i-th training iteration, determine M among the N processing nodes according to the communication relationship of the target processing node among the N processing nodes Processing node.
[0120] The embodiment of the present application also provides a device for model training. The device for model training will be introduced below with reference to the accompanying drawings. Referring to FIG. 700, an embodiment of the present application provides a device 700 for model training. The device 700 may also be a terminal device. The terminal device may include a mobile phone, a tablet computer, and a personal digital assistant (Personal Digital Assistant). , PDA for short), POS (Point of Sales, POS for short), on-board computer and other intelligent terminals. Take the terminal device as a mobile phone as an example:
[0121] Figure 7 Shown is a block diagram of a part of the structure of a mobile phone related to a terminal device provided in an embodiment of the present application. reference Figure 7 The mobile phone includes: a radio frequency (RF) circuit 710, a memory 720, an input unit 730, a display unit 740, a sensor 750, an audio circuit 760, a wireless fidelity (wireless fidelity, WiFi) module 770, a processor 780, And power supply 790 and other components. Those skilled in the art can understand, Figure 7 The structure of the mobile phone shown in does not constitute a limitation on the mobile phone, and may include more or less components than shown in the figure, or a combination of some components, or a different component arrangement.
[0122] Combine below Figure 7 Specific introduction to each component of the mobile phone:
[0123] The RF circuit 710 can be used for receiving and sending signals during the process of sending and receiving information or talking. In particular, after receiving the downlink information of the base station, it is processed by the processor 780; in addition, the designed uplink data is sent to the base station. Generally, the RF circuit 710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA for short), a duplexer, and the like. In addition, the RF circuit 710 can also communicate with the network and other devices through wireless communication. The above-mentioned wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division Multiple Access). Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc. .
[0124] The memory 720 may be used to store software programs and modules. The processor 780 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 720. The memory 720 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data (such as audio data, phone book, etc.) created by the use of mobile phones. In addition, the memory 720 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
[0125] The input unit 730 can be used to receive inputted digital or character information, and generate key signal input related to user settings and function control of the mobile phone. Specifically, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also known as a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 731 or near the touch panel 731. Operation), and drive the corresponding connection device according to the preset program. Optionally, the touch panel 731 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch position, and detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 780, and can receive and execute the commands sent by the processor 780. In addition, the touch panel 731 can be implemented in multiple types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 731, the input unit 730 may also include other input devices 732. Specifically, other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, joystick, and the like.
[0126] The display unit 740 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 740 may include a display panel 741. Optionally, the display panel 741 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD for short), Organic Light-Emitting Diode (OLED for short), etc. Further, the touch panel 731 can cover the display panel 741. When the touch panel 731 detects a touch operation on or near it, it transmits it to the processor 780 to determine the type of the touch event, and then the processor 780 responds to the touch event. Type provides corresponding visual output on the display panel 741. Although in Figure 7 The touch panel 731 and the display panel 741 are used as two independent components to realize the input and input functions of the mobile phone. However, in some embodiments, the touch panel 731 and the display panel 741 can be integrated to realize the input of the mobile phone. And output function.
[0127] The mobile phone may also include at least one sensor 750, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 741 according to the brightness of the ambient light. The proximity sensor can close the display panel 741 and/or when the mobile phone is moved to the ear. Or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three-axis), and can detect the magnitude and direction of gravity when stationary, and can be used to identify mobile phone posture applications (such as horizontal and vertical screen switching, related Games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that can be configured in mobile phones, I will not mention them here. Go into details.
[0128] The audio circuit 760, the speaker 761, and the microphone 762 can provide an audio interface between the user and the mobile phone. The audio circuit 760 can transmit the electric signal converted from the received audio data to the speaker 761, and the speaker 761 converts it into a sound signal for output; on the other hand, the microphone 762 converts the collected sound signal into an electric signal, which is then output by the audio circuit 760. After being received, it is converted into audio data, and then processed by the audio data output processor 780, and sent to, for example, another mobile phone via the RF circuit 710, or the audio data is output to the memory 720 for further processing.
[0129] WiFi is a short-distance wireless transmission technology. The mobile phone can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 770. It provides users with wireless broadband Internet access. although Figure 7 The WiFi module 770 is shown, but it is understandable that it is not an essential component of the mobile phone, and can be omitted as needed without changing the essence of the invention.
[0130] The processor 780 is the control center of the mobile phone. It uses various interfaces and lines to connect various parts of the entire mobile phone. It executes by running or executing software programs and/or modules stored in the memory 720, and calling data stored in the memory 720. Various functions and processing data of the mobile phone can be used to monitor the mobile phone as a whole. Optionally, the processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor and a modem processor, where the application processor mainly processes the operating system, user interface, and application programs, etc. , The modem processor mainly deals with wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 780.
[0131] The mobile phone also includes a power source 790 (such as a battery) for supplying power to various components. Preferably, the power source can be logically connected to the processor 780 through a power management system, so that functions such as charging, discharging, and power consumption management can be managed through the power management system.
[0132] Although not shown, the mobile phone may also include a camera, a Bluetooth module, etc., which will not be repeated here.
[0133] In this embodiment, the processor 780 included in the terminal device also has the following functions:
[0134] At the end of the i-th training iteration, M processing nodes are determined among the N processing nodes, i≤k-1, M
[0135] Acquiring parameters to be fused from the M processing nodes, where the parameters to be fused are model parameters of the network model trained by the M processing nodes;
[0136] According to the parameters to be fused, the initial model parameters of the network model trained by the target processing node at the beginning of the i+1th training iteration are determined; the target processing node is the N processing nodes except the M A processing node other than one processing node.
[0137] The embodiment of this application also provides a server, please refer to Picture 8 As shown, Picture 8 The structure diagram of the server 800 provided in this embodiment of the application, the server 800 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (Central Processing Units, CPU for short) 822 (for example, one Or more than one processor) and memory 832, and one or more storage media 830 (for example, one or more storage devices) that store application programs 842 or data 844. Among them, the memory 832 and the storage medium 830 may be short-term storage or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server. Further, the central processing unit 822 may be configured to communicate with the storage medium 830, and execute a series of instruction operations in the storage medium 830 on the server 800.
[0138] The server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input and output interfaces 858, and/or one or more operating systems 841, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
[0139] The steps performed by the server in the above-mentioned embodiments can be based on this embodiment of the present application and further provides a computer-readable storage medium for storing program code, and the program code is used to execute a model training described in each of the preceding embodiments. Any implementation of the method.
[0140] Those of ordinary skill in the art can understand that all or part of the steps of the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, the execution includes The steps of the foregoing method embodiment; and the foregoing storage medium may be at least one of the following media: read-only memory (English: read-only memory, abbreviation: ROM), RAM, magnetic disk or optical disk, etc., which can store The medium of the program code.
[0141] It should be noted that the various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. Place. In particular, for the device and system embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for related parts, please refer to the part of the description of the method embodiments. The above-described device and system embodiments are merely illustrative. The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
[0142] The above is only a specific implementation of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or changes within the technical scope disclosed in this application. Replacement shall be covered within the scope of protection of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.