A distributed training method, device, system, storage medium and electronic device
By compressing and compensating stochastic gradients, the problem of high communication costs in distributed training is solved, achieving efficient model training convergence and improving the efficiency and accuracy of distributed training.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JD DIGITS HAIYI INFORMATION TECHNOLOGY CO LTD
- Filing Date
- 2022-11-22
- Publication Date
- 2026-06-16
Smart Images

Figure CN115719093B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of distributed machine learning technology, and in particular to a distributed training method, apparatus, system, storage medium, and electronic device. Background Technology
[0002] With the development of artificial intelligence technology and the explosive growth of data, machine learning is becoming increasingly large-scale. To improve the training speed of large-scale machine learning models, distributed learning has been proposed and applied to machine learning training in multiple fields such as vision and speech. A common distributed learning deployment environment is a centralized distributed architecture, which consists of several computing nodes and a central server, where the central server is responsible for coordinating the computation results of the computing nodes.
[0003] In the process of realizing this invention, it was found that the prior art has at least the following technical problems: the number of model parameters in large-scale machine learning is usually very large, resulting in a very high dimension of stochastic gradients, which ultimately makes the communication cost between computing nodes and central servers in P-SGD very high, reducing the efficiency of model training. Summary of the Invention
[0004] This invention provides a distributed training method, apparatus, system, storage medium, and electronic device to ensure the training accuracy of machine learning models while reducing communication costs.
[0005] According to one aspect of the present invention, a distributed training method is provided, applied to a computing node device, the method comprising:
[0006] During the iterative training of a machine learning model, the stochastic gradient of the machine learning model in the current iteration is determined;
[0007] The stochastic gradient of the current iteration is compressed to obtain the compressed gradient of the current iteration, and the compressed gradient is sent to the central server node. The central server node determines the central gradient of the current iteration based on the compressed gradient sent by each computing node device.
[0008] The system receives the central gradient fed back by the central server node, determines the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensates the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and updates the machine learning model for the current iteration based on the target gradient.
[0009] According to another aspect of the present invention, a distributed training method is provided, applied to a central server node device, the method comprising:
[0010] During the iterative training of the machine learning model, the compressed gradient of the machine learning model in the current iteration is received from each computing node.
[0011] The central gradient of the current iteration is determined based on the compressed gradients sent by each computing node and the error of the current iteration.
[0012] If the current iteration count does not meet the preset condition, the central gradient of the current iteration is compressed and the compressed central gradient is fed back to each computing node; and if the current iteration count meets the preset condition, the central gradient of the current iteration is fed back to each computing node, wherein the computing node updates the machine learning model for the current iteration based on the stochastic gradient of the current iteration, the compressed gradient, and the central gradient.
[0013] According to another aspect of the present invention, a distributed training apparatus is provided, integrated into a computing node device, the apparatus comprising:
[0014] The stochastic gradient determination module is used to determine the stochastic gradient of the machine learning model in the current iteration during the iterative training process of the machine learning model.
[0015] The compressed gradient determination module is used to compress the stochastic gradient of the current iteration to obtain the compressed gradient of the current iteration, and send the compressed gradient to the central server node. The central server node determines the central gradient of the current iteration based on the compressed gradient sent by each computing node device.
[0016] The model update module is used to receive the central gradient fed back by the central server node, determine the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensate the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and update the machine learning model for the current iteration based on the target gradient.
[0017] According to another aspect of the present invention, a distributed training device is provided, integrated into a central server node device, the device comprising:
[0018] The compressed gradient receiving module is used to receive the compressed gradient of the machine learning model in the current iteration sent by each computing node during the iterative training of the machine learning model.
[0019] The center gradient determination module is used to determine the center gradient of the current iteration based on the compressed gradient sent by each computing node and the error of the current iteration.
[0020] The central gradient sending module is used to compress the central gradient of the current iteration when the current iteration number does not meet a preset condition, and to feed back the compressed central gradient to each computing node; and to feed back the central gradient of the current iteration to each computing node when the current iteration number meets a preset condition, wherein the computing node updates the machine learning model for the current iteration based on the stochastic gradient of the current iteration, the compressed gradient, and the central gradient.
[0021] According to another aspect of the present invention, a distributed training system is provided, comprising a central server node and multiple computing nodes, wherein,
[0022] During the iterative training of the machine learning model, the computing node determines the stochastic gradient of the machine learning model in the current iteration, compresses the stochastic gradient of the current iteration to obtain the compressed gradient of the current iteration, and sends the compressed gradient to the central server node.
[0023] The central server node determines the central gradient of the current iteration based on the compression gradient sent by each computing node device, and sends the central gradient to each computing node after compression processing if the current iteration number does not meet the preset condition.
[0024] The computing node receives the central gradient fed back by the central server node, determines the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensates the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and updates the machine learning model for the current iteration based on the target gradient.
[0025] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising:
[0026] At least one processor; and
[0027] A memory communicatively connected to the at least one processor; wherein,
[0028] The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the distributed training method according to any embodiment of the present invention.
[0029] According to another aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to execute and implement the distributed training method described in any embodiment of the present invention.
[0030] The technical solution provided in this embodiment compresses the stochastic gradients of each iteration during transmission with the central server node, and transmits the compressed gradients, reducing the communication cost between the computing nodes and the central server node. Furthermore, the central gradient fed back by the central server node can be the compressed gradient. The computing nodes compress the gradients during bidirectional transmission with the central server node, further reducing communication costs. Simultaneously, the error caused by compression is used to determine the compensation gradient. This compensation gradient is then used to compensate the central node for the current iteration, and the model parameters are updated based on the compensated target gradient. By performing gradient compensation during the current iteration, the slow convergence problem caused by gradient compression is avoided while reducing communication costs, thus improving the convergence speed of the distributed training process.
[0031] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description
[0032] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0033] Figure 1 This is a flowchart of a distributed training method provided in an embodiment of the present invention;
[0034] Figure 2 This is a flowchart illustrating an embodiment of the present invention.
[0035] Figure 3 This is a flowchart of a distributed training method provided in an embodiment of the present invention;
[0036] Figure 4 This is a flowchart of a distributed training method provided in an embodiment of the present invention;
[0037] Figure 5 This is a schematic diagram of the structure of a distributed training device provided in an embodiment of the present invention;
[0038] Figure 6 This is a schematic diagram of the structure of a distributed training device provided in an embodiment of the present invention;
[0039] Figure 7This is a schematic diagram of the structure of a distributed training system provided in an embodiment of the present invention;
[0040] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0041] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0042] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0043] Training a machine learning model using parallel stochastic gradient descent (P-SGD) in a centralized-distributed environment requires the following steps: 1. Each computing node calculates stochastic gradients based on its local model and data and sends them to a central server; 2. The central server averages the collected stochastic gradients and returns them to the computing nodes; 3. The computing nodes average the average gradients to perform model averaging. This distributed training system consists of multiple computing nodes and a central server node. The central server node can be configured in a central server device, and the computing nodes can be configured in computing node devices. Different computing nodes can be configured in different computing node devices, or two or more computing nodes can be configured in the same computing node device; this is not a limitation.
[0044] In this embodiment, the computing node and the server node jointly train the machine learning model. The application scenario of distributed training is not limited here, i.e., the type of machine learning model and the functionality of the trained machine learning model are not limited. In some embodiments, the machine learning model may include, but is not limited to, neural network models, logistic regression models, etc., wherein the neural network model may include, but is not limited to, convolutional neural network models (CNN), recurrent neural network models (RNN), long short-term memory network models (LSTM), residual network models (ResNet50), etc. Distributed training of machine learning models can be applied to image classification models, image segmentation models, image feature extraction models, image compression models, image enhancement models, image denoising models, image label generation models, text classification models, text translation models, text summarization extraction models, text prediction models, keyword conversion models, text semantic analysis models, speech recognition models, audio denoising models, audio synthesis models, audio equalizer conversion models, weather prediction models, product recommendation models, article recommendation networks, action recognition models, face recognition models, facial expression recognition models, and other machine learning models. The above application scenarios are only illustrative examples, and this application does not limit the application scenarios of the neural model generation method.
[0045] In distributed training of machine learning models, compute nodes iteratively train the machine learning model locally. Accordingly, the compute nodes are pre-configured with sample data, and the machine learning model is iteratively trained based on this sample data. The training method for the machine learning model is not limited; it can be supervised training, unsupervised training, etc., as long as the machine learning model can be trained and the network parameters updated.
[0046] The sample data can be image data, and the prediction result of the machine learning model is the image processing result; or, the sample data can be text data, and the prediction result of the machine learning model is the text processing result; or, the sample data can be audio data, and the prediction result of the machine learning model is the audio processing result.
[0047] For example, if the sample data is image data, the machine learning model can be an image classification model, and the predicted result output by the machine learning model can be an image classification result; or, the machine learning model can be an image segmentation model, and the predicted result can be an image segmentation result; or, the machine learning model can be an image feature extraction model, and the predicted result can be an image feature extraction result; or, the machine learning model can be an image compression model, and the predicted result can be an image compression result; or, the machine learning model can be an image enhancement model, and the predicted result can be an image enhancement result; or, the machine learning model can be an image denoising model, and the predicted result can be an image denoising result; or, the machine learning model can be an image label generation model, and the predicted result can be an image label, etc. If the sample data is text data, the machine learning model can be a text classification model, and the predicted result output by the machine learning model can be a text classification result; or, the machine learning model can be a text prediction model, and the predicted result can be a text prediction result; or, the machine learning model can be a text summarization extraction model, and the predicted result can be a text summarization extraction result; or, the machine learning model can be a text translation model, and the predicted result can be a text translation result; or, the machine learning model can be a keyword conversion model, and the predicted result can be a keyword conversion result; or, the machine learning model can be a text semantic analysis model, and the predicted result can be a text semantic analysis result, etc. If the sample data is audio data, the machine learning model can be a speech recognition model, and the prediction result output by the machine learning model can be a speech recognition result; or, the machine learning model can be an audio noise reduction model, and the prediction result can be an audio noise reduction result; or, the machine learning model can be an audio synthesis model, and the prediction result can be an audio synthesis result; or, the machine learning model can be an audio equalizer conversion model, and the prediction result can be an audio equalizer conversion result, etc.
[0048] In any iteration of distributed training, each computing node performs local training of the machine learning model for any of the above scenarios, calculates stochastic gradients based on the local model and data, and sends them to the central server node. Data transmission between each computing node and the central server node incurs significant communication costs. To reduce these costs, the transmitted gradients can be compressed. However, compression slows down the convergence of the machine learning model training process.
[0049] To address the aforementioned technical problems, embodiments of the present invention provide a distributed training method. Figure 1This is a flowchart of a distributed training method provided by an embodiment of the present invention. This embodiment is applicable to the training of machine learning models on computing nodes. The method can be executed by a distributed training device, which can be implemented in hardware and / or software. The distributed training device can be configured in a computing node device, which can be an electronic device such as a computer, mobile phone, or PC. Figure 1 As shown, the method includes:
[0050] S110. During the iterative training of the machine learning model, determine the stochastic gradient of the machine learning model in the current iteration.
[0051] S120. The stochastic gradient of the current iteration is compressed to obtain the compressed gradient of the current iteration, and the compressed gradient is sent to the central server node. The central server node determines the central gradient of the current iteration based on the compressed gradient sent by each computing node device.
[0052] S130. Receive the central gradient fed back by the central server node, determine the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensate the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and update the machine learning model for the current iteration based on the target gradient.
[0053] In this embodiment, any computing node, after completing local training, determines the stochastic gradient of the machine learning model in the current iteration. This stochastic gradient can be determined based on the machine learning model's loss function, which can be pre-set and is not limited here. For example, the loss function could be denoted as f... i (x,ξ i The stochastic gradient is determined based on this loss function. Specifically, the derivatives of the model parameters in the machine learning model can be determined based on the loss function. These derivatives can be derivatives of different orders corresponding to different network layers. For example, in the case of a machine learning model with three network layers, the derivatives of the model parameters include the first, second, and third derivatives of the loss function with respect to the model parameters. Combining the derivatives of each model parameter yields the stochastic gradient of the machine learning model in the current iteration. For instance, combining the derivatives of each model parameter in the form of a matrix or vector yields the stochastic gradient of the machine learning model in the current iteration.
[0054] Each computing node needs to send the stochastic gradient of the current iteration to the central server node. To reduce communication costs, the stochastic gradient of the current iteration can be compressed to obtain the compressed gradient, which is then sent to the central server. Compared to the original stochastic gradient, the compressed gradient has a smaller number of values and faster transmission, reducing the communication costs between the computing nodes and the central server node.
[0055] Optionally, the stochastic gradient of the current iteration is compressed to obtain the compressed gradient of the current iteration. This includes: invoking a compressor, and compressing the stochastic gradient of the current iteration based on the compressor to obtain the compressed gradient of the current iteration. The computation node is configured with a compressor; the type of compressor is not limited here. For example, the compressor can be a delta compressor, which satisfies the following condition for any vector x: The compressor, where δ is the compression-related parameter of the compressor. This is to compress any vector x.
[0056] In this embodiment, compressed gradients are transmitted between the computing nodes and the central server node instead of transmitting model parameters. Since model parameters cannot be compressed, the communication cost is high. Transmitting compressed gradients can reduce the communication cost of distributed training.
[0057] The central server node receives the compressed gradients transmitted by each computing node and determines the central gradient for the current iteration based on these compressed gradients. In some embodiments, the central gradient is the average of the compressed gradients uploaded by each computing node; in another embodiment, the central gradient is the sum of the average of the compressed gradients uploaded by each computing node and the current error compensation value. The error compensation can be determined based on the central gradient and the compressed gradient of the central gradient from the previous iteration, for example, e. t =v t-1 -p t-1 , where e t p is the error compensation value in the current iteration. t-1 v is the central gradient from the previous iteration. t-1 This is the compressed gradient of the central gradient from the previous iteration. Correspondingly, the central gradient of the current iteration could be... Where N is the number of computing nodes.
[0058] In some embodiments, the central server node sends the central gradient to each computing node. In some embodiments, the central server node compresses the central gradient and then sends the compressed central gradient to each computing node. By compressing the central gradient, bidirectional gradient compression between the central server node and the computing nodes is achieved, thereby reducing communication costs.
[0059] For each computing node, the machine learning model is updated for the current iteration based on the central gradient fed back from the central server node. The central gradient received by the computing node can be a compressed central gradient or an uncompressed central gradient. Since there is unidirectional or bidirectional gradient compression between the central server node and each computing node, there is a problem of slow convergence caused by compression processing. To address this issue, in this embodiment, a compensation gradient is determined based on the compression error caused by compression processing. This compensation gradient is used to compensate for the central gradient in the current iteration. Correspondingly, the machine learning model is updated for the current iteration based on the compensated gradient, thus avoiding the problem of slow convergence in distributed training. It should be noted that by using the compensation gradient formed in the current iteration to compensate for the central gradient in the current iteration, compared to accumulating the gradient elements left over from compression in the error variable at the computing nodes, the number of gradient elements in the error variable in the entire distributed training process is reduced. This mitigates the impact of delayed model updates caused by elements in the error variable, thereby improving the convergence speed of the distributed training process.
[0060] Optionally, determining a compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, and compensating the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, includes: determining the compensation gradient based on the difference between the stochastic gradient of the current iteration and the compressed gradient; and determining the target gradient of the current iteration based on the sum of the compensation gradient and the central gradient. Specifically, the stochastic gradient of the current iteration can be... Compression gradient can be Correspondingly, the compensation gradient can be Furthermore, the target gradient can be Where, p t The gradient is centered.
[0061] Based on the above embodiments, the computation node can update the parameters of the machine learning model based on the target gradient using the following formula:
[0062] in, These are the model parameters that have not been updated in the current iteration. Here are the model parameters after the current iteration, and η is the learning rate.
[0063] In a distributed training system consisting of multiple computing nodes and a central server node, each computing node completes an iteration based on the above process. During each iteration, it interacts with the central server node to determine the target gradient for each iteration, thereby updating the model parameters of the machine learning model. The above process is executed iteratively until convergence is reached, at which point each computing node obtains the trained machine learning model.
[0064] The technical solution provided in this embodiment compresses the stochastic gradients of each iteration during transmission with the central server node, and transmits the compressed gradients, reducing the communication cost between the computing nodes and the central server node. Furthermore, the central gradient fed back by the central server node can be the compressed gradient. The computing nodes compress the gradients during bidirectional transmission with the central server node, further reducing communication costs. Simultaneously, the error caused by compression is used to determine the compensation gradient. This compensation gradient is then used to compensate the central node for the current iteration, and the model parameters are updated based on the compensated target gradient. By performing gradient compensation during the current iteration, the slow convergence problem caused by gradient compression is avoided while reducing communication costs, thus improving the convergence speed of the distributed training process.
[0065] Based on the above embodiments, this invention also provides a distributed training method, see [link to relevant documentation]. Figure 2 , Figure 2 This is a flowchart illustrating an embodiment of the present invention. Optionally, before compressing the stochastic gradient of the current iteration, the method further includes determining the current iteration number; if the current iteration number does not meet a preset condition, compressing the stochastic gradient of the current iteration; if the current iteration number meets the preset condition, sending the stochastic gradient of the current iteration as the compressed gradient to the central server node. Accordingly, the method specifically includes the following steps:
[0066] S210. During the iterative training of the machine learning model, determine the stochastic gradient of the machine learning model in the current iteration.
[0067] S220. If the current iteration count does not meet the preset condition, the stochastic gradient of the current iteration is compressed; if the current iteration count meets the preset condition, the stochastic gradient of the current iteration is sent to the central server node as a compressed gradient.
[0068] S230. Receive the central gradient fed back by the central server node, determine the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensate the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and update the machine learning model for the current iteration based on the target gradient.
[0069] Because different computing nodes exhibit differences in their training processes for machine learning models—such as differences in sample data or compressors—especially in the bidirectional gradient compression between each computing node and the central server node during each iteration, the model parameters of the machine learning models trained on different computing nodes during iterative training can vary. In this embodiment, to reduce these model parameter differences between different computing nodes, gradient compression is performed on a portion of the iterations during the iteration process to alleviate these differences.
[0070] A pre-defined criterion for the number of iterations is set. If the criterion is met, gradient compression is not performed; otherwise, it is performed. Optionally, the number of iterations for gradient compression is greater than the number of iterations without gradient compression, in order to reduce communication costs and minimize the differences between machine learning models trained on different computing nodes.
[0071] In some embodiments, the preset conditions include a preset interval number condition, such as 50 times or 100 times. That is, if the interval between the current iteration number and the previous iteration number without gradient compression meets the preset interval number, it is determined that the current iteration number meets the preset conditions, and gradient compression is not performed in the current iteration. If the interval between the current iteration number and the previous iteration number without gradient compression does not meet the preset interval number, it is determined that the current iteration number does not meet the preset conditions, and gradient compression is performed in the current iteration.
[0072] In some embodiments, the preset conditions include a criterion for determining the number of iterations based on compression-related parameters in the compressor. Optionally, the preset conditions may be... Where t is the iteration number and δ is the compression correlation parameter of the compressor, which can be read from the compressor configured on the computing node. Judging the iteration number based on the compression correlation parameter in the compressor is beneficial for achieving convergence.
[0073] Based on the above embodiments, after determining the stochastic gradient of the current iteration, the current iteration number is determined. If the current iteration number does not meet a preset condition, the stochastic gradient of the current iteration is compressed, and the resulting compressed gradient is sent to the central server node. If the current iteration number meets the preset condition, the stochastic gradient of the current iteration is sent to the central server node as the compressed gradient. Correspondingly, if the current iteration number does not meet the preset condition, i.e. In this case, the compensation gradient can be determined as follows: If the preset condition is met in the current iteration, that is... In this case, the stochastic gradient is consistent with the compressed gradient, and the compensated gradient is... The value is zero. The computing node stores the above compensation gradient for later use.
[0074] The central server node determines the central gradient based on the received compressed gradient, for example, the central gradient is... The central server node determines the current iteration count. It's important to note that the current iteration count is the same for both the compute nodes and the central server node, meaning they are in the same iteration process. Optionally, the compute nodes and the central server node use the same preset conditions to determine the iteration count. If the central server node determines that the current iteration count meets the preset conditions, no central gradient compression is performed. If the central server node determines that the current iteration count does not meet the preset conditions, the central gradient is compressed. The processed central gradient is then fed back to each compute node.
[0075] The way a computing node updates the model parameters of a machine learning model differs depending on the iteration number. For example, if the current iteration does not meet a preset condition, the model parameters are updated based on the model parameters obtained in the previous iteration after compensating for the central gradient using a compensating gradient. This is achieved by updating the local model parameters to the returned average model parameters when the current iteration number meets the preset conditions. The model parameters are updated based on the central gradient, for example, through... Implementation, where the compensation gradient is located. It is zero.
[0076] The technical solution in this embodiment reduces communication costs during distributed training by performing bidirectional gradient compression between computing nodes and the central server node. Simultaneously, periodic model averaging mitigates the inconsistencies in local models across computing nodes caused by error compensation mechanisms, thus ensuring model accuracy on computing nodes while reducing communication costs.
[0077] Figure 3 This is a flowchart of a distributed training method provided by an embodiment of the present invention. This embodiment is applicable to situations where a machine learning model is trained on a central server node. The method can be executed by a distributed training device integrated on the central server node. This distributed training device can be implemented in hardware and / or software and can be configured in the central server node device, which can be an electronic device such as a computer or server. Figure 3 As shown, the method includes:
[0078] S310. During the iterative training of the machine learning model, receive the compressed gradient of the machine learning model in the current iteration sent by each computing node.
[0079] S320. Determine the central gradient of the current iteration based on the compressed gradients sent by each computing node and the error of the current iteration.
[0080] S330. If the current iteration count does not meet a preset condition, the central gradient of the current iteration is compressed, and the compressed central gradient is fed back to each computing node. Conversely, if the current iteration count meets a preset condition, the central gradient of the current iteration is fed back to each computing node. The computing nodes update the machine learning model for the current iteration based on the stochastic gradient of the current iteration, the compressed gradient, and the central gradient.
[0081] In this embodiment, the compressed gradient received by the central server node from the computing node can be obtained by compressing a stochastic gradient, or it can be obtained by using a stochastic gradient as the compressed gradient. The compressed gradient can be determined based on the current iteration number. For example, if the current iteration number does not meet a preset condition, the compressed gradient is obtained by compressing a stochastic gradient; if the current iteration number meets the preset condition, the compressed gradient is obtained by the stochastic gradient calculated by the computing node.
[0082] The central server node determines the central gradient for the current iteration based on the compressed gradient and the error of the current iteration. For example, the central gradient could be... in, For computing node i, send the compressed gradient, e t This represents the error of the current iteration.
[0083] The central server node checks the current iteration count and determines if the preset condition is met in the current iteration count. In this case, the central gradient of the current iteration is compressed. and the compressed center gradient p t Feedback is sent to each computing node, enabling them to update the machine learning model for the current iteration based on the stochastic gradient, the compressed gradient, and the central gradient. This updates occur when a preset condition is met in the current iteration. In the case of the current iteration's center gradient p t =v t and average model parameters Feedback is sent to each computing node, enabling the computing nodes to update the model parameters based on the central gradient and the average model parameters.
[0084] The technical solution in this embodiment reduces communication costs during distributed training by performing bidirectional gradient compression between computing nodes and the central server node. Simultaneously, periodic model averaging mitigates the inconsistencies in local models across computing nodes caused by error compensation mechanisms, thus ensuring model accuracy on computing nodes while reducing communication costs.
[0085] Based on the above embodiments, this invention also provides a preferred example of a distributed training method, see [link to example]. Figure 4 , Figure 4 This is a flowchart of a distributed training method provided in an embodiment of the present invention, specifically the execution flow when the current iteration number does not meet a preset condition. Any i-th computation node calculates the stochastic gradient. right Compress to obtain It is then sent to the central server for error compression, i.e., gradient compensation. The global error e is temporarily stored on the computing nodes and the central server will store it. t Compensation to The central gradient is obtained from the average value. Central server for v t Compress to obtain p t This information is then sent to each compute node to update the global error variable. The compute nodes then use the locally stored error compensation to update the p value. t get And use this result to update the local model.
[0086] Specifically, the computing nodes are based on the local model and sampling samples Calculate the stochastic gradient When the iteration round t satisfies At that time, the computation node will combine the stochastic gradient (as a compressed gradient) with the local model. Send it to the central server node; otherwise, the compute nodes use the compressor. Compressing the stochastic gradient yields And Send to server node, compression error It is stored on the computing node.
[0087] The central server node first averages the gradients sent by the computing nodes, and then uses an error variable to compensate for the error in the result. When the iteration round t satisfies At that time, the central server node will average the received local model parameters and calculate the result. and v tThe uncompressed central node sends data to each compute node. Otherwise, the central server node uses a compressor. For v t Compress to obtain (The compressed center node), and p t The error is sent to each compute node. The compute nodes then update the error variable. t+1 =v t -p t (i.e., the error variable in the next iteration). If This operation will reset the error variable to 0.
[0088] Model updates are performed on computation nodes: when the iteration round t satisfies At that time, the compute node updates its local model parameters to the returned average model parameters. Update model Equivalent to use To update the model. Otherwise, the compute node runs an instantaneous error compensation mechanism to store local errors. Compensation to the returned gradient p t Then the model is updated.
[0089] For the above embodiments, the following convergence conclusions can be obtained: For a non-convex optimization objective, under the assumptions of continuity, bounded variance, and bounded gradient, when the algorithm uses a delta compressor, assuming the learning rate... We have the following convergence conclusions:
[0090] This result indicates that the proposed implementation achieves the same upper bound on convergence speed as the traditional one-way gradient compression algorithm under the error compensation mechanism, and is superior to the upper bound on convergence speed of the two-way gradient algorithm, thus achieving the goal of faster convergence speed and lower communication cost.
[0091] Figure 5 This is a schematic diagram of the structure of a distributed training device provided in an embodiment of the present invention. Figure 5 As shown, the device includes:
[0092] The stochastic gradient determination module 410 is used to determine the stochastic gradient of the machine learning model in the current iteration during the iterative training of the machine learning model.
[0093] The compression gradient determination module 420 is used to compress the stochastic gradient of the current iteration when the current iteration number does not meet the preset conditions, to obtain the compressed gradient of the current iteration, and to send the compressed gradient to the central server node. The central server node determines the central gradient of the current iteration based on the compressed gradient sent by each computing node device.
[0094] The model update module 430 is used to receive the central gradient fed back by the central server node, determine the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensate the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and update the machine learning model for the current iteration based on the target gradient.
[0095] Based on the above embodiments, optionally, the model update module 330 is used to: determine the compensation gradient based on the difference between the stochastic gradient of the current iteration and the compressed gradient; and determine the target gradient of the current iteration based on the sum of the compensation gradient and the central gradient.
[0096] Based on the above embodiments, optionally, the central gradient fed back by the central server node is a compressed central gradient.
[0097] Based on the above embodiments, optionally, the compression gradient determination module 320 is used for:
[0098] Invoke the compressor, and compress the stochastic gradient of the current iteration based on the compressor to obtain the compressed gradient of the current iteration.
[0099] Based on the above embodiments, optionally, the compressed gradient determination module 420 is used to: compress the stochastic gradient of the current iteration when the current iteration number does not meet the preset condition;
[0100] If the current iteration number meets the preset conditions, the stochastic gradient of the current iteration is sent to the central server node as a compressed gradient;
[0101] Correspondingly, the central gradient fed back by the central server node is the uncompressed central gradient under the condition that the previous iteration number meets the preset conditions.
[0102] Based on the above embodiments, optionally, the preset conditions include a preset interval number condition, or an iteration number determination condition based on compression association parameters in the compressor.
[0103] The distributed training device provided in the embodiments of the present invention can execute the distributed training method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the method execution.
[0104] Figure 6 This is a schematic diagram of the structure of a distributed training device provided in an embodiment of the present invention. Figure 6 As shown, the device includes:
[0105] The compressed gradient receiving module 510 is used to receive the compressed gradient of the machine learning model in the current iteration sent by each computing node during the iterative training of the machine learning model.
[0106] The center gradient determination module 520 is used to determine the center gradient of the current iteration based on the compressed gradient sent by each computing node and the error of the current iteration.
[0107] The central gradient sending module 530 is used to compress the central gradient of the current iteration when the current iteration number does not meet a preset condition, and to feed back the compressed central gradient to each computing node; and to feed back the central gradient of the current iteration to each computing node when the current iteration number meets a preset condition. The computing nodes update the machine learning model for the current iteration based on the stochastic gradient of the current iteration, the compressed gradient, and the central gradient.
[0108] The distributed training device provided in the embodiments of the present invention can execute the distributed training method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the method execution.
[0109] This invention provides a distributed training system, see [link to relevant documentation]. Figure 7 , Figure 7 Is this a schematic diagram of the structure of a distributed training system provided in this embodiment? Figure 7 The distributed training system includes a central server node 610 and multiple computing nodes 620. The computing nodes 620 are used to determine the stochastic gradient of the machine learning model in the current iteration during iterative training, compress the stochastic gradient to obtain the compressed gradient, and send the compressed gradient to the central server node.
[0110] The central server node 610 is used to: determine the central gradient of the current iteration based on the compressed gradient sent by each computing node device, and compress the central gradient and send it to each computing node if the current iteration number does not meet the preset conditions.
[0111] The computing node 620 is also used to receive the central gradient fed back by the central server node, and determine the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensate the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and update the machine learning model for the current iteration based on the target gradient.
[0112] Optionally, if the number of iterations in the current iteration does not meet the preset condition, the computing node 620 performs compression processing on the stochastic gradient of the current iteration; if the number of iterations in the current iteration meets the preset condition, the stochastic gradient of the current iteration is sent to the central server node as a compressed gradient.
[0113] Optionally, the central server node 610 is configured to: compress the central gradient of the current iteration when the current iteration number does not meet the preset condition, and feed back the compressed central gradient to each computing node 620; and, when the current iteration number meets the preset condition, feed back the central gradient of the current iteration to each computing node 620.
[0114] The technical solution provided in this embodiment compresses the stochastic gradients of each iteration during transmission with the central server node, and transmits the compressed gradients, reducing the communication cost between the computing nodes and the central server node. Furthermore, the central gradient fed back by the central server node can be the compressed gradient. The computing nodes compress the gradients during bidirectional transmission with the central server node, further reducing communication costs. Simultaneously, the error caused by compression is used to determine the compensation gradient. This compensation gradient is then used to compensate the central node for the current iteration, and the model parameters are updated based on the compensated target gradient. By performing gradient compensation during the current iteration, the slow convergence problem caused by gradient compression is avoided while reducing communication costs, thus improving the convergence speed of the distributed training process.
[0115] Figure 8 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices (such as helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the invention described and / or claimed herein.
[0116] like Figure 8 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 may also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.
[0117] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0118] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as distributed training methods.
[0119] In some embodiments, the distributed training method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or mounted on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the distributed training method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to execute the distributed training method by any other suitable means (e.g., by means of firmware).
[0120] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0121] Computer programs used to implement the distributed training method of the present invention can be written in any combination of one or more programming languages. These computer programs can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the functions / operations specified in the flowcharts and / or block diagrams are implemented. The computer programs can be executed entirely on the machine, partially on the machine, as a standalone software package partially on the machine and partially on a remote machine, or entirely on a remote machine or server.
[0122] This invention also provides a computer-readable storage medium storing computer instructions for causing a processor to execute a distributed training method, the method comprising:
[0123] During the iterative training of a machine learning model, the stochastic gradient of the machine learning model in the current iteration is determined;
[0124] The stochastic gradient of the current iteration is compressed to obtain the compressed gradient of the current iteration, and the compressed gradient is sent to the central server node. The central server node determines the central gradient of the current iteration based on the compressed gradient sent by each computing node device.
[0125] The system receives the central gradient fed back by the central server node, determines the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensates the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and updates the machine learning model for the current iteration based on the target gradient.
[0126] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0127] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0128] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or computing systems that include middleware components (e.g., application servers), or computing systems that include frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.
[0129] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.
[0130] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.
[0131] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A distributed training method, characterized in that, Applied to a computing node device, the method includes: During the iterative training of the machine learning model, the stochastic gradient of the machine learning model in the current iteration is determined; the stochastic gradient is obtained by combining the different orders of derivatives of the loss function of the machine learning model corresponding to each network layer of the machine learning model. A compressor is invoked to compress the stochastic gradient of the current iteration, resulting in a compressed gradient. This compressed gradient is then sent to a central server node. The central server node determines the central gradient of the current iteration based on the compressed gradients sent by each computing node. The central gradient is the sum of the mean of the compressed gradients uploaded by each computing node and the current error compensation value. The current error compensation value is determined based on the central gradient and the compressed gradient of the central gradient from the previous iteration. The system receives the central gradient fed back by the central server node, determines the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensates the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and updates the machine learning model for the current iteration based on the target gradient.
2. The method according to claim 1, characterized in that, The process of receiving the central gradient fed back by the central server node, determining a compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, and compensating the central gradient based on the compensation gradient to obtain the target gradient of the current iteration includes: The compensation gradient is determined based on the difference between the stochastic gradient of the current iteration and the compressed gradient. The target gradient for the current iteration is determined based on the sum of the compensation gradient and the central gradient.
3. The method according to claim 1 or 2, characterized in that, The central gradient fed back by the central server node is the central gradient after compression processing.
4. The method according to claim 1, characterized in that, The compression process for the stochastic gradient of the current iteration further includes: If the current iteration number does not meet the preset condition, the stochastic gradient of the current iteration is compressed. If the current iteration number meets the preset conditions, the stochastic gradient of the current iteration is sent to the central server node as a compressed gradient; Correspondingly, the central gradient fed back by the central server node is the uncompressed central gradient under the condition that the previous iteration number meets the preset conditions.
5. The method according to claim 4, characterized in that, The preset conditions include preset interval number conditions, or iteration number determination conditions based on compression-related parameters in the compressor.
6. A distributed training method, characterized in that, Applied to a central server node device, the method includes: During the iterative training of the machine learning model, the compressed gradient of the machine learning model in the current iteration is received from each computing node; the compressed gradient is obtained by compressing the stochastic gradient of the machine learning model in the current iteration in the computing node, and the stochastic gradient is obtained by combining the different orders of derivatives of the loss function of the machine learning model in each network layer of the machine learning model. The central gradient of the current iteration is determined based on the compression gradient sent by each computing node and the error of the current iteration; the central gradient is the sum of the mean of the compression gradients uploaded by each computing node and the current error compensation value; the current error compensation value is determined based on the central gradient and the compression gradient of the central gradient in the previous iteration. If the current iteration count does not meet the preset condition, the central gradient of the current iteration is compressed and the compressed central gradient is fed back to each computing node; and if the current iteration count meets the preset condition, the central gradient of the current iteration is fed back to each computing node. The computing node updates the machine learning model for the current iteration based on the stochastic gradient, the compressed gradient, and the central gradient of the current iteration.
7. A distributed training device, characterized in that, Integrated into a computing node device, the device includes: The stochastic gradient determination module is used to determine the stochastic gradient of the machine learning model in the current iteration during the iterative training process of the machine learning model; the stochastic gradient is obtained by combining the different orders of derivatives of the loss function of the machine learning model corresponding to each network layer of the machine learning model. A compression gradient determination module is used to invoke a compressor to compress the stochastic gradient of the current iteration, obtain the compressed gradient of the current iteration, and send the compressed gradient to a central server node. The central server node determines the central gradient of the current iteration based on the compressed gradients sent by each computing node device. The central gradient is the sum of the mean of the compressed gradients uploaded by each computing node and the current error compensation value. The current error compensation value is determined based on the central gradient and the compressed gradient of the central gradient in the previous iteration. The model update module is used to receive the central gradient fed back by the central server node, determine the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensate the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and update the machine learning model for the current iteration based on the target gradient.
8. A distributed training device, characterized in that, Integrated into the central server node device, the device includes: The compressed gradient receiving module is used to receive the compressed gradient of the machine learning model in the current iteration sent by each computing node during the iterative training of the machine learning model. The compressed gradient is obtained by compressing the stochastic gradient of the machine learning model in the current iteration in the computing node. The stochastic gradient is obtained by combining the different orders of derivatives of the loss function of the machine learning model corresponding to each network layer of the machine learning model. The center gradient determination module is used to determine the center gradient of the current iteration based on the compressed gradient sent by each computing node and the error of the current iteration; the center gradient is the sum of the mean of the compressed gradients uploaded by each computing node and the current error compensation value; the current error compensation value is determined based on the center gradient and the compressed gradient of the center gradient in the previous iteration process. The central gradient sending module is used to compress the central gradient of the current iteration when the current iteration number does not meet a preset condition, and feed the compressed central gradient back to each computing node; and to feed the central gradient of the current iteration back to each computing node when the current iteration number meets a preset condition, wherein the computing node updates the machine learning model for the current iteration based on the stochastic gradient of the current iteration, the compressed gradient, and the central gradient.
9. A distributed training system, characterized in that, It includes a central server node and multiple computing nodes, among which, During the iterative training of the machine learning model, the computing node determines the stochastic gradient of the machine learning model in the current iteration, compresses the stochastic gradient of the current iteration to obtain the compressed gradient of the current iteration, and sends the compressed gradient to the central server node; the stochastic gradient is obtained by combining the different orders of derivatives of the loss function of the machine learning model corresponding to each network layer of the machine learning model. The central server node determines the central gradient for the current iteration based on the compression gradients sent by each computing node device, and compresses the central gradient before sending it to each computing node if the current iteration number does not meet a preset condition. The central gradient is the sum of the average of the compression gradients uploaded by each computing node and the current error compensation value. The current error compensation value is determined based on the central gradient and the compression gradient of the central gradient in the previous iteration. The computing node receives the central gradient fed back by the central server node, determines the compensation gradient based on the stochastic gradient of the current iteration and the compressed gradient, compensates the central gradient based on the compensation gradient to obtain the target gradient of the current iteration, and updates the machine learning model for the current iteration based on the target gradient.
10. An electronic device, characterized in that, The electronic device includes: At least one processor; and A memory communicatively connected to the at least one processor; wherein, The memory stores a computer program that can be executed by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the distributed training method of any one of claims 1-5, or the distributed training method of claim 6.
11. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the distributed training method according to any one of claims 1-5, or the distributed training method according to claim 6.