A distributed training method and system for hierarchical networks

By employing a three-layer distributed training method in the hierarchical network, and utilizing quantization compression and sparsity compression for gradient optimization at worker nodes and edge nodes, the problems of communication efficiency and model accuracy in the hierarchical network are solved, achieving efficient collaboration between lightweight terminal devices and network communication.

CN121031718BActive Publication Date: 2026-06-19BEIJING MIANBI INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING MIANBI INTELLIGENT TECH CO LTD
Filing Date
2025-08-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing distributed training methods cannot effectively optimize the communication efficiency of each layer in hierarchical networks, resulting in low model training accuracy and excessive computational and storage burden on terminal devices. There is a lack of optimization strategies for collaborative communication between different network layers.

Method used

A three-layer distributed training method is adopted, in which worker nodes perform quantization compression, edge nodes perform error compensation and sparsity compression, and the central server performs global gradient updates. By deploying asymmetric compression strategies at different levels through quantization and sparsity compressors, the burden on terminal devices is reduced and the model accuracy is improved.

Benefits of technology

It achieves an optimal balance between model training speed and training accuracy in a hierarchical network environment, reduces the load on the central server and the computational and storage pressure on terminal devices, and improves communication efficiency and model accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121031718B_ABST
    Figure CN121031718B_ABST
Patent Text Reader

Abstract

This invention discloses a distributed training method and system for hierarchical networks, belonging to the field of distributed machine learning technology. It addresses the current technical problem of lacking a distributed training method specifically designed for hierarchical networks that can collaboratively optimize communication efficiency across layers while ensuring model training accuracy and stability, thus failing to meet the communication requirements of hierarchical networks. The method includes: worker nodes calculating local stochastic gradients based on initial model parameters and local data; quantizing and compressing the local stochastic gradients to obtain quantized gradients and uploading them to edge nodes; edge nodes aggregating the quantized gradients sent by each worker node to obtain aggregated gradients and performing error compensation; sparsely compressing the error-compensated aggregated gradients to obtain sparse gradients and uploading them to a central server; the central server globally aggregating the sparse gradients sent by each edge node to obtain a global gradient; and updating the model parameters based on the global gradient and a preset learning rate.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of distributed machine learning technology, and in particular to a distributed training method and system for hierarchical networks. Background Technology

[0002] With the popularization of artificial intelligence technology, the training scenarios for deep learning models are becoming increasingly diverse, expanding from centralized data centers to geographically dispersed edge computing and federated learning environments. In these scenarios, computing nodes typically form a natural hierarchical topology: a large number of terminal devices (worker nodes) connect to regional edge servers (edge ​​nodes), which in turn connect to a central cloud server.

[0003] In such hierarchical networks, traditional "flat" distributed training algorithms, such as Distributed Stochastic Gradient Descent (D-SGD), face severe communication challenges. First, there's the central server bottleneck: in a flat architecture, all worker nodes communicate directly with the central server. When the number of nodes is large, this places enormous communication and aggregation pressure on the central server. Second, there's network heterogeneity: the network from terminal devices to edge servers (the "last mile") is typically low-bandwidth and unstable (e.g., Wi-Fi, 5G), while the network from edge servers to the cloud is relatively stable and high-speed. A unified communication strategy cannot adapt to this heterogeneity.

[0004] To address communication issues, existing technologies have proposed mechanisms such as gradient compression and error feedback. However, these methods are typically designed for flat network architectures, assuming that each node performs the same operation. When directly applied to hierarchical networks, they suffer from the following drawbacks: (1) Errors accumulate at each level: If simple compression is performed at each layer, the compression error will be propagated and amplified at each level, ultimately severely affecting the convergence accuracy of the model; (2) Resource and algorithm mismatch: Requiring resource-constrained terminal devices to perform the same complex stateful error feedback algorithm as edge servers increases the computational and storage burden on terminal devices, which is impractical; (3) Lack of hierarchical collaborative strategies: Existing methods do not design differentiated and collaborative communication optimization strategies for the characteristics of different network layers (such as bandwidth, latency, and node capabilities).

[0005] Therefore, there is an urgent need for a novel distributed training method specifically designed for hierarchical networks, which can collaboratively optimize the communication efficiency of each layer while ensuring the accuracy and stability of model training. Summary of the Invention

[0006] This invention provides a distributed training method and system for hierarchical networks to address the following technical problem: Currently, there is a lack of distributed training methods specifically designed for hierarchical networks that can collaboratively optimize the communication efficiency of each layer while ensuring the accuracy and stability of model training, thus failing to meet the communication requirements of hierarchical networks.

[0007] The embodiments of the present invention adopt the following technical solutions:

[0008] On one hand, embodiments of the present invention provide a distributed training method for hierarchical networks, the method comprising: worker nodes calculating local stochastic gradients based on initial model parameters and local data; and quantizing and compressing the local stochastic gradients to obtain quantized gradients and uploading them to edge nodes;

[0009] The edge nodes aggregate the quantization gradients sent by each working node to obtain the aggregated gradient and perform error compensation; the aggregated gradient after error compensation is subjected to sparsification and compression processing to obtain the sparsified gradient and uploaded to the central server.

[0010] The central server performs global aggregation of the sparse gradients sent by each edge node to obtain the global gradient; based on the global gradient and the preset learning rate, the initial model parameters are updated to obtain the final model parameters.

[0011] In one feasible implementation, before the working node calculates the local stochastic gradient based on the initial model parameters and local data, the method further includes:

[0012] The central server initializes the global model parameters of the model to be trained, obtaining the initial model parameters; and then broadcasts the initial model parameters down to all edge nodes.

[0013] The edge node forwards the received initial model parameters to all the working nodes it is connected to;

[0014] Each edge node contains an error buffer for storing compression errors, and the initial value of the error buffer is a zero vector.

[0015] In one feasible implementation, the working node calculates the local stochastic gradient based on the initial model parameters and local data; and quantizes and compresses the local stochastic gradient to obtain the quantized gradient, which is then uploaded to the edge node, specifically including:

[0016] Each worker node aggregates the training data within its connection range and saves it locally as local data;

[0017] After receiving the model training task, the worker node performs calculations on the local data based on the initial model parameters to obtain the local stochastic gradient.

[0018] The working node compresses the local stochastic gradient using a preset quantization compressor to obtain the quantized gradient and uploads it to its corresponding edge node; wherein the data size of the quantized gradient is smaller than the data size of the local stochastic gradient.

[0019] In one feasible implementation, the edge node aggregates the quantization gradients sent by each working node to obtain an aggregated gradient and performs error compensation, specifically including:

[0020] The edge node collects the quantization gradients sent by all the working nodes under its jurisdiction, and calculates the average of all quantization gradients to obtain the aggregate gradient.

[0021] The edge node performs a vector addition operation with the current aggregated gradient, using the historical compressed error stored in the local error buffer from the previous iteration, to obtain the error-compensated aggregated gradient.

[0022] In one feasible implementation, the aggregated gradient after error compensation is subjected to sparsification and compression to obtain a sparse gradient, which is then uploaded to a central server. Specifically, this includes:

[0023] The edge node performs sparse compression processing on the aggregated gradient after error compensation using a preset sparse compressor to complete the secondary compression, obtain the sparse gradient, and upload it to its central server; wherein, the data volume of the sparse gradient is smaller than the data volume of the aggregated gradient.

[0024] In one feasible implementation, after performing error compensation and sparsification compression on the aggregated gradient to obtain a sparse gradient and uploading it to the central server, the method further includes:

[0025] The difference between the aggregated gradient after error compensation and the corresponding sparsification gradient of the edge node is used to obtain the compression error in the sparsification compression process.

[0026] The edge node stores the compression error in its local error cache for use in the next iteration.

[0027] In one feasible implementation, the central server performs global aggregation of the sparse gradients sent by each edge node to obtain a global gradient, specifically including:

[0028] The central server collects the sparse gradients sent by all the edge nodes under its jurisdiction and calculates the average value to obtain the global gradient.

[0029] In one feasible implementation, the initial model parameters are updated based on the global gradient and the preset learning rate to obtain the final model parameters, specifically including:

[0030] The central server substitutes the model parameters obtained from the previous iteration, the global gradient, and the preset learning rate into the update formula to calculate the model parameters after the current iteration.

[0031] This process is repeated until a preset number of iterations or a preset convergence criterion is reached, thus obtaining the final model parameters.

[0032] On the other hand, embodiments of the present invention also provide a distributed training system for hierarchical networks, the system comprising:

[0033] The quantization and compression module is used to calculate the local stochastic gradient based on the initial model parameters and local data through the working node; and to quantize and compress the local stochastic gradient to obtain the quantized gradient and upload it to the edge node.

[0034] The sparsity compression module is used to aggregate the quantization gradients sent by each working node through the edge node to obtain the aggregated gradient and perform error compensation; the aggregated gradient after error compensation is subjected to sparsity compression processing to obtain the sparsity gradient and uploaded to the central server.

[0035] The parameter update module is used to globally aggregate the sparse gradients sent by each edge node through the central server to obtain the global gradient; and update the initial model parameters according to the global gradient and the preset learning rate to obtain the final model parameters.

[0036] In one feasible implementation, the quantization compression module includes a preset quantization compressor for compressing local stochastic gradients.

[0037] The sparsification compression module includes a preset sparsification compressor, which is used to perform sparsification compression processing on the aggregation gradient to complete the secondary compression.

[0038] Compared with the prior art, the distributed training method and system for hierarchical networks provided in this invention have the following advantages:

[0039] This invention proposes a hierarchical error feedback stochastic gradient descent method, and designs an asymmetric, cooperative gradient compression and transmission strategy for a three-layer architecture of worker nodes, edge nodes and central server.

[0040] First, a three-tiered communication and computing architecture of "worker node - edge node - server" is clearly defined. Worker nodes are responsible for local computing, edge nodes are responsible for regional aggregation and critical error compensation, and servers are responsible for the final update of the global model.

[0041] An asymmetric compression strategy is proposed: worker nodes employ stateless gradient quantization compression. This is a lightweight compression method that does not require worker nodes to store historical errors, significantly reducing the computational and storage burden on terminal devices. Edge nodes, after aggregating quantized gradients from multiple worker nodes, undergo secondary compression using gradient sparsification with error feedback. This method can be executed on higher-performance edge nodes, allowing for more aggressive compression to drastically reduce the amount of data sent to the central server, while the error feedback mechanism ensures the final accuracy of the model.

[0042] Furthermore, edge-side error control is proposed: this invention cleverly deploys the error feedback mechanism only at edge nodes. Edge nodes accumulate the sparsity error of the aggregated gradients of their subordinate worker nodes. This design avoids the overhead of maintaining state on massive numbers of terminal devices and effectively compensates for critical compression losses before data enters the backbone network, suppressing the cumulative effect of errors.

[0043] In summary, this invention proposes a hierarchical training method with edge error feedback. By deploying asymmetric compression strategies at different levels, it effectively solves the problems of excessive load on the central server and excessive burden on terminal devices, achieving the optimal balance between model training speed and training accuracy in a hierarchical network environment. Attached Figure Description

[0044] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. In the drawings:

[0045] Figure 1 A flowchart of a distributed training method for hierarchical networks provided in an embodiment of the present invention;

[0046] Figure 2 A schematic diagram of a layered network architecture provided in an embodiment of the present invention;

[0047] Figure 3 An algorithm logic diagram for a distributed training method for hierarchical networks provided in an embodiment of the present invention;

[0048] Figure 4 This is a schematic diagram of the structure of a distributed training system for hierarchical networks provided in an embodiment of the present invention. Detailed Implementation

[0049] To enable those skilled in the art to better understand the technical solutions of this invention, the technical solutions of the embodiments of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this invention, and not all embodiments. Based on the embodiments of this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this invention.

[0050] This invention provides a distributed training method for hierarchical networks, such as... Figure 1 As shown, the distributed training method for hierarchical networks specifically includes steps S101-S103:

[0051] S101. The working node calculates the local stochastic gradient based on the initial model parameters and local data; and quantizes and compresses the local stochastic gradient to obtain the quantized gradient and uploads it to the edge node.

[0052] Specifically, the central server initializes the global model parameters of the model to be trained, obtaining initial model parameters; and then broadcasts these initial model parameters down to all edge nodes. Each edge node forwards the received initial model parameters to all worker nodes it is connected to. Each edge node contains an error buffer for storing compression errors, and the initial value of the error buffer is a zero vector.

[0053] As a feasible implementation method, Figure 2 A schematic diagram of a layered network architecture provided in an embodiment of the present invention, such as... Figure 2 As shown, this invention explicitly defines a three-tiered communication and computing architecture of "worker node-edge node-server". Worker nodes are responsible for local computation, edge nodes are responsible for regional aggregation and critical error compensation, and the server is responsible for the final update of the global model. One server manages multiple edge nodes, and one edge node manages multiple worker nodes, forming a standard hierarchical network structure. Based on this, this invention adds a gradient quantization compressor to the worker nodes, which performs the first compression processing on the computed local stochastic gradients. A sparsity compressor with error feedback is added to the edge nodes, which performs a second compression processing on the aggregated gradient information uploaded by the worker nodes, further reducing the amount of data uploaded to the server, reducing the server's load, and simultaneously enabling compression error compensation at the edge nodes to reduce compression-related errors.

[0054] Furthermore, each worker node aggregates the training data within its connection range and saves it locally as local data. After receiving the model training task, the worker node performs calculations on the local data based on the initial model parameters to obtain local stochastic gradients.

[0055] Furthermore, the working node compresses the local stochastic gradient using a preset quantization compressor to obtain the quantized gradient and uploads it to its corresponding edge node; wherein, the data volume of the quantized gradient is smaller than that of the local stochastic gradient.

[0056] The worker nodes employ stateless gradient quantization compression. This is a lightweight compression method that does not require worker nodes to store historical errors, significantly reducing the computational and storage burden on terminal devices.

[0057] S102. The edge nodes aggregate the quantization gradients sent by each working node to obtain the aggregated gradient and perform error compensation; the aggregated gradient after error compensation is subjected to sparsification and compression processing to obtain the sparsified gradient and uploaded to the central server.

[0058] Specifically, the edge node collects the quantization gradients sent by all the worker nodes under its jurisdiction and calculates the average of all quantization gradients to obtain the aggregated gradient. Then, it performs a vector addition operation with the historical compression error stored in the local error buffer from the previous iteration and the current aggregated gradient to obtain the error-compensated aggregated gradient.

[0059] Furthermore, the edge nodes perform sparse compression processing on the aggregated gradient after error compensation through a preset sparsification compressor to complete the secondary compression, obtain the sparse gradient, and upload it to their respective central server; wherein, the data volume of the sparse gradient is smaller than the data volume of the aggregated gradient.

[0060] As a feasible implementation method, after obtaining the sparsified gradient, the edge node calculates the difference between the aggregated gradient after error compensation and the corresponding sparsified gradient, thus obtaining the compression error in the sparsification compression process. The compression error is then stored in a local error buffer for error compensation in the next iteration.

[0061] After aggregating quantized gradients from multiple worker nodes, edge nodes employ gradient sparsification with error feedback for secondary compression. This method can be executed on higher-performance edge nodes, allowing for more aggressive compression to significantly reduce the amount of data sent to the central server, while the error feedback mechanism ensures the final accuracy of the model. This invention cleverly deploys the error feedback mechanism only on edge nodes. Edge nodes accumulate the sparsification error of the aggregated gradients from their subordinate worker nodes. This design avoids the overhead of maintaining state on massive numbers of terminal devices and effectively compensates for critical compression losses before data enters the backbone network, suppressing the cumulative effect of errors.

[0062] S103. The central server performs global aggregation of the sparse gradients sent by each edge node to obtain the global gradient; based on the global gradient and the preset learning rate, it updates the initial model parameters to obtain the final model parameters.

[0063] Specifically, the central server collects the sparse gradients sent by all the edge nodes under its jurisdiction and calculates the average value to obtain the global gradient.

[0064] Furthermore, the central server substitutes the model parameters, global gradient, and preset learning rate obtained from the previous iteration into the update formula to calculate the model parameters after the current iteration.

[0065] This process is repeated until the preset number of iterations or the preset convergence criterion is reached, yielding the final model parameters.

[0066] As a feasible implementation method, the specific iterative process of the Hierarchical Error-Feedback SGD (H-EF-SGD) algorithm provided by this invention is as follows: Figure 3 As shown:

[0067] First, during system initialization, the central server loads or randomly initializes the global model parameters x0, the learning rate γ, the quantization compressor Q in the worker nodes, and the sparsity compressor C in the edge nodes. The initial global model parameters x0 are then broadcast down to all edge nodes, which in turn forward the parameters to all the worker nodes they are connected to. Next, a buffer is created for each edge node to store compression errors, and the values ​​in this buffer are initialized to a zero vector, representing the initial error term for each edge node e. Among them, 0 d This represents the zero vector.

[0068] Furthermore, the model parameters are calculated in T iterations. The following is the process for a single training iteration (executed in a loop):

[0069] Step 1: Operations on worker nodes (executed in parallel):

[0070] a. Calculate the local stochastic gradient: Each working node calculates a stochastic gradient based on its local data, using the latest global model parameters x0 received from the upper layer. t represents the t-th iteration, and w represents the w-th working node.

[0071] b. Quantization Compression: The working node uses a preset quantization compressor Q to compress the calculated stochastic gradient, generating a quantized gradient with a smaller data size. This operation is stateless, meaning that worker nodes do not store historical compression information.

[0072] c. Data Upload: The worker node will quantize the gradient. Upload it to its home edge node e.

[0073] Step 2: Operations on edge nodes (executed in parallel):

[0074] a. Aggregated Update: Each edge node collects the quantized gradients sent by all worker nodes under its jurisdiction and calculates their average to form an aggregated gradient. Among them, W e This indicates that edge node e governs a total of W working nodes.

[0075] b. Error Compensation: The "historical compressed error" stored in the local error buffer from the previous iteration is added to the "aggregated gradient" calculated in the current step to obtain a compensated gradient.

[0076] c. Sparsity Compression: The compensated gradient is compressed a second time using a pre-defined sparsity compressor. This compressor typically has a high compression ratio and can generate a sparse gradient with a very small data size.

[0077] d. Update the error cache: Calculate the gradient information lost during the sparsification and compression process in step (c) (i.e. The difference between the compensated gradient and the sparsed gradient (plus the error from the previous iteration) is stored in the local error buffer for use in the next iteration.

[0078] e. Data upload: Sparsifying gradients Send to the central server.

[0079] Step 3: Operation of the central server:

[0080] a. Global aggregation: The central server collects the sparse gradients sent by all edge nodes. Then calculate its average value to obtain the final global gradient.

[0081] b. Updating the model: The central server uses "global gradient update" and a preset learning rate to update the global model parameters. Here, E represents the number of edge nodes managed by the central server.

[0082] Step 4, Process Loop and End: After the model update is completed, the new version of the global model parameters will be broadcast again at the start of the next iteration. The system repeats all the above steps until the training reaches the preset convergence criterion or number of iterations.

[0083] Final output: The final global model after training.

[0084] In one embodiment, in a large smart factory, thousands of sensors and edge devices (worker nodes) continuously collect production data and perform preliminary model inference or training. These devices are limited by power consumption and computing power. They send lightweight quantized model updates to a convergence server (edge ​​node) in the workshop or factory area via the factory's local area network. Edge nodes have stronger computing power and are responsible for aggregating data from all devices within their area, performing more complex sparsity compression with error feedback, and finally sending only highly compressed and accurate information to a central server in the cloud via the backbone network for final model fusion. The H-EF-SGD method of this invention is perfectly suited to this scenario, achieving lightweight terminals and minimizing communication on the network backbone.

[0085] In another embodiment, in consumer-facing mobile applications, millions of mobile phones (worker nodes) can train models locally using local data while protecting user privacy. Due to the limited and unstable bandwidth of mobile networks, highly efficient communication methods are required on the mobile devices. Using this invention, the mobile devices perform only stateless gradient quantization, resulting in minimal computational overhead. These quantized gradients are sent to geographically proximate regional data centers (edge ​​nodes). The regional data centers are responsible for aggregating updates from massive numbers of mobile phones and correcting information loss caused by aggressive compression (such as sparsity) through error feedback mechanisms. Finally, they synchronize with other regional data centers to a global central server. This significantly improves the scalability and training efficiency of large-scale federated learning systems.

[0086] This invention achieves comprehensive optimization breakthroughs in a layered architecture compared to existing technologies: it not only supports gradient compression to reduce communication overhead but also introduces a systematic error feedback mechanism to effectively suppress the accumulation of compression errors in multi-layered links. Simultaneously, it employs a non-uniform compression strategy, flexibly adjusting the compression intensity according to the resource conditions of each layer, achieving synergistic optimization of communication efficiency and model accuracy, significantly improving its practicality and robustness in resource-constrained environments. Experiments with different datasets and models revealed that, under conditions of equal data transmission, this invention achieves higher model accuracy than other methods, demonstrating that it achieves optimal training efficiency and model accuracy while maintaining a high compression ratio and low communication cost. This indicates that this invention achieves the optimal balance between communication efficiency, training speed, and accuracy in a multi-layered communication architecture, outperforming existing mainstream layered training methods.

[0087] In addition, embodiments of the present invention also provide a distributed training system for hierarchical networks, such as... Figure 4 As shown, the distributed training system 400 for hierarchical networks specifically includes:

[0088] The quantization and compression module 410 is used to calculate the local stochastic gradient based on the initial model parameters and local data through the working node; and to quantize and compress the local stochastic gradient to obtain the quantized gradient and upload it to the edge node.

[0089] The sparsity compression module 420 is used to aggregate the quantization gradients sent by each working node through the edge node to obtain the aggregated gradient and perform error compensation; and to perform sparsity compression processing on the aggregated gradient after error compensation to obtain the sparsity gradient and upload it to the central server.

[0090] The parameter update module 430 is used to globally aggregate the sparse gradients sent by each edge node through the central server to obtain the global gradient; and update the initial model parameters according to the global gradient and the preset learning rate to obtain the final model parameters.

[0091] As a feasible implementation, the quantization compression module 410 includes a preset quantization compressor for compressing local stochastic gradients; the sparsification compression module 420 includes a preset sparsification compressor for sparsifying and compressing aggregated gradients to complete secondary compression.

[0092] The various embodiments in this invention are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the embodiments of apparatus, devices, and non-volatile computer storage media are basically similar to the method embodiments, so the descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0093] The foregoing has described specific embodiments of the present invention. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0094] The above description is merely an embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations can be made to the embodiments of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of the embodiments of the present invention should be included within the protection scope of the present invention.

Claims

1. A distributed training method for a hierarchical network, characterized in that, The method includes: The working node calculates the local stochastic gradient based on the initial model parameters and local data; and quantizes and compresses the local stochastic gradient to obtain the quantized gradient and uploads it to the edge node. The edge nodes aggregate the quantization gradients sent by each working node to obtain the aggregated gradient and perform error compensation, specifically including: The edge node collects the quantization gradients sent by all the working nodes under its jurisdiction, and calculates the average of all quantization gradients to obtain the aggregate gradient. The edge node performs a vector addition operation with the current aggregated gradient, using the historical compressed error stored in the local error buffer from the previous iteration as the basis for obtaining the error-compensated aggregated gradient. The aggregated gradient after error compensation is subjected to sparsification and compression to obtain a sparsified gradient, which is then uploaded to the central server. The edge node calculates the difference between the aggregated gradient after error compensation and the corresponding sparsification gradient to obtain the compression error in the sparsification compression process; the edge node stores the compression error in its local error buffer for use in the next iteration. The central server performs global aggregation of the sparse gradients sent by each edge node to obtain the global gradient; based on the global gradient and the preset learning rate, the initial model parameters are updated to obtain the final model parameters.

2. The distributed training method for a hierarchical network according to claim 1, wherein, Before the working node calculates the local stochastic gradient based on the initial model parameters and local data, the method further includes: The central server initializes the global model parameters of the model to be trained, obtaining the initial model parameters; and then broadcasts the initial model parameters down to all edge nodes. The edge node forwards the received initial model parameters to all the working nodes it is connected to; Each edge node contains an error buffer for storing compression errors, and the initial value of the error buffer is a zero vector.

3. The distributed training method for a hierarchical network of claim 1, wherein, The working node calculates the local stochastic gradient based on the initial model parameters and local data; and quantizes and compresses the local stochastic gradient to obtain the quantized gradient, which is then uploaded to the edge node. Specifically, this includes: Each worker node aggregates the training data within its connection range and saves it locally as local data; After receiving the model training task, the worker node performs calculations on the local data based on the initial model parameters to obtain the local stochastic gradient. The working node compresses the local stochastic gradient using a preset quantization compressor to obtain the quantized gradient and uploads it to its corresponding edge node; wherein the data size of the quantized gradient is smaller than the data size of the local stochastic gradient.

4. The distributed training method for hierarchical networks according to claim 1, characterized in that, The aggregated gradient after error compensation is subjected to sparsification and compression to obtain a sparse gradient, which is then uploaded to the central server. Specifically, this includes: The edge node performs sparse compression processing on the aggregated gradient after error compensation using a preset sparse compressor to complete the secondary compression, obtain the sparse gradient, and upload it to its central server; wherein, the data volume of the sparse gradient is smaller than the data volume of the aggregated gradient.

5. A distributed training method for hierarchical networks according to claim 1, characterized in that, The central server performs global aggregation of the sparse gradients sent by each edge node to obtain the global gradient, specifically including: The central server collects the sparse gradients sent by all the edge nodes under its jurisdiction and calculates the average value to obtain the global gradient.

6. A distributed training method for hierarchical networks according to claim 1, characterized in that, Based on the global gradient and the preset learning rate, the initial model parameters are updated to obtain the final model parameters, specifically including: The central server substitutes the model parameters obtained from the previous iteration, the global gradient, and the preset learning rate into the update formula to calculate the model parameters after the current iteration. This process is repeated until a preset number of iterations or a preset convergence criterion is reached, thus obtaining the final model parameters.

7. A distributed training system for hierarchical networks, characterized in that, The system includes: The quantization and compression module is used to calculate the local stochastic gradient based on the initial model parameters and local data through the working node; and to quantize and compress the local stochastic gradient to obtain the quantized gradient and upload it to the edge node. The sparsity compression module is used to aggregate the quantization gradients sent by each working node through edge nodes, obtain the aggregated gradient, and perform error compensation. Specifically, the module includes: the edge node collecting the quantization gradients sent by all working nodes under its jurisdiction and calculating the average value of all quantization gradients to obtain the aggregated gradient; the edge node performing vector addition with the current aggregated gradient using the historical compression error stored in its local error buffer from the previous iteration to obtain the error-compensated aggregated gradient; performing sparsity compression processing on the error-compensated aggregated gradient to obtain the sparse gradient and uploading it to the central server; calculating the difference between the error-compensated aggregated gradient and the corresponding sparse gradient through the edge node to obtain the compression error in the sparsity compression process; and storing the compression error in the edge node in its local error buffer for retrieval in the next iteration. The parameter update module is used to globally aggregate the sparse gradients sent by each edge node through the central server to obtain the global gradient; and update the initial model parameters according to the global gradient and the preset learning rate to obtain the final model parameters.

8. A distributed training system for hierarchical networks according to claim 7, characterized in that, The quantization compression module includes a preset quantization compressor for compressing local stochastic gradients. The sparsification compression module includes a preset sparsification compressor, which is used to perform sparsification compression processing on the aggregation gradient to complete the secondary compression.