General in-network synchronization aggregation method, system and device for distributed applications
By dynamically adjusting aggregator resources and execution order in distributed applications, the problem of low resource utilization efficiency in intra-network aggregation technology is solved, achieving efficient data aggregation and improved communication performance, and supporting the concurrency of multiple applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2023-10-13
- Publication Date
- 2026-06-19
AI Technical Summary
Existing intranet aggregation technologies suffer from redundant development, inability to update at runtime, potential security risks, and low resource utilization efficiency in distributed applications, hindering their widespread adoption in parallel reuse across multiple distributed applications.
By acquiring the aggregation task request from the application, the controller determines the aggregator resources and execution order of the target task according to the preset scheduling policy. The controller allocates an isolation area for each aggregator resource and sets offset rules. The switch processes the data packet sequence according to the execution order and aggregation rules. The aggregator merges the data packets and sends them to the receiver. The receiver replies with an ACK message and multicasts it back to the sender.
It enables dynamic adjustment and efficient data aggregation, reduces resource overhead, supports concurrency of multiple applications, reduces development complexity, and improves communication performance and reliability.
Smart Images

Figure CN117354370B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of general intranet aggregation technology, and in particular to a general intranet synchronization aggregation method, system and device for distributed applications. Background Technology
[0002] Driven by programmable network devices, a new communication and computing paradigm called Intranet Aggregation (INA) has been proposed and applied to various distributed systems, including distributed training (DT), high-performance computing (HPC), distributed block storage, and network monitoring. INA offloads the aggregation of data streams to switches to reduce traffic volume and overall job completion time. Existing prototypes have demonstrated the performance improvements of INA, such as a 66% improvement in DT jobs and a 2.7-6.8x improvement in storage.
[0003] While INA has proven its success in single applications, the tight coupling between applications and INA functionality leads to problems such as redundant development, inability to update at runtime, potential security risks, and inefficient resource utilization. These issues hinder the widespread adoption of INA in development, deployment, and operation, and prevent the parallel reuse of multiple distributed applications. Summary of the Invention
[0004] Therefore, it is necessary to provide a general intra-network synchronization aggregation method, system, and device for distributed applications that can reduce switch resource overhead when distributed applications are used in parallel and multiplexed, in order to address the above-mentioned technical problems.
[0005] A general intra-network synchronization aggregation method for distributed applications, applied to a general network architecture for shared clusters of distributed applications, the method comprising:
[0006] Retrieve the application's aggregate task request.
[0007] Based on the aggregated task request and the preset scheduling strategy, the aggregator resources of the target task and the execution order of the target task are determined. The controller allocates an isolation region for each aggregator resource and sets the offset rules corresponding to the isolation region. The isolation region and the corresponding offset rules are written into the aggregation table of the controller to obtain the aggregation rules.
[0008] The switch receives the data packet sequence sent by the sender of the target task according to the execution order and aggregation rules, and locates the data packet sequence in the aggregation table to obtain the aggregator that matches the data packet sequence.
[0009] The data packet sequence is merged by an aggregator to obtain the result data packet, which is then sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet and synchronously transmits it back to the sender via multicast from the switch.
[0010] In one embodiment, the system further includes: in a distributed application shared cluster general network architecture, a communication path from the sender to the receiver of the target task is generated according to a routing protocol, and the servers, controllers, and switches corresponding to the applications form an aggregation hierarchy based on the communication path. Multiple applications send aggregation task requests to local agents, and multiple local agents send the aggregation task requests to the controller in parallel.
[0011] In one embodiment, the method further includes: determining the aggregator resources and execution order of the target task based on the aggregation task request and the scheduling policy preset by the controller; and setting up an execution task switch for the target task according to the execution order. On the execution task switch, the controller allocates an isolation region for each aggregator resource and sets offset rules for the isolation region. The controller writes the isolation region and the offset rules corresponding to the isolation region into the aggregation table of the controller to obtain the aggregation rules for the target task.
[0012] In one embodiment, the method further includes: the sender of the target task divides the target task data block into a data packet sequence and sends the data packet sequence to the switch within a maintained window.
[0013] In one embodiment, the method further includes: the switch receiving the data packet sequence sent by the sender of the target task according to the execution order, and performing addressing and positioning on the aggregation table according to the sequence number and offset rule of the data packet sequence.
[0014] Aggregator.index←packet.seq_num+Offset
[0015] Here, `Aggregator.index` is the index of the isolated region, `packet.seq_num` is the sequence number of the data packet sequence, and `Offset` is the offset rule. It retrieves the aggregator within the isolated region corresponding to the sequence number of the data packet sequence.
[0016] In one embodiment, the method further includes: merging data packet sequences with the same message sequence number through an aggregator to obtain a result data packet, sending the result data packet to the receiver of the target task, the receiver replying with an ACK message sequence based on the result data packet, and when the ACK message sequence arrives at the switch, clearing the aggregator corresponding to each target task according to the number of target tasks and the switch group composed of the aggregation hierarchy, and sending back the ACK message sequence to the sender corresponding to the target task.
[0017] In one embodiment, the aggregation task request includes: a target task ID, a sender ID of the target task, a receiver ID of the target task, an aggregation function, and an aggregation type. The aggregation types include Reduce and Allreduce.
[0018] In one embodiment, the method further includes: if the aggregation type is Reduce, the receiver reassembles the ACK message sequence into a feedback message, which is then transmitted by the controller to the sender's local agent as the aggregation result. If the aggregation type is Allreduce, the sender reassembles the payload of the ACK message sequence into a feedback message, which is then transmitted by the controller to the sender's local agent as the aggregation result. The local agent returns the aggregation result to the application that initiated the target task via IPC.
[0019] A general-purpose intra-network synchronization aggregation system for distributed applications, the system comprising:
[0020] The Aggregate Task Request Acquisition Module is used to acquire aggregate task requests from the application.
[0021] The aggregation rule acquisition module is used to determine the aggregator resources of the target task and the execution order of the target task based on the aggregation task request and the preset scheduling strategy. The controller allocates an isolation region for each aggregator resource and sets the offset rule corresponding to the isolation region. The isolation region and the offset rule corresponding to the isolation region are written into the aggregation table of the controller to obtain the aggregation rule.
[0022] The aggregator matching module is used by the switch to receive the data packet sequence sent by the sender of the target task according to the execution order and aggregation rules, locate the data packet sequence in the aggregation table, and obtain the aggregator that matches the data packet sequence.
[0023] The aggregation module is used to merge data packet sequences through an aggregator to obtain a result data packet, which is then sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet and synchronously transmits it back to the sender via multicast from the switch.
[0024] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program performing the following steps:
[0025] Retrieve the application's aggregate task request.
[0026] Based on the aggregated task request and the preset scheduling strategy, the aggregator resources of the target task and the execution order of the target task are determined. The controller allocates an isolation region for each aggregator resource and sets the offset rules corresponding to the isolation region. The isolation region and the corresponding offset rules are written into the aggregation table of the controller to obtain the aggregation rules.
[0027] The switch receives the data packet sequence sent by the sender of the target task according to the execution order and aggregation rules, and locates the data packet sequence in the aggregation table to obtain the aggregator that matches the data packet sequence.
[0028] The data packet sequence is merged by an aggregator to obtain the result data packet, which is then sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet and synchronously transmits it back to the sender via multicast from the switch.
[0029] The aforementioned general intra-network synchronization aggregation method, apparatus, computer equipment, and storage medium for distributed applications first obtain the aggregation task request from the application. This means it can dynamically adjust and respond according to the application's needs to achieve efficient data aggregation. A preset scheduling strategy is used to determine the aggregator resources and execution order of the target task. This means it can optimize data aggregation according to different strategies to meet the needs of different application scenarios. An isolation area is allocated to each aggregator resource by the controller, and corresponding offset rules are set, enabling effective resource utilization and precise control of data packets. The switch processes the data packet sequence sent by the sender according to the execution order and aggregation rules, ensuring correct routing and aggregation of data packets, thereby reducing resource overhead. The aggregator merges the data packet sequence numbers to obtain the result data packet, which is then sent to the receiver of the target task. Further efficient data processing and merging reduces communication overhead. The receiver replies with an ACK message sequence based on the result data packet and multicasts them back to the sender through the switch, achieving communication reliability and a feedback mechanism. Therefore, through dynamic task scheduling, data packet processing and merging, and effective resource management and control, the technical problems of communication performance and resource overhead are solved. This provides a flexible approach to meet the requirements of different applications while reducing the complexity of development work and supporting a wide range of concurrent applications. Attached Figure Description
[0030] Figure 1 This is an application scenario diagram of a general intra-network synchronization aggregation method for distributed applications in one embodiment;
[0031] Figure 2 This is a flowchart illustrating a general intra-network synchronization aggregation method for distributed applications in one embodiment.
[0032] Figure 3 Here is the GISA interface code in one embodiment;
[0033] Figure 4 This is a schematic diagram of multi-level general network intranet synchronization aggregation in one embodiment;
[0034] Figure 5 Here is a diagram of the GISA message structure in one embodiment;
[0035] Figure 6 This is a layout diagram of the switch aggregator in one embodiment;
[0036] Figure 7 This is a graph showing the aggregation performance when using 7 source nodes and 1 target node in one embodiment;
[0037] Figure 8 This is a throughput graph for different packet loss rate settings in one embodiment;
[0038] Figure 9 Here is a performance graph for different source node and concurrent task numbers in one embodiment, where... Figure 9 (a) shows the throughput performance for different numbers of source nodes. Figure 9 (b) Performance considerations for handling multiple concurrent tasks;
[0039] Figure 10 Here is a single-task cost graph in one embodiment, where Figure 10 (a) represents the switch state overhead required for the aggregation task. Figure 10 (b) is the traffic overhead for the entire network;
[0040] Figure 11 This is a throughput graph of different distributed training models in one embodiment, where... Figure 11 (a) represents the throughput under the VGG16 model. Figure 11 (b) represents the throughput under the AlexNet model. Figure 11 (c) represents the throughput of the ResNet50 model;
[0041] Figure 12 This is a performance graph of an erasure coding storage system in one embodiment, wherein, Figure 12 (a) is the repair time. Figure 12 (b) is the network traffic overhead;
[0042] Figure 13 Here is a performance graph in a network measurement application in one embodiment, wherein, Figure 13 (a) represents the completion time of the CMS from different numbers of monitoring nodes. Figure 13 (b) shows the impact of CMS size on transmission completion time;
[0043] Figure 14 This is a block diagram of a general intra-network synchronization aggregation system for distributed applications in one embodiment.
[0044] Figure 15 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0045] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0046] The general intra-network synchronization aggregation method for distributed applications provided in this application can be applied to, for example... Figure 1 The diagram illustrates a distributed application shared cluster general network architecture (i.e., Generic In-network Synchronous Aggregation, GISA). This GISA network architecture includes an application plane, a control plane, and a data plane. The application plane consists of multiple servers. Figure 1 The example uses three servers (Server 1, Server 2, and Server 3) as a brief illustration. Each server hosts multiple applications within its domain, and data exchange between these applications is distributed. The control plane consists of controllers, which can be deployed on any server in the current network. These controllers are responsible for determining the task execution order, allocating aggregator resources to tasks, and installing routing rules. The data plane deploys a GISA proxy server within the local domain of each server, enabling it to exchange messages with the applications via inter-process communication (IPC).
[0047] In one embodiment, such as Figure 2 As shown, a general intra-network synchronization aggregation method for distributed applications is provided, which can be applied to... Figure 1 Taking the common network architecture of a shared cluster for distributed applications as an example, the following steps are included:
[0048] Step 202: Obtain the application's aggregation task request.
[0049] The aggregation request specifies the following information: task ID, sender ID, receiver ID, and other configurations (e.g., aggregation type: Reduce or AllReduce).
[0050] Specifically, when the general intranet aggregation operation starts, GISA starts the controller in the cluster and starts the agents as daemons on each server. Each agent establishes a communication channel with the controller and is assigned an ID associated with its host MAC address for routing. When an application needs to perform an aggregator operation, each endpoint submits a request to its local agent.
[0051] Furthermore, the application offloads aggregation-type (Reduce or AllReduce) operations to GISA, each operation having multiple senders and one receiver. In the network, the routing protocol generates paths from one sender to the receiver, with all paths forming a tree-like structure in the topology, i.e., an aggregation hierarchy.
[0052] Furthermore, the agent passes the request to the controller, which can receive multiple aggregation task requests.
[0053] Step 204: Determine the aggregator resources and execution order of the target task based on the aggregation task request and the preset scheduling strategy. Allocate an isolation region for each aggregator resource through the controller and set the offset rules corresponding to the isolation region. Write the isolation region and the offset rules corresponding to the isolation region into the aggregation table of the controller to obtain the aggregation rules.
[0054] Specifically, the controller selects target tasks where "all endpoints are ready" based on a scheduling policy (e.g., first-come, first-served). The scheduling policy needs to determine the execution order of tasks and the aggregator resources allocated to each target task.
[0055] Furthermore, the controller first controls the execution task switches in the configuration task hierarchy. On each execution task switch, the controller maintains an isolated region in the Aggregator Table and installs rules with the offset of that region to direct task traffic to that region. Then, the controller notifies all endpoints to start the aggregation operation, for example, by sending direction network transmission packets, and the receivers obtain the results.
[0056] Furthermore, GISA selects any server as the central controller, which can coordinate resource allocation when there are multiple target tasks, and the sender of the target task can only start transmitting data packets to the aggregator when all its endpoints are ready; otherwise, the aggregation will never be completed due to the lack of some senders.
[0057] Step 206: The switch receives the data packet sequence sent by the sender of the target task according to the execution order and aggregation rules, and locates the data packet sequence in the aggregation table to obtain the aggregator that matches the data packet sequence.
[0058] Specifically, in each target task, each sender divides its data block into a sequence of data packets, and all senders initialize with the same sequence number. Each sender maintains a window and always sends data packets within that window.
[0059] Furthermore, the switch aggregates packet sequences according to execution order (e.g., first-come, first-served). When a packet arrives at the switch, it is located to the aggregator in the isolation area corresponding to the target task. The addressing method is to add the packet sequence number to the offset, i.e.:
[0060] Aggregator.index←packet.seq_num+Offset
[0061] Here, Aggregator.index is the index of the isolated region, packet.seq_num is the sequence number of the data packet sequence, and Offset is the offset rule.
[0062] Furthermore, the sequence numbers are circular within a range equal to the size of the isolation area, preventing packets from accessing beyond their boundaries. The aggregator is initialized as an EMPTY and accumulates each packet, while also maintaining a PortBitmap to record the participation of its child nodes. After accumulating each packet, if the bitmap is not full, the packet is discarded because aggregation is incomplete; if the bitmap is full, the aggregator value is copied back to the packet, and the packet passes the result along its route to downstream devices.
[0063] Furthermore, GISA incorporates a retransmission mechanism on the host and a deduplication mechanism on the switch. When the sender transmits a data packet within its sliding window, it also maintains the packet's transmission timestamp. If the ACK for the data packet does not arrive within the timeout threshold, the packet will be retransmitted. Retransmission is repeated three times. In the special case of asynchronous windows, the sender sends data packets with special flags to bypass switch aggregation and directly obtain the result from the receiver.
[0064] Furthermore, the first occurrence of each data packet is recorded in the bitmap (with its bit set to 1), and subsequent occurrences can be identified; these packets will not be counted again. The switch only performs deduplication on aggregated packets, not on forwarded packets. That is, after the above steps, if the bitmap is not full, the data packet will be discarded; otherwise, the packet will carry the aggregator's contents to the downstream device.
[0065] Step 208: The data packet sequence is merged by the aggregator to obtain the result data packet. The result data packet is sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet and synchronously transmits it back to the sender via multicast through the switch.
[0066] Specifically, the switch merges all sender packets with the same sequence number to obtain a result packet. The receiver of the target task receives the aggregated result packet and replies with an ACK packet containing the same sequence number to the sender. If the aggregation type operation is Reduce, the ACK has no payload; otherwise, if the aggregation type operation is AllReduce, the ACK packet will include the result. When the ACK packet and the data packet arrive at the switch, it clears its aggregator, and the switch multicasts the ACK packet to the sender along the aggregation hierarchy. Each sender, upon receiving the ACK packet, moves its sliding window forward and continues sending new packets within the window.
[0067] In the aforementioned general intra-network synchronous aggregation method for distributed applications, the aggregation task request from the application is first obtained. This means that it can dynamically adjust and respond according to the application's needs to achieve efficient data aggregation. A preset scheduling strategy is used to determine the aggregator resources and execution order of the target task. This means that it can optimize data aggregation according to different strategies to meet the needs of different application scenarios. An isolation area is allocated to each aggregator resource by the controller, and corresponding offset rules are set to ensure effective resource utilization and precise control of data packets. The switch processes the data packet sequence sent by the sender according to the execution order and aggregation rules, ensuring correct routing and aggregation of data packets, thereby reducing resource overhead. The aggregator merges the data packet sequence numbers to obtain the result data packet, which is then sent to the receiver of the target task. Further efficient data processing and merging reduces communication overhead. The receiver replies with an ACK message sequence based on the result data packet and multicasts them back to the sender through the switch, achieving communication reliability and a feedback mechanism. Therefore, through dynamic task scheduling, data packet processing and merging, and effective resource management and control, the technical problems of communication performance and resource overhead are solved. This provides a flexible approach to meet the requirements of different applications while reducing the complexity of development work and supporting a wide range of concurrent applications.
[0068] In one embodiment, in a general network architecture for a distributed application shared cluster, a communication path from the sender to the receiver of a target task is generated according to a routing protocol. The servers, controllers, and switches corresponding to the applications form an aggregation hierarchy based on this communication path. Multiple applications send aggregation task requests to local agents, and multiple local agents send these requests to the controller in parallel.
[0069] It's worth noting that instead of using scalar values as sequence elements, each data stream is described as a "multiset sequence." This "multiset" data representation enriches the aggregation semantics of general intranet synchronization aggregation. If a user requests to calculate the "average" of multiple vectors, each vector element value is converted into a multiset (value, 1). The switch aggregates the multisets by adding the two dimensions separately, and the receiver calculates the average by dividing the first value by the second value.
[0070] Use the symbol D i Let D represent the data stream originating from sending node i. i Further expressed as:
[0071] D i = <V i,1 V i,2 V i,3 ,…,V i,k >
[0072] Where k is the sequence length, V i,j Let 1 ≤ j ≤ k be a multiset. The aggregated result D from sequences originating from n sending nodes. * It is also a multi-set sequence, which can be represented as
[0073]
[0074] Where 1≤j≤k, the symbol This represents the sum of multiple sets in a standard.
[0075] Furthermore, such as Figure 3 The interface shown is for the application. The application initializes a task and calls `init()`, which notifies the controller to allocate switch resources and assigns itself a task ID. At runtime, the application calls `request()` to submit data to the GISA agent, which contains a multiset and its format. `request()` also specifies the operation on the multiset, the roles of the endpoints (sender / receiver), and the aggregation mode (Reduce or AllReduce). If the mode is Reduce, the sender returns success / failure, and the receiver returns the result; if the mode is AllReduce, the sender returns the result, and the receiver returns success / failure.
[0076] In one embodiment, the aggregator resources and execution order of the target task are determined based on the aggregation task request and the scheduling policy preset by the controller. The controller then sets up an execution task switch for the target task according to the execution order. On the execution task switch, the controller allocates an isolation region for each aggregator resource and sets offset rules for the isolation region. The isolation region and the offset rules corresponding to the isolation region are written into the controller's aggregation table to obtain the aggregation rules for the target task.
[0077] In one embodiment, the sender of the target task divides the target task data block into a sequence of data packets and sends the sequence of data packets to the switch within a maintained window.
[0078] In one embodiment, the switch receives the sequence of data packets sent by the sender of the target task according to the execution order, and performs addressing and positioning on the aggregation table according to the sequence number and offset rules of the data packet sequence:
[0079] Aggregator.index←packet.seq_num+Offset
[0080] Here, `Aggregator.index` is the index of the isolated region, `packet.seq_num` is the sequence number of the data packet sequence, and `Offset` is the offset rule. It retrieves the aggregator within the isolated region corresponding to the sequence number of the data packet sequence.
[0081] In one embodiment, data packet sequences with the same message sequence number are merged by an aggregator to obtain a result data packet. The result data packet is sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet. When the ACK message sequence arrives at the switch, the aggregator corresponding to each target task is cleared according to the number of target tasks and the switch group composed of the aggregation hierarchy, and the ACK message sequence is sent back to the sender corresponding to the target task.
[0082] In one embodiment, the aggregation task request includes: a target task ID, a sender ID of the target task, a receiver ID of the target task, an aggregation function, and an aggregation type. The aggregation types include Reduce and Allreduce.
[0083] It's worth noting that the task-executing switch performs aggregation operations based on the received aggregation function and aggregation type. If the aggregation function is a self-decomposable aggregation function, the switch performs intra-network aggregation; if the aggregation function is a decomposable aggregation function, the switch decomposes it into a self-decomposable aggregation function before performing intra-network aggregation. The specific details are as follows:
[0084] 1) An aggregate function f is a self-decomposable aggregate function if f satisfies For certain merge operators ◇ and all nonempty multisets X and Y, where the symbol This represents the standard multiset.
[0085] operate Both and ◇ satisfy the commutative and associative laws. Calculate multiple multisets. It can be decomposed recursively, and any order of decomposition will produce the same final result.
[0086] Taking distributed training as an example, in this case, multiple sets degenerate into scalar values. Gradient aggregation involves using the SUM function to aggregate gradient data from different worker nodes. The SUM function is a self-decomposable aggregation function, that is:
[0087]
[0088] Self-decomposing functions also include MIN, MAX, XOR, COUNT, etc., and can be used in various systems. Self-decomposing functions can be executed independently on the switch without the assistance of the terminal host.
[0089] 2) If the aggregation function f is a decomposable function g and a self-decomposable aggregation function h, then f is a decomposable aggregation function, where
[0090] GISA executes `h` on the switch and `g` on the receiver. The AVERAGE vector can be formalized as follows:
[0091]
[0092]
[0093] h(X)=(x,1)
[0094] g(a,b)=a / b
[0095] Here, the operator + is the standard point sum of two pairs in two different dimensions, i.e. (x1,y1)+(x2,y2)=(x1+x2,y1+y2).
[0096] Another example of a decomposable aggregation function (but not a self-decomposable aggregation function) is the RANGE function, which is used to give the difference between the maximum and minimum values in a statistical set. It can be decomposed into a form similar to the one above and can also be used as an aggregation function in this method.
[0097] Therefore, aggregation operations can be performed on switches along the path using aggregation functions. When data from multiple links arrives at the same switch, the switch aggregates them and then sends the aggregated result to the next node. This process can effectively avoid incast transmission problems and congestion at outgoing ports.
[0098] In one embodiment, if the aggregation type is Reduce, the receiver reassembles the ACK packet sequence into a feedback message, which is then transmitted by the controller to the sender's local agent as the aggregation result. If the aggregation type is Allreduce, the sender reassembles the payload of the ACK packet sequence into a feedback message, which is also transmitted by the controller to the sender's local agent as the aggregation result. The local agent returns the aggregation result to the application that initiated the target task via IPC.
[0099] It's worth noting that this method uses a simple First-Come, First-Served (FCFS) scheduling strategy to handle concurrent tasks. When the controller decides whether to execute a task, it checks all switches in the task aggregation hierarchy. If any switch does not have enough N consecutive aggregators, the task will be suspended and wait for available resources. If all switches have N consecutive aggregators, and the controller decides to execute it, these aggregators (represented as the Offset area) are allocated to the task. The controller then installs switch rules on the switches, specifying that traffic for the task should be directed to that area.
[0100] Furthermore, during runtime, packets are matched to a region based on the task ID and mapped to an aggregator within that region by (region_Offset + seq_num). If it's a packet, aggregation is performed; otherwise, if it's an ACK packet, the aggregator is cleared. Upon task completion, the task agent notifies the controller. The controller reclaims the allocated aggregator by removing the corresponding rule from the involved switches. These idle aggregators can then be reassigned to other tasks by issuing new rules.
[0101] This process does not require recompiling the data plane because the switch's in-memory aggregator table remains unchanged. Recompiling is only necessary when the administrator intends to scale the aggregator resources on the switch, such as increasing the number of aggregators to support more tasks or higher aggregation throughput, or adding new aggregation functionality.
[0102] Additionally, the aggregator region size N also affects two configurations on the host. The sequence number range should be [0, N-1]; if the message packet sequence is longer than N, the packets will cycle through the sequence number range. The sender's sliding window should be limited to no more than N. These two configurations ensure that no two different packets in a sequence are delivered and mapped to the same aggregator, resulting in incorrect aggregation.
[0103] Therefore, compared with existing INA solutions, this method is more versatile in three aspects. First, it is decoupled from the application and supports multiplexing of multiple applications. Second, the interface ( Figure 3 It is versatile, supporting a wider range of data formats and operations. Third, its deployment does not require assumptions about the network topology; aggregation can be performed on any switch within the topology.
[0104] In one embodiment, such as Figure 4 This example illustrates a multi-level synchronous aggregation, where servers H1, H2, H4, and H5 act as source nodes, and server H6 as the destination node. Solid arrows indicate the direction of data packets, while dashed arrows represent various packet loss scenarios. If a data packet from H2 is lost (Scenario 1 in the diagram), H2 will retransmit the packet to compensate for the lost data. This retransmission is repeated three times. This implies special cases of asynchronous windows, such as a loss at H2 causing H1 to not receive an ACK and retransmit its data packet. The sender sends a packet with a special flag to bypass the switch aggregation and directly obtain the result from the receiver. Additionally, if the aggregated data packets from S1 to S5 are lost, and H1, H2, and H4 do not receive an ACK and retransmit the packets, these packets will trigger S1 to resend the previous aggregation result to S5.
[0105] Due to the complexity of partial ACK packet loss, this method also provides improvements to window asynchrony. If the sender still hasn't received the ACK, it retryes three times, sending a special packet (with a special FRD flag). This special packet is unicast: it bypasses the aggregation logic of all switches, reaches the receiver, and triggers the receiver to reply with a unicast ACK (in the case of AllReduce, the ACK carries the result). The unicast ACK synchronizes the window of the lossy sender with other windows, thus solving the problem of the lossy sender getting stuck.
[0106] Furthermore, since packet sequence numbers cycle within a certain range, old packets (from lossy senders) and new packets (from lossless senders) may be mapped to the same aggregator and incorrectly aggregated. This method limits the window size to less than half of the aggregation area to prevent old and new packets from overlapping. Halving the window size could halve the throughput, resulting in significant resource waste.
[0107] Furthermore, a 1-bit indicator was added to the switch aggregator to distinguish between new and old packets. The packet sequence was divided into several batches, each containing the same number of packets as the aggregator interval N, with both odd and even batches. The packets also carried their parity check (represented by VER, 0 for even, 1 for odd) in the header. Each aggregator also had a VER field added to check if the aggregator's value was in the same batch as the packet. With the VER field, the switch aggregation logic became as follows: when a packet arrived, if the aggregator was empty (identified by the PortBitmap), the aggregator accepted the packet; if the aggregator matched the packet's VER field, the aggregator processed the packet; otherwise, the aggregator and packet's VER fields did not match, and the packet was discarded. When an ACK packet arrived, if its VER matched the aggregator's, the aggregator was cleared; otherwise, the aggregator was not cleared. This solved the problem that aggregators might aggregate incorrect packets.
[0108] Furthermore, this method adds an RST flag to the data packet. Retransmitted packets have this bit set. If an RST packet arrives at an empty aggregator, it will be discarded directly; otherwise, it will follow the normal message processing logic. This solves the problem of retransmitted packets potentially causing memory leaks in the switch.
[0109] It's worth noting that source routing for data packets avoids interfering with existing routing issues by introducing new rules. During initialization, the switch calculates each path from a sender to a receiver in the aggregation hierarchy and translates one of these paths into a switch output port on that path. The controller then notifies each sender agent of the path (in the format of a list of switch output ports).
[0110] During runtime, each data packet encodes a list of outgoing ports for its path in its header. At each hop of the switch, the switch sends the data packet based on the list header and pops the list header. Note that when multiple data packets are aggregated into one, their outgoing port lists are not incorrectly merged because they have the same hop list after the current switch.
[0111] Routing learning for ACK packets. The PortBitmap is reused for ACK routing in the aggregator. The PortBitmap is configured with the same number of bits as the number of switch ports, and each bit in the PortBitmap is further associated with a switch port. Therefore, a bit 1 in the bitmap can indicate not only that the packet arrived at the child node, but also which physical switch port the packet originated from.
[0112] During runtime, as a packet sets its bits in the PortBitmap, its incoming switch port is also learned. When its ACK packet arrives, its aggregator's PortBitmap is acquired; the switch finds all the outgoing ports and copies the ACK packet to them. The switch then initializes all bits of the aggregator's PortBitmap to 0 and waits for the next batch of packets or other tasks to reuse it.
[0113] Therefore, "bitmap full" does not mean that it is all 1s, but rather that the number of 1s in the bitmap is equal to the CN value in the packet header.
[0114] In one embodiment, such as Figure 4 , Figure 5 As shown, the design uses an FPGA device to represent the network switch, which implements the packet forwarding logic and GISA's INA logic. In the implementation, the agent is built on top of DPDK and has 1200 lines of C++ code, the controller has 800 lines of C++ code, and the switch has 1300 lines of Verilog code. Figure 4 The packet format is displayed. GISA minimizes packet header overhead by replacing the TCP / IP header and compressing the GISA header. The GISA header contains fields such as TaskID, SN (Sequence Number), ACK, VER, ECN, RST, and FRD, which have been explained in previous chapters. ERR is used to record errors that occur during the aggregation operation, such as addition overflow. FIN is used to indicate the last packet in the transmission sequence. PLD specifies whether the ACK packet should carry the result (for AllReduce). The HOP and tuple of OP and CN are used for source routing, where HOP represents the number of remaining hops, OP represents the output switch port, and CN represents the current switch's fan-in degree, i.e., the number of child nodes. The packet payload can be formatted according to the application, and GISA installs switching rules to specify its parsing method and aggregation operation. Figure 5 The data structure of the aggregator is shown, which includes PortBitmap, VER, ECN, ERR and Payload as specified in the design.
[0115] In one embodiment, GISA is implemented on an FPGA-based test platform to evaluate its performance and advantages in various distributed applications. The test platform comprises five Intel Arria 10 FPGA devices and nine workstations. The five FPGA devices are connected to a Layer 2 network: each device has four 10GbE ports, one device acts as a backbone switch, and the four devices act as leaf switches, with the backbone switch connected to each leaf switch via a physical link. All FPGA devices are mounted on workstations equipped with Intel Xeon Platinum 8124M CPUs, 128GB RAM, and 500GB SSDs, serving as GISA controllers. The remaining eight workstations are then connected to the leaf switches via physical links, with each leaf switch connecting two workstations. These workstations run GISA Agent and are equipped with Intel Core i9-13900K CPUs, 64GB RAM, 500GB SSDs, NVIDIA GeForce RTX 2080Ti GPUs, and Intel 82599 10GbE network cards. All workstations are running Ubuntu 20.04.6 with kernel 5.15.0-76.
[0116] 1) Throughput and latency: such as Figure 7 As shown, 512 aggregators are pre-compiled for GISA on each FPGA device, with a maximum packet payload capacity of 1024 bytes for each aggregator. Figure 6 The text describes how increasing the number of aggregators can lead to higher GISA throughput. This is because more aggregators allow source nodes to inject more packets into the network, thus reducing network idle time. However, when throughput approaches the hardware's performance limits, allocating more aggregators only provides marginal utility. For example, for a 1024-byte load, throughput stops increasing linearly, peaking at 10.02 Gbps when more than 192 aggregators are added. This is because the bottleneck limiting throughput shifts from packets in transit to hardware processing power. Therefore, injecting more packets into the network does not further improve throughput but instead leads to more packet loss.
[0117] 2) Reliability: The impact of packet loss rate on throughput, such as... Figure 8As shown, there are 7 source nodes and 1 destination node. Packets are randomly dropped at the input port of each node with a specified probability. Because packet loss forces GISA to reduce its sending window size, an increase in the packet loss rate will slightly reduce its throughput. GISA's throughput gradually decreases from 9.8 Gbps to around 7.0 Gbps as the packet loss rate increases from 0% to 1%, while the unicast packet loss rate remains below 1.8 Gbps. Furthermore, GISA can prevent aggregation tasks from being interrupted by triggering timeout retransmission logic. Therefore, even in unreliable network environments, it maintains a significant advantage over the unicast method.
[0118] 3) Multi-source and multi-tasking: A major advantage of GISA is its ability to transmit data at line rate, regardless of the number of source nodes. Figure 9 (a) presents the throughput performance under different numbers of source nodes. Specifically, GISA uses INA to transmit data, ensuring that the same amount of data is transmitted on all links along the transmission path. Therefore, GISA's throughput is not affected by traffic intersections while still maintaining its maximum line rate. In contrast, unicast throughput decreases rapidly with increasing number of source nodes. This demonstrates that GISA can provide substantial benefits compared to traditional transmission methods, especially when handling large-scale communication groups.
[0119] Figure 9 (b) demonstrates the performance of GISA in handling multiple concurrent tasks, with the number of tasks ranging from 1 to 4, and aggregators evenly distributed among these tasks. Each transmission involves 100MB of data from 7 source nodes to one destination node. Compared to unicast communication, GISA significantly reduces task completion time. Notably, GISA also supports aggregator allocation based on intelligent scheduling policies, which is crucial for applications with various QoS requirements, such as deadlines. Our future work includes exploring GISA's scheduling policies.
[0120] 4) Overhead: In addition, a fat tree topology with 1024 hosts was simulated, which helps to evaluate the routing overhead of GISA in large-scale networks. Figure 10 (a) illustrates the switch state overhead required for the aggregation task. The GISA-naive method relies on forwarding rules to determine packet routes, resulting in a significant increase in switch state as the number of source nodes increases. To construct transmission paths, it publishes numerous routing rules specifying the next hop for each packet. In contrast, the GISA-optimized method utilizes the PortBitmap field in the aggregator to guide the output port of ACK packets and leverages source routing for packet forwarding. Therefore, it significantly reduces switch state consumption. Specifically, with 640 source nodes, the number of entries on the GISA-optimized switch is reduced by 6.4 times compared to the unoptimized GISA.
[0121] Another significant advantage of GISA lies in its ability to reduce traffic. The overall network traffic overhead was assessed by varying the number of source nodes, with each source node transmitting 100MB of data. The results are as follows... Figure 10 As shown in (b), GISA utilizes switches to aggregate relevant traffic along the transmission path, reducing traffic by up to 69.3% with 5 source nodes and by up to 3.78 times with 640 source nodes. This highlights the potential of GISA to significantly reduce network traffic, which can benefit applications involving a large number of communication nodes.
[0122] 5) Distributed Training: Three typical training models were selected to evaluate the performance of GISA in DT applications, verifying its general applicability. Results are as follows: Figure 11 As shown, in the VGG16 and AlexNet models ( Figure 11 (a) and Figure 11 (b) Unicast communication mode exhibits a significant performance degradation as the number of worker nodes increases. ATP achieves higher throughput than unicast by aggregating gradient data on leaf switches, but it does not utilize backbone switches for further gradient data aggregation. This leads to a decrease in ATP's training throughput due to congestion at the spine switches. In contrast, GISA can efficiently aggregate data using switches at each layer along the transmission path. Therefore, GISA's training throughput is almost unaffected by the increase in the number of worker nodes. Specifically, in the AlexNet model ( Figure 11 (b) Compared to ATP, GISA increased throughput by 27.3%, and when 7 worker nodes were involved, it increased throughput by 83.6% compared to unicast.
[0123] In ResNet50 ( Figure 11 In (c)), the performance differences between these methods are not significant because the model is computationally intensive. Therefore, it is difficult to achieve a significant improvement in training performance through network communication optimization as is the case with GISA.
[0124] 6) Distributed Storage: To repair a faulty block in an erasure coding storage system, the traditional method is to retrieve multiple related blocks on the network and restore the faulty block on a single repair node. However, this method often leads to inbound link congestion at the repair node, resulting in high latency for degraded reads. The repair time increases with the coding parameter k of RS(k,m). To address this issue, state-of-the-art repair pipeline (RP) methods transform discrete blocks, layering operations into concurrent sub-operations on sub-blocks, thus effectively avoiding congestion caused by cast-state transmission. GISA can achieve performance comparable to or even higher than RP, such as... Figure 12As shown in (a), this is mainly because the switch can process packets at line rate, thus achieving higher throughput than aggregation performed on the host.
[0125] While RP can alleviate congestion and reduce repair time, a significant drawback is that it remains susceptible to high traffic. We further evaluated this overhead in a simulated fat-tree network with 1024 hosts and evaluated the coding parameters using more blocks (such as RS(9,3) and RS(12,4)). The results are as follows... Figure 12 As shown in (b), as the value of the coding parameter k increases, the number of blocks required to repair a faulty block also increases, leading to a significant increase in traffic. However, GISA can effectively mitigate this problem by aggregating coded blocks in the network. When k=12, RP requires 8.01GB of network traffic to repair a faulty block, while GISA only requires 4.39GB, resulting in a traffic reduction of up to 45.19%.
[0126] 7) Network Monitoring: Sketch-based network monitoring has become a prominent research area in recent years, and its structure is well-suited for synchronous aggregation. By using the MAX operation, we can aggregate the CMS data obtained from various monitoring nodes and then transmit the aggregated data to the collector. However, in scenarios with sizable clusters, significant surges in traffic can lead to longer transmission times required to collect these results. Figure 13 (a) illustrates the unicast method. High traffic can also severely impact other network operations and services.
[0127] GISA can effectively reduce traffic while lowering CMS collection latency. Figure 13 (a) shows the completion time for collecting CMS data from different numbers of monitoring nodes. Notably, the transmission completion time for unicast communication mode increases significantly with the number of monitoring nodes. In contrast, GISA remains almost unaffected, with its completion time remaining relatively stable. This indicates that GISA can effectively serve monitoring tasks in large-scale distributed clusters.
[0128] Figure 13 (b) further illustrates the impact of CMS size on transmission completion time when there are 7 monitoring nodes. When the CMS size increases from 2MB to 16MB, GISA's completion time only increases slightly, while the completion time difference between GISA and Unicast increases by nearly 9.07 times. These two figures confirm GISA's even more significant advantage in handling monitoring tasks with more nodes and larger CMS sizes.
[0129] In summary, even in large networks, GISA maintains its maximum line rate (approximately 10Gbps), significantly reducing traffic by about 3.78 times compared to Unicast. Furthermore, GISA is suitable for various application scenarios and achieves performance acceleration with acceptable system overhead compared to state-of-the-art methods.
[0130] It should be understood that, although Figure 2 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 2 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
[0131] In one embodiment, such as Figure 14 As shown, a general intra-network synchronization aggregation system for distributed applications is provided, including: an aggregation task request acquisition module 1402, an aggregation rule acquisition module 1404, an aggregator matching module 1406, and an aggregation module 1408, wherein:
[0132] The aggregation task request acquisition module 1402 is used to acquire the aggregation task requests of the application.
[0133] The aggregation rule acquisition module 1404 is used to determine the aggregator resources of the target task and the execution order of the target task according to the aggregation task request and the preset scheduling strategy. The controller allocates an isolation region for each aggregator resource and sets the offset rule corresponding to the isolation region. The isolation region and the offset rule corresponding to the isolation region are written into the aggregation table of the controller to obtain the aggregation rule.
[0134] The aggregator matching module 1406 is used by the switch to receive the data packet sequence sent by the sender of the target task according to the execution order and aggregation rules, locate the data packet sequence in the aggregation table, and obtain the aggregator that matches the data packet sequence.
[0135] The aggregation module 1408 is used to merge data packet sequences through an aggregator to obtain result data packets, and send the result data packets to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packets and synchronously transmits them back to the sender via multicast from the switch.
[0136] Specific limitations regarding the general intra-network synchronization aggregation system for distributed applications can be found in the limitations of the general intra-network synchronization aggregation method for distributed applications described above, and will not be repeated here. Each module in the aforementioned general intra-network synchronization aggregation system for distributed applications can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in the computer device in hardware form, or stored in the memory of the computer device in software form, so that the processor can call and execute the corresponding operations of each module.
[0137] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 15 As shown, the computer device includes a processor, memory, network interface, display screen, and input devices connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The network interface is used to communicate with external terminals via a network connection. When the computer program is executed by the processor, it implements a general intra-network synchronous aggregation method for distributed applications. The display screen can be an LCD screen or an e-ink screen. The input devices can be a touch layer covering the display screen, buttons, a trackball, or a touchpad mounted on the computer device casing, or an external keyboard, touchpad, or mouse.
[0138] Those skilled in the art will understand that Figures 14-15 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0139] In one embodiment, a computer device is provided, including a memory and a processor, the memory storing a computer program, the processor executing the computer program to perform the following steps:
[0140] Retrieve the application's aggregate task request.
[0141] Based on the aggregated task request and the preset scheduling strategy, the aggregator resources of the target task and the execution order of the target task are determined. The controller allocates an isolation region for each aggregator resource and sets the offset rules corresponding to the isolation region. The isolation region and the corresponding offset rules are written into the aggregation table of the controller to obtain the aggregation rules.
[0142] The switch receives the data packet sequence sent by the sender of the target task according to the execution order and aggregation rules, and locates the data packet sequence in the aggregation table to obtain the aggregator that matches the data packet sequence.
[0143] The data packet sequence is merged by an aggregator to obtain the result data packet, which is then sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet and synchronously transmits it back to the sender via multicast from the switch.
[0144] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
[0145] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0146] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A general intra-network synchronization aggregation method for distributed applications, characterized in that, The method is applied to a common network architecture for shared clusters of distributed applications; the method includes: Retrieve the application's aggregate task requests; Based on the aggregated task request and the preset scheduling strategy, the aggregator resources of the target task and the execution order of the target task are determined. An isolation region is allocated to each aggregator resource by the controller, and the offset rule corresponding to the isolation region is set. The isolation region and the offset rule corresponding to the isolation region are written into the aggregation table of the controller to obtain the aggregation rule. The switch receives the data packet sequence sent by the sender of the target task according to the execution order and the aggregation rules, and locates the data packet sequence in the aggregation table to obtain the aggregator that matches the data packet sequence; The aggregator merges the data packet sequences to obtain a result data packet, which is then sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet and synchronously transmits it back to the sender via multicast from the switch.
2. The method according to claim 1, characterized in that, Prior to the step of obtaining the application's aggregate task request, the following is also included: In the general network architecture of a shared cluster for distributed applications, a communication path from the sender to the receiver of a target task is generated according to a routing protocol. The server, the controller, and the switch corresponding to the application form an aggregated hierarchical structure based on the communication path. Multiple applications send aggregation task requests to local agents, and multiple local agents send the aggregation task requests to the controller in parallel.
3. The method according to claim 2, characterized in that, Based on the aggregated task request and the preset scheduling strategy, the aggregator resources of the target task and the execution order of the target task are determined. An isolation region is allocated to each aggregator resource by the controller, and an offset rule corresponding to the isolation region is set. The isolation region and the corresponding offset rule are written into the aggregation table of the controller to obtain the aggregation rules, including: Based on the aggregated task request and the controller's preset scheduling policy, the controller determines the aggregator resources of the target task and the execution order of the target task. The controller sets up an execution task switch for the target task according to the execution order. On the execution task switch, the controller allocates an isolation region for each aggregator resource and sets the offset rule for the isolation region. The controller writes the isolation region and the offset rule corresponding to the isolation region into the aggregation table to obtain the aggregation rule for the target task.
4. The method according to claim 3, characterized in that, Before the step of the switch receiving the data packet sequence sent by the sender of the target task according to the execution order and the aggregation rules, locating the data packet sequence in the aggregation table, and obtaining the aggregator matching the data packet sequence, the method further includes: The sender of the target task divides the target task data block into a data packet sequence and sends the data packet sequence to the switch in the maintained window.
5. The method according to claim 4, characterized in that, The switch receives the data packet sequence sent by the sender of the target task according to the execution order and the aggregation rules, and locates the data packet sequence in the aggregation table to obtain the aggregator matching the data packet sequence, including: The switch receives the data packet sequence sent by the sender of the target task according to the execution order, and performs addressing and positioning on the aggregation table according to the sequence number of the data packet sequence and the offset rule: Aggregator.index←packet.seq_num+Offset Wherein, Aggregator.index is the index of the isolated region, packet.seq_num is the sequence number of the data packet sequence, and offset is the offset rule; Obtain the aggregator within the isolation area corresponding to the sequence number of the data packet sequence.
6. The method according to claim 5, characterized in that, The aggregator merges the data packet sequences to obtain a result data packet, which is then sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet and synchronously transmits it back to the sender via multicast from the switch. The aggregator merges data packet sequences with the same message sequence number to obtain a result data packet, which is then sent to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet. When the ACK message sequence arrives at the switch, the aggregator corresponding to each target task is cleared according to the number of target tasks and the switch group composed of the aggregation hierarchy, and the ACK message sequence is sent back to the sender corresponding to the target task.
7. The method according to any one of claims 1 to 6, characterized in that, The aggregation task request includes: target task ID, sender ID of the target task, receiver ID of the target task, aggregation function, and aggregation type; The aggregation types include Reduce and Allreduce.
8. The method according to claim 7, characterized in that, After the step of the receiver replying with an ACK message sequence based on the result data packet and synchronously transmitting it back to the sender via multicast from the switch, the method further includes: If the aggregation type is Reduce, the receiver reassembles the ACK packet sequence into a feedback message, and the feedback message is transmitted by the controller to the sender's local agent as the aggregation result; if the aggregation type is Allreduce, the sender reassembles the payload of the ACK packet sequence into a feedback message, and the feedback message is transmitted by the controller to the sender's local agent as the aggregation result. The local agent returns the aggregation result to the application that launched the target task via IPC.
9. A general intra-network synchronization aggregation system for distributed applications, characterized in that, The system includes: The aggregation task request acquisition module is used to acquire the aggregation task requests of the application. The aggregation rule acquisition module is used to determine the aggregator resources of the target task and the execution order of the target task according to the aggregation task request and the preset scheduling strategy. The controller allocates an isolation region for each aggregator resource and sets the offset rule corresponding to the isolation region. The isolation region and the offset rule corresponding to the isolation region are written into the aggregation table of the controller to obtain the aggregation rule. The aggregator matching module is used by the switch to receive the data packet sequence sent by the sender of the target task according to the execution order and the aggregation rules, and to locate the data packet sequence in the aggregation table to obtain the aggregator that matches the data packet sequence; An aggregation module is used to merge the data packet sequence through the aggregator to obtain a result data packet, and send the result data packet to the receiver of the target task. The receiver replies with an ACK message sequence based on the result data packet and synchronously transmits it back to the sender through multicast of the switch.
10. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 8.