A method and system for collective communication based on RoCE

By constructing a hierarchical logical topology tree and decomposing it into multiple stages of point-to-point RDMA communication tasks, combined with dynamic credit pools and pseudo-random delay scheduling, the network congestion and topology agnostic problems in RoCE aggregate communication are solved, achieving efficient communication optimization and stability improvement.

CN122285584APending Publication Date: 2026-06-26CHINA ACADEMY OF INFORMATION & COMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA ACADEMY OF INFORMATION & COMM
Filing Date
2026-03-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing RoCE aggregation communication technologies suffer from problems such as network congestion, topology agnosticism, inefficient resource management, and lack of adaptive capabilities, resulting in low communication efficiency.

Method used

By acquiring the delay and bandwidth matrices between nodes, a hierarchical logical topology tree is constructed, decomposing global aggregate communication into multiple stages of point-to-point RDMA communication tasks. A dynamic credit pool and sliding window mechanism are used to maintain concurrent execution, and fine-grained control is achieved by combining pseudo-random delay scheduling and RDMA atomic operations.

Benefits of technology

It significantly reduces aggregation communication latency and CPU overhead, improves communication stability and scalability in large-scale clusters, optimizes cross-level communication efficiency, and effectively avoids Incast congestion.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122285584A_ABST
    Figure CN122285584A_ABST
Patent Text Reader

Abstract

This invention provides a RoCE-based aggregated communication method, comprising: acquiring a delay matrix and a bandwidth matrix between nodes, and constructing a hierarchical logical topology tree based on the delay matrix and bandwidth matrix; decomposing the global aggregated communication operation into multiple stages of point-to-point RDMA communication tasks based on the logical topology tree, and planning a physical path for each task; establishing a dynamic credit pool at each receiving node, and initiating an RDMA write operation after the sending node obtains credit from the corresponding receiving node; maintaining concurrent execution of multiple RDMA write operations through a sliding window mechanism; and broadcasting the next stage instruction through an RDMA immediate write operation after the coordinating node detects that all nodes have completed the RDMA write operation, with the synchronization counter of the coordinating node being updated through the RDMA atomic operations of each node. This invention fully utilizes the RoCE RDMA characteristics, significantly reducing aggregated communication latency and CPU overhead, and improving communication stability and scalability in large-scale clusters.
Need to check novelty before this filing date? Find Prior Art

Description

TECHNICAL FIELD

[0001] The present application relates to the field of high-performance and distributed computing, and in particular to a RoCE-based collective communication method and system. BACKGROUND

[0002] With the rapid development of artificial intelligence large model training, scientific computing simulation and big data analysis, the scale of distributed computing clusters is continuously expanding, and collective communication (Collective Communication) operations such as All Reduce, Broadcast, All Gather, etc. have become the core bottleneck of distributed parallel computing. Traditional collective communication libraries based on TCP / IP (such as MPI) have problems such as multiple data copying, high CPU overhead, and large communication delay.

[0003] Although InfiniBand network provides native RDMA support, it is costly and incompatible with general data center Ethernet. RoCE technology implements RDMA on standard Ethernet, becoming an important choice for high-performance computing. However, collective communication on RoCE faces the following challenges: 1. Network congestion: Incast traffic generated by large-scale collective communication easily causes Ethernet switch buffer overflow, causing packet loss and retransmission; 2. Topology unawareness: traditional collective communication algorithms do not fully consider the multi-level network topology of the data center, resulting in low communication efficiency across levels; 3. Inefficient resource management: frequent small-scale communication brings huge memory registration / deregistration overhead; 4. Lack of adaptive ability: static communication mode cannot adapt to dynamically changing network state.

[0004] Existing technologies such as NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) implement intra-network computing, but require dedicated switch support. The ACCL library of Ali Cloud optimizes RoCE collective communication, but still has problems such as lack of fine congestion control and limited topology adaptability. SUMMARY

[0005] The present application provides a RoCE-based collective communication method, which aims to solve the problems of lack of fine congestion control, topology unawareness, and inefficient resource management in existing technologies. The method includes: Obtaining the delay matrix and bandwidth matrix between nodes, and constructing a hierarchical logical topology tree according to the delay matrix and bandwidth matrix; Based on the logical topology tree, decompose the global collective communication operation into multiple stages of point-to-point RDMA communication tasks, and plan a physical path for each task; A dynamic credit pool is established at each receiving node, and the sending node initiates the RDMA write operation after obtaining the credit from the corresponding receiving node; A plurality of RDMA write operations are kept concurrent by a sliding window mechanism; After the coordinating node detects that all nodes complete the RDMA write operation, the next stage instruction is broadcasted through the RDMA immediate number write operation, and the synchronization counter of the coordinating node is updated through the RDMA atomic operation of each node.

[0006] Optionally, the delay matrix and the bandwidth matrix construct a hierarchical logical topology tree, comprising: An initial physical neighbor connection graph is constructed at each node based on the obtained LLDP information; End-to-end probing between nodes is performed based on the physical connection graph to obtain a benchmark communication delay matrix L and an effective bandwidth matrix B; Based on the benchmark communication delay matrix L and the effective bandwidth matrix B, the nodes are divided into a plurality of domains; Based on the domains, a hierarchical logical topology tree is constructed.

[0007] Optionally, the benchmark communication delay matrix L, the effective bandwidth matrix B and the logical topology tree are cached in the memory of each node and kept updated.

[0008] Optionally, based on the logical topology tree, a global collective communication operation is decomposed into a plurality of stages of point-to-point RDMA communication tasks, and a physical path is planned for each task, comprising: In one level of the logical topology tree, a node set is divided into two subsets of approximately equal size; Within each subset, a Reduce-Scatter operation is independently performed; Between the two subsets, a pair of point-to-point Reduces is performed; After the point-to-point Reduce, within each subset, an All-Gather operation is independently performed to generate a point-to-point RDMA task of the current level; For each point-to-point RDMA task of the current level, a physical path is selected for it based on the logical topology tree.

[0009] Optionally, a dynamic credit pool is established at each receiving node, and the sending node initiates the RDMA write operation after obtaining the credit from the corresponding receiving node, comprising: Each receiving node maintains a credit value for each sending node (or each group of sending nodes), and the credit value is notified to the sending node through an update message; Before initiating the RDMA write operation, the sending node must send a lightweight "credit request" message to the receiving node; The receiving node replies with a "credit grant" message based on its own remaining buffer space and the global congestion status. The "credit grant" message contains the number of data blocks that can be sent. The credit value is decremented by 1 for each data block sent by the sending node or each data block received by the receiving node.

[0010] Optionally, when multiple sending nodes send data to a receiving node, a basic offset delay and a pseudo-random delay are calculated for each sending node. The basic offset delay is set in layers according to the number of hops of the sending node in the logical topology, and the pseudo-random delay is generated within a time window based on the node ID and communication round hash. If network congestion is detected, the time window for pseudo-random delays is dynamically expanded.

[0011] Optionally, maintaining concurrent execution of multiple RDMA write operations via a sliding window mechanism includes: The total data to be transmitted is divided into multiple data blocks of a predetermined size; W data block tasks are submitted to the RNIC work queue through W asynchronous RDMA operations, where W is the pipeline window size, which is preset. After the data block task operation in the RNIC's work queue is completed, the buffer is released to obtain the next set of W data blocks until the total data transmission is completed.

[0012] Optionally, after the coordinating node detects that all nodes have completed the RDMA write operation, it broadcasts the next stage instructions via the RDMA immediate data write operation, including: Pre-register a synchronization counter memory region on the coordinating node; After completing its own data transmission task, each participating node performs an RDMA atomic operation on the synchronization counter address of the coordinating node, incrementing the synchronization counter by 1. The coordinating node determines whether all nodes have completed the RDMA write operation based on the value of the synchronization counter. After determining that all nodes have completed the RDMA write operation, the RDMA immediate write operation of the coordinating nodes is used to cause the receiving node's RNIC to immediately generate a completion event with immediate data, which carries a stage identifier.

[0013] Optionally, during a communication session, nodes periodically exchange lightweight heartbeat messages; If the sending node does not receive a heartbeat, credit confirmation, or RDMA operation failure from the peer within the expected time, the corresponding receiving node is marked as faulty. The system uses a logical topology tree to find a node to take over the communication tasks of the fault-marked node.

[0014] This invention also provides a RoCE-based aggregated communication system, comprising: multiple nodes, each node equipped with at least one RoCE-enabled smart network interface card, main memory, and multi-core processor, the multiple nodes being connected via Ethernet; a communication management module residing on each node or a central management node, comprising: a topology management unit, used to acquire the delay matrix and bandwidth matrix between nodes, and construct a hierarchical logical topology tree based on the delay matrix and bandwidth matrix; a task planning unit, used to decompose the global aggregated communication operation into multiple stages of point-to-point RDMA communication tasks based on the logical topology tree, and plan physical paths for each task; a scheduling unit, used to establish a dynamic credit pool at each receiving node, enabling the sending node to initiate an RDMA write operation after acquiring credit from the corresponding receiving node; an RDMA operation engine unit, used to maintain the concurrent execution of multiple RDMA write operations through a sliding window mechanism; and a synchronization unit, used to broadcast the next stage instruction through an RDMA immediate write operation after the coordinating node detects that all nodes have completed the RDMA write operation, the synchronization counter of the coordinating node being updated through the RDMA atomic operations of each node.

[0015] The RoCE-based aggregated communication method and system provided by this invention fully utilizes the RoCE RDMA features to significantly reduce aggregated communication latency and CPU overhead, and improve communication stability and scalability in large-scale clusters. Specifically, it constructs a logical topology tree through active probing and cluster analysis to provide a precise basis for communication planning; it dynamically decomposes aggregated communication tasks based on the logical topology to optimize cross-level communication; it achieves fine-grained control of the transmission rate through dynamic credit allocation and pseudo-random delay scheduling, effectively avoiding Incast congestion; it utilizes a multi-block pipelined windowed transmission algorithm to maximize RDMA concurrency and zero-copy high-throughput transmission; and it achieves microsecond-level barrier synchronization based on lightweight synchronization using RDMA atomic operations, eliminating CPU involvement. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0017] Figure 1 This is a flowchart illustrating the RoCE-based collection communication method provided by the present invention.

[0018] Figure 2This is a flowchart illustrating the process of dynamically decomposing a global collective communication operation into multiple stages of point-to-point RDMA communication tasks in the collective communication method provided by this invention.

[0019] Figure 3 This is a schematic diagram of credit-based flow control in a specific embodiment.

[0020] Figure 4 This is a schematic diagram of pseudo-random delayed scheduling in a specific embodiment.

[0021] Figure 5 This is a timing diagram of multi-block pipelined windowed RDMA transfer in a specific embodiment.

[0022] Figure 6 This is a schematic diagram of barrier synchronization based on RDMA atomic operations in a specific embodiment. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0024] This invention provides a RoCE-based collection communication method, such as... Figure 1 As shown, the method includes: S1. Obtain the delay matrix and bandwidth matrix between nodes, and construct a hierarchical logical topology tree based on the delay matrix and bandwidth matrix; S2. Based on the logical topology tree, the global collection communication operation is decomposed into multiple stages of point-to-point RDMA communication tasks, and a physical path is planned for each task; S3. Establish a dynamic credit pool at each receiving node. After the sending node obtains credit from the corresponding receiving node, it initiates an RDMA write operation. S4. Multiple RDMA write operations can be executed concurrently using a sliding window mechanism; S5. After the coordinating node detects that all nodes have completed the RDMA write operation, it broadcasts the next stage instruction through the RDMA immediate data write operation. The synchronization counter of the coordinating node is updated through the RDMA atomic operation of each node.

[0025] Through the above methods, the present invention can solve the problems of imprecise congestion control, lack of topology awareness, and inefficient resource management in the prior art.

[0026] In step S1, a cluster network topology and performance profile are constructed through active probing, and the delay matrix and bandwidth matrix between nodes are obtained. Based on the delay matrix and bandwidth matrix, a hierarchical logical topology tree is constructed using a graph clustering algorithm. Specifically, this may include sub-steps S11-S14.

[0027] S11. The topology management unit at each node constructs an initial physical neighbor connection graph based on the acquired LLDP information. In this step, LLDP (Link Layer Discovery Protocol) information allows network devices (such as switches, routers, and server network cards) to "announce" their identity, function, and connection port information to their directly connected neighbor devices. As a specific implementation, the topology management unit at each node reads the LLDP information of the switches connected to its RNIC (Remote Direct Memory Access (RDMA) Network Interface Card).

[0028] S12. Based on the physical connection diagram, perform end-to-end probing between nodes to obtain the baseline communication delay matrix L and the effective bandwidth matrix B. In this step, as a specific implementation, all nodes perform a series (e.g., 1000 times) of small data packet (e.g., 16 bytes) RDMA WRITE operations in pairs, record the average round-trip time, and remove the highest / lowest latency to obtain the baseline communication delay matrix L, where L(i,j) represents the latency from node i to j. For directly connected node pairs or nodes with extremely low latency, perform RDMA read / write bandwidth tests with different data block sizes to obtain the effective bandwidth matrix B.

[0029] S13. Based on the aforementioned baseline communication delay matrix L and effective bandwidth matrix B, the nodes are divided into multiple domains. In this step, as a specific implementation, a graph clustering algorithm (such as delay-based hierarchical clustering) is used to divide the nodes into multiple "domains," for example, one domain corresponds to one rack or one switch. Within a domain, the latency between nodes is low and the bandwidth is high; the latency between domains is relatively high.

[0030] S14. Construct a hierarchical logical topology tree based on the domain. In this step, the logical topology is inferred.

[0031] As one implementation, the baseline communication delay matrix L, the effective bandwidth matrix B, and the logical topology tree are cached in the memory of each node and updated periodically or when performance anomalies are detected.

[0032] This invention uses the above-mentioned technical means to construct a logical topology tree through active detection and cluster analysis, providing a precise basis for communication planning.

[0033] In step S2, based on the logical topology tree, the global aggregate communication operation is dynamically decomposed into multiple stages of point-to-point RDMA communication tasks, and a physical path is planned for each task, thereby optimizing cross-level communication. For example... Figure 2 As shown, it specifically includes sub-steps S20-S25.

[0034] S20. At one level of the logical topology tree, determine whether the size of the current node set is greater than a predetermined threshold T; S21. If the number of nodes exceeds a predetermined threshold T, the node set is divided into two approximately equal subsets. In this step, the partitioning principle is based on the effective bandwidth matrix B, minimizing the logical link hops across subsets and maximizing the available bandwidth. The two approximately equal subsets are labeled Group_A and Group_B, respectively. In one implementation, the threshold T can be set to 8; if the number of nodes is less than or equal to the predetermined threshold T, step S22 can be performed directly.

[0035] S22. Within each subset, perform Reduce-Scatter operations independently; in this step, recursively decompose downwards until the subset size reaches an optimization threshold, at which point the recursion stops, and the threshold is dynamically adjusted based on the network topology and performance profile.

[0036] S23. Perform paired point-to-point Reduce between the two subsets; as a specific implementation, the k-th node in Group_A and the k-th node in Group_B use RDMA READ to obtain each other's data, perform local Reduce, and then write some of the results back using RDMA WRITE. The connection planning between subsets strictly follows the logical topology to ensure that the communication paths between Group_A and Group_B are distributed as widely as possible across different upstream links to avoid congestion.

[0037] S24. After point-to-point Reduce, perform an All-Gather operation independently within each subset to generate point-to-point RDMA tasks for the current level; finally, broadcast the complete final result to all nodes within the subset.

[0038] S25. For each point-to-point RDMA task at the current level, select a physical path for it based on the logical topology tree. In this step, for each point-to-point RDMA task generated by decomposition (e.g., node i -> node j), query the logical topology graph, select a physical path for it (in a multi-path environment), and record the key switch port information on the path for subsequent flow control and congestion avoidance.

[0039] In step S3, the transmission rate is finely controlled through dynamic credit allocation and pseudo-random delay scheduling, effectively avoiding Incast congestion. A dynamic credit pool is established at each receiving node to control the transmission rate. After obtaining credit from the corresponding receiving node, the sending node initiates an RDMA write operation, specifically including sub-steps S31-D34.

[0040] S31. Each receiving node maintains a credit value for each sending node (or for each group of sending nodes), and the credit value is notified to the sending node via an update message; in this step, the initial credit is equal to the size of the buffer pre-registered by the receiving end / the size of the data block.

[0041] S32. Before initiating an RDMA write operation, the sending node must send a lightweight “credit request” message to the receiving node; this “credit request” message can be sent via RoCE’s unreliable datagram.

[0042] S33. The receiving node replies with a "credit grant" message based on its own buffer remaining space and the global congestion status. The "credit grant" message contains the number of data blocks that can be sent. S34. The credit value is decremented by 1 for each data block sent by the sending node or received by the receiving node. Correspondingly, the credit is reclaimed for each data block consumed (processed) by the receiving node, and the sending node can be notified asynchronously via an update message.

[0043] As a specific application scenario, when multiple sending nodes send data to a single receiving node, a base offset delay and a pseudo-random delay are calculated for each sending node. The base offset delay is set in layers according to the number of hops of the sending node in the logical topology, and the pseudo-random delay is generated within a time window based on the node ID and communication round hash. If network congestion is detected, the time window of the pseudo-random delay is dynamically expanded.

[0044] As a specific implementation method, a transmission scheduling algorithm combining basic delay and pseudo-random delay is adopted for this many-to-one communication scenario to smooth network traffic. A transmission scheduling mechanism combining basic delay and pseudo-random delay based on network layers is introduced; that is, before communication starts, a basic offset delay T_base and a pseudo-random delay T_random are calculated for each sending node. T_base is set hierarchically according to the "distance" (hop count) of the sending node in the logical topology; the greater the distance, the greater the basic delay, thus staggering data arrival times. T_random is generated within a controllable time window (e.g., [0, 2*RTT_avg]) based on the node ID and communication round hash. The actual transmission start time of the sending node is: T_start = T_base + T_random. If network telemetry (e.g., ECN marking) detects congestion, the window of T_random is dynamically expanded, thereby achieving dynamic adjustment of T_random and enhancing the peak-shaving effect.

[0045] As a specific implementation, the pseudo-random delay T_random is calculated as follows: T_random = Hash(Node_ID, Round_Number) mod (2 * RTT_avg), where Hash() is the hash function, Node_ID is the node identifier, Round_Number is the number of communication rounds, and RTT_avg is the average round-trip delay; when network congestion is detected, the hash modulus coefficient is dynamically increased.

[0046] Figure 3 This is a schematic diagram of credit-based flow control as a specific embodiment. Figure 3 As shown, the credit grant from receiving node R to sending node S1 consists of 5 data blocks, while the credit grant from sending node S2 to receiving node R consists of 3 data blocks. The credit grant value is determined based on the maintained credit pool. After the written data blocks are processed, the credit is released, thereby updating the maintained credit pool.

[0047] Figure 4 This is a schematic diagram of pseudo-random delayed scheduling in a specific embodiment, such as... Figure 4 As shown, the transmission start time of each node is calculated based on the base delay T_base + pseudo-random delay T_random. The base delay T_base is determined by the network layer, while the pseudo-random delay T_random is generated by a hash function and randomly distributed within a certain window.

[0048] In step S4, the RDMA concurrency is maximized through a multi-block pipelined windowing transmission algorithm to achieve zero-copy high-throughput transmission. The multi-block pipelined windowing RDMA transmission algorithm divides large data blocks into fixed-size data blocks and maintains multiple RDMA operations concurrently through a sliding window, achieving zero-copy and overlapping communication computation. This specifically includes sub-steps S41-S43.

[0049] S41. Divide the total data to be transmitted into multiple data blocks of a predetermined size; S42. Submit W data block tasks to the RNIC work queue through W asynchronous RDMA operations, where W is the pipeline window size, which is preset. S43. After the data block task operation in the RNIC's work queue is completed, the buffer is released to obtain the next set of W data blocks until the total data transmission is completed.

[0050] In one specific implementation, the total data D to be transferred is first divided into K data blocks of size C (Chunk_0 to Chunk_{K-1}). A fixed-size pipeline window W is set (e.g., W=4). Then, all K data blocks are transferred using the pipeline engine. In the first stage, the window is filled. That is, W asynchronous RDMA operations (WRITE for Reduce-Scatter / All-Gather, READ for Reduce) are initiated consecutively, submitting the tasks from Chunk_0 to Chunk_{W-1} to the RNIC's work queue (SQ). Next, in the second stage, the window slides to poll the completion queue. Whenever a completion notification (WC) is received, the buffer for that data block is released (if it's a READ, the data is processed), and then the next Chunk_{x} to be transferred (x>=W) is immediately retrieved from the task queue, and a new asynchronous RDMA operation is initiated, keeping W operations in progress in the window at all times until all K data blocks are transferred. It is worth noting that buffer management is performed before transmitting data blocks, that is, a fixed memory buffer (Memory Region) is pre-registered for each data block in the pipeline window and reused throughout the communication session, thereby avoiding the overhead of frequent memory registration / deregistration.

[0051] Figure 5 Here is a timing diagram of a multi-block pipelined windowed RDMA transfer in a specific embodiment, such as Figure 5As shown, this multi-block pipelined windowed RDMA transmission timing diagram illustrates the commit queue (SQ), operation execution status (RNIC), completion queue (CQ), and current window states based on a time axis. The time axis includes a series of time nodes. The commit queue (SQ) shows when the RDMA operation descriptor (WQE) for which data block (Chunk C0, C1...) is placed into the work queue. The operation execution status (RNIC) shows the actual transmission time period for each data block, reflecting network transmission time. The execution times of different data blocks overlap, representing concurrent transmission of different data blocks. The completion queue (CQ) shows when a completion notification (WC) for which data block's transmission is complete is received. The pipeline window state is divided into two stages: "fill window" and "sliding window." The window size W=6. In this embodiment, the key event points on the timeline include: t1: Submit C0 operation; t2: Submit C1 operation, C0 starts transmission; ... Continue submitting until C5, completing the window filling stage; t7: C0 transmission completes, generating WC, immediately submit C6 operation, starting the window sliding stage; t8: C1 transmission completes, generating WC, immediately submit C7 operation.

[0052] In step S5, lightweight barrier synchronization is achieved based on RDMA atomic operations. The global counter is updated through Fetch and Add atomic operations, and control instructions are broadcast through immediate write operations, eliminating CPU involvement and enabling millisecond-level barrier synchronization. After the coordinating node detects that all nodes have completed RDMA write operations, it broadcasts the next stage instructions through RDMA immediate write operations, including sub-steps S51-S54.

[0053] S51. Pre-register a synchronization counter memory region on the coordinating node or a specified root node; in this step, the initial value of the synchronization timer is 0.

[0054] S52. After each participating node completes its own data transmission task, i.e., the local pipeline engine ends, it performs an RDMA atomic operation on the synchronization counter address of the coordinating node to increment the synchronization counter by 1; this operation is performed by the RNIC at the network card level without the CPU of the coordinating node participating.

[0055] S53. The coordinating node determines that all nodes have completed the RDMA write operation based on the value of the synchronization counter. In this step, the coordinating node may actively poll the value of the synchronization counter or wait for an RDMA completion notification of an expected value of "all nodes have completed" (i.e., the number of participating nodes N) to confirm that all nodes have completed the current stage.

[0056] S54. After determining that all nodes have completed the RDMA write operation, the receiving node's RNIC immediately generates a completion event with immediate data, which carries a stage identifier, by coordinating the nodes' RDMA immediate write operations. In this step, through the immediate write operation, a "continue" command is written to the specified memory address of all nodes, so that the receiving node's RNIC immediately generates a completion event with immediate data, efficiently waking up all nodes to enter the next stage or ending the communication.

[0057] Figure 6 This is a schematic diagram of barrier synchronization based on RDMA atomic operations in a specific embodiment, as shown below. Figure 6 As shown, the RDMA atomic operation (Fetch-and-Add) is executed independently on each node, atomically incrementing a counter in the shared memory address on the coordinating node. The coordinating node polls or waits for completion events to detect the global completion status. The coordinating node writes data to a specified address on all nodes, carrying an immediate data value in the operation. This immediate data value is used to encode control commands; for example, 0x01 indicates entering stage 2, 0x02 indicates stage 3, and 0xFF indicates the entire operation is complete. After receiving the immediate data write operation, each node's RNIC generates a special completion entry in its local CQ containing the received immediate data, thus efficiently waking up the application and informing it of control instructions.

[0058] To address the issue of faulty links or nodes, this invention also includes a fault recovery step, namely, during a communication session, nodes periodically exchange lightweight heartbeat messages; if the sending node does not receive a heartbeat, credit confirmation, or RDMA operation failure from the other end within the expected time, the corresponding receiving node is marked as faulty; and a node is found based on the logical topology tree to take over the communication tasks of the faulty node.

[0059] In a preferred implementation, the lightweight heartbeat message can be transmitted via RoCE unreliable datagrams. If the sending node does not receive a heartbeat or credit confirmation from the peer within the expected time, or if the RDMA operation fails due to an error (informed via CQ error WC), the peer is marked as "suspicious." In the case of being marked as "suspicious," a fast retransmission route is initiated: a replacement "partner node" is found in the logical topology to take over the communication tasks of the failed node. For example, in All-Reduce, the partner node of the failed node can send its local backup data to the intended recipient. Retransmission only involves the data of the failed link or node, without restarting the entire collection communication operation.

[0060] This invention also provides a RoCE-based aggregated communication system, comprising: multiple nodes, each node equipped with at least one RoCE-enabled smart network interface card, main memory, and multi-core processor, the multiple nodes being connected via Ethernet; a communication management module residing on each node or a central management node, comprising: a topology management unit, used to acquire the delay matrix and bandwidth matrix between nodes, and construct a hierarchical logical topology tree based on the delay matrix and bandwidth matrix; a task planning unit, used to decompose the global aggregated communication operation into multiple stages of point-to-point RDMA communication tasks based on the logical topology tree, and plan physical paths for each task; a scheduling unit, used to establish a dynamic credit pool at each receiving node, enabling the sending node to initiate an RDMA write operation after acquiring credit from the corresponding receiving node; an RDMA operation engine unit, used to maintain the concurrent execution of multiple RDMA write operations through a sliding window mechanism; and a synchronization unit, used to broadcast the next stage instruction through an RDMA immediate write operation after the coordinating node detects that all nodes have completed the RDMA write operation, the synchronization counter of the coordinating node being updated through the RDMA atomic operations of each node.

[0061] The nodes are connected via an Ethernet infrastructure that supports Data Center Bridging (DCB) and Explicit Congestion Notification (ECN). The smart NIC can offload some aggregated communication operations to the NIC hardware, such as the Reduce operation of data blocks in communication task decomposition and the credit management logic in credit-based flow control, thereby reducing the workload of the smart NIC. Each node runs a complete aggregated communication management module, and the aggregated communication system coordinates the global state through a distributed coordination architecture.

[0062] In one specific implementation, the topology management unit of each node reads the LLDP information of the switches connected to its RNIC and constructs an initial physical neighbor connection graph as the basis for obtaining the delay matrix and bandwidth matrix between nodes. For each point-to-point RDMA task generated by the decomposition (e.g., node i -> node j), the task planning unit queries the logical topology graph, selects a physical path for it (in a multi-path environment), records the key switch port information on the path, and passes it to the scheduling unit.

[0063] For incast scenarios (such as multiple nodes simultaneously sending data to a single node), the flow control scheduling unit calculates a base offset delay T_base and a pseudo-random delay T_random for each sending node before communication begins. T_base is hierarchically set based on the sending node's "distance" (hop count) in the logical topology; the greater the distance, the greater the base delay, thus staggering data arrival times. T_random is generated within a controllable time window (e.g., [0, 2*RTT_avg]) based on the node ID and communication round hash. The actual sending start time of a sending node is: T_start = T_base + T_random. If network telemetry (e.g., ECN marking) detects congestion, the scheduling unit dynamically expands the T_random window to enhance the peak-shaving effect.

[0064] The RoCE-based collection communication method and system provided by this invention constructs a logical topology tree through active probing and cluster analysis, providing a precise basis for communication planning; dynamically decomposes collection communication tasks based on the logical topology to optimize cross-level communication; achieves fine-grained control of the transmission rate through dynamic credit allocation and pseudo-random delay scheduling, effectively avoiding Incast congestion; utilizes a multi-block pipelined windowed transmission algorithm to maximize RDMA concurrency and zero-copy high-throughput transmission; and achieves microsecond-level barrier synchronization based on lightweight synchronization using RDMA atomic operations, eliminating CPU involvement.

[0065] This invention fully utilizes the RoCE RDMA feature to significantly reduce aggregation communication latency and CPU overhead, and improve communication stability and scalability in large-scale clusters.

[0066] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0067] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0068] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A RoCE-based collection communication method, characterized in that, The method includes: Obtain the delay matrix and bandwidth matrix between nodes, and construct a hierarchical logical topology tree based on the delay matrix and bandwidth matrix; Based on the logical topology tree, the global collection communication operation is decomposed into multiple stages of point-to-point RDMA communication tasks, and a physical path is planned for each task. A dynamic credit pool is established at each receiving node. After the sending node obtains credit from the corresponding receiving node, it initiates an RDMA write operation. Multiple RDMA write operations can be executed concurrently using a sliding window mechanism; After the coordinating node detects that all nodes have completed the RDMA write operation, it broadcasts the next stage instruction through the RDMA immediate write operation. The synchronization counter of the coordinating node is updated through the RDMA atomic operations of each node.

2. The RoCE-based aggregated communication method according to claim 1, characterized in that, The delay matrix and bandwidth matrix construct a hierarchical logical topology tree, including: The topology management unit at each node constructs an initial physical neighbor connection graph based on the acquired LLDP information; Based on the physical connection diagram, end-to-end probing between nodes is performed to obtain the reference communication delay matrix L and the effective bandwidth matrix B; Based on the aforementioned baseline communication delay matrix L and effective bandwidth matrix B, the nodes are divided into multiple domains; A hierarchical logical topology tree is constructed based on the domain.

3. The RoCE-based aggregated communication method according to claim 2, characterized in that, The baseline communication delay matrix L, effective bandwidth matrix B, and logical topology tree are cached in the memory of each node and kept updated.

4. The RoCE-based aggregated communication method according to claim 1, characterized in that, Based on the logical topology tree, the global aggregate communication operation is decomposed into multiple stages of point-to-point RDMA communication tasks, and a physical path is planned for each task, including: At one level of the logical topology tree, the set of nodes is divided into two subsets of approximately equal size; Within each subset, the Reduce-Scatter operation is performed independently; Perform pairwise point-to-point Reduce between the two subsets; After point-to-point Reduce, an All-Gather operation is performed independently within each subset to generate point-to-point RDMA tasks for the current level. For each point-to-point RDMA task at the current level, a physical path is selected for it based on the logical topology tree.

5. The RoCE-based aggregated communication method according to claim 1, characterized in that, A dynamic credit pool is established at each receiving node. After obtaining credits from the corresponding receiving node, the sending node initiates an RDMA write operation, including: Each receiving node maintains a credit value for each sending node (or for each group of sending nodes), and the credit value is communicated to the sending node via an update message; Before initiating an RDMA write operation, the sending node must send a lightweight "credit request" message to the receiving node; The receiving node replies with a "credit grant" message based on its own remaining buffer space and the global congestion status. The "credit grant" message contains the number of data blocks that can be sent. The credit value is decremented by 1 for each data block sent by the sending node or each data block received by the receiving node.

6. The RoCE-based aggregated communication method according to claim 1, characterized in that, When multiple sending nodes send data to a receiving node, a basic offset delay and a pseudo-random delay are calculated for each sending node. The basic offset delay is set in layers according to the number of hops of the sending node in the logical topology, and the pseudo-random delay is generated within a time window based on the node ID and communication round hash. If network congestion is detected, the time window for pseudo-random delays is dynamically expanded.

7. The RoCE-based aggregated communication method according to claim 1, characterized in that, The method of maintaining concurrent execution of multiple RDMA write operations through a sliding window mechanism includes: The total data to be transmitted is divided into multiple data blocks of a predetermined size; W data block tasks are submitted to the RNIC work queue through W asynchronous RDMA operations, where W is the pipeline window size, which is preset. After the data block task operation in the RNIC's work queue is completed, the buffer is released to obtain the next set of W data blocks until the total data transmission is completed.

8. The RoCE-based aggregated communication method according to claim 1, characterized in that, After the coordinating node detects that all nodes have completed the RDMA write operation, it broadcasts the next stage instructions via the RDMA immediate data write operation, including: Pre-register a synchronization counter memory region on the coordinating node; After completing its own data transmission task, each participating node performs an RDMA atomic operation on the synchronization counter address of the coordinating node, incrementing the synchronization counter by 1. The coordinating node determines whether all nodes have completed the RDMA write operation based on the value of the synchronization counter. After determining that all nodes have completed the RDMA write operation, the RDMA immediate write operation of the coordinating nodes is used to cause the receiving node's RNIC to immediately generate a completion event with immediate data, which carries a stage identifier.

9. The RoCE-based aggregated communication method according to claim 1, characterized in that, The method also includes: During a communication session, nodes periodically exchange lightweight heartbeat messages; If the sending node does not receive a heartbeat, credit confirmation, or RDMA operation failure from the peer within the expected time, the corresponding receiving node is marked as faulty. The system uses a logical topology tree to find a node to take over the communication tasks of the fault-marked node.

10. A RoCE-based aggregated communication system, characterized in that, The aggregated communication system includes: multiple nodes, each equipped with at least one smart network interface card supporting the RoCE protocol, main memory, and multi-core processor, the multiple nodes being connected via Ethernet; a communication management module residing on each node or a central management node, including: a topology management unit, used to acquire the delay matrix and bandwidth matrix between nodes, and construct a hierarchical logical topology tree based on the delay matrix and bandwidth matrix; a task planning unit, used to decompose the global aggregated communication operation into multiple stages of point-to-point RDMA communication tasks based on the logical topology tree, and plan physical paths for each task; a scheduling unit, used to establish a dynamic credit pool at each receiving node, so that the sending node initiates an RDMA write operation after obtaining credit from the corresponding receiving node; an RDMA operation engine unit, used to maintain the concurrent execution of multiple RDMA write operations through a sliding window mechanism; and a synchronization unit, used to broadcast the next stage instruction through an RDMA immediate write operation after the coordinating node detects that all nodes have completed the RDMA write operation, the synchronization counter of the coordinating node being updated through the RDMA atomic operations of each node.