A data processing unit, distributed system and chip
By introducing a Data Processing Unit (DPU) into a distributed system, and utilizing the RDMA engine and processor to collaboratively process aggregated communication, the problem of wasted GPU computing power and PCIe bandwidth in aggregated communication is solved, achieving more efficient resource utilization and performance improvement.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN JAGUAR MICROSYSTEMS CO LTD
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the collective communication operations in distributed AI training result in a serious waste of GPU computing power, host CPU computing power, and PCIe bandwidth, limiting the full potential of the overall system performance.
The introduction of a Data Processing Unit (DPU) allows it to work in conjunction with the processor via an RDMA engine to take over data transmission and reduction calculation tasks in aggregated communication, reducing the number of data transfers across the PCIe bus, improving PCIe bandwidth utilization, and reducing the load on the host CPU.
By completing data storage and computation within the DPU, the computing power and memory bandwidth of the GPU are freed up, and frequent switching of the host CPU is avoided. This significantly improves the utilization rate of GPU computing power, host CPU computing power and PCIe bandwidth, thereby enhancing system performance.
Smart Images

Figure CN121935205B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of collective communication technology, specifically to a data processing unit, a distributed system, and a chip. Background Technology
[0002] Collective communication (CPC) is a core communication mode in the field of distributed parallel computing, referring to a group of processes or nodes collaboratively completing batch data interaction operations following a unified semantics. In artificial intelligence (AI) applications, CPC is a core technology in distributed AI training, with a wide range of applications and critical use cases. In existing technical architectures, CPC operations are typically driven by GPUs for data storage and computation, with the host CPU initiating the communication operation. Taking the Allreduce operation as an example, during communication, the GPUs of each network node not only need to undertake reduction computation tasks but are also frequently interrupted to respond to data transmission events; simultaneously, the host CPUs of each network node also need to repeatedly participate in the scheduling and interrupt handling of the communication process. Furthermore, the frequent transmission of communication data through the PCIe bus consumes a significant amount of bandwidth resources. This traditional model leads to a serious waste of GPU computing power, host CPU computing power, and PCIe bandwidth in each network node, limiting the full potential of the overall system performance. Summary of the Invention
[0003] The purpose of this application is to propose a data processing unit, a distributed system, and a chip to improve the utilization rate of GPU computing power, host CPU computing power, and PCIe bandwidth of each network node.
[0004] To achieve the above objectives, according to a first aspect of this application, a data processing unit is provided, applied to an intermediate node of a distributed system, including a first processor, a first RDMA engine, and a first data buffer.
[0005] The first RDMA engine is used to receive a first network packet sent by the previous network node during the reduction phase, parse the first network packet to obtain first data, write the first data into the first data buffer, and report a first reception notification to the processor.
[0006] The first processor is configured to, upon receiving the first receiving notification, read the first data from the first data buffer according to the first receiving notification, read the second data from the local GPU memory, perform a reduction calculation based on the first data and the second data to obtain the third data, write the third data into the first data buffer, and send a first sending notification to the RDMA engine.
[0007] The first RDMA engine is configured to, upon receiving the first transmission notification, read the third data from the first data buffer according to the first transmission notification, generate a second network packet according to the third data, and send the second network packet to the next network node, wherein the data processing unit and the local GPU belong to the same network node.
[0008] In one possible implementation, the first RDMA engine is configured to receive a third network message sent by the broadcast initiation node during the broadcast phase, parse the third network message to obtain fourth data, and write the fourth data into the local GPU memory.
[0009] In one possible implementation, the first data buffer includes a first transmit buffer and a first receive buffer, wherein the first transmit buffer is used to buffer the third data and the first receive buffer is used to buffer the first data.
[0010] In one possible implementation, the first processor runs a cross-architecture collection communication library software program to parse the first received notification to obtain a reduction command, and in response to the reduction command, invokes the system-level DMA module to read the second data from the local GPU memory.
[0011] According to a second aspect of this application, a distributed system is provided, including an initial node, a root node, and at least one intermediate node; the intermediate node is provided with a data processing unit as described in the first aspect of this application.
[0012] The initial node is used in the reduction phase to encapsulate the fifth data cached in its local GPU into the fourth network packet and send the fourth network packet to the corresponding intermediate node.
[0013] The root node is used to receive the fifth network message sent by its corresponding intermediate node during the reduction phase, and perform reduction calculation on the sixth data in its local GPU cache and the seventh data in the fifth network message to obtain the fourth data.
[0014] In one possible implementation, the root node is further configured to, during the broadcast phase, encapsulate the fourth data into the third network message and send the third network message to the at least one intermediate node and the initial node;
[0015] The initial node is also used during the broadcast phase to receive the third network message sent by the root node, parse the third network message to obtain the fourth data, and write the fourth data into its local GPU cache.
[0016] In one possible implementation, the initial node is provided with a data processing unit, which includes a second processor, a second RDMA engine, and a second data buffer.
[0017] The second RDMA engine is used to receive a second transmission notification issued by the local node host during the reduction phase, parse the second transmission notification to obtain a payload, write the payload into the second data buffer, and report a second reception notification to the processor; wherein the payload contains full reduction operation instructions;
[0018] The second processor is used to run a preset cross-architecture collection communication library software program; when the second receiving notification is received, the cross-architecture collection communication library software program is used to read the payload from the second data buffer according to the second receiving notification, parse the payload to obtain a full reduction operation instruction, execute the full reduction operation instruction, read the fifth data from the local GPU memory, write the fifth data into the second data buffer, and send a third sending notification to the second RDMA engine;
[0019] The second RDMA engine is used to, upon receiving the third transmission notification, read the fifth data from the second data buffer according to the third transmission notification, generate the fourth network packet according to the fifth data, and send the fourth network packet to the next network node to initiate the reduction process.
[0020] In one possible implementation, the second RDMA engine is configured to receive a third network packet sent by the root node during the broadcast phase, parse the third network packet to obtain the fourth data, and write the fourth data into the local GPU memory.
[0021] In one possible implementation, the root node is provided with a data processing unit, which includes a third processor, a third RDMA engine, and a third data buffer.
[0022] The third RDMA engine is used to receive the fifth network message sent by its corresponding intermediate node during the reduction phase, parse the fifth network message to obtain the seventh data, write the seventh data into the third data buffer, and report the third reception notification to the processor.
[0023] The third processor is used to run a preset cross-architecture collection communication library software program. When the third receiving notification is received, the cross-architecture collection communication library software program reads the seventh data from the third data buffer and the sixth data from the local GPU memory according to the third receiving notification. Based on the sixth data and the seventh data, it performs a reduction calculation to obtain the fourth data, and calls the system-level DMA module to write the fourth data into the local GPU memory, thus ending the reduction process.
[0024] In one possible implementation, the third RDMA engine is used to generate the third network message based on the fourth data during the broadcast phase, and send the third network message to the initial node or the intermediate node.
[0025] According to a third aspect of this application, a chip is provided, including a data processing unit as described in the first aspect of this application.
[0026] This application proposes a data processing unit, a distributed system, and a chip, which have the following advantages:
[0027] The Data Processing Unit (DPU) takes over the data transmission and reduction calculation tasks in the aggregated communication. When the DPU is applied to an intermediate network node, during the reduction phase, the RDMA engine receives data from the previous network node and stores it in the first data buffer inside the DPU. The DPU's processor then runs a preset cross-architecture aggregated communication library software program. This program reads the data from the previous network node and performs reduction calculations with the data in the local GPU memory. The calculation result is still stored in the first data buffer and sent to the next network node via the RDMA engine. In this process, the storage and calculation of the first and third data are completed within the DPU, eliminating the need for frequent data transfer to GPU memory and thus freeing up GPU computing power and memory bandwidth. Simultaneously, the host CPU of this node only issues instructions at the beginning of the process; subsequent communication is initiated and processed autonomously by the DPU, avoiding frequent switching of the host CPU from normal tasks to interrupt handlers. Furthermore, data interaction is conducted through the DPU's internal bus and network interface, significantly reducing the number of data transfers across the PCIe bus and significantly improving PCIe bandwidth utilization.
[0028] Other features and advantages of this application will be set forth in the following description. Attached Figure Description
[0029] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the accompanying drawings required in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0030] Figure 1 This is a schematic diagram of the structure of a data processing unit (DPU) in one embodiment of this application.
[0031] Figure 2 This is a flowchart of a reduction operation in one embodiment of this application.
[0032] Figure 3 This is a flowchart of a broadcast operation in one embodiment of this application.
[0033] Figure 4 This is a schematic diagram of a distributed system in one embodiment of this application.
[0034] Marked in the image:
[0035] 1-First processor; 2-First RDMA engine; 3-First data buffer. Detailed Implementation
[0036] The detailed description of the accompanying drawings is intended to illustrate the present preferred embodiments of this application and is not intended to represent only the forms in which this application can be implemented. It should be understood that the same or equivalent functions may be performed by different embodiments intended to be included within the scope of this application.
[0037] like Figure 1 As shown, one embodiment of this application provides a data processing unit (DPU) applied to a network node of a distributed system, including a first processor, a first RDMA engine, and a first data buffer;
[0038] Specifically, the DPU, acting as a communication coprocessor for network nodes, is designed to offload the communication load from the host CPU and GPU of the local node. The first processor is a general-purpose CPU core integrated within the DPU, used to run the XCCL (Cross-Architecture Collective Communication Library) software, parsing collective communication instructions and performing computational tasks such as reduction. The first RDMA engine is responsible for handling network packet transmission and reception as well as direct memory access (DMA) operations, capable of directly reading and writing GPU memory or the DPU's internal first data buffer. The first data buffer is the DPU's on-chip memory (such as SRAM) or mounted local memory, used to temporarily store intermediate data during collective communication, avoiding frequent occupancy of the host's memory or GPU memory. For example, in a distributed training system with four computing nodes, each node deploys a DPU network card. The DPU connects to the host node via a PCIe interface and interconnects with DPUs on other nodes via a network interface. It should be noted that both the first processor on the DPU side and the host CPU on the local node communicate with the first RDMA engine through queue pairs (QPs).
[0039] It should be noted that the distributed system includes multiple network nodes, which can be divided into initial nodes, intermediate nodes and root nodes according to their position / role. The initial node is the network node that initiates reduction or ends broadcast, the root node is the network node that ends reduction or initiates broadcast, and the intermediate node is the intermediate relay node between the initial node and the root node. In this embodiment, the DPU is specifically applied to the intermediate node.
[0040] The first RDMA engine is used to receive a first network packet sent by the previous network node during the reduction phase, parse the first network packet to obtain first data, write the first data into the first data buffer, and report a first reception notification to the first processor.
[0041] Specifically, in the Reduce phase of the Allreduce operation in distributed training, the previous network node sends a network packet carrying a partial reduction result. The first RDMA engine of this node receives this first network packet, parses the packet header and payload, and extracts the first data (such as the partial data calculated by the previous node). The first RDMA engine does not write this data to the local GPU, but directly writes it to the first data buffer inside the DPU. Subsequently, the first RDMA engine generates a first completion queue element (receive notification), which contains the address pointer of the data in the buffer, and puts it into the completion queue (CQ) to notify the first processor that new data has arrived to be processed. For example, the first data could be an array fragment a_0 sent by the previous node.
[0042] The first processor is configured to run a preset cross-architecture collection communication library software program; the cross-architecture collection communication library software program is configured to, upon detecting the first received notification, read the first data from the first data buffer according to the first received notification, read the second data from the local GPU memory, perform a reduction calculation based on the first data and the second data to obtain the third data, write the third data into the first data buffer, and send a first send notification to the first RDMA engine;
[0043] Specifically, the cross-architecture aggregate communication library software program of the first processor can monitor the completion queue through a polling mechanism. When a first received notification is detected, the subsequent processing flow is triggered. The cross-architecture aggregate communication library software program of the first processor reads the first data from the first data buffer according to the pointer in the first received notification. Simultaneously, it initiates a read operation on the local GPU memory through SDMA (System-level DMA) or direct RDMA technology to obtain the second data, such as the corresponding array fragment b_0 on the local node's GPU. Next, the cross-architecture aggregate communication library software program of the first processor uses the computing resources inside the DPU to perform a reduction calculation (such as an addition operation), merging the first data and the second data to obtain the third data, such as a_0 + b_0. After the calculation is completed, the cross-architecture aggregate communication library software program of the first processor writes the third data back to the first data buffer and constructs a first work queue element (send notification) to indicate transmission, which is then sent to the send queue (SQ) of the first RDMA engine.
[0044] The first RDMA engine is used to detect the first sending notification, read the third data from the first data buffer according to the first sending notification, generate a second network packet according to the third data, and send the second network packet to the next network node.
[0045] Specifically, the first RDMA engine monitors the send queue (SQ). When it detects a first send notification issued by the first processor, it parses the first send notification to obtain the data address and length information. Based on this, the first RDMA engine reads third data from the first data buffer, encapsulates it into a second network packet conforming to the network protocol, and sends the second network packet to the next network node through the physical network interface. This process is entirely completed collaboratively by the DPU hardware and firmware, without the need for intervention from the host CPU of the local node. For example, the first RDMA engine sends a (a_0+b_0) encapsulated packet to the next node to continue the reduction process in the Ring algorithm.
[0046] In some embodiments, the first RDMA engine is configured to, during the broadcast phase, receive a third network packet sent by the broadcast initiation node, parse the third network packet to obtain fourth data, write the fourth data into the local GPU memory, and forward the fourth data to the next network node.
[0047] Specifically, during the Broadcast operation, the first RDMA engine receives the third network packet sent by the broadcast initiating node (i.e., the root node), parses the third network packet to obtain the fourth data (i.e., the data to be broadcast), and writes the fourth data into the local GPU memory using Direct RDMA technology for use by the local GPU. Simultaneously, as an intermediate node, the first RDMA engine also forwards the fourth data to the next network node. For example, in a scenario where Rank 0 broadcasts data to Rank 1 and Rank 2, after the DPU of Rank 1 receives the data, it writes it into the local Rank 1 GPU memory and continues to forward the data to Rank 2.
[0048] In some embodiments, the first data buffer includes a first transmit buffer and a first receive buffer, wherein the first transmit buffer is used to buffer the third data and the first receive buffer is used to buffer the first data.
[0049] Specifically, by dividing the first data buffer into independent transmit and receive buffers, physical or logical isolation between the data receive and transmit paths is achieved. During the reduction phase, the first receive buffer caches the data to be reduced (first data) received from the previous network node, ensuring the continuity of data reception; the first transmit buffer caches the result data after the reduction calculation (third data), enabling the first RDMA engine to quickly read and transmit. This dual-buffer design allows the DPU to execute receive and transmit operations in parallel at the same time. For example, while the first processor reads the first data from the first receive buffer for calculation, the first RDMA engine can read the result data from the previous round of calculation from the first transmit buffer and transmit it, thus avoiding the blocking of read / write operations in a single-buffer structure, effectively improving the pipeline processing efficiency of the reduction operation, and reducing communication latency.
[0050] In some embodiments, the cross-architecture collection communication library software program is used to read the second data from local GPU memory via direct memory access by calling the system-level DMA module.
[0051] Specifically, the local GPU and DPU are connected via the PCIe bus. Traditional data transfer methods often require the intervention of the host CPU on the local node, using memory copy instructions to move data from GPU memory to host memory, and then the DPU reads the data from the host memory. This process involves multiple data copies and bus jumps. In this embodiment, the first processor runs a cross-architecture aggregation communication library software program, which initiates a direct memory access transaction targeting the GPU memory base address by configuring a system-level DMA (SDMA) module. The SDMA module can directly resolve the address mapping of GPU memory, directly read the second data across the PCIe bus, and move it to the first data buffer inside the DPU. This process is entirely executed by the hardware DMA engine, without the need for the host CPU on the local node to participate in data transfer, nor is it necessary to temporarily store the data in host memory. This achieves zero-copy data transfer, significantly reducing CPU load and PCIe bus communication latency.
[0052] Below is an example of a reduction operation procedure, such as... Figure 2 As shown, taking the Reduce phase of the Ring algorithm as an example, this demonstrates how the DPU of an intermediate network node (e.g., Rank 1) offloads the reduce operation. Assuming the previous node is Rank 0 (the initial node) and the next node is Rank 2, the process steps are as follows:
[0053] (1.1) Receiving the first network packet: The first RDMA engine receives the first network packet sent by the previous network node (Rank 0);
[0054] (1.2) Parse and cache data: The first RDMA engine parses the first network packet to obtain the first data and writes the first data into the first data buffer inside the DPU;
[0055] (1.3) Reporting completion event: The first RDMA engine reports the first receive notification to the first processor;
[0056] (1.4) First processor monitoring event: The first processor detects the first received notification;
[0057] (1.5) Reading the first data: The first processor reads the first data from the first data buffer according to the first receiving notification;
[0058] (1.6) Reading local GPU data: The first processor reads the second data (the data that this node participates in the calculation) from the local GPU memory;
[0059] (1.7) Perform reduction calculation: The first processor performs reduction calculation (such as addition operation) based on the first data and the second data to obtain the third data;
[0060] (1.8) Write to buffer: The first processor writes the third data to the first data buffer;
[0061] (1.9) Sending instruction: The first processor sends the first send notification to the first RDMA engine;
[0062] (1.10) Monitoring and sending instructions: The first RDMA engine detects the first sending notification;
[0063] (1.11) Reading the third data: The first RDMA engine reads the third data from the first data buffer according to the first transmission notification;
[0064] (1.12) Generate and send messages: The first RDMA engine generates a second network message based on the third data and sends the second network message to the next network node (Rank 2).
[0065] If this node is the last node (i.e., the root node) in the reduction process, this step will be replaced by: the first RDMA engine writing the third data obtained from the final reduction into the local GPU memory, thus ending the reduction process.
[0066] Below is an example of a broadcast operation procedure, such as Figure 3 As shown, taking the Broadcast phase of the Ring algorithm as an example, this demonstrates how the DPU of an intermediate network node (e.g., Rank 1) offloads the broadcast operation. Assuming the previous node is Rank 0 (the root node) and the next node is Rank 2, the process steps are as follows:
[0067] (2.1) Receiving a third network message: The first RDMA engine receives a third network message sent by the previous network node (Rank 0);
[0068] (2.2) Parsing and obtaining data: The first RDMA engine parses the third network packet to obtain the fourth data;
[0069] (2.3) Write to local GPU memory: The first RDMA engine writes the fourth data to the local GPU memory (for use by the application of this node);
[0070] Step (2.3) can be executed in parallel or quickly completed through internal data copying in terms of hardware logic, without the need for the host CPU of this node to intervene, thereby saving PCIe bandwidth and CPU interrupt overhead.
[0071] like Figure 4 As shown, another embodiment of this application provides a distributed system including an initial node, a root node, and at least one intermediate node; the intermediate node is equipped with a data processing unit (DPU) as described in the above embodiments.
[0072] Specifically, in the reduction phase, the initial node initiates the reduction process. The relevant data for the reduction operation is transmitted from the initial node through the reduction operations of multiple intermediate nodes in sequence, and finally to the root node for the final reduction operation and to end the reduction process. In the broadcast phase, the root node initiates the broadcast of the final reduction result to at least one intermediate node and the initial node.
[0073] In some embodiments, the initial node is also provided with a data processing unit (DPU) for aggregated communication offloading. The data processing unit of the initial node includes a second processor, a second RDMA engine, and a second data buffer.
[0074] The second RDMA engine is used to receive a second transmission notification issued by the local node host during the reduction phase, parse the second transmission notification to obtain a payload, write the payload into the second data buffer, and report a second reception notification to the second processor; wherein the payload contains full reduction operation instructions;
[0075] Specifically, at the start of the aggregated communication, the local host CPU needs to send a communication intent to the DPU. The local host CPU constructs a second send notification for an RDMA WRITE operation. This second send notification carries a payload, which is not ordinary user data but a software-defined communication control command. In this embodiment, it includes an All-Reduce operation instruction. After receiving the second send notification from the local host, the second RDMA engine parses the second send notification to obtain the payload (i.e., the All-Reduce operation instruction), writes it into a specific command area of the second data buffer, and reports a second receive notification to the second processor in the DPU to notify the second processor of the arrival of a new command.
[0076] The second processor is used to run a preset cross-architecture collection communication library software program; when the second receiving notification is detected, the cross-architecture collection communication library software program is used to read the payload from the second data buffer according to the second receiving notification, parse the payload to obtain a full reduction operation instruction, execute the full reduction operation instruction, read the fifth data from the local GPU memory, write the fifth data into the second data buffer, and send a third sending notification to the second RDMA engine;
[0077] Specifically, after detecting the second receive notification, the cross-architecture aggregate communication library software program of the second processor reads the payload from the second data buffer and parses the full reduce operation instruction within it. This full reduce operation instruction specifies the operation type (such as Reduce), the address of the data in the GPU, the data length, and other information. The cross-architecture aggregate communication library software program of the second processor executes the full reduce operation instruction, configures the corresponding data structure, and reads the fifth data (i.e., the data used for computation by this node) from the local GPU memory via SDMA and writes it into the second data buffer. Subsequently, the cross-architecture aggregate communication library software program of the second processor generates a third send notification and sends it to the second RDMA engine, instructing it to send the fifth data to the next node in the Ring topology, initiating the first step of the Ring process.
[0078] It should be noted that in this embodiment, the cross-architecture collective communication library software program of the DPU performs collective communication operations by parsing the payload of the receiving notification sent by the host node. The payload is entirely software-defined and can be flexibly expanded in the future.
[0079] The second RDMA engine is used to detect the third transmission notification, read the fifth data from the second data buffer according to the third transmission notification, generate a fourth network packet according to the fifth data, and send the fourth network packet to the next network node to initiate the reduction process;
[0080] Specifically, when the second RDMA engine detects the third send notification issued by the cross-architecture aggregate communication library software program of the second processor, it reads the fifth data from the second data buffer according to the instructions in the third send notification. The second RDMA engine uses the fifth data to generate a fourth network message and sends it to the next network node, thereby formally starting the reduction process of the entire aggregate communication. This step marks the transfer of communication leadership from the host CPU of this node to the DPU, and all subsequent communication steps (such as the reduction and forwarding of intermediate nodes) are completed autonomously and collaboratively by the DPUs of each node.
[0081] In some embodiments, the second RDMA engine is configured to receive a third network packet sent by the root node during the broadcast phase, parse the third network packet to obtain the fourth data, and write the fourth data into the local GPU memory.
[0082] Specifically, in the final stage of the Broadcast operation, when the second RDMA engine of the initial node (as the last receiving node of the broadcast link) receives the third network message sent by the root node, it parses and obtains the fourth data, that is, the reduction result is written into the local GPU memory through Direct RDMA technology.
[0083] In some embodiments, the second data buffer includes a second transmit buffer and a second receive buffer, the second transmit buffer being used to buffer the fifth data and the second receive buffer being used to buffer the fourth data.
[0084] In some embodiments, the root node is also provided with a data processing unit for offloading the collection communication operation. The data processing unit of the root node includes a third processor, a third RDMA engine and a third data buffer.
[0085] The third RDMA engine is used to receive the fifth network message sent by its corresponding intermediate node during the reduction phase, parse the fifth network message to obtain the seventh data, write the seventh data into the third data buffer, and report the third reception notification to the third processor.
[0086] Specifically, at the end of the Reduce phase of the Allreduce operation, when this node is the last node in the reduction chain (or the final aggregation node of a data block), the third RDMA engine receives the fifth network message sent by the previous network node. This fifth network message carries the accumulated sum (seventh data) after reduction by all preceding nodes. The third RDMA engine parses the message to obtain the seventh data and writes it into the third data buffer. Subsequently, the third RDMA engine generates a third receive notification and reports it to the third processor, informing the third processor that it has obtained the complete reduction results of all preceding nodes and is ready to perform the final local reduction operation. For example, the DPU of Rank 3 receives the data a_0+b_0+c_0 from Rank 2, stores it in the third data buffer, and notifies the third processor.
[0087] The third processor is used to run a preset cross-architecture collection communication library software program. When the third receiving notification is detected, the cross-architecture collection communication library software program reads the seventh data from the third data buffer according to the third receiving notification, reads the sixth data from the local GPU memory, performs a reduction calculation based on the sixth data and the seventh data to obtain the fourth data, and calls the system-level DMA module to write the fourth data into the local GPU memory, thus ending the reduction process.
[0088] Specifically, after detecting the third received notification, the cross-architecture aggregate communication library software program of the third processor reads the seventh data from the third data buffer, and simultaneously reads the sixth data (e.g., d_0) held by this node from the local GPU memory through the system-level DMA module (SDMA). The cross-architecture aggregate communication library software program of the third processor performs the final reduction calculation, merging the sixth and seventh data to obtain the final complete result, the fourth data (e.g., a_0+b_0+c_0+d_0), which is the global reduction result. The cross-architecture aggregate communication library software program of the third processor calls the system-level DMA module to write the data to the local GPU memory via direct memory access, thus ending the reduction process. In another embodiment, the fourth data can be written back to the third data buffer, and a fourth transmission notification can be sent to the third RDMA engine, instructing it to write the final result back to GPU memory to update local training parameters and construct network packets for broadcast to other nodes in the distributed system. The third RDMA engine, upon detecting the fourth transmission notification, reads the fourth data from the third data buffer according to the notification, writes the fourth data to local GPU memory, and ends the reduction process. Specifically, when the third RDMA engine detects the fourth transmission notification sent by the third processor, it reads the fourth data from the third data buffer according to the notification. At this point, there is no need for further forwarding to the network; the third RDMA engine directly writes the fourth data to a specified address in local GPU memory using Direct RDMA or SDMA technology. Thus, the reduction process for this data block is completely completed at this node, and the data is ready for use in the subsequent Broadcast phase or application layer.
[0089] In some embodiments, the third RDMA engine is configured to generate the third network packet based on the fourth data during the broadcast phase, and send the third network packet to the initial node or the intermediate node.
[0090] Specifically, before the Broadcast operation begins, the root node's DPU needs to initiate a broadcast. The third RDMA engine encapsulates the fourth data (i.e., the reduction result) to generate a third network message, and sends this third network message to other nodes in the distributed system (the initial node or intermediate nodes) through the network interface, thus formally initiating the Broadcast process. In this step, the local node acts as the broadcast source, broadcasting the reduction result to other nodes in the distributed system.
[0091] In some embodiments, the third data buffer includes a third transmit buffer and a third receive buffer, wherein the third transmit buffer is used to buffer the fourth data and the third receive buffer is used to buffer the seventh data.
[0092] It should be noted that the data processing units (DPUs) of the initial node, intermediate node, and root node can be integrated into a single general-purpose data processing unit. That is, this single data processing unit can be installed on nodes of different types in the distributed system according to usage requirements. This single data processing unit can be designed to include a processor, an RDMA engine, and a data buffer. By designing a cross-architecture collection communication library software program, the processor can implement the functions of the first processor, the second processor, and the third processor when running the cross-architecture collection communication library software program. The RDMA engine integrates the functions of the first RDMA engine, the second RDMA engine, and the third RDMA engine. The data buffer integrates the functions of the first data buffer, the second data buffer, and the third data buffer.
[0093] Specifically, the aforementioned general-purpose data processing unit adopts a unified hardware architecture design, possessing the flexibility to adapt to different network topologies. At the hardware level, its integrated RDMA engine has full-duplex communication capabilities and the ability to process different message types (such as data packets and command packets); the integrated data buffer capacity is designed to cover the storage needs of the initial node, intermediate node, and root node at different stages; and the integrated processor has general instruction execution capabilities. In practical applications, the same hardware entity can dynamically switch between the roles of initial node, intermediate node, or root node simply by configuring the running mode of the cross-architecture set communication library software program according to the actual position of the node in the network topology (such as the Rank ID in the Ring algorithm). For example, in the initialization phase of a distributed system, node A is configured as the initial node, running the logic flow of the second processor; while in another communication task, node A may be configured as an intermediate node, running the logic flow of the first processor. This design not only improves the reusability of hardware modules and reduces system development and maintenance costs, but also enhances the scalability and deployment flexibility of the distributed system.
[0094] In the distributed system of this application embodiment, the host-side CPU and DPU-side processors of the initial node, intermediate nodes, and root node interact with the RDMA engine using a standard RDMA queue communication mechanism. This is achieved through work queue elements (WQE) and completion queue elements (CQE) to decouple instruction issuance from status feedback. Both the host-side CPU and DPU-side processors are connected to the RDMA engine through their respective queue pairs (QPs). Each queue pair includes a send queue (SQ) and a receive queue (RQ) to carry communication instructions. Simultaneously, each processor has a corresponding completion queue (SQ CQ and RQ CQ) to receive completion events from the RDMA engine.
[0095] The host-side CPU, acting as the initiator of the communication task, is primarily responsible for issuing control commands containing the intent of the collective communication. Specifically, during the initiation phase of the collective communication process, the host-side CPU constructs a send notification containing a payload of All-Reduce instructions. This send notification is the Work Queue Element (WQE). The host-side CPU submits this WQE to the send queue of the RDMA engine. After detecting the WQE, the RDMA engine parses the payload and executes the corresponding hardware operations, such as writing the instruction data into the data buffer inside the DPU. This enables the host-side CPU to delegate tasks to the DPU-side processor, avoiding frequent intervention by the host-side CPU in subsequent data transfer processes.
[0096] As the executor of the collective communication tasks, the DPU-side processor has a bidirectional receive and send notification interaction mechanism with the RDMA engine. On one hand, after completing network packet reception or host instruction parsing, the RDMA engine generates a receive notification, which is a completion queue element (CQE). The RDMA engine places this CQE into the completion queue. The DPU-side processor, after detecting this CQE by polling the completion queue (RQ CQ), reads data or instructions from the data buffer based on the pointer information carried in the CQE, thereby triggering subsequent reduction calculations or data processing flows. On the other hand, when the DPU-side processor completes data processing and is ready to send data, it constructs a send notification and sends it to the RDMA engine's send queue. This send notification is a work queue element (WQE). The WQE indicates the address and length of the data to be sent. The RDMA engine uses this information to perform data reading and network encapsulation and transmission operations, enabling autonomous advancement of the communication process.
[0097] Through the QP / CQ communication mechanism described above, the host-side CPU only needs to issue WQE at the beginning of the process to release computing resources, and the subsequent communication control is completely transferred to the DPU side; while the DPU-side processor and RDMA engine interact efficiently through WQE and CQE, realizing the separation of control flow and data flow, which greatly reduces the interaction overhead across the PCIe bus and the interrupt load of the host CPU.
[0098] Another embodiment of this application provides a chip that includes the data processing unit described in the above embodiments of this application.
[0099] Specifically, the chip in this embodiment can be a system-on-a-chip (SoC) or an application-specific integrated circuit (ASIC). This chip integrates the data processing unit (DPU) described in the previous embodiments. By integrating the processor, RDMA engine, and data buffer into the same chip, the physical distance and latency of data transmission can be significantly reduced, and the internal bus bandwidth and energy efficiency can be improved. For example, this chip can serve as the main chip of a smart network interface card (NIC), inserted into a server's PCIe slot, providing the server with high-performance aggregated communication offloading capabilities.
[0100] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many updates and modifications will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or technical improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.
Claims
1. A data processing unit, applied to an intermediate node in a distributed system, characterized in that, Includes a first processor, a first RDMA engine, and a first data buffer; The first RDMA engine is used to receive a first network packet sent by the previous network node during the reduction phase, parse the first network packet to obtain first data, write the first data into the first data buffer, and report a first reception notification to the first processor. The first processor is configured to, upon receiving the first receiving notification, read the first data from the first data buffer according to the first receiving notification, read the second data from the local GPU memory, perform a reduction calculation based on the first data and the second data to obtain the third data, write the third data into the first data buffer, and send a first sending notification to the first RDMA engine. The first RDMA engine is configured to, upon receiving the first transmission notification, read the third data from the first data buffer according to the first transmission notification, generate a second network packet according to the third data, and send the second network packet to the next network node, wherein the data processing unit and the local GPU memory belong to the same intermediate node.
2. The data processing unit according to claim 1, characterized in that, The first RDMA engine is used to receive a third network packet sent by the root node during the broadcast phase, parse the third network packet to obtain fourth data, and write the fourth data into the local GPU memory. The root node is the broadcast initiation node.
3. The data processing unit according to claim 1, characterized in that, The first data buffer includes a first transmit buffer and a first receive buffer. The first transmit buffer is used to buffer the third data, and the first receive buffer is used to buffer the first data.
4. The data processing unit according to claim 1, characterized in that, The first processor runs a cross-architecture collection communication library software program to parse the first received notification to obtain a reduction command, and in response to the reduction command, calls the system-level DMA module to read the second data from the local GPU memory.
5. A distributed system, characterized in that, It includes an initial node, a root node, and at least one intermediate node; the intermediate node is provided with a data processing unit as described in any one of claims 1 to 4; The initial node is used during the reduction phase to encapsulate the fifth data in its local GPU memory into a fourth network packet and send the fourth network packet to the corresponding intermediate node. The root node is used during the reduction phase to receive the fifth network message sent by its corresponding intermediate node, and to perform reduction calculations on the sixth data in its local GPU memory and the seventh data in the fifth network message to obtain the fourth data.
6. The system according to claim 5, characterized in that, The root node is also used to encapsulate the fourth data into the third network message during the broadcast phase, and send the third network message to the at least one intermediate node and the initial node; The initial node is also used during the broadcast phase to receive the third network message sent by the root node, parse the third network message to obtain the fourth data, and write the fourth data into its local GPU memory.
7. The distributed system according to claim 5, characterized in that, The initial node is equipped with a data processing unit, which includes a second processor, a second RDMA engine, and a second data buffer. The second RDMA engine is used to receive a second transmission notification issued by the local node host during the reduction phase, parse the second transmission notification to obtain a payload, write the payload into the second data buffer, and report a second reception notification to the second processor; wherein the payload contains full reduction operation instructions; The second processor is used to run a preset cross-architecture collection communication library software program; when the second receiving notification is received, the cross-architecture collection communication library software program is used to read the payload from the second data buffer according to the second receiving notification, parse the payload to obtain a full reduction operation instruction, execute the full reduction operation instruction, read the fifth data from the local GPU memory, write the fifth data into the second data buffer, and send a third sending notification to the second RDMA engine; The second RDMA engine is used to, upon receiving the third transmission notification, read the fifth data from the second data buffer according to the third transmission notification, generate the fourth network packet according to the fifth data, and send the fourth network packet to the next network node to initiate the reduction process.
8. The distributed system according to claim 7, characterized in that, The second RDMA engine is used to receive the third network packet sent by the root node during the broadcast phase, parse the third network packet to obtain the fourth data, and write the fourth data into the local GPU memory.
9. The distributed system according to claim 8, characterized in that, The root node is provided with a data processing unit, which includes a third processor, a third RDMA engine, and a third data buffer. The third RDMA engine is used to receive the fifth network message sent by its corresponding intermediate node during the reduction phase, parse the fifth network message to obtain the seventh data, write the seventh data into the third data buffer, and report the third reception notification to the third processor. The third processor is used to run a preset cross-architecture collection communication library software program. When the third receiving notification is received, the cross-architecture collection communication library software program reads the seventh data from the third data buffer and the sixth data from the local GPU memory according to the third receiving notification. Based on the sixth data and the seventh data, it performs a reduction calculation to obtain the fourth data, and calls the system-level DMA module to write the fourth data into the local GPU memory, thus ending the reduction process.
10. The distributed system according to claim 9, characterized in that, The third RDMA engine is used to generate the third network message based on the fourth data during the broadcast phase, and send the third network message to the initial node or the intermediate node.
11. A chip, characterized in that, The data processing unit includes any one of claims 1 to 4.