A network card and a distributed system

By offloading the reduction computation to the network card of the intermediate node in the distributed system, the problem of excessive GPU memory and PCIe bandwidth consumption during All Reduce is solved, thus improving the training and inference performance of large models.

CN122247958APending Publication Date: 2026-06-19SHENZHEN JAGUAR MICROSYSTEMS CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN JAGUAR MICROSYSTEMS CO LTD
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In distributed training of large models, the All Reduce process consumes too much local GPU memory and PCIe bandwidth, leading to system bottlenecks and increased data processing latency, which affects training and inference performance.

Method used

Design a network interface card (NIC) that offloads reduction computation on intermediate nodes of a distributed system, identifies packet types through a mapping table, and performs local data reading and computation, reducing GPU involvement, simplifying the interaction between the CPU and GPU, and completing data processing directly within the NIC.

Benefits of technology

It significantly reduces GPU computing power consumption, alleviates CPU load, reduces PCIe bandwidth usage, shortens communication latency, and improves the overall performance of distributed training and inference for large models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122247958A_ABST
    Figure CN122247958A_ABST
Patent Text Reader

Abstract

This application relates to a network interface card (NIC) and a distributed system. The NIC is applied to an intermediate node in the distributed system and includes: a first receiving module, which receives a first message sent by the previous-level node during the reduction phase. The first message uses an endpoint identifier pre-reserved for the reduction flow; a first message parsing module, which parses the first message and queries a preset mapping table to obtain the type information of the current node as an intermediate node, the reduction flow type information, and the address of the next-level node; a first message processing module, which reads first local data from the GPU memory of the current node according to the type and flow type indications, performs reduction calculation with the payload of the received first message, and generates a second message based on the reduction result; and a first sending module, which sends the second message to the next-level node. This application realizes on-network reduction calculation through a smart NIC, reducing interaction with the GPU and CPU and PCIe bandwidth usage, and reducing the communication latency of the full reduction operation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of distributed training technology for large models, specifically to a network interface card (NIC) and a distributed system. Background Technology

[0002] In distributed training of large model (LLM), full reduction is a key operation for achieving data parallelism, enabling hundreds or thousands of GPUs to have the same model parameters synchronously. This mainly involves first performing reduction calculations (such as summation or averaging) on ​​the local data of each GPU and summarizing it to the root node, and then broadcasting the global reduction result to all devices to achieve efficient data parallel training and parameter consistency.

[0003] In Ring's All Reduce algorithm, such as Figure 1 As shown, taking the All Reduce process with an eight-GPU machine as an example, it includes eight network nodes, which are numbered from 0 to 7. Each network node contains a GPU, memory (DDR), and network card (NIC). The processing is divided into two sub-processes: the reduction process and the broadcast process.

[0004] Please see Figure 1 The Reduce process is as follows:

[0005] (1.1) The initial node, namely GPU 7, controls NIC 7 to initiate network transmission, sending data a7 in DDR 7 to NIC 6 through Ethernet switch; (1.2) After receiving network data, NIC 6 temporarily writes it to DDR 6 for caching through the PCIe channel, and then GPU 6 receives notification of data arrival through some means (such as interrupt or polling); (1.3) After receiving the notification, GPU 6 knows that a7 has been temporarily cached in a certain location of DDR 6, initiates DDR read, reads the data a7 in DDR 6 into GPU again, and reads the data a6 in DDR 6 into GPU at the same time, and then starts to execute Sum / Avg to get the new data a7+a6. (1.4) GPU 6 writes the newly generated data a7+a6 back into a cache location in DDR 6, and then notifies NIC 6; (1.5) After receiving the notification from GPU 6, NIC 6 initiates a network transmission and sends the data a7+a6 in DDR6 to NIC 5 through the Ethernet switch; Figure 1The processing of NIC 5~1, DDR 5~1, and GPU 5~1, marked as intermediate nodes, is similar to steps (1.1)~(1.5) above, except that a7 in steps (1.1)~(1.5) is replaced with a7+a6, a7+a6+a5, a7+a6+a5+a4, a7+a6+a5+a4+a3, and a7+a6+a5+a4+a3+a2 respectively; a copy of a7+a6+a5+a4+a3+a2+a1 from NIC 1 is received on NIC 0. The processing of NIC 0 is similar to steps (1.2)~(1.4) above, except that a7 in steps (1.2)~(1.4) is replaced with a7+a6+a5+a4+a3+a2+a1. After GPU 0 completes the Sum / Avg calculation process, it processes the Sum / Avg data in DDR 5~1. We obtain a7+a6+a5+a4+a3+a2+a1+a0 from 0, and at this point we have obtained the Sum / Avg of the data 'a' on all 8 GPUs, completing the Reduce process.

[0006] After the Reduce function obtains a7+a6+a5+a4+a3+a2+a1+a0, the Broadcast process is as follows: (2.1) GPU 0 notifies NIC 0 to start broadcast processing; (2.2) After receiving the notification from GPU 0, NIC 0 initiates a network transmission and sends the data a7+a6+a5+a4+a3+a2+a1+a0 in DDR 0 to NIC 1 through the Ethernet switch; (2.3) After receiving network data, NIC 1 temporarily writes it to DDR 1 for caching through the PCIe channel, and then GPU 1 receives notification of the data arrival through some means (such as interrupt or polling); (2.4) After receiving the notification, GPU 1 knows that a7+a6+a5+a4+a3+a2+a1+a0 has been temporarily cached in a certain location of DDR 1. GPU 1 initiates DMA transfer to move the data from the NIC's network port buffer to the DDR space. (2.5) After the DMA transfer of GPU 1 is completed, it notifies NIC 1; (2.6) After receiving the notification from GPU 1, NIC 1 initiates network transmission and sends the data a7+a6+a5+a4+a3+a2+a1+a0 in DDR1 to NIC 2 through the Ethernet switch; Figure 1The processing of NIC 2~6, DDR 2~6, and GPU 2~6 marked as intermediate nodes is similar to steps (2.3)~(2.6) above; when a copy of a7+a6+a5+a4+a3+a2+a1+a0 from NIC 6 is received on NIC 7, the processing of NIC 7 is similar to steps (2.3)~(2.4) above; after the DMA transfer of GPU 7 is completed, the Broadcast process is also completed.

[0007] Based on the above processing steps, it can be seen that in the Reduce process, using the NIC as a typical intermediate node, the following steps are required: DDR writes twice (the NIC writes network data to DDR once, and the GPU writes the computation results to DDR once); DDR reads three times (the GPU reads the network data placed on DDR by the NIC once, the GPU reads the data originally on the local DDR once, and the NIC sends the data written to DDR by the GPU out to read from DDR once); the PCIe channel between the NIC and DDR uses one read and one write. In the Broadcast process, using the NIC as a typical intermediate node, the following steps are required: DDR writes twice (the NIC writes network data to DDR once, and the GPU initiates DMA to move data from the NIC's network buffer to DDR once); DDR reads twice (the GPU initiates DMA to move data from the NIC's network buffer to DDR once, and the NIC sends the data written to DDR by the GPU out to read from DDR once); the PCIe channel between the NIC and DDR uses one read and one write.

[0008] Therefore, completing both the Reduce and Broadcast processes of an All Reduce operation consumes a significant amount of DDR and PCIe bandwidth. This bandwidth consumption can easily lead to system bottlenecks and greatly increases data processing latency, which has a significant negative impact on the performance of large model training and inference. Summary of the Invention

[0009] This application proposes a network interface card (NIC) and a distributed system to reduce the consumption of GPU local memory and PCIe bandwidth during All Reduce, thereby improving the performance of large model training and inference.

[0010] According to a first aspect, embodiments of this application propose a network interface card (NIC) for use in an intermediate node of a distributed system, comprising: The first receiving module is used to receive the first message sent by the parent node corresponding to the current node in the reduction phase; wherein, the first message includes first message header information and first payload information, the first message header information includes a first source address, a first destination address and a first endpoint identifier, the first source address is the address of the node that initiated the reduction, the first destination address is the address of the current node, and the first endpoint identifier is an endpoint identifier that the distributed system has reserved in advance for the reduction stream transmission. The first message parsing module is used to parse the first message to obtain the first message header information and the first payload information, and to query a preset first mapping table based on the first message header information to obtain the first node type information, the first flow type information and the second destination address; wherein, the first node type information indicates that the current node is an intermediate node in the reduction stage, the first flow type information indicates that the first message belongs to the reduction flow, and the second destination address is the address of the next level node corresponding to the current node in the reduction stage. The first message processing module is configured to read first local data from the GPU memory of the current node according to the indications of the first node type information and the first stream type information, perform reduction calculations based on the first payload information and the first local data to obtain a first reduction result, and generate a second message based on the first source address, the second destination address, the first endpoint identifier, and the first reduction result; wherein, the second message includes second message header information and second payload information, the second message header information includes the first source address, the second destination address, and the first endpoint identifier, and the second payload information includes the first reduction result; The first sending module is used to send the second message to the next-level node corresponding to the current node in the reduction phase.

[0011] In some specific implementations, the first receiving module is further configured to receive a third message sent by the parent node corresponding to the current node during the broadcast phase; wherein, the third message includes third message header information and third payload information, the third message header information includes a second source address, a first destination address and a second endpoint identifier, the second source address is the address of the node that initiated the broadcast, and the second endpoint identifier is an endpoint identifier that the distributed system has reserved in advance for broadcast stream transmission. The first message parsing module is further configured to parse the third message to obtain the third message header information and the third payload information, and query the first mapping table based on the third message header information to obtain the second node type information, the second stream type information and the third destination address; wherein, the second node type information indicates that the current node is an intermediate node in the broadcast phase, the second stream type information indicates that the third message belongs to the broadcast stream, and the third destination address is the address of the next-level node corresponding to the current node in the broadcast phase; The first message processing module is further configured to write the third payload information into the GPU memory of the current node according to the indication of the second node type information and the second stream type information, and generate a fourth message according to the second source address, the third destination address, the second endpoint identifier and the third payload information; wherein, the fourth message includes fourth message header information and fourth payload information, the fourth message header information includes the second source address, the third destination address and the second endpoint identifier, and the fourth payload information is the same as the third payload information; The first sending module is further configured to send the fourth message to the next-level node corresponding to the current node in the broadcast phase.

[0012] In some specific implementations, the aforementioned network interface card also includes a DMA module; The first message processing module is configured to generate a DMA read instruction based on the indications of the first node type information and the first stream type information, and send the DMA read instruction to the DMA module; and to generate a DMA write instruction based on the indications of the second node type information and the second stream type information, and send the DMA write instruction to the DMA module. The DMA module is configured to read the first local data from the GPU memory of the current node according to the DMA read instruction and return it to the first message processing module; and to write the third payload information into the GPU memory of the current node according to the DMA write instruction.

[0013] According to the second aspect, this application provides a distributed system including an initial node, a root node, and at least one intermediate node; the initial node and the root node are both equipped with network interface cards (NICs), and the intermediate node is equipped with a NIC as described in this application embodiment; The network interface card of the initial node is used to send the second local data stored in the GPU memory of the initial node to the next level node corresponding to the initial node in the reduction phase during the reduction phase. The network interface card of the root node is used to receive the corresponding message sent by the upper-level node during the reduction phase, and perform reduction calculation on the payload information of the message and the third local data stored in the GPU memory of the root node to obtain the second reduction result.

[0014] In some specific implementations, the network interface card of the root node is also used to send the second reduction result to the next-level node corresponding to the root node during the broadcast phase. The network interface card of the initial node is also used to receive the second reduction result sent by the corresponding upper-level node during the broadcast phase, and write the second reduction result into the GPU memory of the initial node.

[0015] In some specific implementations, the network interface card (NIC) of the initial node includes: The first set communication module is used to receive a full reduction set communication command issued by the host of the initial node, read the second local data from the GPU memory of the initial node according to the full reduction set communication command, and generate an initial message according to the second local data; wherein, the initial message includes initial message header information and initial payload information, the initial message header information includes a first source address, a fourth destination address and a third endpoint identifier, the fourth destination address is the address of the root node, the third endpoint identifier is the real endpoint identifier corresponding to the reduction stream allocated by the protocol stack of the initial node, and the initial payload information is the second local data; The second message parsing module is used to parse the initial message to obtain the initial message header information and the second local data, and to query a preset second mapping table based on the initial message header information to obtain the third node type information, the third flow type information, the fifth destination address, and the first endpoint identifier; wherein, the third node type information indicates that the initial node is the start node of the reduction phase, the third flow type information indicates that the initial message belongs to the reduction flow, and the fifth destination address is the address of the next-level node corresponding to the initial node in the reduction phase; The second message processing module is configured to generate a fifth message based on the indications of the third node type information and the third flow type information, and according to the first source address, the fifth destination address, the first endpoint identifier, and the second local data; wherein, the fifth message includes fifth message header information and fifth payload information, the fifth message header information including the first source address, the fifth destination address, and the first endpoint identifier; the fifth payload information is the second local data; The second sending module is used to send the fifth message to the next-level node corresponding to the current node in the reduction phase.

[0016] In some specific implementations, the network interface card of the initial node also includes a second receiving module; The second receiving module is used to receive the sixth message sent by the parent node corresponding to the initial node during the broadcast phase; wherein, the sixth message includes sixth message header information and sixth payload information, the sixth message header information includes a second source address, a sixth destination address and a second endpoint identifier, the second source address is the address of the root node, the sixth destination address is the address of the initial node, the sixth payload information is the second reduction result, and the second endpoint identifier is an endpoint identifier pre-reserved by the distributed system for broadcast stream transmission; The second message parsing module is further configured to parse the sixth message to obtain the sixth message header information and the sixth payload information, and query the second mapping table based on the sixth message header information to obtain the fourth node type information, the fourth flow type information and the fourth endpoint identifier; wherein, the fourth node type information indicates that the initial node is the end node of the broadcast phase, the fourth flow type information indicates that the sixth message belongs to the broadcast flow, and the fourth endpoint identifier is the real endpoint identifier corresponding to the broadcast flow allocated by the protocol stack of the initial node; The second message processing module is further configured to generate a seventh message based on the indication of the fourth node type information and the fourth flow type information, according to the second source address, the sixth destination address, the fourth endpoint identifier, and the sixth payload information; the seventh message includes seventh message header information and seventh payload information, the seventh message header information includes the second source address, the sixth destination address, and the fourth endpoint identifier, the second source address is the address of the root node, and the seventh payload information is the second reduction result; The first set communication module is further configured to perform broadcast stream transmission termination, parse the seventh message to obtain the seventh payload information, and write the seventh payload information into the GPU memory of the initial node.

[0017] In some specific implementations, the network interface card (NIC) of the root node includes: The third receiving module is used to receive the eighth message sent by the parent node corresponding to the root node in the reduction phase; the eighth message includes the eighth message header information and the eighth payload information, and the eighth message header information includes the first source address, the fourth destination address and the first endpoint identifier; The third message parsing module is used to parse the eighth message to obtain the eighth message header information and the eighth payload information, and to query a preset third mapping table based on the eighth message header information to obtain the fifth node type information, the fifth flow type information, and the fifth endpoint identifier; wherein, the fifth node type indicates that the current node is the end node of the reduction phase, the fifth flow type indicates that the eighth message belongs to the reduction flow, and the fifth endpoint identifier is the real endpoint identifier corresponding to the reduction flow allocated by the protocol stack of the root node; The third message processing module is configured to read third local data from the GPU memory of the root node according to the indications of the fifth node type information and the fifth stream type information, perform reduction calculations on the third local data according to the eighth payload information to obtain the second reduction result, and generate a ninth message according to the first source address, the fourth destination address, the fifth endpoint identifier, and the second reduction result; wherein, the ninth message includes ninth message header information and ninth payload information, the ninth message header information includes the first source address, the fourth destination address, and the fifth endpoint identifier, and the ninth payload information is the second reduction result; The second set communication module is used to perform the reduction stream transmission termination, parse the ninth message to obtain the second reduction result, and write the second reduction result into the GPU memory of the root node.

[0018] In some specific implementations, a third sending module is also included; The second set communication module is further configured to generate a tenth message; wherein the tenth message includes tenth message header information and tenth payload information, the tenth message header information includes the second source address, the sixth destination address and the sixth endpoint identifier, the tenth payload information is the second reduction result; the sixth endpoint identifier is the real endpoint identifier corresponding to the broadcast stream allocated by the protocol stack of the root node; The third message parsing module is further configured to parse the tenth message to obtain the tenth message header information and the tenth payload information, and query a preset third mapping table based on the tenth message header information to obtain the sixth node type information, the sixth flow type information, the second endpoint identifier, and the seventh destination address; wherein, the sixth node type indicates that the current node is the start node of the broadcast phase, the sixth flow type indicates that the tenth message belongs to the broadcast flow, and the seventh destination address is the address of the next-level node corresponding to the root node of the broadcast phase; The third message processing module is further configured to generate an eleventh message based on the indications of the sixth node type information and the sixth flow type information, according to the second source address, the seventh destination address, the second endpoint identifier, and the second reduction result; wherein, the eleventh message includes eleventh message header information and eleventh payload information, the eleventh message header information includes the second source address, the seventh destination address, and the second endpoint identifier, and the eleventh payload information is the second reduction result; The third sending module is used to send the eleventh message to the next level node corresponding to the root node in the broadcast phase.

[0019] In some specific implementations, the first endpoint identifier, the second endpoint identifier, the third endpoint identifier, the fourth endpoint identifier, the fifth endpoint identifier, and the sixth endpoint identifier are determined according to the transmission protocol between the nodes of the distributed system.

[0020] The network interface card (NIC) and distributed system proposed in this application have the following beneficial effects: When network interface cards (NICs) are applied to distributed intermediate nodes, the reduction computation is offloaded from the GPU to the NIC of the intermediate node. The GPU of the intermediate node does not need to participate in data computation; it only acts as a data source or is written to the target, thus significantly saving valuable computing power in distributed training. Since the NIC of the intermediate node independently completes packet recognition, local data reading, reduction computation, and packet forwarding according to a pre-defined mapping table, the intermediate node does not need to interact with the host CPU during the entire reduction and broadcast process. The root node and the initial node only have minimal interaction with the CPU at the beginning or end of the computation, greatly reducing the CPU load and avoiding frequent CPU interruptions. Compared to… In traditional solutions, intermediate nodes need to perform four data transfers: writing received data to GPU memory, having the GPU read and compute the data, writing the computed result back to memory, and finally having the network card read and send the data. In this embodiment, the intermediate node only needs the network card to read local data once via DMA and generate and send the computed result directly within the network card, reducing PCIe bandwidth usage by about half. By eliminating multiple control plane interactions and waiting between the CPU and GPU, and between the CPU and NIC in the intermediate node, and simplifying the repeated data transfer between GPU memory and the network card, this embodiment significantly shortens the communication latency of the full reduction operation and improves the overall performance of distributed training and inference of large models. Attached Figure Description

[0021] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the accompanying drawings required in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0022] Figure 1 This is a schematic diagram of the All Reduce algorithm for an existing Ring topology.

[0023] Figure 2 This is a schematic diagram of the structure of a network interface card (NIC) applied to an intermediate node in one embodiment of this application.

[0024] Figure 3 This is a schematic diagram of the structure of a distributed system in one embodiment of this application.

[0025] Figure 4 This is a schematic diagram of the structure of a network interface card (NIC) applied to an initial node in one embodiment of this application.

[0026] Figure 5 This is a schematic diagram of the structure of a network interface card (NIC) applied to a root node in one embodiment of this application.

[0027] Figure label: 1-Intermediate node, 11-First receiving module, 12-First message parsing module, 13-First message processing module, 14-First sending module, 15-DMA module; 2-Initial node, 21-Second receiving module, 22-Second message parsing module, 23-Second message processing module, 24-Second sending module, 25-First set communication module; 3-Root node, 31-Third receiving module, 32-Third message parsing module, 33-Third message processing module, 34-Third sending module, 35-Second set communication module. Detailed Implementation

[0028] The detailed description of the accompanying drawings is intended to illustrate the present preferred embodiments of this application and is not intended to represent only the forms in which this application can be implemented. It should be understood that the same or equivalent functions can be achieved by different embodiments intended to be included within the spirit and scope of this application.

[0029] One embodiment of this application proposes a network interface card (NIC) that can be applied to intermediate node 1 in a distributed system. The distributed system includes multiple nodes. In the full reduction process of the Ring topology, the initial node 2, as the initiator in the reduction phase, directly sends local data to the corresponding next-level node. In the broadcast phase, it acts as the terminator, receiving and saving the global reduction result. The intermediate node 1, as the relay in the reduction phase, performs reduction calculations on the received packet payload and the intermediate node 1's local data, and then forwards the local reduction result to the corresponding next-level node. In the broadcast phase, it writes the received global reduction result into the intermediate node 1's GPU memory and forwards it to the corresponding next-level node. The root node 3, as the terminator in the reduction phase, saves the global reduction result after completing the last reduction calculation, and in the broadcast phase, it acts as the initiator, sending the global reduction result to the corresponding next-level node.

[0030] like Figure 2 As shown in this embodiment, the network card applied to intermediate node 1 includes a first receiving module 11, a first message parsing module 12, a first message processing module 13, a first sending module 14, and a DMA module 15.

[0031] The first receiving module 11 is used to receive a first message sent by the parent node corresponding to the current node in the reduction phase; wherein, the first message includes first message header information and first payload information, the first message header information includes a first source address, a first destination address and a first endpoint identifier, the first source address is the address of the node that initiated the reduction, the first destination address is the address of the current node, and the first endpoint identifier is an endpoint identifier that the distributed system has reserved in advance for the reduction stream transmission.

[0032] Specifically, in the full reduction process based on Ring topology, each node forms a logical ring. This embodiment employs a reverse link-building method, where the data flow of the current node is sent to its parent node on the ring. When the reduction phase begins, the first receiving module 11 of the current node (as intermediate node 1) receives the first message sent from its parent node, with the destination address of the first message being the address of the current node. To distinguish between regular data flows and aggregated communication flows, the distributed system pre-reserves a specific set of endpoint identifiers. The header of the first message carries this first endpoint identifier reserved for the reduction flow, rather than a regular endpoint identifier dynamically allocated by the node protocol stack. For example, in the RDMA protocol, the first endpoint identifier could be a reserved queue pair number (QPN).

[0033] The first message parsing module 12 is used to parse the first message to obtain the first message header information and the first payload information, and to query a preset first mapping table based on the first message header information to obtain the first node type information, the first flow type information and the second destination address; wherein, the first node type information indicates that the current node is an intermediate node in the reduction stage, the first flow type information indicates that the first message belongs to the reduction flow, and the second destination address is the address of the next-level node corresponding to the current node in the reduction stage.

[0034] Specifically, the first message parsing module 12 first performs protocol parsing on the received first message, extracting the first message header information (including the first source address, first destination address, and first endpoint identifier) ​​and the first payload information. Then, the first message parsing module 12 uses key fields from the first message header information (such as source IP address, destination IP address, and endpoint identifier) ​​as search keywords to query the first mapping table pre-configured in the current network interface card (NIC) by the control plane or driver. The lookup results include: first node type information (indicating that the current node is an intermediate node in the reduction phase), first flow type information (indicating that the first message belongs to the reduction flow), and second destination address (i.e., the address of the next-level node in the reduction phase determined according to the Ring table). Through the lookup operation, the NIC can obtain the subsequent processing logic without CPU involvement.

[0035] The first message processing module 13 is configured to read first local data from the GPU memory of the current node according to the indication of the first node type information and the first stream type information, perform reduction calculation according to the first payload information and the first local data to obtain a first reduction result, and generate a second message according to the first source address, the second destination address, the first endpoint identifier and the first reduction result; wherein, the second message includes second message header information and second payload information, the second message header information includes the first source address, the second destination address and the first endpoint identifier, and the second payload information includes the first reduction result.

[0036] Specifically, the first message processing module 13 initiates the on-network reduction processing flow based on the "intermediate node" and "reduction flow" instructions obtained through parsing and table lookup. The first message processing module 13 triggers a direct memory access (DMA) operation, reading the first local data to be reduced from the GPU memory of the current node via the PCIe bus. Then, the first message processing module 13 performs a reduction calculation (e.g., summation or averaging) between a portion of the reduction result in the first payload information and the read first local data to obtain the first reduction result. Finally, the first message processing module 13 generates a new second message. The source address in the header of the second message remains unchanged (still the first source address, i.e., the address of the node initiating the reduction), the destination address is replaced with the second destination address obtained from the table lookup (the address of the next-level node corresponding to the current node), the endpoint identifier still uses the first endpoint identifier reserved for the reduction flow, and the payload is updated to the calculated first reduction result. All of the above processes are completed entirely within the network interface card (NIC) and do not involve the GPU computing core.

[0037] The first sending module 14 is used to send the second message to the next level node corresponding to the current node in the reduction phase.

[0038] Specifically, the first sending module 14 obtains the encapsulated second message from the first message processing module 13, adds the necessary link layer header (such as Ethernet MAC address) to the second message, and sends it to the switching network through the physical port of the network card. Finally, the network routes the message to the next-level node corresponding to the second destination address. The sending process does not require the participation of the host CPU and is completed independently by the network card hardware.

[0039] In some embodiments, the first receiving module 11 is further configured to receive a third message sent by the parent node corresponding to the current node during the broadcast phase; wherein the third message includes third message header information and third payload information, the third message header information includes a second source address, a first destination address and a second endpoint identifier, the second source address is the address of the node initiating the broadcast, and the second endpoint identifier is an endpoint identifier that the distributed system has reserved in advance for broadcast stream transmission.

[0040] Specifically, in the broadcast phase of All-Reduce, the data flow is reversed compared to the reduction phase. During the broadcast phase, the first receiving module 11 of the current node (as intermediate node 1) receives a third message sent from its parent node. Unlike the reduction stream, the third message uses a second endpoint identifier (e.g., another set of reserved QPN ranges) that the distributed system has reserved for broadcast stream transmission. The second source address in the header of the third message is the address of the node that initiated the broadcast (root node 3), and the first destination address is still the address of the current node.

[0041] The first message parsing module 12 is further configured to parse the third message to obtain the third message header information and the third payload information, and query the first mapping table according to the third message header information to obtain the second node type information, the second flow type information and the third destination address; wherein, the second node type information indicates that the current node is an intermediate node in the broadcast phase, the second flow type information indicates that the third message belongs to the broadcast flow, and the third destination address is the address of the next-level node corresponding to the current node in the broadcast phase.

[0042] Specifically, the first message parsing module 12 also parses the third message and queries the first mapping table using the third message header information as the key. At this time, the second node type information obtained from the table query indicates that the current node is an intermediate node in the broadcast phase, the second stream type information indicates that the third message belongs to the broadcast stream, and the third destination address is the address of the next-level node corresponding to the current node in the broadcast phase (in the opposite direction to the reduction phase).

[0043] The first message processing module 13 is further configured to write the third payload information into the GPU memory of the current node according to the indication of the second node type information and the second stream type information, and generate a fourth message according to the second source address, the third destination address, the second endpoint identifier and the third payload information; wherein, the fourth message includes fourth message header information and fourth payload information, the fourth message header information includes the second source address, the third destination address and the second endpoint identifier, and the fourth payload information is the same as the third payload information.

[0044] Specifically, the first message processing module 13 performs broadcast processing according to the instructions of the "intermediate node" and the "broadcast stream". Instead of performing calculations, the first message processing module 13 directly writes the third payload information (i.e., the global reduction result) into the GPU memory of the current node by calling the DMA module. Simultaneously, to continue broadcasting the data to the next-level nodes on the Ring topology, the first message processing module 13 generates a fourth message. The source address in the header of the fourth message remains unchanged (the second source address, i.e., the root node address), the destination address is replaced with the third destination address obtained from a lookup table, the endpoint identifier still uses the second endpoint identifier reserved for the broadcast stream, and the payload is exactly the same as the third payload information. In this way, the data is forwarded to the corresponding next-level nodes while being written to the local GPU.

[0045] The first sending module 14 is further configured to send the fourth message to the next-level node corresponding to the current node in the broadcast phase.

[0046] Specifically, the first sending module 14 sends the generated fourth message to the next-level node indicated by the third destination address. The sending mechanism is the same as that in the reduction phase, and neither requires CPU intervention.

[0047] In some embodiments, the network card applied to intermediate node 1 in the above embodiments further includes a DMA module 15.

[0048] The first message processing module 13 is configured to generate a DMA read instruction based on the indications of the first node type information and the first stream type information, and send the DMA read instruction to the DMA module 15; and to generate a DMA write instruction based on the indications of the second node type information and the second stream type information, and send the DMA write instruction to the DMA module 15.

[0049] Specifically, to efficiently access GPU memory, the network card integrates a DMA module 15. When the first packet processing module 13 identifies that the current stream is a reduction stream and the role is an intermediate node, the first packet processing module 13 generates a length-aligned DMA read instruction based on the memory address of the first local data and sends the DMA read instruction to the DMA module 15. When the first packet processing module 13 identifies that the current stream is a broadcast stream and the role is an intermediate node, the first packet processing module 13 generates a DMA write instruction based on a preset GPU memory write address and sends the DMA write instruction to the DMA module 15.

[0050] The DMA module 15 is configured to read the first local data from the GPU memory of the current node according to the DMA read instruction and return it to the first message processing module 13; and to write the third payload information into the GPU memory of the current node according to the DMA write instruction.

[0051] Specifically, after receiving a DMA read instruction, DMA module 15 initiates a read transaction via the PCIe bus to read the first local data from the GPU memory of the current node and returns the first local data to the first message processing module 13. After receiving a DMA write instruction, DMA module 15 initiates a write transaction via the PCIe bus to directly write the third payload information to a specified location in the GPU memory of the current node. All DMA transfers do not require the participation of the host CPU.

[0052] like Figure 3 As shown, one embodiment of this application proposes a distributed system including an initial node 2, a root node 3, and at least one intermediate node 1; both the initial node 2 and the root node 3 are equipped with network interface cards (NICs), and the intermediate node 1 is equipped with... Figure 2 The network card shown in the aforementioned embodiment.

[0053] Specifically, such as Figure 3 In the distributed system shown, nodes are divided into initial node 2, intermediate node 1, and root node 3. Initial node 2 is the first node to initiate data transmission during the reduction phase, root node 3 is the last node to perform the reduction calculation during the reduction phase, and the remaining nodes are all intermediate nodes 1. All intermediate nodes 1 are configured with the aforementioned... Figure 2 The network interface card (NIC) shown is used in the reduction phase. The NIC of the initial node 2 is responsible for sending its second local data directly to its next-level node in the Ring (i.e., the first intermediate node 1 or the root node 3) during the reduction phase. The NIC of the root node 3 is responsible for receiving packets from its parent node (the last intermediate node 1 or the initial node 2), extracting the packet payload, and performing the final reduction calculation with the third local data in its GPU memory to obtain the global reduction result (the second reduction result). Throughout the reduction process, intermediate node 1 performs data forwarding and partial reduction, while root node 3 performs the final reduction.

[0054] The network interface card of the initial node 2 is used to send the second local data stored in the GPU memory of the initial node 2 to the next level node corresponding to the initial node 2 in the reduction phase during the reduction phase.

[0055] Specifically, during the reduction phase, the network card of the initial node 2 is responsible for sending the second local data of this node directly to its next-level node in the Ring topology (i.e., the first intermediate node 1 or the root node 3).

[0056] The network interface card of the root node 3 is used to receive the corresponding message sent by the upper-level node during the reduction phase, and perform reduction calculation on the payload information of the message and the third local data stored in the GPU memory of the root node 3 to obtain the second reduction result.

[0057] Specifically, the network card of root node 3 is responsible for receiving packets from its parent node (the last intermediate node 1 or the initial node 2) during the reduction phase, extracting the payload, and performing the final reduction calculation with the third local data in the GPU memory of this node to obtain the second reduction result (global reduction result).

[0058] In some embodiments, the network interface card of the root node 3 is further configured to send the second reduction result to the next-level node corresponding to the root node 3 during the broadcast phase.

[0059] Specifically, after the reduction phase is completed, the second reduction result (global reduction result) is stored in the GPU memory of root node 3. During the broadcast phase, the network card of root node 3, as the initiator, encapsulates the second reduction result into a broadcast message and sends it to its next-level node in the broadcast phase (i.e., the first intermediate node 1 in the reverse direction of the Ring starting from root node 3).

[0060] The network interface card of the initial node 2 is also used to receive the second reduction result sent by the corresponding upper-level node during the broadcast phase, and write the second reduction result into the GPU memory of the initial node 2.

[0061] Specifically, the initial node 2 acts as the terminator during the broadcast phase. The network interface card (NIC) of the initial node 2 receives a broadcast message carrying the second reduction result from its parent node (i.e., the last intermediate node 1), and writes the second reduction result into the GPU memory of the initial node 2. At this point, all nodes (initial node 2, intermediate node 1, and root node 3) have the same global reduction result in their GPU memory, completing one All-Reduce operation.

[0062] In some embodiments, such as Figure 4 As shown, the network card of the initial node 2 includes a first set communication module 25, a second receiving module 21, a second message parsing module 22, a second message processing module 23, and a second sending module 24.

[0063] The first set communication module 25 is used to receive a full reduction set communication command issued by the host of the initial node 2, read the second local data from the GPU memory of the initial node 2 according to the full reduction set communication command, and generate an initial message according to the second local data; wherein, the initial message includes initial message header information and initial payload information, the initial message header information includes a first source address, a fourth destination address and a third endpoint identifier, the fourth destination address is the address of the root node 3, the third endpoint identifier is the real endpoint identifier corresponding to the reduction stream allocated by the protocol stack of the initial node 2, and the initial payload information is the second local data.

[0064] Specifically, the network interface card (NIC) of initial node 2 includes a first set communication module 25 for interacting with the host CPU's driver. When the host CPU issues an All-Reduce set communication command, the first set communication module 25 initiates a DMA operation to read second local data from the GPU memory of this node. Then, the first set communication module 25 generates an initial packet. The destination address of the initial packet is the address of root node 3, but the crucial third endpoint identifier is the actual endpoint identifier (not a reserved value) normally allocated by the initial node 2's protocol stack for the reduce stream. This is because initial node 2 has not yet entered the acceleration path within the NIC.

[0065] The second message parsing module 22 is used to parse the initial message to obtain the initial message header information and the second local data, and to query a preset second mapping table based on the initial message header information to obtain the third node type information, the third flow type information, the fifth destination address and the first endpoint identifier; wherein, the third node type information indicates that the initial node 2 is the starting node of the reduction phase, the third flow type information indicates that the initial message belongs to the reduction flow, and the fifth destination address is the address of the next-level node corresponding to the initial node 2 in the reduction phase.

[0066] Specifically, the initial message is implemented in the network interface card's internal pipeline. The second message parsing module 22 receives the initial message generated by the first set communication module 25, parses it, and queries the second mapping table pre-configured by the control plane or driver using the initial message header information. The table lookup result indicates that the current node is the starting node of the reduction phase, the flow type is a reduction flow, and also obtains the address of the next level node of the reduction phase (the fifth destination address) and the endpoint identifier (the first endpoint identifier) ​​reserved by the system for the reduction flow. This first endpoint identifier will be used to replace the real endpoint identifier in the message, so that the subsequent intermediate node 1 can recognize the reduction flow.

[0067] The second message processing module 23 is configured to generate a fifth message based on the first source address, the fifth destination address, the first endpoint identifier, and the second local data, according to the indications of the third node type information and the third flow type information; wherein the fifth message includes fifth message header information and fifth payload information, the fifth message header information includes the first source address, the fifth destination address, and the first endpoint identifier; the fifth payload information is the second local data.

[0068] Specifically, the second message processing module 23 performs message conversion operations according to the instructions of the "start node" and the "reduction flow". The second message processing module 23 uses the first endpoint identifier obtained by looking up the table to replace the third endpoint identifier (real endpoint identifier) ​​in the original initial message, and modifies the destination address to the fifth destination address (next-level node address). The source address remains unchanged. The generated fifth message still has the original second local data as its payload, but the message header has been changed to the format of using the reserved first endpoint identifier. After receiving it, the subsequent intermediate node 1 can process it according to the aforementioned process.

[0069] The second sending module 24 is used to send the fifth message to the next-level node corresponding to the current node in the reduction phase.

[0070] Specifically, the second sending module 24 obtains the fifth message generated by the second message processing module 23, encapsulates it at the link layer, and sends it to the next-level node corresponding to the fifth destination address. This sending process is completed autonomously by the network card hardware and does not occupy the host CPU resources.

[0071] In some embodiments, the network interface card of the initial node 2 further includes a second receiving module 21.

[0072] The second receiving module 21 is used to receive the sixth message sent by the parent node corresponding to the initial node 2 during the broadcast phase; wherein the sixth message includes sixth message header information and sixth payload information, the sixth message header information includes the second source address, the sixth destination address and the second endpoint identifier, the second source address is the address of the root node 3, the sixth destination address is the address of the initial node 2; the sixth payload information is the second reduction result.

[0073] Specifically, during the broadcast phase, the second receiving module 21 of the initial node 2 is responsible for receiving the sixth message sent from its parent node (i.e., the last intermediate node 1). The sixth message uses the second endpoint identifier reserved for the broadcast stream, the source address is the address of the root node 3, the destination address is the address of the initial node 2, and the payload is the second reduction result (global reduction result).

[0074] The second message parsing module 22 is further configured to parse the sixth message to obtain the sixth message header information and the sixth payload information, and query the second mapping table according to the sixth message header information to obtain the fourth node type information, the fourth flow type information and the fourth endpoint identifier; wherein, the fourth node type information indicates that the initial node 2 is the end node of the broadcast phase, the fourth flow type information indicates that the sixth message belongs to the broadcast flow, and the fourth endpoint identifier is the real endpoint identifier corresponding to the broadcast flow allocated by the protocol stack of the initial node 2.

[0075] Specifically, the second message parsing module 22 parses the sixth message and uses the header information of the sixth message to query the second mapping table. The table lookup result indicates that the current node is the end node of the broadcast phase, the flow type is broadcast flow, and it also obtains the fourth endpoint identifier (real endpoint identifier) ​​allocated by the protocol stack of the initial node 2 for the broadcast flow.

[0076] The second message processing module 23 is further configured to generate a seventh message based on the indication of the fourth node type information and the fourth flow type information, according to the second source address, the sixth destination address, the fourth endpoint identifier, and the sixth payload information; the seventh message includes seventh message header information and seventh payload information, the seventh message header information includes the second source address, the sixth destination address, and the fourth endpoint identifier, the second source address is the address of the root node 3, and the seventh payload information is the second reduction result.

[0077] Specifically, the second message processing module 23 performs message conversion according to the instructions of the "end node" and "broadcast stream". The second message processing module 23 replaces the reserved endpoint identifier (second endpoint identifier) ​​in the sixth message with the fourth endpoint identifier (real endpoint identifier) ​​obtained by looking up the table and allocated by the protocol stack of the initial node 2. The destination address and source address remain unchanged, and the seventh message is generated. The seventh message then becomes a standard format that the protocol stack can recognize.

[0078] The first set communication module 25 is also used to perform broadcast stream transmission termination, parse the seventh message to obtain the seventh payload information, and write the seventh payload information into the GPU memory of the initial node 2.

[0079] Specifically, the first set communication module 25, in addition to initiating the reduction, is also responsible for terminating the broadcast stream. The first set communication module 25 receives the seventh message generated by the second message processing module 23, parses the seventh message, extracts the seventh payload information (i.e., the second reduction result), and writes the second reduction result into the GPU memory of the initial node 2 through a DMA operation. Finally, it notifies the host CPU that the broadcast operation is complete.

[0080] In some embodiments, such as Figure 5 As shown, the network interface card of the root node 3 includes a third receiving module 31, a third message parsing module 32, a third message processing module 33, a third sending module 34, and a second set communication module 35.

[0081] The third receiving module 31 is used to receive the eighth message sent by the upper-level node corresponding to the root node 3 in the reduction phase; the eighth message includes the eighth message header information and the eighth payload information, and the eighth message header information includes the first source address, the fourth destination address and the first endpoint identifier.

[0082] Specifically, during the reduction phase, the third receiving module 31 of the root node 3 receives the eighth message sent from its parent node (the last intermediate node 1 or the initial node 2). The eighth message uses the first endpoint identifier reserved for the reduction flow, the source address is the address of the node that initiated the reduction (initial node 2), and the destination address is the address of the root node 3.

[0083] The third message parsing module 32 is used to parse the eighth message to obtain the eighth message header information and the eighth payload information, and to query a preset third mapping table based on the eighth message header information to obtain the fifth node type information, the fifth flow type information and the fifth endpoint identifier; wherein, the fifth node type indicates that the current node is the end node of the reduction phase, the fifth flow type indicates that the eighth message belongs to the reduction flow, and the fifth endpoint identifier is the real endpoint identifier corresponding to the reduction flow allocated by the protocol stack of the root node 3.

[0084] Specifically, the third message parsing module 32 parses the eighth message and uses the header information of the eighth message to query the third mapping table pre-configured by the control plane or driver. The lookup result indicates that the current node is the end node of the reduction phase, the flow type is a reduction flow, and also obtains the fifth endpoint identifier (real endpoint identifier) ​​allocated by the root node 3 protocol stack for the reduction flow.

[0085] The third message processing module 33 is configured to read third local data from the GPU memory of the root node 3 according to the instructions of the fifth node type information and the fifth stream type information, perform reduction calculation on the third local data according to the eighth payload information to obtain the second reduction result, and generate a ninth message according to the first source address, the fourth destination address, the fifth endpoint identifier and the second reduction result; wherein, the ninth message includes ninth message header information and ninth payload information, the ninth message header information includes the first source address, the fourth destination address and the fifth endpoint identifier, and the ninth payload information is the second reduction result.

[0086] Specifically, the third message processing module 33 initiates the final reduction based on the instructions of the "end node" and the "reduction stream". The third message processing module 33 reads the third local data from the GPU memory of the root node 3 via DMA, performs a reduction calculation between the eighth payload information of the eighth message and the third local data, and obtains the second reduction result (global reduction result). Then, the third message processing module 33 generates the ninth message: replacing the endpoint identifier in the header of the ninth message with the real endpoint identifier (fifth endpoint identifier) ​​allocated by the root node 3 protocol stack, and the payload is the second reduction result.

[0087] The second set communication module 35 is used to perform the reduction stream transmission termination, parse the ninth message to obtain the second reduction result, and write the second reduction result into the GPU memory of the root node 3.

[0088] Specifically, the second set communication module 35 is responsible for terminating the reduction stream. The second set communication module 35 receives the ninth message generated by the third message processing module 33, parses it to obtain the second reduction result, and writes the second reduction result into the GPU memory of the root node 3 via DMA, notifying the host CPU that the reduction phase is complete and indicating that the subsequent broadcast phase can proceed.

[0089] In some embodiments, the network interface card of the root node 3 further includes a third transmitting module 34.

[0090] The second set communication module 35 is further configured to generate a tenth message; wherein the tenth message includes tenth message header information and tenth payload information, the tenth message header information includes the second source address, the sixth destination address and the sixth endpoint identifier, the tenth payload information is the second reduction result; the sixth endpoint identifier is the real endpoint identifier corresponding to the broadcast stream allocated by the protocol stack of the root node 3.

[0091] Specifically, after the reduction phase is completed, the second set communication module 35 of root node 3 prepares to initiate a broadcast. The second set communication module 35 generates a tenth message. The payload of the tenth message is the second reduction result. The source address in the header of the tenth message is the address of root node 3 itself, and the destination address is the address of the initial node 2. However, the sixth endpoint identifier is the actual endpoint identifier allocated by the root node 3 protocol stack for the broadcast stream.

[0092] The third message parsing module 32 is further configured to parse the tenth message to obtain the tenth message header information and the tenth payload information, and query a preset third mapping table based on the tenth message header information to obtain the sixth node type information, the sixth flow type information, the second endpoint identifier, and the seventh destination address; wherein, the sixth node type information indicates that the current node is the start node of the broadcast phase, the sixth flow type information indicates that the tenth message belongs to the broadcast flow, and the seventh destination address is the address of the next-level node corresponding to the root node 3 in the broadcast phase.

[0093] Specifically, the third message parsing module 32 receives the tenth message, parses it, and then queries the third mapping table using the header information of the tenth message. The table lookup result indicates that the current node is the starting node of the broadcast phase, the flow type is a broadcast flow, and also obtains the second endpoint identifier reserved by the system for broadcast flows, as well as the address of the next-level node in the broadcast phase (the seventh destination address).

[0094] The third message processing module 33 is further configured to generate an eleventh message based on the indications of the sixth node type information and the sixth flow type information, according to the second source address, the seventh destination address, the second endpoint identifier, and the second reduction result; wherein, the eleventh message includes eleventh message header information and eleventh payload information, the eleventh message header information includes the second source address, the seventh destination address, and the second endpoint identifier, and the eleventh payload information is the second reduction result.

[0095] Specifically, the third message processing module 33 performs message conversion according to the instructions of the "start node" and the "broadcast stream". The third message processing module 33 replaces the real endpoint identifier (sixth endpoint identifier) ​​in the tenth message with the reserved second endpoint identifier obtained by looking up the table, and modifies the destination address to the seventh destination address (next-level node address), generating the eleventh message. The payload of the eleventh message is still the second reduction result.

[0096] The third sending module 34 is used to send the eleventh message to the next level node corresponding to the root node 3 in the broadcast phase.

[0097] Specifically, the third sending module 34 obtains the eleventh message, encapsulates it at the link layer, and sends it to the next-level node indicated by the seventh destination address, thereby initiating the data distribution in the broadcast phase.

[0098] In some embodiments, the first endpoint identifier, the second endpoint identifier, the third endpoint identifier, the fourth endpoint identifier, the fifth endpoint identifier, and the sixth endpoint identifier are determined according to the transmission protocol between nodes of the distributed system.

[0099] Specifically, the format and value range of each endpoint identifier mentioned in the embodiments of this application depend on the transmission protocol used for interconnection between nodes in the distributed system, for example: When using InfiniBand or RoCE (RDMA over Converged Ethernet) protocols, the endpoint identifier corresponds to the queue pair number (QPN). During system initialization, the system software reserves two dedicated QPN ranges for aggregated communication: the first endpoint identifier (reserved for reduced streams) and the second endpoint identifier (reserved for broadcast streams). The QPNs used for normal communication by the protocol stack (such as the third, fourth, fifth, and sixth endpoint identifiers) are dynamically allocated from the other range.

[0100] When using the TCP / IP protocol, an endpoint identifier can be represented as a combination of (IP address, TCP port number). In this case, the reserved endpoint identifier can be a specific set of port numbers (for example, reduction flows use port numbers 60001-61000, and broadcast flows use port numbers 61001-62000). The mapping table inside the network interface card (NIC) identifies the set of communication flows based on these reserved port numbers.

[0101] When using the UDP protocol, the endpoint identifier is similarly a combination of (IP address, UDP port number), and a range of dedicated port numbers can also be reserved.

[0102] For custom lightweight transport protocols, the endpoint identifier can be a single-field flow identifier (FlowID), which is pre-assigned by the control plane and written into the network interface card mapping table.

[0103] Regardless of the transmission protocol used, the design is to reserve a set of endpoint identifier values ​​that do not conflict with normal communication for both the reduction stream and the broadcast stream, so that the network card can distinguish between the aggregate communication stream and the ordinary data stream based solely on the endpoint identifier in the message header, thereby triggering the on-network acceleration processing path described in the embodiments of this application.

[0104] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many updates and modifications will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical applications, or technological improvements to the embodiments in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A network interface card (NIC) used in an intermediate node of a distributed system, characterized in that, include: The first receiving module is used to receive the first message sent by the parent node corresponding to the current node in the reduction phase; wherein, the first message includes first message header information and first payload information, the first message header information includes a first source address, a first destination address and a first endpoint identifier, the first source address is the address of the node that initiated the reduction, the first destination address is the address of the current node, and the first endpoint identifier is an endpoint identifier that the distributed system has reserved in advance for the reduction stream transmission. The first message parsing module is used to parse the first message to obtain the first message header information and the first payload information, and to query a preset first mapping table based on the first message header information to obtain the first node type information, the first flow type information and the second destination address; wherein, the first node type information indicates that the current node is an intermediate node in the reduction stage, the first flow type information indicates that the first message belongs to the reduction flow, and the second destination address is the address of the next level node corresponding to the current node in the reduction stage. The first message processing module is configured to read first local data from the GPU memory of the current node according to the indication of the first node type information and the first stream type information, perform reduction calculation according to the first payload information and the first local data to obtain a first reduction result, and generate a second message according to the first source address, the second destination address, the first endpoint identifier and the first reduction result; The first sending module is used to send the second message to the next-level node corresponding to the current node in the reduction phase.

2. The network interface card according to claim 1, characterized in that, The first receiving module is further configured to receive a third message sent by the parent node corresponding to the current node during the broadcast phase; wherein the third message includes third message header information and third payload information, the third message header information includes a second source address, a first destination address and a second endpoint identifier, the second source address is the address of the node that initiated the broadcast, and the second endpoint identifier is an endpoint identifier that the distributed system has reserved in advance for broadcast stream transmission. The first message parsing module is further configured to parse the third message to obtain the third message header information and the third payload information, and query the first mapping table based on the third message header information to obtain the second node type information, the second stream type information and the third destination address; wherein, the second node type information indicates that the current node is an intermediate node in the broadcast phase, the second stream type information indicates that the third message belongs to the broadcast stream, and the third destination address is the address of the next-level node corresponding to the current node in the broadcast phase; The first message processing module is further configured to write the third payload information into the GPU memory of the current node according to the indication of the second node type information and the second stream type information, and generate a fourth message according to the second source address, the third destination address, the second endpoint identifier and the third payload information; The first sending module is further configured to send the fourth message to the next-level node corresponding to the current node in the broadcast phase.

3. The network interface card according to claim 2, characterized in that, It also includes a DMA module; The first message processing module is configured to generate a DMA read instruction based on the indications of the first node type information and the first stream type information, and send the DMA read instruction to the DMA module; And, it is used to generate a DMA write instruction based on the indication of the second node type information and the second stream type information, and send the DMA write instruction to the DMA module; The DMA module is used to read the first local data from the GPU memory of the current node according to the DMA read instruction and return it to the first message processing module. And, for writing the third payload information into the GPU memory of the current node according to the DMA write instruction.

4. A distributed system, characterized in that, It includes an initial node, a root node, and at least one intermediate node; the initial node and the root node are both equipped with network interface cards (NICs), and the intermediate node is equipped with a NIC as described in any one of claims 1 to 3. The network interface card of the initial node is used to send the second local data stored in the GPU memory of the initial node to the next level node corresponding to the initial node in the reduction phase during the reduction phase. The network interface card of the root node is used to receive the corresponding message sent by the upper-level node during the reduction phase, and perform reduction calculation on the payload information of the message and the third local data stored in the GPU memory of the root node to obtain the second reduction result.

5. The distributed system according to claim 4, characterized in that, The network interface card of the root node is also used to send the second reduction result to the next level node corresponding to the root node during the broadcast phase. The network interface card of the initial node is also used to receive the second reduction result sent by the corresponding upper-level node during the broadcast phase, and write the second reduction result into the GPU memory of the initial node.

6. The distributed system according to claim 4, characterized in that, The network interface card (NIC) of the initial node includes: The first set communication module is used to receive a full reduction set communication command issued by the host of the initial node, read the second local data from the GPU memory of the initial node according to the full reduction set communication command, and generate an initial message according to the second local data; wherein, the initial message includes initial message header information and initial payload information, the initial message header information includes a first source address, a fourth destination address and a third endpoint identifier, the fourth destination address is the address of the root node, the third endpoint identifier is the real endpoint identifier corresponding to the reduction stream allocated by the protocol stack of the initial node, and the initial payload information is the second local data; The second message parsing module is used to parse the initial message to obtain the initial message header information and the second local data, and to query a preset second mapping table based on the initial message header information to obtain the third node type information, the third flow type information, the fifth destination address, and the first endpoint identifier; wherein, the third node type information indicates that the initial node is the start node of the reduction phase, the third flow type information indicates that the initial message belongs to the reduction flow, and the fifth destination address is the address of the next-level node corresponding to the initial node in the reduction phase; The second message processing module is used to generate a fifth message based on the first source address, the fifth destination address, the first endpoint identifier, and the second local data, according to the indication of the third node type information and the third flow type information. The second sending module is used to send the fifth message to the next-level node corresponding to the current node in the reduction phase.

7. The distributed system according to claim 6, characterized in that, The network interface card of the initial node also includes a second receiving module; The second receiving module is used to receive the sixth message sent by the parent node corresponding to the initial node during the broadcast phase; wherein, the sixth message includes sixth message header information and sixth payload information, the sixth message header information includes a second source address, a sixth destination address and a second endpoint identifier, the second source address is the address of the root node, the sixth destination address is the address of the initial node, the sixth payload information is the second reduction result, and the second endpoint identifier is an endpoint identifier pre-reserved by the distributed system for broadcast stream transmission; The second message parsing module is further configured to parse the sixth message to obtain the sixth message header information and the sixth payload information, and query the second mapping table based on the sixth message header information to obtain the fourth node type information, the fourth flow type information and the fourth endpoint identifier; wherein, the fourth node type information indicates that the initial node is the end node of the broadcast phase, the fourth flow type information indicates that the sixth message belongs to the broadcast flow, and the fourth endpoint identifier is the real endpoint identifier corresponding to the broadcast flow allocated by the protocol stack of the initial node; The second message processing module is further configured to generate a seventh message based on the second source address, the sixth destination address, the fourth endpoint identifier, and the sixth payload information, according to the indications of the fourth node type information and the fourth flow type information. The first set communication module is further configured to perform broadcast stream transmission termination, parse the seventh message to obtain the seventh payload information, and write the seventh payload information into the GPU memory of the initial node.

8. The distributed system according to claim 7, characterized in that, The network interface card (NIC) of the root node includes: The third receiving module is used to receive the eighth message sent by the parent node corresponding to the root node in the reduction phase; the eighth message includes the eighth message header information and the eighth payload information, and the eighth message header information includes the first source address, the fourth destination address and the first endpoint identifier; The third message parsing module is used to parse the eighth message to obtain the eighth message header information and the eighth payload information, and to query a preset third mapping table based on the eighth message header information to obtain the fifth node type information, the fifth flow type information, and the fifth endpoint identifier; wherein, the fifth node type indicates that the current node is the end node of the reduction phase, the fifth flow type indicates that the eighth message belongs to the reduction flow, and the fifth endpoint identifier is the real endpoint identifier corresponding to the reduction flow allocated by the protocol stack of the root node; The third message processing module is used to read third local data from the GPU memory of the root node according to the instructions of the fifth node type information and the fifth stream type information, perform reduction calculation on the third local data according to the eighth payload information to obtain the second reduction result, and generate a ninth message according to the first source address, the fourth destination address, the fifth endpoint identifier and the second reduction result. The second set communication module is used to perform the reduction stream transmission termination, parse the ninth message to obtain the second reduction result, and write the second reduction result into the GPU memory of the root node.

9. The distributed system according to claim 8, characterized in that, It also includes a third sending module; The second set communication module is further configured to generate a tenth message; wherein the tenth message includes tenth message header information and tenth payload information, the tenth message header information includes the second source address, the sixth destination address and the sixth endpoint identifier, the tenth payload information is the second reduction result; the sixth endpoint identifier is the real endpoint identifier corresponding to the broadcast stream allocated by the protocol stack of the root node; The third message parsing module is further configured to parse the tenth message to obtain the tenth message header information and the tenth payload information, and query a preset third mapping table based on the tenth message header information to obtain the sixth node type information, the sixth flow type information, the second endpoint identifier, and the seventh destination address; wherein, the sixth node type indicates that the current node is the start node of the broadcast phase, the sixth flow type indicates that the tenth message belongs to the broadcast flow, and the seventh destination address is the address of the next-level node corresponding to the root node of the broadcast phase; The third message processing module is further configured to generate an eleventh message based on the indications of the sixth node type information and the sixth flow type information, according to the second source address, the seventh destination address, the second endpoint identifier, and the second reduction result; wherein, the eleventh message includes eleventh message header information and eleventh payload information, the eleventh message header information includes the second source address, the seventh destination address, and the second endpoint identifier, and the eleventh payload information is the second reduction result; The third sending module is used to send the eleventh message to the next level node corresponding to the root node in the broadcast phase.

10. The distributed system according to claim 9, characterized in that, The first endpoint identifier, the second endpoint identifier, the third endpoint identifier, the fourth endpoint identifier, the fifth endpoint identifier, and the sixth endpoint identifier are determined according to the transmission protocol between nodes of the distributed system.