Collective communication offload method, system, device, and medium
By generating communication primitive blueprints and establishing hardware-level communication contexts during the model deployment phase, and combining the parallel operations of DMA and RDMA pipelines within the DPU, the problems of high latency and low bandwidth utilization in tensor parallel inference are solved, achieving microsecond-level synchronization and efficient communication.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YIHUA TECHNOLOGY (BEIJING) CO LTD
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies in tensor parallel inference of large language models suffer from problems such as high end-to-end latency, communication mode mismatch, and limited bandwidth utilization in set communication. Especially in ultra-high-speed network environments, host CPU resources are occupied, network bandwidth utilization is low, and the communication and computing pipelines are disconnected.
During the model deployment phase, a communication primitive blueprint is generated through the distributed inference controller to establish a hardware-level communication context semantic environment, thereby enabling control plane offloading. During the model inference phase, parallel operation of the DMA and RDMA pipelines within the DPU is used to reduce latency and improve bandwidth utilization.
It achieves microsecond-level synchronization requirements, frees up host CPU resources, improves network bandwidth utilization and inference service throughput, and reduces communication overhead to near physical limits.
Smart Images

Figure CN121979690B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a collection communication offloading method, system, device and medium, which is especially suitable for tensor parallel distributed reasoning scenarios with large language models. Background Technology
[0002] With the increasing scale of AI (Artificial Intelligence) models such as large language models, tensor parallel inference has become a key technology for distributed inference. In this scenario, frequent, fine-grained, and strictly synchronized collective communication is required between multiple computing nodes. Traditional collective communication methods, which are initiated by the CPU (Central Processing Unit) and executed by the network card, suffer from control plane latency, which becomes a major bottleneck in ultra-high-speed networks. Existing collective communication offloading schemes based on Data Processing Units (DPUs) offload the data plane, but the control plane still relies on the host CPU to initiate communication primitive configuration and state management. Furthermore, the message passing mode based on MPI (Message Passing Interface) communication semantics is incompatible with the tensor parallel communication mode in AI inference, failing to achieve microsecond-level extreme synchronization. This results in the occupation of host CPU resources and limited network bandwidth utilization. Summary of the Invention
[0003] This application provides a method, system, device, and medium for aggregated communication offloading, which addresses the problems of high end-to-end latency, communication mode mismatch, and limited bandwidth utilization in existing aggregated communication offloading schemes. The technical solution provided by this application is as follows:
[0004] On the one hand, this application provides a method for offloading collection communication, including:
[0005] During the model deployment phase, the distributed inference controller generates a communication primitive blueprint based on the computation graph description file of the tensor parallel inference model and sends the communication primitive blueprint to the data processing unit corresponding to the tensor parallel inference model. When the data processing unit receives the communication primitive blueprint, it establishes a hardware-level communication context semantic environment based on the communication primitive blueprint.
[0006] During the model inference phase, the inference computation unit performs forward propagation computation based on the tensor parallel inference model. When the computation reaches the communication boundary, it sends a trigger signal to the data processing unit. When the data processing unit receives the trigger signal, it executes the DMA (Direct Memory Access) tensor fragment data pull pipeline and the RDMA (Remote Direct Memory Access) tensor fragment data send pipeline in parallel based on the hardware-level communication context semantic environment. After performing aggregation computation on all tensor fragment data to be synchronized, the aggregation computation result is written back to the inference computation unit, and the inference computation unit is triggered to continue forward propagation computation based on the aggregation computation result.
[0007] Optionally, a communication primitive blueprint is generated based on the computation graph description file of the tensor parallel inference model, including:
[0008] The computation graph description file of the tensor parallel inference model is read and parsed to obtain static feature information. This static feature information includes: tensor fragment information, set communication mode information, and communication participant information. Tensor fragment information includes the dimension, number of fragments, and fragment index of each fragment. Set communication mode information includes AllReduce, AllGather, and ReduceScatter. Communication participant information includes the identifiers of the inference computation units and data processing units participating in the communication.
[0009] Based on static feature information, a communication primitive blueprint is generated, which includes the communication tree topology, data sharding strategy, and lock-free memory slot pre-allocation table.
[0010] Optionally, a hardware-level communication context semantic environment is established based on the communication primitive blueprint, including:
[0011] The following communication semantic context solidification operations are performed based on the communication primitive blueprint to form a hardware-level communication context semantic environment:
[0012] Establish a pool of receiver slots in local memory;
[0013] Establish an RDMA one-sided write target address mapping between the physical address of each receiving slot and the pre-registered memory address of the peer data processing unit;
[0014] Solidify the communication tree topology and generate a hardware forwarding flow table;
[0015] Initialize the collection communication state machine.
[0016] Optionally, when the calculation reaches the communication boundary, a trigger signal is sent to the data processing unit, including:
[0017] When the calculation reaches the communication boundary, a single write operation trigger signal is sent to the data processing unit through the lightweight doorbell register.
[0018] Optionally, the DMA tensor fragment data retrieval pipeline includes: retrieving tensor fragment data to be synchronized from the inference computing unit through the PCIe DMA engine of the inference computing unit and storing it in the transmit buffer;
[0019] The RDMA tensor fragment data transmission pipeline includes: based on the communication tree topology in the hardware-level communication context semantic environment, the pre-configured hardware forwarding flow table, and the RDMA one-sided write target address mapping, initiating the RDMA one-sided write operation, and writing the tensor fragment data to be synchronized in the transmission buffer into the preset receive slot of the peer data processing unit.
[0020] Optionally, perform aggregation calculations on all tensor fragments of data to be synchronized, including:
[0021] After detecting that all tensor fragment data to be synchronized has been received in each receiving slot, the aggregation calculation unit in the data processing unit is triggered to perform the following aggregation calculation according to the communication primitive type:
[0022] If the communication primitive type is AllReduce operation, then perform element-wise summation / average calculation on the tensor fragment data to be synchronized in each receiving slot;
[0023] If the communication primitive type is AllGather operation, then the tensor fragments to be synchronized in each receiving slot will be spliced into a complete tensor according to a predetermined offset;
[0024] If the communication primitive type is ReduceScatter operation, then after reducing the tensor fragments to be synchronized in each receive slot, only the reduction result required by itself is retained.
[0025] Optionally, the aggregated computation result is written back to the inference computation unit, and the inference computation unit is triggered to continue performing forward propagation computation based on the aggregated computation result, including:
[0026] The aggregation calculation results are written back to the inference computing unit via the PCIe DMA engine, and a lightweight completion notification based on the doorbell register is sent to the inference computing unit to notify the inference computing unit to continue the forward propagation calculation of the next stage.
[0027] On the other hand, this application provides a collection communication offloading system, including:
[0028] The distributed inference controller is used to generate a communication primitive blueprint based on the computation graph description file of the tensor parallel inference model during the model deployment phase, and then distribute the communication primitive blueprint to the data processing unit corresponding to the tensor parallel inference model.
[0029] The data processing unit is used to establish a hardware-level communication context semantic environment based on the communication primitive blueprint issued by the distributed inference controller during the model deployment phase; and to execute the DMA tensor fragment data pull pipeline and the RDMA tensor fragment data send pipeline in parallel based on the hardware-level communication context semantic environment when it receives the trigger signal sent by the inference computing unit during the model inference phase. After performing aggregation calculation on all tensor fragment data to be synchronized, the aggregation calculation result is written back to the inference computing unit.
[0030] The inference computation unit is used to perform forward propagation computation based on the tensor parallel inference model during the model inference phase. When the computation reaches the communication boundary, it sends a trigger signal to the data processing unit. Based on the aggregated computation results written back by the data processing unit, it continues to perform forward propagation computation.
[0031] On the other hand, this application provides an electronic device including a memory, a processor, a distributed inference controller, at least one data processing unit, and at least one inference computing unit; the memory stores computer instructions executable by the processor, and when the processor executes the computer instructions, it causes the distributed inference controller, at least one data processing unit, and at least one inference computing unit to collaboratively execute the above-mentioned aggregate communication offloading method.
[0032] On the other hand, this application provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the above-described collection communication offloading method.
[0033] The beneficial effects of this application are as follows:
[0034] (1) Reduce end-to-end communication latency: By completely offloading the control plane to the model deployment stage, zero control path dependency for inference is achieved. By using the dual pipeline parallel operation inside the DPU, end-to-end latency is reduced, meeting the microsecond-level synchronization requirements of large-scale tensor parallelism in ultra-high-speed network environments.
[0035] (2) Achieve zero involvement of host CPU: During the entire model inference process, the establishment, execution and completion notification of the set communication do not involve host CPU interruption or MPI call, and the host CPU resources are completely released to the computing task or the CPU is allowed to enter the power saving state.
[0036] (3) Improve network bandwidth utilization: By pre-solidifying the hardware-level communication context semantic environment during the model deployment stage, the overhead of address resolution and route lookup during the model inference stage is eliminated, effectively improving the utilization of high-speed network bandwidth.
[0037] (4) Improve the throughput of inference service: In the scenario of large model tensor parallel inference, the set communication overhead accounts for 20-30% of the end-to-end latency. By completely offloading the control plane to the model deployment stage and the dual pipeline parallel operation inside the DPU, the communication overhead can be compressed to close to the physical limit, which effectively improves the overall throughput of tensor parallel inference service.
[0038] Other features and advantages of this application will be set forth in the following description, and will be apparent in part from the description, or may be learned by practicing the application. The objectives and other advantages of this application may be realized and obtained by means of the structures particularly pointed out in the written description, claims, and drawings. Attached Figure Description
[0039] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:
[0040] Figure 1 This is a schematic diagram outlining the process of the collection communication offloading method in the embodiments of this application;
[0041] Figure 2 This is a schematic diagram illustrating the specific process of the collection communication offloading method in the embodiments of this application;
[0042] Figure 3 This is a schematic diagram of the composition structure of the collection communication offloading system in the embodiments of this application. Detailed Implementation
[0043] To make the objectives, technical solutions, and beneficial effects of this application clearer, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of the embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0044] To facilitate a better understanding of this application by those skilled in the art, the technical terms used in this application will be briefly introduced below.
[0045] Inference tasks refer to tasks that use a pre-trained model (such as the GPT large language model) to compute prediction results (such as generated text or recognized objects) from new input data (such as a piece of text or an image). In this application, inference tasks specifically refer to large-scale model inference performed in a distributed computing environment using a tensor parallel strategy. That is, a large model is split across multiple inference computing units (such as GPUs / NPUs), each inference computing unit is responsible for a part of the computation of the large model, and the various inference computing units perform parallel computations to complete one inference task.
[0046] Collective communication refers to the communication process in which various inference computing units exchange and synchronize their intermediate computing results (i.e., tensor fragment data to be synchronized) in order to complete an inference task.
[0047] A computation graph description file is a file that graphically represents the computational logic and structure of a large model. This file uses tensors (data) and operators (operations) as basic elements to uniformly describe the structure and computational process of the large model, such as data flow, operation dependencies, forward computation, loss calculation, automatic differentiation, and parameter updates.
[0048] After introducing the technical terms used in this application, the application scenarios and design concepts of this application will be briefly introduced next.
[0049] Currently, in the field of tensor parallel inference, collective communication such as AllReduce and AllGather is implemented by initiating through the CPU and executing through the network card. The control plane of the communication operation still resides on the CPU side, which leads to the following drawbacks of collective communication:
[0050] Communication mode mismatch: Traditional DPU aggregation communication offloading schemes (such as MVAPICH2-DPU) are based on MPI communication semantics design. Their message passing mode is fundamentally different from the Tensor Parallelism communication mode in distributed AI inference. Tensor Parallelism requires fine-grained, high-frequency, and strictly synchronized fragmented tensor aggregation to be completed within a microsecond-level time window. Existing MPI offloading schemes cannot directly adapt to such load characteristics.
[0051] Control plane remains on the host: Although traditional DPU offloading schemes migrate the data plane to the DPU, the orchestration and control logic of communication (such as communication group establishment, synchronization barriers, and error handling) are still executed by the host CPU. In an 800Gbps ultra-high-speed network environment, the CPU's participation in the control path will introduce microsecond-level control plane delays, becoming a key bottleneck for end-to-end delays in tensor parallel inference.
[0052] Lack of inference-specific optimization: Existing DPU offloading solutions are not customized for the characteristics of tensor parallel inference scenarios, such as static computation graphs, fixed tensor shapes, and periodic communication patterns. The computation graph of the tensor parallel inference model is completely determined at the time of deployment, and the participants, data volume, and communication topology of the aggregate communication are all fixed, making it impossible to use this prior knowledge for extreme optimization.
[0053] Communication-computation pipeline fragmentation: Existing DPU offloading schemes treat communication offloading as an independent function, failing to achieve deep collaboration between the aggregate communication engine on the DPU and the tensor computation engine on the host inference computing unit (such as GPU / NPU) in dual pipelines. As a result, although communication and computation overlap in time, data fetching (PCIe DMA) and network sending (RDMA) within the DPU are still executed serially, leading to insufficient utilization of internal DPU resources.
[0054] Therefore, in this application, during the model deployment phase, the distributed inference controller generates a communication primitive blueprint based on the computation graph description file of the tensor parallel inference model and distributes the communication primitive blueprint to the DPU corresponding to the tensor parallel inference model. When the DPU receives the communication primitive blueprint, it establishes a hardware-level communication context semantic environment based on the communication primitive blueprint. During the model inference phase, the GPU / NPU performs forward propagation computation based on the tensor parallel inference model and sends a trigger signal to the DPU when the computation reaches the communication boundary. When the DPU receives the trigger signal, it executes the DMA tensor fragment data pull pipeline and the RDMA tensor fragment data send pipeline in parallel based on the hardware-level communication context semantic environment. After performing aggregation computation on all tensor fragment data to be synchronized, the aggregation computation result is written back to the GPU / NPU, and the GPU / NPU is triggered to continue performing forward propagation computation based on the aggregation computation result.
[0055] By completely offloading the control plane to the model deployment phase, zero control path dependency for inference is achieved. The dual-pipeline parallel operation within the DPU reduces end-to-end communication latency, meeting the microsecond-level synchronization requirements of large-scale tensor parallelism in ultra-high-speed network environments. Furthermore, throughout the entire model inference process, the establishment, execution, and completion notification of ensemble communication do not involve host CPU interrupts or MPI calls, achieving zero host CPU involvement. This allows host CPU resources to be fully released to computational tasks or enables the CPU to enter a power-saving state. In addition, by pre-fixing the hardware-level communication context semantic environment during the model deployment phase, address resolution and routing lookup overhead during model inference are eliminated, effectively improving high-speed network bandwidth utilization. Moreover, in large-model tensor parallel inference scenarios, ensemble communication overhead accounts for 20-30% of end-to-end latency. By completely offloading the control plane to the model deployment phase and utilizing the dual-pipeline parallel operation within the DPU, communication overhead can be compressed to near physical limits, effectively improving the overall throughput of the tensor parallel inference service.
[0056] After introducing the application scenarios and design concepts of this application, the technical solutions provided by this application will be described in detail below.
[0057] This application provides a method for offloading aggregated communication, see below. Figure 1 The diagram shown is a general flowchart of the collective communication offloading method provided in this application embodiment, illustrating the main processes of the model deployment stage (step 100) and the model inference stage (step 200). The general flowchart of the collective communication offloading method provided in this application embodiment is as follows:
[0058] Step 100: During the model deployment phase, the distributed inference controller generates a communication primitive blueprint based on the computation graph description file of the tensor parallel inference model and distributes the communication primitive blueprint to the DPU corresponding to the tensor parallel inference model. When the DPU receives the communication primitive blueprint, it establishes a hardware-level communication context semantic environment based on the communication primitive blueprint. This step corresponds to... Figure 2 Step 1 (Static Analysis of Inference Model and Pre-compilation of Communication Primitives) and Step 2 (DPU Control Plane Unloading and Communication Context Fixing) will be discussed in the following text. Figure 2 Please provide a detailed explanation.
[0059] In this embodiment, during the model deployment phase, the control plane is fully brought forward to the model deployment phase through the static analysis and pre-compilation of the computation graph description file by the distributed inference controller and the solidification of the communication context of the communication primitive blueprint by the DPU. Specifically, this includes the following steps:
[0060] Step 1011: After reading the computation graph description file of the tensor parallel inference model, the distributed inference controller parses the computation graph description file to obtain static feature information; wherein, the static feature information includes:
[0061] Tensor fragmentation information: tensor shape (such as tensor fragmentation dimension, number of fragments, fragmentation index, etc.), data type, communication frequency, used to determine data fragmentation strategy and memory slot size;
[0062] Collective communication mode information: The position of collective communication operations such as AllReduce, AllGather, and ReduceScatter in the computation graph, used to identify communication boundaries and determine the type of communication primitives;
[0063] Communication participant information: The inference computing unit identifiers and DPU list (including DPU identifiers and their network addresses) that participate in each set of communications, used to construct the communication tree topology and establish RDMA connection mapping.
[0064] These static feature information are acquired once during the model deployment phase and used to generate communication primitive blueprints, so that no runtime parsing overhead is required in the subsequent inference phase.
[0065] Step 1012: The distributed inference controller generates a communication primitive blueprint based on static feature information; wherein, the communication primitive blueprint includes:
[0066] Communication tree topology: such as dual binary tree, ring topology, etc., select the optimal structure based on the number of participants and network topology characteristics;
[0067] Data sharding strategies: such as fixed-length data sharding strategies, to ensure load balancing among all participants;
[0068] Unlock-free memory slot pre-allocation table: pre-allocates receiving slots for each participant, establishing a mapping relationship between physical addresses and remote memory addresses.
[0069] In one alternative implementation, the communication primitive blueprint may also include a pre-defined security policy and key, pre-assigning an encryption key to each group of communication participants (i.e., DPUs), so that during the model inference phase, RDMA communication between each DPU can be automatically encrypted / decrypted at the hardware level based on the pre-assigned encryption key, achieving zero-latency secure communication offloading.
[0070] In another alternative implementation, the communication primitive blueprint can also be generated in intermediate representation (IR) form rather than hardware-specific instructions; the DPU includes a lightweight IR interpreter or JIT compiler, allowing the same blueprint to be adapted to DPU hardware from different vendors, thus improving versatility.
[0071] Step 1013: The distributed inference controller sends the communication primitive blueprint to the DPU corresponding to the tensor parallel inference model through the PCIe control channel.
[0072] Step 1014: The DPU performs the following communication semantic context solidification operations based on the communication primitive blueprint:
[0073] Establish a local memory reception slot pool.
[0074] Establish an RDMA one-sided write target address mapping between the physical address of each receiving slot and the pre-registered memory address of the peer data processing unit;
[0075] The communication tree topology is solidified and a hardware forwarding flow table is generated, so that the DPU's data plane can directly route aggregated communication packets without the host CPU's involvement;
[0076] Initialize the set communication state machine, which is completely detached from the host CPU and maintained independently by the DPU core.
[0077] In this embodiment, the receive slot pool, RDMA one-sided write target address mapping, communication tree topology, hardware forwarding flow table, and aggregated communication state machine together constitute a hardware-level communication context semantic environment. Thus, during the model inference phase, the DPU's data plane can directly route aggregated communication messages without the host CPU's involvement, and the aggregated communication state machine is completely detached from the host CPU and maintained independently by the DPU core. Compared to existing DPU offloading schemes where control operations such as communication group establishment and memory registration still rely on the host CPU calling the MPI interface, this embodiment can completely bring the control plane forward to the model deployment phase. During model inference, the DPU has no control path dependency on the host CPU, achieving zero host CPU involvement.
[0078] Step 200: During the model inference phase, the inference computation unit performs forward propagation computation based on the tensor parallel inference model. When the computation reaches the communication boundary, it sends a trigger signal to the DPU. Upon receiving the trigger signal, the DPU, based on the hardware-level communication context semantic environment, executes the DMA tensor fragment data pull pipeline and the RDMA tensor fragment data send pipeline in parallel. After performing aggregation computation on all tensor fragment data to be synchronized, it writes the aggregation computation result back to the inference computation unit and triggers the inference computation unit to continue forward propagation computation based on the aggregation computation result. This step corresponds to... Figure 2 Step 3 (Tensor computation and set communication dual-pipeline startup) and Step 4 (Data surface aggregation and result write-back) will be discussed in the following text. Figure 2 Please provide a detailed explanation.
[0079] In this embodiment, during the model inference stage, the end-to-end communication latency is reduced through the dual-pipeline parallel operation of the DPU and the data plane aggregation and write-back operation, thus meeting the microsecond-level synchronization requirements of large-scale tensor parallelism in ultra-high-speed network environments. Specifically, the following steps are included:
[0080] Step 2011: The inference computing unit (such as GPU / NPU) performs forward propagation computation based on the tensor parallel inference model. When the computation reaches the communication boundary, it sends a single write operation trigger signal to the DPU through the lightweight doorbell register.
[0081] In this embodiment, the communication boundary is the precise location in the computation graph where cross-device data synchronization is triggered, i.e., the watershed between the computation flow and the communication flow: before the communication boundary, the GPU / NPU performs forward propagation computation; after the communication boundary, the DPU performs tensor fragment data aggregation and writes it back to the GPU / NPU, allowing the GPU / NPU to continue the next stage of forward propagation computation. During the model deployment phase, the distributed inference controller resolves the pre-defined communication boundary coordinates through the computation graph; when the GPU / NPU reaches the marker layer represented by the boundary coordinates, it sends a single write operation trigger signal to the DPU through the doorbell register. This operation does not require interrupt handling by the host CPU and directly triggers the communication engine on the DPU.
[0082] Step 2012: The communication engine on the DPU executes the following dual-pipeline operations in parallel:
[0083] DMA Tensor Fragment Data Retrieval Pipeline: The tensor fragment data to be synchronized is retrieved from the video memory of the inference computing unit (such as GPU / NPU) through the PCIe DMA engine of the inference computing unit, and the tensor fragment data to be synchronized is stored in the transmit buffer.
[0084] RDMA Tensor Fragment Data Transmission Pipeline: Based on the communication tree topology in the hardware-level communication context semantic environment, the pre-configured hardware forwarding flow table, and the RDMA one-sided write target address mapping, the RDMA one-sided write operation is initiated to write the tensor fragment data to be synchronized in the transmission buffer into the preset receive slot of the peer DPU.
[0085] Compared to existing DPU offloading schemes that only overlap the two coarse-grained tasks of host computing and DPU communication, this application embodiment further constructs a data pull pipeline and a data send pipeline within the DPU, enabling PCIe DMA transmission and RDMA network transmission to be executed in parallel within the DPU, thereby effectively eliminating serial waiting within the DPU.
[0086] Step 2013: After the DPU determines that each receive slot has received the tensor fragment data to be synchronized from all source DPUs (i.e., participants), it triggers the aggregation calculation unit within the DPU to perform the following aggregation calculation according to the communication primitive type:
[0087] If the communication primitive type is AllReduce operation, then perform element-wise summation / average calculation on the tensor fragment data to be synchronized in each receiving slot;
[0088] If the communication primitive type is AllGather operation, then the tensor fragments to be synchronized in each receiving slot are spliced into a complete tensor according to a predetermined offset;
[0089] If the communication primitive type is ReduceScatter operation, then after reducing the tensor fragment data to be synchronized in each receiving slot, only the reduction result required by itself is retained.
[0090] In this embodiment, during the model deployment phase, the distributed inference controller can further break down aggregation computation operations (such as the summation operation of AllReduce) into multiple sub-steps (such as different levels of the reduction tree) and embed them into the DPU's tensor shard data retrieval pipeline. That is, data is retrieved while partial aggregation is performed, realizing a deeper parallel pipeline of retrieval-aggregation-send, further reducing end-to-end communication latency. Simultaneously, during the model inference phase, the DPU can also monitor the real-time utilization of PCIe bandwidth and network bandwidth. When the two are mismatched (such as network congestion), the buffer size and scheduling strategy between the two pipelines are dynamically adjusted to prioritize the utilization of bottleneck resources, achieving dynamic load balancing.
[0091] Step 2014: After the aggregation calculation unit completes the aggregation calculation, the DPU writes the aggregation calculation result back to the memory of the inference calculation unit (such as the GPU / NPU) through the PCIe DMA engine, and sends a lightweight completion notification based on the doorbell register to the inference calculation unit to notify it to continue the forward propagation calculation of the next stage. For example, after writing the aggregation calculation result back to the GPU / NPU's memory, the DPU sends a single write operation trigger signal to the GPU / NPU's doorbell register. The GPU / NPU's hardware scheduler continuously polls the current state of the doorbell register. If the current state is not ready, the forward propagation calculation of the next stage is paused to keep the pipeline idle. If the current state is that a trigger signal has been received, the forward propagation calculation of the next stage is immediately resumed from the current pause point. In this way, lightweight completion notification is implemented through PCIe memory-mapped I / O (MMIO), without triggering interrupts or system calls, which can effectively reduce end-to-end communication latency and avoid context switching overhead. This mechanism enables hardware-level collaboration between inference computing units (such as GPUs / NPUs) and DPUs without the need for software intervention.
[0092] Furthermore, during the model deployment phase, the distributed inference controller can pre-generate a parameterized blueprint template. This template includes a dynamic instantiation interface and multiple communication configurations, corresponding to different batch sizes. During the model inference phase, when the DPU's communication engine detects a change in batch size (e.g., from N to M) causing a change in tensor shape, it retrieves the communication configuration adapted to the changed batch size from the parameterized blueprint template via the dynamic instantiation interface as the new communication configuration and updates the hardware-level communication context semantic environment based on this new configuration. This dynamic adaptation and fault tolerance of batch size allows the communication strategy to adapt to computational needs in real time, effectively improving communication efficiency in scenarios with dynamically changing batch sizes (such as streaming inference). In another embodiment, the distributed inference controller can also predict the batch size distribution for the next time window based on historical load characteristics, enabling the DPU's communication engine to pre-instantiate the corresponding communication configuration within the DPU. This achieves zero configuration switching latency when the prediction is successful.
[0093] The following provides a further detailed description of the aggregated communication offloading method provided in the embodiments of this application. (See also...) Figure 2 The diagram shown is a detailed flowchart of the collective communication offloading method provided in this application embodiment, specifically illustrating the complete process (Step 1-Step 5) from static analysis, context fixing, dual pipeline execution to dynamic adaptation, and the interaction relationships between each component. The specific flow of the collective communication offloading method provided in this application embodiment is as follows:
[0094] Step 1: Static analysis of the inference model and pre-compilation of communication primitives.
[0095] During the model deployment phase, the distributed inference controller 101 reads the computation graph description file of the tensor parallel inference model and parses the following static feature information:
[0096] Information on each parallel tensor segmentation: tensor shape 102, data type 103, communication frequency 104;
[0097] Collective communication mode information: The position of collective communication operations such as AllReduce, AllGather, and ReduceScatter in the computation graph, i.e., collective communication mode 105;
[0098] Communication participant information: A list of DPU devices participating in each group communication and their network addresses, i.e., group communication participant set 106.
[0099] Based on the above static feature information, the distributed inference controller 101 generates a communication primitive blueprint 107, which includes:
[0100] Pre-computed communication tree topology 108 (such as dual binary tree, ring topology);
[0101] Fixed-length data fragmentation strategy 109;
[0102] Unlocked memory slot pre-allocation table 110.
[0103] Step 2: DPU control plane unloading and communication context persistence.
[0104] The communication primitive blueprint 107 is distributed to the DPU 112 of each participating node via the PCIe control channel 111. The control plane agent 113 on the DPU performs the following hardening operations:
[0105] Establish receive slot pool 114 in the DPU's local memory;
[0106] Establish RDMA one-sided write target address mapping between each slot and the pre-registered memory region of the peer DPU 115;
[0107] The communication tree topology information is solidified and a hardware forwarding flow table 116 is generated, so that the data plane of the DPU can directly route the collection communication messages without the involvement of the CPU;
[0108] The DPU's internal aggregate communication state machine 117 is initialized. This aggregate communication state machine is completely independent of the host CPU and is maintained independently by the DPU core. The above-mentioned receive slot pool 114, RDMA one-sided write target address mapping 115, hardware forwarding flow table 116, and aggregate communication state machine 117 together constitute the hardware-level communication context semantic environment. This environment is established once during the model deployment phase and requires no control plane intervention during the model inference phase.
[0109] Step 3: Start the dual pipeline for tensor computation and set communication.
[0110] The GPU / NPU 118 begins forward propagation computation. When the computation reaches the communication boundary 119, it sends a single write operation 120 to the DPU through the lightweight doorbell register 121. This operation does not require interrupt handling by the host CPU and directly triggers the communication engine 122 on the DPU 112.
[0111] The DPU's communication engine 122 utilizes the DPU's multi-core parallel processing capabilities to execute the following dual-pipeline operations in parallel:
[0112] DMA Tensor Fragment Data Retrieval Pipeline: Tensor fragment data to be synchronized is retrieved from the host GPU / NPU memory 124 via the PCIe DMA engine 123 and stored in the send buffer 126; this operation is scheduled by a dedicated core of the DPU and utilizes PCIe bandwidth to achieve high-speed data retrieval.
[0113] RDMA Tensor Fragment Data Transmission Pipeline: Based on the fixed communication tree topology 108 and hardware forwarding flow table 116, an RDMA one-sided write operation 127 is initiated to write the tensor fragment data to be synchronized in the transmit buffer 126 into the receive slot 114 of the peer DPU. This operation is scheduled by another dedicated core of the DPU, utilizing the RDMA hardware offloading capability to achieve zero-copy network transmission.
[0114] The two pipelines execute in complete parallelism within the DPU, and data consistency is guaranteed through a hardware semaphore mechanism: when the DMA completes the fetching of a partial data block, it immediately notifies the RDMA send thread to start the transmission, thus achieving pipeline-level parallelism.
[0115] Step 4: Data aggregation and result write-back.
[0116] After collecting tensor fragment data from all participants, the receive slot 114 of each DPU 112 triggers the aggregation calculation unit 128 to perform the following operations according to the communication primitive type:
[0117] If it is AllReduce operation 129: Perform element-wise summation / average on each tensor fragment data in multiple receive slots;
[0118] If it is AllGather operation 130: the tensor fragments from multiple receiving slots are spliced together into a complete tensor according to a predetermined offset;
[0119] If it is ReduceScatter operation 131: After reducing the tensor fragments in multiple receive slots, only the reduction result fragments required by this node are retained.
[0120] After aggregation is completed, the DPU writes the aggregation calculation result back to the GPU / NPU memory 124 through the PCIe DMA engine 123 and triggers a lightweight completion notification 132 to notify the GPU / NPU that the next stage of calculation can begin immediately.
[0121] Step 5: Dynamic adaptation and fault tolerance.
[0122] During the model deployment phase, the distributed inference controller 101 pre-generates a parameterized blueprint template 134. This template includes a dynamic instantiation interface and pre-computed multiple sets of communication configurations, corresponding to communication topologies, data sharding strategies, and memory slot allocations for different batch sizes (e.g., BatchSize=1, 2, 4). When the batch size changes, for example, from N to M, the DPU's communication engine 122 detects the tensor shape change in real time and, through the dynamic instantiation interface, searches for a communication configuration matching the changed tensor shape in the parameterized blueprint template 134. It then instantiates the communication configuration corresponding to the new size through a hot-loading mechanism and dynamically updates the hardware-level communication context semantic environment. The entire process takes less than 0.5 microseconds and does not require re-executing the complete control plane flow, achieving a balance between high performance and dynamic adaptation.
[0123] Based on the above embodiments, this application provides a collective communication offloading system, see below. Figure 3 The diagram shown is a structural representation of the aggregated communication offloading system provided in this embodiment of the application. It illustrates the relationship between the distributed inference controller, the data processing unit, and the inference computing unit. The aggregated communication offloading system is labeled 300, and the aggregated communication offloading system 300 includes at least:
[0124] The distributed inference controller is used to generate a communication primitive blueprint based on the computation graph description file of the tensor parallel inference model during the model deployment phase, and then distribute the communication primitive blueprint to the data processing unit corresponding to the tensor parallel inference model.
[0125] The data processing unit is used to establish a hardware-level communication context semantic environment based on the communication primitive blueprint issued by the distributed inference controller during the model deployment phase; and to execute the DMA tensor fragment data pull pipeline and the RDMA tensor fragment data send pipeline in parallel based on the hardware-level communication context semantic environment when it receives the trigger signal sent by the inference computing unit during the model inference phase. After performing aggregation calculation on all tensor fragment data to be synchronized, the aggregation calculation result is written back to the inference computing unit.
[0126] The inference computation unit is used to perform forward propagation computation based on the tensor parallel inference model during the model inference phase. When the computation reaches the communication boundary, it sends a trigger signal to the data processing unit. When it is determined that the data processing unit has written back the aggregation computation result, it continues to perform forward propagation computation based on the aggregation computation result.
[0127] In one possible implementation, a distributed inference controller is used to read the computation graph description file of the tensor parallel inference model and parse the computation graph description file to obtain static feature information; based on the static feature information, a communication primitive blueprint is generated, which includes the communication tree topology, data sharding strategy, and lock-free memory slot pre-allocation table; wherein, the static feature information includes: information on each tensor shard, information on the set communication mode, and information on the communication participants; the tensor sharding information includes the dimension, number of shards, and shard index of the tensor shards; the set communication mode information includes AllReduce, AllGather, and ReduceScatter; the information on the communication participants includes the identifiers of the inference computation unit and the data processing unit participating in the communication.
[0128] In one possible implementation, the data processing unit is configured to perform the following communication semantic context solidification operations based on the communication primitive blueprint to form a hardware-level communication context semantic environment: establishing a receive slot pool in local memory; establishing an RDMA one-sided write target address mapping between the physical address of each receive slot and the pre-registered memory address of the peer data processing unit; solidifying the communication tree topology and generating a hardware forwarding flow table; and initializing the set communication state machine.
[0129] In one possible implementation, the inference computing unit is configured to send a single write operation trigger signal to the data processing unit via a lightweight doorbell register when computing reaches the communication boundary.
[0130] In one possible implementation, the data processing unit is configured to perform the following dual-pipeline operations based on a hardware-level communication context semantic environment:
[0131] The DMA tensor fragment data retrieval pipeline includes: retrieving tensor fragment data to be synchronized from the inference computing unit through the PCIe DMA engine of the inference computing unit and storing it in the transmit buffer;
[0132] The RDMA tensor fragment data transmission pipeline includes: based on the communication tree topology in the hardware-level communication context semantic environment, the pre-configured hardware forwarding flow table, and the RDMA one-sided write target address mapping, initiating the RDMA one-sided write operation, and writing the tensor fragment data to be synchronized in the transmission buffer into the preset receive slot of the peer data processing unit.
[0133] In one possible implementation, the data processing unit, after detecting that all tensor fragment data to be synchronized has been received in each receiving slot, triggers the aggregation calculation unit within the data processing unit to perform the following aggregation calculations according to the communication primitive type: if the communication primitive type is AllReduce operation, then element-wise summation / average is performed on the tensor fragment data to be synchronized in each receiving slot; if the communication primitive type is AllGather operation, then the tensor fragment data to be synchronized in each receiving slot is concatenated into a complete tensor according to a predetermined offset; if the communication primitive type is ReduceScatter operation, then after reducing the tensor fragment data to be synchronized in each receiving slot, only the reduction result required for itself is retained.
[0134] In one possible implementation, the data processing unit is configured to write the aggregation calculation result back to the inference calculation unit via the PCIe DMA engine and send a lightweight completion notification based on the doorbell register to the inference calculation unit to notify the inference calculation unit to continue executing the next stage of forward propagation calculation.
[0135] In one possible implementation, the distributed inference controller is also used to pre-generate parameterized blueprint templates during the model deployment phase; wherein the parameterized blueprint templates include dynamic instantiation interfaces and communication configurations corresponding to different batch sizes;
[0136] The data processing unit is also used during the model inference phase to detect when the batch size changes, and through a dynamic instantiation interface, retrieve the communication configuration adapted to the changed batch size from the parameterized blueprint template as the new communication configuration, and update the hardware-level communication context semantic environment based on the new communication configuration.
[0137] It should be noted that the principle of the collective communication unloading system 300 provided in this application embodiment to solve the technical problem is similar to that of the collective communication unloading method provided in this application embodiment. Therefore, the implementation of the collective communication unloading system 300 provided in this application embodiment can refer to the implementation of the collective communication unloading method provided in this application embodiment, and the repeated parts will not be described again.
[0138] After introducing the collection communication offloading method and system provided in the embodiments of this application, the electronic device provided in the embodiments of this application will be briefly introduced next.
[0139] The electronic device provided in this application includes a memory, a processor, a distributed inference controller, at least one data processing unit, and at least one inference computing unit. The memory stores computer instructions that can be executed by the processor. When the processor executes the computer instructions, the distributed inference controller, at least one data processing unit, and at least one inference computing unit coordinately execute the above-mentioned aggregate communication offloading method provided in this application.
[0140] Furthermore, embodiments of this application also provide a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the aforementioned collective communication offloading method provided in embodiments of this application. Specifically, the computer instructions may be built into or installed in a processor, enabling the processor to implement the aforementioned collective communication offloading method provided in embodiments of this application by executing the built-in or installed computer instructions.
[0141] It should be noted that although several units or sub-units of the system have been mentioned in the detailed description above, this division is merely exemplary and not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more units described above can be embodied in one unit. Conversely, the features and functions of one unit described above can be further divided and embodied by multiple units.
[0142] Furthermore, although the operations of the method of this application are described in a specific order in the accompanying drawings, this does not require or imply that these operations must be performed in that specific order, or that all the operations shown must be performed to achieve the desired result. Additionally or alternatively, certain steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.
[0143] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.
[0144] Obviously, those skilled in the art can make various modifications and variations to the embodiments of this application without departing from the spirit and scope of the embodiments of this application. Therefore, if these modifications and variations to the embodiments of this application fall within the scope of the claims of this application and their equivalents, this application also intends to include these modifications and variations.
Claims
1. A method for offloading aggregated communication, characterized in that, include: During the model deployment phase, the distributed inference controller generates a communication primitive blueprint based on the computation graph description file of the tensor parallel inference model, and sends the communication primitive blueprint to the data processing unit corresponding to the tensor parallel inference model. When the data processing unit receives the communication primitive blueprint, it establishes a hardware-level communication context semantic environment based on the communication primitive blueprint; the communication primitive blueprint includes a communication tree topology, a data fragmentation strategy, and a lock-free memory slot pre-allocation table; During the model inference phase, the inference computation unit performs forward propagation computation based on the tensor parallel inference model. When the computation reaches the communication boundary, it sends a trigger signal to the data processing unit. When the data processing unit receives the trigger signal, it executes the DMA tensor fragment data pull pipeline and the RDMA tensor fragment data send pipeline in parallel based on the hardware-level communication context semantic environment. After performing aggregation calculation on all tensor fragment data to be synchronized, the aggregation calculation result is written back to the inference calculation unit, and the inference calculation unit is triggered to continue to perform forward propagation calculation based on the aggregation calculation result. The establishment of a hardware-level communication context semantic environment based on the communication primitive blueprint includes: Based on the communication primitive blueprint, the following communication semantic context solidification operations are performed to form the hardware-level communication context semantic environment: Establish a pool of receiver slots in local memory; Establish an RDMA one-sided write target address mapping between the physical address of each receiving slot and the pre-registered memory address of the peer data processing unit; Solidify the communication tree topology and generate a hardware forwarding flow table; Initialize the collection communication state machine.
2. The method for offloading aggregated communication as described in claim 1, characterized in that, The computation graph description file based on the tensor parallel inference model generates communication primitive blueprints, including: The computation graph description file of the tensor parallel inference model is read, and the computation graph description file is parsed to obtain static feature information. The static feature information includes: tensor fragment information, set communication mode information, and communication participant information. The tensor fragment information includes the dimension, number of fragments, and fragment index of the tensor fragments. The set communication mode information includes AllReduce, AllGather, and ReduceScatter. The communication participant information includes the identifiers of the inference computation unit and the data processing unit participating in the communication. Based on the static feature information, a communication primitive blueprint is generated, which includes the communication tree topology, data sharding strategy, and lock-free memory slot pre-allocation table.
3. The method for offloading aggregated communication as described in claim 1, characterized in that, When the calculation reaches the communication boundary, a trigger signal is sent to the data processing unit, including: When the calculation reaches the communication boundary, a single write operation trigger signal is sent to the data processing unit through the lightweight doorbell register.
4. The method for offloading aggregated communication as described in claim 1, characterized in that, The DMA tensor fragment data retrieval pipeline includes: retrieving tensor fragment data to be synchronized from the inference computing unit through the PCIe DMA engine of the inference computing unit and storing it in the transmission buffer; The RDMA tensor fragment data transmission pipeline includes: based on the communication tree topology in the hardware-level communication context semantic environment, the pre-configured hardware forwarding flow table, and the RDMA one-sided write target address mapping, initiating an RDMA one-sided write operation to write the tensor fragment data to be synchronized in the transmission buffer into the preset receive slot of the peer data processing unit.
5. The method for offloading aggregated communication as described in claim 1, characterized in that, Perform aggregation calculations on all tensor fragments of data to be synchronized, including: After detecting that all tensor fragment data to be synchronized has been received in each receiving slot, the aggregation calculation unit in the data processing unit is triggered to perform the following aggregation calculation according to the communication primitive type: If the communication primitive type is AllReduce operation, then perform element-wise summation / average calculation on the tensor fragment data to be synchronized in each receiving slot; If the communication primitive type is AllGather operation, then the tensor fragments to be synchronized in each receiving slot are spliced into a complete tensor according to a predetermined offset; If the communication primitive type is ReduceScatter operation, then after reducing the tensor fragment data to be synchronized in each receiving slot, only the reduction result required by itself is retained.
6. The method for offloading aggregated communication as described in claim 1, characterized in that, The aggregated calculation result is written back to the inference calculation unit, and the inference calculation unit is triggered to continue performing forward propagation calculation based on the aggregated calculation result, including: The aggregation calculation results are written back to the inference computing unit via the PCIe DMA engine, and a lightweight completion notification based on the doorbell register is sent to the inference computing unit to notify the inference computing unit to continue the forward propagation calculation of the next stage.
7. A collective communication offloading system, characterized in that, include: A distributed inference controller is used to generate a communication primitive blueprint based on the computation graph description file of the tensor parallel inference model during the model deployment phase, and to distribute the communication primitive blueprint to the data processing unit corresponding to the tensor parallel inference model; the communication primitive blueprint includes a communication tree topology, a data sharding strategy, and a lock-free memory slot pre-allocation table. The data processing unit is used to establish a hardware-level communication context semantic environment based on the communication primitive blueprint issued by the distributed inference controller during the model deployment phase; and to execute the DMA tensor fragment data pull pipeline and the RDMA tensor fragment data send pipeline in parallel based on the hardware-level communication context semantic environment when receiving the trigger signal sent by the inference computing unit during the model inference phase. After performing aggregation calculation on all tensor fragment data to be synchronized, the aggregation calculation result is written back to the inference computing unit. The inference computation unit is used to perform forward propagation computation based on the tensor parallel inference model during the model inference phase. When the computation reaches the communication boundary, it sends a trigger signal to the data processing unit. Based on the aggregated computation result written back by the data processing unit, it continues to perform forward propagation computation. The data processing unit is configured to perform the following communication semantic context solidification operations based on the communication primitive blueprint to form the hardware-level communication context semantic environment: establishing a receiver slot pool in local memory; Establish RDMA one-sided write target address mapping between the physical address of each receiving slot and the pre-registered memory address of the peer data processing unit; solidify the communication tree topology and generate a hardware forwarding flow table; initialize the set communication state machine.
8. An electronic device, characterized in that, The method includes a memory, a processor, a distributed inference controller, at least one data processing unit, and at least one inference computing unit; the memory stores computer instructions executable by the processor, and when the processor executes the computer instructions, it causes the distributed inference controller, the at least one data processing unit, and the at least one inference computing unit to collaboratively execute the aggregated communication offloading method as described in any one of claims 1 to 6.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that, when executed by a processor, implement the collection communication offloading method as described in any one of claims 1-6.