A data aggregation auxiliary device, system and data aggregation method

By implementing UALink and PCIe protocol conversion in the switch and adopting payload aggregation and streaming transmission mechanisms, the problems of protocol incompatibility and low transmission efficiency between GPU and large-capacity memory are solved, thereby improving the data aggregation speed and bandwidth utilization of AI computing clusters.

CN122309431APending Publication Date: 2026-06-30SHANGHAI XINLIJI SEMICON CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI XINLIJI SEMICON CO LTD
Filing Date
2026-01-16
Publication Date
2026-06-30

Smart Images

  • Figure CN122309431A_ABST
    Figure CN122309431A_ABST
Patent Text Reader

Abstract

This invention discloses a data aggregation auxiliary device, system, and data aggregation method. The PCIe controller of the aggregation auxiliary device receives UALink data packets from an external processor through each UALink port and forwards them to the converter, identifying the source ID corresponding to the UALink port. When aggregating the first UALink data packet, the converter parses the UALink data packet to extract the sector index and payload data, and stores the payload data in the payload buffer. Based on the corresponding source ID, the converter looks up the associated PCIe base address and sector size in a pre-built protocol conversion table. Combining the sector index, the converter calculates the corresponding PCIe target address, generates the PCIe TLP header data, and stores it in a register. When aggregating the next UALink data packet, the header data in the register is reused, significantly reducing the header overhead on the PCIe bus.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer communication, and in particular to a data aggregation auxiliary device, system, and data aggregation method. Background Technology

[0002] With the explosive growth of artificial intelligence technology, the number of parameters in Large Language Models (LLMs) has increased exponentially, placing extremely high demands on the underlying computing infrastructure. LLMs are deep learning-based AI models that, through training on massive amounts of text data, can understand and generate human language. These models typically employ the Transformer architecture and can perform various natural language processing tasks, such as text generation, translation, question answering, and summarization. Common LLM models include the GPT series, BERT, and PaLM.

[0003] In current AI computing clusters, GPUs are typically interconnected using high-speed interconnect buses (such as UALink and NVLink) to meet the high-bandwidth communication requirements. Although modern GPUs are generally equipped with high-bandwidth memory (HBM), its capacity is still very limited for models with trillions of parameters.

[0004] In the training and inference process of AI models, aggregate communication operations (such as Gather and All-Gather) are frequently involved, requiring the aggregation of data distributed across multiple GPUs into a centralized storage space. Due to the limitations of GPU memory capacity, it is often necessary to utilize external system memory with larger capacity, such as Dynamic Random Access Memory (DRAM) or other storage devices, to store this aggregated data. Figure 1 As shown.

[0005] In AI computing, the Gather operation aggregates data scattered across various GPUs, requiring significant storage space. It is a prerequisite for many operations, such as gradient aggregation, model synchronization, and data redistribution. However, this interconnected architecture for data aggregation suffers from the following problems: First, there is protocol incompatibility: GPUs primarily use proprietary high-speed protocols such as UALink (PCIe is not commonly used in large-scale AI computations because its transmission speed and clustering capabilities for multi-GPU interconnects are inferior to proprietary high-speed protocols). Meanwhile, large-capacity memory expansion devices or standard storage nodes are typically based on the PCIe protocol. UALink cannot be used for end-to-end interconnection between memory and switches, or between switches and GPUs, lacking an efficient protocol conversion mechanism.

[0006] Secondly, there is low transmission efficiency (memory wall problem): Traditional protocol conversion is usually a "packet-by-packet" mapping conversion. When multiple GPUs concurrently write large amounts of data to memory, if the switch simply converts each UALink request into an independent PCIe TLP (Transaction Layer Packet), it will result in a large amount of TLP header overhead on the PCIe bus, with a low effective data payload ratio, which severely restricts the bandwidth utilization of the PCIe bus and cannot meet the needs of rapid disk writing or aggregation of massive amounts of data in AI training.

[0007] The disclosure of the above background technical content is only for the purpose of assisting in understanding the concept and technical solution of this application, and does not necessarily provide technical instruction. Summary of the Invention

[0008] The purpose of this invention is to provide a switch and communication system that can be applied to high-performance computing scenarios in artificial intelligence to realize bus protocol conversion and data transmission. Specifically, it involves realizing the conversion between UALink protocol and PCIe protocol and fast transmission based on data aggregation within the switch.

[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows: A data aggregation auxiliary device includes a PCIe controller, a converter, a protocol conversion table, a header information register, a payload buffer, multiple UALink ports, and only one PCIe port. The PCIe controller is connected to each UALink port to receive UALink data packets from a processor outside the data aggregation auxiliary device through the UALink port and forward them to the converter. The PCIe controller also identifies the UALink port ID corresponding to the UALink port. The protocol conversion table is configured to store the associated UALink port ID, PCIe base address, and sector size; When aggregating the first UALink data packet, the data aggregation auxiliary device performs the following steps: The converter parses the UALink data packet to extract the sector index and payload data in the UALink data packet, and stores the payload data in the payload buffer; Based on the corresponding UALink port ID, determine the associated PCIe base address and sector size in the protocol conversion table; Based on the associated PCIe base address and sector size, as well as the extracted sector index, determine the corresponding PCIe target address; Generate the header data of the PCIe transaction layer packet, which contains the PCIe destination address; The generated header data is stored in the header information register; the data in the payload buffer and the header data are encapsulated into a PCIe data packet to be written, and sent to the memory expansion device through the PCIe port; The header data in the header information register is reused when aggregating the next UALink data packet.

[0010] Furthermore, following any one or a combination of the aforementioned technical solutions, reusing the header data in the header information register when aggregating the next UALink data packet includes: When aggregating the next UALink data packet, the data aggregation auxiliary device performs the following steps: The converter parses the UALink data packet to extract the payload data in the UALink data packet and stores the payload data in the payload buffer; The header data in the header information register is reused and combined with the payload data in the payload buffer to form a PCIe data packet to be written, and then sent to the memory expansion device through the PCIe port.

[0011] Furthermore, following any one or a combination of the aforementioned technical solutions, after each combination of payload data retrieved from the payload buffer and header data in the header information register, the PCIe target address in the header information register is updated using the length of the payload data combined with the header data as an offset.

[0012] Furthermore, following any one or a combination of the aforementioned technical solutions, the unique PCIe port of the data aggregation auxiliary device is configured with a port controller. The port controller is configured to obtain header data from the header information register, combine it with the payload data in the payload buffer, and encapsulate the data to obtain the corresponding PCIe data packet.

[0013] Furthermore, in accordance with any or a combination of the aforementioned technical solutions, the PCIe port of the data aggregation auxiliary device is integrated with a PCIe chip, the PCIe port is connected to memory, and the PCIe chip is configured to access the memory.

[0014] Furthermore, following any one or a combination of the aforementioned technical solutions, the PCIe port is configured to connect to a PCIe host outside the data aggregation auxiliary device, and the PCIe host is equipped with memory.

[0015] This invention also provides a data aggregation method in which multiple sending devices send write requests based on the UALink protocol, and data aggregation is achieved at a memory expansion device based on the PCIe protocol through protocol conversion. The data aggregation method includes the following steps: Receive the first UALink data packet sent by the sending device through the UALink port and identify the ID of the UALink port; The UALink data packet is parsed to extract the sector index and payload data, and the payload data is stored in the payload buffer. Look up the PCIe base address and sector size associated with the ID of the UALink port in the pre-established protocol conversion table; Based on the associated PCIe base address and sector size, as well as the extracted sector index, determine the corresponding PCIe target address; generate the header data of the PCIe transaction layer data packet, which contains the PCIe target address; The generated header data is stored in a register, and the data in the payload buffer is encapsulated with the header data into a PCIe packet to be written. The PCIe data packet to be written is sent to the memory expansion device through a unique PCIe port; The header data in the register is reused when aggregating the next UALink data packet.

[0016] Furthermore, following any one or a combination of the aforementioned technical solutions, the next UALink data packet sent by the sending device is received through the UALink port, and the ID of the UALink port is identified. Parse subsequent UALink data packets to extract payload data, and store the payload data in the payload buffer; The header data in the register is reused and combined with the payload data in the payload buffer to form a PCIe data packet to be written. The PCIe data packet to be written is sent to the memory expansion device through a unique PCIe port.

[0017] Furthermore, following any one or a combination of the aforementioned technical solutions, after each combination and encapsulation of the payload data in the payload buffer with the header data in the register, the PCIe target address in the register is updated using the length of the payload data combined with the header data as an offset.

[0018] The present invention also provides a data aggregation system, including multiple electronic devices, a memory expansion device, and a data aggregation auxiliary device as described above, wherein the multiple electronic devices are connected to multiple UALink ports of the data aggregation auxiliary device and communicate based on the UALink protocol; The memory expansion device is connected to the single PCIe port of the data aggregation auxiliary device and communicates based on the PCIe protocol.

[0019] According to another aspect of the present invention, a switch for implementing UALink and PCIe protocol conversion is provided, including a PCIe controller, a converter, a protocol conversion table, a header information register, multiple UALink ports and multiple PCIe ports, wherein the PCIe controller is connected to each UALink port, and is configured to receive UALink data packets from a processor outside the switch through the UALink port and forward them to the converter, and the PCIe controller identifies the UALink port ID corresponding to the UALink port; The converter is configured to parse UALink packets to extract the target port ID and sector index from the UALink packets; The protocol conversion table is configured to store the associated UALink port ID, PCIe base address, and sector size; The switch is configured to perform the following steps: Based on the corresponding UALink port ID, determine the associated PCIe base address and sector size in the protocol conversion table; Based on the associated PCIe base address and sector size, as well as the extracted sector index, determine the corresponding PCIe target address; Generate the header data of the PCIe transaction layer packet, which contains the PCIe destination address; The generated header data is stored in the header information register; If the target port ID extracted by the converter multiple times represents the same PCIe port, the switch repeatedly uses the header data in the header information register to encapsulate the data, obtains a PCIe data packet, and sends the PCIe data packet out through the PCIe port.

[0020] Furthermore, following any of the aforementioned technical solutions or combinations thereof, if the target port ID in the current UALink data packet and the target port ID in the previous UALink data packet represent the same PCIe port, then the PCIe target address will not be calculated again and the header data of the PCIe transaction layer data packet will not be generated again. If the target port ID in the current UALink data packet is different from the target port ID in the previous UALink data packet, and the target port ID in the current UALink data packet represents a PCIe port, then the header data of the PCIe transaction layer data packet is generated again and updated in the header information register.

[0021] Furthermore, based on any or a combination of the aforementioned technical solutions, the UALink data packet is defined as originating from the source UALink port. If the target port ID extracted by the converter represents a target UALink port that is different from the source UALink port, the PCIe controller establishes a UALink data channel between the source UALink port and the target UALink port, so that the UALink data packet originating from the source UALink port is sent to the target UALink port through the UALink data channel.

[0022] Furthermore, following any one or a combination of the aforementioned technical solutions, the PCIe target address is calculated using the following formula: Add obj = Add base +( Size sector × Index sector ),in, Add obj For PCIe target address, Add base PCIe base address Size sector For sector size, Index sector For sector indexing; Each UALink port is associated with a different PCIe base address.

[0023] Furthermore, following any one or a combination of the aforementioned technical solutions, the PCIe port is configured with a port controller, which is configured to obtain header data from the header information register and encapsulate the data to obtain the corresponding PCIe data packet.

[0024] Furthermore, in accordance with any or a combination of the aforementioned technical solutions, the PCIe port is integrated with a PCIe chip, the PCIe port is connected to memory, and the PCIe chip is configured to access memory. Alternatively, the PCIe port may be configured to connect to a PCIe host external to the switch, the PCIe host being configured with memory.

[0025] Furthermore, following any or a combination of the aforementioned technical solutions, the switch is also configured with a load buffer to enable the processor to perform data write operations through the switch in the following manner: The converter parses UALink data packets to extract data including payload data; The switch stores the payload data in the payload buffer; If the target port ID in the current UALink packet is the same as the target port ID in the previous UALink packet, the switch combines the payload data in the payload buffer with the header data in the header information register to generate a PCIe packet to be written. The switch writes the PCIe data packet to be written to the storage device of the corresponding device through the PCIe port represented by the target port ID.

[0026] Furthermore, following any one or a combination of the aforementioned technical solutions, the PCIe target address in the header information register is updated using the length of the payload data combined with the header data as an offset.

[0027] Furthermore, following any or a combination of the aforementioned technical solutions, the PCIe port is configured to connect to a PCIe host external to the switch, and the PCIe host is equipped with memory; the processor's data read operation through the switch is implemented in the following manner: In response to the processor sending a UALink data packet containing a read request through the UALink port, and the converter parsing the UALink data packet to extract the target port ID to represent the PCIe port, the PCIe data packet obtained by the switch through data encapsulation contains read type field information. In response to a PCIe packet containing read type information, the PCIe host reads memory to obtain the content to be read, encapsulates it into a target PCIe packet, and sends it to the converter through the corresponding PCIe port. The converter forwards the target PCIe data packet to the PCIe controller, which parses out the content to be read and sends it to the processor that issued the read request through the corresponding UALink port.

[0028] Furthermore, in accordance with any or a combination of the aforementioned technical solutions, the information extracted by the converter from the UALink data packet containing the read request also includes the target read address; The PCIe data packets obtained by the switch through data encapsulation also contain field information of the target read address; The PCIe host accesses the memory according to the target read address to determine the content to be read corresponding to the target read address.

[0029] Furthermore, following any or a combination of the aforementioned technical solutions, the PCIe port integrates a PCIe chip, the PCIe port is connected to memory, and the PCIe chip is configured to access the memory; the processor's data read operation through the switch is achieved in the following manner: In response to the processor sending a UALink data packet containing a read request through the UALink port, and the converter parsing the UALink data packet to extract the target port ID to represent the PCIe port, the converter sends the result of parsing the UALink data packet to the PCIe chip of the PCIe port without encapsulating the PCIe data packet. The PCIe chip reads memory to obtain the content to be read, and sends it to the converter without encapsulating PCIe data packets; The PCIe controller receives the content to be read from the converter and sends it to the processor that issued the read request through the corresponding UALink port.

[0030] Furthermore, in accordance with any or a combination of the aforementioned technical solutions, the information extracted by the converter from the UALink data packet containing the read request also includes the target read address; The PCIe chip of the PCIe port accesses the memory according to the target read address to determine the content to be read corresponding to the target read address.

[0031] According to another aspect of the present invention, an efficient communication method for cross-protocol conversion is provided, comprising the following steps: Receive UALink data packets sent by the processor through the source UALink port and identify the ID of the source UALink port; Parse the UALink data packet to extract the target PCIe port ID and sector index; Look up the PCIe base address and sector size associated with the ID of the source UALink port in the pre-established protocol conversion table; Based on the associated PCIe base address and sector size, as well as the extracted sector index, determine the corresponding PCIe target address; Generate the header data of the PCIe transaction layer packet, which contains the PCIe destination address; The generated header data is stored in a register; If multiple consecutive UALink data packets received correspond to the same target PCIe port ID, the header data in the register is reused for data encapsulation to obtain PCIe data packets. The PCIe data packet is sent to the memory through the PCIe port corresponding to the PCIe port ID.

[0032] Furthermore, following any one or a combination of the aforementioned technical solutions, the processor's data writing operation is achieved in the following manner: The data extracted by parsing the UALink data packets also includes payload data; The payload data is stored in a preset buffer; If the target PCIe port ID in the current UALink data packet is the same as the target PCIe port ID in the previous UALink data packet, then the payload data in the buffer and the header data in the register are combined to generate the PCIe data packet to be written. The PCIe data packet is written to the memory connected to it through the PCIe port.

[0033] Furthermore, following any one or a combination of the aforementioned technical solutions, the PCIe target address in the header information register is updated using the length of the payload data combined with the header data as an offset.

[0034] Furthermore, following any one or a combination of the aforementioned technical solutions, the port controller configured on the PCIe port obtains header data from the header information register, combines it with the payload data in the buffer, and encapsulates it to obtain the corresponding PCIe data packet.

[0035] Furthermore, following any one or a combination of the aforementioned technical solutions, the processor's data reading operation is achieved in the following manner: Receive a UALink data packet containing a read request sent by the processor through the source UALink port; the read request includes the target read address. The read request is encapsulated into a PCIe data packet and sent through the PCIe port to the PCIe host to which it is connected; The PCIe host accesses memory according to the target read address to obtain the content to be read. The PCIe host encapsulates the content to be read into a target PCIe data packet and returns it through the PCIe port. It then parses out the content to be read and sends it to the processor that issued the read request through the source UALink port.

[0036] Furthermore, following any one or a combination of the aforementioned technical solutions, a UALink data packet containing a read request is received from the processor via the source UALink port, wherein the read request includes a target read address; The PCIe chip integrated with the PCIe port accesses the memory connected to the PCIe port according to the target read address to obtain the content to be read; The content to be read is sent to the processor that issued the read request through the source UALink port.

[0037] Furthermore, following any one or a combination of the aforementioned technical solutions, it is not necessary to encapsulate the content to be read into a PCIe data packet.

[0038] Furthermore, following any of the aforementioned technical solutions or combinations thereof, if the port ID obtained by parsing the source UALink data packet represents another target UALink port that is different from the source UALink port, then a UALink data channel is established between the source UALink port and the target UALink port, so that the UALink data packet from the source UALink port is sent to the target UALink port through the UALink data channel.

[0039] Furthermore, following any one or a combination of the aforementioned technical solutions, the PCIe target address is calculated using the following formula: Add obj = Add base +( Size sector × Index sector ),in, Add obj For PCIe target address, Add base PCIe base address Size sector For sector size, Index sector For sector indexing; Each UALink port is associated with a different PCIe base address.

[0040] Furthermore, following any of the aforementioned technical solutions or combinations thereof, if multiple consecutive UALink data packets correspond to different target PCIe port IDs, and the target port ID in the current UALink data packet represents a PCIe port, then the header data of the PCIe transaction layer data packet is repeatedly generated and updated in the header information register.

[0041] According to another aspect of the present invention, a communication system is provided, comprising a first electronic device, a second electronic device, and a switch as described above for implementing UALink and PCIe protocol conversion, wherein the first electronic device is configured to connect to the UALink port of the switch and communicate based on the UALink protocol, and the second electronic device is configured to connect to the PCIe port of the switch and communicate based on the PCIe protocol.

[0042] Furthermore, following any one or a combination of the aforementioned technical solutions, the first electronic device is a GPU; the second electronic device is a PCIe host configured with memory; or, the second electronic device is memory, and the PCIe port of the switch is configured with a PCIe chip for accessing the memory.

[0043] The beneficial effects of the technical solution provided by this invention are as follows: a. By establishing a mapping table based on UALink port ID and sector information, the automatic conversion of UALink protocol request (Req) to PCIe TLP packet header is realized, enabling the GPU to access PCIe memory resources through the UALink bus and achieving protocol compatibility; b. Innovative introduction of Payload aggregation and streaming mechanism: When a continuous write request is detected, the TLP header is reused to directly aggregate the Payload data in the buffer and automatically increment the address before sending. This mechanism greatly reduces the header overhead on the PCIe bus, significantly increases the proportion of Payload transmission, thereby increasing the transmission bandwidth and accelerating the data aggregation speed in large model training and inference. Attached Figure Description

[0044] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0045] Figure 1 A diagram illustrating how the Gather operation aggregates data from multiple GPUs into a centralized storage space; Figure 2 A schematic diagram of the structure of a first type of switch that implements UALink and PCIe protocol conversion, provided as an exemplary embodiment of the present invention; Figure 3 A schematic diagram of the structure of a second type of switch for implementing UALink and PCIe protocol conversion, provided as an exemplary embodiment of the present invention; Figure 4 A schematic diagram illustrating the process of switch multiplexing header information is provided as an exemplary embodiment of the present invention; Figure 5 A schematic diagram illustrating the data writing operation of a processor through a switch, provided as an exemplary embodiment of the present invention; Figure 6 A schematic diagram of a first process for a processor to perform a data reading operation via a switch, provided as an exemplary embodiment of the present invention; Figure 7 A second flowchart illustrating a processor's data reading operation via a switch, provided as an exemplary embodiment of the present invention; Figure 8 A flowchart illustrating an efficient communication method for cross-protocol conversion provided as an exemplary embodiment of the present invention; Figure 9 A schematic diagram of the structure of a data aggregation auxiliary device provided as an exemplary embodiment of the present invention; Figure 10 A flowchart illustrating a data aggregation method provided as an exemplary embodiment of the present invention. Detailed Implementation

[0046] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0047] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, apparatus, product, or device that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or devices.

[0048] This invention aims to solve the problem of enabling UALink bus devices (such as GPUs) to efficiently access PCIe devices (such as large-capacity memory) in heterogeneous AI computing systems. In particular, it addresses the common Gather operation in AI computing by resolving the bandwidth waste and high transmission latency caused by frequent TLP packet header encapsulation in traditional protocol conversion methods.

[0049] In one embodiment of the present invention, a switch is provided that implements UALink and PCIe protocol conversion, such as... Figure 2 As shown, the switch includes a PCIe controller, converters, protocol conversion table, header information register, and multiple UALink ports. Figure 2 Ports 0 to 11 and multiple PCIe ports (in the middle) Figure 2 Ports 12 and 13 in the switch are configured to receive UALink data packets from processors (such as GPUs) outside the switch via the UALink ports and forward them to the converter. The PCIe controller identifies the UALink port ID corresponding to the UALink port. The converter is configured to parse UALink packets to extract the target port ID and sector index from the UALink packets; The protocol translation table is configured to store the associated UALink port ID, PCIe base address (PCIe BAR), and sector size. The BAR is a special configuration register on the PCIe device that stores the "starting address," which is equivalent to the address of the device resource. When the processor accesses the starting address corresponding to this BAR value, it is equivalent to accessing the first resource of the PCIe device. Accessing the "starting address + offset" is equivalent to accessing other resources within the device. This is beneficial for calculating the PCIe target address.

[0050] The switch is configured to perform the following steps to reuse header information, thereby significantly reducing header overhead on the PCIe bus, specifically as follows: Figure 4 As shown: S100: Based on the corresponding UALink port ID, determine the associated PCIe base address and sector size in the protocol conversion table; the protocol conversion table serves as a mapping table that pre-builds the conversion relationship between each UALink port and PCIe, which can associate each UALink port with a different PCIe base address, which is equivalent to pre-configuring a dedicated area (PCIe BAR) for each UALink port for data writing, while the sector size depends on the hardware conditions of the memory.

[0051] S200: Determine the corresponding PCIe target address based on the associated PCIe base address and sector size, as well as the extracted sector index; specifically, calculate the PCIe target address using the following formula: Add obj = Add base +( Size sector × Index sector ),in, Add obj For PCIe target address, Add base PCIe base address Size sector For sector size, Index sector For sector indexing; S300: Generate the header data of the PCIe transaction layer data packet, which contains the PCIe destination address; S400: Store the generated header data into the header information register; If the target port IDs extracted by the converter multiple times represent the same PCIe port (this is the first case), the switch repeatedly uses the header data in the header information register for data encapsulation to obtain PCIe data packets, and the switch sends the PCIe data packets outward through this PCIe port. That is, if the target port ID in the current UALink data packet and the target port ID in the previous UALink data packet represent the same PCIe port, in this first case, the PCIe target address is not recalculated, and the header data of the PCIe transaction layer data packet is not regenerated; the PCIe target address does not need to be recalculated. Add obj By generating headers, the overhead on the PCIe bus is reduced, data transmission bandwidth is increased, and data transmission efficiency is improved.

[0052] The above breaks through the sequential limitations of traditional packet-to-packet conversion and innovatively adopts a payload aggregation and streaming transmission mechanism: by directly stripping and merging effective data from multiple concurrent sources and reusing the TLP packet header, the protocol overhead is greatly reduced, and the PCIe bus bandwidth utilization and transmission efficiency of AI clusters in massive data gathering scenarios are significantly improved.

[0053] See also Figure 4 The second scenario is that the target port ID in the current UALink data packet is different from the target port ID in the previous UALink data packet, and the target port ID in the current UALink data packet represents a PCIe port. In this case, the steps of generating the header data of the PCIe transaction layer data packet (including protocol table mapping, calculating the PCIe target address, and generating the header) are repeated, and the new header is updated in the header information register.

[0054] There is also a third scenario where the target port ID in the current UALink data packet represents the UALink port. For example, GPU0 inputs a UALink data packet through Prot0, with the target port being Port10, to be sent to GPU10. In this case, the UALink data packet is defined as originating from the source UALink port, and the target port ID represents a target UALink port that is different from the source UALink port. The PCIe controller establishes a UALink data channel between the source UALink port and the target UALink port, so that the UALink data packet from the source UALink port is sent to the target UALink port through this UALink data channel.

[0055] The converter continuously monitors inbound traffic. Once it receives a UALink data packet with a different target port ID (meaning the access target has changed, i.e., the second and third cases mentioned above), it immediately stops the fast transmission mode that reuses the TLP header information, refreshes the register, and re-executes the normal conversion process.

[0056] In the various steps of the switch executing the multiplexing header information described above, the executing entity can make corresponding configurations according to actual needs. For example, in step S100, it can be configured as follows: Figure 2 The PCIe base address and sector size associated with the UALink port ID of the data sender are determined by the converter in the protocol conversion table, as shown below. Figure 3 The associated information is determined by the PCIe controller in the protocol conversion table.

[0057] In step S200, the step of calculating the PCIe target address can be performed by the converter or by the PCIe controller.

[0058] In step S300, the header data of the PCIe transaction layer data packet can be generated by the PCIe controller or by the converter.

[0059] In step S400, the header data can be stored in the header information register by the PCIe controller or by the converter.

[0060] The PCIe port is equipped with a port controller. The step of reusing the header data in the header information register for data encapsulation can be performed by the port controller to obtain PCIe data packets. PCIe data encapsulation converts upper-layer application data into transaction layer data packets (TLPs) that can be transmitted on the PCIe bus. It mainly includes transaction layer encapsulation, data link layer encapsulation, and physical layer encapsulation. After the physical layer at the receiving end receives the PCIe data packet, it performs decoding and clock recovery. The data link layer at the receiving end checks the CRC to confirm data integrity. The transaction layer at the receiving end parses the TLP header to extract the payload data and submits it to the upper-layer application.

[0061] like Figure 2 and Figure 3 As shown, PCIe ports connect to memory, and the specific hardware structures for these connections can be mainly divided into the following two types: Method 1: The PCIe port is integrated with a PCIe chip, the PCIe port is connected to the memory, and the PCIe chip is configured to access the memory; Method 2: The PCIe port is configured to connect to a PCIe host outside the switch, and the PCIe host is equipped with memory.

[0062] The differences between the two methods are as follows: Method 1, which uses a chip-integrated approach, allows the PCIe chip on the PCIe port to access memory, which is equivalent to the switch directly accessing the memory. Therefore, it does not require data encapsulation to obtain PCIe format data packets to achieve memory access. Method 2 requires an external PCIe host to access its memory. Therefore, the switch (port controller) must encapsulate the PCIe data packets and send them to the PCIe host through the PCIe port. The PCIe host then decapsulates the data packets to obtain the specific information about accessing the memory.

[0063] Access is divided into two types: data writing and data reading. The converter parses the Req data in the UALink protocol layer interface (UPLI) and converts the read / write command it represents into the corresponding Type field in the PCIe TLP packet header. The following explains the process for each of these two types of commands: The processor's data write operation through the switch is as follows Figure 5 As shown: The switch is also equipped with a payload buffer. The data extracted by the converter from the UALink data packet also includes payload data. The process of extracting the effective payload data is divided into physical layer reception and decoding, link layer parsing, and transaction layer parsing and payload extraction. Specifically, the header field of the transaction layer data packet is parsed, and the starting position and length of the payload data in the data packet are determined according to the value of the data length field in the header field. In this way, the payload data of a specified position and length is copied from the TLP data packet, and the payload data is stored in the payload buffer. This step can be performed by the converter or by the PCIe controller. If the target port ID in the current UALink data packet is the same as the target port ID in the previous UALink data packet, the switch (converter, PCIe controller, or port controller) combines the payload data in the payload buffer with the header data in the header information register. Based on the combined result, the port controller further encapsulates it into a PCIe data packet to be written according to the PCIe data encapsulation rules. The switch (e.g., a port controller) writes the PCIe data packet to be written to the storage device of the corresponding device through the PCIe port represented by the target port ID.

[0064] The phrase "same destination port ID as the previous UALink packet" means that they all represent the same PCIe port, indicating that one or more UALink ports are requesting to transfer / write data to the same memory. In this case, the header information in the multiplexing register is used as described above. That is, the PCIe TLP Header calculated for the first time is stored in the TLP Header register. Under this premise, the switch no longer generates a separate TLP for each UALink packet. Instead, it directly strips the payload data from the UALink packets received by each UALink port and writes it sequentially into the payload buffer of the PCIe port. Then, the data in the payload buffer is merged with the packet header in the TLP Header register, encapsulated to generate a TLP, and sent.

[0065] Using the length of the payload data combined with the header data as an offset, the PCIe target address in the header information register is updated. For example, if the initial address is 0x1000 (directly using the PCIe target address in the header information register), starting from the first byte of the payload buffer, 128 bytes of data (0080 in hexadecimal) are retrieved, encapsulated into a TLP, and sent. Then, the PCIe target address in the header information register is updated: new address = old address + current payload length = 0x1000 + 0080 = 0x1080. When sending the second TLP, starting from the position immediately following the previous one (byte 129) in the payload buffer, another 128 bytes of data are retrieved, encapsulated, and sent. At this point, the address in the header register is already 0x1080. After sending, the address is updated again to 0x1080 + 0080 = 0x1100. This process is repeated until all data in the buffer is retrieved and cleared.

[0066] For example, if the PCIe target address is 0x1000, and the converter initially parses the UALink data packet sent by GPU0 through Port0 and obtains the target port ID as Port13, then the payload data is sent through Port13 and written to address 0x1000 in its corresponding memory 1. Figure 2 and Figure 3 The number of PCIe ports shown is two for illustration only. This invention does not limit the number of PCIe ports or UALink ports.

[0067] The processor's data read operations through the switch are divided into two modes depending on the switch's PCIe port configuration / connection. The first mode is when the PCIe port is configured to connect to an external PCIe host equipped with memory; the processor's data read operations through the switch are as follows: Figure 6 As shown: In response to the processor sending a UALink data packet containing a read request through the UALink port, and the converter parsing the UALink data packet to extract the target port ID to represent the PCIe port, the PCIe data packet obtained by the switch through data encapsulation contains field information of the read type; specifically, the read request includes the target read address, and correspondingly, the converter parses the target read address, and the PCIe data packet encapsulated by the switch also contains the field information of the target read address. In response to a PCIe data packet containing read type information and a target read address, the PCIe host accesses memory according to the target read address to determine the content to be read corresponding to the target read address, encapsulates it into a target PCIe data packet, and sends it to the converter through the corresponding PCIe port. The converter forwards the target PCIe data packet to the PCIe controller, which parses out the content to be read and sends it to the processor that issued the read request through the corresponding UALink port.

[0068] The second method involves integrating a PCIe chip into the PCIe port. This PCIe port is connected to memory, and the PCIe chip is configured to access the memory. The processor performs data read operations through the switch, such as... Figure 7 As shown: In response to the processor sending a UALink data packet containing a read request through the UALink port, specifically, the read request includes a target read address, and the converter parses the target read address accordingly, and the converter parses the UALink data packet to represent the PCIe port with the extracted target port ID, then without encapsulating the PCIe data packet, the converter sends the result of parsing the UALink data packet (including the target read address) to the PCIe chip of the PCIe port. The PCIe chip accesses the memory based on the target read address to determine the content to be read corresponding to the target read address. After obtaining the content to be read, the PCIe chip sends the unencapsulated content to the converter without encapsulating PCIe data packets. The PCIe controller receives the content to be read from the converter and sends it to the processor that issued the read request through the corresponding UALink port.

[0069] In one embodiment of the present invention, an efficient communication method for cross-protocol conversion is provided. The sending device sends a communication request (read request or write request) based on the UALink protocol. Through protocol conversion, read / write operations are implemented at large-capacity memory expansion devices or standard storage nodes based on the PCIe protocol. Figure 8 As shown, the communication method includes the following steps: The system receives UALink data packets sent by the sending device (GPU processor) through the UALink port and identifies the ID of the UALink port, which is the port ID connected to the sending device. Parse the UALink data packet to extract the target PCIe port ID and sector index; Look up the PCIe base address and sector size associated with the ID of the UALink port in the pre-established protocol conversion table; Based on the associated PCIe base address and sector size, as well as the extracted sector index, determine the corresponding PCIe target address; Generate the header data of the PCIe transaction layer packet, which contains the PCIe destination address; The generated header data is stored in a register; If multiple consecutive UALink data packets received correspond to the same target PCIe port ID, the header data in the register is reused for data encapsulation to obtain PCIe data packets. The PCIe data packet is sent to a high-capacity memory expansion device or standard storage node based on the PCIe protocol via the PCIe port corresponding to the PCIe port ID.

[0070] In one embodiment, the communication method is based on the above-described switch embodiment as the communication hardware foundation; in another embodiment, the communication method is not limited to using a switch as the hardware foundation.

[0071] Furthermore, the processor's data write operation is implemented in the following way: The data extracted by parsing the UALink data packets also includes payload data; The payload data is stored in a preset buffer; If the target PCIe port ID in the current UALink data packet is the same as the target PCIe port ID in the previous UALink data packet, then the payload data in the buffer and the header data in the register are combined to generate the PCIe data packet to be written. The PCIe data packet is written to the memory connected to it through the PCIe port. See details. Figure 5 The example shown.

[0072] Furthermore, following any one or a combination of the aforementioned technical solutions, the processor's data reading operation is achieved in the following manner: The UALink port receives a UALink data packet containing a read request sent by the processor, the read request including the target read address; The read request is encapsulated into a PCIe data packet and sent through the PCIe port to the connected PCIe host; the PCIe host accesses memory according to the target read address to obtain the content to be read; the PCIe host encapsulates the content to be read into a target PCIe data packet and returns it through the PCIe port, parses out the content to be read, and sends it through the UALink port to the processor that issued the read request. See details below. Figure 6 The embodiment shown can be used as a communication hardware without being limited to the above-described switch.

[0073] Alternatively, a PCIe chip integrated with the PCIe port can access the memory connected to the PCIe port according to the target read address to obtain the content to be read; the content to be read is then sent to the processor that issued the read request via the UALink port, see details below. Figure 7 The embodiments shown may not be limited to using the aforementioned switch as communication hardware.

[0074] In one embodiment of the present invention, a communication system is provided, comprising a first electronic device, a second electronic device, and a switch as described above that implements UALink and PCIe protocol conversion, such as... Figure 2 As shown, the first electronic device of the communication system is multiple GPUs, which are configured to connect to the UALink port of the switch and communicate based on the UALink protocol. The second electronic device is memory or a PCIe host with memory, which is configured to connect to the PCIe port of the switch and communicate based on the PCIe protocol. When the second electronic device is memory, the PCIe port is equipped with a PCIe chip to enable access to the memory based on the PCIe protocol.

[0075] It should be noted that the communication method and communication system provided in this embodiment of the invention belong to the same inventive concept as the switch that implements the UALink and PCIe protocol conversion described above. All contents of the switch embodiment are incorporated into this communication method embodiment and communication system embodiment by reference, and will not be repeated here.

[0076] In one embodiment of the present invention, a data aggregation auxiliary device is also provided, which serves as an intermediate connection device between a sending end device based on the UALink protocol and a large-capacity memory expansion device or standard storage node based on the PCIe protocol. This device helps multiple sending end devices (GPU processors) to quickly aggregate data to a large-capacity memory expansion device or standard storage node. Unlike a switch, the data aggregation auxiliary device has only one PCIe port.

[0077] like Figure 9As shown: The data aggregation auxiliary device includes a PCIe controller, a converter, a protocol conversion table, a header information register, a payload buffer, multiple UALink ports, and only one PCIe port. The PCIe controller is connected to each UALink port and is configured to receive UALink data packets from processors outside the data aggregation auxiliary device through the UALink ports and forward them to the converter. The PCIe controller identifies the UALink port ID corresponding to the UALink port. In the connection architecture of multiple GPU processors, one data aggregation auxiliary device, and a large-capacity memory expansion device, by default, each GPU processor requests to send communication requests to the same / unique PCIe port. Therefore, it is not necessary to determine whether the target port ID extracted by the converter from parsing multiple UALink data packets represents the same PCIe port.

[0078] The protocol conversion table is configured to store the associated UALink port ID, PCIe base address, and sector size; When the PCIe controller receives the first UALink data packet, the data aggregation auxiliary device performs the following steps: the converter parses the UALink data packet to extract the sector index and payload data from the UALink data packet, and stores the payload data in the payload buffer; based on the corresponding UALink port ID, it determines the associated PCIe base address and sector size in the protocol conversion table; based on the associated PCIe base address and sector size, and the extracted sector index, it determines the corresponding PCIe target address; it generates the header data of the PCIe transaction layer data packet, which contains the PCIe target address; it stores the generated header data in the header information register; it encapsulates the data in the payload buffer and the header data into a PCIe data packet to be written, and sends it to the mass memory expansion device through the PCIe port; When the PCIe controller receives subsequent UALink data packets, the data aggregation auxiliary device performs the following steps: the converter parses the UALink data packets to extract the payload data in the UALink data packets and stores the payload data in the payload buffer; the header data in the header information register is reused, and the PCIe target address and header data are not recalculated and generated again, and the header data is combined with the payload data in the payload buffer to form a PCIe data packet to be written, and the packet is sent to the mass memory expansion device through the PCIe port.

[0079] Furthermore, using the length of the payload data combined with the header data as an offset, the PCIe target address in the header information register is updated. For details, please refer to the switch embodiment; further explanation is unnecessary. The method for calculating the PCIe target address also refers to the switch embodiment; further explanation is unnecessary.

[0080] The data aggregation auxiliary device has a port controller on its only PCIe port. The port controller is configured to obtain header data from the header information register, combine it with the payload data in the payload buffer, and encapsulate the data to obtain the corresponding PCIe data packet.

[0081] The data aggregation auxiliary device has a PCIe port integrated with a PCIe chip, the PCIe port is connected to memory, and the PCIe chip is configured to access memory. Alternatively, the PCIe port may be configured to connect to a PCIe host external to the data aggregation auxiliary device, the PCIe host being configured with memory.

[0082] In one embodiment of the present invention, a data aggregation method is provided, in which multiple sending devices send write requests based on the UALink protocol, and data aggregation is achieved at a large-capacity memory expansion device or standard storage node based on the PCIe protocol through protocol conversion, such as... Figure 10 As shown, the data aggregation method includes the following steps: Receive the first UALink data packet sent by the sending device (GPU processor) through the UALink port, and identify the ID of the UALink port, that is, the port ID of the sending device. The UALink data packet is parsed to extract the sector index and payload data, and the payload data is stored in the payload buffer. The pre-established protocol conversion table is used to look up the PCIe base address and sector size associated with the ID of the UALink port; each UALink port is associated with a different PCIe base address.

[0083] Based on the associated PCIe base address and sector size, as well as the extracted sector index, the corresponding PCIe target address is determined; the formula for calculating the PCIe target address is as follows: Add obj = Add base +( Size sector × Index sector ),in, Add obj For PCIe target address, Addbase PCIe base address Size sector For sector size, Index sector For sector indexing; Generate the header data of the PCIe transaction layer packet, which contains the PCIe destination address; The generated header data is stored in a register, and the data in the payload buffer is encapsulated with the header data into a PCIe packet to be written. The PCIe data packet to be written is sent to the high-capacity memory expansion device through a single PCIe port; Receive subsequent UALink data packets sent by the sending device (GPU processor) through the UALink port, and identify the ID of the UALink port; Parse subsequent UALink data packets to extract payload data, and store the payload data in the payload buffer; The header data in the register is reused instead of the PCIe target address and header data being calculated and generated again. The header data is then combined with the payload data in the payload buffer and encapsulated into a PCIe data packet to be written. The PCIe data packet to be written is sent to the mass memory expansion device through a single PCIe port.

[0084] After each combination and encapsulation of the payload data in the payload buffer with the header data in the register, the PCIe target address in the register is updated using the length of the payload data combined with the header data as the offset.

[0085] In one embodiment of the present invention, a data aggregation system is provided, including multiple electronic devices, a memory expansion device, and a data aggregation auxiliary device as described above, wherein the multiple electronic devices are connected to multiple UALink ports of the data aggregation auxiliary device and communicate based on the UALink protocol; the memory expansion device is connected to a single PCIe port of the data aggregation auxiliary device and communicates based on the PCIe protocol.

[0086] The data aggregation method and system provided in this embodiment belong to the same inventive concept as the data aggregation auxiliary device provided in the above embodiments. All contents of the data aggregation auxiliary device embodiment are incorporated into the data aggregation method and data aggregation system embodiment by reference, and will not be repeated here.

[0087] In AI's Gather operation, the primary task is to aggregate data distributed across various locations into a centralized storage space. Whether the data is continuous or discontinuous is not important. Therefore, this invention breaks through the sequential limitations of traditional "packet-to-packet" conversion and innovatively adopts a Payload aggregation and streaming transmission mechanism: by directly stripping and merging effective data from multiple concurrent sources and reusing TLP packet headers, the protocol overhead is significantly reduced, and the PCIe bus bandwidth utilization and transmission efficiency of AI clusters in massive data aggregation (Gather) scenarios are significantly improved.

[0088] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0089] The above description is only a specific embodiment of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications should also be considered within the scope of protection of this application.

Claims

1. A data aggregation auxiliary device, characterized in that, It includes a PCIe controller, a converter, a protocol conversion table, a header information register, a payload buffer, multiple UALink ports and only one PCIe port, wherein the PCIe controller is connected to each UALink port to receive UALink data packets from a processor outside the data aggregation auxiliary device through the UALink port and forward them to the converter, and the PCIe controller identifies the UALink port ID corresponding to the UALink port. The protocol conversion table is configured to store the associated UALink port ID, PCIe base address, and sector size; When aggregating the first UALink data packet, the data aggregation auxiliary device performs the following steps: The converter parses the UALink data packet to extract the sector index and payload data in the UALink data packet, and stores the payload data in the payload buffer; Based on the corresponding UALink port ID, determine the associated PCIe base address and sector size in the protocol conversion table; Based on the associated PCIe base address and sector size, as well as the extracted sector index, determine the corresponding PCIe target address; Generate the header data of the PCIe transaction layer packet, which contains the PCIe destination address; The generated header data is stored in the header information register; the payload data in the payload buffer and the header data are encapsulated into a PCIe data packet to be written, and sent to the memory expansion device through the PCIe port; The header data in the header information register is reused when aggregating the next UALink data packet.

2. The data aggregation auxiliary device according to claim 1, characterized in that, When aggregating the next UALink data packet, the header data in the header information register is reused, including: When aggregating the next UALink data packet, the data aggregation auxiliary device performs the following steps: The converter parses the UALink data packet to extract the payload data in the UALink data packet and stores the payload data in the payload buffer; The header data in the header information register is reused and combined with the payload data in the payload buffer to form a PCIe data packet to be written, and then sent to the memory expansion device through the PCIe port.

3. The data aggregation auxiliary device according to claim 1, characterized in that, After each load data retrieved from the load buffer is combined with the header data in the header information register, the PCIe target address in the header information register is updated using the length of the load data combined with the header data as an offset.

4. The data aggregation auxiliary device according to claim 1, characterized in that, The data aggregation auxiliary device has a port controller on its only PCIe port. The port controller is configured to obtain header data from the header information register, combine it with the payload data in the payload buffer, and encapsulate the data to obtain the corresponding PCIe data packet.

5. The data aggregation auxiliary device according to claim 1, characterized in that, The data aggregation auxiliary device has a PCIe port integrated with a PCIe chip, which is connected to memory and configured to access memory.

6. The data aggregation auxiliary device according to claim 1, characterized in that, The PCIe port is configured to connect to a PCIe host external to the data aggregation auxiliary device, the PCIe host being configured with memory.

7. A data aggregation method, characterized in that, Multiple sending devices send write requests based on the UALink protocol, and data aggregation is achieved at the memory expansion device based on the PCIe protocol through protocol conversion. The data aggregation method includes the following steps: Receive the first UALink data packet sent by the sending device through the UALink port and identify the ID of the UALink port; The UALink data packet is parsed to extract the sector index and payload data, and the payload data is stored in the payload buffer. Look up the PCIe base address and sector size associated with the ID of the UALink port in the pre-established protocol conversion table; Based on the associated PCIe base address and sector size, as well as the extracted sector index, determine the corresponding PCIe target address; generate the header data of the PCIe transaction layer data packet, which contains the PCIe target address; The generated header data is stored in a register, and the payload data in the payload buffer is encapsulated together with the header data into a PCIe data packet to be written. The PCIe data packet to be written is sent to the memory expansion device through a unique PCIe port; The header data in the register is reused when aggregating the next UALink data packet.

8. The data aggregation method according to claim 7, characterized in that, Receive the next UALink data packet sent by the sending device through the UALink port and identify the ID of the UALink port; Parse subsequent UALink data packets to extract payload data, and store the payload data in the payload buffer; The header data in the register is reused and combined with the payload data in the payload buffer to form a PCIe data packet to be written. The PCIe data packet to be written is sent to the memory expansion device through a unique PCIe port.

9. The data aggregation method according to claim 7, characterized in that, After each combination and encapsulation of the payload data in the payload buffer with the header data in the register, the PCIe target address in the register is updated using the length of the payload data combined with the header data as the offset.

10. A data aggregation system, characterized in that, It includes multiple electronic devices, a memory expansion device, and a data aggregation auxiliary device as described in any one of claims 1 to 6, wherein the multiple electronic devices are correspondingly connected to multiple UALink ports of the data aggregation auxiliary device and communicate based on the UALink protocol; The memory expansion device is connected to the single PCIe port of the data aggregation auxiliary device and communicates based on the PCIe protocol.