Semantic switching chips, computer systems, semantic switching methods, network interface cards, devices, media, and software products
By introducing a semantic exchange chip into the AI system, request conversion between vertically and horizontally scaled networks is realized, solving the complexity problem caused by the separation of scale-up and scale-out networks, reducing GPU resource consumption, and improving AI training efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN JAGUAR MICROSYSTEMS CO LTD
- Filing Date
- 2026-05-29
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, the separation of scale-up and scale-out networks leads to high complexity in collective communication, which consumes GPU computing resources and affects the efficiency of training large AI models.
A semantic switching chip is provided, comprising at least two ports and a switching module, for translating memory requests and RDMA read/write requests between vertically scaled-up networks and horizontally scaled-up networks. The semantic switching chip shields the differences between interconnected networks at the hardware level, simplifies the complexity of the aggregated communication library software, and offloads communication to the semantic switching chip.
It reduces GPU resource consumption, improves the efficiency of training large AI models, and simplifies the complexity of the collection communication library software.
Smart Images

Figure CN122309441A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer network technology, and in particular to a semantic exchange chip, computer system, semantic exchange method, network interface card, device, medium and program product. Background Technology
[0002] AI applications are developing rapidly, and AI models are becoming increasingly large-scale. Distributed parallel computing is a key means of training large AI models, typically including data parallelism, pipeline parallelism, and tensor parallelism. These parallel modes all require multiple ensemble communication operations between multiple computing devices (GPUs) before the next iteration of training can proceed. High-performance interconnect networks, as the foundation for communication between computing devices, need to possess high bandwidth and low latency capabilities.
[0003] Based on bandwidth and scope, interconnect networks between computing devices can be divided into two categories: Scale-up (vertical scaling networks) and Scale-out (horizontal scaling networks). Scale-up networks feature high bandwidth and low latency, support Load / Store memory semantics, and are primarily used for high-speed interconnection between multiple GPUs within a single node (or rack). They typically support a limited number of GPUs (8-1K). In contrast, Scale-out networks have lower bandwidth and higher latency, but support a larger number of GPUs (1K-100K). They are typically used in scenarios where increasing the number of GPUs expands the overall processing power and capacity of the system. Scale-out networks support RDMA message semantics.
[0004] However, since scale-up and scale-out networks are separate, for aggregate communication that requires cross-node communication, the NCCL (NVIDIA Collective Communications Library) calls the UCX / libibverbs API interface and communicates with other nodes through the scale-out network via CUDA drivers, RDMA drivers, and RNICs that support the RDMA protocol. This is quite complex, and these complexities are currently handled by aggregate communication software like NCCL, thus utilizing the GPU's computing resources. Summary of the Invention
[0005] Therefore, it is necessary to provide a semantic switching chip, computer system, semantic switching method, network interface card, device, medium, and program product that can reduce GPU utilization to address the above-mentioned technical problems.
[0006] In a first aspect, this application provides a semantic exchange chip, the semantic exchange chip comprising:
[0007] At least two ports, each of which is used to implement the conversion between memory requests of the vertically scaling network and RDMA read / write requests of the horizontally scaling network;
[0008] The switching module communicates with each of the ports and is used to schedule RDMA read / write requests received from the horizontal scaling network in one port to another port for conversion to obtain memory requests to be transmitted to the vertical scaling network, and to schedule RDMA read / write requests obtained from the memory requests of the vertical scaling network in the other port to the first port for execution to transmit the RDMA read / write requests to the ports corresponding to other semantic switching chips in the horizontal scaling network.
[0009] In one embodiment, the port includes:
[0010] The first adapter is used to receive and parse memory requests transmitted via the vertically extended network route, convert the memory requests into RDMA read / write requests, send the converted RDMA read / write requests to the switching module, and receive RDMA read / write requests scheduled by the switching module, and convert the RDMA read / write requests into at least one independent memory request.
[0011] The second adapter is configured to execute the RDMA read / write request to send it to the semantic switching chip in another computer system via the horizontal scaling network, and to receive and parse the RDMA read / write request sent by the semantic switching chip in the other computer system from the horizontal scaling network, and to transmit the RDMA read / write request to the switching module.
[0012] In one embodiment, the first adapter is specifically configured to receive and parse at least one memory request sent by the vertically extended network routing, merge and package the at least one memory request into a first target data packet, and encapsulate the first target data packet into an RDMA read / write request; and
[0013] The system receives RDMA read / write requests scheduled from the switching module, converts the RDMA read / write requests into multiple second data packets, and generates a memory request for each second data packet.
[0014] In one embodiment, the second adapter is specifically configured to receive an RDMA read request sent by a semantic switching chip in another computer system, verify the legality of the read / write permissions and address carried in the RDMA read request, and transmit the RDMA read request to the switching module after the verification is successful; and upon receiving an RDMA write request sent by a semantic switching chip in another computer system, transmit the RDMA write request to the switching module.
[0015] In one embodiment, the semantic exchange chip further includes a processing core or a semantic exchange chip multiplexed off-chip processing core, the processing core being used to establish intra-system topology and inter-system RDMA channels.
[0016] In one embodiment, the processing core is used for:
[0017] The system receives and stores the computer system topology information corresponding to the current computer system uploaded when the graphics processor driver is loaded. The computer system topology information is used for vertical network expansion to perform data transmission within the computer system.
[0018] Receive the communication address broadcast by the semantic exchange chip in other computer systems, and generate global topology information based on the communication address broadcast by the semantic exchange chip in other computer systems;
[0019] Based on the global topology information, an RDMA channel and graphics processor memory address mapping are established between computer systems, and the RDMA channel and mapping relationship are written into the memory of each graphics processor through vertically extended network routing.
[0020] In one embodiment, receiving the communication address broadcast by the semantic exchange chip in another computer system, and generating global topology information based on the communication address broadcast by the semantic exchange chip in the other computer system, includes:
[0021] Each of the semantic exchange chips determines the Rendezvous endpoint on the horizontally extended network and receives the communication address corresponding to each of the semantic exchange chips issued by the scheduling system;
[0022] The communication address is broadcast through the horizontal expansion network, and the communication address broadcast by the semantic exchange chip in other computer systems is received.
[0023] Based on the communication address broadcast by the semantic exchange chip in other computer systems, global topology information is generated, which includes the computer system identifier, the target graphics processor identifier, and the communication address corresponding to the target semantic exchange chip.
[0024] In one embodiment, the step of establishing RDMA channels and graphics processor memory address mappings between computer systems based on the global topology information, and writing the RDMA channels and mapping relationships into the memory of each graphics processor through the vertically extended network routing, includes:
[0025] Based on the global topology information, an RDMA communication channel for the semantic exchange chip between different computer systems is established through the horizontally extended network;
[0026] Based on the global topology information, a mapping relationship is established between the graphics processor memory address and the memory address of the RDMA communication channel, and read / write permissions for the memory address of the RDMA communication channel are obtained.
[0027] Through the vertically extended network routing, the RDMA communication channel, the mapping relationship, and the read / write permissions of the RDMA communication channel's memory address are written into the graphics processor's video memory based on the computer system's topology information.
[0028] Secondly, this application also provides a computer system, the system comprising:
[0029] At least one central processing unit;
[0030] At least two graphics processors, each of which is connected to a central processing unit, and the graphics processors are interconnected via a vertically extended network routing;
[0031] At least two of the aforementioned semantic switching chips, each of which is attached to a vertically extended network route, wherein the semantic switching chip is used to convert memory requests of the vertically extended network into RDMA read / write requests of the horizontally extended network.
[0032] In one embodiment, the graphics processor is specifically used to call a communication library through a user program, and to process memory requests by calling the API interface provided by the driver through a shared memory library based on the communication library.
[0033] Thirdly, this application also provides a semantic exchange method applied to a semantic exchange chip, the method comprising:
[0034] Receive and parse the memory request transmitted from the vertically expanded network route, convert the memory request into an RDMA read / write request, and then send it to the horizontally expanded network route;
[0035] Receive and parse the RDMA read / write request transmitted from the horizontal scaling network route, convert the RDMA read / write request into a memory request, and then send it to the vertical scaling network route.
[0036] In one embodiment, receiving and parsing the memory request sent by the vertical scaling network route, converting the memory request into an RDMA read / write request, and then sending it to the horizontal scaling network route includes:
[0037] The first adapter receives and parses at least one memory request sent via the vertically extended network route;
[0038] At least one of the memory requests is merged and packaged into a first target data packet using a first adapter;
[0039] The first target data packet is encapsulated into an RDMA read / write request by the first adapter, and the RDMA read / write request is transmitted to the second adapter via the switching module;
[0040] The RDMA read / write request is executed via the second adapter to be sent to the horizontally extended network route.
[0041] In one embodiment, receiving and parsing RDMA read / write requests sent by the horizontal scaling network, and converting the RDMA read / write requests into memory requests, includes:
[0042] The second adapter receives RDMA read / write requests from semantic switching chips in other computer systems via the horizontally extended network routing, verifies the legality of the read / write permissions and addresses carried in the write request, and transmits the RDMA read / write request to the first adapter via the switching module after the verification is successful.
[0043] The first adapter receives the RDMA read / write request transmitted by the switching module.
[0044] The RDMA read / write request is split into multiple second data packets via the first adapter;
[0045] The first adapter generates a memory request for each of the second packets and sends it to the vertically extended network route.
[0046] In one embodiment, the method further includes:
[0047] Establish the topology within the computer system and the RDMA channel between computer systems.
[0048] In one embodiment, establishing the intra-system topology and inter-system RDMA channels includes:
[0049] The system receives and stores the computer system topology information corresponding to the current computer system uploaded when the graphics processor driver is loaded. The computer system topology information is used for vertical network expansion to perform data transmission within the computer system.
[0050] Receive the communication address broadcast by the semantic exchange chip in other computer systems, and generate global topology information based on the communication address broadcast by the semantic exchange chip in other computer systems;
[0051] Based on the global topology information, an RDMA channel and graphics processor memory address mapping are established between computer systems, and the RDMA channel and mapping relationship are written into the memory of each graphics processor through the vertically extended network routing.
[0052] In one embodiment, receiving the communication address broadcast by the semantic exchange chip in another computer system, and generating global topology information based on the communication address broadcast by the semantic exchange chip in the other computer system, includes:
[0053] Determine the Rendezvous endpoint on the horizontally extended network and receive the communication address corresponding to the semantic exchange chip issued by the scheduling system;
[0054] The communication address is broadcast through the horizontal expansion network, and the communication address broadcast by the semantic exchange chip in other computer systems is received.
[0055] Based on the communication address broadcast by the semantic exchange chip in other computer systems, global topology information is generated, which includes the computer system identifier, the target graphics processor identifier, and the communication address corresponding to the target semantic exchange chip.
[0056] In one embodiment, the step of establishing RDMA channels and graphics processor memory address mappings between computer systems based on the global topology information, and writing the RDMA channels and mapping relationships into the memory of each graphics processor through the vertically extended network routing, includes:
[0057] Based on the global topology information, an RDMA communication channel for the semantic exchange chip between different computer systems is established through the horizontally extended network routing;
[0058] Based on the global topology information, a mapping relationship is established between the graphics processor memory address and the memory address of the RDMA communication channel, and read / write permissions for the memory address of the RDMA communication channel are obtained.
[0059] Through the vertically extended network routing, the RDMA communication channel, the mapping relationship, and the read / write permissions of the RDMA communication channel's memory address are written into the graphics processor's video memory based on the computer system's topology information.
[0060] Fourthly, this application also provides a network interface card, including a semantic exchange chip as described in any of the above embodiments and multiple interfaces, wherein the chip processes data or communicates externally through the interfaces.
[0061] Fifthly, this application also provides a computer device including a network interface card in any of the above embodiments, the network interface card being used for processing data or external communication.
[0062] Sixthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the methods in any of the above embodiments.
[0063] In a seventh aspect, this application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the method in any of the above embodiments.
[0064] The aforementioned semantic switching chip, computer system, semantic switching method, network interface card, device, medium, and program products include a semantic switching chip comprising at least two ports, each port being used to convert memory requests from a vertically scaled-up network to RDMA read / write requests from a horizontally scaled-up network. A semantic switching module, communicating with each port, is used to schedule RDMA read / write requests received from the horizontally scaled-up network in one port to the other port for conversion to obtain memory requests to be transmitted to the vertically scaled-up network, and to schedule RDMA read / write requests obtained from the vertically scaled-up network memory requests in the other port to the first port for execution, so as to transmit the RDMA read / write requests to ports corresponding to other semantic switching chips within the horizontally scaled-up network. This shields the differences in the interconnect network at the hardware level, eliminating the need for the communication library software to be aware of the underlying interconnect network implementation details, greatly simplifying the complexity of the communication library software. Furthermore, it offloads communication between different nodes from the GPU to the semantic switching chip, thereby reducing GPU resource consumption. Attached Figure Description
[0065] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0066] Figure 1 This is an internal topology diagram of a node in a related embodiment;
[0067] Figure 2 This is a diagram illustrating the call hierarchy of AI application software in related technologies;
[0068] Figure 3 This is a block diagram of the semantic exchange chip in one embodiment;
[0069] Figure 4 This is a block diagram of the computer system in one embodiment;
[0070] Figure 5 This is a schematic diagram of the graphics processor software call hierarchy in one embodiment;
[0071] Figure 6 This is a flowchart illustrating a semantic exchange method in one embodiment;
[0072] Figure 7 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0073] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0074] It should be noted that the terms "first," "second," etc., used in this application can be used to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish the first element from the second element. The terms "comprising" and "having," and any variations thereof, used in this application, are intended to cover non-exclusive inclusion. The term "multiple" used in this application refers to two or more. The term "and / or" used in this application refers to one of the embodiments, or any combination of multiple embodiments.
[0075] Combination Figure 1 As shown, Figure 1 This is an internal topology diagram of a node in a related embodiment. Figure 1 The mid-node (GPU server) consists of two CPUs, each equipped with DRAM (memory) and network interface cards (NIC0~NIC1, primarily used for distributed storage and in-band management). Each CPU is connected via a PCIe switch to four GPU compute cards (GPU0~GPU3, GPU4~GPU7) and four RNICs supporting the RDMA protocol (RNIC0~RNIC3, RNIC4~RNIC7). Each RNIC is dedicated to one GPU compute card.
[0076] The eight GPU computing cards in the GPU server are interconnected through multiple NvSwitch to form a high-bandwidth scale-up network (this example uses the common NvSwitch interconnection; other methods may be used in other embodiments).
[0077] The eight RNIC network cards in the GPU server are connected to the TOR (Top Of Rack) switch (the TOR switch is not in...). Figure 1 As shown in the image, multiple GPU servers are interconnected through TOR switches, Leaf switches, and Spine switches to form a computing cluster with a scale of up to thousands or tens of thousands of GPUs. A network consisting of RNIC network cards, TOR switches, Leaf switches, and Spine switches is typically referred to as a low-bandwidth scale-out network.
[0078] When the GPU executes the corresponding AI application, it can call the communication library to shield the underlying network implementation details, combined with Figure 2 As shown, Figure 2 This is a diagram illustrating the call hierarchy of AI application software in related technologies. Figure 2 Using NCCL as an example of a communication library and CUDA as an example of a driver, this explanation illustrates how users develop and deploy AI applications based on AI programming frameworks such as PyTorch or TensorFlow. These applications will be deployed to run on multiple GPU computing cards. During the training process, these GPU computing cards may need to use scale-up and scale-out interconnects. The NCCL communication library shields the upper-layer applications from the underlying network details.
[0079] For some small models that can be trained with a single node and a single GPU, scale-up and scale-out interconnect networks are not needed. For some models with a single node and multiple GPUs, scale-out networks are not needed. In these two scenarios, combining... Figure 2 In (a), the NCCL library can directly call the API interface provided by the CUDA driver through the shared memory library.
[0080] For collection communication that requires cross-node communication, combined with Figure 2 In (b), the NCCL collection communication library calls the UCX / libibverbs API interface and communicates with other nodes through the CUDA driver, RDMA driver and RNIC that supports the RDMA protocol via the scale-out network.
[0081] Therefore, in current related technologies, scale-up and scale-out networks are separated. Although the NCCL collective communication library software shields the upper-layer applications from the differences in the underlying network implementation, the resulting software complexity still exists. This complexity is currently borne by collective communication software like NCCL, thus utilizing the computing resources of the GPU.
[0082] To address this technical problem, this application provides a semantic switching chip, combined with Figure 3 As shown, it includes at least two ports and a switching module, wherein the switching module communicates with each of the ports. Each port is used to convert memory requests from the vertical scaling network to RDMA read / write requests from the horizontal scaling network. The switching module is used to schedule RDMA read / write requests received from the horizontal scaling network in one port to another port for conversion to obtain memory requests to be transmitted to the vertical scaling network, and to schedule RDMA read / write requests obtained from the memory requests of the vertical scaling network in another port to one port for execution of RDMA read / write requests, so as to transmit them to the ports corresponding to other semantic switching chips in the horizontal scaling network.
[0083] Each port of the semantic exchange chip is used to convert memory requests from the vertically scaling network to RDMA read / write requests from the horizontally scaling network. The converted memory requests and the converted RDMA read / write requests are then forwarded to the corresponding ports through the switching module. In this way, semantic exchange is performed on a separate semantic exchange chip, rather than on the GPU, which can reduce the GPU resource usage and allow more GPU resources to be used for training.
[0084] Specifically, the port can receive memory requests sent by the GPU or RDMA read / write requests sent by semantic exchange chips in other computer systems from the horizontal scaling network. The port can convert memory requests sent by the GPU into RDMA read / write requests, and convert RDMA read / write requests sent by semantic exchange chips in other computer systems from the horizontal scaling network into memory requests.
[0085] Memory requests can include LOAD requests and STORE requests, while RDMA read / write requests include RDMA read requests and RDMA write requests.
[0086] Specifically, the port can convert LOAD requests sent by the GPU into RDMA read requests, and STORE requests sent by the GPU into RDMA write requests; convert RDMA read requests sent by semantic exchange chips in other computer systems into LOAD requests; and convert RDMA write requests received from the horizontal scaling network from semantic exchange chips in other computer systems into STORE requests.
[0087] In practical applications, taking a semantic switching chip with one port and another port as an example, when the other port of the semantic switching chip receives a remote address (e.g., 0xB00000002000) for accessing a cross-node (computer system) from the vertically extended network route, it encapsulates the Load request into an RDMA read message, sends this RDMA read message to the port through the switching module, and executes the RDMA read message through the port to send it to the remote semantic switching chip via an RDMA QP connection. Similarly, it encapsulates the Store request into an RDMA write message, sends this RDMA write message to the port through the switching module, and executes the RDMA write message through the port to forward it to the remote semantic switching chip via an RDMA QP connection. After the other port completes the request conversion, it sends the converted RDMA read / write request to the switching module. The switching module forwards the converted RDMA read / write request to the corresponding port, which executes the RDMA read / write request to transmit it to the corresponding ports of other semantic switching chips within the horizontally extended network.
[0088] In addition, the port of the semantic switching chip can also receive RDMA read / write requests from the horizontal scaling network. The switching module schedules the RDMA read / write request to another port, which then converts the RDMA read / write request to obtain a memory request to be transmitted to the vertical scaling network.
[0089] Specifically, the switching module can schedule the RDMA read / write request converted by another port to the port, and route the converted RDMA read / write request to the corresponding target semantic translation chip in the horizontal expansion network through the port. The path to the target semantic translation chip can be determined according to the address of the memory request before conversion corresponding to the RDMA read / write request, and the port routes the converted RDMA read / write request to the corresponding target semantic translation chip in the horizontal expansion network according to the path.
[0090] The switching module can also schedule RDMA read / write requests received from the horizontal expansion network in one port to another port. The other port converts the RDMA read / write request to obtain a memory request to be transmitted to the vertical expansion network, and determines the target semantic translation chip based on the address of the converted memory request.
[0091] During initialization, the semantic exchange chip generates global topology information. This information stores the mapping between UVA address segments and (node ID, target GPU Rank, target SSNIC RDMA endpoint). The UVA address segment corresponds to the address of the memory request. Therefore, the target semantic exchange chip (SSNIC) can be determined based on the address of the memory request to generate a path. If the converted request is an RDMA read / write request, the target semantic exchange chip (SSNIC) can be determined based on the address of the memory request before conversion to generate a path. The aforementioned semantic switching chip includes at least two ports, each port being used to convert memory requests from the vertically scaling network to RDMA read / write requests from the horizontally scaling network. A switching module, communicating with each port, is used to schedule RDMA read / write requests received from the horizontally scaling network in one port to the other port for conversion to obtain memory requests to be transmitted to the vertically scaling network. It also schedules RDMA read / write requests obtained from the vertically scaling network memory requests in the other port to the first port for execution, so that the RDMA read / write requests can be transmitted to ports corresponding to other semantic switching chips within the horizontally scaling network. This shields the differences in the interconnect network at the hardware level, eliminating the need for the communication library software to be aware of the underlying interconnect network implementation details, greatly simplifying the complexity of the communication library software and thus reducing GPU resource consumption.
[0092] In some optional embodiments, the port employs an Ethernet protocol stack, including a physical layer (e.g., 800Gbps PHY), a media layer (ETH MAC), and an adaptation layer. This adaptation layer includes a first adapter and a second adapter. The first adapter receives and parses memory requests routed from the vertically extending network, converts the memory requests into RDMA read / write requests, and sends the converted RDMA read / write requests to the switching module. It also receives RDMA read / write requests scheduled by the second adapter and converts these RDMA read / write requests into at least one independent memory request. The second adapter executes the RDMA read / write requests to send them to semantic switching chips in other computer systems via the horizontally extending network, and receives and parses RDMA read / write requests sent by semantic switching chips in other computer systems from the horizontally extending network, and transmits the RDMA read / write requests to the switching module.
[0093] The first adapter is used to convert memory requests and RDMA read / write requests, while the second adapter is used to communicate with semantic exchange chips in other computer systems through the horizontal scaling network.
[0094] In some optional embodiments, the first adapter is specifically configured to receive and parse at least one memory request sent by the vertically extended network route, merge and package at least one memory request to obtain a first target data packet, and encapsulate the first target data packet into an RDMA read / write request; and receive an RDMA read / write request scheduled by the switching module, convert the RDMA read / write request into a plurality of second target data packets, and generate a memory request for each second target data packet.
[0095] Combination Figure 3 As shown, the Load / Store request is sent from the vertical expansion network route to the first adapter. The first adapter interfaces with the vertical expansion network route, receives the command stream from the vertical expansion network route, parses the target UVA address carried in the command, and provides the routing decision basis for the subsequent switching module, the semantic switch.
[0096] Specifically, the first adapter can receive and parse memory requests sent by the vertically extended network route, and convert the memory requests into RDMA read / write requests. Figure 3 (The clockwise data flow indicator line in the diagram). Specifically, the first adapter receives and parses at least one memory request sent by the longitudinally extended network route, then merges and packages the at least one memory request into a first target data packet, and encapsulates the first target data packet into an RDMA read / write request.
[0097] Specifically, in combination Figure 3 As shown, for the packing direction, the mapping and packing layer of the first adapter groups memory requests according to the target address and packs multiple memory requests into a single first target data packet (SUE PDU). In this embodiment, taking a STORE request as an example, multiple fine-grained Store requests issued by the GPU (e.g., a large number of small-sized write operations in gradient synchronization) can be aggregated into a single first target data packet (SUE PDU) by the mapping and packing layer of the first adapter (SUE Adapter), then encapsulated as an RDMA Write message and sent to the switching module. The switching module then sends this RDMA Write message to a second adapter on another port. The second adapter on the other port executes the RDMA read / write request to send it to the semantic switching chip in other computer systems via the horizontally scaled network, effectively reducing header overhead.
[0098] Regarding the unpacking direction, the first adapter can receive RDMA read / write requests scheduled by the switching module and convert the RDMA read / write requests into multiple second data packets. For each second data packet, a memory request is generated. For example, after the first adapter receives an RDMA Write message forwarded by the switching module from the horizontal scaling network route, the first adapter (SUE Adapter) splits the RDMA Write message into multiple second target data packets (SUE PDUs) and generates a memory request for each second target data packet (SUE PDU) so that it can be written to the memory address of the target GPU one by one through the vertical scaling network scale-up.
[0099] The second adapter sends and receives Ethernet packets at its underlying layer. Ordering and Reliability (through ACK and retransmission mechanisms) are implemented at the Transport layer, and Payload encapsulation and decapsulation are implemented at the Packing layer.
[0100] In some optional embodiments, the second adapter is specifically configured to receive an RDMA read request sent by a semantic switching chip in another computer system, verify the legality of the read / write permissions and address carried in the RDMA read request, and transmit the RDMA read request to the switching module after the verification is successful; and upon receiving an RDMA write request sent by a semantic switching chip in another computer system, transmit the RDMA write request to the switching module.
[0101] The second adapter mainly performs standard operations in the RDMA network, including RDMA Send and RDMA Write operations.
[0102] Specifically, when the second adapter receives an RDMA write message from a semantic exchange chip in another computer system via the horizontal scaling network ( Figure 3 The counter-clockwise data flow indicator line in the middle sends the RDMA write message to the switching module. The switching module schedules the RDMA write message to the first adapter in another port of the semantic switching chip. The first adapter converts the RDMA write request into a memory request and executes the memory request to write the data into the GPU's memory through the vertically extended network.
[0103] When the second adapter receives an RDMA read request message from a semantic exchange chip in another computer system from the horizontal scaling network, it first verifies the validity of the permission R_Key and address range carried in the request. After successful verification, the second adapter sends the RDMA read request message to the switching module. The switching module then sends the RDMA read request message to the first adapter on another port of the semantic exchange chip. The first adapter converts the RDMA read request message into a corresponding memory request, such as a Load operation, and then initiates a Load operation to the local vertical scaling network Scale-up through the first adapter to read the corresponding data from the target GPU memory. After reading, the first adapter encapsulates the data into an RDMA Read Response message and sends the RDMA Read Response message to the switching module. The switching module schedules the RDMA Read Response message to the corresponding first adapter, and the first adapter returns the message to the requesting semantic exchange chip through the original QP connection.
[0104] In some optional embodiments, the semantic exchange chip further includes a processing core or a semantic exchange chip multiplexed off-chip processing core, the processing core being used to establish intra-system topology and inter-system RDMA channels.
[0105] The semantic exchange chip can include a processing core, or it can be an off-chip processing core. This processing core can pre-establish the topology within the computer system and RDMA channels between computer systems. In other words, the semantic exchange chip assists the GPU in topology discovery and, after topology establishment, establishes RDMA connections between semantic exchange chips across nodes. The establishment and maintenance of RDMA QP connections are handled by the processing core within the semantic exchange chip or the off-chip processing core. This offloads the cross-node awareness work, previously handled by the GPU / CPU, to the semantic exchange chip, reducing the resource consumption on the GPU.
[0106] To simplify the explanation, the establishment of intra-node topology in traditional technology includes: During NCCL initialization, the PCIe bus topology within the node is enumerated by reading the operating system's / sys / sysfs, identifying all GPUs, NICs, PCIe switches, and NVSwitches. The discovery results are serialized into an XML topology tree, which records attributes such as the PCI Bus ID, device type, link speed, and NVLink connection target for each device. The establishment of inter-node topology includes: 1. Rendezvous handshake: Rank0 calls ncclGetUniqueId() to generate a globally unique identifier (containing Rank0's TCP monitoring address), which is broadcast to all processes participating in training via MPI or Socket. 2. Bootstrap AllGather: Each Rank is interconnected via Socket Ring, performing an AllGather operation to aggregate metadata such as the host_hash (hostname hash), local Rank information, and GPU device information of each Rank, thus enabling each Rank to know the global node distribution of the entire training cluster (which Ranks are on the same node and which are across nodes). 3. Intra-node XML sharing: Multiple Ranks within the same node exchange their local XML topology trees via shared memory ( / dev / shm), merging them into a complete node-level topology graph. 4. Path calculation and graph search: Based on the merged topology graph, NCCL calls ncclTopoComputePaths() to calculate the communication bandwidth matrix between all Rank pairs, then calls ncclTopoCompute() to search for the optimal Ring / Tree communication graph, ultimately establishing an RDMA QP connection (for cross-node communication).
[0107] In this application, the steps for establishing the topology between nodes are offloaded from the GPU to the semantic exchange chip. In some optional embodiments, the processing core is used to: receive the intra-computer system topology information corresponding to the current computer system uploaded when the graphics processor driver is loaded, and store the intra-computer system topology information, which is used for vertical expansion network for intra-computer system data transmission; receive the communication addresses broadcast by the semantic exchange chips in other computer systems, and generate global topology information based on the communication addresses broadcast by the semantic exchange chips in other computer systems; establish RDMA channels and graphics processor memory address mappings between computer systems based on the global topology information, and write the RDMA channels and mapping relationships into the memory of each graphics processor through the vertical expansion network routing.
[0108] The establishment of the local topology within a node includes: after the semantic exchange chip is powered on and initialized, it collects the hardware topology information of the node through the following means: when the GPU driver is loaded, it actively reports the topology information of the node, that is, the topology information within the computer system, such as XML topology information, to the semantic exchange chip; the semantic exchange chip parses the topology information and maintains a "node topology table" locally, recording the information of each GPU: GPU number and its rank in the communication group, uniform virtual address (UVA) segment base address and length, GPU Direct RDMA (GDR) support capability, and port number and corresponding bandwidth of the vertical scaling network.
[0109] The establishment of an RDMA channel involves: first, performing a handshake to generate global topology information, and then establishing an RDMA channel based on the global topology information.
[0110] In some optional embodiments, receiving the communication addresses broadcast by the semantic exchange chips in other computer systems and generating global topology information based on the communication addresses broadcast by the semantic exchange chips in other computer systems includes: each semantic exchange chip determining a Rendezvous endpoint on the horizontal expansion network and receiving the communication addresses corresponding to each semantic exchange chip issued by the scheduling system; broadcasting the communication addresses through the horizontal expansion network and receiving the communication addresses broadcast by the semantic exchange chips in other computer systems; generating global topology information based on the communication addresses broadcast by the semantic exchange chips in other computer systems, wherein the global topology information includes a computer system identifier, a destination graphics processor identifier, and a communication address corresponding to the destination semantic exchange chip.
[0111] Among them, the Bootstrap AllGather of NCCL corresponds to the handshake and metadata collection phases, but the execution entity has changed from CPU / GPU to the semantic exchange chip SSNIC.
[0112] For the Rendezvous service: Each node's SSNIC in the cluster monitors a fixed TCP / RDMA port on the scale-out network, serving as the node's Rendezvous endpoint. When the training task starts, the scheduling system (SLURM / K8s) distributes the addresses of all semantic switching chip SSNICs, such as a list of IP addresses, to each semantic switching chip SSNIC (or broadcasts the address of a Leader SSNIC as a rendezvous point through a centralized controller).
[0113] For metadata AllGather: Each node's semantic exchange chip SSNIC performs an AllGather operation through the scale-out network (RDMA message semantics), broadcasting the node's metadata data packet to all other semantic exchange chips SSNICs in the cluster.
[0114] After AllGather is completed, global topology information is generated. Each semantic switching chip SSNIC has mastered the metadata of all nodes in the cluster and builds a global topology table based on this. This table is a two-dimensional mapping: it stores the mapping relationship between UVA address range and (node ID, target GPU Rank, target SSNIC RDMA endpoint). This mapping table is the routing basis for subsequent Load / Store semantic conversion.
[0115] Subsequently, communication links can be established based on the global topology table. This allows each GPU to use a unified virtual address (UVA) mechanism, enabling the system to determine which node the current Load / Store request should be forwarded to via the virtual address (for example, GPU0 address 0x1000 of node A is mapped to 0xA00000001000 in the unified virtual address space, and GPU1 address 0x2000 of node B is mapped to 0xB00000002000 in the unified virtual address space).
[0116] Optionally, the step of establishing an RDMA channel and graphics processor memory address mapping between computer systems based on the global topology information, and writing the RDMA channel and mapping relationship into the memory of each graphics processor through the vertical expansion network routing, includes: establishing an RDMA communication channel between the semantic exchange chip of different computer systems through the horizontal expansion network based on the global topology information; establishing a mapping relationship between the graphics processor memory address and the memory address of the RDMA communication channel based on the global topology information, and obtaining read and write permissions for the memory address of the RDMA communication channel; and writing the RDMA communication channel, the mapping relationship, and the read and write permissions for the memory address of the RDMA communication channel into the memory of the graphics processor through the vertical expansion network routing based on the topology information within the computer system.
[0117] The establishment of an RDMA QP connection involves establishing an RDMA QP connection between the semantic exchange chip SSNIC and the SSNIC of the remote end (another computer system, also known as a node) that needs to communicate, through a horizontal scaling-out network. The QP connection can be established in RC (Reliable Connection) mode, where the local semantic exchange chip SSNIC CPU sends a CM (Connection Management) request, and the remote semantic exchange chip SSNIC CPU responds. For a cluster of N nodes, at least one QP is established between each pair of node semantic exchange chip SSNICs. Multiple QPs can be established to support multi-path transmission; the specific number of QPs is configurable.
[0118] Registering virtual addresses to the memory addresses of the RDMA communication channel establishes a mapping between the graphics processor's memory addresses and the RDMA communication channel's memory addresses. This mapping is achieved through virtual addresses, where each graphics processor memory address corresponds to a virtual address. Therefore, only the mapping between virtual addresses and RDMA communication channel memory addresses needs to be established. The semantic exchange chip SSNIC registers the virtual addresses (UVA) of each GPU in the global topology table as RDMA Memory Regions (MRs) for the RDMA communication channel, obtaining the corresponding read / write permissions (L_Key and R_Key) for subsequent RDMA read / write operations. It's important to note that this step involves completely offloading the GPU's original Streaming Multiprocessor (SM) to the semantic exchange chip SSNIC, releasing GPU SM resources.
[0119] Local GPU UVA Mapping Notification: The semantic exchange chip SSNIC writes the established QP connection information and the mapping table between remote virtual address UVA and RDMA endpoints to the driver cache of each local GPU through vertical scaling network scale-up. This allows the GPU to issue Load / Store requests without being aware of the underlying RDMA details, with the semantic exchange SSNIC responsible for complete semantic translation and forwarding.
[0120] In the above embodiments, the differences in the Internet are shielded at the hardware level. The collective communication library software does not need to be aware of the implementation details of the underlying Internet, which greatly simplifies the complexity of the collective communication library software and facilitates programming and use for users. At the same time, the semantic exchange architecture also offloads the GPU computing resources originally used for communication, reducing the hardware cost of the training cluster.
[0121] Combination Figure 4 As shown, Figure 4The diagram below is a structural block diagram of a computer system in one embodiment. In this embodiment, the structural block diagram includes: at least one central processing unit (CPU); at least two graphics processors (GPUs), each GPU being connected to a node CPU and interconnected via a vertically extended network route; and at least two semantic switching chips as described in any of the above embodiments, each semantic switching chip being connected to one of the vertically extended network routes, the semantic switching chips being used to convert memory requests in the vertically extended network to RDMA read / write requests in the horizontally extended network.
[0122] Among them, the combination Figure 4 As shown, the computer system includes two central processing units (CPUs), each CPU corresponding to four graphics processing units (GPUs), which are interconnected via a vertically extended network routing.
[0123] Each graphics processing unit (GPU) can correspond to a semantic switching chip. The semantic switching chip is connected to the vertical scaling network route and realizes the conversion between memory requests of the vertical scaling network and RDMA read / write requests of the horizontal scaling network. For specific conversion methods, please refer to the above.
[0124] For memory requests within a node, they are handled directly through vertically expanded network routing. The memory of each GPU and CPU within a node (computer system) can be shared through this vertically expanded network routing.
[0125] For memory requests between nodes, they are routed to the semantic exchange chip. The semantic exchange chip converts the received memory request into an RDMA read / write request. For RDMA read / write requests received from the horizontal scaling network, they are converted into memory requests. After the request is converted, the target semantic exchange chip is determined based on the address corresponding to the request, so that the request is routed to the target semantic exchange chip. For the converted RDMA read / write requests, they are routed to the target semantic exchange chip in other nodes through the horizontal scaling network.
[0126] In this way, the software does not need to be aware of the underlying scale-out network, and can be programmed uniformly using Load / Store semantics, which simplifies the implementation of the NCCL collection communication library; it saves SM resources in the GPU computing card, and RDMA communication is offloaded to the semantic exchange chip SSNIC, so that the SM resources of the GPU computing card are only used for computing work.
[0127] In some alternative embodiments, the graphics processor is specifically used to call a communication library through a user program and, based on the communication library, to call API interfaces provided by the driver through a shared memory library to process memory requests.
[0128] Among them, combined Figure 5As shown, Figure 5 This is a schematic diagram of the graphics processor software call hierarchy in one embodiment. In this embodiment, the communication library is NCCL and the driver is CUDA, which is used as an example for illustration. In other embodiments, the communication library and driver can be other, and no specific limitation is made here.
[0129] Graphics processing units (GPUs) store AI applications developed and deployed by users based on AI programming frameworks such as PyTorch or TensorFlow. When these applications are executed, the GPU computing card only sees the vertical interconnect network scale-up during the training process. The simplified NCCL collection communication library can directly call the API interface provided by the CUDA driver through the shared memory library.
[0130] In one exemplary embodiment, such as Figure 6 As shown, a semantic exchange method is provided, which can be applied to... Figure 1 Taking the semantic exchange chip in the example, the following steps are included:
[0131] S602: Receive and parse the memory request transmitted from the vertically expanded network route, convert the memory request into an RDMA read / write request, and then send it to the horizontally expanded network route;
[0132] S604: Receive and parse the RDMA read / write request transmitted from the horizontal expansion network route, convert the RDMA read / write request into a memory request, and then send it to the vertical expansion network route.
[0133] In some optional embodiments, receiving and parsing the memory request transmitted from the vertical scaling network route, converting the memory request into an RDMA read / write request, and sending it to the horizontal scaling network route includes: receiving and parsing at least one memory request transmitted from the vertical scaling network route through a first adapter; merging and packaging the at least one memory request into a first target data packet through the first adapter; encapsulating the first target data packet into an RDMA read / write request through the first adapter, and transmitting the RDMA read / write request to a second adapter via a switching module; and executing the RDMA read / write request through the second adapter to send it to the horizontal scaling network route.
[0134] In some optional embodiments, receiving and parsing the RDMA read / write request transmitted from the horizontal scaling network route, and converting the RDMA read / write request into a memory request, includes: receiving the RDMA read / write request transmitted from the semantic switching chip in another computer system from the horizontal scaling network route via a second adapter; verifying the legality of the read / write permissions and address carried in the write request; and transmitting the RDMA read / write request to the first adapter via the switching module after successful verification; receiving the RDMA read / write request transmitted from the switching module via the first adapter; splitting the RDMA read / write request into multiple second data packets via the first adapter; generating a memory request for each second data packet via the first adapter, and sending it to the vertical scaling network route.
[0135] In some optional embodiments, the method further includes: establishing intra-system topology and inter-system RDMA channels.
[0136] In some optional embodiments, establishing the intra-system topology and inter-system RDMA channels includes: receiving intra-system topology information corresponding to the current computer system uploaded when the graphics processor driver is loaded, and storing the intra-system topology information, which is used for vertical network expansion for intra-system data transmission; receiving communication addresses broadcast by the semantic exchange chips in other computer systems, and generating global topology information based on the communication addresses broadcast by the semantic exchange chips in other computer systems; establishing inter-system RDMA channels and graphics processor memory address mappings based on the global topology information, and writing the RDMA channels and mapping relationships into the memory of each graphics processor through the vertical network expansion routing.
[0137] In some optional embodiments, receiving the communication address broadcast by the semantic exchange chip in other computer systems and generating global topology information based on the communication address broadcast by the semantic exchange chip in other computer systems includes: determining the Rendezvous endpoint on the horizontal expansion network and receiving the communication address corresponding to each semantic exchange chip issued by the scheduling system; broadcasting the communication address through the horizontal expansion network and receiving the communication address broadcast by the semantic exchange chip in other computer systems; generating global topology information based on the communication address broadcast by the semantic exchange chip in other computer systems, wherein the global topology information includes a computer system identifier, a destination graphics processor identifier, and a communication address corresponding to the destination semantic exchange chip.
[0138] In some optional embodiments, the step of establishing an RDMA channel and graphics processor memory address mapping between computer systems based on the global topology information, and writing the RDMA channel and mapping relationship into the memory of each graphics processor through the vertically extended network routing, includes: establishing an RDMA communication channel between the semantic exchange chip of different computer systems through the horizontally extended network routing based on the global topology information; establishing a mapping relationship between the graphics processor memory address and the memory address of the RDMA communication channel based on the global topology information, and obtaining read and write permissions for the memory address of the RDMA communication channel; and writing the RDMA communication channel, the mapping relationship, and the read and write permissions for the memory address of the RDMA communication channel into the memory of the graphics processor through the vertically extended network routing based on the topology information within the computer system.
[0139] The limitations of each step involved in the above method can be found above, and will not be repeated here.
[0140] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps. It is understood that the steps in different embodiments can be freely combined as needed, and all non-contradictory solutions formed by such combinations are within the scope of protection of this application.
[0141] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 7As shown, the computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores the data involved in the aforementioned methods. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When the computer program is executed by the processor, it implements a semantic exchange method.
[0142] Those skilled in the art will understand that Figure 7 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0143] In one embodiment, a network interface card is also provided, including a chip as described in any of the above embodiments and multiple interfaces, through which the chip processes data or communicates externally.
[0144] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.
[0145] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.
[0146] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0147] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.
[0148] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.
[0149] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A semantic switching chip, characterized in that, The semantic exchange chip includes: At least two ports, each of which is used to implement the conversion between memory requests of the vertically scaling network and RDMA read / write requests of the horizontally scaling network; The switching module communicates with each of the ports and is used to schedule RDMA read / write requests received from the horizontal scaling network in one port to another port for conversion to obtain memory requests to be transmitted to the vertical scaling network, and to schedule RDMA read / write requests obtained from the memory requests of the vertical scaling network in the other port to the first port to execute the RDMA read / write requests, so as to transmit them to the ports corresponding to other semantic switching chips in the horizontal scaling network.
2. The semantic switching chip according to claim 1, characterized in that, The port includes: The first adapter is used to receive and parse memory requests transmitted via the vertically extended network route, convert the memory requests into RDMA read / write requests, send the converted RDMA read / write requests to the switching module, and receive RDMA read / write requests scheduled by the switching module, and convert the RDMA read / write requests into at least one independent memory request. The second adapter is configured to execute the RDMA read / write request to send it to the semantic switching chip in another computer system via the horizontal scaling network, and to receive and parse the RDMA read / write request sent by the semantic switching chip in the other computer system from the horizontal scaling network, and to transmit the RDMA read / write request to the switching module.
3. The semantic switching chip according to claim 2, characterized in that, The first adapter is specifically used to receive and parse at least one memory request sent by the vertically extended network route, merge and package the at least one memory request to obtain a first target data packet, and encapsulate the first target data packet into an RDMA read / write request; as well as The system receives RDMA read / write requests scheduled from the switching module, converts the RDMA read / write requests into multiple second data packets, and generates a memory request for each second data packet.
4. The semantic switching chip according to claim 2, characterized in that, The second adapter is specifically used to receive RDMA read requests sent by semantic switching chips in other computer systems, verify the legality of the read / write permissions and address carried in the RDMA read request, and transmit the RDMA read request to the switching module after the verification is successful. And upon receiving an RDMA write request from a semantic switching chip in another computer system, the RDMA write request is transmitted to the switching module.
5. The semantic switching chip according to claim 1, characterized in that, The semantic exchange chip also includes a processing core or a multiplexed off-chip processing core, the processing core being used to establish the intra-system topology and the inter-system RDMA channel.
6. The semantic switching chip according to claim 5, characterized in that, The processing core is used for: The system receives and stores the computer system topology information corresponding to the current computer system uploaded when the graphics processor driver is loaded. The computer system topology information is used for vertical network expansion to perform data transmission within the computer system. Receive the communication address broadcast by the semantic exchange chip in other computer systems, and generate global topology information based on the communication address broadcast by the semantic exchange chip in other computer systems; Based on the global topology information, an RDMA channel and graphics processor memory address mapping are established between computer systems, and the RDMA channel and mapping relationship are written into the memory of each graphics processor through vertically extended network routing.
7. The semantic switching chip according to claim 6, characterized in that, The step of receiving the communication address broadcast by the semantic exchange chip in other computer systems and generating global topology information based on the communication address broadcast by the semantic exchange chip in other computer systems includes: Each of the semantic exchange chips determines the Rendezvous endpoint on the horizontally extended network and receives the communication address corresponding to each of the semantic exchange chips issued by the scheduling system; The communication address is broadcast through the horizontal expansion network, and the communication address broadcast by the semantic exchange chip in other computer systems is received. Based on the communication address broadcast by the semantic exchange chip in other computer systems, global topology information is generated, which includes the computer system identifier, the target graphics processor identifier, and the communication address corresponding to the target semantic exchange chip.
8. The semantic switching chip according to claim 6, characterized in that, The step of establishing RDMA channels and graphics processor memory address mappings between computer systems based on the global topology information, and writing the RDMA channels and mapping relationships into the memory of each graphics processor through the vertically extended network routing, includes: Based on the global topology information, an RDMA communication channel for the semantic exchange chip between different computer systems is established through the horizontally extended network; Based on the global topology information, a mapping relationship is established between the graphics processor memory address and the memory address of the RDMA communication channel, and read / write permissions for the memory address of the RDMA communication channel are obtained. Through the vertically extended network routing, the RDMA communication channel, the mapping relationship, and the read / write permissions of the RDMA communication channel's memory address are written into the graphics processor's video memory based on the computer system's topology information.
9. A computer system, characterized in that, The system includes: At least one central processing unit; At least two graphics processors, each of which is connected to a central processing unit, and the graphics processors are interconnected via a vertically extended network routing; At least two semantic switching chips according to any one of claims 1 to 8, each of the semantic switching chips being attached to a vertically extended network route, the semantic switching chip being used to implement the conversion between memory requests of the vertically extended network and RDMA read / write requests of the horizontally extended network.
10. The system according to claim 9, characterized in that, The graphics processor is specifically used to call the communication library through the user program, and to process memory requests by calling the API interface provided by the driver through the shared memory library based on the communication library.
11. A semantic exchange method, characterized in that, Applied to semantic exchange chips, the method includes: Receive and parse the memory request transmitted from the vertically expanded network route, convert the memory request into an RDMA read / write request, and then send it to the horizontally expanded network route; Receive and parse the RDMA read / write request transmitted from the horizontal scaling network route, convert the RDMA read / write request into a memory request, and then send it to the vertical scaling network route.
12. The method according to claim 11, characterized in that, The process of receiving and parsing memory requests transmitted from the vertically scaling network route, converting the memory requests into RDMA read / write requests, and then sending them to the horizontally scaling network route includes: The first adapter receives and parses at least one memory request routed from the vertically extended network. At least one of the memory requests is merged and packaged into a first target data packet using a first adapter; The first target data packet is encapsulated into an RDMA read / write request by the first adapter, and the RDMA read / write request is transmitted to the second adapter via the switching module; The RDMA read / write request is executed via the second adapter to be sent to the horizontally extended network route.
13. The method according to claim 11, characterized in that, The process of receiving and parsing RDMA read / write requests transmitted via horizontal scaling network routing, and converting the RDMA read / write requests into memory requests, includes: The second adapter receives RDMA read / write requests from semantic switching chips in other computer systems via the horizontally extended network routing, verifies the legality of the read / write permissions and addresses carried in the write request, and transmits the RDMA read / write request to the first adapter via the switching module after the verification is successful. The first adapter receives RDMA read / write requests transmitted from the switching module. The RDMA read / write request is split into multiple second data packets via the first adapter; The first adapter generates a memory request for each of the second packets and sends it to the vertically extended network route.
14. The method according to any one of claims 11 to 13, characterized in that, The method further includes: Establish the topology within the computer system and the RDMA channel between computer systems.
15. The method according to claim 14, characterized in that, The establishment of the intra-system topology and inter-system RDMA channels includes: The system receives and stores the computer system topology information corresponding to the current computer system uploaded when the graphics processor driver is loaded. The computer system topology information is used for vertical network expansion to perform data transmission within the computer system. Receive the communication address broadcast by the semantic exchange chip in other computer systems, and generate global topology information based on the communication address broadcast by the semantic exchange chip in other computer systems; Based on the global topology information, an RDMA channel and graphics processor memory address mapping are established between computer systems, and the RDMA channel and mapping relationship are written into the memory of each graphics processor through the vertically extended network routing.
16. The method according to claim 15, characterized in that, The step of receiving the communication address broadcast by the semantic exchange chip in other computer systems and generating global topology information based on the communication address broadcast by the semantic exchange chip in other computer systems includes: Determine the Rendezvous endpoint on the horizontally extended network and receive the communication address corresponding to the semantic exchange chip issued by the scheduling system; The communication address is broadcast through the horizontal expansion network, and the communication address broadcast by the semantic exchange chip in other computer systems is received. Based on the communication address broadcast by the semantic exchange chip in other computer systems, global topology information is generated, which includes the computer system identifier, the target graphics processor identifier, and the communication address corresponding to the target semantic exchange chip.
17. The method according to claim 15, characterized in that, The step of establishing RDMA channels and graphics processor memory address mappings between computer systems based on the global topology information, and writing the RDMA channels and mapping relationships into the memory of each graphics processor through the vertically extended network routing, includes: Based on the global topology information, an RDMA communication channel for the semantic exchange chip between different computer systems is established through the horizontally extended network routing; Based on the global topology information, a mapping relationship is established between the graphics processor memory address and the memory address of the RDMA communication channel, and read / write permissions for the memory address of the RDMA communication channel are obtained. Through the vertically extended network routing, the RDMA communication channel, the mapping relationship, and the read / write permissions of the RDMA communication channel's memory address are written into the graphics processor's video memory based on the computer system's topology information.
18. A network interface card, characterized in that, It includes a semantic exchange chip as described in any one of claims 1 to 8 and a plurality of interfaces, wherein the chip processes data or communicates externally through the interfaces.
19. A computer device, characterized in that, Includes the network interface card as described in claim 18, the network interface card being used for processing data or external communication.
20. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 11 to 17.
21. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 11 to 17.