Network-connected MPI processing architecture in SMARTNIC

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The MPI shell in SmartNICs addresses data copying overhead by enabling direct data processing, improving performance and efficiency in MPI applications by integrating with existing libraries and facilitating network-centric data centers.

JP7874110B2Active Publication Date: 2026-06-15XILINX INC

View PDF 4 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: XILINX INC
Filing Date: 2022-03-16
Publication Date: 2026-06-15

Application Information

Patent Timeline

16 Mar 2022

Application

15 Jun 2026

Publication

JP7874110B2

IPC: G06F13/12; G06F13/28; G06F13/38

CPC: G06F9/546; Y02D10/00; H04L61/2525; H04L69/22

AI Tagging

Application Domain

Electric digital data processing

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure 0007874110000001
Figure 0007874110000002
Figure 0007874110000003

Patent Text Reader

Abstract

Embodiments herein describe an interface shell in a SmartNIC that reduces data copy overhead in CPU-centric solutions that rely on hardware compute engines (which may include one or more accelerators). The interface shell offloads tag matching and address translation without CPU involvement. Additionally, the interface shell allows the compute engine to read messages directly from the network without extra data copies, i.e., without first copying the data to the CPU's memory.

Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] The examples in this disclosure generally relate to message passing interface (MPI) shells for smart network interface cards (SmartNICs). [Background technology]

[0002] The expansion of data and scale-out workloads / applications presents scalability and performance challenges for modern data centers. To achieve low latency, high throughput, and low power consumption for modern applications, data centers often place their computing tasks in distributed and networked configurations. For example, a data center can include multiple nodes connected via a network, with each node in the architecture containing a host with a multi-core central processing unit (CPU) and hardware accelerators in the form of ASICs (Application-Specific Integrated Circuits), FPGAs (Field-Programmable Gate Arrays), or GPUs (Graphics Processing Units).

[0003] MPI is widely deployed in numerous distributed applications across various domains, including scientific computing, genetic computing, and machine learning. For decades, it has been the de facto programming model for developing parallel and distributed computing. MPI provides various primitives, such as point-to-point communication and collective and synchronous operations. Data communication (sending / receiving) between nodes takes place over a network. In traditional MPI applications where computation is offloaded to accelerators, data received from the network is first stored in the host's memory, and then copied to the accelerator's memory (hereinafter referred to as device memory) via the PCIe (Peripheral Component Interconnect Express) bus for computational acceleration. After computation, the results are usually copied back to the host's memory. The overhead of multiple data copies significantly degrades system performance and results in high latency in this CPU-centric solution. In other words, when receiving a task from another node over the network, the CPU on the receiving node must first process the data and then send it to the memory corresponding to the accelerator within that node. Upon completion, the accelerator returns the processed data to the CPU's memory before the node finally transmits it to the requesting node. Therefore, using accelerators in an MPI environment can generate significant overhead because data is transmitted between the CPU's memory and the accelerator. [Overview of the project]

[0004] One embodiment describes a network interface card (NIC) that includes an MPI shell, which includes a circuit configured to sniff packets received from a network to identify Message Passing Interface (MPI) messages and to transfer the data contained in the MPI messages to a computing circuit for processing without first copying the data contained in the MPI messages to a memory corresponding to a central processing unit (CPU). The CPU is located on the same computing node as the NIC.

[0005] Another embodiment described herein is a NIC that includes a hardware computing circuit and an MPI shell which is configured to receive instructions from an external CPU that instructs the computing circuit to process the data contained in the MPI message by sniffing packets received from the network, identify the MPI message, transfer the data contained in the MPI message to the computing circuit for processing, and instruct the computing circuit to process the data contained in the MPI message.

[0006] Another embodiment described herein is a NIC including an interface shell, which includes a circuit configured to sniff packets received from a network to identify messages corresponding to a distributed computing system in which tasks are transmitted between nodes in the distributed computing system using messages, and to transfer the data contained in the messages to hardware computing circuits for processing without first copying the data contained in the messages to memory corresponding to the CPU. The CPU is located on the same computing node as the NIC.

[0007] To ensure a detailed understanding of the above features, a more specific explanation, concisely summarized above, can be provided by referring to exemplary implementations, some of which are shown in the attached drawings. However, it should be noted that the attached drawings only show typical exemplary implementations and should therefore not be considered limiting in scope. [Brief explanation of the drawing]

[0008] [Figure 1] An example of a parallel computing system with a SmartNIC including an MPI shell is shown. [Figure 2] This is a block diagram of nodes in a parallel computing system, as an example. [Figure 3] This is a block diagram of an MPI shell, as an example. [Figure 4] An example of a packet classifier within the MPI shell is shown. [Figure 5] An example of a tag matcher within the MPI shell is shown below. [Figure 6] An example of address conversion within the MPI shell is shown. [Figure 7] This example shows a data mover within the MPI shell. [Figure 8] This example shows a computation engine within the MPI shell. [Figure 9] An example of a data controller within the MPI shell is shown. [Figure 10A] This example demonstrates integrating the MPI shell into different SmartNIC implementation configurations. [Figure 10B] This example demonstrates integrating the MPI shell into different SmartNIC implementation configurations. [Figure 10C] This example demonstrates integrating the MPI shell into different SmartNIC implementation configurations. [Modes for carrying out the invention]

[0009] Various features are described below with reference to the drawings. Note that the drawings may or may not be drawn to scale, and elements of similar structure or function are represented by the same reference numerals throughout the drawings. Note that the drawings are intended solely to facilitate the description of features. They are not characterized as an exhaustive description of the specification or as a limitation on the claims. In addition, illustrated examples do not necessarily have all the embodiments or advantages shown. Embodiments or advantages described in relation to a particular embodiment are not necessarily limited to that embodiment and may be implemented in any other embodiment even if not illustrated or explicitly described in that way.

[0010] Embodiments of this specification describe an MPI shell in a SmartNIC that reduces data copy overhead in CPU-centric solutions that rely on hardware accelerators. The MPI shell offloads tag matching and address translation without CPU involvement. Furthermore, the MPI shell enables accelerators to read messages directly from the network without extra data copying, i.e., without first copying the data to the CPU's memory. Moreover, the MPI shell enables the MPI programming model to encompass network-centric data center architectures with SmartNICs, and the MPI shell can be seamlessly integrated into existing MPI libraries without major changes to applications. The MPI shell brings data computation as close as possible (e.g., to the computer engine or accelerator on the SmartNIC) to achieve high performance, low latency, and low power consumption.

[0011] In one embodiment, the SmartNIC and computing engine can be designed on a single device, such as an FPGA-based SmartNIC device. This type of data center architecture targets high-speed (40Gbps to 200Gbps) networks and provides improved computing power through its distributed adaptive computing capabilities. Due to the inherent heterogeneity, scalability, and efficiency of the data center architecture, it tightly encompasses modern distributed system applications that demand high performance, low latency, and low power consumption.

[0012] Figure 1 shows an example of a computing system 100 having a SmartNIC including an MPI shell. As shown in the figure, the computing system 100 (e.g., a parallel computing system) includes a plurality of nodes 105 interconnected via a network 150 (e.g., a local access network (LAN)). Each node 105 may include a CPU 110 and a SmartNIC 115, but a node 105 may include multiple CPUs (which may include multiple cores) and multiple SmartNICs 115. In one embodiment, the nodes 105 communicate using MPI, but the embodiments described herein can be extended to include any distributed computing system in which tasks are transmitted between the nodes 105.

[0013] At node 105A, CPU 110 communicates with network 150 and therefore with other nodes 105, relying on SmartNIC 115. SmartNIC 115 includes an MPI shell 120 that enables SmartNIC 115 to "sniff" or "intercept" data transmitted from other nodes 105 in system 100 to node 105A. Instead of storing this data in the memory (e.g., RAM) corresponding to CPU 110, CPU 110 may instruct the MPI shell 120 to process this data using an integrated computing engine 125 (also referred to as a computing circuit, which may include one or more user-defined hardware accelerators). Once processed, CPU 110 can instruct SmartNIC 115 to transmit the processed data to another node 105 using network 150. Thus, the data (both received and processed) never needs to be stored in the memory of CPU 110. Therefore, the data writing / reading process bypasses the CPU and its corresponding memory complex.

[0014] In one embodiment, the computing engine 125 is separate from the SmartNIC 115. In this case, the MPI shell 120 can still directly provide MPI messages to the computing engine 125 for processing, receive the processed data from the computing engine 125, and transfer the processed data to different nodes 105 within the system 100, bypassing the memory complex of the CPU 110. The CPU 110 can control this process using the MPI shell 120, but the MPI messages do not need to flow through the CPU 110 to reach the separate computing engine 125.

[0015] The MPI shell 120 and the compute engine 125 are hardware (e.g., circuitry) within the SmartNIC 115. In one embodiment, the MPI shell 120 and the compute engine 125 are implemented in the programmable logic of the SmartNIC's FPGA. In another embodiment, the MPI shell 120 and the compute engine 125 are implemented in an ASIC or system-on-a-chip (SoC). In that case, the circuitry forming the MPI shell 120 and the compute engine 125 is enhanced. In either case, the MPI shell 120 may be implemented in an integrated circuit within the SmartNIC 115, while the compute engine 125 may be implemented in the same integrated circuit on the SmartNIC 115 or a different integrated circuit, or may be implemented separately from the SmartNIC 115.

[0016] Figure 2 is a block diagram of nodes in a parallel computing system, as in one example. In one embodiment, Figure 2 shows the components within node 105 in Figure 1. In this example, node 105 includes software executed by CPU 110, including an MPI application 205, an MPI library 210, and drivers 250. These drivers 250 include a network stack 215, a kernel driver 217, and an MPI shell runtime 220. The MPI application 205 may include any application such as a scientific computing application, a genetic computing application, or a machine learning / artificial intelligence application. The MPI library 210 enables the MPI application 205 to utilize a distributed computing environment (e.g., computing system 100 in Figure 1). The MPI library 210 can enable point-to-point communication, as well as collective and synchronous operations between nodes in the distributed computing environment.

[0017] Driver 250 enables the MPI application 205 and library 210 to communicate with SmartNIC 115. The network stack 215 and kernel driver 217 do not need to be modified or updated to implement the embodiments herein. However, the MPI shell runtime 220 is a new driver 250 that enables the CPU 110 and software running on the CPU 110 (e.g., MPI application 205) to control and communicate with the MPI shell 120 in SmartNIC 115. In one embodiment, the MPI shell runtime 220 is a software library used for device memory management and communication between the CPU 110 and the MPI shell 120 (e.g., controlling the compute engine 125 as described later). For memory management of device memory (i.e., local memory used by the compute engine 125), the MPI shell runtime 220 allocates message buffers physically located in device memory for hardware processes and deallocates the buffers when the hardware processes have completed their lifecycle. This can be implemented with various memory management algorithms such as fixed-size block allocation, buddy memory allocation, and slab allocation. Although the MPI shell runtime 220 is shown as separate from the MPI shell 120, it can be considered part of the MPI shell, with runtime 220 being the software portion of the shell and the hardware portion of the shell located on the SmartNIC 115.

[0018] The SmartNIC 115 includes a SmartNIC Direct Memory Access (DMA) subsystem 225 that interfaces with software running on the CPU 110, and a SmartNIC Media Access Control (MAC) subsystem 230. In the following description, the term "host" generally refers to the CPU 110 in the same node as the SmartNIC 115, and the software running on the CPU 110.

[0019] Focusing on a typical MPI implementation, the MPI standard generally uses bidirectional communication involving a sender (e.g., MPI_send) and a receiver (MPI_rcv). In the sender, a CPU in the first node prepares a message with headers and data for transmission over a communication channel using a transport protocol such as TCP / IP, RoCE (Remote Direct Memory Access over Converged Ethernet), or iWARP, and transmits the message to the receiver over the communication channel. In the receiver, a CPU in the second node extracts the message headers and data from the communication channel, stores them in a temporary buffer, performs an operation called tag matching to check whether the incoming message matches a receive request posted by the receiver, and copies the message to the destination receive buffer.

[0020] If a node has an MPI shell 120 integrated into a SmartNIC 115, the MPI process can be separated into a hardware process and a software process. The hardware process is characterized by the hardware computing engine 125 that executes the process being located in or attached to the SmartNIC, while the software process is a conventional MPI process executed only on the CPU 110. Each process has a unique number and rank as its identifier. Any bidirectional communication between the software process and the hardware process can be classified into four examples as follows:

[0021] Example A: If the sender and receiver are software processes, the system leverages the conventional MPI communication flow described above without any modification. That is, the communication passes through the MPI shell 120 without being affected by shell 120.

[0022] Example B: When the sender is a software process but the receiver is a hardware process, the system utilizes the conventional MPI transmission operation described above without any modification. However, on the receiver side, the MPI shell 120 sniffs / filters packets of messages related to the hardware process directly from the SmartNIC MAC subsystem 230 and stores the data in a destination receive buffer located in device memory (i.e., local accelerator or compute engine memory). Tagging and address translation are offloaded to the MPI shell 120 without CPU involvement (described in more detail below). Once the message has been fully received and stored in device memory, the MPI shell 120 notifies the receiver host (e.g., an MPI application 205 running on CPU 110). When all messages are ready, the host issues a compute command to a specific hardware compute engine 125 for computation, with the message address as an argument. The calculation engine 125 then reads a message from its device memory, starts the calculation, writes the result back to the device memory if applicable, and, upon completion, notifies the host (e.g., the MPI application 205).

[0023] Example C: When the sender is a hardware process but the receiver is a software process, the receive operation at the receiver is the same as the conventional MPI receive operation described above. However, at the sender, the host uses the SmartNIC DMA subsystem 225 to notify the transmission (TX) logic in the SmartNIC 115 of the message address if the message is stored in device memory. The TX logic then reads the message via the data controller in the MPI shell (described in detail in Figures 3 and 9) and sends the data to the remote receiver. If the message is stored in host-side memory, the send operation is the same as the conventional MPI operation.

[0024] Example D: If both the sender and receiver are hardware processes, the receiver will follow the behavior described in Example B. The sender will follow the behavior described in Example C.

[0025] Figure 3 is a block diagram of an MPI shell 120, as an example. In particular, Figure 3 shows the portion of the MPI shell 120 present in the SmartNIC. Although not shown, the MPI shell 120 may include an MPI shell runtime (e.g., a software driver) that runs on the host (e.g., the MPI shell runtime 220 in Figure 2).

[0026] The hardware of the MPI shell 120 includes a data controller 305, a data mover 320, an address converter 325, a tag matcher 330, a packet classifier 335, and a computing engine 125. Each of these hardware elements (e.g., circuits) is described in more detail in the following diagram. However, as a brief introduction, the packet classifier 335 filters (or identifies) incoming packets related to MPI messages and generates metadata for those packets. This metadata is then used by downstream hardware elements within the MPI shell 120. Although not shown, the packet classifier 335 can receive packets from the SmartNIC MAC subsystem 230 in Figure 2, which then receives packets from the network.

[0027] The tag matcher 330 matches incoming messages from the source process with receive requests posted to the destination process. The address converter 325 calculates the destination address in memory (e.g., local memory in the SmartNIC) for incoming MPI message packets and tracks when the message is ready. The data mover 320 converts packets in the AXIS (Advanced eXtensible Interface Streaming) protocol format into data in the AXI protocol format and issues an interrupt or pull signal to the local host (e.g., local CPU and MPI application) when the MPI message has been fully received. The data controller 305 performs arbitration for various hardware elements within the MPI shell 120 to share and access memory in the SmartNIC. The compute engine 125 can perform arbitrary accelerator functions on the data in the MPI message. As described above, the compute engine 125 can be implemented using programmable or enhanced logic.

[0028] Figure 4 shows an example of a packet classifier 335 within the MPI shell 120. The packet classifier 335 includes a parser 405 and a matching table 410. In general, the packet classifier 335 filters packets related to MPI communication and generates metadata for those packets. Furthermore, the table configuration within the MPI shell runtime 220 in Figure 2 allows for writing and deleting entries in the matching table 410 within the packet classifier 335.

[0029] Parser 405 extracts information from the incoming packet. This information may include the message header, packet sequence number, payload length, and flow ID. In one embodiment, the message header is: <rank src ,rank dstIt includes <0, tag, context_id, opcode, message length (msg_len)>, and the opcode is used to identify MPI operations such as send, receive, put, and get operations. rank src and rank dst The signals are unique identifiers of the source process and the destination process respectively. The flow ID is used to classify packets, and an example of the flow ID can be designed with <source IP address, destination IP address, protocol, source port, destination port>.

[0030] The matching table 410 receives the flow ID (fid) derived from the parser 405 as input and searches for the MPI communication information of the flow corresponding to the fid. The MPI communication information is defined by rank src , rank dst , tag, and context_id. The entries in the matching table 410 are updated or written by the host (e.g., local CPU) when the host and its remote peer have completed their MPI handshake process. The update / write operation can be implemented using the AXI-Lite interface. The matching table 410 can be implemented using hashing binary / ternary / semiternary addressable memory (BCAM / TCAM / STCAM), etc.

[0031] The packet classifier 335 outputs metadata src , rank dst , tag, context_id, opcode, msg_len, pkt_seq, payload_len, drop>, where the drop signal is the miss signal from the lookup request and is output to the matching table 410. That is, drop goes high when the matching table 410 cannot find the flow corresponding to the fid received from the parser 405. When the drop signal is high, the corresponding packet is dropped by the SmartNIC. pc

[0032] Figure 5 shows an example of a tag matcher 330 in the MPI shell 120. The tag matcher 330 matches send operations from a source process with receive requests from a destination process. As shown in Figure 5, the tag matcher 330 includes two lookup tables, namely a post-recv matching table 505 and an unexpected message (unexpected_msg) matching table 510. The post-recv matching table 505 is used to store receive requests from the local host for which no matching element can be found in the unexpected_msg table 510, and the unexpected_msg matching table 510 stores incoming messages from senders for which no matching entry can be found in the post-recv matching table 505. Tables 505 and 510 can be implemented using hashing, tri-based methods, TCAM, and other range search techniques.

[0033] The inputs to the tag matcher 330 are a recv_req (receive request) from the host and metadata. pc , and packets. If a packet is the first segment of a message from the sender, metadata pc teeth, <rank src ,rank dst The message header contains ,context_id,tag>. The output of tag matcher 330 is the packet, memory allocation request (alloc_req pr ), address update request (addr_ud um ), and multiple address lookup requests (addr_lp pr ,addr_lp non_hdr and addr_lp um ) includes.

[0034] addr_lp pr The signal indicates that the packet of the target message from the sender arrived after the host posted its corresponding receive request (recv_req). Furthermore, addr_lppr The signal indicates that an entry exists in the post_recv matching table for those packets.

[0035] addr_lp um The signal indicates that a receive request posted by the host has arrived, after the tag matcher 330 has recorded the corresponding message in the unexpected_msg matching table 510.

[0036] addr_lp non_hdr The signal is used to request the memory location of a series of packets of messages from a sender whose payload does not contain message header information. This signal is used to request the memory location of a sequence of packets of messages from a sender that does not contain message header information in their payload. <rank src ,rank dst Includes tag, payload length (payload_len), and packet sequence number (pkt_seq).

[0037] alloc_req pr The signal indicates that a packet of the target message from the sender has arrived before the host posts a receive request, and that the host needs to allocate memory space to store the unexpected message. This signal indicates that <rank src ,rank dst Includes tag, message length (mlen), and packet sequence number (pkt_seq).

[0038] addr_ud um The signal is, <rank src ,rank dstThis is a receive request containing ,tag, host-assigned address (addr), mlen>. This signal is posted from a host that is not found in either the unexpected_msg matching table 510 or the post_recv matching table 505. This signal informs the address converter in the MPI shell (e.g., address converter 325 in Figure 3) of the memory address / space allocated by the host for this receive request.

[0039] Upon receiving a packet from the packet classifier, the tag matcher 330 uses metadata to determine if it is the first packet of an MPI message. pc Use the post_recv matching table 505 to perform a lookup. The key (K) of the entry in post_recv table 505 pr )teeth, <rank src ,rank dst Includes ,context_id,tag>. If the lookup result is a hit, the corresponding entry is removed from post_recv table 505 and tag matcher 330 is called address lookup request (addr_lp pr ) is issued to the address converter in the MPI shell to obtain the memory location of this packet. Otherwise, tag matcher 330 updates the unexpected_msg matching table 510 and, because tag matcher 330 could not find a match in post_recv table 505, a memory allocation request (alloc_req) for this message is issued. pr ) issues an address lookup request (addr_lp) to the address converter. If the received packet does not contain a message header (for example, not the first packet in an MPI message), tag matcher 330 issues an address lookup request (addr_lp) non_hdr The address converter is issued a message to obtain the memory location of this packet.

[0040] An MPI send operation from the sender is paired with an MPI receive operation from the receiver. When the receiver calls the MPI receive operation (MPI_recv), the host notifies the tag matcher 330 of a receive request (recv_req). The receive request is: <rank src ,rank dst It can include ,context_id,tag,base_addr,mlen>, where base_addr is the base address of the device memory allocated by the host for the message. Then the tag matcher 330, <rank src ,rank dst Keys containing ,context_id,tag> (K um The tag matcher 330 extracts the address lookup request (addr_lp) from the received request and searches (looks up) the unexpected_msg matching table 510 to check if an unexpected message has been received. If the lookup is successful, the entry corresponding to the unexpected message is removed from the unexpected_msg table 510, and the tag matcher 330 performs an address lookup request (addr_lp um ) issues an address update request (addr_ud) to the address converter. Otherwise, the tag matcher 330 issues an address update request (addr_ud) to update the base address associated with the message. um ) is sent to the address converter. Since this is a recipe request that does not match, the tag matcher 330 writes a new entry to the post_recv matching table 505 to include the received request.

[0041] Figure 6 shows an example of an address converter 325 in an MPI shell. The address converter 325 is used to calculate the destination address in device memory of an incoming message packet and to track the readiness of the message. The address converter 325 includes an address table 605, a sequence range (seq_range) table 610, a status manager 615, and a memory manager 620. Both tables 605 and 610 are used to record the memory address assigned to the packet of the target message.

[0042] In one embodiment, each entry in the address table 605 is a key (K at ) and value (V at ) includes, K at teeth, <rank src ,rank dst It has a tag> and V at The base address assigned to the message (add rat This includes mlen, an index (idx) used to query the status of message delivery, and the packet sequence number (pkt_base_seq) of the first packet of the MPI message, including the message header.

[0043] The seq_range table 610 has a key-value structure similar to table 605. The difference is that in the seq_range table 610, the key (K tsr The pkt_seq_range signal contains a field that describes the packet sequence range (pkt_seq_range). The pkt_seq_range signal is a tuple containing (pkt_base_seq of the first packet of the message, pkt_base_seq + mlen).

[0044] The address table 605 retrieves the addr_lp of the matched incoming packet from the tag matcher 330 in Figure 5 for messages that have posted a corresponding receive request. pr , or the corresponding unexpected message arrived in the posted matched incoming request addr_lp um One of the following is received as input to the lookup request: addr_lp pr If the lookup hits, address table 605 has a key-value pair (K) with pkt_seq_range. tsr , V tsr ) generates and writes it to the seq_range table 610. Meanwhile, the address converter 325 updates the idx-th register in the status manager 615 with its pkt_seq_range and the received byte (payload_len) to the new base memory address addr' at =(addrat Calculate (+pkt_seq - pkt_base_seq), and here addr at And pkt_base_seq is V at It is from, and pkt_seq is addr_lp pr It is from the calculated memory address (addr' at The received packet is then sent to a data mover (shown in detail in Figure 7) for storage.

[0045] In contrast, addr_lp um If a lookup for is successful, the address table 605 does not update the seq_range table 610 because the request is from a posted receive and the unexpected message has been received. In this scenario, the address table 605 simply notifies the idx-th register in the status manager 615 that the tag matcher has received a receive request for this message from the host. The address table 605 may support wildcard searches such as MPI_ANY_SOURCE and MPI_ANY_TAG, which may be implemented using TCAM or STCAM.

[0046] The seq_range table 610 uses the tag matcher to retrieve addr_lp for incoming packets of messages that do not have a message header as input for lookup. non_hdr <rank src ,rank dst Receives ,tag,payload_len,pkt_seq>). rank src rank dst Apart from tag search, the seq_range table 610 also contains addr_lp non_hdr Perform a range lookup operation for the request and check if the pkt_seq falls within the pkt_seq_range of any entry. If the lookup is successful, address table 605 returns the new base memory address addr' tsr =(addr tsrCalculate (+pkt_seq - pkt_base_seq), where addr tsr And pkt_base_seq is V tsr It is from, and pkt_seq is addr_lp non_hdr It is from the calculated memory address addr'. tsr The data is then sent to the data mover to store the corresponding received packet. The address converter 325 also updates the idx-th register of the status management unit 615 with the number of bytes received (payload_len). The seq_range table 610 has wildcard and range search requirements and can be implemented using TCAM.

[0047] The status manager 615 tracks the transmission status of each message. In one embodiment, the status manager 615 has a set of registers and a register allocator. The set of registers is <rank src ,rank dst Information such as tag, addr (address assigned by either the host or memory manager 620), mlen, received bytes (recv_bytes), packet sequence range (pkt_seq_range), and rr_recved can be recorded for a message, where rr_recved is a ready signal indicating that the tag matcher has received a recv_req from the host for this message and the host is waiting for the message.

[0048] The register allocator manages a pool of idle registers and uses alloc_req pr or addr_ud umA new register can be allocated from the idle pool for each request, and a pointer (idx) can be output to other components to access the register. When the idx-th register has recv_bytes equal to mlen and rr_recved is asserted high, this indicates that the corresponding message has been fully received, a matching receive request has been found, and the host is ready to read. Next, the status manager 615 generates a ready signal (msg_ready) including <rank src ,rank dst ,tag,addr> for the data mover and issues a "delete" signal to remove the corresponding entries in the address table 605 and the seq_range table 610.

[0049] In one embodiment, the memory manager 620 allocates memory space for incoming unexpected messages and generates update requests for the address table 605. The memory manager 620 tracks the allocated memory blocks and the free memory space between them. The memory manager 620 can be implemented with various memory management algorithms such as fixed-size block allocation, buddy memory allocation, and slab allocation. The memory manager 620 takes an alloc_req pr signal (<ranksrc,rankdst,tag,mlen,pkt_seq>) as input and generates a physical memory address (addr pr ) allocated according to the message length (mlen) from alloc_req mm . Then, the allocated addr mm is sent to the data mover to store the received corresponding packet. The memory address is also recorded in the idx-th register in the status manager 615 via (idx,addr mm ). Also, the memory manager 620 has a key-value pair (K at =<rank src ,rank dst ,tag>,V at =<addrmm Generate an update request including <mlen, idx, pkt_seq>, and write it to the address table 605.

[0050] When the address converter 325 receives the addr_ud signal from the tag matcher, it writes / updates the entry in the address table 605. um signal from the tag matcher, it writes / updates the entry in the address table 605. addr_ud um indicates that the host posts a new receive request and there is no unexpected message in the unexpected_msg table within the tag match that matches the posted receive request. addr_ud um The signal, as explained in the tag matcher, <rank src , rank dst , tag, addr, mlen>. addr_ud um The base address (addr) in the signal is assigned by the host to store the dedicated message from rank src to rank dst . Next, an update request including the key-value pair (K at ) = <rank src , rank dst , tag>, V at = <addr, mlen, idx, pkt_seq = none>) is generated using addr_ud and idx from the state manager 615 and written to the address table 605. um and idx and written to the address table 605.

[0051] Figure 7 shows an example of a data mover 320 in an MPI shell. The data mover 320 includes an AXIS-to-AXI bridge 705 and a message-ready (msg_rdy) FIFO 710. The AXIS-to-AXI bridge 705 converts packet data in AXI-streaming protocol form (e.g., AXIS_data) to data in AXI protocol form (e.g., AXI_data). The converted data is then written to device memory via the memory controller. The corresponding base address (address) of AXIS_data is obtained from the address converter shown in Figure 6 and indicates its destination memory location in local memory within the SmartNIC.

[0052] The msg_rdy FIFO 710 stores the ready status of a message. These ready statuses are based on the source and destination process identifiers (rank). src and rank dst The msg_rdy FIFO 710 may contain the tag, and its address in device memory, indicating that the message has been fully written to device memory and is ready to be read. The empty signal of the msg_rdy FIFO 710 can be connected via a memory-mapped register to either the PCIe / host interrupt system or the pull system. If connected to the interrupt system, when the msg_rdy FIFO is not empty, the data mover 320 triggers the interrupt system, causing the host to process the interrupt accordingly. If connected to the pull system, the data mover 320 writes a ready signal to a dedicated memory-mapped register when the msg_rdy FIFO 710 has stored an element. The host can periodically / continuously check the value of the dedicated memory-mapped register and process events accordingly.

[0053] Figure 8 shows an example of a compute engine 800 in an MPI shell. In this example, the compute engine 800 houses multiple kernels (kernels 0 to n) that can form one or more hardware accelerators. Each kernel includes a control FIFO 805 and a status FIFO 810. The control FIFO 805 receives control messages from the host. These control messages may include <unique ID of the workload, number of address arguments (N), address of argument 0, address of argument 1, ..., address of argument N>. To start a kernel with a workload, the host can issue a control message to the control FIFO 805 via the AXI-Lite interface. If the control FIFO 805 has elements inside, the kernel can receive a control message from the FIFO 805 and begin execution. Using the base address provided by the control message, the kernel can read data stored in device memory using the AXI interface. A kernel can support multiple AXI interfaces to increase its memory access bandwidth. The kernel may also have memory-mapped registers that are accessible to the host via the AXI-Lite interface.

[0054] When the kernel finishes execution, it writes a completion signal to its status FIFO 810. The empty status FIFO signal can be connected to either the PCIe / host interrupt system or the pull system via a memory-mapped register. In designs with an interrupt system, when status FIFO 810 is not empty, the kernel triggers the interrupt system, causing the host to process the interrupt accordingly. In designs with a pull system, the kernel writes a completion signal to a dedicated memory-mapped register when the status FIFO has an element. The host can periodically or continuously check the value of the dedicated memory-mapped register and process the event accordingly when it detects a "completed" status.

[0055] If the computing engine 800 is implemented using programmable logic, the kernel may be designed using either high-level synthesis (HLS) or register-transfer-level (RTL) coding. However, in another embodiment, the computing engine 800 may be implemented in enhanced circuitry such as an ASIC or SoC.

[0056] Figure 9 shows an example of a data controller 305 within an MPI shell. The data controller 305 provides memory access channels for various connected modules / components. The data controller 305 includes an AXI interconnect 315 and one or more memory controllers 310 (multiple memory controllers are shown in this example). The memory controllers 310 access device memory within the SmartNIC. Modules and components, including data movers and compute engines, connected hosts and transmit logic for the SmartNIC or transport layer offload engine, can share the memory controllers for memory access via the AXI interconnect 315, which leverages the AXI protocol. The AXI interconnect 315 acts as an interface between the MPI shell and the host (e.g., CPU).

[0057] Communication between the host and the hardware accelerator includes an interrupt or pull operation on the host when a message is ready (illustrated using data mover 320 in Figure 7), a control message from the host to start the accelerator, and an interrupt or pull operation on the host when the accelerator finishes its execution (referred to using compute engine 125 in Figure 8).

[0058] Furthermore, control register access is used to configure or read memory-mapped registers within the MPI shell to collect statistics such as scalar arguments within the accelerator, error information, or the number of messages received, the number of dropped messages, the number of available accelerators, and the types of accelerators supported.

[0059] Furthermore, collective operations such as MPI_bcast, MPI_gather, MPI_scatter, and MPI_reduce are all based on the operations in MPI_send and MPI_recv. Systems with an MPI shell can also support these collective operations without any modifications. In addition, reduce-related operations such as MPI_reduce and MPI_allreduce include compute operations such as MPI_max, MPI_min, MPI_sum, MPI_and, and MPI_or. These predefined compute operations can be implemented in accelerators within the MPI shell.

[0060] Figures 10A to 10C illustrate an example of integrating an MPI shell into different SmartNIC implementations. Nodes in a network-centric data center typically include hosts with multi-core CPUs and devices that function as SmartNICs connecting to the network. These devices can be either ASICs (Application-Specific Integrated Circuits) SmartNICs or programmable SmartNICs. An MPI shell, acting as a sniffer, can be integrated with various SmartNIC systems. Figures 10A to 10C show three examples of MPI shell integration in programmable SmartNICs.

[0061] The first example in Figure 10A shows a system architecture with an MPI shell integrated into the basic SmartNIC. The communication channel used in this example is the TCP / IP protocol. The system leverages its host for TCP / IP control such as TCP connectivity, retransmission, congestion control, TCP transmission, and TCP ACKs on the SmartNIC. More specifically, the host in this system is responsible for the MPI library, the networking stack such as TCP / IP, the kernel driver that controls its SmartNIC connectivity, and the MPI shell runtime. The MPI library 210 includes various functions such as MPI process management, point-to-point messaging control, collective operation, and synchronization. The MPI shell acts as a sniffer without interrupting the existing network flow and processes only packets from target messages destined for the compute engine 125.

[0062] Packets received from the network (M-RX) can be redirected to the receive path (D-RX) within the SmartNIC MAC subsystem 230 before reaching the packet classifier 335. For messages sent to hardware processes (i.e., the compute engine 125), the MPI shell relies on the host to acknowledge all received TCP packets.

[0063] Regarding the transmission operation, if message data exists in device memory, the host (1) constructs a message with a header, the address of the message data, and dummy data, and (2) sends the message via normal TCP transmission operation. Parser 1005 detects this type of message. Parser 1005 then triggers the segmentation offload engine 1010 to read data from device memory to send the actual message packet.

[0064] Figure 10B shows a system architecture with an MPI shell integrated into a SmartNIC that has a TCP Offloading Engine (TOE). This integration is similar to the integration in Figure 10A. This system maintains two sets of TCP management: one using a traditional CPU-based TCP / IP stack for software processes, and the other leveraging the TOE for hardware processes.

[0065] Packets received from the network (M-RX) are redirected to the host via D-RX or to the TOE receiving (TOE RX) engine 1025 according to the results generated by the packet classifier 335. For transmission operations, the TOE transmission (TOE TX) engine 1015 can read message data from device memory and send it to the remote peer via the arbiter 1020.

[0066] Figure 10C shows a system architecture with an MPI shell integrated into a SmartNIC, which has a RoCE RX engine 1040, a RoCE TX engine 1030, and an arbiter 1035. The connections are very similar to those in Figures 10A and 10B and are therefore not described in detail.

[0067] The embodiments presented in this disclosure are referenced above. However, the scope of this disclosure is not limited to any specific described embodiments. Rather, any combination of the features and elements described is intended to implement and practice the intended embodiments, whether or not they relate to different embodiments. Furthermore, while the embodiments disclosed herein may achieve advantages over other possible solutions or the prior art, whether or not a particular advantage is achieved by a given embodiment does not limit the scope of this disclosure. Accordingly, the aforementioned aspects, features, embodiments, and advantages are merely illustrative and should not be considered elements or limitations of the appended claims unless expressly stated in the claims.

[0068] As will be understood by those skilled in the art, the embodiments disclosed herein may be embodied as systems, methods, or computer program products. Accordingly, embodiments may take the form of entirely hardware embodiments, entirely software embodiments (including firmware, resident software, microcode, etc.), or embodiments that combine software and hardware embodiments, all of which may be commonly referred to herein as “circuits,” “modules,” or “systems.” Furthermore, embodiments may take the form of computer program products embodied in one or more computer-readable media in which computer-readable program code is embodied.

[0069] Any combination of one or more computer-readable media may be used. A computer-readable media may be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium may be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any preferred combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any preferred combination thereof. In the context of this specification, a computer-readable storage medium is any tangible medium that contains or can store programs for use by, or in connection with, an instruction execution system, apparatus, or device.

[0070] A computer-readable signal medium may include, for example, a propagating data signal in which computer-readable program code is embodied, either in the baseband or as part of a carrier wave. Such a propagating signal may take any of various forms, including but not limited to electromagnetic, optical, or any preferred combination thereof. A computer-readable signal medium may be any computer-readable medium, rather than a computer-readable storage medium, that can communicate, propagate, or transfer a program for use by or in connection with an instruction execution system, apparatus, or device.

[0071] Program code, embodied on a computer-readable medium, can be transmitted using any suitable medium, including but not limited to wireless, wireline, fiber optic cable, RF, or any preferred combination thereof.

[0072] Computer program code for performing the operations of the embodiments of this disclosure may be written in any combination of one or more programming languages, including, for example, object-oriented programming languages such as Java®, Smalltalk, and C++, and conventional procedural programming languages such as the C programming language or similar programming languages. The program code may run entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer, partially on a remote computer, or fully on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or it may be connected to an external computer (for example, via the Internet using an Internet service provider).

[0073] Aspects of the present disclosure are described below with reference to the flowcharts and / or block diagrams of the methods, apparatus (systems), and computer program products according to the embodiments presented herein. It will be understood that each block in the flowcharts and / or block diagrams, and combinations of blocks in the flowcharts and / or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general-purpose computer, a dedicated computer, or another programmable data processing device such that instructions executed via the processor of the computer or other programmable data processing device create means for implementing the functions / actions specified in the blocks of the flowcharts and / or block diagrams.

[0074] These computer program instructions can also be stored on a computer-readable storage medium, which can instruct a computer, a programmable data processing device, and / or other device to function in a particular way, such that the instructions stored on the computer-readable storage medium produce a manufactured article containing instructions that implement the modes of function / action specified in the blocks of a flow diagram and / or block diagram.

[0075] Computer program instructions can also be loaded into a computer, other programmable data processing device, or other device to perform a series of operational steps on the computer, other programmable device, or other device, thereby generating a computer implementation process. Thus, instructions executed on a computer or other programmable device provide a process for implementing the functions / actions specified in the blocks of a flow diagram and / or block diagram.

[0076] The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of instructions containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions described in a block may occur in a different order than shown in the figure. For example, two consecutively shown blocks may actually be executed substantially simultaneously, or the blocks may be executed in reverse order depending on the functions involved. It should also be noted that each block in the block diagram and / or flowchart illustrations, and combinations of blocks in the block diagram and / or flowchart illustrations, may be implemented by a dedicated hardware-based system that performs a specified function or action, or combines dedicated hardware with computer instructions.

[0077] The above applies to specific examples, but other and further examples may be devised without departing from the basic scope, and the scope will be determined by the following "Claims".

Claims

1. A network interface card (NIC), It includes a Message Passing Interface (MPI) shell, and the MPI shell is A packet classifier configured to sniff packets received from a network to identify MPI messages and generate metadata for the packets corresponding to the MPI messages, The data contained in the MPI message is transferred to a computing circuit by a central processing unit (CPU), the CPU having a circuit configured to transfer the data contained in the MPI message to a computing circuit for processing without first copying it to a memory corresponding to the CPU, which is located on the same computing node as the NIC. The aforementioned MPI shell is A tag matcher configured to receive a packet corresponding to the MPI message as input and generate an address lookup request, wherein the metadata includes information used by the tag matcher to generate the address lookup request, A NIC comprising an address converter configured to receive an address lookup request from a tag matcher and generate an address assigned from either the CPU or a memory manager located within the address converter.

2. The MPI shell is configured to perform tag matching and address translation on the data contained in the MPI message without the involvement of the CPU, and the MPI shell, A data mover configured to receive the address from the address converter and store the data in the MPI message at the address in the target memory shared with the computing circuit, A data controller comprising a memory controller coupled to local memory and an interconnect, the data controller further comprising a data controller coupled to the output of a data mover, the interconnect functioning as an interface between the MPI shell and the CPU, according to claim 1.

3. It is NIC, Hardware computing circuitry, The system comprises an MPI shell, and the MPI shell is A packet classifier configured to sniff packets received from a network to identify MPI messages and generate metadata for the packets corresponding to the MPI messages, A circuit configured to transfer the data contained in the MPI message to the calculation circuit for processing, The circuit comprises a circuit configured to receive instructions from an external CPU of the NIC, which instructs the calculation circuit to process the data contained in the MPI message, The aforementioned MPI shell is A tag matcher configured to receive a packet corresponding to the MPI message as input and generate an address lookup request, wherein the metadata includes information used by the tag matcher to generate the address lookup request, A NIC comprising an address converter configured to receive an address lookup request from a tag matcher and generate an address assigned from either the CPU or a memory manager located within the address converter.

4. The NIC according to claim 1, comprising sniffing packets received from a network to identify MPI messages, and using messages to identify messages corresponding to the distributed computing system that tasks transmit between nodes in the distributed computing system.

5. The calculation circuit is located within the NIC according to claim 1 or 3.

6. The NIC according to claim 4, wherein the calculation circuit and the MPI shell are arranged on the same integrated circuit within the NIC.

7. The NIC according to claim 4, wherein the MPI shell is configured to perform tag matching and address translation on the data contained in the message without the involvement of the CPU.

8. The NIC according to claim 4, wherein the MPI shell is configured to receive instructions from the CPU to instruct the computing circuit to process the data contained in the message.

9. The NIC according to claim 1 or 3, wherein the calculation circuit and the MPI shell are arranged on the same integrated circuit within the NIC.

10. The NIC according to claim 1 or 3, wherein the MPI shell is configured to perform tag matching and address translation on the data contained in the MPI message without the involvement of the CPU.

11. The aforementioned MPI shell is The NIC according to claim 1 or 3, further comprising a data mover configured to receive the address from the address converter and store the data contained in the MPI message in an address target memory shared with the calculation circuit.

12. The MPI shell is configured to perform tag matching and address translation on the data contained in the MPI message without the involvement of the CPU, and the MPI shell, A data mover configured to receive the address from the address converter and store the data in the MPI message at the address in the target memory shared with the computing circuit, A data controller comprising a memory controller coupled to local memory and an interconnect, the data controller further comprising a data controller coupled to the output of the data mover, the interconnect functioning as an interface between the MPI shell and the CPU, according to claim 3.

13. The NIC according to claim 1, wherein the MPI shell is configured to receive instructions from the CPU to instruct the computing circuit to process the data contained in the MPI message.