Spatial distribution in 3D data processing units
The 3D SmartNIC architecture addresses scalability and security issues in SmartNICs by distributing functions across stacked layers, reducing latency and buffering, and encrypting sensitive data, thereby improving efficiency and security.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- XILINX INC
- Filing Date
- 2022-02-17
- Publication Date
- 2026-06-25
AI Technical Summary
Current SmartNICs spatially decompose computing, networking, and storage functions in two dimensions, limiting scalability and efficiency due to iterative buffering of data and metadata, and exposing sensitive information to potential attacks.
A 3D SmartNIC architecture that spatially distributes computation, memory, and network accelerator functions across multiple stacked layers, using a sequencer to coordinate traffic flow and encrypt sensitive information, reducing latency and buffering iterations.
Enhances scalability and security by minimizing latency and resource inefficiencies while protecting sensitive data from attacks through encrypted distribution across separate chips.
Smart Images

Figure 0007880343000001 
Figure 0007880343000002 
Figure 0007880343000003
Abstract
Description
Technical Field
[0001] Examples of the present disclosure generally relate to a 3D network interface card (NIC) that includes a plurality of stacked layers that communicate with each other.
Background Art
[0002] Cloud infrastructure is growing at an accelerating pace to meet the ever-increasing demand for services hosted in the cloud. There is a growing need to offload compute, network, and memory functions to accelerators to free up server CPUs and allow them to focus on running customer applications. These accelerators are part of the cloud's hyperconverged infrastructure (HCI) and provide cloud vendors with a simpler way to manage various compute-centric, network-centric, and memory-centric workloads for a single or multiple customers. Many cloud operators use SmartNICs to help handle these workloads. Generally, a SmartNIC is a NIC that contains a data processing unit capable of handling network traffic, accelerating and offloading other functions that would otherwise be performed by the host CPU if a standard, i.e., "simple," NIC were used. SmartNICs excel at consolidating multiple offload acceleration functions into a single component, are adaptable enough to accelerate new functions or support new protocols, and provide cloud vendors with a single way to manage virtualization and security in the case of multiple cloud tenants (e.g., customers) using HCI simultaneously. The term Data Processing Unit (DPU) is also used as a substitute for SmartNIC to describe a collection of processing, acceleration, and offload functions for virtualization, security, networking, computing, and storage, or subsets thereof. 3D DPUs can have various form factors, such as peripheral cards and OCP accelerator modules, or they can be directly implemented on the motherboard along with other components / accelerators / memory.
[0003] SmartNICs adapt to rapidly changing workloads through new features and offload acceleration of protocols created throughout their lifecycle. SmartNICs (e.g., PCIe cards) are typically plugged into servers or storage nodes in the cloud, connecting to top-of-rack (TOR) network switches, and then to the rest of the cloud. In hyperscale deployments of these components in millions of units, power consumption also becomes a critical metric for SmartNICs. The combination of adaptive intelligence and low power consumption, along with programmable logic and enhanced acceleration, makes SmartNIC devices particularly well-suited.
[0004] The highly centralized nature of SmartNICs means that they can perform computing, networking, and storage functions in a single component. However, current SmartNICs spatially decompose these functions in two dimensions, either across multiple chiplets within a package or across a large monolithic die. In other words, data processing units that perform workloads that would otherwise need to be executed by the CPU in a server are located in a 2D plane, either with chiplets mounted on the same board (e.g., a printed circuit board) or with the processing units formed on the same chip. This severely limits the scalability of these SmartNICs to meet future bandwidth demands.
[0005] In addition, the nature of SmartNIC processing requires the movement of not only network flows but also a large amount of metadata associated with those flows. This metadata may include action verbs, i.e., sets of commands, for the current stage of the processing or acceleration pipeline in the SmartNIC, or it may contain action verbs or serve as a reference for the next stage of the acceleration pipeline to interpret / execute. In a multi-tenant environment, where the same service is provided by SmartNIC to multiple tenants within a host, or where multiple network, computing, or storage functions are provided to the same tenant, the metadata may also contain tenant identification information, their service level agreements (SLAs), and / or information about the type of service or acceleration function desired by the tenant. As the number of offload accelerator functions increases, so does the amount or type of metadata. As a result of these characteristics, metadata often represents a significant overhead relative to the amount of data being processed or moved. Furthermore, SmartNIC processing also requires temporary buffering of data, and in some cases, temporary buffering of some or all of the metadata is required when a particular tenant's traffic is being processed or when determining the next function or processing step for the data. In other words, due to the spatially distributed nature of current technologies, there is iterative buffering of data and metadata as tenant traffic moves through various stages of the pipeline. As the amount of link bandwidth increases, the amount of iterative buffering also increases, thus leading to inefficient use of resources to spatially distribute data movement. [Overview of the project] [Means for solving the problem]
[0006] One embodiment described herein is a NIC comprising a plurality of layers arranged in a stack and communicatively coupled to one another, a plurality of accelerator functions within the plurality of layers, and a sequencer disposed in one of the plurality of layers, wherein the sequencer is configured to coordinate the traffic flow received in the NIC between different accelerator functions among the plurality of accelerator functions to form a pipeline.
[0007] Another embodiment described herein is a 3D data processing unit comprising a plurality of layers arranged in a stack and communicatively coupled to one another, a plurality of accelerator functions within the plurality of layers, and a sequencer disposed in one of the plurality of layers, wherein the sequencer is configured to coordinate the traffic flow received in the 3D data processing unit between different accelerator functions among the plurality of accelerator functions to form a pipeline.
[0008] Another embodiment described herein is a system comprising multiple NICs, each comprising multiple layers arranged in a stack and communicatively coupled to one another, and multiple accelerator functions within the multiple layers. The system also comprises multiple accelerator cards and a switch that communicatively couples the multiple NICs to the multiple accelerator cards, wherein the multiple NICs, multiple accelerator cards, and switch are housed in the same box.
[0009] To ensure a detailed understanding of the above features, a more specific explanation, concisely summarized above, can be provided by referring to exemplary implementations, some of which are shown in the attached drawings. However, it should be noted that the attached drawings only show typical exemplary implementations and should therefore not be considered limiting in scope. [Brief explanation of the drawing]
[0010] [Figure 1] An example of a computing system equipped with a 3D SmartNIC is shown. [Figure 2] This shows an example of multiple layers within a 3D SmartNIC. [Figure 3] An example of a 3D SmartNIC with a fabric layer is shown. [Figure 4] This example shows a 3D SmartNIC with a cryptographic engine in the intermediate layer. [Figure 5] An example of a sequencer is shown. [Figure 6] This is a block diagram of an I / O expansion box including a SmartNIC, as an example. [Modes for carrying out the invention]
[0011] For ease of understanding, the same reference numeral is used to indicate identical elements common to multiple drawings, where possible. It is intended that elements in one example may be usefully incorporated into others.
[0012] Various features are described below with reference to the drawings. Note that the drawings may or may not be drawn to scale, and elements of similar structure or function are represented by the same reference numerals throughout the drawings. Note that the drawings are intended solely to facilitate the description of the features. They are not intended as a comprehensive description of the specification or as a limitation on the claims. In addition, illustrated examples do not necessarily have all the embodiments or advantages shown. Embodiments or advantages described in relation to a particular embodiment are not necessarily limited to that embodiment and may be implemented in any other embodiment even if not illustrated or explicitly described in that way.
[0013] Embodiments described herein describe a 3D SmartNIC that spatially distributes computation accelerator functions, memory accelerator functions, or network accelerator functions in three dimensions using multiple layers. That is, unlike current SmartNICs that can perform acceleration functions in a 2D plane (for example, using chiplets arranged on a common board or data processing units integrated on the same monolithic chip), a 3D SmartNIC can distribute these functions across multiple stacked layers, each layer being able to communicate directly or indirectly with the other layers. For example, a host may transmit a network flow containing data (e.g., packets) to be executed in a pipeline formed from multiple accelerator functions within the 3D SmartNIC. For example, the network flow may first be processed by function A of the first layer, then by functions B and C of the second layer, and then by function D of the third layer. Since the latency between these pipelined functions can affect the overall throughput of a 3D SmartNIC, using multiple layers can improve the physical and logical coupling between different stages in the pipeline (i.e., accelerator functions) compared to a SmartNIC where all accelerator functions are executed by hardware on the same plane. In other words, the physical and logical proximity of functions A-D can be smaller in a 3D SmartNIC than in a 2D SmartNIC by using multiple layers. Furthermore, a 3D NIC can reduce the latency and iterations associated with data movement between these functions compared to a 2D SmartNIC.
[0014] The tightly coupled, Active-on-Active (AoA) layer of the 3D SmartNIC allows data and metadata processing and movement to traverse shorter physical distances and paths with orders of magnitude greater bandwidth than is possible with conventional technologies. Furthermore, in one embodiment, transient data buffering is performed universally / centralized within the packet buffer, thus reducing the amount of repetitive buffering and resulting in efficient use of spatially distributed data movement resources for tenant traffic moving through various stages of network, storage, or computational pipelined acceleration. In another embodiment, the packet buffer is spatially distributed, distributed in either a 2D or 3D plane based on the accelerator usage order, the physical location of the network, storage, or computational pipelined acceleration, and the optimal latency between the packet buffer and the source and destination accelerators or external interfaces. Route determination of which of the spatially distributed buffers to use may be determined a priori and systematized in the metadata, or it may be determined dynamically by the processing steps in the acceleration pipeline.
[0015] In addition, some SmartNIC security requirements demand that the exposed link interface (an attack surface and potential source of sensitive information leakage) carrying tenant data be encrypted and protected from side-channel attacks. The advantages of 3D SmartNIC over conventional technology include spatially distributing only encrypted tenant information across separate chips or chiplets via exposed links. Since connections on the z-axis are not exposed, any distribution of decrypted tenant information can only be performed on the z-axis (i.e., between layers). Another advantage of 3D SmartNIC is that encryption can be performed in the intermediate layer on the z-axis, preventing malicious actors from obtaining sensitive information using non-invasive probing methods such as laser probes on exposed interfaces.
[0016] Figure 1 shows an example of a computing system 100 equipped with a 3D SmartNIC 110. As shown, the computing system 100 includes a host 105 that relies on the SmartNIC 110 to exchange data with a network 130. For example, the network 130 may be a local network within a data center that connects the host 105 (e.g., a server) to other computing systems within the data center (e.g., other servers or network storage devices). Although the 3D SmartNIC 110 is shown outside the host 105, in one embodiment the SmartNIC is located inside the host 105. For example, the SmartNIC 110 may be a PCIe card that is plugged into a PCIe slot inside the host 105.
[0017] The 3D SmartNIC 110 includes multiple layers 115 (or decks) that form a 3D structure. That is, unlike conventional SmartNICs which have computing resources arranged in a 2D plane, such as multiple chiplets arranged on a common substrate (e.g., a PCB board) or a single monolithic chip, the computing resources within the SmartNIC 110 are distributed across multiple layers 115. In one embodiment, the layers 115 are separate integrated circuits (ICs) or chips that form a stack. For example, ICs may be joined to each other using solder connections so that computing resources in different layers 115 can communicate with each other. In another embodiment, the layers 115 may include separate substrates such as PCBs containing ICs or chiplets, and these ICs or chiplets are connected to ICs or chiplets on substrates of other layers, for example, using solder bumps or wire bonds. Alternatively, the layers 115 may be directly bonded to each other using through-silicon via connections for three-dimensional connectivity by stacking the layers, or they may penetrate different types of substrates (e.g., PCBs) and achieve three-dimensional connectivity between the layers 115 using solder bumps or wire bond connections.
[0018] Layer 115 includes at least one sequencer 120. In one embodiment, there is only one sequencer within the SmartNIC 110 (i.e., only one of the layers 115 has a sequencer 120), but in other embodiments, it may be advantageous to have multiple sequencers 120 within the same layer 115 or in different layers 115. Generally, the sequencer 120 coordinates the traffic flow between different accelerator functions 125 within the SmartNIC 110. The sequencer may also coordinate the use of universal / centralized packet buffers or the order of use of spatially distributed packet buffers. In one embodiment, each layer 115 includes at least one function 125 that processes data in the traffic flow received from either the host 105 or the network 130. Furthermore, each layer 115 may include multiple functions 125.
[0019] In one embodiment, each accelerator function 125 is a hardware element that performs a computation, networking, or storage function on data (or metadata) in the network flow. These hardware elements may be separate ICs within layer 115, or one IC may have hardware elements for performing multiple accelerator functions 125. Accelerator functions 125 may include hardware elements for accelerating interfaces to the host 105 and network 130, crypto accelerators, compression accelerators, fabric accelerators, memory controllers, memory elements (e.g., random access memory (RAM)), etc. These hardware elements may be implemented using programmable logic blocks or hardened logic blocks. For example, the memory controller, RAM, interface (input / output (I / O)) accelerator, compression accelerator, and crypto accelerator may be implemented using hardened logic, while the fabric accelerator is implemented using programmable logic (e.g., configurable logic blocks). However, in other embodiments, some accelerators (e.g., cryptographic accelerators or compression accelerators) may be implemented with programmable logic instead of enhanced logic.
[0020] FIG. 2 shows, by way of example, multiple layers within the 3D SmartNIC 110. In FIG. 2, the 3D SmartNIC 110 can have any number of layers 115, but for simplicity, only two layers, layer 115A and layer 115B, are shown. For example, layer 115A and layer 115B can be the only two layers within the SmartNIC 110, or there can be one or more layers between these two layers.
[0021] As shown, both layers 115 include hardware elements that form accelerator functions 125A - E. In this example, layer 115A includes accelerator functions 125A - D, and layer 115B includes accelerator function 125E. Further, layer 115A includes a sequencer 120 that is communicatively coupled to each of the accelerator functions 125A - D within layer 115A. Although not shown, sequencer 120 is also coupled to accelerator function 125E within layer 115B and can be coupled to host interface 210 and network interface 215.
[0022] As described above, sequencer 120 coordinates the way network traffic flows between accelerator functions 125 to form different stages in the data acceleration pipeline. In one embodiment, sequencer 120 establishes a pipeline for each network flow, and the accelerator functions 125 form the stages of the pipeline. For example, in a first network flow (which can be associated with a first customer or tenant), data can be sent first to function 125A, then to function 125D, and finally to function 125E. However, in a second network flow for a different customer or tenant, that data can be sent first to function 125E and then to function 125B. Details for using sequencer 120 to establish different pipelines for different network flows are described below in connection with FIG. 5.
[0023] Layer 115A also includes a packet buffer 205 that functions as a centralized, universal packet holding area for data being transferred between function 125, host interface 210, and network interface 215. Continuing with the above example, after a packet of the first network flow is processed by function 125A, the next function in the pipeline, i.e., function 125D, may not be ready for the packet. Function 125A can store the packet in packet buffer 205 until function 125D is ready for the packet. Thus, although not shown, each of functions 125A - D can be connected to packet buffer 205. Packet buffer 205 can also be used when transferring packets between layers 115. For example, SmartNIC 110 can use packet buffer 205 to temporarily store packets before these packets are stored in RAM as part of function 125E. Packet buffer 205 is a universal buffer as it can be used by various functions 125 within SmartNIC 110 that can perform different network acceleration tasks, computer acceleration tasks, and memory acceleration tasks. Thus, in one embodiment, each accelerator function (as well as host interface 210 and network interface 215) is connected to packet buffer 205 and can thus store packets in and retrieve packets from buffer 205.
[0024] The configuration of accelerator function 125, sequencer 120, packet buffer 205, and host interface 210 and network interface 215 in FIG. 2 is merely an example of a 3D SmartNIC. For example, in other embodiments, host interface 210 and network interface 215 can be disposed on layer 115A. Further, layer 115B can have two or more accelerator functions (i.e., function 125E), or layer 115A can include more or fewer accelerator functions than those shown.
[0025] Figure 3 shows an example of a 3D SmartNIC 300 with a fabric layer. As shown in the figure, the SmartNIC 300 includes three layers 315A to C, with various accelerator functions distributed throughout layer 315. In this example, layer 315A includes a processor 305, a host interface accelerator 310, a cryptographic accelerator 317, a compression accelerator 320, and a network interface accelerator 325, along with the sequencer 120 and packet buffer 205 as described above. The processor 305, host interface accelerator 310, cryptographic accelerator 317, compression accelerator 320, and network interface accelerator 325 are examples of the accelerator functions 125 described in Figures 1 and 2.
[0026] The processor 305 may be an ARM or x86 processor capable of performing computational tasks for data in the network flow. The host interface accelerator 310 and the network interface accelerator 325 accelerate the functions performed by the host interface 210 and the network interface 215, respectively. The cryptographic accelerator 317 can decrypt and encrypt data as it enters and leaves the SmartNIC 300. For example, some functions may require decrypted data, in which case the sequencer 120 can first route the data (received by the SmartNIC in an encrypted state) to the cryptographic accelerator 317 for decryption, processing by the function, and then re-encrypt the data before it is sent from the SmartNIC 300.
[0027] The compression accelerator 320 can perform data compression and decompression. For example, a host may send data that should be stored in a network storage device. Instead of host 105 compressing the network flow, host 105 can instruct the compression accelerator 320 in SmartNIC 300 to compress the data in the network flow before transferring the compressed data to the network storage device using the network 130. When SmartNIC 300 receives the compressed data from the network storage device, the compression accelerator 320 can decompress the data before transferring it to host 105. Furthermore, SmartNIC 300 can use the compression accelerator 320 to compress data stored internally.
[0028] In one embodiment, the 3D SmartNIC300 may include multiple cryptographic accelerators and compression accelerators. For example, layer 315A may include both an AES-XTS cryptographic accelerator and an AES-GCM cryptographic accelerator. The SmartNIC300 may also include different cryptographic accelerators that perform different compression algorithms.
[0029] Layer 315B includes a fabric accelerator 330, which in one embodiment is implemented using programmable logic. The fabric accelerator 330 can provide connectivity between functions in layer 315A and functions in layer 315C. For example, the fabric accelerator 330 may include a first fabric accelerator for storing data in memory 340 (e.g., RAM) in layer 315C. The accelerator 330 may also include a second fabric accelerator used by the sequencer 120 to communicate with functions in other layers, and a third fabric accelerator used by the cryptographic accelerator 317 or compression accelerator 320 when communicating data between layers. Layers 315A and 315C may also include programmable logic 345 that creates the ability to customize accelerator functions or the communication or ordering between accelerator functions. In one embodiment, a programmable logic 345A within layer 315A resides between the host interface accelerator 310 and the processor 305, customizing specific host interface data to provide processing hints to the processor 305 and improving the processor 305's cache efficiency. In another embodiment, a programmable logic 345A between the cryptographic accelerator 317 and the network interface accelerator 325 customizes the cryptographic key or cryptographic algorithm used for traffic received by or destined for the network interface accelerator 325. In these examples, the programmable logic 345A functions as a shim to provide customized processing or communication between at least two accelerator functions within layer 315A. Furthermore, layer 315C may also include a programmable logic 345B that functions as a shim to allow communication between its enhanced components.
[0030] In one embodiment, layer 315B also includes a packet buffer block, such as packet buffer 205 in layer 315A, or a network key management block. Furthermore, enhanced accelerator blocks, such as enhanced accelerator blocks in layer 315A (e.g., accelerators 310, 317, 320, or 325), may also be included in layer 315B.
[0031] In one embodiment, layer 315A also includes a fabric accelerator (e.g., one or more fabric accelerator blocks) that provides connectivity between functions within layer 315A. That is, each layer may have its own fabric accelerator to provide communication between functions within that layer, while the fabric accelerator 330 in layer 315B provides connectivity between layers 315A to C.
[0032] Layer 315C includes a host interface 210, a network interface 215, a memory controller 335, and memory 340. Memory 340 can be used to store data longer than that in the packet buffer 205. For example, when data moves between different accelerator functions (e.g., different stages of a pipeline), the data can be stored in the packet buffer 205, but when the data requires a longer waiting time, the SmartNIC 300 can store the data in memory 340. Memory 340 can also be used to store accelerator-related metadata, such as the cryptographic key or cryptographic state of the cryptographic accelerator 317.
[0033] As shown, layers 315 can communicate with each other. In one embodiment, layers 315A and 315C communicate using layer 315B, which functions as a fabric layer (i.e., interconnect). In this example, layer 315B functions as an indirect connection between layers 315A and 315C. However, in another embodiment, layers 315A and 315C can communicate directly with each other without passing through the logic in layer 315B. For example, layer 315B may include through-vias (e.g., silicon through-vias) that directly connect bump pads in layer 315C to bump pads in layer 315A. In this way, functions in layer 315A can communicate directly with functions in layer 315C without relying on the fabric accelerator 330 in layer 315B. For example, some functions in layer 315A may communicate directly with layer 315C using these through-vias, while other functions in layer 315A may use the fabric accelerator 330 when communicating with layer 315C. If the SmartNIC300 is extended to include multiple intermediate layers, these layers can also have through vias connected to each other as needed to provide a direct connection between the top layer 315A and the bottom layer 315C.
[0034] As shown in Figures 1 to 3, spatially distributing functions across multiple layers allows for tighter coupling between these functions (and between the packet buffer 205 and the sequencer 120) than if all of these hardware elements were arranged on the same 2D plane. For example, if all of these functions were arranged on the same monolithic chip, transferring data between two functions at opposite ends of the chip may require more latency than transferring data between two functions on different layers. Therefore, using a 3D structure in the SmartNIC300 makes it possible to implement more functions in the SmartNIC300, which makes the SmartNIC300 more flexible and scalable without increasing latency due to the spatial distribution of functions.
[0035] Figure 4 shows an example of a 3D SmartNIC 400 with a cryptographic engine 405 in the intermediate layer 415B. That is, the SmartNIC 400 includes three layers 415A-C, with the cryptographic engine 405 located in the intermediate layer 415B, sandwiched between the upper layer 415A and the lower layer 415C. This provides additional physical protection to the cryptographic engine 405. For example, it protects the cryptographic engine 405 from attempts at physical intrusion to access its keys. To access the keys used by the cryptographic engine 405, an attacker would need to disassemble the SmartNIC 400 in a way that allows the SmartNIC 400 to continue operating. However, a 2D SmartNIC does not provide the same protection because its cryptographic engine 405 must be located in a 2D plane that is easily accessible.
[0036] In one embodiment, the cryptographic engine 405 may be located in its own layer 415 within the SmartNIC 400. However, in another embodiment, additional functionality may be located in the same layer 415B as the engine 405. For example, layer 415B may also include the fabric accelerator 330 shown in Figure 3.
[0037] Figure 5 shows a sequencer 120 that, in this example, can be used with various accelerator functions shown in Figure 3, such as the processor 305, the host interface accelerator 310, the cryptographic accelerator 317, the compression accelerator 320, and the network interface accelerator 325. In this embodiment, the sequencer 120 has sub-sequencer modules for communicating with these functions. Specifically, the sequencer 120 includes an I / O sequencer 505 corresponding to the host interface accelerator 310 and the network interface accelerator 325, a processor sequencer 510 corresponding to the processor 305, a cryptographic sequencer 515 corresponding to the cryptographic accelerator 317, and a compression sequencer 520 corresponding to the compression accelerator 320.
[0038] Communication between spatially distributed accelerator functions (e.g., processor 305, host interface accelerator 310, cryptographic accelerator 317, compression accelerator 320, and network interface accelerator 325) and the sequencer 120 can be performed in at least two ways. Firstly, metadata interpreted by either the sequencer 120 or a function includes a turnlist describing the distributed functions that a particular traffic flow must traverse in sequence when tenant data undergoes pipelined acceleration. In other words, metadata corresponding to a traffic flow can define the order in which data should be processed by functions. This turnlist establishes the functions used to process packets in a network flow and the stages of the pipeline that determine the order in which the selected functions process the packets.
[0039] Secondly, the metadata includes a linked list of pipeline acceleration functions to be used for processing the packet, where a null pointer in the linked list indicates an exit point (e.g., a host or network exit), or a null pointer indicates the second-to-last destination pipeline stage in the linked list prior to the null pointer, where the next linked list function (or functions) is expected to be pre-configured based on the packet's processing. In this way, the next stage or function of the pipeline can be dynamically selected while the packet is being processed.
[0040] Both of these techniques achieve low latency in traffic flows as they traverse their different functions, and low residency of traffic flows within packet buffer 205, thereby increasing the efficiency of packet buffer 205 for network flows of other tenants.
[0041] Figure 6 is a block diagram of an I / O expansion box 600, including an example SmartNIC 110 and a memory accelerator card, a machine learning accelerator card, or another accelerator card 610. In Figure 6, the host 105 communicates with multiple SmartNICs 110, which may be on separate boards or on the same board. The SmartNICs 110 are then communicatively coupled to the memory accelerator card, the machine learning accelerator card, or another accelerator card 610. The expansion box 600 includes a switch 605 that allows communication between the host 105 and the SmartNICs 110, and between the SmartNICs and the memory accelerator card, the machine learning accelerator card, or another accelerator card. In one embodiment, the switch facilitates cache-coherent and non-cache-coherent communication between the host 105, the SmartNICs 110, and the memory accelerator card, the machine learning accelerator card, or another accelerator card 610. Therefore, the switch 605 can support data transfer between the host 105, the SmartNIC 110, and the memory accelerator card, machine learning accelerator card, or other accelerator card 610 in a cache-coherent manner in which the memory space of the host 105 is shared by the SmartNIC 110 and the memory accelerator card, machine learning accelerator card, or other accelerator card 610, or by using non-coherent data transfer (e.g., direct memory access (DMA) read / write).
[0042] For example, host 105 uses a coherent domain to transfer data that should be sent to all SmartNICs 110 (assuming the data is not too large), but uses a non-coherent domain to transfer large amounts of data, or data destined for only one of the SmartNICs 110.
[0043] The embodiments presented in this disclosure are referenced above. However, the scope of this disclosure is not limited to any specific described embodiments. Rather, any combination of the features and elements described is intended to implement and practice the intended embodiments, whether or not they relate to different embodiments. Furthermore, while the embodiments disclosed herein may achieve advantages over other possible solutions or the prior art, whether or not a particular advantage is achieved by a given embodiment does not limit the scope of this disclosure. Accordingly, the aforementioned aspects, features, embodiments, and advantages are merely illustrative and should not be considered elements or limitations of the appended claims unless expressly stated in the claims.
[0044] As will be understood by those skilled in the art, embodiments disclosed herein may be embodied as systems, methods, or computer program products. Accordingly, embodiments may take the form of entirely hardware embodiments, entirely software embodiments (including firmware, resident software, microcode, etc.), or embodiments that combine software and hardware embodiments, all of which may be commonly referred to herein as “circuits,” “modules,” or “systems.” Furthermore, embodiments may take the form of computer program products embodied in one or more computer-readable media in which computer-readable program code is embodied.
[0045] Any combination of one or more computer-readable media may be used. A computer-readable media may be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium may be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any preferred combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any preferred combination thereof. In the context of this specification, a computer-readable storage medium is any tangible medium that can contain or store programs for use by, or in connection with, an instruction execution system, apparatus, or device.
[0046] A computer-readable signal medium may include, for example, a propagating data signal in which computer-readable program code is embodied, either in the baseband or as part of a carrier wave. Such a propagating signal may take any of various forms, including but not limited to electromagnetic, optical, or any preferred combination thereof. A computer-readable signal medium may be any computer-readable medium, rather than a computer-readable storage medium, that can communicate, propagate, or transfer a program for use by or in connection with an instruction execution system, apparatus, or device.
[0047] Program code, embodied on a computer-readable medium, can be transmitted using any suitable medium, including but not limited to wireless, wireline, fiber optic cable, RF, or any preferred combination thereof.
[0048] Computer program code for performing the operations of the embodiments of this disclosure may be written in any combination of one or more programming languages, including, for example, object-oriented programming languages such as Java®, Smalltalk, and C++, and conventional procedural programming languages such as the C programming language or similar programming languages. The program code may run entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer, partially on a remote computer, or fully on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or it may be connected to an external computer (for example, via the Internet using an Internet service provider).
[0049] Aspects of the present disclosure are described below with reference to the flowcharts and / or block diagrams of the methods, apparatus (systems), and computer program products according to the embodiments presented herein. It will be understood that each block in the flowcharts and / or block diagrams, and combinations of blocks in the flowcharts and / or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general-purpose computer, a dedicated computer, or another programmable data processing device such that instructions executed via the processor of the computer or other programmable data processing device result in a machine that creates means for performing the functions / actions specified in the blocks of the flowcharts and / or block diagrams.
[0050] These computer program instructions can also be stored on a computer-readable storage medium, which can instruct a computer, a programmable data processing device, and / or other device to function in a particular way, such that the instructions stored on the computer-readable storage medium produce a manufactured article containing instructions that implement the modes of function / action specified in the blocks of a flow diagram and / or block diagram.
[0051] Computer program instructions can also be loaded into a computer, other programmable data processing device, or other device to perform a series of operational steps on the computer, other programmable device, or other device, thereby generating a computer implementation process. Thus, instructions executed on a computer or other programmable device provide a process for implementing the functions / actions specified in the blocks of a flow diagram and / or block diagram.
[0052] The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of instructions containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions described in a block may occur in a different order than shown in the figure. For example, two consecutively shown blocks may actually be executed substantially simultaneously, or the blocks may be executed in reverse order depending on the functions involved. It should also be noted that each block in the block diagram and / or flowchart illustrations, and combinations of blocks in the block diagram and / or flowchart illustrations, may be implemented by a dedicated hardware-based system that performs a specified function or action, or combines dedicated hardware with computer instructions.
[0053] The above applies to specific examples, but other and further examples may be devised without departing from the basic scope, and the scope will be determined by the following "Claims".
Claims
1. A network interface card (NIC), Multiple layers arranged within a stack and connected to each other in a communicative manner, Multiple accelerator functions within the aforementioned multiple layers, A NIC comprising: a sequencer disposed in one of the plurality of layers, wherein the sequencer is configured to adjust the traffic flow received by the NIC between different accelerator functions in different layers of the plurality of layers to form a pipeline.
2. The NIC according to claim 1, wherein each of the plurality of layers comprises at least one integrated circuit.
3. The NIC according to claim 1, comprising a packet buffer connected to the plurality of accelerator functions, wherein each different accelerator function is configured to use the packet buffer to temporarily store packets between stages of the pipeline, and each different accelerator function further comprises a packet buffer that forms the stages in the pipeline.
4. The aforementioned plurality of layers comprises at least an uppermost layer, an intermediate layer, and a lowermost layer, and the intermediate layer is At least one fabric accelerator implemented using programmable logic, or The NIC according to claim 1, further comprising a cryptographic engine for encrypting or decrypting data in the traffic flow.
5. The NIC according to claim 1, wherein the plurality of layers comprises a first layer including at least two accelerator functions, and the first layer further comprises programmable logic for providing customized processing or communication between the at least two accelerator functions.
6. The NIC according to claim 5, wherein the at least two accelerator functions are formed using fixed logic.
7. A 3D data processing unit (DPU), Multiple layers arranged within a stack and connected to each other in a communicative manner, Multiple accelerator functions within the aforementioned multiple layers, A 3D DPU comprising: a sequencer disposed in one of the plurality of layers, wherein the sequencer is configured to adjust the traffic flow received in the 3D DPU between different accelerator functions in different layers of the plurality of layers to form a pipeline.
8. The 3D DPU according to claim 7, wherein each of the plurality of layers comprises at least one integrated circuit.
9. A packet buffer connected to the plurality of accelerator functions, wherein each different accelerator function is configured to use the packet buffer to temporarily store packets between stages of the pipeline, and each different accelerator function further comprises a packet buffer that forms the stages in the pipeline, according to claim 7.
10. The 3D DPU according to claim 7, wherein the plurality of layers comprises at least an uppermost layer, an intermediate layer, and a lowermost layer, and the intermediate layer comprises at least one fabric accelerator implemented using programmable logic.
11. The 3D DPU according to claim 7, wherein the plurality of layers comprises at least an uppermost layer, an intermediate layer, and a lowermost layer, and the intermediate layer comprises an encryption engine for encrypting or decrypting data in the traffic flow.
12. The 3D DPU according to claim 7, wherein the plurality of layers comprises a first layer including at least two accelerator functions, and the first layer further comprises programmable logic for providing customized processing or communication between the at least two accelerator functions.
13. It is a system, There are multiple NICs, and each one is Multiple layers arranged within a stack and coupled together so that they can communicate with each other, Multiple accelerator functions within the aforementioned multiple layers, It includes multiple NICs, Multiple accelerator cards, The system comprises a switch that enables communication between the plurality of NICs and the plurality of accelerator cards, wherein the plurality of NICs, the plurality of accelerator cards, and the switch are arranged in the same box. The system is configured such that the switch facilitates both cache-coherent and non-cache-coherent communication between the host, the plurality of NICs, and the plurality of accelerator cards.
14. The switch is configured to enable the host to transfer data to a first NIC among the plurality of NICs using a coherent domain and to transfer data to a second NIC among the plurality of NICs using a non-coherent domain. The system according to claim 13, wherein the cache-coherent communication allows the plurality of NICs and the plurality of accelerator cards to share the host's memory space.