Aggregating small remote memory access requests

By introducing an asynchronous buffering mechanism in the source and destination NICs, small remote memory operation requests are aggregated into larger messages and sorted and buffered, solving the problem of high network packet overhead in supercomputers and achieving efficient resource utilization and cost reduction.

CN117951051BActive Publication Date: 2026-06-26HEWLETT PACKARD ENTERPRISE DEV LP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HEWLETT PACKARD ENTERPRISE DEV LP
Filing Date
2023-10-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Current supercomputers suffer from high network packet overhead when performing a large number of small remote memory operations. Commodity-based systems are inefficient, while proprietary systems are costly and consume significant memory and CPU resources by aggregating requests through software.

Method used

By introducing an asynchronous buffering mechanism in both the source and destination NICs, multiple small remote memory operation requests are aggregated into a larger message, which is then sorted and buffered on the source and destination sides respectively, reducing network transmission overhead.

Benefits of technology

It effectively reduces the overhead of transmitting small remote memory operation requests over high-bandwidth networks, improves resource utilization efficiency, and reduces system costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117951051B_ABST
    Figure CN117951051B_ABST
Patent Text Reader

Abstract

The present disclosure relates to aggregating small remote memory access requests. A network interface card (NIC) receives a stream of commands, respective commands including memory operation requests, each request associated with a destination NIC. The NIC asynchronously buffers the requests into queues based on the destination NIC, each queue specific to a corresponding destination NIC. When a first queue of requests reaches a threshold, the NIC aggregates the first queue of requests into a first packet and sends the first packet to the destination NIC. The NIC receives a plurality of packets, a second packet including memory operation requests, each request associated with a same destination NIC and a destination core. The NIC asynchronously buffers the requests of the second packet into queues based on the destination core, each queue specific to a corresponding destination core. When a second queue of requests reaches a threshold, the NIC aggregates the second queue of requests into a third packet and sends the third packet to the destination core.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] Current supercomputer operations may require a large number of small remote memory operations, each potentially carrying some network packet overhead. As link bandwidth increases, maintaining line rates with such small packet sizes may become increasingly difficult and expensive. Systems based on commodity technologies may be ineffective, while systems based on proprietary technologies may be prohibitively expensive. Attached Figure Description

[0002] Figure 1 A diagram illustrating an architecture for facilitating aggregated remote memory operation requests according to aspects of this application is shown.

[0003] Figure 2 An exemplary format of a NIC command, including a header and multiple memory operation requests, according to aspects of this application is shown.

[0004] Figure 3 A diagram of a source NIC according to an aspect of this application is shown, including a first sorting or asynchronous buffer based on memory operation requests from a destination NIC.

[0005] Figure 4 A diagram of a destination NIC according to an aspect of this application is shown, including a second sorting or asynchronous buffer based on memory operation requests from the destination core.

[0006] Figure 5A A flowchart illustrating a method for facilitating aggregated remote memory operation requests according to aspects of this application is presented, including a transmission operation performed by a single NIC.

[0007] Figure 5B A flowchart illustrating a method for facilitating aggregated remote memory operation requests according to aspects of this application is presented, including a method by... Figure 5A The receiving operation performed by a single NIC.

[0008] Figure 6A A flowchart illustrating a method for facilitating aggregated remote memory operation requests according to aspects of this application is presented, including a transmission operation performed by the source NIC.

[0009] Figure 6B A flowchart illustrating a method for facilitating aggregated remote memory operation requests according to aspects of this application is presented, including a receiving operation performed by the destination NIC.

[0010] Figure 7 A means for facilitating a request for operation of a converged remote memory according to aspects of this application is shown.

[0011] In the accompanying drawings, similar reference numerals refer to the same elements. Detailed Implementation

[0012] The following description is presented to enable any person skilled in the art to make and use these aspects and examples, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed aspects will be apparent to those skilled in the art, and the general principles defined herein can be applied to other aspects and applications without departing from the spirit and scope of this disclosure. Therefore, the aspects described herein are not limited to those shown, but are consistent with the widest scope of the principles and features disclosed herein.

[0013] Current supercomputer execution may require a large number of small remote memory operations, each potentially carrying some network packet overhead. As link bandwidth increases, maintaining line rates for such small packets can become increasingly difficult and expensive. Commodity-based systems may be ineffective, while proprietary systems may be prohibitively expensive. Furthermore, current solutions for efficient execution of small remote memory operations often involve aggregating requests via software. However, aggregating requests via software can consume significant amounts of memory and central processing unit (CPU) time, potentially leading to inefficient use of those resources.

[0014] The aspects described in this application address these challenges by providing a system that aggregates numerous small remote memory operation requests into larger messages (e.g., NIC commands), where the NIC (or NIC application-specific integrated circuit (ASIC)) can buffer and process the requests asynchronously for efficient batch service. Requests can be streamed from the host to the source NIC, rather than being sent individually from the host to the source NIC. The source NIC can perform a first ordering or a first asynchronous buffering (i.e., “buffered asynchronous” or “BA”) by placing each request into a source-side queue corresponding to the destination NIC indicated in the request. When a given source-side queue is full, the source NIC can send the data in the full queue as packets to the destination NIC. The destination NIC can receive these packets and perform a second ordering or a second asynchronous buffering by placing each request into a destination-side queue corresponding to the destination core indicated in the request. When a given destination-side queue is full, the destination NIC can send (i.e., stream) the data in the full queue as packets to the destination core.

[0015] Therefore, the described aspects can separate the process of ordering requests from the payload data carried by the requests, which can lead to the release of software to provide the necessary resources based on the requirements of a given application. An exemplary high-level architecture with source NICs and destination NICs is combined below. Figure 1 The detailed diagrams of the source (transmitting) NIC and the destination (receiving) NIC are described below, along with their respective illustrations. Figure 3 and Figure 4 Described.

[0016] The terms “asynchronous buffering,” “buffered asynchronous,” and “BA” are used interchangeably in this disclosure and refer to the operations described herein, wherein requests are ordered or buffered into queues based on destination information (e.g., requests based on destination NICs in a first order), as described in [the respective contexts]. Figure 1 and Figure 3 The request sorting unit 124 and sorting engine 330; or based on the destination core in the second sorting, such as respectively in Figure 1 and Figure 4 The request sorting unit 152 and the sorting engine 440. These requests are subsequently aggregated, for example, when based on the destination NIC, by Figure 1 Data transmission unit 134 and Figure 3 The request aggregation unit 350 and the packet sending unit 352; and when based on the destination core, by Figure 1 Data streaming unit 142 and Figure 4 The request aggregation unit 424 and the packet sending unit 422.

[0017] The terms “endpoint” and “core” are used interchangeably in this disclosure and refer to one of a plurality of endpoints or cores of a host associated with a given NIC.

[0018] The terms "memory operation request," "remote memory access request," and "remote memory operation" are used interchangeably in this disclosure and refer to a request to access or perform an operation on the host's memory. In this disclosure, these types of requests are typically small in size. For example, 10-15 of these requests might fit into a single NIC command of 256 bytes. Exemplary NIC commands with multiple memory operation requests are combined below. Figure 2 Described.

[0019] Exemplary high-level architecture

[0020] Figure 1A diagram 100 illustrating an architecture facilitating aggregated remote memory operation requests according to aspects of this application is shown. Diagram 100 may include: a host having cores 110 (including, for example, cores 112, 114, and 116); a NIC 120; a host having cores 160 (including, for example, cores 162, 164, and 166); and a NIC 140. In an exemplary data path, NIC 120 (e.g., a source NIC) may send data to NIC 140 (e.g., a destination NIC) via network 102 (e.g., a high-bandwidth network). NIC 120 may include: a data receiving unit 122; a request sequencing unit 124; first plurality of queues 126, 128, 130, and 132; and a data sending unit 134. NIC 140 may include: a data receiving unit 154; a request sequencing unit 152; second plurality of queues 144, 146, 148, and 150; and a data streaming unit 142.

[0021] Queues 126-132 in the first plurality of queues in the source NIC 120 can each be specific to the destination NIC. In some aspects, the number of queues in the first plurality of queues can be 4096; for example, NIC 120 can communicate with up to 4096 other NICs. Queues 144-150 in the second plurality of queues in the destination NIC 140 can each be specific to a destination core or endpoint. In some aspects, the number of queues in the second plurality of queues can be 256; for example, NIC 140 can be associated with a host having 256 cores or endpoints. The number of queues in the first plurality of queues of the source NIC 120 and the second plurality of queues of the destination NIC 140 can be greater than or less than these exemplary values ​​and can be based on various factors such as specific application or customer needs, future changes in processor architecture or design, and bandwidth variations.

[0022] Furthermore, based on the current system design, the exemplary size for each of the first and second plurality of queues can be 256 bytes (regardless of the actual number of queues corresponding to the destination NIC 140 of the core). Similar to the number of queues in both the first and second plurality of queues, the size of each queue (i.e., the queue depth) can be greater than or less than this exemplary value, and is also based on various factors, as described above.

[0023] During operation, NIC 120 can receive data from one of the cores in core 110 via, for example, communications 168, 169, and 170 through data receiving unit 122. Instead of the core sending data and the NIC receiving data as numerous individual small messages or requests, the core can stream data to data receiving unit 122 as NIC commands. These streaming NIC commands can instruct the payload to be buffered and subsequently processed asynchronously (using the buffered asynchronous or BA methods described herein). The corresponding NIC commands can also include multiple small remote memory operation requests, each with a header and payload. Each request can instruct a destination NIC and a destination endpoint or core. Exemplary NIC commands are referenced below. Figure 2 As described. Data receiving unit 122 can transmit received NIC commands (via communication 172) to request sequencing unit 124. Request sequencing unit 124 can process multiple requests in a NIC command by asynchronously buffering the requests into a first plurality of queues (e.g., 126-132) based on the destination NIC associated with or indicated in each request (via communication 174). When the total size of the requests stored in a given queue reaches a predetermined threshold (e.g., 256 bytes), those requests can be aggregated into a first packet (via communication 176), and data sending unit 134 can send the first packet to the indicated destination NIC (via communication 180 through network 102).

[0024] NIC 140 can receive multiple packets via data receiving unit 154, including a first packet containing requests previously aggregated and stored in a given queue of NIC 120. Recall that each request may indicate a destination NIC and a destination endpoint or core. Continuing with the example of the first packet received by data receiving unit 154, each request in this first packet may indicate the same destination NIC and the destination endpoint or core associated with NIC 140. Data receiving unit 154 may transmit the received packets to request sorting unit 152 (via communication 184). Request sorting unit 152 may process the multiple requests in the first packet by asynchronously buffering the requests into a second plurality of queues (e.g., 144-150) based on the destination endpoint or core associated with or indicated in each request (via communication 186). When the total size of requests stored in a given queue reaches a predetermined threshold (e.g., 256 bytes), these requests can be aggregated into packets (via communication 188), and data streaming unit 142 can send the packets to the indicated destination core (via, for example, communications 190, 191, and 192). Data streaming unit 142 can thus stream packets destined for each specific core, where each packet is the size of the queue (e.g., 256 bytes) and contains a number of smaller remote memory operation requests.

[0025] Therefore, Figure 100 illustrates how the described aspect can reduce the overhead associated with transmitting large numbers of memory operation requests over a high-bandwidth network by using a first sorting (based on the destination NIC) on the source side and a second sorting (based on the destination core) on the destination side to aggregate requests into a queue.

[0026] Example format for buffering asynchronous NIC commands

[0027] Figure 2 An exemplary format of a NIC command 200 according to aspects of this application, including a header and multiple memory operation requests, is shown. The NIC command 200 may be indicated by four-byte segments (e.g., segments 0-3 (210), 4-7 (212), 8-11 (214), and 12-15 (216)). The NIC command 200 may include a BA command header 218 (indicated by a vertical crosshair fill pattern) as its first 16 bytes, which may indicate the length of its payload and that the payload includes memory operation requests to be buffered and aggregated asynchronously. For example, the length of the subsequent payload of command 200 may be 108 bytes.

[0028] The memory operation requests in the payload of NIC command 200 can each include a 4-byte header (indicated by a right-skewed padding pattern) and a corresponding payload. The 4-byte header can indicate at least the following: the destination NIC for the request; the destination core for the request; and the size or length of the payload. For example, request 0 header 220 can indicate the destination NIC, the destination core, and a payload length of 8 bytes, while request 0 payload (bytes 0-7) 222 can follow. Subsequent requests may include a similar format: Request 1 header 224 may indicate its destination NIC, its destination core, and a payload length of 12 bytes, followed by Request 1 payload (bytes 0-11) 226; Request 2 header 228 may indicate its destination NIC, its destination core, and a payload length of 56 bytes, followed by Request 2 payload (bytes 0-15) 230, Request 2 payload (bytes 16-31) 232, Request 2 payload (bytes 32-47) 234, and Request 2 payload (bytes 48-55) 236; and Request 3 header 238 may indicate its destination NIC, its destination core, and a payload length of 16 bytes, followed by Request 3 payload (bytes 0-3) 240 and Request 3 payload (bytes 4-15) 242.

[0029] Although NIC command 200 only describes a 16-byte BA command header 218 and a subsequent 108-byte payload (for requests 0, 1, 2, and 3), totaling 124 bytes, NIC command 200 can include up to any predetermined value of data, such as 256 bytes. Memory operation requests cannot be split across NIC command boundaries (i.e., cannot cross NIC command boundaries).

[0030] Detailed description of the source / transmitting NIC

[0031] Figure 3 Figure 300 illustrates a source NIC 320 according to an aspect of this application, including a first sorting or asynchronous buffer based on memory operation requests from destination NICs. NIC 320 can receive data from core 310 (e.g., from one of cores 312, 314, 316, and 318 via communication 360). NIC 320 may include: a data receiving unit 322; an engine selection unit 324; multiple sorting engines 330 (e.g., eight engines 331, 332, 333, 334, 335, 336, 337, and 338); multiple per-destination NIC queues 340 (e.g., queues 341, 342, 343, 344, 345, 346, and 347); a request aggregation unit 350; and a packet sending unit 352.

[0032] As mentioned above Figure 1As described, the core can stream data as NIC commands to the data receiving unit 322. These streamed NIC commands can instruct the payload to be buffered and subsequently processed asynchronously (using the buffer asynchronous or BA method described herein). Each NIC command can also include multiple small remote memory operation requests, each with a header and payload, as combined above. Figure 2 As described. Each request (in its header) can indicate the destination NIC and the destination endpoint or core.

[0033] Data receiving unit 322 can transmit the received NIC command (via communication 362) to engine selection unit 324. Engine selection unit 324 can select a first engine (i.e., sorting engine 330) among a plurality of engines based on a load balancing strategy and transmit the given NIC command (via communication 364) to the selected engine. Each sorting engine can process a certain amount of data per clock cycle, for example, 16 bytes per clock cycle. As a result, given a plurality of (e.g., 8) sorting engines, sorting engine 330 can buffer and process approximately 16*8 = 128 total bytes per clock cycle. A single NIC command will be processed entirely by the selected engine. That is, the NIC command will not be further fragmented into smaller pieces for processing. Each sorting engine can process the request in a given NIC command (e.g., at a rate of 16 bytes per clock cycle) and place the request into the appropriate per-destination NIC queue (of queue 340).

[0034] For example, engine selection unit 324 can determine to send a given NIC command to sorting engine 334 (via communication 366). Sorting engine 334 can process the 256 bytes of a given NIC command by buffering each memory operation request into the correct per-destination NIC queue (e.g., buffering into queues 341-347 via communication 368 and 370).

[0035] When the total size of requests in a given queue stored in queue 340 reaches a predetermined threshold (e.g., 256 bytes), those requests can be aggregated into a first packet by request aggregation unit 350 (via communication 378). Request aggregation unit 350 can send the first packet to packet sending unit 352 (via communication 380), and packet sending unit 352 can send the first packet to the indicated destination NIC (via communication 382 over a network (not shown)).

[0036] Detailed description of the destination / receiving NIC

[0037] Figure 4Figure 400 illustrates a destination NIC 420 according to an aspect of this application, including a second sorting or asynchronous buffer based on memory operation requests from the destination core. NIC 420 may include: a data receiving unit 452; an engine selection unit 450; multiple sorting engines 440 (e.g., eight engines 441, 442, 443, 444, 445, 446, 447, and 448); multiple per-destination core queues 430 (e.g., queues 431, 432, 433, 434, 4345, 436, and 437); a request aggregation unit 424; and a packet sending unit 422. NIC 420 may stream data packets to core / endpoint 410 (e.g., via communication 476, one of cores 412, 414, 416, and 418).

[0038] NIC 420 can receive data from the source NIC (via communication 460 on the network (not shown)). The data may include packets comprising multiple small remote memory operation requests, each with a header and payload, as described above. Figure 2 As described. Each request (in its header) can indicate a destination NIC and a destination endpoint or core. Packets received by destination NIC 420 can indicate the same destination NIC (i.e., NIC 420) and destination core (i.e., one of the cores in core 410). Data receiving unit 452 can transmit the received packets to engine selection unit 450 (via communication 462). Engine selection unit 450 can select a second engine from a second plurality of engines (i.e., sorting engines 440) based on a load balancing strategy and transmit the given packets to the selected engine (via communication 464). Figure 3 Similar to sorting engine 330 in the source NIC 320 shown, each sorting engine in sorting engine 440 can process a certain amount of data per clock cycle, for example, 16 bytes per clock cycle. As a result, given multiple (e.g., 8) sorting engines, sorting engine 330 can buffer and process approximately 16 * 8 = 128 total bytes per clock cycle. A single packet will be processed entirely by the selected engine. That is, the packet will not be further fragmented into smaller pieces for processing. Each sorting engine can process the request in a given packet (e.g., at a rate of 16 bytes per clock cycle) and place the request into the appropriate per-destination core queue (in queue 430).

[0039] For example, engine selection unit 450 can determine to send a given packet to sorting engine 445 (via communication 466). Sorting engine 445 can process (up to) 256 bytes of a given packet by buffering each memory operation request into the correct per-destination core queue (e.g., buffered into queues 431-437 via communication 468 and 470).

[0040] When the total size of requests stored in a given queue of queue 430 reaches a predetermined threshold (e.g., 256 bytes), those requests can be aggregated into a second packet by request aggregation unit 424 (via communication 472). Request aggregation unit 424 can send the second packet to packet sending unit 422 (via communication 474), and packet sending unit 422 can send the second packet to the indicated destination core (via communication 476). Packet sending unit 422 can be a data streaming unit, that is, streaming multiple packets whose destination is the various cores of destination NIC 420.

[0041] Methods for facilitating the aggregation of remote memory operation requests

[0042] The described aspects may include a single NIC that performs both source (send) and destination (receive) operations, as described below. Figure 5A and 5B What is depicted. Figure 5A A flowchart 500 illustrating a method for facilitating the aggregation of remote memory operation requests according to aspects of this application is presented, including a transmission operation performed by a single NIC. During the operation, the system receives a command stream via a local network interface card (NIC), wherein the corresponding commands include a first plurality of memory operation requests, each associated with a remote destination NIC and a remote destination core (operation 502). The command stream is received via the local NIC as a stream rather than as individual memory operation requests. The local NIC can obtain data in a contiguous array of memory operation requests, including payloads and corresponding destination information (destination NIC and destination core), via, for example, a Peripheral Component Interconnect Fast (PCIe) connection, Compute Fast Link (CXL), or other host interfaces or on-chip networks. The system asynchronously buffers the requests into a first plurality of queues based on the destination NIC associated with each request, wherein each queue is specific to a corresponding remote destination NIC (operation 504). If the total size of the requests stored in the first queues does not reach a predetermined threshold (decision 506), the system can continue at operation 502 or 504.

[0043] If the total size of the requests stored in the first queue does indeed reach a predetermined threshold (Decision 506), the system aggregates the requests stored in the first queue into a first packet (Operation 508) and sends the first packet to the remote destination NIC via a high-bandwidth network, thereby reducing the overhead associated with transmitting a large number of memory operation requests over a high-bandwidth network (Operation 510). Figure 5B Continue at mark A.

[0044] Figure 5B A flowchart 520 illustrating a method for facilitating aggregated remote memory operation requests according to aspects of this application is presented, including a method by... Figure 5A The system receives multiple packets through the local NIC, where a second packet includes a second plurality of memory operation requests, each request destined for the local NIC and associated with a local destination core (operation 522). The system asynchronously buffers the requests of the second packet into a second plurality of queues based on the destination core associated with each request, where each queue is specific to a corresponding local destination core (operation 524). If the total size of the requests in the second queues stored in the second plurality of queues does not reach a predetermined threshold (decision 526), ​​the system can continue at operation 522 or 524.

[0045] If the total size of requests stored in the second queue (a plurality of queues) does indeed reach a predetermined threshold (decision 526), ​​the system aggregates the requests stored in the second queue into a third packet (operation 528) and sends the third packet to the local destination core, thereby further reducing the overhead associated with transmitting a large number of memory operation requests over a high-bandwidth network (operation 530). The system can determine, via the local NIC, that the total size of aggregated requests stored in one or more queues (a plurality of queues) has reached the predetermined threshold, and can further stream those aggregated requests (not shown) to the corresponding local destination core specific to the respective queue. Operation returns.

[0046] The described aspects may also include a system comprising two NICs: a first NIC (e.g., a source NIC or a transmitting NIC); and a second NIC (e.g., a destination NIC or a receiving NIC), as described below regarding Figure 6A and 6B As described. Figure 6AA flowchart 600 illustrating a method for facilitating the aggregation of remote memory operation requests according to aspects of this application is presented, including a transmission operation performed by a source NIC. During the operation, the system receives a command stream via the source network interface card (NIC), wherein the corresponding commands include a first plurality of memory operation requests, each request being associated with a destination NIC and a destination core (operation 602). The system asynchronously buffers the requests into a first plurality of queues based on the destination NIC associated with each request, wherein each queue is specific to a corresponding destination NIC (operation 604). If the total size of the requests stored in the first queues does not reach a predetermined threshold (decision 606), the system may continue at operation 602 or 604.

[0047] If the total size of the requests stored in the first queue does indeed reach a predetermined threshold (Decision 606), the system aggregates the requests stored in the first queue into a first packet (Operation 608) and sends the first packet to the destination NIC via the high-bandwidth network, thereby reducing the overhead associated with transmitting a large number of memory operation requests over the high-bandwidth network (Operation 610). Figure 6B Continue at mark B.

[0048] Figure 6B A flowchart 620 illustrating a method for facilitating the aggregation of remote memory operation requests according to aspects of this application is presented, including a receiving operation performed by a destination NIC. The system receives multiple packets via the destination NIC, including a first packet comprising requests previously aggregated and stored in a first plurality of queues, where each request is associated with the same destination NIC and destination core (operation 622). The system asynchronously buffers the requests of the first packet into a second plurality of queues via the destination NIC based on the destination core associated with each request, where each queue is specific to a corresponding destination core (operation 624). If the total size of the requests in the second queues stored in the second plurality of queues does not reach a predetermined threshold (decision 626), the system may continue at operation 622 or 624.

[0049] If the total size of requests in the second queue stored in the second plurality of queues does not reach a predetermined threshold (Decision 626), the system aggregates the requests stored in the second queue into a second packet via the destination NIC (Operation 628) and sends the second packet to the destination core, thereby reducing the overhead associated with transmitting a large number of memory operation requests over a high-bandwidth network (Operation 630). Operation returns.

[0050] Devices for facilitating aggregated remote memory operation requests

[0051] Figure 7A device 700 for facilitating aggregated remote memory operation requests according to aspects of this application is shown. Device 700 may represent a network interface card (NIC) (such as regarding...) Figure 5A and Figure 5B The flowcharts 500 and 520 describe a single NIC, and it may include a transmitting unit 710 and a receiving unit 720. The transmitting unit 710 may include: a first command unit 712 (which can perform the commands described above respectively regarding...). Figure 1 and Figure 3 The operations described for the data receiving units 122 and 322 are similar to those described above; the first sorting unit 714 (which can perform operations similar to those described above) Figure 1 and Figure 3 The operation described in the request sorting unit 124 and the sorting engine 330 is similar to that described in the operation; the first queue management unit 716 (which can manage and buffer) Figure 1 In queues 126-132 and Figure 3 Data in each destination NIC queue 340); and a first aggregation communication unit 718 (which can perform the above-mentioned...) Figure 1 Data transmission unit 134 and Figure 3 The operations described in the request aggregation unit 350 and the packet sending unit 352 are similar to those described in the previous section.

[0052] The receiving unit 720 may include: a second command unit 722 (which can perform the commands described above respectively regarding...) Figure 1 and Figure 4 The operations described for the data receiving units 154 and 452 are similar to those described above; the second sorting unit 724 (which can perform operations similar to those described above) Figure 1 and Figure 4 The operation described in the request sorting unit 152 and the sorting engine 440 is similar to that described in the operation; the second queue management unit 726 (which can manage and buffer) Figure 1 In queues 144-150 and Figure 4 Data in each destination core queue 430); and a second aggregation communication unit 728 (which can perform the above-mentioned data for ... Figure 1 Data flow unit 142 and Figure 4 The operations described in the request aggregation unit 424 and the packet sending unit 422 are similar to those described in the previous section.

[0053] Generally, the disclosed aspects provide advantageous systems, methods, and apparatuses. In one aspect, the system receives a command stream via a local network interface card (NIC), wherein the corresponding commands include a first plurality of memory operation requests, each request being associated with a remote destination NIC and a remote destination core. The system asynchronously buffers the requests into a first plurality of queues based on the destination NIC associated with each request, wherein each queue is specific to a corresponding remote destination NIC. In response to determining that the total size of the requests stored in the first queues reaches a predetermined threshold, the system aggregates the requests stored in the first queues into a first packet and sends the first packet to the remote destination NIC via a high-bandwidth network. The system receives a plurality of packets via the local NIC, wherein a second packet among the received packets includes a second plurality of memory operation requests, each request being destined for the local NIC and associated with a local destination core. The system asynchronously buffers the requests of the second packet into a second plurality of queues based on the local destination core associated with each request, wherein each queue is specific to a corresponding local destination core. In response to determining that the total size of the requests in the second queues stored in the second plurality of queues reaches a predetermined threshold, the system aggregates the requests stored in the second queues into a third packet and sends the third packet to the local destination core.

[0054] In a variant of this, the first plurality of engines of the local NIC asynchronously buffer requests into a first plurality of queues, and the system selects the first engine among the first plurality of engines to asynchronously buffer requests from each command based on a load balancing strategy.

[0055] In another variant of this, the second multiple engine of the local NIC asynchronously buffers requests into a second multiple queue, and the system selects the second engine of the second multiple engine to asynchronously buffer requests from each packet based on a load balancing strategy.

[0056] In another variant, the command stream is received via the local NIC as a command stream rather than as an individual memory operation request.

[0057] In another variant, the local NIC receives a command stream by acquiring data from a contiguous array of memory operation requests, which include a payload and corresponding destination information connected via a Peripheral Component Interconnect Fast (PCIe) link.

[0058] In another variant, the system determines that the total size of aggregated requests stored in one or more queues within a second plurality of queues reaches a predetermined threshold. The system then streams the aggregated requests stored in the one or more queues within the second plurality of queues to the corresponding destination core specific to the respective queue via the local NIC.

[0059] In another variant, the corresponding remote destination core corresponds to a destination endpoint among multiple destination endpoints associated with the remote destination NIC.

[0060] In another variant, the corresponding memory operation request is associated with a payload smaller than a predetermined size.

[0061] In another variant, each command received via the local NIC can be up to 256 bytes in size.

[0062] In another variant, the first plurality of queues comprises 4096 queues, and the second plurality of queues comprises 256 queues.

[0063] In another variant, the corresponding command in the header indicates that memory operation requests should be buffered and aggregated asynchronously.

[0064] In another aspect, an apparatus or NIC includes: a first command module for receiving a command stream, wherein the corresponding commands include a first plurality of memory operation requests, each request being associated with a remote destination NIC and a remote destination core; a first sorting module for asynchronously buffering requests into a first plurality of queues based on the destination NIC associated with each request, wherein each queue is specific to a corresponding remote destination NIC; a first aggregation communication module for aggregating the requests stored in the first queues into a first packet and sending the first packet to the remote destination NIC via a high-bandwidth network in response to determining that the total size of the requests stored in the first queues has reached a predetermined threshold; a second command module for receiving a plurality of packets, wherein a second packet among the received packets includes a second plurality of memory operation requests, each request being destined for a local NIC and associated with a local destination core; a second sorting module for asynchronously buffering the requests of the second packet into a second plurality of queues based on the local destination core associated with each request, wherein each queue is specific to a corresponding local destination core; and a second aggregation communication module for aggregating the requests stored in the second queues into a third packet and sending the third packet to the local destination core in response to determining that the total size of the requests in the second queues stored in the second plurality of queues has reached a predetermined threshold.

[0065] In another aspect, a system includes a local NIC (e.g., a source NIC) and a remote NIC (e.g., a destination NIC). The local NIC includes: a first command module for receiving a command stream, wherein the corresponding commands include a first plurality of memory operation requests, wherein each request is associated with a remote destination NIC and a remote destination core; a first sorting module for asynchronously buffering the requests into a first plurality of queues based on the destination NIC associated with each request, wherein each queue is specific to a corresponding remote destination NIC; and a first aggregation communication module for aggregating the requests stored in the first queue into a first packet and sending the first packet to the remote destination NIC via a high-bandwidth network in response to the total size of the requests determined to be stored in the first queue reaching a predetermined threshold. The remote NIC includes: a second command module for receiving a first packet comprising requests previously aggregated and stored in a first queue, wherein each request is destined for the remote NIC and associated with a remote destination core; a second sorting module for asynchronously buffering the requests of the first packet into a second plurality of queues based on the remote destination core associated with each request, wherein each queue is specific to a corresponding remote destination core; and a second aggregation communication module for, in response to determining that the total size of the requests in the second queues stored in the second plurality of queues has reached a predetermined threshold, aggregating the requests stored in the second queues into a third packet and sending the third packet to the remote destination core.

[0066] The foregoing descriptions of the various aspects have been presented for illustrative and descriptive purposes only. They are not intended to be exhaustive or to limit the aspects described herein to the disclosed forms. Therefore, many modifications and variations will be apparent to those skilled in the art. Furthermore, the foregoing disclosure is not intended to limit the aspects described herein. The scope of the aspects described herein is defined by the appended claims.

Claims

1. A computer-implemented method, comprising: The command stream is received via the local network interface card (NIC), wherein the corresponding commands include a plurality of memory operation requests, each of which is associated with a remote destination NIC and a remote destination core. Based on the destination NIC associated with each request, the requests are asynchronously buffered into a first plurality of queues, each of the first plurality of queues being specific to a corresponding remote destination NIC, and the number of the first plurality of queues being based on the number of remote destination NICs with which the local NIC can communicate; In response to determining that the total size of the requests stored in the first queue, which is specific to the first corresponding remote destination NIC, in the first plurality of queues reaches a predetermined threshold, the requests stored in the first queue are aggregated into a first packet and sent to the remote destination NIC via a high-bandwidth network; Multiple packets are received via the local NIC, wherein a second packet among the received packets includes a second plurality of memory operation requests, wherein each request is destined for the local NIC and associated with a local destination core; Based on the local destination core associated with each request, the requests of the second group are asynchronously buffered into a second plurality of queues, each of the second plurality of queues being specific to a corresponding local destination core, and the number of the second plurality of queues being based on the number of local destination cores. as well as In response to determining that the total size of the requests stored in the second queue, which is specific to the first corresponding local destination core, in the second plurality of queues reaches the predetermined threshold, the requests stored in the second queue are aggregated into a third packet and sent to the local destination core.

2. The method according to claim 1, The first plurality of engines of the local NIC asynchronously buffer the requests into the first plurality of queues, and The method further includes selecting a first engine among the first plurality of engines based on a load balancing strategy to asynchronously buffer the requests from each command.

3. The method according to claim 1, The second plurality of engines of the local NIC asynchronously buffer the requests into the second plurality of queues, and The method further includes selecting a second engine from the second plurality of engines based on a load balancing strategy to asynchronously buffer the requests from each group.

4. The method according to claim 1, The command stream is received via the local NIC as a command stream rather than as an individual memory operation request.

5. The method according to claim 1, further comprising: The local NIC receives the command stream by acquiring data from a continuous array of memory operation requests via a Peripheral Component Interconnect Fast PCIe connection, the memory operation requests including a payload and corresponding destination information.

6. The method according to claim 1, further comprising: Determine that the total size of the aggregated requests stored in one or more of the second plurality of queues reaches the predetermined threshold; as well as The local NIC transmits the aggregated requests stored in one or more queues in the second plurality of queues to the corresponding destination core specific to the corresponding queue.

7. The method according to claim 1, The corresponding remote destination core corresponds to a destination endpoint among multiple destination endpoints associated with the remote destination NIC.

8. The method according to claim 1, The corresponding memory operation request is associated with a payload smaller than a predetermined size.

9. The method according to claim 1, Each command received through the local NIC can be up to 256 bytes in size.

10. The method according to claim 1, The first plurality of queues includes 4096 queues, and The second plurality of queues includes 256 queues.

11. The method according to claim 1, The corresponding command in the header indicates that the memory operation request should be buffered and aggregated asynchronously.

12. A first network interface card (NIC), comprising an integrated circuit, said integrated circuit being used to: Receive command stream, wherein the corresponding commands include a first plurality of memory operation requests, wherein each request is associated with a remote destination NIC and a remote destination core; Based on the remote destination NIC associated with each request, the request is asynchronously buffered into a first plurality of queues, each of the first plurality of queues being specific to a corresponding remote destination NIC, and the number of the first plurality of queues being based on the number of remote destination NICs to which the first NIC can communicate; In response to determining that the total size of the requests stored in the first queue, which is specific to the first corresponding remote destination NIC, in the first plurality of queues reaches a predetermined threshold, the requests stored in the first queue are aggregated into a first packet and sent to the remote destination NIC via a high-bandwidth network; Receive multiple packets, wherein a second packet among the received packets includes a second plurality of memory operation requests, wherein each request is destined for the first NIC and associated with a local destination core; Based on the local destination core associated with each request, the requests of the second group are asynchronously buffered into a second plurality of queues, each of the second plurality of queues being specific to a corresponding local destination core, and the number of the second plurality of queues being based on the number of local destination cores. as well as In response to determining that the second sorting module asynchronously buffers the requests of the second group into a second plurality of queues based on the local destination core associated with each request, wherein each queue in the second plurality of queues is specific to a corresponding local destination core, and wherein the number of the second plurality of queues is based on the number of local destination cores; as well as In response to determining that the total size of the requests stored in the second queue, which is specific to the first corresponding local destination core, in the second plurality of queues reaches the predetermined threshold, the requests stored in the second queue are aggregated into a third packet and sent to the local destination core.

13. The first NIC according to claim 12, wherein the integrated circuit is further configured to: Based on the remote destination NIC associated with each request, the buffered requests are managed and stored in the first plurality of queues; and Based on the local destination core associated with each request, the buffered requests of the second group, which are buffered by the second sorting module, are managed and stored in the second plurality of queues.

14. The first NIC according to claim 12, The first plurality of engines asynchronously buffer the requests into the first plurality of queues, and The integrated circuit is also used to select a first engine among the first plurality of engines based on a load balancing strategy to asynchronously buffer the requests from each command.

15. The first NIC according to claim 12, The second or more engines asynchronously buffer the requests into the second or more queues, and The integrated circuit is also used to select a second engine among the second plurality of engines based on a load balancing strategy to asynchronously buffer the requests from each group.

16. The first NIC according to claim 12, wherein the integrated circuit is further configured to: The command stream is received by acquiring data from a continuous array of memory operation requests via a Peripheral Component Interconnect Fast PCIe connection, the memory operation requests including payload and corresponding destination information.

17. The first NIC according to claim 12, The corresponding queue in the second plurality of queues corresponds to one of the plurality of local destination cores or destination endpoints associated with the first NIC.

18. The first NIC according to claim 12, The corresponding memory operation request is associated with a payload smaller than a predetermined size. The size of each received command and each received packet can be up to 256 bytes. The first plurality of queues includes 4096 queues, and The second plurality of queues includes 256 queues.

19. A system comprising: The local network interface card (NIC) includes a first integrated circuit, which is used for: Receive command stream, wherein the corresponding commands include a first plurality of memory operation requests, wherein each request is associated with a remote destination NIC and a remote destination core; The requests are asynchronously buffered into a first plurality of queues based on the destination NIC associated with each request, wherein each queue in the first plurality of queues is specific to a corresponding remote destination NIC, and wherein the number of the first plurality of queues is based on the number of remote destination NICs with which the local NIC can communicate; as well as In response to determining that the total size of the requests stored in the first queue, which is specific to the first corresponding remote destination NIC, in the first plurality of queues reaches a predetermined threshold, the requests stored in the first queue are aggregated into a first packet and sent to the remote destination NIC via a high-bandwidth network; as well as The remote NIC includes a second integrated circuit, which is used to: Receive the first packet, the first packet including the requests that were previously aggregated and stored in the first queue, wherein each request is destined for the remote NIC and associated with the remote destination core; The requests of the first group are asynchronously buffered into a second plurality of queues based on the remote destination core associated with each request, wherein each queue in the second plurality of queues is specific to a corresponding remote destination core, and wherein the number of the second plurality of queues is based on the number of local destination cores. as well as In response to determining that the total size of the requests stored in the second queue, which is specific to the first corresponding local destination core, in the second plurality of queues reaches the predetermined threshold, the requests stored in the second queue are aggregated into a second packet and sent to the remote destination core.

20. The system according to claim 19, The first integrated circuit of the local NIC is further configured to manage buffered requests and store them in the first plurality of queues based on the remote destination NIC associated with each request. The second integrated circuit of the remote NIC is further used to manage buffered requests and store them in the second plurality of queues based on the local destination core associated with each request. The local NIC further includes a first plurality of engines that asynchronously buffer the requests into a first plurality of queues. The first integrated circuit of the local NIC is further configured to select a first engine among the first plurality of engines based on a load balancing strategy to asynchronously buffer the requests from each command. The remote NIC further includes a second plurality of engines that asynchronously buffer the requests into a second plurality of queues, and The second integrated circuit of the remote NIC is further used to select a second engine among the second plurality of engines based on a load balancing strategy to asynchronously buffer the requests from each group.