Early credit value return for credit value based flow control
By employing a credit-based flow control mechanism and a chiplet protocol interface (CPI) network in the chiplet system, the problems of buffer overload and low communication efficiency are solved, achieving efficient data transmission and improved system performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MICRON TECHNOLOGY INC
- Filing Date
- 2021-06-22
- Publication Date
- 2026-06-12
AI Technical Summary
Existing credit-based flow control systems suffer from low communication efficiency and buffer overload in chip systems, especially when dealing with bursts of traffic. This causes the transmitting device to wait for the buffer to be read, increasing processing cycles and power consumption.
A credit-based flow control mechanism is adopted, in which the receiving device returns the credit value before the data packet is read, and the sending device sends the data packet according to the available credit value, ensuring that the buffer does not overflow, and achieving efficient communication through the chiplet protocol interface (CPI) network.
It improves the communication efficiency of the chip system, reduces the time and power consumption of the transmitting device in the waiting state, shortens the data transmission delay, and improves system performance.
Smart Images

Figure CN116250218B_ABST
Abstract
Description
[0001] Priority application
[0002] This application claims priority to U.S. Application No. 17 / 007,516, filed August 31, 2020, the entire contents of which are incorporated herein by reference.
[0003] Statement regarding government support
[0004] This invention was developed with the support of the U.S. government under DARPA Agreement HR00111830003. The U.S. government holds certain rights to this invention. Technical Field
[0005] Embodiments of this disclosure generally relate to managing packet-based network communications using credit-value-based flow control, wherein flow control is managed by credit values and “credit value returns,” and more specifically to early credit value returns in systems using credit-value-based flow control, which in selected embodiments may be implemented in packet-based communications in chiplet systems. Background Technology
[0006] In a credit-based flow control system, the sending device reduces its available credit value before sending data to the receiving device. The receiving device buffers the data. In a conventional credit-based system, after the data is removed from the buffer and processed, the receiving device sends a response message to the sending device. In response to receiving the response message, the sending device increases its available credit value.
[0007] Chiplets are an emerging technology for integrating various processing functionalities. Typically, a chiplet system consists of discrete modules (each referred to as a "chiplet") integrated on an interposer layer and, in many instances, interconnected via one or more established networks as needed to provide the required functionality to the system. The interposer layer and the contained chiplets may be packaged together to facilitate interconnection with other components of a larger system. Each chiplet may contain one or more individual integrated circuits or "chips" (ICs), which may be combined with discrete circuit components and coupled together to a corresponding substrate for attachment to the interposer layer. Most or all of the chiplets in the system will be individually configured for communication via the one or more established networks.
[0008] Chipsets, configured as individual modules of a system, differ from systems implemented on a single chip containing different blocks of devices (e.g., intellectual property (IP) blocks) (e.g., System-on-a-Chip (SoC)) on a substrate (e.g., a single die), or multiple discrete packaged devices integrated on a printed circuit board (PCB). Generally, chiplets offer better performance (e.g., lower power consumption, reduced latency, etc.) than discrete packaged devices, and offer greater manufacturing benefits than a single die chip. These manufacturing benefits may include higher yields or reduced development costs and time.
[0009] A chiplet system may comprise, for example, one or more application (or processor) chiplets and one or more support chiplets. Here, the distinction between application and support chiplets is merely a reference to possible design scenarios for chiplet systems. Thus, for example, a synthetic vision chiplet system may comprise (by example only) application chiplets for generating synthetic vision output, and support chiplets such as memory controller chiplets, sensor interface chiplets, or communication chiplets. In typical use cases, synthetic vision designers may design the application chiplets and obtain support chiplets from other sources. Therefore, design expenditures (e.g., in terms of time or complexity) are reduced by avoiding the design and manufacture of the functionality embodied in the support chiplets. Chiplets also support the tight integration of IP blocks that might otherwise be difficult, such as IP blocks manufactured using different processing technologies or with different feature sizes (or utilizing different contact technologies or pitches). Therefore, multiple ICs or IC assemblies with different physical, electrical, or communication characteristics can be assembled in a modular manner to provide assemblies that implement the desired functionality. Chiplet systems also facilitate adaptation to the needs of different larger systems that will be incorporated into the chiplet system. In one example, an IC or other assembly can be optimized for power, speed, or heat generation for a specific function, as might be the case with a sensor. Compared to attempting integration with other devices on a single die, the IC or other assembly can be more easily integrated. Furthermore, by reducing the overall size of the die, the yield of small chips is often higher than that of more complex single-die devices. Attached Figure Description
[0010] This disclosure will be more fully understood from the detailed description given below and the accompanying drawings of various embodiments thereof. However, the drawings should not be construed as limiting this disclosure to the specific embodiments, but are for illustration and understanding only.
[0011] Figure 1A and 1B An example of a chiplet system according to one embodiment is shown.
[0012] Figure 2 Components of an example of a memory controller chiplet according to one embodiment are shown.
[0013] Figure 3 An example is shown of routing between chiplets in a chiplet layout using a chiplet protocol interface (CPI) network according to one embodiment.
[0014] Figure 4 This is a block diagram of a packet buffer using credit-based flow control according to some embodiments of the present disclosure and a packet command buffer suitable for use in early credit value returns in a system using credit-based flow control.
[0015] Figure 5 This is a block diagram of data packets suitable for use in a system using credit-based flow control, according to some embodiments of this disclosure.
[0016] Figure 6 This is a block diagram of data packets returned from early credit values in a system that is suitable for using credit value-based flow control, according to some embodiments of this disclosure.
[0017] Figure 7 This is a flowchart illustrating the operation of a method for performing early credit value return in a system using credit value-based flow control, executed by two circuits according to some embodiments of the present disclosure.
[0018] Figure 8 This is a flowchart illustrating the operation of a method for performing early credit value return in a system using credit value-based flow control, executed by two circuits according to some embodiments of the present disclosure.
[0019] Figure 9 This is a flowchart illustrating the operation of a method for performing early credit value return in a system using credit value-based flow control, executed by a circuit according to some embodiments of the present disclosure.
[0020] Figure 10 This is a block diagram of an example computer system in which embodiments of this disclosure may be operated. Detailed Implementation
[0021] Figure 1, described below, provides an example of a chiplet system and the components operating therein. Within the context of this chiplet system, flow control can be managed via “credit values” and “credit value returns.” In this system, when communication between two devices is initiated, one or both devices allocate buffer space to store data received from the other device; and the allocated buffer space amount (based on a credit value) is transferred to the other device, as discussed in more detail below. In the implementation of the credit value / credit value return system described herein, the device receiving the packet containing the command may “early” issue a credit value return to the source before the received packet has been read by the receiving device. In one instance, a credit value return may be issued when the first command is stored in the buffer. Credit value-based flow control can be used to prevent buffer overload in networks with bursty traffic. Credit value-based flow control can be used between chiplets within a chip, between chips within a device, between devices in a local area network or a wide area network, or in any suitable combination of these methods.
[0022] An example system utilizing the described credit value return mechanism is a modular memory system, in which discrete modules (each module referred to as a "chiplet") containing one or more semiconductor devices or other components providing corresponding functionality of the memory system are assembled, and the discrete modules are interconnected via one or more communication interfaces to provide memory system functionality. Chiplet-based systems, such as the modular memory system described, offer numerous advantages, allowing beneficial modifications to a portion of the system without requiring a complete redesign of the memory system or its components. Examples of such beneficial modifications may include, for example, adding additional storage capacity; adding memory based on different storage technologies (and potentially requiring dedicated interfaces or control functionality); adapting to modifications to the memory controller or memory communication bus; and so on.
[0023] In some examples of this chiplet-based system, communication between individual chiplets can be managed locally.
[0024] When communication between two devices is initiated, one or both devices allocate buffer space to store data received from the other device. The allocating device informs the other device of the amount of buffer space allocated by sending a data packet indicating the number of available credits. In some exemplary embodiments, the transmission of the data packet indicating the number of credits is skipped, and a predetermined number of credits is allocated automatically.
[0025] Before sending a packet, the sending device confirms that sufficient credits are available. If sufficient credits are available, the cost of the packet is reduced by the number of credits (e.g., one credit) and the packet is sent. In some example embodiments, the cost of the packet is based on the packet size.
[0026] The receiving device stores at least a portion of the data packet in a buffer. For example, the data packet may contain a 32-bit field with one to four command bytes, as indicated by a four-bit mask field. Active command bytes are copied to the buffer, and inactive command bytes are discarded. Therefore, if each entry in the buffer stores four bytes, a 32-bit field will always consume no more than one entry, but may consume as little as a quarter of an entry.
[0027] To ensure the buffer doesn't overflow, the sending device spends one credit value for each entry in the buffer, which may be consumed by the transmitted data packets. Therefore, even if a transmitted data packet contains only one byte of data, the credit value for the entire entry is consumed. However, the receiving device may be able to store the received data entirely in the allocated entry. For example, the first data packet contains three bytes of data, and the second data packet contains one byte of data. In this instance, the data from both packets could be stored in a single entry, but the sending device would reduce the available credit value by two.
[0028] In existing systems, the receiving device returns a credit value after reading data from the buffer. As discussed in this paper, when received data is added to the buffer without consuming a new entry, a response packet is sent to the sending device before reading data from the buffer. The response packet returns the credit value to the sending device. This allows the sending device to continue sending data without waiting for the buffer to be read, thus enabling more efficient use of the buffer in communication between the two devices.
[0029] One benefit of the selected embodiments of this disclosure is that the transmitting device spends less time in the state where packet transmission is suspended. Therefore, the processing cycles consumed in the waiting state and in rechecking the number of available credit values are reduced. Furthermore, power consumption in the waiting state is reduced. The performance of the system including the communication device is also improved due to the reduced latency in data transmission. This communication efficiency can provide specific benefits when applied to communications to and from memory devices and / or memory controllers, as the waiting time of one or more processors or processor cores, for example, while waiting for memory read results, can also affect various aspects of system performance. Other benefits will be apparent to those skilled in the art who have obtained the benefits of this disclosure.
[0030] Figure 1A and 1B An example of a chiplet system 110 according to one embodiment is shown. Figure 1AThis is an illustration of a chiplet system 110 mounted on a peripheral board 105, which can be connected to a wider range of computer systems, for example, via peripheral component interconnect (PCIe). The chiplet system 110 includes a package substrate 115, an interposer 120, and four chips: an application chiplet 125, a host interface chiplet 135, a memory controller chiplet 140, and a memory device chiplet 150. Other systems may include numerous additional chipsets to provide additional functionality, as will be apparent from the following discussion. The package of the chiplet system 110 is shown with a cap or cover 165, but other packaging techniques and structures for chiplet systems may be used. Figure 1B This is a block diagram for the purpose of clearly labeling the components in a chiplet system.
[0031] Application chip 125 is shown as including a network on chip (NOC) 130 to support a chiplet network 155 for inter-chiplet communication. In an example embodiment, NOC 130 may be included on application chip 125. In one example, NOC 130 may be defined in response to selected supporting chips (e.g., chips 135, 140, and 150), thereby enabling the designer to select an appropriate number or switch of chiplet network connections for NOC 130. In one example, NOC 130 may reside on a single chiplet or even within interposer 120. In the examples discussed herein, NOC 130 implements a chiplet protocol interface (CPI) network.
[0032] CPI is a packet-based network that supports virtual channels to enable flexible, high-speed interaction between chiplets. CPI bridges the chiplet-internal network to chiplet network 155. For example, the Advanced Scalable Interface (AXI) is a widely used specification for designing intra-chip communication. However, the AXI specification covers a large number of physical design options, such as the number of physical channels, signal timing, and power. Within a single chip, these options are typically selected to meet design goals such as power consumption and speed. However, to achieve flexibility in chiplet systems, adapters such as CPI are used to intersect between various AXI design options that can be implemented in various chiplets. By implementing a mapping from physical channels to virtual channels and encapsulating time-based signaling using packetization protocols, CPI bridges the chiplet-internal network across chiplet network 155.
[0033] CPI can use a variety of different physical layers to transmit packets. A physical layer may contain simple conductive connections, or it may contain drivers to increase voltage, or otherwise facilitate signal transmission over longer distances. An example of such a physical layer may include an Advanced Interface Bus (AIB), which in various instances may be implemented in intermediate layer 120. The AIB uses source-synchronous data transfer with a forwarding clock to transmit and receive data. Packets are transmitted across the AIB at Single Data Rate (SDR) or Double Data Rate (DDR) relative to the transmitted clock. The AIB supports various channel widths. When operating in SDR mode, the AIB channel width is a multiple of 20 bits (20, 40, 60…), and for DDR mode, the AIB channel width is a multiple of 40 bits (40, 80, 120…). The AIB channel width includes both transmitted and received signals. Channels can be configured with a symmetrical number of transmit (TX) and receive (RX) inputs / outputs (I / O), or with an asymmetrical number of transmitters and receivers (e.g., all transmitters or all receivers). The channel can act as either the AIB master or slave depending on which chip provides the master clock. The AIB I / O unit supports three timing modes: asynchronous (i.e., non-timing), SDR, and DDR. In various instances, the non-timing mode is used for the clock and some control signals. SDR mode can use a dedicated SDR-only I / O unit or a dual-purpose SDR / DDR I / O unit.
[0034] In one example, the CPI packet protocol (e.g., point-to-point or routable) can use symmetrical receive and transmit I / O units within an AIB channel. The CPI streaming protocol allows for more flexible utilization of AIB I / O units. In one example, the AIB channel for streaming mode can be configured with I / O units as all TX, all RX, or half TX and half RX. The CPI packet protocol can use the AIB channel in either SDR or DDR operating modes. In one example, the AIB channel is configured in increments of 80 I / O units (i.e., 40 TX and 40 RX) for SDR mode and in increments of 40 I / O units for DDR mode. The CPI streaming protocol can use the AIB channel in either SDR or DDR operating modes. Here, in one example, the AIB channel is configured in increments of 40 I / O units for both SDR and DDR modes. In one example, a unique interface identifier is assigned to each AIB channel. The identifier is used during CPI reset and initialization to determine the paired AIB channel across neighboring chiplets. In one example, the interface identifier is a 20-bit value comprising a seven-bit chiplet identifier, a seven-bit column identifier, and a six-bit link identifier. The AIB physical layer uses an AIB out-of-band shift register to transmit the interface identifier. Bits 32-51 of the shift register are used to transmit the 20-bit interface identifier in both directions across the AIB interface.
[0035] AIB defines a set of stacked AIB channels as an AIB channel column. An AIB channel column has a certain number of AIB channels, plus auxiliary channels. The auxiliary channels contain signals used for AIB initialization. All AIB channels within a column (except for the auxiliary channels) have the same configuration (e.g., all TX, all RX, or half TX and half RX, and have the same number of data I / O signals). In one example, AIB channels are numbered in consecutive ascending order, starting with the AIB channel adjacent to the AUX channel. The AIB channel adjacent to the AUX is designated as AIB channel 0.
[0036] AIB channels are typically configured as half TX data and half RX data, all TX data, or all RX data plus associated clock and promiscuous control. In some example implementations, the number of TX data signals relative to the number of RX data signals is determined at design time and cannot be configured as part of system initialization.
[0037] The CPI packet protocol (point-to-point and routable) uses symmetrical receive and transmit I / O units within the AIB channel. The CPI streaming protocol allows for more flexible use of the AIB I / O units. In some example implementations, the I / O units for streaming mode can be configured as all TX, all RX, or half TX and half RX.
[0038] Typically, the CPI interface on an individual chiplet may include serialization-deserialization (SERDES) hardware. SERDES interconnects are well-suited for scenarios requiring high-speed signaling and low signal counts. However, SERDES can introduce additional power consumption and longer latency for multiplexing and demultiplexing, error detection or correction (e.g., using block-level cyclic redundancy check (CRC)), link-level retries, or forward error correction. However, when low latency or power consumption is a primary concern for ultra-short-range chiplet-to-chiplet interconnects, parallel interfaces that allow data transfer with minimal latency can be utilized. CPIs contain elements designed to minimize both latency and power consumption in these ultra-short-range chiplet interconnects.
[0039] For flow control, CPI employs a credit-based technique. For example, the receiving side, such as chip 125, and the sending side, such as memory controller chip 140, provide credit values representing available buffers. In one example, the CPI receiver contains buffers for each virtual channel for a given transmission time unit. Therefore, if the CPI receiver supports five messages and a single virtual channel in time, the receiver has five buffers arranged into five entries (e.g., one entry per unit time). If four virtual channels are supported, the receiver has twenty buffers arranged into five entries. Each buffer holds the payload of one CPI packet.
[0040] When a sender transmits to a receiver, it decrements its available credit value based on the transmission. Once the receiver has exhausted all its credit value, the sender stops sending packets to the receiver. This ensures that the receiver always has an available buffer to store transmissions.
[0041] When the receiver processes the received packet and releases the buffer, it sends the available buffer space back to the sender. This credit value, indicating the available buffer space for the sender, can then be returned by the sender to allow the transmission of additional information.
[0042] Also shown is a chiplet mesh network 160 that uses direct chiplet-to-chiplet technology without requiring a NOC 130. The chiplet mesh network 160 can be implemented in a CPI or another chiplet-to-chiplet protocol. The chiplet mesh network 160 typically enables a chiplet pipeline, where one chiplet acts as an interface to the pipeline, while other chipslets in the pipeline only interface with themselves.
[0043] Additionally, dedicated device interfaces, such as one or more industry-standard memory interfaces 145 (e.g., synchronous memory interfaces, such as DDR5, DDR6), can also be used to interconnect chiplets. Connections from a chiplet system or individual chiplets to external devices (e.g., larger systems) can be made via the desired interface (e.g., a PCIe interface). In one example, for instance, an external interface can be implemented via a host interface chiplet 135, which, in the depicted example, provides a PCIe interface external to the chiplet system 110. Such interfaces are typically used when industry practice or standards have converged on them. The illustrated example of connecting a memory controller chiplet 140 to the DDR interface 145 of a dynamic random access memory (DRAM) memory device 150 exemplifies this industry practice.
[0044] Among the various possible supporting chiplets, the memory controller chiplet 140 is likely to be present in the chiplet system 110, due to the ubiquitous use of memory devices for computer processing and the current level of maturity in memory device technology. Therefore, using the memory device chiplet 150 and the memory controller chiplet 140, both designed by other designers, allows chiplet system designers to obtain robust products manufactured by established companies. Typically, the memory controller chiplet 140 provides a memory device-specific interface for reading, writing, or erasing data. Typically, the memory controller chiplet 140 can provide additional features such as error detection, error correction, maintenance operations, or atomic operation execution. For some types of memory, maintenance operations are often specific to the memory device 150, such as garbage collection in NAND flash or storage-class memory, or temperature regulation (e.g., cross-temperature management) in NAND flash memory. In one instance, maintenance operations may involve logic-to-physical (L2P) mapping or management to provide an indirection hierarchy between the physical and logical representations of data. In other types of memory, such as DRAM, some memory operations, such as refresh, may be controlled by the host processor or memory controller at some times and by the DRAM memory device or logic associated with one or more DRAM devices at other times, such as interface chips (in one example, buffers).
[0045] Atomic operations are data manipulations that can be performed, for example, by the memory controller chiplet 140. In other chiplet systems, atomic operations can be performed by other chipsets. For example, an atomic operation can be specified as an "increment" in a command by the application chiplet 125, the command containing a memory address and possibly an increment value. Upon receiving the command, the memory controller chiplet 140 retrieves a number from the specified memory address, increments the number by the amount specified in the command, and stores the result. Upon successful completion, the memory controller chiplet 140 provides the application chiplet 125 with an indication that the command was successful. Atomic operations avoid transmitting data across the chiplet network 160, thereby reducing the latency of executing these commands.
[0046] Atomic operations can be classified as built-in atoms or programmable (e.g., custom) atoms. Built-in atoms are a finite set of operations implemented immutably in the hardware. Programmable atoms are applets that can be executed on programmable atom units (PAUs) (e.g., custom atom units (CAUs)) of the memory controller chiplet 140. Figure 1 illustrates an example of a memory controller chiplet discussing PAUs.
[0047] The memory device chiplet 150 may be or contain any combination of volatile memory devices or non-volatile memory. Examples of volatile memory devices include (but are not limited to) random access memory (RAM) – such as DRAM, synchronous DRAM (SDRAM), graphics DDR type 6 SDRAM (GDDR6 SDRAM), etc. Examples of non-volatile memory devices include (but are not limited to) NAND flash memory, memory-type memory (e.g., phase-change memory or memristor-based technology), ferroelectric RAM (FeRAM), etc. The example shown includes a memory device 150 as a chiplet; however, the memory device 150 may reside elsewhere, such as in different packages on board 105. For many applications, multiple memory device chipsets may be provided. In one example, these memory device chipsets may each implement one or more memory technologies. In one example, the memory chiplet may contain multiple stacked memory dies of different technologies, such as one or more SRAM devices stacked with or otherwise communicating with one or more DRAM devices. The memory controller 140 can also be used to coordinate the operation between multiple memory chips in the chiplet system 110; for example, utilizing one or more memory chips in one or more tiers of a cache storage device, and using one or more additional memory chips as main memory. The chiplet system 110 may also include multiple memory controllers 140, which can be used to provide memory control functionality for individual processors, sensors, networks, etc. For example, chiplet architectures such as the chiplet system 110 offer advantages in allowing adaptation to different memory storage technologies; and provide different memory interfaces via updated chiplet configurations without redesigning the rest of the system architecture.
[0048] Figure 2Components of an example of a memory controller chiplet 205 according to one embodiment are shown. The memory controller chiplet 205 includes a cache 210, a cache controller 215, an off-die memory controller 220 (e.g., for communicating with off-die memory 175), a network communication interface 225 (e.g., for interfacing with chiplet network 180 and communicating with other chiplets), and a set of atomic and merge operations 250. Members of this set may include, for example, a write merge unit 155, a dangerous unit (160), a built-in atomic unit 165, or a PAU 170. The components are illustrated logically and are not necessarily how they will be implemented. For example, a built-in atomic unit 165 may include different means along the path to off-die memory. For example, a built-in atomic unit may reside in an interface means / buffer on the memory chiplet, as discussed above. In contrast, programmable atomic operations 170 may be implemented in a separate processor on the memory controller chiplet 105 (but in various instances, may be implemented elsewhere, such as on the memory chiplet itself).
[0049] The off-die memory controller 220 is directly coupled to the off-die memory 275 (e.g., via a bus or other communication connection) to provide write and read operations to and from one or more off-die memories, such as off-die memory 275 and off-die memory 280. In the depicted example, the off-die memory controller 220 is also coupled to the output of the atom and merge operation unit 250 and to the input of the cache controller 215 (e.g., a memory-side cache controller).
[0050] In the instance configuration, the cache controller 215 is directly coupled to the cache 210 and can be coupled to the network communication interface 225 for input (e.g., incoming read or write requests) and coupled to the output of the die-off memory controller 220.
[0051] Network communication interface 225 includes packet decoder 230, network input queue 235, packet encoder 240, and network output queue 245 to support packet-based chiplet network 285, such as CPI. Chiplet network 285 can provide packet routing between and within processors, memory controllers, mixed-thread processors, configurable processing circuitry, or communication interfaces. In this packet-based communication system, each packet typically contains destination and source addressing, as well as any data payload or instructions. In one instance, depending on the configuration, chiplet network 285 may be implemented as a collection of crossbar switches with a folded Clos configuration, or as a mesh network providing additional connectivity.
[0052] In various instances, the chiplet network 285 may be part of an asynchronous switching structure. Here, data packets can be routed along any of various paths, such that, depending on the route, any selected data packet can arrive at the addressed destination at any time among multiple different times. Furthermore, the chiplet network 285 may be implemented at least partially as a synchronous communication network, such as a synchronous mesh communication network. Two configurations of the communication network are used, for example, according to this disclosure, upon careful consideration.
[0053] The memory controller chip 205 can receive packets having, for example, a source address, a read request, and a physical address. In response, the off-die memory controller 220 or the cache controller 215 reads data from the specified physical address (which may be in off-die memory 275 or cache 210) and assembles a response packet for the source address containing the requested data. Similarly, the memory controller chip 205 can receive packets having a source address, a write request, and a physical address. In response, the memory controller chip 205 writes data to the specified physical address (which may be in cache 210 or off-die memory 275 or 280) and assembles a response packet for the source address acknowledging that the data has been stored in memory.
[0054] Therefore, where possible, the memory controller chiplet 205 may receive read and write requests via chiplet network 285 and process the requests using cache controller 215, which interfaces with cache 210. If cache controller 215 cannot process the request, off-chip memory controller 220 processes the request by communicating with off-chip memory 275 or 280, atomic and merge operations 250, or both. As described above, one or more cache levels may also be implemented in off-chip memory 275 or 280; and in some such instances, they may be directly accessed by cache controller 215. Data read by off-chip memory controller 220 may be cached in cache 210 by cache controller 215 for later use.
[0055] Atom and merge operation 250 is coupled (as input) to receive the output of off-die memory controller 220 and provides the output to cache 210, network communication interface 225, or directly to chiplet network 285. Memory danger clear (reset) unit 260, write merge unit 265, and built-in (e.g., predetermined) atom operation unit 265 can each be implemented as a state machine with other combinational logic circuitry (e.g., adders, shifters, comparators, AND gates, OR gates, XOR gates, or any suitable combination thereof) or other logic circuitry. These components may also include one or more registers or buffers to store operands or other data. PAU 270 can be implemented as one or more processor cores or control circuitry, and various state machines with other combinational logic circuitry or other logic circuitry, and may also include one or more registers, buffers, or memories to store addresses, executable instructions, operands, and other data, or may be implemented as a processor.
[0056] Write merging unit 255 receives read data and request data, and merges the request data and read data to create a single unit having the read data and the source address to be used in the response or return data packet. Write merging unit 255 provides the merged data to the write port of cache 210 (or equivalently, to cache controller 215 for writing to cache 210). Optionally, write merging unit 255 provides the merged data to network communication interface 225 to encode and prepare response or return data packets for transmission on chiplet network 280.
[0057] When requested data is used for a built-in atomic operation, the built-in atomic operation unit 265 receives the request and reads the data from the write merging unit 265 or directly from the off-chip memory controller 220. The atomic operation is performed, and using the write merging unit 255, the resulting data is written to the cache 210 or provided to the network communication interface 225 to encode and prepare response or return packets for transmission on the chiplet network 285.
[0058] Built-in atomic operation unit 265 handles predefined atomic operations, such as fetch and increment or compare and swap. In one instance, these operations perform simple read-modify-write operations on a single memory location of 32 bytes or less. An atomic memory operation begins with a request packet transmitted via chiplet network 285. The request packet has a physical address, atomic operator type, operand size, and optionally up to 32 bytes of data. The atomic operation performs a read-modify-write operation on a cache line of cache 210, thereby filling the cache as needed. The atomic operator response can be a simple complete response or a response with up to 32 bytes of data. Example atomic memory operators include fetch and AND, fetch and OR, fetch and XOR, fetch and add, fetch and subtract, fetch and increment, fetch and decrement, fetch and minimum, fetch and maximum, fetch and swap, and compare and swap. In various example embodiments, 32-bit and 64-bit operations and operations on 16 or 32 bytes of data are supported. The methods disclosed herein are also compatible with hardware that supports larger or smaller operations and more or less data.
[0059] Built-in atomic operations may also involve requests for "standard" atomic operations on the requested data, such as relatively simple single-loop integer atoms, such as fetch and increment or compare and swap, whose throughput will be the same as conventional memory read or write operations that do not involve atomic operations. For these operations, cache controller 215 can typically reserve cache lines in cache 210 by setting (in hardware) danger bits, preventing the cache lines from being read by another process during transition. Data is obtained from off-chip memory 275 or cache 210 and provided to built-in atomic operation unit 265 to perform the requested atomic operation. After the atomic operation, in addition to providing the obtained data to packet encoder 240 to encode outgoing packets for transmission on chiplet network 285, built-in atomic operation unit 265 also provides the obtained data to write merging unit 255, which writes the obtained data back to cache circuitry 210. After the obtained data is written to cache 210, memory danger clearing unit 260 clears any corresponding danger bits set.
[0060] The PAU 270 implements high-performance (high throughput and low latency) programmable atomic operations (also known as "custom atomic operations"), comparable to the performance of built-in atomic operations. Instead of performing multiple memory accesses, in response to an atomic operation request specifying a programmable atomic operation and a memory address, the circuitry in the memory controller chiplet 205 passes the atomic operation request to the PAU 270 and sets a danger bit stored in a memory danger register corresponding to the memory address of the memory row used in the atomic operation. This ensures that no other operation (read, write, or atomic operation) is performed on the memory row, and the danger bit is then cleared after the atomic operation is completed. The additional direct data path provided to the PAU 270 for performing programmable atomic operations allows for additional write operations without being limited by the bandwidth of the communication network and without increasing any congestion on the communication network.
[0061] The PAU 270 includes a multi-threaded processor, such as a RISC-VIS-based multi-threaded processor, with one or more processor cores and further featuring an extended instruction set for performing programmable atomic operations. When equipped with the extended instruction set for performing programmable atomic operations, the PAU 270 can be embodied as one or more hybrid-threaded processors. In some example implementations, the PAU 270 provides bucket-style transient thread switching to maintain a high instruction-per-clock rate.
[0062] Programmable atomic operations can be executed by PAU 270, which involve requesting programmable atomic operations on requested data. Users can prepare programming code to provide these programmable atomic operations. For example, programmable atomic operations can be relatively simple Bloom filter multi-loop operations, such as floating-point addition, or relatively complex multi-instruction operations, such as Bloom filter insert. Programmable atomic operations can be the same as or different from predetermined atomic operations, as long as they are defined by the user and not the system vendor. For these operations, cache controller 215 can reserve cache lines in cache 210 by setting a danger bit (in hardware), preventing the cache lines from being read by another process during transition. Data is obtained from cache 210 or off-chip memory 275 or 280 and provided to PAU 270 to execute the requested programmable atomic operation. After the atomic operation, PAU 270 provides the resulting data to network communication interface 225 to directly encode outgoing data packets containing the resulting data for transmission on chiplet network 285. Furthermore, PAU 270 provides the obtained data to cache controller 215, which in turn writes the obtained data to cache 210. After the obtained data is written to cache 210, cache control circuit 215 clears any corresponding dangerous bits that have been set.
[0063] In selected examples, the approach taken for programmable atomic operations is to provide multiple general-purpose custom atomic request types that can be sent from an originating source, such as a processor or other system component, to the memory controller chiplet 205 via chiplet network 285. Cache controller 215 or off-die memory controller 220 recognizes the request as a custom atom and forwards it to PAU 270. In a representative embodiment, PAU 270: (1) is a programmable processing element capable of efficiently performing user-defined atomic operations; (2) performs load and store operations on memory, arithmetic and logical operations, and control flow decisions; and (3) utilizes a RISC-V ISA with a new specialized instruction set to facilitate interaction with these controllers 215, 220, thereby performing user-defined operations atomically. In desirable examples, the RISC-V ISA contains a complete instruction set supporting high-level language operators and data types. The PAU 270 can utilize a RISC-V ISA, but will typically support a more limited instruction set and a limited register file size to reduce the die size of the cell when included within the memory controller chiplet 205.
[0064] As mentioned above, before writing read data to cache 210, the memory danger clearing unit 260 clears the set danger bits of the reserved cache line. Therefore, when the write merging unit 255 receives a request and read data, the memory danger clearing unit 260 can send a reset or clear signal to cache 210 to reset the set memory danger bits of the reserved cache line. Furthermore, resetting this danger bit also releases pending read or write requests involving the specified (or reserved) cache line, thereby providing the pending read or write requests to the inbound request multiplexer for selection and processing.
[0065] Figure 3 This illustration shows an example of routing between chiplets in a chiplet layout 300 using a chiplet protocol interface (CPI) network according to one embodiment. Chiplet layout 300 includes chiplets 310A, 310B, 310C, 310D, 310E, 310F, 310G, and 310H. Chipslets 310A-310H are interconnected via a network including nodes 330A, 330B, 330C, 330D, 330E, 330F, 330G, and 330H. Each of chiplets 310A-310H includes a hardware transceiver, labeled 320A-320H.
[0066] CPI packets can be passed between chiplets 310 using the Advanced Interface Bus (AIB). The AIB provides physical layer functionality. The physical layer uses source-synchronous data transfer with a forwarding clock to transmit and receive data. Packets are passed across the AIB relative to the transmitted clock in SDR or DDR. The AIB supports various channel widths. When operating in SDR mode, the AIB channel width is a multiple of 20 bits (20, 40, 60…), and for DDR mode, the AIB channel width is a multiple of 40 bits (40, 80, 120…). The AIB channel width includes both transmitted and received signals. Channels can be configured with a symmetrical number of transmit (TX) and receive (RX) inputs / outputs (I / O), or an asymmetrical number of transmitters and receivers (e.g., all transmitters or all receivers). A channel can act as an AIB master or slave depending on which chiplet provides the master clock.
[0067] The AIB adapter provides interfaces to the AIB link layer and to the AIB physical layer (PHY). The AIB adapter provides data hierarchical registers, a power-on reset sequencer, and control signal shift registers.
[0068] The AIB physical layer consists of AIB I / O units. AIB I / O units (implemented in some embodiments by hardware transceiver 320) can be input-only, output-only, or bidirectional. An AIB channel consists of a set of AIB I / O units, the number of which depends on the AIB channel configuration. A receive signal on a chiplet is connected to a transmit signal on a paired chiplet. In some embodiments, each column includes an auxiliary (AUX) channel and data channels numbered 0 to N.
[0069] Data packets are routed between chiplets 310 via network nodes 330. Node 330 can determine the next node 330 to forward a received data packet to based on one or more data fields of the data packet. For example, source or destination address, source or destination port, virtual channel, or any suitable combination thereof can be hashed to select consecutive network nodes or available network paths. Path selection in this manner can be used to balance network traffic.
[0070] Therefore, in Figure 3 The diagram illustrates the data path from chiplet 310A to chiplet 310D. Data packets are sent from hardware transceiver 320A to network node 330A; forwarded by network node 330A to network node 330C; forwarded by network node 330C to network node 330D; and delivered by network node 330D to hardware transceiver 320D of chiplet 310D.
[0071] Figure 3 The diagram also illustrates a second data path from chiplet 310A to chiplet 310G. Data packets are sent from hardware transceiver 320A to network node 330A; forwarded by network node 330A to network node 330B; forwarded by network node 330B to network node 330D; forwarded by network node 330D to network node 330C; forwarded by network node 330C to network node 330E; forwarded by network node 330E to network node 330F; forwarded by network node 330F to network node 330H; forwarded by network node 330H to network node 330G; and delivered by network node 330G to hardware transceiver 320G of chiplet 310G. (The last sentence appears to be a fragment and doesn't translate directly.) Figure 3 Visually, it is obvious that multiple paths through the network can be used for data transmission between any pair of chiplets.
[0072] The AIB I / O unit supports three timing modes: asynchronous (i.e., non-timing), SDR, and DDR. Non-timing mode is used for clock and some control signals. SDR mode can use a dedicated SDR-only I / O unit or a dual-purpose SDR / DDR I / O unit.
[0073] The CPI packet protocol (point-to-point and routable) can use the AIB channel in either SDR or DDR operating mode. In some example implementations, the AIB channel will increment by 80 I / O units (i.e., 40 TX and 40 RX) for SDR mode and by 40 I / O units for DDR mode.
[0074] The CPI streaming protocol can use AIB channels in either SDR or DDR operating mode. In some example implementations, the AIB channels are incremented by 40 I / O units for both modes (SDR and DDR).
[0075] A unique interface identifier is assigned to each AIB channel. This identifier is used during CPI reset and initialization to determine the paired AIB channel across neighboring chiplets. In some implementations, the interface identifier is a 20-bit value comprising a seven-bit chiplet identifier, a seven-bit column identifier, and a six-bit link identifier. The AIB physical layer uses an AIB out-of-band shift register to transmit the interface identifier. Bits 32-51 of the shift register are used to transmit the 20-bit interface identifier in both directions across the AIB interface.
[0076] In some implementations, AIB channels are numbered in ascending order, starting with the AIB channel adjacent to the AUX channel. The AIB channel adjacent to the AUX is designated as AIB channel 0.
[0077] Figure 3 An example is provided to demonstrate eight chiplets 310 connected via a network comprising eight nodes 330. More or fewer chiplets 310 and more or fewer nodes 330 can be included in the chiplet network, thus allowing the creation of networks of chiplets of any size.
[0078] Figure 4 This is a block diagram of a packet buffer 400 using credit-based flow control according to some embodiments of the present disclosure and a packet command buffer 450 suitable for use in early credit value returns in a system using credit-based flow control.
[0079] Packet buffer 400 includes entries 405, 410, 415, 420, 425, and 430. Packet command buffer 450 includes entries 455, 460, 465, 470, 475, and 480. Packet buffer 400 and packet command buffer 450 are controlled by a buffer control unit. The buffer control unit maintains packet buffer 400 and packet command buffer 450, passes data from packet buffer 400 and packet command buffer 450 to packet decoder 230, causes credit value return packets to be added to network output queue 245 via packet encoder 240, or performs any suitable combination of these operations. The buffer control unit may be implemented as hardware within network interface 225.
[0080] Each of entries 405-430 contains four time slots, each of which can hold an access command data segment (e.g., a memory read command, a memory write command, a built-in atomic command, or a custom atomic command). Figure 4 This illustrates an example of a packet stored in packet buffer 400 and packet command buffer 450. The three time slots of entry 405 hold the data received in the first packet T0. The remaining time slots of entry 405 and all four time slots of entry 410 hold the data received in the second packet T2. The three time slots of entry 415 hold the data received in the third packet T3; the remaining time slots of entry 415 hold the data received in the fourth packet T4. Entries 420-430 are empty and can be used to store data received in future packets.
[0081] Before sending data packet T0, the transmitting device decrements its available credit value by one, because all three memory access command packet data segments can be fitted into a single four-slot entry. Once all slots in entry 405 have been processed, the receiving device marks entry 405 as available and returns the credit value to the transmitting device.
[0082] Before sending the memory access command packet T2, the transmitting device decrements its available credit value by two because the five T2 data segments do not fit into a single entry, but do fit into two four-slot entries. In systems that do not use the earlier credit value return, the T2 data segments are stored in two entries (e.g., entries 410 and 415), and two credit values are returned when both entries are processed. Alternatively, the T2 data segments are as follows: Figure 4 The data is stored as shown, and the number of credit values returned is calculated based on the total number of entries used and the combination of processing of those entries. Therefore, although only one additional entry (entry 410) is used to store T2 packets, two credit values will be returned for five T2 packets.
[0083] In a system using early credit value returns, the time slot of entry 405, which does not store T0 data, is used to store the first segment of T2 data. Therefore, the remaining four data segments are stored in additional entry 410. As a result of packaging data from two different packets in entry 405, T2 data consumes only one additional entry. The receiving system determines that the sending system has spent two credit values based on the number of data segments in the T2 packet and the number of time slots in each entry. Because two credit values were spent but only one entry was used due to the addition of T2 data, the receiving system can return one credit value before the T2 data is processed.
[0084] Similarly, the sending system spends a credit value to send T4 data, but because the T4 data does not use additional entries, the receiving system can return the credit value before the T4 data is processed.
[0085] Packet command buffer 450 stores one entry for each packet represented in packet buffer 400. Entry 455 stores the T0 header, a pointer to the time slot containing the first portion of the T0 data, and the length of the T0 data. The pointer indicates both entry 405 and the offset (0) within entry 405. Figure 4 In this example, each of entries 405-430 in packet buffer 400 has four time slots. Correspondingly, the offset value in packet command buffer 450 can be stored using two bits. In some example embodiments, entries and offsets are packaged into a single byte (e.g., using a 6-bit entry identifier and a 2-bit offset value), a word, or a double word.
[0086] Entry 460 stores the T2 header, the corresponding entry and offset in packet buffer 400, and the amount of T2 data in packet buffer 400. Entries 465 and 470 store the data for T3 and T4 packets, respectively. Entries 475 and 480 can be used to store data for additional incoming packets.
[0087] Figure 5 This is a block diagram of a data packet 500 suitable for use in a system using credit-based flow control, according to some embodiments of the present disclosure. The data packet 500 is divided into flow control units (micro-pieces), each consisting of 36 bits. The first micro-piece of the data packet 500 includes a CP field 505, a path field 510, a statistics field 515, a destination identifier (DID) field 520, a sequence continuation (SC) field 525, a length field 530, and a command field 535. Each remaining micro-piece includes a credit value return (CR) / write enable mask (WEM) field (e.g., CR / WEM fields 540 and 550) and a data field (e.g., data fields 545 and 555).
[0088] The CP field 505 is a two-bit field that indicates whether the CR / WEM field of a later fragment in the packet contains CR data, WEM data, or should be ignored, and whether the path field 510 should be used to control packet ordering. In some example embodiments, a value of 0 or 1 in the CP field 505 indicates that CR / WEM fields 540 and 550 contain credit value return data; a value of 2 or 3 in the CP field 505 indicates that CR / WEM fields 540 and 550 contain WEM data; a value of 0 indicates that the path field 510 should be ignored; a value of 1 or 3 indicates that the path field 510 should be used to determine the path of packet 500; and a value of 2 indicates that a single path will be used for ordering.
[0089] The path field 510 is an eight-bit field. When the CP field 505 indicates that the path field 510 is used to determine the path of data packet 500, all data packets with the same value in the path field 510 are guaranteed to take the same path across the network. Therefore, the order of the data packets will remain unchanged between the sender and receiver. If the CP field 505 indicates that a single path sorting will be used, the path for each packet is determined as if the path field 510 were set to 0. Accordingly, all packets take the same path and the order will remain unchanged, regardless of the actual value of the path field 510 for each data packet. If the CP field 505 indicates that the path field 510 will be ignored, the data packets are routed without considering the value of the path field 510, and the data packets can be received by the receiver in a different order than the order in which they were sent by the sender. However, this avoids congestion in the network and allows for higher throughput in the device.
[0090] The response status is stored in statistics field 515 (a four-digit field). In some example implementations, a status of zero indicates that the request was successfully processed, and a non-zero status indicates various error codes.
[0091] The DID field 520 stores a twelve-bit DID. The DID uniquely identifies the destination (e.g., a destination chip) within the network. It ensures the sequential delivery of all data packets with the SC field 525 set. The length field 530 is a five-bit field indicating the number of fragments comprising data packet 500. The interpretation of the length field 530 can be non-linear. For example, a value 0-22 can be interpreted as fragments 0-22 of data packet 500, and a value 23-27 can be interpreted as fragments 33-37 of data packet 500 (i.e., 10 more than the indicated value). Other values for the length field 530 can be vendor-defined, not protocol-defined.
[0092] The command of data packet 500 is stored in command field 535 (a seven-bit field). The command can be a write command, a read command, a predefined atomic operation command, a custom atomic operation command, a read response, an acknowledgment response, or a vendor-specific command. Furthermore, the command can indicate the virtual channel of data packet 500. For example, different commands can be used for different virtual channels, or bits 1, 2, 3, or 4 of the seven-bit command field 535 can be used to indicate the virtual channel, and the remaining bits can be used to indicate the command.
[0093] The memory access command can further identify the number of bytes to be written or accessed, the memory space to be accessed (e.g., off-die memory 375 or instruction memory for custom atomic operations), or any suitable combination thereof. In some example embodiments, the command may instruct additional bits on a later microchip to identify the command. For example, a multi-byte command containing a larger command can be sent by using a vendor-specific command in the seven-bit command field 535 and a portion or all of the 32-bit data field 545.
[0094] If WEM is enabled, CR / WEM fields 540 and 550 are four-bit masks indicating whether each of the corresponding bytes of the 32 data bits in the microchip will be written. Therefore, the size of a single microchip is always 36 bits, but it can contain 0-32 data bits to be written. If CR is enabled, two bits of CR / WEM fields 540 and 550 identify whether the credit value return is for virtual channel 0, 1, 2, or 3, and the other two bits of CR / WEM fields 540 and 550 indicate whether the number of credit values to be returned is 0, 1, 2, or 3.
[0095] Figure 6 This is a block diagram of a data packet 600 suitable for use in a system employing credit value-based flow control, according to some embodiments of this disclosure. Data packet 600 includes a single 36-bit micropie. The micropie contains four credit value return fields 605, 610, 615, and 620, a length field 630, and reserved fields 625 and 635. The length field 630 is set to 0 to indicate that no additional micropieces constitute data packet 600. Reserved fields 625 and 635 are unused and should be set to 0.
[0096] Each of the credit value return fields 605-620 is a five-bit field. If the credit value is being returned to a low virtual channel, the first bit of the credit value return fields 605-620 is set to 0. If the credit value is being returned to a high virtual channel, the first bit of the credit value return fields 605-620 is set to 1. The remaining four bits of each credit value return field 605-620 indicate the number of credit values being returned (0-15). Therefore, the CR0 credit value return field 620 returns 0-15 credit values to virtual channel 0 or 4; the CR1 credit value return field 615 returns 0-15 credit values to virtual channel 1 or 5; the CR2 credit value return field 610 returns 0-15 credit values to virtual channel 2 or 6; and the CR3 credit value return field 605 returns 0-15 credit values to virtual channel 3 or 7.
[0097] Therefore, early credit value return for credit value-based flow control can be implemented by using the CR / WEM fields 540 and 550 of packet 500 when sending other packets (such as acknowledgment packets) or by using packet 600.
[0098] Figure 7 This is a flowchart illustrating the operation of a method 700 for performing early credit value return in a system using credit value-based flow control, executed by two circuits according to some embodiments of the present disclosure. Method 700 includes operations 710, 720, 730, 740, and 750. By way of example and not limitation, method 700 is described as being used by the apparatus of Figures 1-3. Figure 4 buffer and Figure 5-6 The data packets 500 and 600 are used for execution.
[0099] In operation 710, the first circuit sends a first packet to the second circuit indicating the number of credit values available to the second circuit. In some example embodiments, operation 710 is performed during the initialization of the first circuit. The first packet may be data packet 600, returning credit values to the second circuit on one or more virtual channels.
[0100] In operation 720, the second circuit sends a second packet to the first circuit, the second packet comprising a micro-piece corresponding to a portion of the credit value. For example, the first circuit may send packet 500, which contains a header micro-piece and a data micro-piece. In an example embodiment where the micro-piece corresponds to a time slot in packet buffer 400, the packet to be written corresponds only to a portion of the credit value because a packet with only two micro-pieces is less than four time slots in the entries of packet buffer 400.
[0101] In operation 730, based on data corresponding to a partial credit value, the second circuit decrements the available credit value by a full credit value. For example, the second circuit may track the available credit values of multiple virtual channels and decrement the available credit value of the virtual channel used to transmit the second packet on it based on the number of chips indicated by the value in the length field 530.
[0102] As another example, a data packet 500 comprising multiple micro-pieces is sent (in operation 720), and multiple credit values are spent (in operation 730), the multiple credit values corresponding to the multiple micro-pieces.
[0103] In operation 740, the first circuit receives data and stores the data in a buffer. For example, the T4 data in packet buffer 400 can be stored in entry 415, such as... Figure 4 As shown.
[0104] Based on the alignment of the received micro-pieces in the buffer, causing the micro-pieces not to use new entries in the buffer, the first circuit sends a third packet to the second circuit, thereby returning the credit value to the second circuit (operation 750). For example, because entry 415 has an available time slot and the T4 data only consumes one time slot, the T4 data is stored in packet buffer 400 without consuming additional entries in packet buffer 400. Accordingly, the first circuit sends a packet that returns the credit value to the second circuit. For example, packet 500 may be sent with: a command field 535 indicating the virtual channel of packet T4 and acknowledging the command of packet T4; a value in CP field 505 indicating that the CR / WEM field 540 contains the credit value return data; and a value of 1 in CR / WEM field 540.
[0105] If entry 415 is full, the T4 data will already be stored in a new entry, and an acknowledgment of the T4 command will be sent without an earlier credit value return. Similarly, T3 data received before the T4 data is stored in entry 415 without causing the first circuitry to transmit a packet including a credit value return. Although the T3 data consumes less than a full entry, storing the T3 data utilizes the then-empty entry 415. Accordingly, the transmitting device appropriately consumes one credit value. Therefore, in some exemplary embodiments, a first packet including a microchip will appropriately consume one credit value, while a second packet including more commands (e.g., three microchips) will achieve an earlier credit value return.
[0106] As another example of the application of method 700 Figure 4 The T2 data shown may have been received, where the first fragment of the T2 data is stored in entry 405, and additional commands are stored (and subsequently become) in unused entries of packet buffer 400. Therefore, the T2 data uses one new entry in packet buffer 400 instead of two, and a credit value is immediately returned to the sender. In this example, the count of the plurality of fragments of the T2 data is five, which exceeds the number of entries in each entry of packet buffer 400. Therefore, a credit value can be returned in both cases where the number of fragments in the packet is less than the number of time slots in the entry (e.g., as in T4 data for entry 415) or greater than the number of time slots in the entry (e.g., as in T2 data for entries 405 and 410). However, regardless of buffer alignment, a credit value will not be returned when the number of fragments is an even multiple of the size of the buffer entry, because an integer number of entries will always be consumed.
[0107] In this example, T2 data is adapted across two entries. In other example implementations, the data packet may span more than two entries.
[0108] Figure 8 This is a flowchart illustrating operations of an example method 800 performed by two circuits according to some embodiments of the present disclosure for performing an early credit value return in a system using credit value-based flow control. Method 800 is an optional extension of method 700 and includes operations 810 and 820. By way of example and not limitation, method 800 is described as being used by the apparatus of Figures 1-3. Figure 4 buffer and Figure 5-6 Packages 500 and 600 are executed.
[0109] In executing the above-mentioned relative Figure 7Following the described method 700, the second circuit increments the available credit value based on the receipt of the third packet (operation 810). In operation 820, based on the incremented available credit value, the second circuit sends a fourth packet to the first circuit. For example, after decrementing the available credit value in operation 730, there may not be enough credit value available to send the fourth packet, but after receiving the returned credit value, there may be sufficient credit value available. Therefore, the fourth packet is sent as soon as the credit value return packet is received, without waiting for the first device to process the data stored in the packet buffer 400.
[0110] Figure 9 This is a flowchart illustrating operations of a method 900 performed by circuitry according to some embodiments of the present disclosure for performing early credit value return in a system using credit value-based flow control. Method 900 includes operations 910 and 920. By way of example and not limitation, method 900 is described as being used by the apparatus of Figures 1-3. Figure 4 buffer and Figure 5-6 The data packets 500 and 600 are used for execution.
[0111] In operation 910, the chiplet receives a packet containing a first command from a source (e.g., another chiplet or the body of PCIe card 100). For example, T4 data in packet buffer 400 may be stored in entry 415, such as... Figure 4 As shown.
[0112] Based on the fact that the entry in packet buffer 400 already contains a second packet and has an alignment of available entries for the received data in packet buffer 400, causing the data not to use new entries in packet buffer 400, the chiplet stores the first packet in the available entries and transmits a packet to the source including a credit value return indicating the available space in the buffer for the source (operation 920). For example, because entry 415 has one available time slot and T4 data only consumes one time slot, T4 data is stored in packet buffer 400 without consuming additional entries in packet buffer 400. Accordingly, the chiplet sends a packet returning the credit value to the source. For example, data packet 500 may be sent with: a command field indicating the virtual channel of packet T4 and acknowledgment of the command of packet T4; a value in CP field 505 indicating that CR / WEM field 540 contains the credit value return data; and a value of 1 in CR / WEM field 540.
[0113] Figure 10A block diagram of an example machine 1000 is shown, which may be used, in, or through which any or more of the techniques (e.g., methods) discussed herein be implemented. As described herein, an example may contain logic or several components or mechanisms in, or be operable by, machine 1000. A circuit system (e.g., a processing circuit system) is a collection of circuits (e.g., simple circuits, gates, logic, etc.) implemented in a tangible entity containing the hardware of machine 1000. The membership of a circuit system may be flexible over time. A circuit system contains members that can perform a specified operation individually or in combination during operation. In one example, the hardware of the circuit system may be designed unchanged to perform a specific operation (e.g., hardwiring). In one example, the hardware of the circuit system may contain physically connected components (e.g., execution units, transistors, simple circuits, etc.) and a machine-readable medium that is physically modified (e.g., the magnetic, electrical, movable placement, etc. of an unchanging number of particles) to encode instructions for a specific operation. When connecting the physical components, the underlying electrical properties of the hardware configuration change, for example, from an insulator to a conductor, or vice versa. Instructions enable embedded hardware (e.g., an execution unit or loading mechanism) to hardware-create members of a circuit system via variable connections to perform specific operations during operation. Thus, in one example, a machine-readable medium element is part of the circuit system or another component communicatively coupled to the circuit system during device operation. In one example, any of the physical components can be used in more than one member of more than one circuit system. For example, during operation, an execution unit may be used at one point in time in a first circuit of a first circuit system and reused at different times by a second circuit in the first circuit system or by a third circuit in the second circuit system. Additional examples of these components relative to machine 1000 are as follows.
[0114] In alternative embodiments, machine 1000 may act as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, machine 1000 may operate as a server machine, a client machine, or both in a server-client network environment. In one instance, machine 1000 may act as a peer-to-peer (P2P) (or other distributed) network environment. Machine 1000 may be a personal computer (PC), tablet PC, set-top box (STB), personal digital assistant (PDA), mobile phone, network appliance, network router, switch, or bridge, or any machine capable of executing (sequentially or otherwise) instructions specifying actions to be taken by said machine. Furthermore, while only a single machine is described, the term "machine" should also be considered as encompassing any collection of machines that individually or jointly execute a set of instructions (or multiple sets of instructions) to perform any or more of the methods discussed herein (e.g., cloud computing, Software as a Service (SaaS), other computer cluster configurations).
[0115] Machine (e.g., computer system) 1000 may include a hardware processor 1002 (e.g., a central processing unit (CPU), graphics processing unit (GPU), hardware processor core, or any combination thereof), main memory 1004, static memory (e.g., memory or storage device for firmware, microcode, basic input / output (BIOS), unified extensible firmware interface (UEFI), etc.) 1006, and mass storage device 1008 (e.g., hard disk drive, tape drive, flash memory, or other block device), some or all of which may communicate with each other via an interconnection link (e.g., bus) 1030. Machine 1000 may further include a display unit 1010, an alphanumeric input device 1012 (e.g., keyboard), and a user interface (UI) navigation device 1014 (e.g., mouse). In one example, the display unit 1010, input device 1012, and UI navigation device 1014 may be a touchscreen display. Machine 1000 may additionally include a storage device (e.g., a drive unit) 1008, a signal generation device 1018 (e.g., a speaker), a network interface device 1020, and one or more sensors 1016, such as a Global Positioning System (GPS) sensor, a compass, an accelerometer, or other sensors. Machine 1000 may include an output controller 1028, for example, serial (e.g., Universal Serial Bus (USB), parallel, or other wired or wireless (e.g., infrared (IR), near field communication (NFC), etc.) connection, to communicate with or control one or more peripheral devices (e.g., a printer, a card reader, etc.).
[0116] The registers of processor 1002, main memory 1004, static memory 1006, or mass storage device 1008 may be or contain machine-readable medium 1022, on which one or more data structures or sets of instructions 1024 (e.g., software) embodying or utilized by any one or more of the techniques or functions described herein may be stored. Instructions 1024 may also reside wholly or at least partially in any of the registers of processor 1002, main memory 1004, static memory 1006, or mass storage device 1008 during execution by machine 1000. In one example, one or any combination of hardware processor 1002, main memory 1004, static memory 1006, or mass storage device 1008 may constitute machine-readable medium 1022. Although machine-readable medium 1022 is shown as a single medium, the term "machine-readable medium" may include a single medium or multiple media (e.g., a centralized or distributed database, or associated cache and server) configured to store the one or more instructions 1024.
[0117] The term "machine-readable medium" can include any medium capable of storing, encoding, or carrying instructions executable by machine 1000 and causing machine 1000 to perform any one or more of the technologies disclosed herein, or any medium capable of storing, encoding, or carrying data structures used by or associated with such instructions. Examples of non-limiting machine-readable media can include solid-state memory, optical media, magnetic media, and signals (e.g., radio frequency signals, other photon-based signals, sound signals, etc.). In one example, a non-transitory machine-readable medium includes a machine-readable medium having a plurality of particles having an invariant (e.g., rest) mass and therefore being composed of matter. Therefore, a non-transitory machine-readable medium is a machine-readable medium that does not contain transiently propagating signals. Specific examples of non-transitory machine-readable media can include: non-volatile memory, such as semiconductor memory devices (e.g., electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM)) and flash memory devices; magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
[0118] In one instance, information stored or otherwise provided on machine-readable medium 1022 may represent instructions 1024, such as instructions 1024 itself or a format from which instructions 1024 can be derived. This format from which instructions 1024 can be derived may include source code, encoded instructions (e.g., in compressed or encrypted form), packaged instructions (e.g., split into multiple packages), etc. The information representing instructions 1024 in machine-readable medium 1022 may be processed by a processing circuitry system into instructions to perform any of the operations discussed herein. For example, deriving instructions 1024 from information (e.g., processed by a processing circuitry system) may include: compiling (e.g., from source code, object code, etc.), interpreting, loading, organizing (e.g., dynamically or statically linking), encoding, decoding, encrypting, decrypting, encapsulating, decapsulating, or otherwise manipulating information into instructions 1024.
[0119] In one instance, the derivation of instruction 1024 may involve assembling, compiling, or decompiling information (e.g., by a processing circuitry) to create instruction 1024 from some intermediate or preprocessed format provided by machine-readable medium 1022. When information is provided in multiple parts, the information may be combined, decapsulated, and modified to create instruction 1024. For example, the information may be contained in multiple compressed source code packages (or object code, or binary executable code, etc.) on one or more remote servers. The source code packages may be encrypted when transmitted over a network and, if necessary, decrypted, decompressed, assembled (e.g., linked), and compiled or decompiled at the local machine (e.g., for a library, a standalone executable, etc.) and executed by the local machine.
[0120] Instruction 1024 may use a transmission medium to further transmit or receive over a communication network 1026 via network interface device 1020 using any of several transmission protocols (e.g., Frame Relay, Internet Protocol (IP), Transmission Control Protocol (TCP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), etc.). Example communication networks may include Local Area Networks (LANs), Wide Area Networks (WANs), packet data networks (e.g., the Internet), mobile phone networks (e.g., cellular networks), conventional telephone networks (POTS), and wireless data networks (e.g., referred to as…). The Institute of Electrical and Electronics Engineers (IEEE) 802.11 standard series, known as The IEEE 802.16 series of standards, the IEEE 802.15.4 series of standards, peer-to-peer (P2P) networks, etc. In one example, network interface device 1020 may include one or more physical jacks (e.g., Ethernet, coaxial, or telephone jacks) or one or more antennas to connect to communication network 1026. In one example, network interface device 1020 may include multiple antennas to perform wireless communication using at least one of single-input multiple-output (SIMO), multiple-input multiple-output (MIMO), or multiple-input single-output (MISO) technologies. The term "transmitting medium" should be considered to include any intangible medium capable of storing, encoding, or carrying instructions for execution by machine 1000, and includes digital or analog communication signals or other intangible media to facilitate communication of such software. The transmitting medium is a machine-readable medium.
[0121] In the foregoing description, some exemplary embodiments of this disclosure have been described. It will be apparent that various modifications can be made to this disclosure without departing from the broader scope and spirit of this disclosure as set forth in the appended claims. Therefore, the description and drawings should be viewed in an illustrative rather than restrictive sense. The following is a non-exhaustive list of examples of embodiments of this disclosure.
[0122] Example 1 is a system comprising: a hardware transceiver configured to perform operations including: receiving a packet from a source including a first command; a buffer control unit configured to control a buffer and perform operations including: storing the first command in the available entry based on an entry in the buffer that already contains a second command with available entries; and causing the hardware transceiver to transmit a packet to the source including a credit value indicating that buffer space is available for the source.
[0123] In Example 2, the subject of Example 1 includes, wherein: the package includes a plurality of commands containing the first command and additional commands; and the operation of the buffer control unit further includes: storing the additional commands in unused entries of the buffer.
[0124] In Example 3, the subject of Example 2 includes the following: the count of the plurality of commands exceeds the number of entries in each entry of the buffer.
[0125] In Example 4, the subject of Examples 1-3 includes, wherein: the first command is a memory access command.
[0126] In Example 5, the subject of Examples 1-4 includes, wherein: the first command is a memory write command.
[0127] In Example 6, the subject matter of Examples 1-5 includes the following: the hardware transceiver and the buffer control unit are part of a first chiplet; and the source is a second chiplet.
[0128] In Example 7, the subject matter of Examples 1-6 includes the following: the operation of the hardware transceiver further includes: before receiving the packet, receiving a previous packet including one or more previous commands containing the second command, the count of the previous commands being less than the width of the entry of the buffer, the entry of the buffer being empty when the previous packet is received; and the operation of the buffer control unit further includes: storing the previous commands in the entry of the buffer without causing the hardware transceiver to transmit a packet including a credit value return.
[0129] Example 8 is a method comprising: receiving a packet from a source including a first memory command; storing the first command in the available entry based on an entry in a buffer that already contains a second memory command with available entries; and transmitting a packet to the source including a credit value indicating that buffer space is available to the source.
[0130] In Example 9, the subject of Example 8 includes, wherein: the package includes a plurality of commands containing the first memory command and additional commands; and the method further includes: storing the additional commands in unused entries of the buffer.
[0131] In Example 10, the subject of Example 9 includes the following: the count of the plurality of commands exceeds the number of entries in each entry of the buffer.
[0132] In Example 11, the subject of Examples 8-10 includes, wherein: the first memory command is a memory access command.
[0133] In Example 12, the subject of Examples 8-11 includes, wherein: the first memory command is a memory write command.
[0134] In Example 13, the subject of Examples 8-12 includes the following: the receiving is performed by a first chip; and the source is a second chip.
[0135] In Example 14, the subject of Examples 8-13 includes receiving, before receiving the packet, a previous packet including one or more previous commands containing the second command, the count of the previous commands being less than the width of the entry of the buffer, the entry of the buffer being empty when the previous packet is received; and storing the previous commands in the entry of the buffer without transmitting a packet including a credit value return.
[0136] Example 15 is a non-transitory machine-readable medium that stores instructions that, when executed by a system, cause the system to perform the following operations: receiving a packet from a source containing a first command; storing the first command in a available entry based on an entry in a buffer that already contains a second command with available entries; and transmitting a packet to the source containing a credit value indicating that buffer space is available to the source.
[0137] In Example 16, the subject of Example 15 includes, wherein: the package includes a plurality of commands containing the first command and additional commands; and the method further includes: storing the additional commands in unused entries of the buffer.
[0138] In Example 17, the subject of Example 16 includes the following: the count of the plurality of commands exceeds the number of entries in each entry of the buffer.
[0139] In Example 18, the subject of Examples 15-17 includes, wherein: the first command is a memory access command.
[0140] In Example 19, the subject of Examples 15-18 includes, wherein: the first command is a memory write command.
[0141] In Example 20, the subject matter of Examples 15-19 includes, wherein: the receiving is performed by a first chip; and the source is a second chip.
[0142] Example 21 is at least one machine-readable medium containing instructions that, when executed by a processing circuitry system, cause the processing circuitry system to perform an operation to implement any one of Examples 1-20.
[0143] Example 22 is a device that includes components for implementing any one of Examples 1-20.
[0144] Example 23 is a system used to implement any of Examples 1-20.
[0145] Example 24 is a method for implementing any of Examples 1-20.
Claims
1. A system for flow control using a credit value-based approach, comprising: A buffer, which comprises multiple entries, each of which includes multiple time slots; A hardware transceiver configured to perform the following operations: Receive a packet from the source that includes multiple commands, the multiple commands including a first command and additional commands, the count of the multiple commands exceeding the number of slots in each entry of the buffer; and A buffer control unit is configured to control the buffer and perform operations including the following: Based on the fact that the first entry in the buffer already contains the second command has an available time slot: Store the first command in the available time slot of the first entry; The additional command is stored in a second entry in the buffer that does not contain a used time slot; and This causes the hardware transceiver to transmit a packet to the source that includes a credit value indicating the availability of buffer space for the source.
2. The system according to claim 1, wherein: The first command is a memory access command.
3. The system according to claim 1, wherein: The first command is a memory write command.
4. The system according to claim 1, wherein: The hardware transceiver and the buffer control unit are part of the first small chip; and The source is the second small chip.
5. The system according to claim 1, wherein: The operation of the hardware transceiver further includes: Before receiving the packet, a previous packet including one or more previous commands containing the second command is received, the count of which is less than the number of time slots of the first entry of the buffer, the first entry of the buffer being empty when the previous packet is received; and The operation of the buffer control unit further includes: The previous command is stored in the first entry of the buffer without causing the hardware transceiver to transmit a packet including a credit value return.
6. The system of claim 1, wherein the count of the plurality of commands is indicated by a mask field in the packet.
7. The system of claim 1, wherein the credit value indicating the buffer space available to the source returns an indication of the number of entries in the buffer available to the source.
8. A method for flow control using a credit score, comprising: Receive a packet from the source that includes multiple commands, including a first memory command and additional commands, the count of which exceeds the number of time slots in each entry of the buffer; and Based on the fact that the first entry in the buffer, which already contains the second memory command, has an available time slot: Store the first memory command in the available time slot of the first entry; The additional command is stored in a second entry in the buffer that does not contain a used time slot; as well as The packet sent to the source includes a credit value returned indicating the buffer space available to the source.
9. The method according to claim 8, wherein: The first memory command is a memory access command.
10. The method according to claim 8, wherein: The first memory command is a memory write command.
11. The method according to claim 8, wherein: The receiving is performed by the first small chip; and The source is the second small chip.
12. The method of claim 8, further comprising: Before receiving the packet, a previous packet is received comprising one or more previous commands containing the second memory command, the count of which is less than the number of time slots of the first entry of the buffer, the first entry of the buffer being empty when the previous packet is received; and The previous command is stored in the entry of the buffer without transmitting a packet that includes a credit value return.
13. The method of claim 8, wherein the count of the plurality of commands is indicated by a mask field in the packet.
14. The method of claim 8, wherein the credit value indicating the buffer space available to the source returns an indication of the number of entries available in the buffer for the source.
15. A non-transitory machine-readable medium storing instructions, which, when executed by a system, cause the system to perform operations including: receiving a packet comprising a plurality of commands from a source, the plurality of commands comprising a first memory command and an additional command, a count of the plurality of commands exceeding a number of time slots in each entry of a buffer; and Based on the fact that the first entry in the buffer, which already contains the second memory command, has an available time slot: Store the first memory command in the available time slot of the first entry; The additional command is stored in a second entry in the buffer that does not contain a used time slot; as well as The packet sent to the source includes a credit value returned indicating the buffer space available to the source.
16. The non-transitory machine-readable medium according to claim 15, wherein: The first memory command is a memory access command.
17. The non-transitory machine-readable medium according to claim 15, wherein: The first memory command is a memory write command.
18. The non-transitory machine-readable medium according to claim 15, wherein: The receiving is performed by the first small chip; and The source is the second small chip.
19. The non-transitory machine-readable medium of claim 15, wherein the count of the plurality of commands is indicated by a mask field in the packet.
20. The non-transitory machine-readable medium of claim 15, wherein the credit value indicating the buffer space available to the source returns an indication of the number of entries available to the buffer in the source.