Data transfer between memory and distributed computing array
By introducing remote buffers and controllers into integrated circuits to coordinate data transmission, the data skew problem between memory and computing arrays in multi-chip integrated circuits is solved, achieving efficient data synchronization and bandwidth utilization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XILINX INC
- Filing Date
- 2020-12-04
- Publication Date
- 2026-06-23
AI Technical Summary
In multi-chip integrated circuits, data transmission between memory and computing arrays is affected by skew, resulting in unpredictable data transmission and inefficient and degraded use of available memory bandwidth.
By introducing multiple remote buffers and controllers into the integrated circuit, data transmission is coordinated to ensure that data is transmitted synchronously from memory to the computing array, eliminating skew and maximizing bandwidth utilization.
It enables synchronous data transfer between memory and computing array, improving the utilization of computing array and the efficiency of memory bandwidth, and reducing the complexity and overhead of multi-chip integrated circuits.
Smart Images

Figure CN114746853B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to integrated circuits (ICs), and more specifically, to data transfer between memory and computing arrays distributed across multiple dies of an IC. Background Technology
[0002] A neural network processor (NNP) is an integrated circuit (IC) having one or more computational arrays capable of implementing neural networks. The computational arrays input data, such as weights implemented in the neural network, from memory. The computational arrays input weights from memory in parallel through multiple memory channels. Data transfer from memory to the computational array is typically affected by skew. Therefore, data arrives at different parts of the computational array at different times. Data skew is at least partly due to the independence between memory channels during parallel operation, and in the case of a multi-chip IC with computational arrays distributed across multiple dies, the data wavefront from each memory channel is orthogonal to the computational array in the multi-chip IC. Whether viewed individually or cumulatively, these problems make data transfer from memory to the computational array unpredictable and lead to inefficient and / or degraded use of the available bandwidth in memory. Summary of the Invention
[0003] An example implementation includes an integrated circuit (IC). The IC includes multiple dies. The IC includes multiple memory channel interfaces configured to communicate with memory, wherein the multiple memory channel interfaces are disposed within a first die among the multiple dies. The IC may include a compute array distributed across the multiple dies and multiple remote buffers distributed across the multiple dies. The multiple remote buffers are coupled to the multiple memory channels and the compute array. The IC also includes a controller configured to determine that each of the multiple remote buffers already contains data, and in response, broadcast a read enable signal to each of the multiple remote buffers to initiate data transfer from the multiple remote buffers to the compute array on the multiple dies.
[0004] Another example implementation includes a controller. The controller is housed within an IC having multiple dies. The controller includes a request controller configured to translate a first request to access memory into a second request compatible with an on-chip communication bus, wherein the request controller provides the second request to multiple request buffer-bus master blocks configured to receive data from multiple channels of memory. The controller also includes a remote buffer read address generation unit coupled to the request controller and configured to monitor the fill level in each of the multiple remote buffers distributed across the multiple dies. Each of the multiple remote buffers is configured to provide data obtained from a corresponding one of the multiple request buffer-bus master blocks to a compute array distributed across the multiple dies. In response to determining that each of the multiple remote buffers is storing data based on the fill level, the remote buffer read address generation unit is configured to initiate data transfer from each of the multiple remote buffers to the compute array across the multiple dies.
[0005] Another example implementation includes a method. The method includes monitoring the fill levels of multiple remote buffers distributed across multiple dies, wherein each of the multiple remote buffers is configured to provide data to a compute array also distributed across multiple dies, determining that each of the multiple remote buffers stores data based on the fill level, and in response to the determination, initiating data transfer from each of the multiple remote buffers to the compute array across the multiple dies.
[0006] This overview is provided only to introduce certain concepts and not to identify any key or essential features of the claimed subject matter. Other features of the device of the invention will be apparent from the accompanying drawings and the following detailed description. Attached Figure Description
[0007] The arrangement of the invention is illustrated by way of example in the accompanying drawings. However, the drawings should not be construed as limiting the arrangement of the invention to the specific embodiments shown. Various aspects and advantages will become apparent from the following detailed description and with reference to the drawings.
[0008] Figure 1 The diagram illustrates an example plan view of a circuit architecture implemented within an integrated circuit (IC);
[0009] Figure 2 The diagram shows Figure 1 Example implementation of the circuit architecture;
[0010] Figure 3 The diagram shows Figure 1 Another example implementation of the circuit architecture;
[0011] Figure 4The diagram illustrates the implementation. Figure 1 An example of a balanced tree structure for circuit architecture;
[0012] Figure 5 An example implementation of a request buffer bus master (RBBM) circuit block as described in this disclosure is illustrated.
[0013] Figure 6 The diagram illustrates an example implementation of the main controller;
[0014] Figure 7 The diagram illustrates an example implementation of the request controller;
[0015] Figure 8 The illustration shows an example method for transferring data between a high-bandwidth memory and a distributed computing array;
[0016] Figure 9 An example architecture of the IC is shown. Detailed Implementation
[0017] Although this disclosure concludes with claims defining novel features, it is believed that a better understanding of the various features described herein will come from considering the description in conjunction with the accompanying drawings. For illustrative purposes, processes, machines, manufactures, and any variations thereof described herein are provided. Specific structural and functional details described herein should not be construed as limiting, but merely as the basis for the claims and as a representative basis for teaching those skilled in the art to use the features described in virtually any appropriately detailed structure in different ways. Furthermore, the terminology and phrases used in this disclosure are not intended to be limiting, but rather to provide an understandable description of the described features.
[0018] This disclosure relates to integrated circuits (ICs), and more specifically, to data transfer between memory and computing arrays distributed across multiple dies of an IC. A neural network processor (NNP) is an integrated circuit having one or more computing arrays capable of implementing neural networks. In the case of a multi-die IC, the IC can implement a single, larger computing array distributed across two or more dies of the multi-die IC. Implementing a single, larger computing array in a distributed manner across multiple dies offers several advantages compared to multiple smaller, independent computing arrays on different dies, including, but not limited to, improved latency, improved weight storage capacity, and improved computational efficiency.
[0019] Data, such as weights of a neural network, is input from a high-bandwidth memory (HBM) to a computational array. For descriptive purposes, the memory accessed by memory channels is referred to throughout this disclosure as "high-bandwidth memory" or "HBM" to better distinguish it from other types of memory in the circuit architecture, such as buffers and / or queues. However, it should be understood that an HBM can be implemented using any of a variety of different techniques that support multiple independent and parallel memory channels communicatively linked to an example circuit architecture described by a suitable memory controller. Examples of HBMs can include any of a variety of RAM types, including double data rate RAM or other suitable memories.
[0020] Although the compute array is distributed across multiple dies of the IC, it is viewed by HBM as a single compute array to which weights are input in parallel via available memory channels. For example, the compute array can be implemented as an array where each die implements one or more rows of the compute array. Each memory channel can provide data to one or more rows of the compute array.
[0021] When a single compute array is distributed across multiple dies, data transfer from the HBM to the compute array typically encounters timing issues. For example, each memory channel usually has its own independent control pins, asynchronous clock, and refresh sleep mode. These characteristics can cause data skew in the memory channels. As a result, different rows of the compute array often receive data at different times. Since the different rows of the compute array are located on different dies of the IC, their distances from the HBM vary, further exacerbating the data skew. For example, the data wavefront (e.g., data propagation) from the memory channels may be orthogonal to the compute array in the IC. These issues lead to overall unpredictability in data transfer from the HBM to the compute array.
[0022] Based on the inventive arrangement described in this disclosure, an example circuit architecture is provided that enables memory channel scheduling of read requests to HBM, thereby simultaneously improving and / or maximizing HBM bandwidth utilization. The example circuit architecture also eliminates data transfer skew between HBM and the compute array. Therefore, data can be provided synchronously from HBM to different rows of the compute array on the dies of a multi-die IC while reducing skew. This allows the compute array to remain busy while making fuller use of HBM read bandwidth.
[0023] The example circuit architecture also reduces the overhead and complexity of distributing computing arrays across multiple dies of a multi-die IC. The example circuit architecture described herein is applicable to multi-die ICs with varying numbers of dies. As the number of dies in a multi-die IC changes from one model to the next, and / or the area of each die changes, the example circuit architecture described herein can adapt to these changes to improve data transfer between the various dies of the multi-die IC on which the computing array is distributed.
[0024] Throughout this disclosure, the Advanced Microcontroller Bus Architecture (AMBA) Extensible Interface (AXI) (hereinafter referred to as "AXI") protocol and communication bus are used for the purposes described. AXI defines an embedded microcontroller bus interface for establishing on-chip connections between compatible circuit blocks and / or systems. AXI is provided as an illustrative example of a bus interface and is not intended to limit the examples described in this disclosure. It should be understood that other similar and / or equivalent protocols, communication buses, bus interfaces, and / or interconnects may be used instead of AXI, and the various example circuit blocks and / or signals provided in this disclosure will be based on the specific protocol, communication bus, bus interface, and / or interconnect used.
[0025] Other aspects of the arrangement of the invention will now be described in more detail with reference to the accompanying drawings. For the sake of simplicity and clarity, the elements shown in the drawings are not necessarily drawn to scale. For example, the dimensions of some elements may be exaggerated relative to others for clarity. Furthermore, where deemed appropriate, reference numerals are repeated between the various drawings to indicate corresponding, similar, or identical features.
[0026] Figure 1 An example plan view of the circuit architecture implemented in IC 100 is shown. IC 100 is a multi-die IC and includes a computing array. The computing array is distributed across dies 102, 104, and 106 of IC 100. For illustrative purposes, IC 100 is shown as having three dies. In other examples, IC 100 may have fewer or more dies than shown.
[0027] In this example, the computation array is subdivided into 256 computation array rows. Computation array rows 0-95 are implemented in die 102. Computation array rows 96-191 are implemented in die 104. Computation array rows 192-255 are implemented in die 106. The computation array may include a digital signal processing (DSP) cascade chain that connects dies 102, 104, and 106 together.
[0028] Data, such as weights, is obtained from an HBM (not shown) communicatively linked to IC 100 via multiple memory channels. On one hand, the HBM is implemented in a separate IC (e.g., outside the package of IC 100) and on the same circuit board as IC 100. On the other hand, the HBM is implemented in another die within IC 100 (e.g., within the same package as IC 100). The HBM may be positioned along the bottom side of IC 100, for example, adjacent to the bottom of die 106 from left to right. In some cases of HBM, the memory channels are referred to as pseudo channels (PCs). For descriptive purposes, the term "memory channel" is used to refer to both the memory channels of the HBM and / or the pseudo channels of the HBM.
[0029] exist Figure 1 In the example, die 106 includes 16 memory controllers 0-15. Each memory controller is capable of servicing (e.g., reading and / or writing) two memory channels. Figure 1 The memory controllers 0-15 are marked in parentheses to indicate the specific memory channels served by each memory controller. For example, memory controller 0 serves memory channels 0 and 1, memory controller 1 serves memory channels 2-3, and so on.
[0030] Each memory controller is connected to one or more request buffers and one or more bus master circuits (e.g., an AXI master). Figure 1 In one example implementation shown, each memory channel is coupled to a request buffer-bus master block via a memory controller. Figure 1 In this diagram, each request buffer-bus master block (e.g., where the bus master could be an AXI master) combination is abbreviated as "RBBM block" and is illustrated as "RBBM". Since each memory controller can serve two memory channels, there are two RBBM blocks directly above each memory controller and they are coupled to each memory controller. Each RBBM block is labeled for the specific memory channel served by the RBBM block. Therefore, Figure 1 Examples include RBBM circuit blocks 0-15. In this example, both the memory controller and the RBBM circuit blocks are located on a single die of the IC100, for example, on the same die.
[0031] Each of dies 102, 104, and 106 includes multiple remote buffers. The remote buffers are distributed across dies 102, 104, and 106. Figure 1In the example, each RBBM circuit block is connected to multiple remote buffers. In one example, each RBBM circuit block is connected to four different remote buffers. For illustrative purposes, RBBM circuit block 0 is connected to remote buffers 0-3 and provides data to remote buffers 0-3. RBBM circuit block 1 is connected to remote buffers 4-7 and provides data to remote buffers 4-7. Each of the remaining RBBM circuit blocks can be connected to a consecutive group of four remote buffers numbered consecutively in die 102 via the remote buffers of dies 104 and 106.
[0032] Each of dies 102, 104, and 106 also includes multiple caches. Typically, the number of caches (e.g., 32) corresponds to the number of memory channels. Each cache is capable of serving data to multiple compute array rows. Figure 1 In the example, each cache can provide data for 8 compute array rows. Die 102 includes caches 0-11, where cache 0 provides data to compute array rows 0-7, cache 1 provides data to compute array rows 8-15, cache 2 provides data to compute array rows 16-23, and so on. Die 104 includes caches 12-23, where cache 12 provides data to compute array rows 96-103, cache 13 provides data to compute array rows 104-111, cache 14 provides data to compute array rows 112-119, and so on. Die 106 includes caches 24-31, where cache 24 provides data to compute array rows 192-199, cache 25 provides data to compute array rows 200-207, cache 26 provides data to compute array rows 208-215, and so on.
[0033] In this example, data such as weights can be loaded from the HBM via 32 memory channels implemented in die 106. Ultimately, the weights are input as multiplication operands into the compute array row. The weights enter IC 100 via parallel memory channels through memory controllers 0-15 in die 106. Within die 106, each memory channel RBBM circuit block is placed alongside each memory channel to handle flow control between each memory channel and the associated compute array row. The RBBM circuit blocks are controlled by the main controller 108 to execute HBM read and write requests (e.g., "accesses").
[0034] exist Figure 1 In the example, the memory channel is located away from the remote buffer. The main controller 108 is also located away from the memory channel, which is closer to the right side of the die 106. Furthermore, the data wavefront enters the compute array row in a direction orthogonal (e.g., horizontal) to the data wavefront entering the IC 100 via the memory controller (e.g., vertically).
[0035] Data read from the HBM is written from the RBBM circuit block into various remote buffers 0-127 in dies 102, 104, and 106. The read side (e.g., the side connected to the cache) of each remote buffer is controlled by the master controller 108. The master controller 108 controls the read side of each remote buffer 0-127 to perform skew-eliminating reads in dies 102, 104, and 106 and inputs the data into the corresponding caches 0-31 for use in the respective compute array rows 0-255.
[0036] The master controller 108 coordinates data transfers to and from remote buffers, thereby eliminating data skew. By coordinating reads from remote buffers, the master controller 108 ensures that data, such as weights, is provided synchronously to each row of the compute array. Furthermore, the master controller 108 can improve and / or maximize HBM bandwidth utilization. This allows the compute array to remain busy while making fuller use of HBM read bandwidth.
[0037] As noted, IC 100 may include more than Figure 1 The number of bare pieces shown is less or more. In this respect, Figure 1 The circuit architecture in the examples reduces the overhead and complexity of distributed data for computational arrays across multiple dies in a multi-die IC, regardless of whether the IC comprises fewer or more than three dies. The example architecture described herein is applicable to multi-die ICs with different numbers of dies than those shown. Furthermore, as... Figure 1 The dimensions of dies 102, 104, and 106 shown are only for better illustration of the components in each corresponding die. Dies may have the same or different dimensions.
[0038] Figure 2 The diagram shows Figure 1 An example implementation of the circuit architecture. In Figure 2 In this example, the die boundary has been removed. IC 100 can be mounted on a circuit board that is communicatively coupled to a host computer via a communication bus. For example, IC 100 can be coupled to the host computer via a Peripheral Component Fast Interconnect (PCIe) connection or other suitable connection. IC 100, such as die 106, may include PCIe Direct Memory Access (DMA) circuitry 202 to facilitate PCIe connectivity. PCIe DMA circuitry 202 is connected to Block Random Access Memory (BRAM) controller 204 via connection 206. In the example embodiment, one or more request buffers for RBBM circuit blocks and / or remote buffers are implemented using BRAM.
[0039] The master controller 108 is connected to the BRAM controller 204. The BRAM controller 204 can operate as an AXI endpoint slave device, integrated with AXI interconnect and system master devices to communicate with local storage devices (e.g., BRAM). Figure 2 In this example, the BRAM controller 204 can function as a bridge between the PCIe and the host controller 108. On the one hand, the host controller 108 is a centralized controller driven by a command queue via a host-PCIe connection (e.g., receiving commands via the PCIe DMA circuit 202 and the BRAM controller 204).
[0040] The master controller 108 is capable of performing several different operations. For example, the master controller 108 is capable of implementing narrow write requests to the HBM to initialize the HBM. In that case, the master controller 108 is capable of accessing all 32 memory channels of the HBM by using a global address (e.g., by using only a global address), through the AXI master controller 31 (of the RBBM circuit block 31), and the memory controller 15.
[0041] The main controller 108 is also capable of handling narrow read requests from HBM. In that case, the main controller 108 can access all 32 memory channels of HBM via the AXI main controller 31 and the memory controller 15 using global addresses (e.g., using only global addresses).
[0042] The main controller 108 is also capable of handling wide read requests from the HBM. The main controller 108 can execute read requests from all 32 memory channels in parallel (e.g., sequentially and randomly) via the bus master circuits 0-31 (e.g., RBBM circuit blocks 0-31) for memory controllers 1-15 using local memory channel addresses.
[0043] exist Figure 2 In the example, the master controller 108 is capable of monitoring and / or tracking various signals. Furthermore, the master controller 108 is capable of generating multiple different signals in response to detecting certain conditions in the monitored signals. For example, the master controller 108 is capable of generating signal 208. Signal 208 is a remote buffer read enable signal. The master controller 108 is capable of generating signal 208 and broadcasting signal 208 (e.g., the same signal) to each of the remote buffers 0-127. In this way, the master controller 108 can synchronously read and enable each remote buffer to eliminate skew in the data read from the remote buffers and provided to the compute array rows.
[0044] Signal 210 is a remote buffer write enable signal. Each bus master circuit 0-31 in RBBM circuit blocks 0-31 is capable of generating signal 210 to the corresponding remote buffer. The master controller 108 is capable of receiving each remote buffer write enable signal generated by the bus master circuits 0-31 for each remote buffer. In one aspect, the master controller 108 is capable of monitoring the fill level of each remote buffer by tracking the write enable signal 210 from each remote buffer and the read enable signal 208 provided to each remote buffer.
[0045] Signal 212 is the same as signal 208. However, signal 212 is generated based on a different clock signal than signal 208 (e.g., axi_clk instead of sys_clk). This allows the master controller 108 to provide remote buffer read enable signals to each of remote buffers 0-127 and to the AXI master devices within each of RBBM circuit blocks 0-31. Thus, the remote buffer fill level can also be tracked by a remote buffer pointer manager locally implemented in each RBBM circuit block. The remote buffer pointer manager will combine... Figure 5 To describe in more detail.
[0046] Signal 214 represents the AXI-AW / W / B signal, which the master controller 108 can provide to the RBBM circuit block 31 to initiate the narrow write as described above. In this disclosure, AXI-AW refers to the AXI write address signal; AXI-W refers to the AXI write data signal; AXI-B refers to the AXI write response signal; AXI-AR refers to the AXI read address signal; and AXI-R refers to the AXI read data signal. The AXI master controller 108 also receives signal 216 from each RBBM circuit block 0-31. Signal 216 may be an AR REQ ready signal (e.g., where "AR" stands for "address read"). The master controller 108 can also broadcast a broadcast signal 218, such as an AR REQ, to each of the RBBM circuit blocks 0-31 to initiate an HBM read.
[0047] Figure 2 The example circuit architecture includes multiple different clock domains. `dsp_clk` is used to provide the clock for the compute array rows and the output ports of caches 0-31. In one example, `dsp_clk` is set to 710MHz. A data transfer rate of 0.355TB / s (8x16x32 bits * 710MHz) is achieved through 8x16b connections between each cache 0-31 and 8 compute array rows input by each corresponding cache.
[0048] `sys_clk` is used to time the input ports (e.g., the right side) of caches 0-31 connected to remote buffers and the output ports (e.g., the left side) of remote buffers 0-127 connected to caches 0-31. `sys_clk` is also used to time a portion of the master controller 108, for example, to broadcast signal 208 to each remote buffer. In one example, `sys_clk` is set to 355 MHz. Using the 8x32-bit connection between the shown remote buffers 0-127 and caches 0-31, a data rate of 0.355 TB / s (8x32x32 bits * 355 MHz) can be achieved. For example, `sys_clk` could be set to half or approximately half the frequency of `dsp_clk`.
[0049] exist Figure 2 In the example, caches 0-31 can not only cache data but also traverse the clock domain. More specifically, each of caches 0-31 is capable of receiving data at the sys_clk rate and outputting data to the compute array row at the dsp_clk rate (e.g., twice the input clock rate). In one or more example implementations, circuitry such as remote buffers and RBBM circuit blocks can be implemented with programmable logic having a slower clock speed than other hardwired circuit blocks that can be used to implement compute array rows. Therefore, caches 0-31 are able to bridge this clock speed difference.
[0050] axi_clk is used to provide clocking for the input ports (e.g., the right side) of the remote buffer and the output ports (e.g., the left side) of the RBBM circuit block. axi_clk is also used to provide clocking for a portion of the master controller 108, for example, to monitor received signals 210 and 216 and output signals 212, 214, and 218. In one example, axi_clk is set to 450MHz. A data rate of 0.45TB / s (4x64x32 bits * 450MHz) is achieved through a 4x64b connection between RBBM circuit blocks 0-31 and remote buffers 0-127.
[0051] Each RBBM circuit block is coupled to the corresponding memory controller via a 256-bit connection, achieving a data rate of 0.45 TB / s (32 x 256 bits * 450 MHz). The clock frequency of memory controllers 0-15 can also be 450 MHz. Each memory controller supports two 64-bit memory channel connections (e.g., one per memory channel), providing a data rate of 0.45 TB / s (2048 bits / T * 1.8 GT / s).
[0052] For the purposes of discussion, the term "memory channel interface" is used in this disclosure to refer to the connection between a specific RBBM circuit block and the corresponding portion (e.g., a single channel) of the memory controller in which the RBBM circuit block resides. For example, RBBM circuit block 0 and the portion of memory controller 0 connected to RBBM circuit block 0 (e.g., reference...) Figure 3 Data buffer 302-0 and request queue 304-0 are memory channel interfaces, while RBBM circuit block 0 and the portion of memory controller 0 connected to RBBM circuit block 1 (e.g., data buffer 302-1 and request queue 304-1) are considered another memory channel interface.
[0053] The main controller 108 is capable of generating read and write requests based on HBM read and write commands received via PCIe DMA 202 and BRAM controller 204. The main controller 108 can operate in an "accelerate and wait" mode. For example, the main controller 108 can send read requests to the request buffers of RBBM circuit blocks 0-31 until the request buffers and the data path including remote buffers are full. In response to each read command, the main controller 108 can also initiate a data read operation (e.g., data transfer) from each remote buffer to the corresponding cache. Furthermore, the main controller 108 can free up some request buffer space and trigger the main controller 108 to generate new read requests based on the available space in the request buffers to obtain more data from the HBM.
[0054] In one example implementation, HBM comprises 16 banks within each PC and 32 columns within each row. Through interleaved banks, each PC can read up to 16 x 32 x 256 bits (128 Kb). The BRAM is 4 x 36 Kb in size, with one 36 Kb BRAM used to buffer two compute array rows. Thus, Figure 2 The example circuit architecture is capable of reading up to 16 interleaved pages from a PC in burst lengths of 512 bits, serving eight compute array rows at a time.
[0055] Figure 3 The diagram shows Figure 1 Another example implementation of the circuit architecture. In Figure 3 In this example, the die boundary has been removed. Furthermore, memory controller 0-31 (in...) Figure 3 Each of the two memory channels (abbreviated as "MC") is coupled to two memory channels. Figure 3 A more detailed view of the memory controller 0-15 and RBBM circuit block 0-31 is presented.
[0056] exist Figure 3In the example, each memory controller 0-15 serves two memory channels. Therefore, each memory controller 0-15 includes a data buffer 302 for each served memory channel and a request queue 304 for each served memory channel. For example, memory controller 0 includes a data buffer 302-0 and a request queue 304-0 for serving memory channel 0, and a data buffer 302-1 and a request queue 304-1 for serving memory channel 1. Similarly, memory controller 15 includes a data buffer 302-30 and a request queue 304-30 for serving memory channel 30, and a data buffer 302-31 and a request queue 304-31 for serving memory channel 31.
[0057] Using the AXI protocol as an illustrative example, data buffer 302 can be implemented as an AXI-R (read) data buffer. Each data buffer 302 can include 64 × 16 (1024) entries, where each entry is 256 bits. Request queue 304 can be implemented as an AXI-AR (address read) request queue. Each request queue 304 can include 64 entries. Each data buffer 302 receives data from the corresponding memory channel. Each request queue 304 is capable of providing commands, addresses, and / or control signals to the corresponding memory channel received from the corresponding AXI master.
[0058] Each RBBM circuit block 0-31 includes a bus master circuit and a request buffer. For example, RBBM circuit block 0 includes a bus master circuit 0 and a request buffer 0. RBBM circuit block 1 includes a bus master circuit 1 and a request buffer 1. RBBM circuit block 30 includes a bus master circuit 30 and a request buffer 30. RBBM circuit block 31 includes a bus master circuit 31 and a request buffer 31. Therefore, each bus master circuit has a data connection to the corresponding data buffer 302 and a control connection to the corresponding request queue 304 (e.g., for address, control signals, and / or commands).
[0059] Because of HBM refresh, clock domain crossover, and the interleaving of the two memory channels within a single memory controller, skew is introduced between data read from HBM via different memory channels. When memory channel skew exists, data read from HBM via different memory channels is not aligned. This is true even if all read requests for all 32 memory channels are issued by each of the 32 AXI masters in the same cycle. Data skew is at least partly due to HBM refresh.
[0060] Consider an example where HBM has a global refresh cycle of 260ns every 3900ns. In this case, the HBM throughput is limited to 0.42TB / s via refresh commands ((3900-260) / 2900*0.45=0.42). This also means that there is a refresh window of 117 axi_clk cycles (260*0.45=117) every 1755 axi_clk cycles (3900*0.45=1755), during which HBM read or write requests cannot be sent to HBM through memory channels. Since these refresh windows are not aligned across all 32 memory channels, the maximum skew between any two memory channels is 117 axi_clk cycles when there are no overlapping refresh cycles between them. If the memory controller is able to generate a new request every two axi_clk cycles, the skew between any two memory channels can be as high as 59 (117 / 2 = 59) HBM read requests in a cycle in which one of the two memory channels has issued 59 read requests while the other memory channel is blocked due to refresh performance.
[0061] exist Figure 3 In the example, each request queue 304 can be used to queue up to 64 HBM read requests initiated by the master controller 108 via the corresponding bus master circuit during a refresh command cycle. In this case, 59 HBM read requests accumulated during the refresh command cycle can be absorbed into 64-entry request queues 304. The master controller 108 can monitor the FIFO ready or full status of each request queue 304 by monitoring signals 216 from the corresponding RBBM circuit block (e.g., AR REQ ready signals 0-31 from all memory channels for HBMs used for wide read requests). In response to determining that space is available in each data buffer 302, for example, based on the status of each request queue 304, the master controller 108 can generate new HBM read requests (e.g., for each memory channel) and broadcast such requests to each of the memory controllers 0-15 (e.g., via the bus master circuit). That is, the master controller 108 sends HBM read requests to the request buffers. Each bus master circuit serves requests from its local request buffer. If any request queue 304 is full, the main controller 108 will not generate a new HBM read request.
[0062] Regarding the fill level of request queues 304 in each of the 32 memory channels, two scenarios are possible for HBM wide read requests. The first scenario corresponds to a state where the circuitry is ready to accept new HBM read requests. In this scenario, there is available buffer space in each (e.g., all) request queue 304 to receive new HBM read requests from the host controller 108. The second scenario corresponds to a state where the circuitry is not ready to receive new HBM read requests. In this scenario, one or more request queues 304 are full, and at least one other request queue 304 is neither full nor empty. The situation where some request queues 304 are full while others are empty will not occur because the maximum skew between any two memory channels (e.g., 59) is less than the buffer size of the request queue 304 (e.g., 64). In this second scenario, since there are still pending HBM read requests in all request queues 304, HBM throughput is not affected by serving new requests from the host controller 108.
[0063] refer to Figure 2 and 3 There are 32 data streams in total. Each data stream is 256 bits wide and extends from the memory channel to the corresponding remote buffer. For example... Figure 1 As shown, some remote buffers are located in die 102 or 104, while others are located in die 106, which is closer to the main controller 108. In some example arrangements, since the data path for each memory channel is 256 bits wide, hardware resources can be minimized by keeping these data paths relatively short.
[0064] Figure 4 The diagram illustrates the implementation. Figure 1 An example of a balanced tree structure in the circuit architecture. The balanced tree structure is used to broadcast HBM wide read requests from the host controller 108 to the memory controller.
[0065] like Figure 4 As shown in the example, the master controller 108 broadcasts signal 218 (AR REQ broadcast signal) to each RBBM circuit block from left to right. For illustrative purposes, only RBBM circuit blocks 31, 16, and 0 are shown. The arrival time of signal 218 for each RBBM circuit block is aligned with the same axi_clk period. Furthermore, the master controller 108 is capable of broadcasting signal 208 (e.g., remote buffer read enable) to each remote buffer. For illustrative purposes, only remote buffers 0-3, 64-67, and 124-127 are shown.
[0066] exist Figure 4In the example, each RBBM circuit block includes a remote buffer pointer manager 406 (shown as 406-31, 406-16, and 406-0). The remote buffer pointer manager 406 may be included as part of a request buffer or implemented separately from the request buffer within each respective RBBM circuit block. Each remote buffer pointer manager 406 is capable of receiving a signal 208 for tracking the fill level of the corresponding remote buffer. Furthermore, each remote buffer pointer manager 406 is capable of outputting a signal 210 (e.g., remote buffer write enable signals 210-31, 210-16, and 210-0) to the corresponding remote buffer.
[0067] For example, flip-flop (FF) 402 in die 104 receives a 256-bit wide data signal from RBBM circuit block 31 in die 106. FF 402 passes the data to FF 404 in die 102. FF 404 passes the data to remote buffers 0-3. Remote buffer pointer managers 406-31 are able to output control signals 210-31 to FF 408 in die 104. FF 408 outputs control signals 210-31 to FF 410 in die 102. FF 410 outputs control signals 210-31 to remote buffers 0-3. Master controller 108 generates and broadcasts signal 208, such as a remote buffer ready signal, to remote buffers 124-127 in die 106. Signal 208 continues to FF 412 in die 104. FF 412 outputs signal 208 to remote buffers 64-67. FF 412 outputs signal 208 to FF 414 in die 102. FF 414 provides signal 208 to remote buffers 0-3.
[0068] FF 416 in die 104 receives a 256-bit wide data signal from RBBM circuit block 16 in die 106. FF416 passes the data to remote buffers 64-67. The remote buffer pointer manager 406-16 can output control signals 210-16 to FF 418 in die 104. FF 418 outputs control signals 210-16 to remote buffers 64-67.
[0069] RBBM circuit block 0 outputs a 256-bit wide data signal directly to remote buffers 124-127 in die 106. Remote buffer pointer manager 406-0 outputs control signal 210-0 directly to remote buffers 124-127 in die 106.
[0070] Typically, the skew addressed by the inventive arrangement described in this disclosure has several distinct components. For example, the skew includes components caused by HBM refresh and components caused by data propagation delays, such as signals across clock domains, data delays along different paths within the same die, data delays along different paths across different dies, and so on. HBM refresh is the largest contributor to skew. This aspect of the skew is primarily handled by the read data buffers within the memory controller, for example, de-skewing. Even when taken together, the other skew components contribute less to the overall skew than the HBM skew and can be handled by the remote buffers. Therefore, there is no need to balance the HBM read data delays from the memory controller to the remote buffers across the 32 memory channels. A minimal pipeline stage can be used to send data from the memory controller to the remote buffers on dies 102, 104, and 106 because the remote buffers have sufficient storage capacity to tolerate the skew on the remote buffer write side introduced by the described delay imbalance.
[0071] Figure 4 The example also illustrates how remote buffers can generate backpressure on the memory controllers used for individual memory channels. Backpressure is generated, at least in part, by locally generating each control signal 210 within each corresponding remote buffer pointer manager 406 in each RBBM circuit block close to the memory channel. Furthermore, the master controller 108 broadcasts a common signal 208 using a balanced tree structure.
[0072] Read and write pointers for the remote buffers are generated at each corresponding remote buffer on dies 102, 104, and 106. Signal 210 should match the latency of writing data to the remote buffer. Due to skew between the write sides of the remote buffers receiving data from various memory channels, as well as skew between signals 210, write data latency, and remote data read enable latency, the remote buffer backpressure may include some buffer margin to absorb those skews. Signals 210 and 216 ( Figure 4 (Not shown in the image) is propagated back to the main controller 108. The delay of these signals can be balanced to ensure that the main controller 108 captures signals in the same cycle to generate new HBM read requests and signals 208.
[0073] Figure 5 An example implementation of an RBBM circuit block as described in this disclosure is illustrated. Figure 5In this example, the bus master circuit 502 includes two read channels. The AXI master 502 includes an AXI-AR master 504 capable of handling read address data and an AXI-R master 506 capable of handling read data. The other three AXI write channels: AXI-AW (write address), AXI-W (write data), and AXI-B (write response) can be controlled internally by the master controller 108. For illustrative purposes, the signal label “AXI_Rxxxx” is intended to refer to AXI signals with the prefix “AXI_R” in the relevant communication specification, while the signal label “AXI_ARxxxx” is intended to refer to AXI signals with the prefix “AXI_AR” in the relevant communication specification.
[0074] Request buffer 508 is a request buffer previously described as part of an RBBM circuit block and is used to handle the HBM read request queue backpressure round-trip delay (e.g., signal 218) between AR REQ ready (e.g., signal 216) and the AR REQ signal. For example, the depth of request buffer 508 should be greater than the backpressure round-trip delay (e.g., less than 32).
[0075] The AR REQ (Address Read Request) signal 218 can specify a read start address, which is either a 23-bit local address for a wide read request from the main circuitry 108 on memory controllers 0-15 or a 28-bit global address for a narrow read request. Read requests from the main circuitry 108 are made only on memory controller 15. The AR REQ signal 218 can also include a 6-bit read transaction identifier for narrow and wide identifiers and specify other stream-related information. The AR REQ signal 218 can also specify a 4-bit read burst length, supporting burst lengths up to 16 bytes depending on the specific AXI protocol used. In one example embodiment, three combinational logic block memories (CLBMs), each a 32x14 dual-port RAM, can be used to implement a 32x42 FIFO for the request buffer 508. In this case, the request buffer 508 is able to serve the AXI-AR master controller 504, with a new read request every axi_clk cycle.
[0076] The remote buffer pointer manager 406 generates signal 210 (remote buffer write enable) for read data received from HBM. The remote buffer pointer manager 406 can also locally generate remote buffer backpressure for the AXI-R (read data) channel corresponding to the AXI-R host 506. Due to skew between different memory channels, each remote buffer pointer manager 406 can maintain the corresponding remote buffer fill level by tracking remote buffer write enable and remote buffer read enable. Once HBM read data is valid via the assertion of the BRAM_WVALID signal, each remote buffer pointer manager 406 can increase the remote buffer fill level. Each remote buffer pointer manager 406 can also decrement the remote buffer fill level upon receiving signal 208 (remote buffer read enable) broadcast from the host controller 108. The remote buffer pointer manager 406 can also generate a remote buffer backpressure signal (e.g., BRAM_WREADY) for each memory channel based on a predefined remote buffer fill level. The remote buffer fill threshold used to generate the backpressure signal should take into account the skew between the remote buffer write-end delays across memory channels and the skew between the remote buffer read and write ends.
[0077] As shown in the figure, the AXI-R host 506 can output data that can be provided to the corresponding remote buffer.
[0078] Figure 6 An example embodiment of the main controller 108 is illustrated. Figure 6 As shown, the main controller 108 includes a remote buffer read address generation unit (remote buffer read AGU) 602 coupled to the request controller 604. Figure 6 In the example, the main controller 108 converts both narrow read requests and wide read requests from the BRAM controller 204 into AXI-AR requests for request buffers 0-31 (only for narrow reads of 31). The main controller 108 further converts narrow write requests into AXI-W and AXI-AW requests only for the main circuit 31.
[0079] On the one hand, simultaneously, for an AXI-AR request, the expected read burst length from the AXI-R is sent from the request controller 604 to the remote buffer read AGU 602. The remote buffer read AGU 602 is capable of queuing the expected read burst length received, for example, in a queue or memory. The remote buffer read AGU 602 is capable of maintaining the remote buffer fill level of each remote buffer by tracking all signals 210 (e.g., all remote buffer write enable 0-31) and signal 208 (e.g., common remote buffer read enable). As long as the remote buffer fill level of all remote buffers exceeds the expected burst length stored in the queue, the remote buffer read AGU 602 is capable of triggering (e.g., continuing to trigger) a remote buffer read operation (e.g., assertion signal 208). Signal 208 is used by each request buffer to create remote buffer backpressure for each memory channel and by each remote buffer to deskew. Because the expected burst length is predetermined and queued in the remote buffer read AGU 602, the reordering function, which allows the output order to differ from the input order, is disabled by forcing the AXI ID to the same value.
[0080] Figure 7 The diagram shows Figure 6 An example embodiment of the request controller 604. Figure 7 In the example, the request controller 604 includes a transaction buffer 702 and a scheduler 704. The transaction buffer 702 uses sys_clk as the write clock and axi_clk as the read clock to decouple the two asynchronous clock domains. The request controller 604 also includes multiple controllers, shown as an AR (address read) controller 706, an AXI-AW (address write) controller 708, an AXI-W (write) controller 710, and an AXI B detector 712 capable of detecting valid responses on the AXI B or AXI response channel.
[0081] exist Figure 7 In the example, AR controller 706 is able to check the request buffer backpressure represented by signal 216 (e.g., AR REQ ready signals 0-31). AR controller 706 can check AR REQ ready signals 0-31 simultaneously, for example, together. On the one hand, AR controller 706 generates an AXI-AR REQ (HBM read request) only if there is available space in each request buffer.
[0082] For AXI read-related channels, such as AXI-AR and AXI-R, a request buffer is required between the AXI master and the request controller 604. For AXI write-related channels, such as AXI-AW, AXI-W, and AXI-B, the request controller 604 (e.g., controllers 708, 710, and 712) communicates directly with the AXI master controller 31 corresponding to the memory controller 15. When the AXI ready indication and the AXI valid indication (AXI-XX-AWREADY / AWVALID or AXI-XX-WREADY / WVALID) are asserted in the same cycle, the request controller 604 outputs an AXI request and simultaneously reads a new request from the buffer.
[0083] Scheduler 704 is capable of routing (e.g., scheduling) different AXI transactions from transaction buffer 702 to the appropriate controllers among controllers 706, 708, 710, and / or 712 based on transaction type. Scheduler 704 also schedules AXI access sequences between AXI writes and AXI reads; and arranges AXI access sequences between consecutive AXI write operations. For example, scheduler 704 does not issue a new AXI-AR request or a new AXI-AW transaction until it receives a response from a previous AXI write transaction.
[0084] Reference Figure 6 and 7 The Remote Buffer Read AGU 602 receives the remote buffer write enable and common remote buffer read enable corresponding to each memory channel. By counting the data transfer units using performance counters, the system throughput can be measured at either the write side or the read side of the remote buffer.
[0085] Figure 8 A method 800 for transferring data between an HBM and a distributed computing array is illustrated. Method 800 can be achieved by using methods as described in this disclosure. Figures 1 to 7 The described circuit architecture (reference) Figure 8 This is executed by a system (referred to as the "system").
[0086] In block 802, the system can monitor the fill levels of multiple remote buffers distributed across multiple dies. Each of the multiple remote buffers can be configured to provide data to a compute array also distributed across multiple dies. In block 804, the system can determine that each of the multiple remote buffers is storing data based on the fill level. In block 806, in response to this determination, the system can initiate data transfer from each of the multiple remote buffers to the compute array from the multiple dies. The data transfer can be synchronous (e.g., de-skewed). For example, data transfers occurring on each die are synchronous. Minimal pipelines are used to facilitate synchronization from one chip to another. Data output from the remote buffers to the compute array rows is further de-skewed.
[0087] In one aspect, the system initiates data transfer from each remote buffer by broadcasting a read enable signal to each of the multiple remote buffers. The read enable signal is a common read enable signal broadcast to each remote buffer.
[0088] On the other hand, the system is able to monitor the fill level by tracking multiple write enable corresponding to multiple remote buffers on a one-to-one basis, as well as tracking the common read enable for each of the multiple remote buffers.
[0089] In a particular implementation, the system is able to receive data from the HBM within multiple RBBM circuit blocks located on a first die among multiple dies via multiple corresponding memory channels. The multiple RBBM circuit blocks provide data to various remote buffers among multiple remote buffers.
[0090] The system is also able to convert a first request to access memory into a second request that conforms to the on-chip communication bus, and provide the second request to the communication bus master circuit corresponding to each of the plurality of RBBM circuit blocks.
[0091] The system is capable of providing data from each of a plurality of remote buffers to a plurality of cache circuit blocks distributed across a plurality of dies, wherein each cache circuit block is connected to at least one of the plurality of remote buffers and a compute array. Each cache circuit block can be configured to receive data from a selected remote buffer at a first clock rate and output data to the compute array at a second clock rate exceeding the first clock rate.
[0092] Figure 9 The illustration depicts an example architecture of a programmable device 900. The programmable device 900 is an example of a programmable IC and an adaptive system. In one aspect, the programmable device 900 is also an example of a system-on-a-chip (SoC). The programmable device 900 can be implemented using multiple interconnected dies, where... Figure 9 The various programmable circuit resources shown are implemented on different interconnect dies. In one example, programmable device 900 can be used to implement the features described herein. Figure 1-8 The example circuit architecture is described.
[0093] In the example, the programmable device 900 includes a data processing engine (DPE) array 902, programmable logic (PL) 904, a processor system (PS) 906, an on-chip network (NoC) 908, a platform management controller (PMC) 910, and one or more hardwired circuit blocks 912. It also includes a configuration frame interface (CFI) 914.
[0094] The DPE array 902 is implemented as multiple interconnected programmable data processing engines (DPEs) 916. The DPEs 916 can be arranged in an array and are hardwired. Each DPE 916 may include one or more cores 918 and memory modules. Figure 9 (abbreviated as "MM") 920. On one hand, each core 918 is capable of executing program code (not shown) stored in a core-specific program memory contained within each corresponding core. Each core 918 has direct access to memory modules 920 within the same DPE 916 and to memory modules 920 of any other DPE 916 adjacent to its core 918 in the top, bottom, left, and right directions. For example, core 918-5 can directly read memory modules 920-5, 920-8, 920-6, and 920-2. Core 918-5 treats each memory module 920-5, 920-8, 920-6, and 920-2 as a unified memory region (e.g., as part of its local memory). This facilitates data sharing between different DPEs 916 in the DPE array 902. In other examples, core 918-5 may be directly connected to memory modules 920 in other DPEs.
[0095] The DPE 916 is interconnected via programmable interconnect circuitry. Programmable interconnect circuitry can include one or more distinct and independent networks. For example, programmable interconnect circuitry can include a stream network formed by stream connections (shaded arrows) and a memory-mapped network formed by memory-mapped connections (crossed shaded arrows).
[0096] Loading configuration data into the control registers of a DPE 916 via memory-mapped links allows for independent control of each DPE 916 and its components. DPEs 916 can be enabled / disabled on a per-DPE basis. For example, each core 918 can be configured to access the described memory module 920 or only a subset thereof to achieve isolation between cores 918 or multiple cores 918 operating as a cluster. Each stream connection can be configured to establish logical connections between only selected DPEs 916 to achieve isolation between DPEs 916 or multiple DPEs 916 operating as a cluster. Because each core 918 can load program code specific to that core 918, each DPE 916 can implement one or more different cores within it.
[0097] In other respects, the programmable interconnect circuitry within the DPE array 902 may include other independent networks, such as debug networks and broadcast networks independent of (e.g., different and separate) stream connections and memory-mapped connections and / or events. In some respects, the debug network is formed by memory-mapped connections and / or is part of a memory-mapped network.
[0098] Core 918 can be directly connected to adjacent Core 918 via core-to-core cascading connections. On one hand, core-to-core cascading connections are unidirectional and direct connections between Core 918, as shown in the figure. On the other hand, core-to-core cascading connections are bidirectional and direct connections between Core 918. Activation of the core-to-core cascading interface can also be controlled by loading configuration data into the control registers of each DPE 916.
[0099] In the example implementation, DPE 916 does not include a cache memory. By omitting the cache memory, DPE array 902 can achieve predictable, for example, deterministic performance. Furthermore, significant processing overhead is avoided because there is no need to maintain consistency between cache memories located in different DPEs 916. In another example, core 918 does not have an input interrupt. Therefore, core 918 can operate uninterruptedly. Omitting the input interrupt for core 918 also allows DPE array 902 to achieve predictable, for example, deterministic performance.
[0100] The SoC interface block 922 operates as an interface connecting the DPE 916 to other resources of the programmable device 900. Figure 9In the example, the SoC interface block 922 includes a plurality of interconnected tiles 924 arranged in a row. In a particular embodiment, different architectures may be used to implement the tiles 924 within the SoC interface block 922, with each different tile architecture supporting communication with different resources of the programmable device 900. The tiles 924 are connected so that data can be propagated bidirectionally from one tile to another. Each tile 924 is capable of operating as an interface directly above the column of DPEs 916.
[0101] Chip 924 is connected to adjacent chips, directly above DPE 916, and below the circuitry via the illustrated stream connections and memory-mapped connections. Chip 924 may also include a debug network connected to a debug network implemented in DPE array 902. Each chip 924 is capable of receiving data from another source, such as PS 906, PL 904, and / or another hardwired circuit block 912. For example, chip 924-1 is capable of providing portions (whether application or configuration) of data addressed to DPE 916 in the above column to such DPE 916, while sending data addressed to DPE 916 in other columns to other chips 924, such as 924-2 or 924-3, so that such chips 924 can route the data addressed to DPE 916 accordingly in their respective columns.
[0102] In one aspect, the SoC interface block 922 includes two different types of chips 924. The first type of chip 924 has an architecture configured to serve solely as an interface between DPE 916 and PL 904. The second type of chip 924 has an architecture configured to serve as an interface between DPE 916 and NoC 908, and between DPE 916 and PL 904. The SoC interface block 922 may include a combination of the first and second types of chips, or only the second type of chips.
[0103] In one respect, the DPE array 902 can be used to implement the computing array described herein. In this respect, the DPE array 902 can be distributed across multiple different dies.
[0104] A PL 904 is a circuit that can be programmed to perform a specified function. As an example, a PL 904 can be implemented as a field-programmable gate array (FPGA) type circuit. A PL 904 can comprise a set of programmable circuit blocks. As defined herein, the term "programmable logic" refers to circuitry used to construct reconfigurable digital circuitry. Programmable logic consists of numerous programmable circuit blocks (sometimes called "chips") that provide basic functionality. Unlike hardwired circuitry, the topology of a PL 904 is highly configurable. Each programmable circuit block of a PL 904 typically includes programmable elements 926 (e.g., functional elements) and programmable interconnects 942. The programmable interconnects 942 provide the highly configurable topology of the PL 904. The programmable interconnects 942 can be configured per line for connectivity between the programmable elements 926 providing the programmable circuit blocks of the PL 904, and are configurable per bit (e.g., where each line transmits a single information bit), unlike the connectivity between DPEs 916.
[0105] Examples of programmable circuit blocks in the PL 904 include configurable logic blocks with lookup tables and registers. Unlike the hardwired circuits (sometimes called hard blocks) described below, these programmable circuit blocks have undefined functionality at the time of manufacture. The PL 904 may include other types of programmable circuit blocks that also provide basic and defined functionality, but with more limited programmability. Examples of these circuit blocks may include digital signal processing blocks (DSPs), phase-locked loops (PLLs), and block random access memory (BRAMs). As with other programmable circuit blocks in the PL 904, these types of programmable circuit blocks are numerous and intermingled with other programmable circuit blocks in the PL 904. These circuit blocks may also have an architecture that typically includes programmable interconnects 942 and programmable elements 926, and thus, are part of the highly configurable topology of the PL 904.
[0106] Before use, the PL 904, such as programmable interconnects and programmable elements, must be programmed or "configured" by loading data, known as a configuration bitstream, into its internal configuration memory cells. Once the configuration bitstream is loaded into the configuration memory cells, it defines how the PL 904 is configured, such as its topology and mode of operation (e.g., the specific functions it performs). In this disclosure, "configuration bitstream" is not equivalent to program code that can be executed by a processor or computer.
[0107] In one respect, PL 904 can be used to implement Figure 1-7 One or more of the components shown. For example, various buffers, queues, and / or controllers can be implemented using a PL 904. At this point, the PL 904 can be distributed across multiple dies.
[0108] The PS 906 is implemented as a hard-wired circuit manufactured as part of a programmable device 900. The PS 906 can be implemented as or include any of a variety of different processor types, each capable of executing program code. For example, the PS 906 can be implemented as a single processor, such as a single core capable of executing program code. In another example, the PS 906 can be implemented as a multi-core processor. In yet another example, the PS 906 can include one or more cores, modules, coprocessors, I / O interfaces, and / or other resources. The PS 906 can be implemented using any of a variety of different architectures. Example architectures that can be used to implement the PS 906 include, but are not limited to, ARM processor architecture, x86 processor architecture, graphics processing unit (GPU) architecture, mobile processor architecture, DSP architecture, combinations of the foregoing architectures, or other suitable architectures capable of executing computer-readable instructions or program code.
[0109] NoC 908 is a programmable interconnect network for sharing data between endpoint circuits in a programmable device 900. Endpoint circuits can be arranged in DPE array 902, PL 904, PS 906, and / or selected hardwired circuit blocks 912. NoC 908 may include high-speed data paths with dedicated switching. In one example, NoC 908 includes one or more horizontal paths, one or more vertical paths, or both horizontal and vertical paths. Figure 9 The arrangement and number of areas shown are merely an example. NoC 908 is an example of a general infrastructure within the programmable device 900 that can be used to connect selected components and / or subsystems.
[0110] Within the NoC 908, the network routed via the NoC 908 is unknown until a user circuit design is created for implementation within the programmable device 900. The NoC 908 can be programmed by loading configuration data into an internal configuration register that defines how elements within the NoC 908, such as switches and interfaces, are configured and operated to pass data between switches and between NoC interfaces to connect endpoint circuits. The NoC 908 is manufactured as part of the programmable device 900 (e.g., hardwired) and, while not physically modifiable, can be programmed to establish connectivity between different master and slave circuits in a user circuit design. The NoC 908 does not implement any data paths or routes when powered on. However, once configured by the PMC 910, the NoC 908 implements data paths or routes between endpoint circuits.
[0111] The PMC 910 manages the programmable device 900. As a subsystem within the programmable device 900, the PMC 910 manages all other programmable circuit resources on the programmable device 900. The PMC 910 maintains a secure and reliable environment, boots the programmable device 900, and manages the programmable device 900 during normal operation. For example, the PMC 910 provides unified and programmable control over the different programmable circuit resources of the programmable device 900 (e.g., DPE array 902, PL 904, PS 906, and NoC 908) for power-on, startup / configuration, safety, power management, safety monitoring, debugging, and / or error handling. The PMC 910 operates as a dedicated platform manager that decouples the PS 906 and PL 904. Therefore, the PS 906 and PL 904 can be managed, configured, and / or powered on and / or powered off independently of each other.
[0112] In one respect, the PMC 910 can operate as the root of trust for the entire programmable device 900. As an example, the PMC 910 is responsible for authenticating and / or verifying a device image containing configuration data for any programmable resources of the programmable device 900 that can be loaded into it. The PMC 910 is also capable of protecting the programmable device 900 from tampering during operation. By operating as the root of trust for the programmable device 900, the PMC 910 can monitor the operation of PL 904, PS 906, and / or any other programmable circuit resources that may be included in the programmable device 900. The root of trust capability performed by the PMC 910 is distinct from and separate from any operations performed by PS 906 and PL 904 and / or by PS 906 and / or PL 904.
[0113] On one hand, the PMC 910 operates on a dedicated power supply. Therefore, the PMC 910 is powered separately and independently of the power supplies for the PS 906 and PL 904. This power independence allows the PMC 910, PS 906, and PL 904 to protect each other from electrical noise and glitches. Furthermore, while the PMC 910 continues to operate, one or both of the PS 906 and PL 904 can be powered down (e.g., suspended or placed into sleep mode). This capability allows any part of the programmable device 900 that has been powered down, such as PL 904, PS 906, NoC 908, etc., to wake up and return to operational status more quickly without requiring the entire programmable device 900 to undergo a complete power-on and startup process.
[0114] The PMC 910 can be implemented as a processor with dedicated resources. The PMC 910 may include multiple redundant processors. The processors of the PMC 910 are capable of executing firmware. The use of firmware supports the configurability and segmentation of global features of the programmable device 900, such as reset, clock, and protection, to provide flexibility in creating separate processing domains (which may differ from a subsystem-specific "power domain"). A processing domain may involve a mixture or combination of one or more different programmable circuit resources of the programmable device 900 (e.g., where a processing domain may include combinations and devices different from DPE array 902, PS 906, PL 904, NoC 908, and / or other hardwired circuit blocks 912).
[0115] Hardwired circuit block 912 includes a dedicated circuit block manufactured as part of programmable device 900. Although hardwired, hardwired circuit block 912 can be configured to implement one or more different operating modes by loading configuration data into a control register. Examples of hardwired circuit block 912 may include input / output (I / O) blocks, transceivers for sending and receiving signals to and from circuitry and / or systems outside programmable device 900, memory controllers, etc. Examples of different I / O blocks may include single-ended and pseudo-differential I / O. Examples of transceivers may include high-speed differential clock transceivers. Other examples of hardwired circuit block 912 include, but are not limited to, cryptographic engines, digital-to-analog converters (DACs), analog-to-digital converters (ADCs), etc. Typically, hardwired circuit block 912 is a dedicated circuit block.
[0116] In one respect, the hardwired circuit block 912 can be used to implement Figure 1-7 One or more of the components shown. For example, various memory controllers and / or other controllers can be implemented as hardwired circuit blocks 912. At this point, one or more hardwired circuit blocks 912 can be distributed across multiple dies.
[0117] The CFI 914 is an interface through which configuration data, such as configuration bitstreams, can be provided to the PL 904 for implementing various user-specified circuits and / or circuits. The CFI 914 is coupled to and accessible by the PMC 910 to provide configuration data to the PL 904. In some cases, the PMC 910 can configure the PS 906 first, so that once the PS 906 is configured by the PMC 910, configuration data can be provided to the PL 904 via the CFI 914. On one hand, the CFI 914 has a built-in Cyclic Redundancy Check (CRC) circuitry (e.g., a 32-bit CRC circuitry) incorporated therein. Therefore, any data loaded into the CFI 914 and / or read back via the CFI 914 can have its integrity checked by examining the value of the code appended to the data.
[0118] Figure 9 The various programmable circuit resources shown above can be initially programmed as part of the boot process of the programmable device 900. During runtime, the programmable circuit resources can be reconfigured. In one aspect, the PMC 910 is capable of initially configuring the DPE array 902, PL 904, PS 906, and NoC 908. At any point during runtime, the PMC 910 can reconfigure all or part of the programmable device 900. In some cases, once PS 906 is initially configured by the PMC 910, it can configure and / or reconfigure PL 904 and / or NoC 908.
[0119] Combination Figure 9 The example programmable devices described are for illustrative purposes only. In other example implementations, the example circuit architectures described herein may be implemented in custom multi-die ICs (e.g., application-specific ICs with multiple dies) and / or in programmable ICs (such as field-programmable gate arrays (FPGAs) with multiple dies). Furthermore, the specific techniques used for communicatively linking the dies within the IC package, such as a common silicon interposer with wiring coupling the dies, multi-chip modules, three or more stacked dies, etc., are not intended to limit the creative arrangements described herein.
[0120] For illustrative purposes, specific nomenclature has been described to provide a thorough understanding of the various inventive concepts disclosed herein. However, the terminology used herein is for the purpose of describing specific aspects of the inventive arrangement only and is not intended to be limiting.
[0121] As defined herein, the singular forms “a,” “one,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise.
[0122] As defined herein, the term “approximately” means almost correct or precise, close to but not precise in value or quantity. For example, the term “approximately” can mean that the listed feature, parameter, or value is within a predetermined amount of the exact feature, parameter, or value.
[0123] As defined herein, unless otherwise expressly stated, the terms “at least one,” “one or more,” and “and / or” are open-ended expressions that are both combined and separate in operation. For example, “at least one of A, B, and C,” “at least one of A, B, or C,” “one or more of A, B, and C,” “a combination of one or more A, B, or C,” and “A, B, and / or C” refer to a single A, a single B, a single C, A and B together, A and C together, B and C together, or A, B, and C together.
[0124] As defined herein, the term "automatic" means without human intervention. As defined herein, the term "user" means a person.
[0125] As defined herein, the term “if” means “when,” “at,” or “in response to,” depending on the context. Therefore, the phrase “if determined” or “if [the condition or event] is detected” can be interpreted as “when determined,” “in response to determined,” “when [the condition or event] is detected,” “in response to the detection of [the condition or event],” or “in response to the detection of [the condition or event],” depending on the context.
[0126] As defined herein, the term "in response to" and similar language as described above, such as "if," "when," or "at," refers to the ease with which an action or event is responded to or reacted to. A response or reaction is performed automatically. Therefore, if a second action is performed "in response to" a first action, a causal relationship exists between the occurrence of the first action and the occurrence of the second action. The term "response" implies a causal relationship.
[0127] As defined herein, the term "processor" refers to at least one hardware circuit. The hardware circuit can be configured to execute instructions contained in program code. The hardware circuit can be an integrated circuit or embedded within an integrated circuit.
[0128] As defined herein, the term “substantially” means that the listed characteristics, parameters, or values are not required to be precisely implemented, but may be subject to skew or variation (including, for example, tolerances, measurement errors, measurement accuracy limitations, and other factors known to those skilled in the art), the magnitude of which does not preclude the effects that the characteristic is intended to provide.
[0129] The terms first, second, etc., may be used herein to describe various units. These units should not be limited by these terms, as they are used only to distinguish one unit from another, unless otherwise specified or clearly indicated by the context.
[0130] The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products arranged according to various aspects of the present invention. In this respect, each block in a flowchart or block diagram may represent a module, segment, or part of an instruction, which includes one or more executable instructions for implementing the specified operation.
[0131] In some alternative implementations, the operations marked in the blocks may not occur in the order indicated in the diagram. For example, two blocks shown consecutively may be executed substantially simultaneously, or these blocks may sometimes be executed in reverse order, depending on the functions involved. In other examples, blocks may typically be executed in ascending numerical order, while in other examples, one or more blocks may be executed in a different order, where the results are stored and used in subsequent blocks or other blocks that do not immediately follow. It will also be noted that each block illustrated in the block diagram and / or flowchart, as well as combinations of blocks in the block diagram and / or flowchart illustration, may be implemented by a dedicated hardware-based system that performs the specified function or action, or implements a combination of dedicated hardware and computer instructions.
[0132] All means or steps that can be found in the following claims, plus the corresponding structure, material, action, and equivalents of the functional elements, are intended to include any structure, material, or action for performing a function in combination with other claimed elements, as specifically claimed.
[0133] The IC may include multiple dies. The IC may include multiple memory channel interfaces configured to communicate with memory, wherein the multiple memory channel interfaces are disposed within a first die among the multiple dies. The IC may include a compute array distributed across the multiple dies and multiple remote buffers distributed across the multiple dies. The multiple remote buffers may be coupled to the multiple memory channels and the compute array. The IC may also include a controller configured to determine that each of the multiple remote buffers stores data therein, and in response, broadcast a read enable signal to each of the multiple remote buffers, thereby initiating a data transfer from the multiple remote buffers across the multiple dies to the compute array.
[0134] Data transmission can be synchronized to eliminate skew caused by data transmitted by individual transmissions.
[0135] The foregoing and other embodiments may each optionally include one or more of the following features, individually or in combination. One or more embodiments may include a combination of all of the following features.
[0136] In one aspect, the IC may include a plurality of request buffer-bus master circuit blocks disposed in a first die, wherein each request buffer-bus master circuit block is connected to one of a plurality of memory channel interfaces and at least one of the plurality of remote buffers.
[0137] On the other hand, the IC may include multiple cache circuit blocks distributed on multiple dies, wherein each cache circuit block is connected to at least one of multiple remote buffers and connected to a computing array.
[0138] On the other hand, each cache circuit block can be configured to receive data from a selected remote buffer at a first clock rate and output the data to the computing array at a second clock rate exceeding the first clock rate.
[0139] On the other hand, the computing array comprises multiple rows, wherein each of the plurality of dies comprises two or more rows.
[0140] On the other hand, each memory channel interface can provide data from the memory to two or more rows of the computing array.
[0141] On the other hand, the memory is high-bandwidth memory. On the other hand, the memory is double data rate random access memory.
[0142] On the other hand, the computing array implements the neural network processor, and the data specifies the weights applied by the neural network processor.
[0143] On the other hand, each memory channel interface provides data from the memory to two or more rows of the computing array.
[0144] In one aspect, a controller is disposed within an IC having multiple dies. The controller includes a request controller configured to translate a first request for memory access into a second request conforming to an on-chip communication bus, wherein the request controller provides the second request to multiple request buffer-bus master blocks configured to receive data from multiple channels of memory. The controller also includes a remote buffer read address generation unit coupled to the request controller and configured to monitor the fill level of each of multiple remote buffers distributed across the multiple dies. Each of the multiple remote buffers is configured to provide data obtained from a corresponding portion of the multiple request buffer-bus master blocks to a compute array also distributed across the multiple dies. In response to determining that each of the multiple remote buffers is storing data based on the fill level, the remote buffer read address generation unit is configured to initiate a data transfer from each of the multiple remote buffers across the multiple dies to the compute array.
[0145] Data transmission can be synchronized to eliminate skew caused by data transmitted from different sources.
[0146] The foregoing and other embodiments may each optionally include one or more of the following features, individually or in combination. One or more embodiments may include a combination of all of the following features.
[0147] In one respect, the request controller can receive a first request at a first clock rate and provide a second request at a second clock rate.
[0148] On the other hand, the remote buffer read address generation unit can monitor the fill level in each of the multiple remote buffers by tracking multiple write enable corresponding to multiple remote buffers and tracking the common read enable for each of the multiple remote buffers.
[0149] One method may include monitoring the fill level of multiple remote buffers distributed across multiple dies, wherein each of the multiple remote buffers is configured to provide data to a compute array also distributed across multiple dies. The method may further include determining, based on the fill level, that each of the multiple remote buffers is storing data, and in response to the determination, initiating a data transfer from each of the multiple remote buffers across the multiple dies to the compute array.
[0150] Data transmission can be synchronized to eliminate skew caused by data transmitted from different sources.
[0151] The foregoing and other embodiments may each optionally include, individually or in combination, one or more of the following features. One or more embodiments may include a combination of all of the following features.
[0152] In one aspect, initiating data transmission from each remote buffer includes broadcasting a read enable signal to each of the multiple remote buffers.
[0153] On the other hand, monitoring the fill level can include tracking multiple write enable actions corresponding to multiple remote buffers and tracking common read enable actions for each of the multiple remote buffers.
[0154] In another aspect, the method may include receiving data from memory via multiple corresponding memory channels within multiple request buffer bus master circuit blocks in a first die arranged in multiple dies, wherein the multiple request buffer bus master circuit blocks provide data to each of the multiple remote buffers.
[0155] In another aspect, the method may include converting a first request to access memory into a second request conforming to the on-chip communication bus, and providing the second request to the communication bus master circuit corresponding to each of the plurality of request buffers.
[0156] In another aspect, the method may include providing data from each of a plurality of remote buffers to a plurality of cache circuit blocks distributed on a plurality of dies, wherein each cache circuit block is connected to at least one remote buffer of the plurality of remote buffers and a computing array.
[0157] On the other hand, each cache circuit block can be configured to receive data from a selected remote buffer at a first clock rate and output the data to the computing array at a second clock rate exceeding the first clock rate.
[0158] The description of the arrangements of the invention provided herein is for illustrative purposes and is not intended to be exhaustive or limiting to the forms and examples disclosed. The terminology used herein is chosen to explain the principles, practical applications, or improvements to technical techniques found in the market, and / or to enable others skilled in the art to understand the inventive arrangements disclosed herein. Modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described inventive arrangements. Therefore, reference should be made to the following claims rather than to the foregoing disclosure to indicate the scope of such features and embodiments.
Claims
1. An integrated circuit comprising multiple dies, characterized in that, The integrated circuit includes: Multiple memory channel interfaces, configured to communicate with memory, are disposed within a first die of the multiple dies; A computing array, wherein the computing array is distributed across the plurality of bare dies; Multiple remote buffers, distributed across multiple dies, coupled to the multiple memory channels and the computing array; and A controller is configured to determine that each of the plurality of remote buffers already contains data, and in response, to broadcast a read enable signal to each of the plurality of remote buffers to initiate data transfer from the plurality of remote buffers to the computing array on the plurality of dies.
2. The integrated circuit according to claim 1, characterized in that, The integrated circuit also includes: Multiple request buffer-bus master blocks are disposed in a first die, wherein each request buffer-bus master block is connected to one of the multiple memory channel interfaces and to at least one of the multiple remote buffers.
3. The integrated circuit according to claim 1, characterized in that, The integrated circuit also includes: Multiple cache circuit blocks are distributed on the multiple dies, wherein each cache circuit block is connected to at least one of the multiple remote buffers and to the computing array.
4. The integrated circuit according to claim 3, characterized in that, Each cache circuit block is configured to receive data from a selected remote buffer at a first clock rate and output the data to the computing array at a second clock rate exceeding the first clock rate.
5. The integrated circuit according to claim 1, characterized in that, The computing array comprises multiple rows, wherein each of the multiple dies comprises two or more of the multiple rows.
6. The integrated circuit according to claim 5, characterized in that, Each memory channel interface provides data from the memory to two or more rows of the computing array.
7. The integrated circuit according to claim 1, characterized in that, The memory is a high-bandwidth memory.
8. The integrated circuit according to claim 1, characterized in that, The memory is a double data rate random access memory.
9. The integrated circuit according to claim 1, characterized in that, The computing array implements a neural network processor, and the data specifies the weights applied by the neural network processor.
10. The integrated circuit according to claim 1, characterized in that, Each memory channel interface provides data from the memory to two or more rows of the computing array.
11. A controller disposed within an integrated circuit having multiple dies, characterized in that, The controller includes: A request controller configured to convert a first request to access memory into a second request compatible with an on-chip communication bus, wherein the request controller provides the second request to a plurality of request buffer-bus master blocks configured to receive data from a plurality of channels of the memory; A remote buffer read address generation unit, coupled to the request controller and configured to monitor the fill level of each of a plurality of remote buffers distributed across the plurality of dies, wherein each of the plurality of remote buffers is configured to provide data obtained from a corresponding one of the plurality of request buffer-bus master blocks to a computing array distributed across the plurality of dies; and In response to determining that each of the plurality of remote buffers is storing data based on the fill level, the remote buffer read address generation unit is configured to initiate data transfer from each of the plurality of remote buffers to the computing array of the plurality of dies.
12. The controller according to claim 11, characterized in that, The request controller receives the first request at a first clock frequency and provides the second request at a second clock frequency.
13. The controller according to claim 11, characterized in that, The remote buffer read address generation unit monitors the fill level in each of the plurality of remote buffers by tracking multiple write enable corresponding to the plurality of remote buffers and tracking the common read enable of each of the plurality of remote buffers.
14. The controller according to claim 11, characterized in that, The plurality of request buffers-bus master blocks include a plurality of corresponding request buffers, and wherein the request controller is configured to initiate a read request to the memory in response to determining that there is available space in each of the plurality of request buffers.
15. The controller according to claim 11, characterized in that, The request controller includes: A transaction buffer configured to decouple a first clock domain from a second clock domain; A scheduler, which is coupled to the transaction buffer; Multiple controllers coupled to the scheduler, wherein a first subset of the multiple controllers is configured to monitor the multiple remote buffers, and a second subset of the multiple controllers is configured to monitor the multiple request buffer-bus master blocks; and The scheduler is configured to route transactions to different controllers among the plurality of controllers based on the transaction type.