Inline-configured processor
A distributed management system with CIM circuits and a NoC parallelizes configuration across heterogeneous IC subsystems, addressing inefficiencies in centralized managers by reducing configuration time and simplifying firmware complexity.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- XILINX INC
- Filing Date
- 2024-05-08
- Publication Date
- 2026-07-01
AI Technical Summary
Conventional centralized configuration managers for programmable IC devices with heterogeneous subsystems become bottlenecks during configuration and initialization due to the size and complexity of the programming image, leading to inefficiencies and increased complexity in memory and firmware usage.
A distributed management system with multiple configuration interface manager (CIM) circuits and a packet-switched network-on-chip (NoC) that receives and distributes configuration packets in parallel to different regions of the IC, allowing parallel configuration of heterogeneous subsystems.
The distributed configuration system significantly reduces configuration and initialization time, simplifies firmware customization, and enhances scalability without adding complexity to the user interface.
Smart Images

Figure 2026521686000001_ABST
Abstract
Description
Technical Field
[0001] Examples of the present disclosure generally relate to an in-line configuration interface processor.
Background Art
[0002] Conventionally, programmable integrated circuit (IC) devices (e.g., field programmable gate arrays, i.e., FPGAs) have been configured directly via a processor-based central configuration manager. This may be acceptable for relatively small and monolithic IC devices. Newer programmable IC devices may include multiple heterogeneous subsystems (e.g., system-on-chip (SoC), network-on-chip (NoC), memory controller, artificial intelligence engine, enhanced network interface controller (HNIC), coherent peripheral interconnect express (PCIe) module (CPM), video display unit (VDU), and / or other heterogeneous subsystems), but these typically require their own programming interfaces and information. Also, these subsystems can directly interface with the significantly larger FPGA fabric in new programmable devices, particularly with the advent of stacked IC dies. Configuration and partial reconfiguration of such IC devices may require combinations of various configuration partitions provided via their respective interfaces. In such complex heterogeneous IC devices, the conventional centralized configuration manager becomes a bottleneck during configuration and initialization. The size and heterogeneous nature of the programming image for such devices have made configuration via a centralized processing manager inefficient.
Summary of the Invention
[0003] Techniques for inline configuration interface processing are described. One example is an integrated circuit (IC) device that includes a functional circuit, a packet-switched network on-chip (NoC), and a distributed management circuit that includes multiple configuration interface manager (CIM) circuits, each receiving its respective programming partition as a configuration packet via the NoC and providing configuration parameters to each region of the functional circuit in parallel with each other based on the respective configuration packets.
[0004] Another example described herein is an IC device comprising a distributed management circuit, a packet-switched network-on-chip (NoC), a first IC die including a first functional circuit, a second IC die including a second functional circuit, and a chip-to-chip (C2C) communication channel configured to interface between the NoC and the second IC die. The distributed management circuit includes a plurality of configuration interface manager (CIM) circuits configured to receive their respective programming partitions as configuration packets via the NoC and provide configuration parameters to each area of the first functional circuit in parallel with each other based on their respective configuration packets. The first CIM circuit of the CIM circuits also receives a programming partition for the second IC die as an additional configuration packet via the NoC and provides configuration parameters to the second IC die via the NoC and C2C interface circuit based on the additional configuration packets.
[0005] Another example described herein is an IC device that includes a functional circuit and a distributed management circuit that includes multiple configuration interface manager (CIM) circuits that receive their respective programming partitions as configuration packets via a packet-switched network on-chip (NoC), extract commands from their respective configuration packets, and perform operations related to their respective areas of the functional circuit in parallel with each other based on the codes contained in the command fields. [Brief explanation of the drawing]
[0006] More detailed explanations of the features listed above, which are briefly summarized above, can be provided by referring to exemplary implementations, some of which are illustrated in the attached drawings. However, it should be noted that the attached drawings illustrate only typical exemplary implementations and should therefore not be considered limiting in scope. [Figure 1] This demonstrates how to configure an integrated circuit (IC) using a distributed system according to one embodiment. [Figure 2A] This demonstrates how to configure multiple integrated circuits using a distributed configuration system according to one embodiment. [Figure 2B] This demonstrates how to configure multiple integrated circuits using a distributed configuration system according to one embodiment. [Figure 3] This is a flowchart for configuring a device using a distributed system according to one embodiment. [Figure 4] This demonstrates how to configure a device using a distributed system according to one embodiment. [Figure 5] A portion of the device image according to one embodiment is shown. [Figure 6] This shows packets within a device image according to one embodiment. [Figure 7] This is a block diagram of an IC device including a functional circuit, a central management circuit, and a distributed management circuit according to one embodiment. [Figure 8] This is a block diagram of a distributed management circuit according to one embodiment. [Figure 9A] This is a block diagram of a DMA engine for a distributed management circuit, including a command engine and a data engine, according to one embodiment. [Figure 9B] This shows a data buffer management table (DBMT) of the packet processor of a distributed management circuit, and the interconnections between the packet processor, memory controller, and the distributed management circuit, according to one embodiment. [Figure 10] This shows a field for a command executed by a packet processor in a distributed management circuit according to one embodiment. [Figure 11] A subfield of the opcode field in Figure 10 according to one embodiment is shown. [Figure 12] This describes a memory word write (MWW) command according to one embodiment, which enables a packet processor to write a value to a bit-aligned address in a memory map. [Figure 13] This describes a synchronous memory word write (SMWW) command according to one embodiment, which allows a packet processor to write a value to a bit-aligned address in a memory map and delay the issuance of further instructions until the SMWW command is completed. [Figure 14] This describes a conditional true memory word write (TMWW) command, according to one embodiment, which allows a packet processor to write a value to a bit-aligned address in a memory map if a specified condition is true. [Figure 15] This describes a conditional false memory word write (FMWW) command, according to one embodiment, which allows a packet processor to write a value to a bit-aligned address in a memory map when a specified condition is false. [Figure 16] A conditional true synchronous memory word write (TSMWW) command is shown, according to one embodiment, which allows a packet processor to write a value to a bit-aligned address in a memory map and delay the issuance of further instructions until the TSMWW command is complete, provided that a specified condition is true. [Figure 17] This describes a conditional false synchronous memory word write (FSMWW) command according to one embodiment, which allows a packet processor to write a value to a bit-aligned address in a memory map and delay the issuance of further instructions until the FSMWW command is complete, provided that a specified condition is false. [Figure 18] This document shows a memory double word write (MDW) command that enables a packet processor to write a double word value to a bit-aligned address in a memory map, according to one embodiment. [Figure 19]The following describes a synchronous memory double word write (SMDW) command according to one embodiment, which allows a packet processor to write a double word value to a bit-aligned address in a memory map and stall the issuance of further instructions until the SMDW command is completed. [Figure 20] A conditional true memory double word write (TMDW) command is shown, according to one embodiment, which allows a packet processor to write a double word value to a bit-aligned address in a memory map if a specified condition is true. [Figure 21] This describes a conditional false memory double word write (FMDW) command, according to one embodiment, which allows a packet processor to write a double word value to a bit-aligned address in a memory map when a specified condition is false. [Figure 22] The following describes a memory quadword write (MQW) command according to one embodiment, which enables a packet processor to write a selectable number of quadwords to bit-aligned addresses in a memory map. [Figure 23] A comparison (C) command is shown, according to one embodiment, which allows a packet processor to compare the masked value of the lowest word in the packet processor's local data register (LDR) with a specified value and set a condition register based on that comparison. [Figure 24] This shows a Masked LDR Word & Write (MLWW) command, which allows the packet processor to force different bits in the least significant word of an LPR to a specified value and write the resulting word to a specified address in memory. [Figure 25] This is a block diagram of a multilayer IC device according to a positional embodiment. [Figure 26] This is a block diagram of a programmable logic or configurable circuit, including an array of blocks or tiles of a configurable circuit or programmable circuit according to one embodiment.
[0007] For ease of understanding, where possible, the same reference numbers are used to denote the same elements common to the drawings. It is contemplated that elements of one embodiment may be beneficially incorporated into other embodiments. **BRIEF DESCRIPTION OF THE DRAWINGS**
[0008] Various features are described below with reference to the drawings. Note that the drawings may or may not be drawn to scale, and that elements of similar structure or function are represented by like reference numerals throughout the drawings. Note that the drawings are intended only to facilitate the description of the features. They are not intended as an exhaustive description of the features or as a limitation on the scope of the claims. Additionally, the illustrated examples need not have all of the aspects or advantages shown. Aspects or advantages described in connection with a particular embodiment are not necessarily limited to that embodiment and may be implemented in any other embodiment even if not so illustrated or explicitly described as such.
[0009] Modern adaptive system-on-chip IC devices may include programmable logic, fixed / reinforced circuitry, NoCs, complex heterogeneous subsystems, input / output circuitry, and other circuitry distributed across the IC die, multiple stacked IC dies, and / or the entire chiplet. The various natures of the components require respective configuration interfaces, configuration image formats, and orderings. Distributing configuration parameters across such an IC device using a conventional centralized management system is inefficient, increasing the device configuration / initialization time, adding complexity to the memory and firmware used for device configuration and initialization, and adding complexity to the programming image for the device (e.g., separate partitions may be required for subsystems with different configuration interfaces).
[0010] Embodiments of this specification describe a centralized management system and a distributed in-line configuration interface manager (CIM). The centralized management system distributes configuration packets to the CIM at line speed. The CIM configures respective regions of an IC device based on respective configuration packets, in parallel with each other. The centralized management system can enforce overall security of the IC and can include an integrated application programming interface (API) that interfaces with a user.
[0011] The architecture disclosed herein provides a scalable solution for configuring and initializing an IC device. The architecture disclosed herein can provide orders-of-magnitude improvements in configuration and initialization without adding complexity to the user interface. The disclosed architecture can reduce the complexity of firmware customization, optimization, and verification.
[0012] FIG. 1 shows configuring a configurable integrated circuit (IC) device 100 using a distributed system, according to one embodiment. In the example of FIG. 1, the configurable IC device 100 includes a single integrated circuit (IC) 110. In one embodiment, the IC 110 includes a heterogeneous computing system including different types of subsystems (e.g., NoC, data processing engines, memory controllers, programmable logic, etc.) configured using configuration information within a device image 105. For example, the IC 110 can be a system-on-chip (SoC) or an application-specific integrated circuit (ASIC).
[0013] In another embodiment, IC110 includes a homogeneous computing system. While the distributed configuration systems described herein can provide the greatest improvement over devices having heterogeneous computing systems (due to having a mixture of various configuration partitions transmitted through separate interfaces), the embodiments herein can also improve the process of configuring homogeneous computing systems, in particular, as those systems become larger. For example, IC110 may be a large field-programmable array (FPGA) containing programmable logic configured by device image 105.
[0014] In particular, the configurable device is not limited to having programmable logic. That is, the embodiments herein can be applied to configurable devices that include or do not include programmable logic. The distributed configuration system described herein can be used in any configurable device that relies on a received device image 105 to configure at least one subsystem within the device before the device begins to perform user functions.
[0015] IC110 includes a stream engine 115 (e.g., circuit) that receives a device image 105 for configuring IC device 100. The stream engine 115 is an example of a central configuration manager circuit, and in other embodiments, the stream function may be implemented using back-to-back memory-mapped transfers at the physical interface level. Thus, the stream engine 115 may be a memory-mapped engine that receives the device image via memory-mapped data writes.
[0016] As shown, the stream engine 115 receives a device image 105 consisting of packetized configuration data and then forwards each configuration packet 125 to different areas within the IC 110. The stream engine 115 can function as a user interface with an API for communicating with an external host computing system (not shown). The stream engine 115, as will be described in more detail below, generally distributes the configuration information contained in the device image 105 to various areas of the IC 110 in the form of config packets 125.
[0017] To distribute config packets 125, IC 110 includes a hardware network 120. In one embodiment, the network 120 is a NoC, but is not limited thereto. For example, IC 110 may have a dedicated configuration trace used to distribute config packets 125 to different areas within IC 110. The type of hardware network used may affect how stream data is transferred at the physical level from the central configuration manager (e.g., stream engine 115) to the distributed CIM circuit 130.
[0018] In Figure 1, IC110 is subdivided into different regions (e.g., region A and region B). Although two regions are shown, IC110 can be subdivided into any number of regions. One advantage of the distributed configuration system is that it can be easily scaled with the size of the configurable IC device 100. That is, as the size of IC110 increases, additional regions can be added.
[0019] Each region within IC110 includes a dedicated CIM circuit 130 for distributing configuration information to subsystems within that region. That is, the stream engine 115 can receive the device image 105 and distribute packetized configuration information such that data used to configure subsystems in region A is sent to CIM circuit 130A, and data used to configure subsystems in region B is sent to CIM circuit 130B.
[0020] Although not shown here, the CIM circuit 130 may have separate interfaces or ports to subsystems in each region. For example, CIM circuit 130A may parse the received config packet 125A and transmit configuration information to different circuits within the region. In this case, region A includes a first circuit 135A and a second circuit 135B. These circuits may be different (i.e., heterogeneous) circuits. For example, the first circuit 135A may be a memory controller, and the second circuit 135B may be an enhanced data processing engine. These circuits may use different types of interfaces, communicate with CIM circuit 130A, and use different types of configuration data. Rather than a central configuration manager (e.g., stream engine 115) having to parse the configuration information and distribute it to all subsystems within the IC, in this example, the stream engine 115 can transfer the configuration information to each region, and then it is left to the CIM circuit 130 to distribute the configuration information to the circuits within that region using different interfaces. However, in another embodiment, the first and second circuits 135A and 135B may be of the same type (for example, both may be memory controllers, or both may be programmable logic blocks). Thus, the embodiments herein can be used when the region has heterogeneous or homogeneous circuits.
[0021] Furthermore, since the stream engine 115 distributes configuration information to different regions each having a dedicated CIM circuit 130, the CIM circuits 130 within each region can operate in parallel. That is, CIM circuit 130A distributes configuration information to the first and second circuits 135A and 135B, while CIM circuit 130B can distribute configuration information to the third and fourth circuits 135C and 135D. In this way, the regions within IC 110 can be configured in parallel by dedicated CIM circuits 130.
[0022] Figures 2A and 2B illustrate the configuration of multiple integrated circuits within a configurable device 200 using a distributed system according to one embodiment. Unlike the configurable IC device 100 in Figure 1, the configurable device 200 in Figures 2A and 2B includes multiple ICs, namely IC110, IC205, and IC210. These ICs may be located in the same package. Although three ICs are shown, the configurable device 200 can contain any number of ICs.
[0023] In the configurable device 200A of Figure 2A, the ICs are arranged in a 3D stack. For example, IC 110 may be a base die, and ICs 205 and 210 are stacked on top of the base die. For example, the base die may include peripherals and communication interfaces for communicating with an external host, while ICs 205 and 210 include different types of circuitry 220 (e.g., programmable logic or arrays of data processing engines). The ICs can use through-vias to transmit data to each other.
[0024] IC110 in Figure 2A may be the same IC110 as shown in Figure 1, each containing multiple regions, each including a dedicated CIM circuit 130. Rather than a 2D region being allocated within the same IC as shown in Figure 1, in Figure 2A, the CIM circuit is allocated a 3D region spanning three ICs. Specifically, CIM circuit 130A is allocated region A, which can include a circuit within IC110 (not shown), a circuit 220A within IC205, and a circuit 220C within IC210. CIM circuit 130B is allocated region B, which can include a circuit within IC110 (not shown), a circuit 220B within IC205, and a circuit 220D within IC210.
[0025] The circuits 220 in IC205 and 210 may be the same or different. For example, circuits 220A and 220B in IC205 may be the same (e.g., programmable logic), and circuits 220C and 220D in IC210 may be the same (e.g., a data processing engine). Furthermore, circuits 220A to D in both IC205 and 210 may be the same, for example, all data processing engines.
[0026] Figure 2A shows a stack of ICs, but in another embodiment, the ICs may be arranged on an interposer (i.e., side by side), and the interposer provides a communication channel for transmitting data between the ICs. For example, IC 110 may be an anchor die, and ICs 205 and 210 are chiplets. In this example, ICs 205 and 210 may be located on different sides of IC 110. The anchor die may contain common blocks such as a processor subsystem (PS) and a memory subsystem (DDR controller). The chiplets may contain dedicated logic such as a data processing engine, a high-speed transceiver, or high-bandwidth memory. In that case, the region is not a 3D region, but nevertheless, each CIM circuit 130 can be allocated a region containing portions from each of the three ICs in Figure 2A.
[0027] In summary, Figure 2A shows that the CIM circuit 130 in one IC is used to configure the circuit 220 in different ICs. Therefore, ICs 205 and 210 do not have their own CIM circuits.
[0028] Similar to Figure 2A, Figure 2B shows a configurable device 200B having multiple ICs, but unlike Figure 2A, each IC has at least one CIM circuit 130. Furthermore, unlike Figure 2A where the region extends across multiple ICs, in Figure 2B the region can be confined within a single IC.
[0029] In particular, the three ICs in Figure 2B can be arranged as a 3D stack as shown in Figure 2A, or side by side on an interposer.
[0030] Network 120 within IC110 can be used to forward config packets to other ICs 205 and 210. That is, in addition to identifying config packets for areas on IC110, stream engine 115 also distributes config packets for areas within ICs 205 and 210. Since IC205 contains two areas (areas C and D) with dedicated CIM circuits 130C and 130D, stream engine 115 sends config packet 125C to CIM circuit 130C to configure a circuit (not shown) in area C, and sends a different config packet 125D to CIM circuit 130D to configure a circuit (not shown) in area D.
[0031] However, IC210 is not divided into multiple regions (although it may be). In this case, the stream engine 115 sends a config packet 125E to the CIM circuit 130E to configure the circuit within IC210. For example, IC210 may be smaller than IC205, or may have fewer configurable circuits than IC205, and therefore IC210 is not divided into multiple regions.
[0032] Therefore, Figure 2B shows a configurable device 200B including multiple ICs, where a central configuration manager (e.g., stream engine 115) on one of the ICs can distribute config packets 125 to CIM circuits 130 on different ICs. Each of these ICs may have two or more CIM circuits 130, depending on how many regions it contains.
[0033] Figure 3 is a flowchart of method 300 for configuring a device using a distributed system, according to one embodiment. In block 305, a stream engine (e.g., a central configuration manager) receives a device image for configuring a configurable device. The device image can be received as streaming data or memory-mapped data.
[0034] The configurable device may contain only one IC including multiple CIM circuits as shown in Figure 1, or it may contain multiple ICs as shown in Figures 2A and 2B. In either case, in one embodiment, there is only one stream engine (i.e., only one central configuration manager) within the configurable device.
[0035] In block 310, the stream circuit constitutes a network within the configurable device. In one embodiment, the network is located on the same IC as the stream circuit. The stream circuit may initially be configured to deliver configuration information to a CIM circuit within the configurable device. For example, if the stream circuit communicates with the CIM circuit using a NoC, the device image may include data to configure the NoC so that the NoC can communicate with the CIM circuit.
[0036] In one embodiment, the stream circuit includes its own CIM circuit for configuring the network. That is, the stream circuit can identify configuration information in a received device image intended to configure the network, and transfer this information to its CIM circuit, which then configures the network. The network can be configured to transmit data not only to CIM circuits on the same IC, but also to CIM circuits on other ICs (if the configurable device has multiple ICs, each having its own CIM circuit).
[0037] In block 315, the stream circuit analyzes the device image to identify configuration information (e.g., configuration packets) relating to the CIM circuits within the configurable device. In one embodiment, the device image may include an embedded header indicating which data is intended for which region. That is, a software tool in a host that generates and transmits the device image to the configurable device can recognize regions within the configurable device. Therefore, when generating the device image, the software application can organize the device image so that the configuration information of the circuits within a particular region of the device is organized as packet data. Thus, when analyzing the device image, the stream circuit can easily identify different parts of the device image that are destined for different regions (e.g., different CIM circuits) where the data packets may be placed. This is illustrated in more detail in Figure 5.
[0038] In one embodiment, the packetization of configuration information within a device image can be performed by a stream circuit based on a dynamic scheduling algorithm for reconfigurable configuration contexts.
[0039] In block 320, the stream circuit sends config packets to the CIM circuits. That is, after identifying the data in the device image intended for the destination region, the stream circuit can forward the corresponding config packets to the dedicated CIM circuits in those regions. Thus, each region receives only the configuration information used to configure the circuits within that region.
[0040] In one embodiment, the configurable device includes at least two CIM circuits. These CIM circuits may be on the same IC or multiple ICs. Furthermore, the region may include the entire IC, a 2D region including only a small portion of the IC, or a 3D region spanning multiple ICs. Figure 2B shows an example where the region may include the entire IC (e.g., IC210), Figure 1 shows a 2D region covering a sub-part of the IC (e.g., IC110), and Figure 2A shows a 3D region extending across multiple ICs.
[0041] In one embodiment, communication between a stream circuit and multiple CIM circuits is encrypted so that each of the multiple CIM circuits decrypts the portion (e.g., configuration packet) received from the central configuration manager circuit. Furthermore, in one embodiment, each of the multiple CIM circuits is configured to perform an integrity check on the portion (packet) received from the stream circuit.
[0042] In block 325, the CIM circuit forwards configuration information to circuits within the area allocated to the CIM circuit. That is, the CIM circuit analyzes received packets that may contain configuration information for multiple subsystems within the area and identifies which configuration information should be sent to which subsystem. If those subsystems are heterogeneous systems, the CIM circuit may use different interfaces or ports for different subsystems within the area.
[0043] Advantageously, in method 300, the streaming circuit is primarily responsible for streaming configuration information to various CIM circuits, as specified by the device image. The actual processing and transfer of configuration data to the specific circuit being configured is delegated to the CIM.
[0044] In one embodiment, the CIM circuit operates in two modes. In the first mode, the direct memory access (DMA) circuit within the stream circuit distributes the configuration information of a region as a continuous stream to the CIM circuit responsible for that region. Once the configuration packets for a region are buffered in the CIM circuit, the CIM circuit can process the packets, and the stream circuit transmits the configuration packets to other CIM circuits in the configurable device.
[0045] In the second mode (e.g., DRAM mode), the stream circuit a priori copies configuration packets for all areas within a contiguous partition to DRAM and instructs the CIM circuit to simultaneously pull packets from those areas in DRAM. A contiguous partition is a partition in which all data within that partition is intended to be processed by a single CIM. Local memory within the CIM circuit is used to store packets fetched from DRAM by the CIM circuit for pre-use hashing and authentication.
[0046] Figure 4 illustrates the configuration of a configurable device 400 using a distributed system according to one embodiment. As shown, the configurable device 400 receives a device image 105 in the stream engine 115. In addition to distributing the configuration information in the device image 105 to different regions, as described above, the stream engine 115 (e.g., a central configuration manager) can perform other functions. Firstly, the stream engine 115 can create a level of abstraction that remains consistent across devices. That is, the stream engine 115 can maintain a consistent protocol for all functions performed by the stream engine 115, regardless of the size of the device 400 and the combination of features in the device 400. Secondly, the stream engine 115 can function as a root-of-trust for the device 400. In one embodiment, the stream engine 115 authenticates the device image 105 before it is delivered to the CIM circuit. Thirdly, the stream engine 115 may include debug interface logic and a debug packet controller for identifying errors that may occur during the configuration process.
[0047] In one embodiment, the stream engine 115 is implemented within a processor that may be a general-purpose processor. However, in other embodiments, the stream engine 115 may be dedicated circuitry for performing the functions described herein.
[0048] Device 400 includes N regions corresponding to N CIM circuits 405. Here, region 0 is assumed to be located on the same IC as the stream engine 115. This region includes CIM circuit 405A, PS410, NoC415, and peripheral equipment 420.
[0049] The PS410 may be a general-purpose processor containing any number of cores. The PS410 may also consist of one or more processing subsystems, which are also composed of corresponding CIMs, i.e., CIM circuits 405A.
[0050] Although not shown, NoC415 may extend throughout the device 400 to enable various components within the device 400 to communicate with each other. For example, in one physical implementation, the stream engine 115 may be located in the upper right portion of the IC within the configurable device 400, while the CIM circuits 405B and 405C are located in the upper left and lower left portions of the IC (or on another IC). However, using NoC415, the stream engine 115 can still communicate with the CIM circuits 405B and 405C in those areas. However, in embodiments, the stream engine 115 may be required to first configure NoC415 before it can transmit configuration information to the CIM circuits 405B and 405C, as described above in block 310 of method 300.
[0051] The peripheral device 420 may include I / O circuits for communicating with an external computing system or device. For example, the peripheral device 420 may include a DMA engine for retrieving memory from the host computing system.
[0052] Although shown as separate components, in one embodiment, the CIM circuit 405A is part of the stream engine 115. Customizing the firmware within the stream engine 115 (e.g., the central configuration manager) to configure each subsystem adds complexity, hinders optimization, and results in larger code size, inefficient execution, and verification difficulties. Since the processing of the regions is instead performed by the CIM, and the stream circuit simply streams packets to the CIM, a common part of the firmware can be used to push the configuration image to all regions on the device. These regions can contain different IP and functionality. Furthermore, by including the CIM circuit within the stream circuit, the same programming model can be employed for regions that communicate directly with or are integrated with the stream circuit on the same IC. Examples of configurations performed by the local CIM circuit 405A within the stream engine 115 are the configurations of PS410, NoC415, and peripheral device 420.
[0053] In this embodiment, regions 1 and n may contain similar circuit elements, but this is not a requirement. That is, both regions include a programmable logic (PL) block 425, a hard IP 430, an interface to a chiplet 440 (when using the configuration shown in Figure 2A), and a memory controller 445. Alternatively, region 1 may contain only programmable logic, and region n may contain only a DPE segment.
[0054] The CIM circuits 405B and 405C may include separate interfaces or ports to different circuit elements within region 1 and region n. Regions 1 and n may be in the same IC as region 0, or they may be in separate ICs. For example, region 0 may be located in a first IC and regions 1 through n may be located in a second IC, or region 0 may be located in a first IC, region 1 in a second IC, and region n in a third IC.
[0055] PL blocks 425 in regions 1 and n can contain any amount of programmable logic. Using the configuration information in device image 105, CIM circuits 405B and 405C can configure PL blocks 425 to perform user-defined functions during operation.
[0056] The hard IP430 can include any various enhanced circuits that can be configured using device image 105.
[0057] The data processing engine (DPE) segment 435 may include multiple DPEs that can be arranged in a grid, cluster, or checkerboard pattern within the device 400. Furthermore, each DPE segment 435 can be of any size and may have any number of rows and columns formed by the DPEs. In one embodiment, the DPEs within the DPE segment 435 are identical; that is, each DPE (also called a tile or block) may have the same hardware components or circuitry. Furthermore, the embodiments herein are not limited to DPEs. Alternatively, the device 400 may include an array of any type of processing element; for example, the DPEs may be a digital signal processing engine, a cryptographic engine, a forward error correction (FEC) engine, or other dedicated hardware for performing one or more dedicated tasks.
[0058] The chiplet 440 can be part of the anchor / chiplet configuration described above in Figure 2A. For example, the CIM circuit 405B may be tasked with transferring configuration information to the chiplet 440A, and the CIM circuit 405C may be tasked with transferring configuration information to the chiplet 440B.
[0059] Involving the stream engine 115 (e.g., a central configuration manager) in low-level data movement at the device level for configuration is inefficient in terms of performance and power. Therefore, as described above, the stream engine 115 streams configuration information to the CIM circuit 405 distributed across devices via a network (e.g., NoC 415). By using hardware to stream configuration information directly to the CIM circuit 405, the stream engine 115 does not create a bottleneck. Also, config packets (forming a continuous stream as shown in Figure 4) are transferred from the stream circuit to the CIM circuit 405 at maximum burst capability, avoiding overloading the NoC 415 with many small, independent memory transfers.
[0060] Figure 5 shows a portion of device image 105 according to one embodiment. Figure 5 shows a high-level organization that can be used in device image 105 for a configurable device. Image 105 includes a boot header and multiple programming partitions, each partition allocated to a specific area within the configurable device. The boot header provides information used to authenticate access to the device and to process the rest of image 105, including authentication and decryption.
[0061] Partition 505 in device image 105 is a main partition that may always exist and includes, for example, platform loader and manager (PLM) firmware running on a processor that also includes stream circuits or a central configuration manager. In one embodiment, the main partition 505 is loaded by read-only memory (ROM) in the processor, and loading of other partitions is performed by the PLM firmware together with the CIM circuit.
[0062] In this example, each subsequent partition 510 includes a secure partition header, which is processed by the stream circuit to establish a key and other configuration information used by the CIM circuit to process the partition. The remainder of partition 510 is divided into multiple packets, which the stream circuit routes to specific CIM circuits (e.g., CIM a, CIM b, CIM c, etc.) for processing. The packet header for each packet in partition 510 identifies the target CIM circuit, so the stream circuit knows the destination for each packet. In this way, the stream circuit can packetize the data as described in block 315 of method 300 and forward the packets to specific CIM circuits.
[0063] Furthermore, packet data within each packet in partition 510 is processed by the CIM circuit rather than the stream circuit. Therefore, the processing of configuration information within data packets (and the transfer of that configuration information to the specific circuit being configured) is delegated to the CIM circuit once the packet is received by those circuits.
[0064] Figure 6 shows a CIM packet 600 in a device image according to one embodiment. Specifically, Figure 6 shows an example of the packet format within partition 510 in Figure 5. Packet 600 is divided into a header 605 and packet data 610 (payload). The first quadword in the CIM packet 600 specifies the target CIM (using the CIM ID), packet length, header length, and packet attributes.
[0065] In one embodiment, the lengths of the CIM packet 600 and header 605 are always multiples of a quad word. Furthermore, the least significant bit of the packet attributes may indicate whether the packet is the last packet in a partition that needs to be transferred, for example, using direct memory access (DMA).
[0066] The packet header 605 also includes an SHA hash (or any other suitable cryptographic element) for the next packet. Padding within the header 605 can be used to ensure that the packet length meets the requirements of the SHA-3 architecture. The last packet in one of the partitions 510 in Figure 5 does not need to include an SHA hash and padding because there are no subsequent packets in that partition 510.
[0067] In one embodiment, the entire CIM packet 600, including the header 605 and the payload, i.e., the packet data 610, is hashed. In one embodiment, each CIM circuit includes internal storage sufficient to buffer at least two packets. Buffering the CIM packet 600 within the CIM circuit also allows the CIM packet 600 to be verified to ensure data integrity, as well as decrypted to ensure data privacy.
[0068] Figure 7 is a block diagram of an integrated circuit (IC) device 700 according to one embodiment, which includes functional circuits 706-1 to 706-n (collectively, functional circuits 706), a central management circuit 702, and distributed management circuits 703-1 to 703-n, each containing a CIM circuit 704-1 to 704-n (collectively, CIM circuit 704). The CIM circuit 704 may represent exemplary embodiments of the CIM circuits 130A and 130B in Figure 1.
[0069] In the example in Figure 7, the functional circuit 706-1 includes a fixed functional circuit 730 (e.g., a non-programmable or enhanced circuit, and / or an application-specific integrated circuit (ASIC)), a register 736 that holds configuration parameters of the fixed functional circuit 730, and an interface circuit, shown here as a local control interconnect (LCI) circuit 738, that interfaces between the CIM circuit 704-1 and the register 736 via link 739. The register 736 may, for example, control the multiplexer of the fixed functional circuit 730. Another register 736 may be used to store a status indicator (e.g., a status indicator for a memory controller).
[0070] The functional circuit 706-1 further includes one or more computing engines 734 (e.g., an array of artificial intelligence engines, i.e., AIEs) and a programmable circuit, shown here as programmable logic (PL) 732. The computing engine 734 may include programmable registers and / or memory for various functions. The PL 732 includes a configurable random access memory (CRAM) 740 that holds configuration parameters for the configurable circuit or fabric of the PL 732. The functional circuit 706-1 further includes an interface circuit 742 that interfaces the CIM circuit 704-1, the PL 732, and the computing engine 734 via one or more links 743. The interface circuit 742 may include a configurable frame interface (CFrame) circuit 744 that interfaces the CIM circuit 704-1 and the CRAM 740 via a CFrame programming bus.
[0071] The LCI circuit 738 and / or interface circuit 742 may include a configurable master / slave interface circuit, such as the on-chip communication bus protocol developed by Arm Ltd. in Cambridge, UK, and commercially available as the Advanced eXtensible Interface (AXI). The LCI circuit 738 may include registers and / or static random access memory (SRAM) for holding configuration parameters for the LCI circuit 738.
[0072] The functional circuit 706-1 is not limited to the example shown in Figure 7.
[0073] The CIM circuit 704 distributes configuration parameters to the respective function circuits 706. These configuration parameters may relate to clocking, memory controllers, input / output (I / O) circuits, transceivers, chiplets, and / or other features / functions. In the example in Figure 7, the CIM circuit 704-1 provides configuration parameters for the interface circuit 738 and register 736 through the root bridge 746, NoC peripheral interconnect (NPI) switch 748, and link 739. The CIM circuit 704-1 also provides configuration parameters for the PL 732, compute engine 734, and interface circuit 742 via link 743. In one embodiment, the CIM circuit 704-1 also provides configuration parameters to an off-chip device 711 (e.g., a chiplet). The CIM circuit 704-1 can, for example, push an image to the off-chip device 711 via a chip-to-chip (C2C) interface, which may include an engine that performs self-configuration based on the image.
[0074] The CIM circuit 704 may perform additional management functions (e.g., configuration, control, and / or debugging functions) and / or data processing functions (e.g., integrity, authentication, and / or error detection) associated with each function circuit 706. The CIM circuit 704 may perform one or more functions inline or in a pipelined manner. The CIM circuit 704 may execute commands, such as memory access commands. The CIM circuit 704 may be useful for distributing management and / or data processing functions (i.e., functions that would otherwise be performed by the central management circuit 702 and / or the host device) throughout the IC device 700. The CIM circuit 704 may return data (e.g., readback data) to the central management circuit 702 via their respective links 721-1 to 721-n. Exemplary embodiments of the CIM circuit 704 will be provided later.
[0075] In the example shown in Figure 7, the central management circuit 702 includes a streaming engine 714 that distributes configuration information 708 to the CIM circuit 704 via a first communication channel. In one embodiment, the configuration information 708 includes configuration packets, and the first communication channel includes a packet-switched network-on-chip (NoC) 716 and the respective communication links 717-1 to 717-n. However, the first communication channel is not limited to the NoC. The streaming engine 714 and PDI 712 may represent examples of the stream engine 115 and PDI 105 in Figure 1.
[0076] In one embodiment, the PDI 712 includes a boot header and a number of programming partitions, as already described with reference to Figures 5 and 6. The first partition of the PDI 712 may be the always-present main partition and includes platform loader and manager (PLM) firmware running on the management engine 718 of the central management circuit 702. The central management circuit 702 may load keys contained within the secure header of the partition.
[0077] If the PDI712 contains multiple programming partitions, the programming partitions may be in the form of packets targeting each CIM circuit 704 (for example, the packets may include a packet header that identifies each target CIM circuit 704). In this example, the streaming engine 714 may distribute packets to each CIM circuit 704 via NoC716. The least significant bit of the packet attributes may indicate to the streaming engine 714 that the packet is the last packet in the partition to be forwarded by the streaming engine 714.
[0078] The streaming engine 714 may include a direct memory access (DMA) engine 722 that distributes packets to the CIM circuit 704, which has maximum burst capability, to avoid overloading the NoC 716 with a large number of small, independent memory transfers. Using the streaming engine 714 and associated hardware (e.g., NoC 716), streaming the configuration information 708 directly to the CIM circuit 704 rather than the management engine 718 may be useful in avoiding the management engine 718 becoming a bottleneck. The CIM circuit 704 extracts configuration instructions and associated configuration parameters from each partition and distributes the configuration parameters to the respective areas of the function circuit 706 based on the instructions.
[0079] Before distributing the programming partitions to the CIM circuit 704 via NoC 716, the central management circuit 702 may configure the CIM circuit 704 using initialization parameters 709 during the initialization or power-up phase via a second communication channel. In the example in Figure 7, the second communication channel is a tree-type interconnect including a global control interconnect (GCI) circuit 720 rooted at the central management circuit 702, local control interconnects rooted at each distributed management circuit 703, and their respective links 719-1 to 719-n. The second communication channel may be based on a network-on-chip (NoC) peripheral interconnect (NPI) standard or protocol. However, the second communication channel is not limited to the NPI standard. After the central management circuit 702 configures the CIM circuit 704-1, the CIM circuit 704-1 can receive configuration packets from the streaming engine 714 via NoC 716.
[0080] In the example in Figure 7, the distributed management circuit 703-1 further includes an NPI switch 747 and an endpoint circuit 749, enabling the central management circuit 702 to access and configure the CIM circuit 704-1 via a second communication channel (e.g., via the GCI circuit 720). The distributed management circuit 703-1 further includes an NPI root bridge 746 and an NPI switch 748, enabling the CIM circuit 704-1 to access the LCI circuit 738. The initialization parameters 709 may include parameters for configuring the NPI switches 747 and 748, the endpoint circuit 749, and / or the root bridge 746 during the initialization or startup phase.
[0081] In one embodiment, the central control circuit 702 may have direct access to register 736 and / or other features of the function circuit 706-1 via the GCI circuit 720, link 719-1, NPI switch 747, NPI bus 750, NPI switch 748, link 739, and LCI 738.
[0082] The initialization parameter 709 may further include parameters for configuring the GCI circuit 720 and NPI switches 747 and 748 to allow the management engine 718 to directly access the LCI circuit 738 (for example, to directly read register 736). In this example, the GCI circuit 720 and NPI switches 747 and 748 bypass the CIM circuit 704-1 to provide a transition from a high-level LCI to a lower-level LCI.
[0083] In the example in Figure 7, switches 747 and 748 are shown as NoC peripheral interconnect (NPI) switch circuits, and the root bridge 746 is shown as an NPI root. In this example, the root bridge 746 can convert AXI-formatted transfers received from the CIM circuit 704-1 to the NPI protocol. Switches 747 and 748 and the root bridge 746 are not limited to NPI circuits.
[0084] The initialization parameter 709 may further include parameters for configuring the registers of NoC716. Alternatively or additionally, the central control circuit 702 may provide initialization parameters to NoC716 as described below.
[0085] The central management circuit 702 may further include a central CIM circuit 724 for offloading work from the management engine 718 and / or the host device. In one embodiment, the central CIM circuit 724 configures a second communication channel (i.e., NoC 716) during the initialization or power-up phase based on configuration information 708. NoC 716 may include configurable switches and a number of discontinuous registers, which may result in a number of write operations being required to program the discontinuous registers. Configuring NoC 716 using the central CIM circuit 724 may be useful in freeing up resources from the management engine 718 or the host device for other purposes. The central CIM circuit 724 may also perform self-configuration based on configuration information 708. The central CIM circuit 724 may include features of CIM circuit 704-1, but may differ from CIM circuit 704-1 in one or more respects. Examples of these differences will be provided later.
[0086] As described above, the central management circuit 702 can push configuration information 708 (e.g., a packetized partition of PDI 712) to the CIM circuit 704 via NoC 716. Alternatively or additionally, the central management circuit 702 may store the configuration information 708 in external memory, here shown as external DRAM 710, and provide memory location information to the CIM circuit 704, enabling the CIM circuit 704 to retrieve or pull the configuration information 708 from the DRAM 710. As an example, during the initialization or startup phase, the CIM circuit 704 may receive the configuration information 708 directly from the central management circuit 702 via NoC 716 to configure each functional circuit 706. Subsequently, the CIM circuit 704 may retrieve additional configuration information 708 from the DRAM 710 via NoC 716 to reconfigure or partially reconfigure each functional circuit 706. For partial reconfiguration of a region, it may be more efficient to have the CIM circuit 704-1 retrieve configuration parameters from external memory.
[0087] The external DRAM 710 may contain one or more libraries of reconfiguration or partial reconfiguration instructions and associated configuration parameters for various tasks. The libraries may, for example, contain instructions and parameters for configuring the PL732 region as an accelerator circuit. When the function circuit 706-1 is assigned a task (for example, by a host device / data center), the CIM circuit 704-1 may retrieve the appropriate library of reconfiguration instructions and parameters from the external DRAM 710.
[0088] In one embodiment, the CIM circuit reconfigures or partially reconfigures the function circuit 706-1 by writing to register 736 via interface circuit 738, to CRAM 740 via Cframe circuit 744, and / or to registers and / or memory of the compute engine 734 via interface circuit 742, in order to reconfigure or partially reconfigure the fixed function circuit 730. Alternatively or additionally, the central management circuit 702 directly provides the reconfiguration or partial reconfiguration parameters for interface circuit 738 and / or register 736 to interface circuit 738 via GCR 720 and switches 747 and 748.
[0089] Figure 8 is a block diagram of a distributed management circuit 703-1 according to one embodiment. The remaining distributed management circuits 703 may be the same as those in distributed management circuit 703-1.
[0090] In the example shown in Figure 8, the CIM circuit 704-1 includes a CIM interconnect 802 that interfaces between resources / circuits within the CIM circuit 704-1 and also interfaces with circuits outside the CIM circuit 704-1.
[0091] In Figure 8, the CIM interconnect 802 includes a master port and a slave port, which are here referred to as "M" and "S," respectively. The master port and slave port may represent an AXI master port and an AXI slave port. The CIM interconnect 802 is not limited to master ports and slave ports, nor is it limited to an AXI interface.
[0092] The CIM circuit 704-1 further includes a packet processor 804 that parses commands from packets received from NoC716 and / or external DRAM710 and executes the commands on the target interface.
[0093] The CIM circuit 704-1 further includes a random access memory (RAM) 806. The RAM 806 may include a packet buffer 840 that holds incoming packets to be processed by the packet processor 804, and a data buffer 842 that holds data associated with commands executed on the packet processor 804 (e.g., stream data that is expected to be read or written by commands executed on the packet processor 804).
[0094] In one embodiment, the packet buffer 840 includes two slots, each capable of holding a packet. This allows one packet to be pushed to the CIM circuit 704-1 while the CIM circuit 704-1 is processing another packet. The entire packet, including its header, may be stored in each slot. The remainder of the RAM 806 may be used for the data buffer 842 to hold intermediate data that is being read back or processed. In one embodiment, the packet processor 804 can execute a command that uses a particular data buffer 842 as either the source or destination.
[0095] The CIM circuit 704-1 further includes a memory controller 844. The memory controller 844 includes a first slave port 846 that can access the CIM interconnect 802 and a second slave port 848 that can access the packet processor 804 for fetching commands.
[0096] The CIM circuit 704-1 further includes an inline decryption circuit, shown here as the AES-GCM circuit 810 (i.e., the advanced encryption standard Galois / counter mode), which decrypts the configuration packets before the packet processor 804 processes them. In one embodiment, the packet processor 804 fetches the configuration packets from the packet buffer 840 and parses the configuration packets for commands to be executed by the packet processor 804. If the configuration packets are encrypted, the packet processor routes the configuration packets to enter and exit the AES-GCM circuit 810. The packet processor 804 may control the AES-GCM circuit 810, which may be useful / efficient for encryption key rolling. The packet processor 804 may roll the encryption key of the AES-GCM circuit 810 together with the AES-GCM circuit 810.
[0097] The CIM circuit 704-1 further includes an integrity check circuit 812 that reads the configuration registers in the function circuit 706-1 and performs an error correction code (ECC) check.
[0098] The CIM circuit 704-1 further includes a GCR interface circuit 814 that functions as a node or interface to the Global Communication Ring (GCR) interconnect. In one embodiment, the GCR interface circuit 814 captures data transmitted by the central management circuit 702 (e.g., eFuse information) and communicates error / interrupt packets on the GCR to the central management circuit 702. In one embodiment, the packet processor 804 can use the GCR interface circuit 814 to communicate with the central management circuit 702 and / or other GCR nodes.
[0099] The features and link 743 shown in block 862 may be omitted from the central CIM circuit 724 (Figure 7).
[0100] The CIM circuit 704-1 further includes a DMA engine 816 that streams commands and data to and from the CIM circuit 704-1. The DMA engine 816 will be described later with reference to Figures 9A and 9B.
[0101] The CIM circuit 704-1 further includes an authentication circuit that authenticates the configuration packets received from the central management circuit 702 and the external DRAM 710 before the packet processor 804 processes the configuration packets. The authentication circuit may implement a secure hash algorithm (SHA) published by the National Institute of Standards and Technology (NIST). In the example in Figure 7, the authentication circuit is shown as the SHA-3 circuit 808. The central management circuit 702 and / or the distributed management circuit 703-1 may be programmed to push packets to the SHA-3 circuit 808 when packets are pushed or pulled to the distributed management circuit 703-1.
[0102] In one embodiment, the central management circuit 702 provides the distributed management circuit 703-1 with an expected hash value for the first packet during the initialization phase, and the header of the configuration packet includes the SHA hash value of each subsequent packet (e.g., within the three quadwords of the header). The packet header may also include padding to provide a suitable packet length for the SHA-3 circuit 808. The DMA engine 816 can automatically load the SHA hash values contained in the header into the SHA-3 circuit 808 for authentication of subsequent packets.
[0103] When the first packet is read into the packet buffer 840, the SHA-3 circuit 808 calculates a hash value based on the first packet and provides an SHA digest, which is then compared with the hash value provided by the central control circuit 702. If the SHA digest matches the hash value provided by the central control circuit 702, the packet processor 804 may process the packet. The DMA engine 816 may store the hash value contained in the header of the first packet for use with subsequent packets.
[0104] When a subsequent packet is read into the packet buffer 840, the SHA-3 circuit 808 calculates a hash value based on the packet and provides an SHA digest, which is then compared with the hash value fetched and stored from the preceding packet. If the SHA digest matches the stored hash value, the packet processor 804 may process the packet. If the SHA digest does not match the hash value of the packet, the DMA engine 816 or the packet processor 804 may send an error message / interrupt to the central management circuit 702. The central management circuit 702 may then delay streaming the packet to the distributed management circuit 703-1.
[0105] In one embodiment, the packet buffer 840 is marked as full when a packet is read into the packet buffer 840. If the SHA digest matches the hash value of the packet, the packet buffer 840 is marked as available. The DMA engine 816 may suspend processing packets until the packet buffer 840 is marked as available.
[0106] As described above, the process of comparing the hash of the first packet with the hash value provided by the central management circuit 702, and then comparing the hash value of the subsequent packet with the hash value analyzed from the preceding packet, essentially authenticates / verifies the SHA hash of the subsequent packet.
[0107] The packet processor 804 may include one or more local registers, which may include, but are not limited to, local data registers (LDRs), control registers, and / or condition registers (CRs). In one embodiment, the packet processor 804 includes a 16-bit control register (e.g., 16 one-bit registers that may be represented as Control_Reg[15:0]) and a 16-bit CR (e.g., 16 one-bit CRs that may be represented as Condition_Reg[15:0]). Local registers may be useful for providing low-latency control. The packet processor 804 may access (retrieve values from and / or write to) local registers during the execution of one or more of various types of commands. The packet processor 804 may selectively execute predicated commands based, for example, the value of a condition or CR bit. Additional examples are further provided below.
[0108] In Figure 8, the packet processor 804 includes a command fetch port 850, a data execution port 852, an AES master port 854, an AES slave port 856, and a DMA read FIFO (first-in, first-out) buffer port 858, which are described below.
[0109] The packet processor 804 interfaces with the memory controller 844 using the command fetch port 850 and reads packets verified by the SHA-3 circuit 808, etc. In one embodiment, the command fetch port 850 includes a dedicated AXI interface (e.g., a 128-bit AXI interface) which reads from the start address to the end of the packet (e.g., 128 bits). The packet processor 804 may determine the packet length at the beginning of the packet header and determine when to stop fetching commands based on the packet length.
[0110] The packet processor 804 uses the data execution port 852 (e.g., a 128-bit AXI master interface) to execute various types of read and write transactions (e.g., AXI transactions) via the CIM interconnect 802. The transaction type, including the transaction length and width, is defined by a command embedded within the packet. Data for read operations can be transferred to a specific register in the command engine 902 or to a specific offset in the data buffer 842. The base address of the data buffer 842 can be determined by the buffer translation table of the packet processor 804.
[0111] The packet processor 804 uses the AES master port 854 (for example, a 128-bit write-only master interface) to direct packets read from the data buffer 842 to the AES-GCM circuit 810.
[0112] The AES-GCM circuit 810 pushes write transactions to the input FIFO buffer of the packet processor 804 via the AES slave port 856 (e.g., a 128-bit slave interface). The packet processor 804 parses the commands contained in the inbound stream and can create back pressure at appropriate times (i.e., the AES slave port 856 cannot receive additional commands until there is space in the packet processor 804's FIFO buffer).
[0113] The packet processor 804 pushes readback data from its read pipeline to the DMA engine 816 using the DMA read FIFO buffer port 858 (e.g., a 128-bit path), as will be described later with reference to Figures 9A and 9B. The packet processor 804 may read data from multiple locations (e.g., to collect trace data) and push the readback data to the DMA engine 816, which may transfer or stream the readback data to memory (e.g., RAM 806 or external DRAM 710). The DMA engine 816 may be useful in freeing up the packet processor 804 to perform other functions.
[0114] Figure 9A is a block diagram of a DMA engine 816, including a command engine 902 and a data engine 904, according to one embodiment.
[0115] The command engine 902 pulls a configuration packet 910 from the DRAM 710 (for example, for reconfiguration / partial reconfiguration). The command engine 902 may read the configuration packet 910 and push it to the CIM interconnect 802 for delivery to the packet buffer 840. The command engine 902 may extract commands from the configuration packet 910 for execution by the packet processor 804.
[0116] The data engine 904 pushes the readback data 912 (from the functional circuit 706-1) to a storage device such as the fabric buffer of the external DRAM 710 or PL732. Readback will be explained later. The data engine 904 may be programmed / configured to perform other tasks such as forwarding. The data engine 904 may operate under the control of the packet processor 804.
[0117] The command engine 902 and the data engine 904 can operate in parallel with each other. For example, the command engine 902 may read or pull configuration packets 910 from the external DRAM 710 and copy the command packets to the packet buffer 840 in RAM 806. Meanwhile, the data engine 904 pushes readback data 912 received from the CIM interconnect 802, or packets received from the packet processor 804, to NoC 716 via link 824.
[0118] The DMA engine 816 may operate in one or more of the following modes, examples of which are provided below for direct configuration mode, direct fabric readback mode, and support mode.
[0119] In direct configuration mode, the command engine 902 is programmed to stream packets from a contiguous area of the external DRAM 710 to the packet buffer 840. In one embodiment, the command engine 902 examines the least significant bit of the attribute word in the first quadword of the current packet to determine if the current packet is the last packet to be forwarded. If the current packet is the last packet to be forwarded, the command engine 902 stops forwarding packets after the current packet has been read.
[0120] In direct fabric readback mode, the packet processor 804 initiates a readback of data within the functional circuit 706-1 (e.g., within PL732), and the data engine 904 streams the resulting readback data 912 to memory (e.g., data buffer 842 or external DRAM 710). In one embodiment, the packet processor 804 performs the readback operation by pushing a write command to the data engine 904, and the data engine 904 pulls data from the functional circuit 706-1. The packet processor 804 or the data engine 904 may push a write command to the CFrame circuit 744 to write the contents of a register or memory location in PL732 or CRAM 740 onto link 743. The data engine 904 may issue a read command to the keyhole or fixed aperture of the CFrame circuit 744, and further, the resulting readback data may be directed to NoC716 via the DMA switch 828.
[0121] After the packet processor 804 has finished writing the readback command to the CFrame circuit 744, the packet processor 804 may write to the control register of the data engine 904 to indicate that the data engine 904 has finished reading any outstanding data from the CFrame circuit 744. The packet processor 804 can directly read the remaining data in the FIFO buffer of the CFrame circuit 744 and push that remaining data to the read FIFO buffer 906 of the data engine 904, as described below with respect to the support modes.
[0122] The packet processor 804 may perform data readback for one or more of various purposes, such as conditional commands, data processing, integrity checks, and / or state capture (e.g., for emulation purposes).
[0123] In the case of a conditional command, the packet processor 804 may read back the contents of a register in the functional circuit 706-1 (for example, a register in PL732) to determine whether to execute the command.
[0124] For data processing, the packet processor 804 can instruct the DMA engine 816 to place data in the first data buffer of the data buffer 842. The packet processor 804 can then read (i.e., read back) the data from the first data buffer, process the data, write the processed data to the second data buffer of the data buffer 842, and instruct the DMA engine 816 to empty the second buffer.
[0125] For integrity checks, the packet processor 804 may read back configuration parameters from the registers or memory (e.g., CRAM 740) of the functional circuit 706 via the configuration circuit (e.g., via links 739 and / or 743) and compare the readback data with the configuration parameters previously provided to the registers or memory.
[0126] For emulation purposes, the packet processor 804 may save the operating state of the function circuit 706-1 or a part thereof, and then configure the function circuit 706-1 or a part thereof in the saved state (for example, for debugging purposes). In one embodiment, the packet processor 804 or other circuitry stops the clock of the function circuit 706-1, and the packet processor 804 reads the contents of the configuration registers / memory of the function circuit 706-1 via the configuration circuitry (e.g., links 739 and / or 743). The contents represent the saved state of the function circuit 706-1, or a part thereof. The packet processor 804 may then configure the function circuit 706-1 using the saved state via the configuration infrastructure. Alternatively or additionally, the function circuit 706-1 may include test / debug infrastructure for reading registers (e.g., chipscope) and / or flip-flops (e.g., scantest). In this embodiment, the packet processor 804 can read back the states of the registers and / or flip-flops via the test / debug infrastructure. Subsequently, the packet processor 804 can configure the functional circuit 706-1 using the saved state via the test / debug infrastructure.
[0127] In support mode, the data engine 904 supports the packet processor 804 when performing DMA read operations. When the packet processor 804 performs a read DMA operation, it pushes the resulting data to the data engine 904's read FIFO buffer 906 via link 824 (e.g., the packet processor 804's read pipeline). The data engine 904 can stream or write data from the read FIFO buffer 906 to a contiguous region of the external DRAM 710 via link 908, the DMA switch 826, and NoC 716. In one embodiment, the data engine 904 is programmed with a start address or base address in the region of the external DRAM 710, and increments the address with each write operation until the data engine 904 is programmed with a new base address.
[0128] Furthermore, with respect to a slot in the packet buffer 840, the data engine 904 may mark the last transaction associated with the packet to notify the packet processor 804 that the packet has been completed. Also, the busy flag for the associated slot in the packet buffer 840 may be set to identify the slot as full. If other slots in the packet buffer 840 are still being used by the packet processor 804 (i.e., the busy flag is set), the data engine 904 may stop pushing packets to the packet buffer 840. The busy flag may be routed throughout the IC device 700 (for example, to the DMA engines of other distributed management circuits 703 via the central management circuit 702).
[0129] In one embodiment, the packet processor 804 and the DMA data engine 904 are configured to read data and push it into a data buffer 842, which may be configured in RAM 806 using commands. The size and base address of the data buffer 842, as well as its configuration parameters (e.g., circular buffer, fixed FIFO, or LIFO), may be programmed into the data buffer management table (DBMT) of the packet processor 804, as described below with reference to Figure 9B.
[0130] Figure 9B shows, in one embodiment, the DBMT920 of the packet processor 804 and the interconnection between the packet processor 804, the memory controller 844, and the interconnection 802. In the example of Figure 9B, the DBMT920 supports up to 16 data buffers 842. In the example of Figure 9B, the entries in the DBMT902 include a base address field 908, an end address field 910, a write pointer field 912, a read pointer field 914, and a buffer mode field 916, which will be described later.
[0131] Commands that use data buffer 842 as a source or destination may include a field (e.g., a 4-bit field) specifying which data buffer 842 to use, an example of which will be provided later. In one embodiment, multiple operations of packet processor 804 may push data to and from the same data buffer 842 in the order in which the operations are performed. DBMT902 maintains the level of data in data buffer 804 and can read and write pointers for operations.
[0132] The base address field 908 contains the lower addresses of data buffer 842.
[0133] The end address field 910 contains the upper address of data buffer 842.
[0134] The write pointer field 912 contains the address of the next entry that can be written to the data buffer 842. When a particular data buffer 842 is programmed to DBMT902, the write pointer field 912 becomes equal to the value of the base address field 908.
[0135] The read pointer field 914 contains the address of the last entry read from the data buffer 842. When a particular data buffer is programmed to DBMT902, the read pointer field 914 becomes equal to the value in the end address field 910 if the FIFO option is selected, and equal to the value in the base address field 908 if the LIFO option is selected.
[0136] The buffer mode field 916 includes the usage mode of the data buffer 842 (e.g., fixed FIFO, circular buffer, or LIFO).
[0137] The packet processor 804 may execute one or more of various command types. Exemplary command types or categories include, but are not limited to, write commands, register read commands, register mask and write commands, compare commands, data buffer commands, and read-through DMA commands.
[0138] A write command enables the packet processor 804 to perform single and / or burst write operations (e.g., up to 256 x 128 bits). The write command may specify the data to be written. The packet processor 804 may instruct one or more slave interface circuits of the CIM interconnect 802 to issue a write command. The write command may be predicted based on the conditions of the specified CR bits.
[0139] A register read command allows the packet processor 804 to read word values, double word values, and / or quad word values from an address on the CIM interconnect 802 into the packet processor 804's LDR. The packet processor 804 can manipulate the values in the LDR and write the manipulated values to the slave interface circuit of the CIM interconnect 802 and / or the CIM register 860. The register read command can be predicted based on the conditions of a specified CR bit.
[0140] The register mask and write command allows the packet processor 804 to write word values, double word values, and / or quad word values from the LDR to the slave interface circuit of the CIM interconnect 802. In the case of register word operations, any bit in the least significant word of the LDR can be forced to 1 or 0 and written to the destination. The register mask and write command can be predicted based on the conditions of the specified condition register bits.
[0141] The comparison command allows the packet processor 804 to compare the least significant word of the LDR with a comparison value. The comparison command may cause the packet processor 804 to mask bits with a specified mask (e.g., a 32-bit mask) and compare the masked bits with the comparison value (e.g., a 32-bit value). If the masked bits match the comparison value, the packet processor 804 may set the specified CR bit.
[0142] A data buffer command may include read commands and / or write commands. A data buffer command enables the packet processor 804 to push data to or from a designated data buffer 842 (e.g., to the LDR or external DRAM 710). A data buffer command may push word data, double word data, or quad word data. A data buffer command may support burst reads from a designated data buffer 842 to a location outside the CIM circuit 704-1, such as by pushing the read data to the read FIFO buffer 906 of the data engine 904 for transfer to an external location (e.g., external DRAM 710). A data buffer command may be predicted based on a specified CR bit condition.
[0143] Read-through DMA commands allow read operations of various sizes to be sent to / through the CIM interconnect 802. Read-through DMA commands can be used to perform a read operation from a specified data buffer 842. The read data can be pushed to the read FIFO buffer 906 of the data engine 904 for transfer to memory (e.g., data buffer 842 or external DRAM 710). Read-through DMA commands can be predicted based on the conditions of a specified CR bit.
[0144] Commands executed by packet processor 804 may have one or more of the following characteristics:
[0145] Commands can be started and stopped on quadword boundaries.
[0146] Commands can be 1 to 257 quadwords long.
[0147] Word writing and double word writing can be specified with a single quad word.
[0148] Quad-word reading can be specified with a single quad-word.
[0149] A quadword write can be specified by a command with two or more quadword lengths. The details of the command, including the command length and address, may be defined in the first quadword, and the data to be written may be specified in subsequent quadwords.
[0150] The lower part of an address (e.g., the lower 32 bits) can be specified in the first quadword. The upper part of an address (e.g., the upper 32 bits) can be specified in a register (e.g., the CIM Upper_Address register), but can be used throughout the context of the associated command.
[0151] The readback data may be pushed into the read FIFO buffer 906, or it may be held in the LDR.
[0152] The data for the write operation may be supplied from the LDR or specified by the relevant command.
[0153] Masking / checking may be performed on local registers of the packet processor 804. For example, masking / checking may be performed on the LDR, and another local register (e.g., bits of the CR of the packet processor 804) may be set based on the LDR.
[0154] Conditional / predicated execution can be performed based on the state of the condition register bits.
[0155] Exemplary instruction fields and formats will be described later.
[0156] Figure 10 shows a field 1000 for a command executed by a packet processor 804 according to one embodiment. Field 1000 includes an opcode field 1002, a length field 1004, a synchronization field 1006, a write data source field 1008, a condition register field 1010, a data buffer index field 1012, a word 2 data field 1014, a word 1 data field 1016, and an address field 1020, as well as a read or write destination field 1018.
[0157] Figure 11 shows a subfield of the opcode field 1002 according to one embodiment. In the example in Figure 11, the opcode field 1002 is shown as an 8-bit field including an operation type or class field 1102, an execution criteria field 1104, and a data width field 1106.
[0158] The following table shows an example of the behavioral class code.
[0159] [Table 1]
[0160] The execution criteria field 1104 specifies whether the command is predicated and the predication parameters. Examples of execution criteria codes are shown in the table below.
[0161] [Table 2]
[0162] The data width field 1106 specifies the width of the operation. Example data width codes are shown in the table below.
[0163] [Table 3]
[0164] Returning to Figure 10, the length field 1004 specifies the length of the write or read quadword burst. In the case of a write quadword burst, the length field 1004 can also indicate the number of quadwords following the first quadword in the write command minus one.
[0165] The synchronization field 1006 indicates when the associated commands are synchronized and stops the packet processor 804 from issuing further commands until the command is complete. Synchronized commands can return a status of CR to indicate successful completion. A value of zero may indicate that the commands are not synchronized. A value of 1 may indicate that the commands are synchronized.
[0166] A synchronous command is a type of command that stops issuing further commands until the synchronous command is complete. Typically, a CIM can issue asynchronous commands back-to-back on its AXI interface. Back-to-back asynchronous commands are processed in a pipeline manner. When a CIM issues a synchronous command on its AXI interface, it will not issue any further commands until it receives an indication on its AXI interface that the synchronous command is complete.
[0167] The write data source field 1008 specifies whether the data for the write operation is included in the associated command or supplied from a local register of the packet processor 804. An example of source code is shown in the table below.
[0168] [Table 4]
[0169] The condition register (CR) field 1010 specifies the CR bits used to execute the associated command. In the example in Figure 10, the CR field 1010 contains 3 bits to specify one of the 16 CR bits.
[0170] The data buffer index field 1012 specifies the index of the data buffer 842 used to look up information in the DBMT 902 (Figure 9B). The packet processor 804 can use this information to write data from the LDR to the data buffer 842, or read data stored in the data buffer 842 and copy that data to the LDR, or read that data and pass it to the FIFO buffer 906.
[0171] With respect to the Word 1 data field 1016 and the Word 2 data field 1014, in the case of a single word (e.g., a 32-bit word) write operation, the Word 1 data field 1016 contains the data to be written (e.g., 32 bits), and the Word 2 data field 1014 is not used. In the case of a double word write operation, the Word 1 data field 1016 contains the lower part or word of the data to be written, and the Word 2 data field 1014 contains the upper part or word of the data to be written (e.g., 32 bits).
[0172] In mask storage operation (for example, data is supplied from bits [31:0] of the LDR), the word 1 data field 1016 contains the mask (i.e., specifying the bits of supplied data to be masked), and the word 2 data field 1014 contains the values of the bits specified by the mask in the word 1 data field 1016. In other words, any of the bits [31:0] of the LDR that are not masked by the values in the word 1 data field 1016 are set to the values specified by the respective bits in the word 2 data field 1014. For example, if bit 0 of the LDR is not masked as specified by the value of bit 0 in the word 1 data field 1016, then bit 0 of bits [31:0] of the LDR is set to the value of bit 0 in the word 2 data field 1014.
[0173] With respect to the read or write destination field (destination field) 1018, in the case of a read command, the destination field 1018 specifies whether the data to be read or masked by the read operation should be pushed into the read FIFO buffer 906 or stored in the local register of the packet processor 804. In the case of a single beat read from memory, the data may be pushed into the local register of the packet processor 804 by default. Exemplary source / destination codes for read commands are provided in the table below.
[0174] [Table 5]
[0175] For write commands, the destination field 1018 specifies whether the write data (word / double word / quad word) is written to memory (e.g., external DRAM 710), data buffer 842, or a local register of packet processor 804. Exemplary destination codes for write commands are provided in the table below.
[0176] [Table 6]
[0177] Commands to the packet processor 804 can be constructed by selecting appropriate encoding for the fields shown in Figures 10 and 11. An example of a pole command is provided below for reading a 32-bit value from a memory-mapped register and reissuing or repeating the pole command if a particular bit does not match a specified value.
[0178] [Table 7]
[0179] An example command for packet processor 804 is shown below.
[0180] Figure 12 shows an exemplary memory word write (MWW) command 1200 that enables the packet processor 804 to write a value to a bit-aligned address in the memory map of the CIM circuit 704-1 (for example, writing a 32-bit value to a 32-bit aligned address).
[0181] Figure 13 shows an exemplary synchronous memory word write (SMWW) command 1300, which allows the packet processor 804 to write a value to a bit-aligned address in the memory map (for example, a 32-bit value to a 32-bit aligned address) and delay the issuance of further instructions until the SMWW command 1300 is complete. If the SMWW command 1300 returns an error, a CR indicated by bits 103:100 of the condition register field 1010 is asserted.
[0182] Figure 14 shows an exemplary conditional true memory word write (TMWW) command 1400 that allows the packet processor 804 to write a value to a bit-aligned address in the memory map (for example, writing a 32-bit value to a 32-bit aligned address in the memory map) when the condition flag indicated by bits 103:100 of the condition register field 1010 is true.
[0183] Figure 15 shows an exemplary conditional false memory word write (FMWW) command 1500 that allows the packet processor 804 to write a value to a bit-aligned address in the memory map (for example, writing a 32-bit value to a 32-bit aligned address in the memory map) when the condition flag indicated by bits 103:100 of the condition register field 1010 is false.
[0184] Figure 16 shows an exemplary conditional true synchronous memory word write (TSMWW) command 1600, which, if the condition flag indicated by bits 103:100 of the condition register field 1010 is true, allows the packet processor 804 to write a value to a bit-aligned address in the memory map (for example, a 32-bit value to a 32-bit aligned address in the memory map) and delays the issuance of further instructions until the TSMWW command 1600 is complete.
[0185] Figure 17 shows an exemplary conditional false synchronous memory word write (CIM) command 1700, which, if the condition flag indicated by bits 103:100 of the condition register field 1010 is false, allows the packet processor 804 to write a value to a bit-aligned address in the memory map (for example, a 32-bit value to a 32-bit aligned address in the CIM memory map) and delays the issuance of further instructions until the FSMWW command 1700 is complete.
[0186] Figure 18 shows an exemplary memory double word write (MDW) command 1800 that enables the packet processor 804 to write a double word value to a bit-aligned address in the memory map (for example, writing a 64-bit value to a 32-bit aligned address in the CIM memory map).
[0187] Figure 19 shows an exemplary synchronous memory double word write (SMDW) command 1900, which allows the packet processor 804 to write a double word value to a bit-aligned address in the memory map (for example, a 64-bit value to a 32-bit aligned address in the CIM memory map) and delay the issuance of further instructions until the SMDW command 1900 is complete. If the SMDW command 1900 returns an error, a CR indicated by bits 103:100 of the condition register field 1010 is asserted.
[0188] Figure 20 shows an exemplary conditional true memory double word write (TMDW) command 2000, which allows the packet processor 804 to write a double word value to a bit-aligned address in the memory map (for example, writing a 64-bit value to a 32-bit aligned address in the CIM memory map) when the condition flag indicated by bits 103:100 of the condition register field 1010 is true.
[0189] Figure 21 shows an exemplary conditional false memory double word write (FMDW) command 2100 that allows the packet processor 804 to write a double word value to a bit-aligned address in the memory map (for example, writing a 64-bit value to a 32-bit aligned address in the CIM memory map) when the condition flag indicated by bits 103:100 of the condition register field 1010 is false.
[0190] Figure 22 shows an exemplary memory quadword write (MQW) command 2200 that allows the packet processor 804 to write a selectable number of quadwords to bit-aligned addresses in the memory map (for example, 1 to 256 quadwords to 128-bit aligned addresses in the memory map). The number of quadwords is one more than that specified by bits 119-112 of the length field 1004 (Figure 10).
[0191] Figure 23 shows an exemplary compare(C) command 2300 that allows the packet processor 804 to compare the masked value of the lowest word of the LDR with a specified value and set a condition register based on the comparison. Exemplary pseudocode is provided below. If (LDR[31:0]&Mask) is equal to Comp_Value, then CR[Condition_Reg]=1 Otherwise, CR[Condition_Reg]=0
[0192] Figure 24 shows an exemplary Masked LDR Word & Write (MLWW) command 2400, which allows packet processor 804 to force different bits in the least significant word of the LPR to a specified value and write the resulting word to a specified address in memory. Exemplary pseudocode is provided below. Write (LDR[31:0]& Mask)|(Value[31:0]&!Mask)] to the memory location address.
[0193] Figure 25 is a block diagram of a multilayer IC device 2500 according to one embodiment. The IC device 2500 may represent an exemplary embodiment of the IC device 100.
[0194] The IC device 2500 includes multiple stacks of dies 2502-1 to 2502-j interconnected via an inter-chip interface. These are interconnected via a ground floor, much like a multi-story building.
[0195] The base layer, or die 2502-1, may include management infrastructure circuits (e.g., communication / interface circuits, central management circuits, and / or distributed management circuits). The upper layers, or dies 2502-2 to 2502-j, may include functional circuits (e.g., functional circuit 706 in Figure 7). One or more upper layers may include a PL fabric (e.g., PL732 in Figure 7). The topmost die 2502-j may include one or more computing engines (e.g., artificial intelligence engines, i.e., AIEs) which may be arranged as an array of computing engines, without limitation.
[0196] In the example in Figure 25, the base die 2502-1 includes distributed management circuits 2504-1 to 2504-4 uniformly distributed or arranged in rows between the VNoC column 2506-1 and the DHBI column 2508-1. The base layer 2502-1 further includes distributed management circuits 2504-5 to 2504-8 uniformly distributed or arranged in a single column between the VNoC circuit 2506-2 and the DHBI circuit 2508-2. Other embodiments may include a different number of CIMs (e.g., two columns of eight CIMs). The distributed management circuits 2504-1 to 2504-8 may represent an exemplary embodiment of the distributed management circuit 703 in Figure 7, but may include each CIM circuit 704. The distributed management circuits 2504-1 to 2504-8 may each be responsible for a specific three-dimensional region of the die 2502-2 to 2502-j, or they may be associated with each of the distributed management circuits 2504-1 to 2504-8.
[0197] The base layer 2502-1 further includes a central management circuit 2516 in a central region 2514 that streams the configuration partitions to distributed management circuits 2504-1 to 2504-8 via NoC2510 (e.g., NoC716 in Figure 7).
[0198] The VNoC circuit 2506 may represent a vertical connection or an intra-die connection of NoC2510.
[0199] DHBI column 2508 may represent a general-purpose interconnect circuit that connects to a chiplet or memory (e.g., high-bandwidth memory, i.e., HBM, and / or high-capacity memory, i.e., HVM). DHBI column 2508 includes multiple interfaces for connecting to multiple chiplets.
[0200] The base layer 2502-1 further includes die- or inter-layer interface circuits, shown here as OHBI circuits 2512-1 to 2512-6, which provide inter-layer connections for the IC device 2500. OHBI circuits 2512 may interface between adjacent stacks of the IC device 2500. OHBI circuits 2512 may be located under PL circuits of one or more higher layers, or under dies 2502-2 to 2502-j. OHBI circuits 2512 may represent or include local control interconnects, i.e., LCI circuits. Distributed management circuits 2504-1 may be responsible for the circuits of the base die 2502-1 and any chiplets or memories (e.g., off-chip device 711 in Figure 7) connected via OHBI array 2508-1.
[0201] The base layer 2502-1 further includes multiple instances of input / output (I / O) circuits and memory controllers, which are shown here as X5IO+MC2518-1 to 2518-5 (collectively, X5IO+MC2518). The I / O circuits may provide high-speed input / output services to their respective memory controllers and / or for other purposes, such as interface with the PL fabric of IC device 2500. Multiple instances of I / O circuits and / or memory controllers may be useful for parallel operation (e.g., to access multiple memory devices in parallel) and / or to enable multiple sources of IC device 2500 to access the same resource in series. Multiple instances of X5IO+MC2518 may be used in association with each other. For example, if an instance of X5IO+MC2518 represents a 32-bit memory controller, two instances of X5IO+MC2518 may be used in association with each other to provide a 64-bit memory controller.
[0202] One or more of the dies 2502 may contain memory (i.e., on-die memory). Alternatively, or additionally, the IC device 2500 may be configured to access external memory (e.g., external DRAM 710 in Figure 7), which may include on-board memory (i.e., the IC device 2500 and memory may be mounted on the same circuit board or integrated within the same IC package). Integrating the IC device 2500 and external memory within the IC package may reduce memory access latency. The IC device 2500 can access external memory, such as HBM, via the DHBI column 2508.
[0203] One or more of the programmable / configurable logic (PL) examples described above may include one or more of various types of configurable circuit blocks, as described below with reference to Figure 26. Figure 26 is a block diagram of a configurable circuit 2600, which includes an array of configurable or programmable circuit blocks or tiles, according to one embodiment. The example in Figure 26 may represent other IC devices that utilize configurable interconnection structures for selectively combining circuit / logic elements, such as field-programmable gate arrays (FPGAs) and / or composite programmable logic devices (CPLDs).
[0204] In the example in Figure 26, the tile includes a multi-gigabit transceiver ("MGT") 2601, a configurable logic block ("CLB") 2602, a block random access memory ("BRAM") 2603, an input / output block ("IOB") 2604, configuration and clocking logic ("Config / Clock") 2605, a digital signal processing (DSP) block 2606, a dedicated input / output block ("I / O") 2607 (e.g., a configuration port and a clock port), and other programmable logic 2608, which may, without limitation, include a digital clock manager, an analog-to-digital converter, and / or system monitoring logic. The tile further includes a dedicated processor 2610.
[0205] One or more tiles may include a programmable interconnect element (INT) 2611 having connections to input and output terminals 2620 of programmable logic elements within the same tile, and / or to one or more other tiles. A programmable INT 2611 may include connections to interconnect segments 2622 of other programmable INTs 2611 within the same tile and / or other tiles. A programmable INT 2611 may also include connections to interconnect segments 2624 of general-purpose routing resources between logic blocks (not shown). A general-purpose routing resource may include routing channels between logic blocks (not shown) containing tracks of interconnect segments (e.g., interconnect segment 2624) and switch blocks (not shown) for connecting interconnect segments. An interconnect segment of a general-purpose routing resource (e.g., interconnect segment 2624) may span one or more logic blocks. A programmable INT 2611, in combination with a general-purpose routing resource, may represent a programmable interconnect structure.
[0206] The CLB2602 may include a configurable logic element (CLE) 2612 that can be programmed to implement user logic circuits. The CLB2602 may also include a programmable INT 2611.
[0207] The BRAM2603 may include BRAM logic elements (BRL)2613 and one or more programmable INT2611. The number of interconnected elements included in a tile may vary depending on the height of the tile. The BRAM2603 may have a height equivalent to, for example, five CLB2602s. Other numbers (e.g., 4) may also be used.
[0208] The DSP block 2606 may include one or more programmable INT 2611s in addition to DSP logic element (DSPL) 2614. The IOB 2604 may include, for example, one or more instances of programmable INT 2611 in addition to two instances of input / output logic element (IOL) 2615. For example, the I / O pads connected to the I / O logic element 2615 are not necessarily limited to the area of the I / O logic element 2615.
[0209] In the example in Figure 26, the configuration / clock 2605 may be used for configuration, clock, and / or other control logic. The vertical column 2609 may be used to distribute the clock and / or configuration signals.
[0210] Logic blocks (e.g., programmable with fixed functions) can disrupt the column structure of the configurable circuit 2600. For example, the processor 2610 spans several columns of the CLB 2602 and BRAM 2603. The processor 2610 may include one or more of various components, without limitation, ranging from a single microprocessor to a complete programmable processing system of a microprocessor, memory controller, and / or peripherals.
[0211] In Figure 26, the configurable circuit 2600 further includes an analog circuit 2650 which may include, without limitation, one or more analog switches 267, a multiplexer, and / or a demultiplexer. The analog switch 267 may be useful for reducing leakage current.
[0212] Figure 26 is provided for illustrative purposes only. The configurable circuit 2600 is not limited to the number of logic blocks in a row, the relative width of the rows, the number and order of the rows, the types of logic blocks included in the rows, the relative sizes of the logic blocks, the illustrated interconnections / logic implementations, or other exemplary features of Figure 26.
[0213] The embodiments presented in this disclosure are referenced above. However, the scope of this disclosure is not limited to the specific embodiments described. Rather, any combination of the features and elements described is intended to implement and practice the intended embodiments, whether or not they relate to different embodiments. Furthermore, while the embodiments disclosed herein may achieve advantages over other possible solutions or prior art, whether or not a particular advantage is achieved by a given embodiment does not limit the scope of this disclosure. Accordingly, the aforementioned aspects, features, embodiments, and advantages are merely illustrative and shall not be considered elements or limitations of the appended claims unless expressly enumerated in the claims.
[0214] As will be understood by those skilled in the art, the embodiments disclosed herein may be embodied as systems, methods, or computer program products. Accordingly, embodiments may take the form of entirely hardware embodiments, entirely software embodiments (including firmware, resident software, microcode, etc.), or embodiments that combine software and hardware embodiments, which may all be collectively referred to herein as “circuits,” “modules,” or “systems.” Furthermore, embodiments may take the form of computer program products embodied in one or more computer-readable media in which computer-readable program code is embodied.
[0215] Any combination of one or more computer-readable media may be used. A computer-readable media may be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium may be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any preferred combination thereof. More specific examples (a non-exhaustive list) of computer-readable storage media include electrical connections with one or more wires, portable computer floppy disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any preferred combination thereof. In the context of this specification, a computer-readable storage medium is any tangible medium that can contain or store programs for use by, or in connection with, an instruction execution system, apparatus, or device.
[0216] A computer-readable signal medium may include, for example, a propagating data signal in which computer-readable program code is embodied, either in the baseband or as part of a carrier wave. Such a propagating signal may take any of various forms, including but not limited to electromagnetic, optical, or any preferred combination thereof. A computer-readable signal medium may be any computer-readable medium, rather than a computer-readable storage medium, that can communicate, propagate, or transfer a program for use by or in connection with an instruction execution system, apparatus, or device.
[0217] Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, fiber optic cable, RF, or any preferred combination thereof.
[0218] Computer program code for performing the operations of the embodiments of this disclosure may be written in any combination of one or more programming languages, including, for example, object-oriented programming languages such as Java, Smalltalk, and C++, and conventional procedural programming languages such as the C programming language or similar programming languages. The program code may run entirely on the user's computer, partially as a standalone software package on the user's computer, partially on the user's computer, partially on a remote computer, or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or it may be connected to an external computer (for example, via the Internet using an Internet service provider).
[0219] Aspects of the present disclosure are described herein with reference to flowcharts and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments presented herein. It will be understood that each block in the flowcharts and / or block diagrams, and combinations of blocks in the flowcharts and / or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, a dedicated computer, or another programmable data processing device such that instructions executed via the processor of the computer or other programmable data processing device result in a machine that creates means for implementing the functions / actions specified in the blocks of the flowcharts and / or block diagrams.
[0220] These computer program instructions may also be stored in computer-readable storage media, and the instructions may also instruct computers, programmable data processing devices, and / or other devices to function in a particular manner, such as to produce products containing instructions that implement functions / actions specified in blocks of flowcharts and / or block diagrams.
[0221] Computer program instructions can also be loaded into a computer, other programmable data processing device, or other device to perform a series of operational steps on the computer, other programmable device, or other device, thereby generating a computer implementation process. Thus, instructions executed on a computer or other programmable device provide a process for implementing the functions / actions specified in the blocks of a flowchart and / or block diagram.
[0222] The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of instructions containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions described in a block may occur in a different order than shown in the figure. For example, two consecutively shown blocks may actually be executed substantially simultaneously, or blocks may be executed in reverse order depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowchart illustrations, and combinations of blocks in the block diagrams and / or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs a specified function or action or a combination of dedicated hardware and computer instructions.
[0223] The technology disclosed above may be represented in the following non-limiting embodiments.
[0224] Example 1. An integrated circuit (IC) device comprising a functional circuit, a first communication channel, and a distributed management circuit, the distributed management circuit comprising a plurality of configuration interface manager (CIM) circuits configured to receive their respective programming partitions as configuration packets via the first communication channel and to provide configuration parameters to each area of the functional circuit in parallel with each other based on their respective configuration packets.
[0225] Example 2. The IC device according to Example 1, further comprising a central management circuit configured to stream configuration packets to the random access memory (RAM) packet buffers of each CIM circuit via a first communication channel.
[0226] Example 3. The IC device according to Example 2, wherein the central management circuit comprises a direct memory access (DMA) engine configured to stream configuration packets to each CIM circuit via a first communication channel.
[0227] Example 4. The IC device according to Example 2, wherein the central control circuit is further configured to configure the first communication channel and the CIM circuit via a second communication channel during the initialization phase of the IC device.
[0228] Example 5. The IC device according to Example 4, wherein the second communication channel comprises a global communication ring (GCR) interconnect circuit, the central management circuit is further configured to provide electronic fuse (eFuse) information to the CIM circuit via the GCR interconnect circuit, and the CIM circuit comprises each GCR node configured to capture eFuse information from the GCR interconnect circuit and communicate with the central management circuit and one or more other GCR nodes of the IC device.
[0229] Example 6. The IC device according to Example 2, wherein the CIM circuit comprises a direct memory access (DMA) command engine configured to read configuration packets from external memory via a first communication channel and store the configuration packets in the RAM packet buffer of the respective CIM circuit.
[0230] Example 7. The IC device according to Example 1, wherein the first CIM circuit among the CIM circuits comprises a random access memory (RAM) having a packet buffer for storing configuration packets, and a packet processor configured to retrieve configuration packets from the packet buffer, extract commands from the configuration packets, and execute commands.
[0231] Example 8. The IC device according to Example 7, further comprising a first CIM circuit, a direct memory access (DMA) data engine configured to access a data buffer in RAM in response to a command executed by a packet processor, and a DMA command engine configured to read a configuration packet from external memory and store the configuration packet in a packet buffer.
[0232] Example 9. The IC device according to Example 8, wherein the DMA data engine and the DMA command engine are configured to perform their respective operations in parallel with each other.
[0233] Example 10. The IC device according to Example 8, wherein the packet processor is further configured to initiate a readback operation to read state information from a portion of a first area of the functional circuit, and the DMA data engine is further configured to receive readback data from the packet processor and write the readback data to one or more of RAM and external memory.
[0234] Example 11. The IC device according to Example 10, wherein the packet processor is further configured to reconstruct a portion of a first region of the functional circuit using readback data.
[0235] Example 12. The IC device according to Example 10, wherein the readback data includes the contents of a configuration register in a first region of the functional circuit, and the first CIM circuit further includes an error detection circuit configured to check for errors in the readback data.
[0236] Embodiment 13. The IC device according to Embodiment 8, further comprising a central management circuit configured to provide a first CIM circuit with a hash value of a first configuration packet in a stream of configuration packets, the first CIM circuit further comprising an authentication circuit, the central management circuit configured to provide the first CIM circuit with a first hash value of a first configuration packet in a stream of configuration packets, the DMA data engine further configured to provide the authentication circuit with a hash value contained in the header of a subsequent configuration packet in a stream of configuration packets, the authentication circuit configured to authenticate a first configuration packet in a stream of configuration packets based on the first hash value, and to authenticate subsequent configuration packets based on the hash value contained in the header of each preceding configuration packet.
[0237] Example 14. The IC device according to Example 7, wherein the first CIM circuit further comprises a decoding circuit, and the packet processor is further configured to retrieve a configuration packet from a packet buffer, transfer the configuration packet to the decoding circuit, and decode each configuration packet followed by extracting a command from the configuration packet.
[0238] Example 15. The IC device according to Example 8, wherein the first CIM circuit further comprises a memory controller for controlling access to RAM, and an interconnection circuit configured to interface between the first CIM circuit and a first communication channel, and interface between the circuits of the first CIM, the interconnection circuit comprising a master interface circuit and a slave interface circuit configured to interface with the first communication channel via their respective n-bit buses for receiving configuration packets from the first communication channel and outputting data to the first communication channel NoC, where n is a positive integer, and an additional master interface circuit and a slave interface circuit configured to interface with their respective areas of a packet processor, a memory controller, a DMA data engine, a DMA command engine, and a functional circuit via their respective additional n-bit buses.
[0239] Example 16. An integrated circuit (IC) device comprising: a first IC die having a distributed management circuit, a first communication channel, and a first functional circuit; a second IC die having a second functional circuit; and a second communication channel having a chip-to-chip (C2C) communication channel configured to interface between the first communication channel and the second IC die, wherein the distributed management circuit comprises a plurality of configuration interface manager (CIM) circuits configured to receive their respective programming partitions as configuration packets via the first communication channel and to provide configuration parameters to each area of the first functional circuit in parallel with each other based on their respective configuration packets, and the first CIM circuit among the CIM circuits is further configured to receive a programming partition for the second IC die as an additional configuration packet via the first communication channel and to provide configuration parameters to the second IC die via the first communication channel and the C2C communication channel based on the additional configuration packets.
[0240] Example 17. The IC device according to Example 16, further comprising a central management circuit, wherein the first CIM circuit comprises a random access memory (RAM) having a packet buffer and a data buffer for storing configuration packets, a RAM controller for controlling access to the RAM, a packet processor configured to retrieve configuration packets from the packet buffer, extract commands from the configuration packets, and execute commands, a direct memory access (DMA) data engine configured to write configuration packets streamed from the central management circuit to the packet buffer and access the data buffer in response to commands executed by the packet processor, and a DMA command engine configured to read configuration packets from external memory and store the configuration packets in the packet buffer.
[0241] Example 18. An integrated circuit (IC) device comprising a functional circuit and a distributed management circuit comprising a plurality of configuration interface manager (CIM) circuits configured to receive their respective programming partitions as configuration packets via a communication channel, extract commands from their respective configuration packets, and execute operations related to each area of the functional circuit in parallel with each other based on the codes contained in the command fields.
[0242] Example 19. The IC device described in Example 18, wherein the operation includes a write operation, a mask and write operation, a read operation, a read and mask operation, and a compare operation.
[0243] Example 20. The IC device according to Example 18, wherein the command comprises an execution criteria code, the execution criteria code including a code that specifies to perform a specified operation unconditionally, to selectively perform a specified operation based on the state of a packet processor condition register, and to selectively repeat a specified read and mask operation based on the results of the read and mask operation.
[0244] Example 21. The IC device according to Example 18, wherein the first CIM circuit of the CIM circuit is further configured to selectively suspend the processing of subsequent commands until the currently executing command is completed, based on the state of the synchronization bits contained in the currently executing command.
[0245] Example 22. The command includes a command specifying a write operation, and the first CIM circuit of the CIM circuit is further configured to perform the write operation based on the write data source code contained in the command, the write data source code includes the fact that the write data is in the command, the write data is in the local data register (LDR) of the packet processor, and the write data is in the condition register and control register of the packet processor. An IC device as described in Example 18, specifying one of the following.
[0246] Example 23. The command includes a command to perform a read operation. The command copies data from a memory read operation to the packet processor register of the first CIM circuit, copies data from a memory read operation to the DMA data engine of the first CIM circuit, copies data from a data buffer read operation to the packet processor register, and copies data from a data buffer read operation to the DMA engine. An IC device as described in Example 18, including a data source code that specifies one of the following.
[0247] Example 24. The IC device described in Example 18, wherein the command includes a command for performing a read operation, and the command includes data source code specifying one of the following: write to memory, write to a data buffer specified in the command's data buffer index field, write to the packet processor's local data register, and write to the packet processor's condition register and control register.
[0248] Example 25. The IC device described in Example 18, wherein the first CIM circuit of the CIM circuit comprises a packet processor including a condition register, the packet processor being configured to parse a condition code from a command and populate the condition register with the condition code.
[0249] Example 26. The IC device as in Example 18, wherein the command includes a command for performing a read operation, and the first CIM circuit of the CIM circuit comprises a packet processor and a direct memory access (DMA) engine, the packet processor comprises a data buffer management table (DBMT) and a local data register (LDR), and the packet processor is configured to parse a data buffer index from the command, retrieve information from the DBMT based on the data buffer index, read data from the data buffer based on the information, copy the data to the LDR or transfer the data to the DMA data engine.
[0250] Example 27. The IC device described in Example 18, wherein the command includes a command to perform a read operation, and the first CIM circuit of the CIM circuit comprises a packet processor and a direct memory access (DMA) engine, the packet processor comprises a data buffer management table (DBMT) and a local data register (LDR), and the packet processor is configured to parse a data buffer index from the command, retrieve information from the DBMT based on the data buffer index, and write data from the LDR to the data buffer based on the information.
[0251] Example 28. The IC device according to Example 1, wherein the first communication channel includes a packet-switched network on-chip (NoC).
[0252] Example 29. The IC device according to Example 16, wherein the first communication channel comprises a packet-switched network on-chip (NoC).
[0253] Example 30. The IC device according to Example 18, wherein the communication channel includes a packet-switched network on-chip (NoC).
[0254] The above applies to specific examples, but other and further examples may be devised without departing from the basic scope, and the scope will be determined by the following "Claims".
Claims
1. Functional circuitry, The first communication channel, and A distributed management circuit comprising a plurality of configuration interface manager (CIM) circuits configured to receive each programming partition as a configuration packet via the first communication channel and to provide configuration parameters to each area of the functional circuit in parallel with each other based on each configuration packet, An integrated circuit (IC) device that includes [a certain feature].
2. The IC device according to claim 1, further comprising a central management circuit configured to stream the configuration packets to the random access memory (RAM) packet buffers of each of the CIM circuits via the first communication channel.
3. The aforementioned central control circuit is The IC device according to claim 2, comprising a direct memory access (DMA) engine configured to stream the configuration packets to each of the CIM circuits via the first communication channel.
4. The IC device according to claim 2, wherein the central management circuit is further configured to configure the first communication channel and the CIM circuit via a second communication channel during the initialization phase of the IC device.
5. The IC device according to claim 2, wherein the CIM circuit comprises a direct memory access (DMA) command engine configured to read the configuration packets from an external memory via the first communication channel and store the configuration packets in the RAM packet buffer of the respective CIM circuit.
6. The first CIM circuit among the aforementioned CIM circuits is Random access memory (RAM) having a packet buffer for storing the aforementioned configuration packets, and A packet processor configured to retrieve the configuration packets from the packet buffer, extract commands from the configuration packets, and execute the commands, The IC device according to claim 1, comprising:
7. The first CIM circuit described above is A direct memory access (DMA) data engine configured to access the data buffer of the RAM in response to a command executed by the packet processor, and A DMA command engine configured to read the configuration packets from external memory and store the configuration packets in the packet buffer, The IC device according to claim 6, further comprising the above.
8. The first CIM circuit further comprises a decoding circuit, and The packet processor is further configured to retrieve the configuration packets from the packet buffer, transfer the configuration packets to the decoding circuit, and, following the decoding of each configuration packet, extract commands from the configuration packets. The IC device according to claim 6.
9. A first IC die comprising a distributed management circuit, a first communication channel, and a first functional circuit, A second IC die equipped with a second functional circuit, A second communication channel comprising a chip-to-chip (C2C) communication channel configured to interface between the first communication channel and the second IC die, An integrated circuit (IC) device comprising, The distributed management circuit includes a plurality of configuration interface manager (CIM) circuits configured to receive each programming partition as a configuration packet via the first communication channel and to provide configuration parameters to each area of the first functional circuit in parallel with each other based on each configuration packet, and The first CIM circuit among the CIM circuits is further configured to receive a programming partition for the second IC die as an additional configuration packet via the first communication channel, and to provide configuration parameters to the second IC die via the first communication channel and the C2C communication channel based on the additional configuration packet. Integrated circuit (IC) device.
10. The system further includes a central control circuit, and the first CIM circuit is: Random access memory (RAM) comprising a packet buffer for storing the aforementioned configuration packets and a data buffer, A RAM controller that controls access to the RAM, A packet processor configured to retrieve the configuration packets from the packet buffer, extract commands from the configuration packets, and execute the commands, A direct memory access (DMA) data engine configured to write the configuration packets streamed from the central management circuit to the packet buffer and to access the data buffer in response to commands executed by the packet processor, A DMA command engine configured to read the configuration packets from external memory and store the configuration packets in the packet buffer, The IC device according to claim 9, comprising:
11. Functional circuitry, A distributed management circuit comprising a plurality of configuration interface manager (CIM) circuits configured to receive each programming partition as a configuration packet via the first communication channel, extract a command from each configuration packet, and execute operations related to each area of the functional circuit in parallel with each other based on the code contained in the field of the command, An integrated circuit (IC) device that includes [a certain feature].
12. The command includes an execution criteria code, and the execution criteria code is Performing a specified action without any conditions, Selectively execute the specified operation based on the state of the packet processor's condition register, and Selectively repeating a specified reading and masking operation based on the results of the reading and masking operation. The IC device according to claim 11, including a code that specifies the IC device.
13. The IC device according to claim 11, wherein the first CIM circuit among the CIM circuits is further configured to selectively suspend the processing of subsequent commands until the currently executing command is completed, based on the state of the synchronization bits contained in the currently executing command.
14. The command includes a command specifying a write operation, and the first CIM circuit of the CIM circuit is further configured to execute the write operation based on the write data source code included in the command, and the write data source code is The data to be written is contained within the aforementioned command. The aforementioned write data is located in the local data register (LDR) of the packet processor, and The aforementioned write data is located in the condition register and control register of the packet processor. Specify one of the following: The IC device according to claim 11.
15. The command includes a command for performing a read operation, and the command is Copy the data from the memory read operation to the register of the packet processor of the first CIM circuit among the CIM circuits. The data from the memory read operation is copied to the DMA data engine of the first CIM circuit. Copying data from a data buffer read operation to the register of the packet processor, and Copy the data from the data buffer read operation to the DMA engine. The IC device according to claim 11, comprising a data source code specifying one of the following.
16. The IC device according to claim 11, wherein the first CIM circuit among the CIM circuits comprises a packet processor including a condition register, the packet processor is configured to parse a condition code from the command and input the condition code into the condition register.
17. The aforementioned command includes a command to perform a read operation, The first CIM circuit among the CIM circuits comprises a packet processor and a direct memory access (DMA) engine. The packet processor comprises a data buffer management table (DBMT) and a local data register (LDR), The packet processor is configured to parse the data buffer index from the command, retrieve information from the DBMT based on the data buffer index, read data from the data buffer based on the information, copy the data to the LDR, or transfer the data to the DMA data engine. The IC device according to claim 11.
18. The aforementioned command includes a command to perform a write operation, The first CIM circuit among the CIM circuits comprises a packet processor and a direct memory access (DMA) engine. The packet processor comprises a data buffer management table (DBMT) and a local data register (LDR), The packet processor is configured to parse the data buffer index from the command, retrieve information from the DBMT based on the data buffer index, and write data from the LDR to the data buffer based on the information. The IC device according to claim 11.
19. The IC device according to claim 1, 9, or 11, wherein the first communication channel comprises a packet-switched network on-chip (NoC).