Multilevel cache security

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By introducing a consistent memory system and the MESI protocol into the memory system of integrated circuits, the problems of cache consistency and scalability in multi-core systems are solved, achieving more efficient data security and memory system performance.

CN113892090BActive Publication Date: 2026-06-19TEXAS INSTRUMENTS INC

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TEXAS INSTRUMENTS INC
Filing Date: 2020-05-26
Publication Date: 2026-06-19

Application Information

Patent Timeline

26 May 2020

Application

19 Jun 2026

Publication

CN113892090B

IPC: G06F12/0897

AI Tagging

Application Domain

Memory systems

Technology Topics

Program instruction Parallel computing

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In integrated circuits, with the enhancement of central processing units, when multiple CPUs share memory systems, the scalability of memory architecture and data security face challenges, especially in terms of cache information sharing and consistency management, leading to increased latency and processing demands.

Method used

A consistent memory system is adopted, which maintains the consistency of the memory system by storing the security code of data in the L1 and L2 caches and performing cache consistency operations in response to the security code. The MESI consistency protocol and shadow tagging mechanism are used to ensure the synchronization of data between different cache levels.

Benefits of technology

It improves data security and memory system processing efficiency, reduces access latency, enhances cache consistency management in multi-core systems, and improves system scalability and performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN113892090B_ABST

Patent Text Reader

Abstract

In the described example, the coherent memory system includes a central processing unit (CPU) and level 1 and level 2 caches. The CPU is configured to execute program instructions (1000) to manipulate data in at least a first or second security context. Each of the first and second caches stores (e.g., 1050) a security code indicating the at least first or second security context through which data of a corresponding cache line is received. The level 1 and level 2 caches maintain coherence by comparing (1020) the security code of the corresponding cache line and performing a cache coherence operation (1030) in response.

Need to check novelty before this filing date? Find Prior Art

Description

Background Technology

[0001] The processing device may be formed as part of an integrated circuit, such as a system-on-a-chip (SoC). In some instances, the SoC includes at least one central processing unit (CPU), wherein each CPU of the SoC is coupled to an integrated (e.g., shared) memory system. The memory system may include, for example, multi-level cache memory (e.g., static RAM—SRAM—formed on the SoC's integrated circuit) and at least one main memory (e.g., dynamic RAM—DRAM and / or DDR—memory that may be located outside the SoC's integrated circuit).

[0002] As increasingly powerful CPUs are added to (or coupled to) processing devices, increasingly complex memory architectures present scalability challenges. These challenges persist and may even become greater when multiple CPUs share a common address space within a memory system. A portion of the common address space of shared memory may contain levels of a consistent cache (e.g., where each level may contain different memories for storing data with unique addresses).

[0003] In one instance, a CPU in a cache memory system might consume an entire cache line every four cycles, placing additional processing demands on the cache, which is designed to consistently share cached information across various CPUs. This latency can be extended when the cache is configured to protect certain areas of cache memory from being read or modified by at least one CPU that would otherwise be allowed to access a cache line. Improving data security in such systems may require increased processing power and / or a more efficient processing architecture. Summary of the Invention

[0004] In the described example, the coherent memory system includes a central processing unit (CPU) and level 1 and level 2 caches. The CPU is configured to execute program instructions to manipulate data in at least a first or second security context. Each of the first and second caches stores a security code for indicating the at least first or second security context through which data for receiving a corresponding cache line is received. The level 1 and level 2 caches maintain coherence by comparing the security code of the corresponding cache line and performing a cache coherence operation in response. Attached Figure Description

[0005] Figure 1 It is a high-level system diagram that shows an example of a dual-core scalar / vector processor formed as a system-on-a-chip.

[0006] Figure 2 It is a high-level diagram that shows multiple levels of an instance hierarchical memory system.

[0007] Figure 3 Demonstrates an example of a single-core or optionally dual-core scalar and / or vector processor system 300 with a consistent and hierarchical memory architecture.

[0008] Figure 4 Demonstrates a two-level instance of a unified memory controller with a consistent and hierarchical memory architecture.

[0009] Figure 5A This demonstrates that each physical storage unit has 4 virtual storage units. Figure 4 The system instance is a Level 2 memory storage interface.

[0010] Figure 5B This demonstrates that each physical storage unit has 2 virtual storage units. Figure 4 The system instance is a Level 2 memory storage interface.

[0011] Figure 6A Demonstrates the physical structure of the example L1D (Level 1 Data) controller.

[0012] Figure 6B This demonstrates the physical structure of the Level 2 (L2) controller.

[0013] Figure 7A Displays instance-level data (L1D) cache tag values prior to instance cache operations.

[0014] Figure 7B Displays instance-level data (L1D) cache tag values following instance cache operations.

[0015] Figure 8A The instance L2 shadow structure is shown before instance cache operations.

[0016] Figure 8B The instance L2 shadow structure is shown after the L1D allocation of the row, where the modified row is moved from the main cache to the sacrificial cache and from the sacrificial cache to L2.

[0017] Figure 9A This is a flowchart of an instance of a consistent read operation in a multi-level cache system.

[0018] Figure 9B This is a flowchart of an instance of a snooping read operation in a multi-level cache system.

[0019] Figure 9C This is a flowchart of an instance of a CMO (Cache Maintenance Operation) read operation in a multi-level cache system.

[0020] Figure 10 This is a flowchart of an example of a DMA write operation in a multi-level cache system.

[0021] Figure 11 This is a flowchart of an instance of a read allocation operation in a multi-level cache system.

[0022] Figure 12 This is a flowchart of an instance of a sacrificial write operation in a multi-level cache system. Detailed Implementation

[0023] In the accompanying drawings, similar reference figures refer to similar elements, and various features are not necessarily drawn to scale.

[0024] The processing device may be formed as part of an integrated circuit, such as a system-on-a-chip (SoC). As described below, the processing device may include instance security features for protecting the security of data in a memory system, such as a multi-level cache system.

[0025] Figure 1 This is a high-level system diagram illustrating an example dual-core scalar / vector processor configured as a system-on-a-chip. SoC 100 is an example dual-core scalar and / or vector processor containing a central processing unit (CPU) 110 core. The CPU 110 core includes a Level 1 instruction cache (L1I) 111, a Level 1 data cache (L1D) 112, and a streaming engine (SE) 113, such as a dual-streaming engine (2xSE). SoC 100 may further include an optional CPU 120 core, which includes a Level 1 instruction cache (L1I) 121, a Level 1 data cache (L1D) 122, and a streaming engine 123. In various instances, the CPU 110 core and / or the CPU 120 core may include register files, arithmetic logic units, multipliers, and program flow control units (not specifically shown) that can be arranged for scalar and / or vector processing. SoC 100 includes a Level 2 unified (e.g., combined instruction / data) cache 131 arranged to selectively cache both instructions and data.

[0026] In one example, the CPU 110, L1 instruction cache (L1I) 111, L1 data cache (L1D) 112, streaming engine 113, and L2 unified cache (L2) 131 are formed on a single integrated circuit. In another example, the scalar central processing unit (CPU) 120 core, L1 instruction cache (L1I) 121, L1 data cache (L1D) 122, streaming engine 123, and L2 unified cache (L2) 131 are formed on a single integrated circuit containing the CPU 110 core.

[0027] In this example, SoC 100 is formed on a single integrated circuit, which also includes auxiliary circuitry such as Dynamic Power Control (DPC) power-on / power-off circuitry 141, emulation / tracking circuitry 142, Design for Test (DFT) programmable built-in self-test (PBIST) and Serial Message System (SMS) circuitry 143, and timing circuitry 144. A memory controller (e.g., a multi-core shared memory controller level 3 "MSMC3") 151 is coupled to SoC 100 and can be integrated with SoC 100 on the same integrated circuit. MSMC3 may include memory access functionality such as direct memory access (DMA), allowing MSMC3 to function as a DMA controller (or to work in cooperation with a DMA controller).

[0028] CPU 110 operates under program control to perform data processing operations on data stored in a memory system (e.g., a memory system containing memory shared by multiple cores). The program controlling CPU 110 contains multiple instructions fetched by CPU 110 before being decoded and executed.

[0029] SoC 100 includes several cache memories. In this example, Level 1 Instruction Cache (L1I) 111 stores instructions used by CPU 110. CPU 110 accesses (including attempts to access) any of the multiple instructions from Level 1 Instruction Cache 111. Level 1 Data Cache (L1D) 112 stores data used by CPU 110. CPU 110 accesses (including attempts to access) any addressed data from Level 1 Data Cache 112 (e.g., any data pointed to by any of the multiple instructions). The Level 1 cache (e.g., L1I 111, L1D 112, and 2xSE 113) for each CPU core (e.g., 110 and 120) is supported by Level 2 Unified Cache (L2) 131.

[0030] In the event of a cache miss for any memory request in the corresponding L1 cache, the requested information (e.g., instruction code, non-stream data, and / or stream data) is retrieved from the L2 unified cache 131. If the requested information is already stored in the L2 unified cache 131, the requested information is supplied to the requesting L1 cache to relay the requested information to the CPU 110. Relaying the requested information to both the requesting cache and the CPU 110 can reduce access latency to the CPU 110.

[0031] Streaming engines 113 and 123 may be similar in structure and operation. In SoC 100, streaming engine 113 transfers data from L2 unified cache 131 to CPU 110. Streaming engine 123 transfers data from L2 unified cache 131 to CPU 120. In this example, each streaming engine 113 and 123 controls (and otherwise manages) up to two data streams.

[0032] Each streaming engine 113 and 123 is arranged to transmit data of a defined type (e.g., a defined structure and / or protocol), wherein the data is transmitted as a stream. The stream contains a sequence of selected elements of a defined type. A program that operates on (e.g., consumes) the stream is instantiated (e.g., a processor is configured as a dedicated machine) to sequentially read the contained data and process each element of the data in turn.

[0033] In this example, the stream data contains indications of defined start and end times (e.g., these indications can be used to determine the corresponding start and / or end time points). The stream data contains elements that typically have a fixed size and type throughout the stream. The stream data may contain a fixed sequence of elements, where a program cannot randomly search for elements contained in the stream. In this example, the stream data is read-only while active, so a program cannot write to the stream while reading from it.

[0034] When a stream is opened by an instance streaming engine, the streaming engine: calculates the address; fetches the defined data type from the L2 unified cache; performs data type manipulation; and directly delivers the processed data to the demand programming execution unit within the CPU. Data type manipulation may include operations such as zero extension, sign extension, and data element sorting / exchange (e.g., matrix transpose).

[0035] In various instances, streaming engines are configured to perform real-time digital filtering operations on defined data types (e.g., benign data). Such engines reduce memory access time (e.g., that would otherwise be encountered by the requesting processor), freeing up the requesting processor to perform other processing functions.

[0036] In various instances, streaming engines improve the operational efficiency of the L1 cache. For example, a streaming engine can minimize the number of cache miss pauses because the streaming buffer can bypass the L1D cache (e.g., 111). Furthermore, a streaming engine can reduce the number of scalar operations that would otherwise be required to maintain the control loop and manage the corresponding address pointers. A streaming engine may include a hardware memory address generator that reduces the software execution that would otherwise be encountered in generating addresses and managing the control loop logic (e.g., freeing up the CPU to perform other tasks).

[0037] The Level 2 unified cache 131 is further coupled to higher-level memory system components via the memory controller 151. The memory controller 151 accesses external memory ( Figure 1 (Not shown in the image) The handling occurs in the L2 unified cache 131, where a cache miss occurs. The memory controller 151 is arranged to control memory-centric functions, such as cacheability determination, error detection and correction, and address translation.

[0038] The example SoC 100 system includes multiple CPUs 110 and 120. In a system containing multiple CPUs, a memory controller 151 can be arranged to control data transfer between the multiple CPUs and maintain cache coherency in processors that can access each other's external memory.

[0039] Figure 2 This is a high-level diagram illustrating multiple levels of an instance-hierarchical memory system. Memory system 200 is an instance-hierarchical memory system that includes a CPU 210 and controllers (e.g., 222, 232, and 241) for maintaining memory coherence across three corresponding levels of cache and memory. A Level 1 cache (e.g., L1 data cache) includes L1 SRAM (static RAM) 221, a Level 1 controller 222, an L1 cache tag 223, and a sacrificial cache tag 224. For example, the Level 1 cache includes memory accessible by the CPU 210 and arranged to represent temporary data storage by the CPU 210. A Level 2 cache (e.g., L2 unified cache) includes L2 SRAM 231, a Level 2 controller 232, an L2 cache tag 233, a shadow L1 main cache tag 234, and a shadow L1 sacrificial cache tag 235. For example, the Level 2 cache includes memory accessible by the CPU 210 and arranged to represent temporary data storage by the CPU 210. The memory system 200 is consistent throughout, and the memory regions at each level of the cache may contain local memory (e.g., cache lines) that can be addressed by the CPU. Table 1 shows the different memory regions present in the memory system 200 and whether each memory region can be configured to be consistent.

[0040] Table 1

[0041]

[0042] CPU 210 is bidirectionally coupled to Level 1 controller 222, which in turn is bidirectionally coupled to Level 2 controller 232, which in turn is bidirectionally coupled to Level 3 controller 241, thus connecting at least three levels of the cache memory to CPU 210. Data transfers into and out of L1 SRAM 221 cache memory are controlled by Level 1 controller 222. Data transfers into and out of L2 SRAM 231 cache memory are controlled by Level 2 controller 232.

[0043] Level 1 controller 222 is coupled to (and in some instances included) L1 cache tag 223 and sacrificial cache tag 224. L1 cache tag 223 is the non-data portion of a corresponding L1 cache line, containing corresponding data stored in SRAM 221 cache memory. L1 sacrificial cache tag 224 (e.g., stored in tag RAM) is the non-data portion of a cache line, where each cache line contains a corresponding line of data stored in SRAM 221 cache memory. In an example, cache lines evicted from the L1 cache are copied to the sacrificial cache, such that, for example, L1 cache tag 223 is copied to (or otherwise mapped to) L1 sacrificial cache tag 224. The sacrificial cache may, for example, store the originally evicted data at L1 level, so that memory requests from CPU 210 that "hit" a line stored in the sacrificial cache can be responded to without accessing the L2 cache (e.g., reducing access time in such cases).

[0044] The Level 2 controller 232 is coupled to (e.g., includes) two sets of cache tags. The first set of cache tags includes L2 cache tags 233, which are the non-data portions of corresponding L2 cache lines, and each cache line contains a corresponding line of data stored in the SRAM 231 cache memory. The second set of cache tags includes a shadow L1 main cache tag 234 and a shadow L1 sacrifice cache tag 235. The shadow L1 main cache tag 234 typically corresponds to L1 cache tag 223 (e.g., points to or contains the same information as L1 cache tag 223). The shadow L1 sacrifice cache tag 235 typically corresponds to L1 sacrifice cache tag 224 (e.g., points to or contains the same information as L1 sacrifice cache tag 224). The shadow L1 main cache tag 234 contains at least the valid and dirty states of the corresponding cache line in L1 cache tag 223, while the shadow L1 sacrifice cache tag 235 contains at least the valid and dirty states of the corresponding cache line in L1 sacrifice cache tag 224.

[0045] Level 2 controller 232 generates peeping transactions to maintain (e.g., including updates and implementations) read and write consistency between the Level 2 cache and the Level 1 cache. For example, Level 2 controller 232 sends a peeping transaction to the Level 1 controller to determine the state of an L1D cache line and update the shadow tag (e.g., 234 or 235) associated with the queried L1D cache line. The shadow tag (e.g., 234 or 235) may be used only for peeping transactions that maintain consistency between the L2 SRAM and the Level 1 data cache. In some instances, updates to all cache lines in higher-level caches may be ignored, improving the efficiency of the L1-to-L2 cache interface.

[0046] In response to the snooping request data returned by Level 1 controller 222, Level 2 controller 232 updates the shadow tag (e.g., 234 or 235) corresponding to the snooped L1 cache line. Events that perform updates include, for example, the allocation of L1D cache lines and, for example, dirty modifications and obliteration modifications to data stored in L1 SRAM 221.

[0047] Hardware cache coherence is a technique that allows data and program caches in different groups, known as "shareable domains" (e.g., shared across different CPUs, or even within a single CPU), and different requesters (including those that may not contain caches) to access (e.g., read) the latest data value at a given address in memory. Ideally, this "coherent" data value needs to be accurately reflected to every observer in the shareable domain. Observers can be, for example, caches or requesters that issue commands to read a given memory location.

[0048] By using memory attributes, some memory locations can be marked as "shareable," while others can be marked as "non-shareable." To maintain complete consistency in an ideal system, only shareable memory regions (e.g., regions that can be one or more contiguous locations) need to maintain consistency among caches / requesters (observers) that are part of the same shareability domain. Maintaining consistency for non-shareable memory locations is not required. Methods and apparatuses arranged to efficiently achieve consistency for shareable memory regions are described below. For example, when a shareable memory region is consistent, it is shareable because all data locations within the shareable memory region have the latest value of the data assigned to each location within the shareable memory region.

[0049] The following describes the technology, control logic, and state information of a functionally correct and consistent system. Each observer can issue read (and optionally write) requests to locations marked as shareable. Furthermore, the cache can issue peek requests to locations in response to the type of peek operation, requesting to read, return, or even update its cache state.

[0050] In a multi-level cache hierarchy, intermediate levels (e.g., L2) can both send and receive snooping operations (e.g., to maintain consistency between different cache levels). In contrast, the first level of a cache hierarchy (e.g., Level 1 controller 222) receives snooping operations but does not dispatch them. Furthermore, the last level of a cache hierarchy (e.g., Level 3 controller 241) can dispatch snooping operations but does not receive them. Typically, snooping operations are inherently dispatched to lower cache levels within the cache hierarchy (e.g., where lower represents a cache structure closer to the CPU processing element, and higher represents a cache structure farther from the CPU processing element).

[0051] Level 2 controller 232 includes hardware, control logic, and status information for accurately querying, determining, and processing the current state of consistent (shareable) cache lines in a Level 1 cache (e.g., L1D 112), wherein the lower-level cache is arranged as a heterogeneous caching system. In this example, Level 1 controller 222 manages a heterogeneous caching system that includes primary caches (e.g., set-associative) and sacrificed caches (e.g., fully associative).

[0052] The coherence of the memory system 200 can be enforced by recording the state of each cache line in the cache using the MESI (Modified-Exclusive-Shared-Invalid) coherence scheme (including its derivatives). The standard MESI cache coherence protocol includes four states: Modified, Exclusive, Shared, and Invalid (or its derivatives) for each cache line.

[0053] A modified status indicates that the value in the corresponding cache line has been modified relative to main memory, and the value in the cache line is exclusively stored in the current cache. A modified status also indicates that the value in the line is explicitly absent or invalid in any other cache within the same shareability domain.

[0054] An exclusive state indicates that the value in the corresponding cache line has not been modified relative to main memory, but the value in the cache line is exclusively stored in the current cache. This indicates that the value in the line is explicitly absent or invalid in any other cache within the same shareability domain.

[0055] A shared state indicates that the value in the corresponding cache line has not been modified relative to main memory. A value in a cache line can exist in multiple caches within the same shareability domain.

[0056] An invalid state indicates that any value in the corresponding cache line is treated as if it does not exist in the cache (e.g., as a result of being invalidated or evicted).

[0057] Shareability domains can be defined as a set of caches that must remain consistent with each other. Not all MESI states necessarily require the implementation of a multi-level consistency system with cache hierarchies. For example, shared state can be eliminated (e.g., at the cost of performance), resulting in a MEI consistent system. In a MEI consistent system, exactly one cache in the entire system can hold a copy of every MEI cache line at a given time, regardless of whether the cache line has been modified (or may be modified in the future).

[0058] In a consistent cache system, the unit of consistency is a single cache line, such that the data length (e.g., the number of addresses used to access data within a cache line, whether 32, 64, or 128 bytes) is considered an atomic unit of consistency. In instance system 300 (hereinafter referred to as...),... Figure 3 As described, the cache structure shared between L1D and L2 contains 128-byte coherence units. Typically, the L1 and L2 cache structures and tracking mechanisms operate atomically on selected coherence units.

[0059] To maintain cache consistency, various consistency transactions can be initiated. These consistency transactions include transaction types such as read, write, snoop, and sacrifice. Each transaction type can have multiple forms / variants, which are contained in the bus signaling protocol (e.g., the VBUSM.C protocol specification).

[0060] A read-consistent transaction involves returning the "current" (e.g., most recent) value at a given address, regardless of whether that value is stored at the endpoint (e.g., in external memory) or in a cache within the consistency system.

[0061] A write-to-consistency transaction involves updating the current value at a given address and invalidating the copy stored in the cache of the consistency system.

[0062] Cache maintenance operations (CMO) include operations that initiate actions to be taken on a single address in the consistent cache (L1D and L2).

[0063] A snooping consistency transaction (“snooping”) involves reading, invalidating, or both reading and invalidating a copy of data stored in a cache. A snooping is initiated by a higher-level controller of the tier against the next lower-level cache of that tier. Snooping can be further propagated by the lower-level cache controller to even lower levels of the tier as needed to maintain consistency.

[0064] A sacrificial consistency transaction involves sending a sacrificed cache line (“sacrifice”) from a lower-level cache in the cache hierarchy to the next higher level in the cache hierarchy. The sacrifice is used to transfer modified data to the next level of the hierarchy. In some cases, the sacrifice may be further propagated to higher levels of the cache hierarchy. In instances where the L1D sends a sacrifice to the L2 to obtain an address in DDR or L3 SRAM and the line does not exist in the L2 cache, the L2 controller is configured to forward the sacrifice to the next level of the cache hierarchy.

[0065] Table 2 describes instance-consistent commands that can be initiated between L2 and various controllers that interact with the L2 cache.

[0066] Table 2

[0067] Main controller Main control startup L2 Start PMC Read none MMU Read none Streaming Engine (SE) Read, CMO none DMC Read, Write, Sacrifice spying MSMC (L3 controller) Peeping, DMA read, DMA write Read, Write, Sacrifice

[0068] The Level 2 controller 232 maintains local information (e.g., in the Level 2 shadow register), which is updated to reflect every change in monitored state information occurring in the hardware FIFO, RAM, and logic within the Level 1 cache. This allows the current (e.g., most recent) state of all consistent cache lines present in both the main cache and the sacrificed cache within the L1 controller to be determined locally at the Level 2 cache. Pipeline hardware on a dedicated bus between the Level 1 and Level 2 caches improves the speed at which the Level 2 shadow register is updated and reduces the need for a bidirectional data access bus for reading and writing data between the Level 1 and Level 2 caches. Accurately updating the shadow information maintains the correct data values and functionality of the consistent hardware cache system.

[0069] Figure 3An instance-consistent shared memory system 300 with a consistent and hierarchical memory architecture is demonstrated, consisting of a single core or optionally two cores of scalar and / or vector processors. System 300 is an instance-consistent shared memory system, such as System 200 or SoC 100. System 300 includes at least one CPU core. For example, a first core may include a first CPU 310, a DMC 361, a 32KB L1D cache 312, a PMC 362, a 32KB L1I cache 311, and a dual-stream buffer 313. An optional second core may include components similar to the first core. CPU 310 (and the second core 320, if present) are coupled to a UMC 363 via appropriate interfaces, the UMC being configured to control L2 cache marking and memory.

[0070] Generally, system 300 includes various cache controllers, such as program memory controller (PMC) 362 (e.g., for controlling data transfers to and from the Level 1 program cache 311) and data memory controller (DMC) 361 (e.g., for controlling data transfers into and out of the Level 1 data cache 312). Figure 1 As shown, the L2 cache can be shared between the two processing cores. System 300 also includes a unified memory controller (UMC) 363 (for example, for controlling data transfer between L2 and L3 caches). The UMC 363 is described below, for example, regarding... Figure 4 The L2 cache described herein includes a UMC 363 coupled to an MMU (Memory Management Unit) 391 and an MSMC 351. DMC 361, PMC 362, SE313, MSMC 351, and MMU 391 are requesters, all of which have access to memory stored in the L2 cache.

[0071] In this example, System 300 is a pipelined cache and memory controller system for fixed-point and / or floating-point DSPs (Digital Signal Processors). System 300 includes at least one CPU core (each CPU core containing a corresponding private L1 cache, controller, and stream buffer) and a shared L2 cache controller. System 300 can provide bandwidth of up to 2048 bits of data per cycle, an eight-fold improvement over previous generation systems. The L1 cache can maintain a transfer of 512 bits of data to the CPU per cycle, and the L2 cache can transfer 1024 bits of data to the dual stream buffer per cycle. The L1 and L2 controllers are capable of queuing multiple transactions to the next higher level of memory and can reorder out-of-order data returns. The L1P 311 controller supports branch exit prediction from the CPU and can queue multiple prefetch misses to the L2 cache included in the UMC 363.

[0072] System 300 contains full soft error correction codes (ECC) for its data and TAG RAM (e.g., as described below regarding...). Figure 4 (As described). The ECC scheme employed not only corrects errors in memory-stored data but also provides error correction for data transferred through processor pipelines and interface registers. System 300 supports full memory coherence, where, for example, internal caches and memories (e.g., contained by L1 and L2 caches) are kept consistent with each other and with external caches and memories (e.g., MSMC 351 for L3 cache and external memories at L4 and final memory levels). UMC 363 maintains consistency among multiple L1Ds and with each of the higher, consecutive levels of caches and memories. UMC 363 can maintain consistency with the dual-streaming engine by peeking at L1D cache lines in response to streaming engine reads (e.g., via pipelines separate from the streaming data path).

[0073] The System 300 supports consistency across virtual memory schemes and includes address translation, μTLB (micro-translation back buffer), L2 page table walks, and L1P cache invalidation. The UMC 363 can support one or two stream buffers, each with two streams. Stream buffer data is consistent with the L1D cache, and each stream buffer has a pipelined high-bandwidth interface to the L2 cache.

[0074] System 300 contains instance interfaces between various components at different levels within System 300. In addition to CPU-to-DMC (CPR-DMC) and CPU-to-PMC (CPR-PMC) interfaces, inter-level interfaces and data paths may be constructed in pipelined multi-transaction standards (e.g., VBUSM or MBA).

[0075] The instance interfaces include CPU-DMC, CPU-PMC, DMC-UMC, PMC-UMC, SE-UMC, UMC-MSMC, MMU-UMC, and PMC-MMU interfaces. CPU-DMC includes 512-bit vector reads, 512-bit vector writes, and 64-bit scalar writes. CPU-PMC includes 512-bit reads. DMC-UMC includes 512-bit read and 512-bit write interfaces for performing cache transactions, snooping transactions, L1DSRAM DMA, and external MMR access (e.g., each interface can handle two data phase transactions). The PMC-UMC interface includes 512-bit reads (supporting one or two data phase reads). The SE-UMC interface includes 512-bit reads (supporting one or two data phase reads). The UMC-MSMC interface includes 512-bit reads and 512-bit writes (with overlapping snooping and DMA transactions). The MMU-UMC interface includes page table lookups from L2. The PMC-MMU interface contains a μTLB miss in the MMU.

[0076] L1P 311 contains a 32KB L1P cache with a 4-way set-associative size of 64 bytes per cache line, where each line is virtually indexed and tagged (48-bit virtual address). L1P 311 includes automatic prefetching for L1P misses (where a prefetch miss from L2 may include two data phases of data return). L1P 311 is coupled to and controlled by PMC 362 (e.g., contained by PMC 362).

[0077] PMC 362 supports prefetching and branch prediction capable of queuing a variable number (e.g., up to 8) of fetch grouping requests to UMC (e.g., to enable deeper prefetching in the program pipeline).

[0078] The PMC 362 includes an error correction code (ECC) with parity protection for both the data and tag RAM (e.g., 1-bit error detection for both tag and data RAM). The data RAM parity protection is provided with one parity bit per 32 bits. In the tag RAM, parity errors can be enforced through automatic invalidation and prefetching.

[0079] The PMC 362 supports global cache coherency operations. The PMC 362 supports single-cycle cache invalidation in three modes (e.g., full cache line, MMU page table base 0, and MMU page table base 1).

[0080] The PMC 362 provides virtual memory by handling virtual-to-physical addressing misses and incorporates them into the μTLB for address translation and code protection.

[0081] The PMC 362 provides emulation and debugging capabilities through cache-level access codes that can return on read to indicate data read from it, and bus error codes that can return to indicate the pass / fail status of all emulation reads and writes. The PMC 362 provides extended control register access, which includes the L1PECR register accessible from the CPU via a non-pipeline interface. These extended control registers are not memory-mapped but can be mapped via the MOVC CPU instruction.

[0082] The L1D cache 312 is a direct-mapped cache and mirrored in parallel with a 16-entry fully associative sacrifice cache. The L1D cache 312 contains 32KB of memory configurable down to 8KB of cache. The L1D cache 312 includes dual data paths (e.g., for 64-bit scalars or 1KB vector operands). The L1D cache 312 has a 128-byte cache line size. The L1D cache 312 includes read allocation cache support for both write-back and write-through modes. The L1D cache 312 is physically indexed and physically tagged (44-bit physical addresses), supports speculative loads and hits on misses, published write miss support, and provides write merging for all incomplete write transactions within the L1D. The L1D cache 312 supports FENCE operations on incomplete transactions. The L1D is automatically flushed and refreshed on idle.

[0083] L1D cache 312 contains L1D SRAM for supporting access from the CPU and DMA. The amount of available SRAM is determined by the sum of the sizes of the L1D memory and the L1D cache.

[0084] The DMC 361 includes lookup table and histogram capabilities to support 16 parallel table lookups and histograms. The DMC 361 can initialize lookup tables and dynamically configure L1D SRAM into multiple sectors / paths in response to the selected degree of parallelism.

[0085] The DMC 361 includes 64-bit and 512-bit CPU load / store bandwidth and 1024-bit L1D memory bandwidth. The DMC 361 supports 16 interfaces for 64-bit wide memory banks, with up to 8 incomplete load misses for L2. Physical and virtual memory banks are discussed below. Figure 5A and Figure 5B Describe it.

[0086] The DMC 361 includes Error Detection and Correction (ECC). The DMC 361 includes ECC detection and correction at a 32-bit granularity. This includes full ECC for both data and tag RAM, with 1 bit error correction and 2 bits error detection for both. The DMC 361 provides ECC checksums for writes and sacrifices to L2. The DMC 361 receives ECC checksums with read data from L2 and performs detection and correction before presenting the verified data to the CPU. The DMC 361 provides full ECC for sacrificed cache lines. The DMC 361 provides read-modify-write support to prevent parity corruption during partial line writes. ECC L2-L1D interface latency is corrected for ECC protection of the read response data pipeline.

[0087] The DMC 361 provides emulation and debug execution by returning access codes (e.g., DAS) on reads to indicate the cache level from which data is read. Bus error codes can be returned to indicate the pass / fail status of emulation reads and writes. The contents of the cache-marked RAM can be accessed via the ECR (Extended Control Register).

[0088] The DMC 361 provides atomic operations for swap operations or comparisons and swap operations to cacheable memory space and provides incremental operations to cacheable memory space.

[0089] The DMC 361 provides coherence, including full MESI (Modify-Exclusive-Shared-Invalidate) support for both main and sacrificial caches. The DMC 361 provides support for global cache coherence operations, including snooping and cache maintenance support from L2, snooping of L2 SRAM, MSMC SRAM, and external (DDR) addresses, and full-mark RAM comparisons for snooping and cache maintenance operations.

[0090] In this example, the DMC 361 provides 48-bit wide virtual memory addressing for physical addressing of memory with 44-bit physical addresses.

[0091] The DMC 361 supports extended control register access. The L1D ECR registers can be accessed from the CPU via a non-pipeline interface. These registers are not memory-mapped, but rather mapped to the MOVC CPU instruction.

[0092] DMC supports L2 address aliases (including the VCOP address alias pattern). These aliases can be extended to multiple individual buffers, such as VCOP-IBUFAH, IBUFAL, IBUFBH, and IBUFBL buffers. L2 address aliases include out-of-range and ownership checks for all buffers to maintain privacy.

[0093] The UMC 363 controls the flow of data into and out of the L2 cache 331. The L2 cache 331 is 8-way set-associative and supports cache sizes from 64KB to 1MB. The L2 cache 331 policy includes random least recently used (LRU) and / or random replacement. The L2 cache 331 has a 128-byte cache line size. The L2 cache 331 has a write allocation policy and supports write-back and write-through modes. The L2 cache 331 performs cache invalidation for cache mode changes; this is configurable and can be disabled. The L2 cache 331 is physically indexed and physically tagged (44-bit physical addresses), containing four tag RAMs per group, allowing for four independent partition pipelines. The L2 cache 331 supports two 64-byte streams from the streaming engine, L1D, and L1P caches, and supports configuration and MDMA access on the unified interface to the MSMC 351. The L2 cache 331 caches the MMU page table.

[0094] The L2 cache 331 instance L2 SRAM component comprises four 512-bit physical memory banks, each with four virtual memory banks. Each memory bank (e.g., physical and / or virtual memory banks) has independent access control. The L2 SRAM includes a security firewall regarding L2 SRAM access. The L2 SRAM supports DMA access over a merged MSMC interface.

[0095] The UMC 363 provides prefetch hardware and on-demand prefetching to external (DDR), MSMC SRAM, and L2 SRAM.

[0096] The L2 cache provides error detection and correction at a 256-bit granularity (e.g., ECC). Full ECC support is provided for both the tag and data RAM, with both having 1-bit error correction and 2-bit error detection. The ECC (see, for example, ECC GEN RMW 471 described below) includes ECC parity checks for writes to the MSMC 351 and for sacrifices, and includes read-modify-write operations for DMA / DRU writes to keep parity valid and updated. The ECC is configured to correct and / or generate multiple parity bits for data sent to the L1P 311 and SE 313 via the data path / pipeline. This includes automatic erasure to prevent the accumulation of 1-bit errors and parity refresh. The ECC is cleared and parity is reset upon system reset.

[0097] The UMC 363 provides emulation and debugging by returning access codes during reads to indicate the cache level from which data was read. Bus error codes are returned to indicate the pass / fail status of emulated reads and writes.

[0098] The UMC 363 supports full consistency between L1D 312, two streams of SE 313, L2 SRAM 331, MSMC 351 SRAM, and external memory (DDR). This includes L1D-to-shared L2 consistency, which can be maintained in response to snooping on L2 SRAM, MSMC SRAM, and external (DDR) addresses. This consistency is maintained via MESI schemes and policies. The UMC 363 includes user consistency commands from SE 313 and includes support for global consistency operations.

[0099] The UMC 363 supports extended control register access. The L1D ECR register can be accessed from the CPU via a non-pipeline interface. The contents of the ECR register can be accessed in response to the MOVC CPU instruction.

[0100] UMC 363 supports L2 address aliases (including the VCOP address alias pattern). These aliases can be extended to multiple individual buffers, such as VCOP-IBUFAH, IBUFAL, IBUFBH, and IBUFBL buffers. L2 address aliases include out-of-range and ownership checks for all buffers to maintain privacy.

[0101] The MSMC 351 allows processor module 110 to dynamically share both internal and external memory for program and data within a consistent memory hierarchy. The MSMC 351 includes internal RAM that provides programmer flexibility by allowing portions of the internal RAM to be configured as shared Level 3 RAM (SL3). The shared Level 3 RAM can be cached in a local L2 cache. The MSMC can be coupled to on-chip shared memory.

[0102] The MFENCE (Memory Fence) instruction provides a pause instruction execution pipeline for CPU 310 until all processor-triggered memory transactions are completed. These memory transactions may include: cache line filling; writes from L1D to L2 or from processor modules to MSMC 351 and / or other system endpoints; sacrificed write-back; block or global coherence operations; cache mode changes; and incomplete XMC prefetch requests. The MFENCE instruction can be used as a simple mechanism to pause programs until memory requests arrive at their endpoints via dispatch. It can also provide ordering guarantees for writes that arrive at a single endpoint via multiple paths (e.g., where multiprocessor algorithms depend on the ordering of data written to a particular address) and during manual coherence operations.

[0103] The system memory management unit (MMU) 391 invalidates the μTLB in response to a processor context switch, for example, to maintain privacy.

[0104] Figure 4A two-level instance unified memory controller is presented, demonstrating a consistent and hierarchical memory architecture. System 400 is an instance-consistent shared memory system, such as System 300. System 400 includes at least one CPU. For example, a first core (core 0) may include CPU 410, L1D 421, SE 422, L1P 423, MSMC 461, dynamic power-down controller 463, and L2 memory 480. An optional second core (core 1) may include components similar to the first core. The first core (and the second core 412, if present) are coupled to a UMC 430 via appropriate interfaces, the UMC being arranged to control L2 cache marking and memory.

[0105] The UMC 430 may include an L2 cache controller, a state memory 440 (which includes L2 cache tag RAM 441, L2 MESI 442, L1D shadow tag RAM 443, L1D MESI 444 and tag RAM ECC 445), a memory coherence (external, internal, global, user) 450 controller, an MSMC interface 451, an emulation 452 controller, a power-down controller 453, an extended control register (ECR) 454, a firewall 470, an ECC generator read-modify-write (ECC GEN RMW), an L2 SRAM / cache arbitration and interface 472, and an ECC check 473.

[0106] Generally speaking (reference) Figure 3 and Figure 4System 400 may include six requester ports (e.g., interfaces) coupled to UMC 430: a PMC 362, a DMC 361, two SE ports (contained in a streaming engine SE 313), an internal ECR 454 interface from the CPU (e.g., CPU 410), and an MSMC 461. The DMC 361 interface has separate 512-bit read and write paths. This interface can also be used for snooping from the L1D cache. Each read transaction can consist of one or two data phases. The PMC 362 interface consists of a 512-bit read-only path (L1P fetch only). Each read transaction can consist of one or two data phases. The two SE interfaces (of SE 313) are 512-bit read-only. Each read transaction can consist of one or two data phases. The read transactions also serve as part of the user block consistency function. The MSMC 461 interface consists of separate 512-bit read and write paths. A separate 512-bit read / write path interface is also used for snooping commands, read / write access to L2 SRAM, and read / write access to L1D SRAM. Each read transaction can consist of one or two data phases. The internal ECR 454 interface from each CPU of the System 400 is a 64-bit non-pipeline interface and is used for configuration access to the ECR 454 registers of the UMC 430.

[0107] The UMC to DMC interface may include: a 512-bit DMC read path; a 512-bit DMC write path; DMC to UMC signals (e.g., read / write / sacrifice address, address and security of cache lines evicted to the sacrifice buffer, address and security of cache lines evicted from the sacrifice buffer, two mark update interfaces for indicating clean lines evicted from the sacrifice buffer, byte enable, read / write indicator, MMU page table attribute / privilege / security level indicator, snoop response, L1D cache mode signals such as size, size change, global consistency, and global consistency type); and UMC to DMC signals (e.g., snoop signaling, responses to reads and writes, and other such handshake signals).

[0108] The UMC to PMC interface may include: a 512-bit PMC read path; a PMC to UMC fetch address; and other such handshake signals.

[0109] The UMC to SE interface may include: a 512-bit SE read path; an SE to UMC fetch address; an SE to UMC user block consistency indicator; and other such handshake signals.

[0110] The MSMC-UMC interface can be coupled to carry various types of transactions, such as: Master DMA (MDMA, which may include cache allocation, sacrifice, long-distance writes, and non-cacheable reads, wherein such MDMA transactions may originate from the UMC); External Configuration Group (ECFG, which may include read / write access to memory-mapped registers that may be physically located outside the CPU core, wherein such read / write access may originate from the UMC); DMA transactions (which may originate from the MSMC and are transactions that can transfer data, for example, between different CPU cores, between a CPU core and external DDR, or between a CPU core and non-DDR memory on the SOC, wherein such transactions may be created by the DMA controller and may point to L2 SRAM or L1D SRAM); peeping transactions (which may originate from the MSMC and may be generated in response to a transaction from another core, allowing the other core to peek at data from the first CPU core); and cache warming (e.g., allowing the MSMC to initiate transactions that can be used by the UMC to allocate rows from the 3L cache or external memory to the UMC cache).

[0111] The UMC-to-MSMC interface may include: a 512-bit MSMC read path; a 512-bit MSMC write path; MSMC-to-UMC signals (e.g., address, byte enable, read / write indicator, MMU page table attribute / privilege / security level indicator, snooping transaction, DMA transaction, and cache warm transaction); and UMC-to-MSMC signals (e.g., snoop response, address, byte enable, read / write indicator, and MMU page table attribute / privilege / security level indicator) and other such handshake signals.

[0112] System 400 may include an Extended Control Register (ECR) mapped to the MOVC CPU instruction. The UMC ECR path allows 64-bit read / write access to the UMC's control registers. For configuration reads, the UMC is configured to sample the contents of the registers and retain those contents during the access. The UMC ECR interface includes: a 64-bit ECR read path; a 64-bit ECR write path; an address; a privilege / security level indicator; an index that can be used for cache tag viewing; and other such handshake signals.

[0113] The UMC to MMU interface may include: a 64-bit read path; an address; and other such handshake signals.

[0114] The UMC to L2 interface may include: a virtual memory bank; physical memory banks of L2 memory, each containing 512-bit addressable data units; a 512-bit read data path; a 512-bit write data path; an address; a byte enable; a memory enable indicator; a read / write indicator; a virtual memory bank select; and other such handshake signals.

[0115] The UMC 430 may include a Level 2 memory 480 (e.g., SRAM). The L2 memory 480 may contain any suitable number of memory banks, and four memory banks 481, 482, 483, and 484 are specified, each coupled via a corresponding set of 512b read / write data paths and ECC data paths. The four memory banks may be organized with four virtual memory banks each, or with two virtual memory banks each, as referenced below. Figure 5A and Figure 5B Describe them separately.

[0116] Figure 5A This demonstrates that each physical storage unit has 4 virtual storage units. Figure 4 The system example is a Level 2 memory bank interface. For example, interface 500A includes physical memory banks 510 (e.g., memory bank 0), 520 (e.g., memory bank 1), 530 (e.g., memory bank 2), and 540 (e.g., memory bank 4). Each of physical memory banks 510, 520, 530, and 540 includes four virtual memory banks (virtual memory bank 0, virtual memory bank 1, virtual memory bank 2, and virtual memory bank 3). Each virtual memory bank of each physical memory bank includes a corresponding multiplexer / demultiplexer such that each corresponding virtual memory bank of a given (e.g., addressed) physical memory bank can be written to or read from in virtual memory bank memory access. Each virtual memory bank in a given physical memory bank can be accessed contiguously using (e.g., overlapping or separate) virtual memory bank memory access.

[0117] Figure 5B This demonstrates that each physical storage unit has 2 virtual storage units. Figure 4 The system example is a Level 2 memory bank interface. For example, interface 500B includes physical memory banks 510 (e.g., memory bank 0), 520 (e.g., memory bank 1), 530 (e.g., memory bank 2), and 540 (e.g., memory bank 4). Each of physical memory banks 510, 520, 530, and 540 includes two virtual memory banks (virtual memory bank 0 and virtual memory bank 1). Each virtual memory bank of each physical memory bank includes a corresponding multiplexer / demultiplexer such that each corresponding virtual memory bank of a given (e.g., addressed) physical memory bank can be written to or read from in virtual memory bank access. Each virtual memory bank in a given physical memory bank can be accessed contiguously (e.g., with overlapping or separate virtual memory bank accesses).

[0118] Refer again Figure 4The UMC 430 includes four 512-bit wide memory ports, which may be referred to as UMC Memory Access Ports (UMAP) ports. Each L2 SRAM interface (e.g., the interface from the requester to the L2 cache) can support one new access per UMC cycle as the memory bank arranged in SRAM becomes responsive within each UMC cycle. Access to the memory bank can be pipelined over multiple UMC cycles, allowing for the use of higher latency memory. Each of the virtual memory banks can contain different latency because each interface confirms the availability of each virtual port, rather than the availability of the entire physical memory bank.

[0119] The UMC L2 SRAM protocol accommodates memory directly connected to the UMC 430. The UMC 430 presents the address and read / write indication at the UMAP boundary and waits for a period of time (e.g., a delay) during which the L2 SRAM is "expected" to respond. The UMC 430 can independently control four banks. Access to these virtual banks is issued sequentially. If an additional memory has a pipeline latency greater than one cycle, then consecutive requests to the same virtual bank result in a "bank conflict." The second request is deferred until the first request completes. Consecutive requests to different virtual banks can be made without delay (e.g., when the latency of a later access to memory is no greater than twice the pipeline latency of one cycle).

[0120] The UMC 430 can read returned data after a programmed access delay (e.g., in the absence of a memory error). Two different types of delays are supported – pipeline delay and access delay. Pipeline delay is the number of cycles the UMC must wait before the same virtual memory can be accessed again. Access delay is the number of cycles required for the memory to present data to the UMC after a read command has been presented. In instance systems, the UMC 430 supports 1 to 6 delays for both pipeline and access delays.

[0121] The latency variation between different types of SRAM can be compensated for by inserting wait states into memory accesses, where the number of wait states is selected in response to the latency of the memory being accessed. 1-cycle and 2-cycle access latencies can be referred to as "0-wait state" and "1-wait state," respectively.

[0122] Security is a term commonly used to protect data in memory. Enforcing security involves: assigning permissions to specific controllers, specifying memory address ranges with certain allowed actions for specific permissions, and determining whether each fetch and read / write transaction for each memory address contains sufficient privileges to access the specific address and preventing access to the specific address for each transaction with insufficient privileges.

[0123] Permission information includes permission possibilities interpreted along various axes. For example, permission possibilities can be interpreted along axes of privilege, hypervisor, and security (e.g., security) levels. Along the privilege axis, permission possibilities include the possibility of a user or supervisor. Along the hypervisor axis (if applicable), permission possibilities include the possibility of a root or guest. Along the security level axis, permission possibilities include the possibility of secure or insecure. Permission possibilities are enforced across the three levels of the cache.

[0124] Many instances contain at least two security states, each with a corresponding associated memory attribute for controlling physical and / or logical security components. The secure / unsecure state is an attribute of a transaction presented by the CPU to the cache controller (or otherwise associated with it). When the CPU is in a secure state (e.g., indicated by the csecure attribute on each of the transactions generated by the CPU), the cache controller of each cache level allows the CPU to access both secure and unsecure memory locations. When the CPU is in an unsecure state (e.g., indicated by the csecure attribute on each of the transactions generated by the CPU), the cache controller of each cache level allows the CPU to access unsecure memory locations but prevents the CPU from accessing secure memory locations. The csecure attribute may be a "security code" (e.g., where the security code contains at least one bit of a security state field and / or numeric word state indicating the security level of the procedure being executed on the CPU). The security code may be referenced below. Figures 6A to 12 The described "safe position".

[0125] In this example, the L2 firewall provides security for requester-generated transactions regarding access to L2 SRAM and for higher-level L2-generated memory transactions regarding access to memory. The L2 firewall works in conjunction with the L3 firewall to enable access permissions to control transactions occurring between the L2 and L3 caches. The security firewall components reside at two interfaces: the UMC-MSMC interface (e.g., which protects CPU-initiated transactions moving to or toward external memory) and the UMC-L2RSAM interface (e.g., to protect access to or toward L2 SRAM space).

[0126] Typically, firewalls can be configured in one of two modes: a whitelist mode (e.g., listing specified addresses to indicate which controllers / privileges are allowed access to a predetermined address range) and a blacklist mode (e.g., listing specified addresses to indicate which controllers / privileges are blocked from accessing a predetermined address range). In this example, the predetermined address range can be pre-determined before the firewall blocks or allows access to addresses within that predetermined address range.

[0127] To protect selected levels of a cache controlled by a firewall (e.g.), permission information (e.g., protection policies for granting access to specific address blocks) can be stored in the selected levels of the cache, allowing selected areas of memory to be specifically protected by lists of granted or denied access to the corresponding areas to be protected. For blacklisted areas, the firewall is configured to block access to any cacheable memory location (e.g., any memory location with content that can be stored in the cache). In one example, programming the firewall to block access to cacheable memory locations by processes that are not explicitly whitelisted helps prevent read-only memory from being cached and then subsequently updated locally in the cache due to a cache hit by the process.

[0128] There are unprotected address regions listed in the firewall's whitelist or blacklist. Such regions (e.g., "greylisted regions") may arise when not every possible memory location is assigned a selected protection policy. Due to the limited nature of firewall configuration resources (e.g., limited memory or address processing requirements), not associating a selected protection policy with every possible memory location may be a compromise design choice.

[0129] In certain caching operations affecting data stored in graylisted areas (e.g., areas whose union with the blacklisted and whitelisted areas listed in a firewall does not intersect), access to firewall-protected caches can be enhanced (e.g., beyond the firewall's protection without the additional complexity of circuitry and layout space that would otherwise be required). In an example, the security level of the process generating data stored in a particular cache line can protect data stored in the graylisted area without, for example, increasing the complexity of the firewall (e.g., to narrow the scope of the graylisted area) by storing the data in a tag memory (containing address tags, MESI status, and status bits described herein) associated with that particular cache line.

[0130] For an access request from a requester that is permitted (e.g., not blocked) to access a selected cache line in a selected level cache, in response to the security context of the access request and in response to the stored security code associated with the selected cache line, the selected cache line may be selectively snooped (e.g., read from L1 cache but kept in L1 cache), snooped invalidated (e.g., read from L1 cache and removed from L1 cache), or invalidated (e.g., removed from cache), wherein the stored security code indicates the security context of the process when the process generates information stored in the selected cache line. For example, selectively invalidating or evicting a selected cache line may be in response to a comparison between the security context of the access request and the security context indicated by the security code. For example, selectively invalidating or evicting selected cache lines can be determined in response to a difference between the security context of the access request and the security context indicated by the security code.

[0131] As described below, selectively invalidating or evicting a selected cache line in response to a security context of an access request and in response to a stored security code instructing the selected cache line can reduce firewall complexity (e.g., to achieve a similar performance level), reduce the time required to flush the L1D cache (e.g., when it is executed to prevent malware from accessing cache contents), and improve the overall performance of the CPU / memory system containing the selected cache line.

[0132] Evicting a smaller subset of cache lines reduces the number of CPU pauses that would otherwise occur during the cache eviction process (e.g., no memory request security context matches the security context of a cache line addressed by a memory request). By not evictioning data with the same security context, this reduces or eliminates the latency encountered in distributing evicted cache information to memory endpoints (e.g., external memory) and the additional latency encountered when reloading evicted lines.

[0133] In a write-back cache, values stored in memory locations within a cache line can be modified (dirty, e.g., modified by the CPU) relative to main memory. The modified cache line can be evicted when the memory allocated for it is determined to be needed by other memory. As the cache line containing the modified value is evicted from the cache, progressively sending the evicted cache line (which contains dirty memory) to the next higher level reduces the time required to flush the L1D cache. This improves the overall performance of the memory system containing the L1D cache by reducing the number of CPU pauses that occur during cache eviction. (The evicted cache line can also be stored in a sacrificial cache at the same level in the cache hierarchy.) In response to progressively sending the dirty cache line to a higher level of the cache, the corresponding portion of main memory is eventually updated with the modified information stored in the evicted cache line. When the corresponding portion of main memory is updated with the dirty cache line, all memory contains the modified data, making the memory system (e.g.) consistent again and allowing the modified data to be considered no longer dirty.

[0134] UMC 430 (as mentioned above) Figure 4 (As described) is coupled to control Level 2 memory 480 in response to firewall 470. Firewall 470 includes a dedicated whitelist firewall that can be programmed to allow / disallow access to selected L2 SRAM address areas. Each of the selected L2 SRAM address areas can be assigned a corresponding cache policy. The assigned cache policy can be, for example, a policy for a selected permission level for each type of access (e.g., memory read access or write access). Table 3 shows example cache policy assignments.

[0135] Table 3

[0136]

[0137]

[0138] Such as about Figure 2As described, for example, an instance of an L1D heterogeneous cache implementation can cache the addresses of (a number of) L2 SRAMs for each cache line in L1 (data) cache 223 and (L1D) sacrifice cache 223. Management of the L1D main and sacrifice caches and the L2 shadow copy is performed in response to a dedicated protocol / interface coupled between the L1D and L2 controllers, which allows allocation and relocation information to be passed from L1 to the L2 controller. The L2 controller can respond to transactions and information from L1 and can also create and enforce snooping transactions to maintain I / O (DMA) consistency from non-cache requesters within the same shareability domain. Snooping transactions can cause the L2 controller to initiate changes to the shadow cache of the L2 cache and the main / sacrificial cache of the L1D cache.

[0139] Level 1 (e.g., L1D) controller 222 may include a program that can be selected by a programmer to initiate cache maintenance operation (CMO) to manage cache occupancy in the L1D and L2 controllers at the granularity of individual cache lines.

[0140] References in this article Figure 4 In the described example, CMO transactions can be issued from the streaming engine to the L2 controller (e.g., UMC 430) via directional transactions on the VBUSM.C protocol interface. The VBUSM.C protocol interface is configured to couple the SE 422 and UMC430 together. Table 4 shows the example VBUSM.C protocol interface.

[0141] Table 4

[0142]

[0143] The VBUS.C protocol includes the instance csband signal. The csband signal is an encapsulated bus (e.g., 97-bit wide) that cascades several sub-signals, as shown in Table 4. The csband signal is asserted to maintain consistency during certain changes in cache state (e.g., cache activities such as allocating cache lines and updating shadow information in the L2 controller).

[0144] At certain times, the software startup CMO may require evicting / revoking address (or single address) blocks for a specific security level (e.g., secure only vs. non-secure only). This document describes “security codes” (e.g., “security bits”) that can be used to control the L2 cache to maintain fine-grained control by evicting / revoking a smaller (e.g., minimum) subset of L1D cache lines requested by the CMO. This need to evict / revoke cache lines from L1D can occur in response to a change in the CPU’s privilege mode level (e.g., from secure to non-secure or from non-secure to secure). Table 5 shows an instance tag line of the L1D cache containing the security bit (csecure in bit 49) for each cache line in the L1D cache.

[0145] Table 5

[0146] Tag Name 63 52 51 50 49 48 13 12 0 L1PCTAG reserve efficient Table base CSECURE mark reserve

[0147] Table 6 shows the field descriptions of the instance tag line of the L1D cache, which contains the security bit (csecure) for each cache line in the L1D cache.

[0148] Table 6

[0149] Bit field description 12-0 Reserved Read Returns 0 48-13 Marking cache lines 49CSECU cache line safety bits Privilege bits of 50 table-based cache lines 51 valid lines exist in the cache. 63-52 Reserved, read return 0

[0150] In response to determining the status of the corresponding security code for each cache line, a selected portion of the cache to be evicted or invalidated (e.g., a subset of L1D cache lines) is determined. Selecting a subset of the cache to be evicted (e.g., rather than evicting all lines of the cache) reduces the time required to flush the L1D cache, which improves the overall performance of memory systems containing L1Ds by reducing the number of CPU pauses that occur during cache eviction. Table 6 shows the marked lines of the L1D cache, which contain the security code bits used to determine the security status of the corresponding lines.

[0151] The `calloc` signal is asserted to initiate a read command from the L1D to read an L2 cache line. An assertion of `calloc` (e.g., `calloc == 1`) indicates that a given cache line (`caddress` + `csecure`) is being allocated by the L1D main cache. When `calloc` is asserted (e.g., `calloc == 1`), csband information is used to update the L1D shadow information in the L2 controller. When `calloc` is not asserted (e.g., `calloc == 0`), the validity bits (`cmain_valid` and `cvictim_valid`) of the addressed cache line are set to 0, such that (e.g.) the L1D cache line is not changed when the `calloc` signal is not asserted.

[0152] Typically, two requesters cannot read the same cache line at the same time (e.g., when transferring from the main cache to the sacrificial cache and when transferring out of the sacrificial cache) (e.g., where the cache line is uniquely identified by the address and status of the security code). To help avoid this conflict, the values of the cvictim_address and cvictim_secure (security bits of the L1D sacrificial cache line) signals can be prevented from precisely matching the corresponding values of the cmain_address and cmain_secure signals during the period when the calloc signal is asserted (calloc == 1) and the valid bit of the addressed cache line is set (e.g., when cmain_valid == 1 and cvictim_valid == 1).

[0153] Snooping and DMA transactions initiated by the L3 controller operate similarly to CMO transactions issued by the streaming engine. For example, such snooping and DMA transactions contain security code to indicate the security level of the procedure that initiated the request.

[0154] Consistent read transactions issued from the MMU or streaming engine operate similarly to CMO transactions issued by the streaming engine. For example, a consistent read transaction includes security code indicating the security level of the consistent read request.

[0155] In various instances, an L2 controller (e.g., L2 cache controller 431) is configured to receive from a requester an access request indicating a selected cache line. The L2 controller is configured to compare the security code of the received access request with a stored security code associated with the security context of a previous access request that writes current information into the selected cache line. In response to the comparison, the selected cache line can be selectively invalidated or evicted, such that a subset (e.g., a set smaller than the entire set) of the selected cache lines is invalidated or evicted in response to a change in the requester's security level (e.g., as indicated by the security code).

[0156] The L2 controller is coupled to a secondary data cache, which is a stored L2 SRAM physical structure. The L2 SRAM is a monolithic endpoint RAM arranged to not store any cache line for an address indicated by an access request from a requester, or to store one or two such cache lines. In various instances, the number of cache lines for a single cacheable address that can be stored in the L2 SRAM is equal to the number of security levels indicated by the security code of the received access request. In an instance, the security code is a bit (e.g., a "security bit") that allows data to be stored at a given cacheable address to be stored in a first cache line associated with a first possible value of the security code (e.g., when the security bit is 0), and allows data to be stored at a given cacheable address to be stored in a second cache line associated with a second possible value of the security code (e.g., when the security bit is 1).

[0157] Consistency is maintained by including a field (e.g., a bit field) containing a security code (e.g., a security bit) in each of the L1D tag, L2 tag, and L2 shadow tag. When an access request causes information to be written to a cache line of any of the L1D tag, L2 tag, and L2 shadow tag, the security code (e.g., the security bit contained in the access request) of the access request is further propagated to other caches that contain (or will contain) the information of the cache line indicated by the access request.

[0158] Access requests contain security codes that indicate the security level of the security context of the requester initiating the access code. As described below, security codes (e.g., security bits) may be included in L1D flags, CMO or snoop transactions, MMU or SE read transactions, and DMA read / write transactions. L2 snoop transactions to L1D contain security codes that initiate CMO / snoop / read / DMA transaction requests.

[0159] When the L2 controller processes a transaction that needs to look up in a shadow copy of the L1D primary or sacrificial cache tag, the L2 controller evaluates the security code of the cache line addressed by the transaction being processed to determine a "hit" or "miss" (e.g., by accessing the L1D cache line). For example, a hit for an incoming transaction is determined by: 1) detecting a match between the stored security code of the addressed cache line in the shadow tag and the security code of the incoming transaction; and 2) detecting a match between the address of the cache line in the shadow tag and the address of the cache line of the incoming transaction. In the above example, a miss for an incoming transaction is determined by: 1) not detecting a match between the stored security code of the addressed cache line in the shadow tag and the security code of the incoming transaction; or 2) not detecting a match between the address of the cache line in the shadow tag and the address of the cache line of the incoming transaction.

[0160] To help ensure that the L1D accurately performs its own hit / miss detection on subsequent snooping transactions processed by the L1D, a security code associated with the most recent cache line hit from the L2 controller can be transmitted to the L1D controller. The security code associated with the most recent cache line hit from the L2 controller can be transmitted to the L1D controller via a snooping transaction initiated by the L2 controller (via VBUSM.C bus interface protocol signaling) in response to the most recent cache line hit (e.g., including a hit / miss detection in response to the security code status).

[0161] Conversely, some comparable solutions lack security codes in the cache tag that indicate the security level of the requester context through which the cache line is tagged. This lack of retention of the security level of the requester context through which the cache line is tagged can lead to serious security control failures (e.g., because the distinction between the security and non-security context security levels of the requester context through which the cache line is tagged could potentially allow access requests to be processed at a different security level than the security level of the requester context through which the cache line is tagged).

[0162] For example, the distinction between secure and insecure contexts in the cache tag enables fine-grained cache eviction / invalidation of cache lines stored in the first context without affecting the cache performance of cache lines stored in a context different from the first context. In instances where insecure cache lines are invalidated via a CMO operation, secure lines can remain in the cache, resulting in improved cache performance for cache lines stored in the secure software context. For example, this improvement can occur where cache lines stored in the insecure software context and cache lines stored in the secure software context both share the same tag address in the same cache.

[0163] The efficiency of the L2 controller in accurately performing consistent snooping operations on the L1D can be improved by performing consistent snooping operations on a subset of L1Ds where the cached addresses and access requests are the same and have the same cached address and security level. The selection of which consistent snooping operations to initiate on the L1D can be determined in response to evaluating the security level of the software context indicated by the transaction's security code (e.g., the state of the security bits), where the state of the security bits is in a cache tag stored in the L1D (primary or sacrificial) cache and also in a shadow copy of the L1D / L2 cache tag stored / maintained in the L2 cache.

[0164] Figure 6A This demonstrates the physical structure of an example L1D (Level 1 Data) controller. For example, a Level 1 Data controller 600A includes a main cache tag 601 and a sacrificial cache tag 602. The main cache tag 601 is configured to track (e.g., for a given main cache line) an address tag, the MESI, and the security level (e.g., indicated by a security code) of the process that last modified the data in the given cache line. The sacrificial cache tag 602 is configured to track (e.g., for a given sacrificial cache line) an address tag, the MESI, and the security level (e.g., indicated by a security code) of the process that last modified the data in the given cache line.

[0165] L1D main cache 601 is a direct-mapped cache serving read and write hits and snooping. L1D main cache 601 maintains a current MESI state that can be modified a) in response to read, write, and snooping accesses and b) in response to security codes (e.g., security bits). L1D main cache 601 is a read-allocation cache. Write accesses from the CPU that miss the cache are sent (e.g., forwarded) to L2 without requiring the allocation of a cache line in L1D main cache 601. Due to the direct-mapped design of the L1D cache, when a new allocation occurs, the current line in the set is moved (e.g., evicted) to the sacrificial cache 602, regardless of whether the current line in the set is clean or dirty.

[0166] L1D Sacrifice Cache 602 is a fully associative structure that stores lines removed from main cache 601 due to replacement (e.g., in response to a write from the CPU). L1D Sacrifice Cache 602 stores both clean and dirty lines. The L1D Sacrifice Cache serves read and write hits and snooping (e.g., received from the CPU), while maintaining the correct MESI (e.g., when a cache line contains an address and security code that matches the address and security code of a read, write, or snooping access) in response to a hit to L1D Sacrifice Cache 602. When a line in a modified state (e.g., dirty) is removed from the sacrifice cache (e.g., evicted), it is sent as a sacrifice to the L2 main cache (referencing below). Figure 6B (As described).

[0167] Figure 6B This illustrates the physical structure of a Level 2 (L2) controller. For example, a Level 1 data controller 600B includes a main cache tag 610 and a sacrificial cache tag 620. The main cache tag 610 is configured to track (e.g., for a given main cache line) an address tag, MESI, and the security level (e.g., indicated by a security code) of the process that last modified the data in the given cache line. The sacrificial cache tag 620 is configured to track (e.g., for a given sacrificial cache line) an address tag, MESI, and the security level (e.g., indicated by a security code) of the process that last modified the data in the given cache line. The sacrificial cache 620 contains floating entries containing cache tag information for entries addressed by the same "path".

[0168] The L2 cache is a unified cache arranged to serve requests from multiple requesters of various types. Requester types can include, for example, L1D data memory controllers (L1D DMC), L1P program memory controllers (PMC), streaming engines (SE), MMUs (Memory Management Units), and L3 MSMCs (Multi-core Shared Memory Controllers).

[0169] The L2 cache does not include L1D and L1P, so it is not necessary for L2 to include all cache lines stored in the L1D and L1P caches. In this scheme, some lines can be cached in both levels of the hierarchy. The L2 cache is also non-exclusive, meaning that there is no explicit prevention that cache lines are cached in both the L1 and L2 caches at the same time. In instance operations involving the allocation and random replacement of cache lines, a cache line may exist in one of the L1D and L2 caches, in both, or not in either the L1D or L2 cache. Similarly, similar cache lines can be stored in both the L1P and L2 caches at the same time.

[0170] Figure 7A Displays instance-level data (L1D) cache tag values prior to instance cache operations. For example, a Level 1 data controller 700A includes a main cache tag 710A, a sacrifice cache tag 720A, and a temporary sacrifice hold buffer 730A. The main cache tag 710A is configured to track (e.g., for a corresponding main cache line) the address tag, MESI, and the security level "S" associated with the security context of the process initiating the cache line. The column for "S" in the main cache tag 710A (and other cache tags having a security code memory for storing security level S) is a list of instance-level cache security codes. The sacrifice cache tag 720A is configured to track the address tag, MESI, and the security level of the process containing cache tag information of the evicted entry through its corresponding entry (e.g., such that the sacrifice cache can be loaded by the sacrifice cache line without waiting for the evicted cache line to be sent to a higher cache level).

[0171] The instance state of the L1D data structure in the Level 1 data controller 700A is displayed as it was before the L1D controller allocated an instance of line C. In this instance, line A is selected and stored in the main cache marker 710A as a modified (“M” MESI) and security code S (e.g., a security bit of 1 or 0) of the initiation process for cache line A. At the same time, the selected path of the sacrifice cache marker 720A includes cache line B as a modified state of S and security code of the initiation process for cache line B. At the same time, the L1D temporary sacrifice hold buffer is empty.

[0172] As described below, receiving an L1D cache line allocation access command causes, in response to the received allocation access command, a modified cache line of the main cache mark 710A to be transferred to a sacrificial cache mark 720A, such that a cache line evicted from the sacrificial cache mark 720A (e.g., evicted to make room for the modified cache line transferred from the main cache mark 710A) is transferred from the sacrificial cache mark 720A to an L1D temporary sacrificial save buffer (e.g., to be eventually sent to the L2 cache).

[0173] Figure 7BThis displays the instance-level data (L1D) cache tag values following instance cache operations. For example, a Level 1 data controller 700B includes a main cache tag 710B, a sacrificial cache tag 720B, and a temporary sacrificial save buffer 730B. In this example, the main cache tag 710B, the sacrificial cache tag 720B, and the temporary sacrificial save buffer 730B display the values of the corresponding L1D data structures following the L1D controller allocation line C.

[0174] In an instance cache operation, the L1D cache allocates a new line (e.g., cache line C) at address C in the main cache tag 810B. This initiates a transfer of cache line A (e.g., from main cache tag 710A) to the corresponding path of the sacrifice cache tag 720B. In response to the transfer of cache line A to the corresponding path of the sacrifice cache tag 720B, cache line B is transferred from the corresponding path of the sacrifice cache to the L1D temporary sacrifice hold buffer 730B. Cache line B is stored in the L1D temporary sacrifice hold buffer 730B, awaiting subsequent transfer of line B to the L2 cache.

[0175] Figure 8A This illustrates the instance L2 shadow structure prior to instance cache operations. For example, the Level 2 data controller 800A includes an L2 shadow primary cache tag 810A and an L2 shadow tag sacrifice cache 820A. The L2 data controller 800A maintains a shadow copy of the address tag, MESI status information, and security information for each cache line in the L1D primary cache (e.g., in primary cache tag 710A, and subsequently modified in 710B). The L2 shadow primary cache tag 810A of the primary cache entry 710A allows the L2 controller to correctly track each cached entry in the primary cache line in L1D, enabling the L2 controller to correctly (and quickly, e.g., without polling all L1D primary cache entries) determine when to send a snooping transaction to perform either a) read or b) invalidate only one of the cache lines in L1D.

[0176] The L2 data controller 800A also maintains a shadow copy of the address tag and MESI status information for each cache line stored in the L1D sacrificial cache (e.g., in sacrificial cache tag 820). The L2 shadow tag of the sacrificial cache entry in sacrificial cache 820A (e.g., in sacrificial cache tag 720A, and subsequently modified in 720B) allows the L2 controller to accurately track the cached primary cache line in the L1D, enabling the L2 controller to accurately determine when to send a snooping transaction to the L1D controller.

[0177] Maintaining L1D cache tags (e.g., L1 main cache tag 710A and L1 sacrifice cache tag 720A) as L2 shadow tags reduces inter-level cache access latency that would otherwise be longer (e.g., without shadow tags). If the shadow tags are not maintained in L2, the L2 controller will be forced to snoop on L1D for every request that might be held in the L1D main or sacrifice cache, which will significantly degrade interface performance due to the large increase in the resulting snooping bandwidth.

[0178] The instance state of the L1D data structure in the Level 2 data controller 800A is displayed as it was before an instance cache operation (e.g., the L1D controller's allocation of line C). In this instance, the selected line A (previously copied from the main cache tag 710A) is stored in the L2 shadow main cache tag 810A as a modification (“M” in MESI) and a security code (e.g., a security bit of 1 or 0) of the initiation process of cache line A. At the same time, the selected path of the L2 shadow tag sacrifice cache 820A (as previously copied from the sacrifice cache tag 720B) includes cache line B as a modification state of the initiation process of cache line B and a security code. At the same time, the floating entries of the L2 shadow tag sacrifice cache 820A are empty, reflecting that the state of the L1D temporary sacrifice holding buffer (L2 shadow tag sacrifice cache 820A) is empty. The column for “S” in the main cache tag 810A and the other columns for “S” in other L2 cache tags containing the security code memory for storing security level “S” are each instances of the corresponding L2 cache security code list.

[0179] As described below, an L1D cache line allocation access command is received, such that in response to the received allocation access command, the modified cache line of the main cache mark 710A is transferred to the sacrificial cache mark 720A, such that the cache line evicted from the sacrificial cache mark 720A is transferred from the sacrificial cache mark 720A to the L1D temporary sacrificial save buffer.

[0180] Figure 8B This example illustrates an L2 shadow structure following the L1D allocation of a row, where the modified row is moved from the primary cache to the sacrificial cache, and then from the sacrificial cache to L2. For instance, the Level 2 data controller 800B includes a primary cache tag 810B and an L2 shadow tag for the sacrificial cache 820B. In this example, the primary cache tag 810B and the L2 shadow tag for the sacrificial cache 820B represent the values of the corresponding L1D data structures transmitted to the Level 2 data controller 800A after the L1D controller allocates row C.

[0181] In instance cache operations, the L1D cache allocates a new line (e.g., cache line C) at address C in the main cache tag 810B, which initiates a transfer of cache line A (e.g., from main cache tag 710A) to the corresponding path in the sacrifice cache tag 720B. In response to the allocation of the new line (e.g., cache line C) at address C in the main cache tag 710B, the line at address C in the main cache tag 710B (e.g., cache line C) is allocated (and / or copied) to the corresponding line (e.g., cache line C) at address C in the main cache tag 810B.

[0182] In response to a transfer of cache line A to the corresponding path of the sacrificial cache tag 720B, cache line B is transferred from the corresponding path of the sacrificial cache to the L1D temporary sacrificial hold buffer 730B. Cache line B is stored in the L1D temporary sacrificial hold buffer 730B, awaiting subsequent transfer of line B to the L2 cache (e.g., when access to a sacrificial write operation through its update memory endpoint (e.g., external memory) is granted).

[0183] Figure 9A This is a flowchart of an instance procedure for a consistent read operation in a multi-level cache system. A consistent read operation is an instance of a cache consistent operation. Procedure 900A is an instance procedure initiated as an MMU read operation, SE read operation, or DMA read operation, such as a consistent read operation. Procedure 900A is initiated at operation 910A.

[0184] At Operation 910A, a consistent read operation is generated by the MMU, SE, or DMA controller and sent to the L2 controller (e.g., UMC 430).

[0185] At operation 920A, the L2 controller (e.g., UMC 430) receives a consistent read operation generated by the MMU, SE, or DMA controller. The L2 controller is configured to determine whether the received consistent read operation results in both an L2 shadow tag hit and a security hit (e.g., a security code match). An L2 shadow tag hit occurs in response to a match between the consistent read address of the received consistent read operation and the address marked in either the L2 shadow tag in the L2 shadow main cache or the L2 shadow sacrifice cache. A security hit occurs in response to a match between the security code of the received consistent read operation and the security code stored in the cache line hit by the received consistent read operation. In response to the determination that the received consistent read operation results in both an L2 shadow tag hit and a security hit, process 900A continues at operation 930A. In response to the determination that the received consistent read operation does not result in either an L2 shadow tag hit or a security hit, process 900A continues at operation 922A.

[0186] At operation 922A, the L2 controller, in response to (e.g., to implement) a received consistent read operation, locally generates a consistent read command and sends the locally generated consistent read command to a memory endpoint (e.g., so that the memory endpoint can return the requested consistent read data to the requester that generated and sent the consistent read operation received by the L2 controller). The endpoint may be an L2 cache, external memory, or any other endpoint.

[0187] At operation 930A, the L2 controller locally generates a snoop read request in response to a received consistent read operation causing both an L2 shadow tag hit and a security hit to be determined. The L2 controller sends the snoop read request to a lower-level cache (e.g., L1D) so that, for example, the L2 cache can be kept consistent with the lower-level cache.

[0188] At operation 940A, the L2 controller determines whether the snooping response (e.g., generated and sent by a lower-level cache controller in response to a snooping read request sent by the L2 controller) indicates that the snooped cache line contains valid data. In response to the determination that the snooping response contains valid data, process 900A continues at operation 950A. In response to the determination that the snooping response does not contain valid data, process 900A continues at operation 922A.

[0189] At Operation 950A, the L2 controller will return (e.g., forward) the data contained in the snoop response to the read master (e.g., generate and send a requester for a consistent read operation received by the L2 controller).

[0190] Figure 9BThis is a flowchart of an instance procedure for a snoop read operation in a multi-level cache system. A snoop read operation is an instance of a cache coherence operation. Procedure 900B is an instance procedure initiated as an MMU read operation, SE read operation, or DMA read operation, such as a snoop operation. Procedure 900B is initiated at operation 910B.

[0191] At operation 910B, a snooping operation is generated by the L3 cache and / or the next higher-level cache.

[0192] At operation 920B, the L2 controller (e.g., UMC 430) receives snooping operations generated by the L3 cache and / or the next higher-level cache. The L2 controller is configured to determine whether the received snooping operation results in both an L2 shadow tag hit (e.g., address match) and a security hit (e.g., security code match). An L2 shadow tag hit occurs in response to a match between the snooping read address of the received snooping read operation and an address marked in either the L2 shadow tag in the L2 shadow main cache or the L2 shadow sacrifice cache. A security hit occurs in response to a match between the security code of the received snooping read operation and a security code stored in the cache line hit by the received snooping read operation. In response to the determination of a positive comparison between the received snooping read operation resulting in both an L2 shadow tag hit and a security hit, process 900B continues at operation 930B. In response to the fact that the received snooping read operation does not result in the determination of both L2 shadow mark hit and security hit, process 900B continues at operation 922B.

[0193] At operation 922B, the L2 controller locally generates a read command to read data from a memory endpoint (e.g., the nearest valid cache entry or external memory) in response to a received snoop read operation from the L3 cache (or the next higher-level cache). For example, the nearest valid cache entry may be the L2 cache when a hit / miss check indicates that the snooped cache line exists in the L2 cache. If the line does not exist in the L2 cache, the read command may be forwarded to the next lower-level cache or forwarded to another endpoint.

[0194] At operation 930B, the L2 controller locally generates a snoop read request in response to the determination of both an L2 shadow tag hit and a security hit caused by a received snoop read operation. The L2 controller sends the snoop read request to a lower-level cache (e.g., L1D) so that, for example, the L2 cache can be kept in sync with the lower-level cache.

[0195] At operation 940B, the L2 controller determines whether the snooping response (e.g., generated and sent by a lower-level cache controller in response to a snooping read request sent by the L2 controller) indicates that the snooped cache line contains valid data. In response to the determination that the snooping response contains valid data, process 900B continues at operation 950B. In response to the determination that the snooping response does not contain valid data, process 900B continues at operation 922B.

[0196] At Operation 950B, the L2 controller returns (e.g., forwards) the data contained in the snoop response to the read master (e.g., generates and sends a requester for the snoop read operation received by the L2 controller).

[0197] Figure 9C This is a flowchart of an instance of a CMO (Cache Coherence Operation) read operation in a multi-level cache system. The CMO read operation is an instance of a cache coherence operation. Process 900C is an instance of a CPU-generated CMO operation, such as the CMO operation itself. Process 900C is initiated at operation 910C.

[0198] At operation 910C, a CMO operation is generated by the CPU and sent to the L2 controller (e.g., UMC430) via the SE. The generation of the CMO is described in U.S. Patent No. 10,599,433, the entire contents of which are incorporated herein by reference for all purposes. In an example, the CMO operation inherits the security level of a process running on the CPU (e.g., where the security code is determined in response to the inherited security level). The CPU's security process generates the CMO to include a target address and a security code set to indicate the security process through which the CMO is generated. The CMO operation can be used to evict or remove infrequently used lines from the cache, wherein the lines selected for eviction or removal are those matching the security code of the process that generated the CMO.

[0199] At operation 920C, the L2 controller (e.g., UMC 430) receives a snooping operation generated by the CPU. The L2 controller is configured to determine whether the received CMO operation causes both an L2 shadow tag hit and a security hit (e.g., a security code match). An L2 shadow tag hit occurs in response to a match between the CMO address of the received CMO operation and an address marked in either the L2 shadow tag in the L2 shadow main cache or the L2 shadow sacrifice cache. A security hit occurs in response to a match between the security code of the received CMO operation and a security code stored in the cache line hit by the received CMO operation. In response to the determination that the received CMO operation causes both an L2 shadow tag hit and a security hit, process 900C continues at operation 930C. In response to the determination that the received CMO operation does not cause both an L2 shadow tag hit and a security hit, process 900C continues at operation 922C.

[0200] At Operation 922C, the L2 controller locally reads the sacrifice cache line in response to a received CMO read operation. The data from the sacrifice is encapsulated as snoop data to be forwarded as a snoop request to the next level cache or endpoint (e.g., in Operation 950C), and a locally generated read command is sent to the memory endpoint.

[0201] At Operation 930C, the L2 controller locally generates a snoop read request in response to a received CMO read operation causing both an L2 shadow tag hit and a security hit to be determined. The L2 controller sends the snoop read request to a lower-level cache (e.g., L1D) so that, for example, the L2 cache can be kept in sync with the lower-level cache.

[0202] At operation 940C, the L2 controller determines whether the snooping response (e.g., generated and sent by a lower-level cache controller in response to a snooping read request sent by the L2 controller) indicates that the snooped cache line contains valid data. In response to the determination that the snooping response contains valid data, process 900C continues at operation 950C. In response to the determination that the snooping response does not contain valid data, process 900C continues at operation 922C.

[0203] At Operation 950C, the L2 controller returns (e.g., forwards) the data contained in the snoop response (e.g., from Operation 922C or Operation 940C) to the read master (e.g., the requester that generates and sends the CMO operation received by the L2 controller).

[0204] Figure 10This is a flowchart of an example DMA write operation in a multi-level cache system. Process 1000 is an example process initiated by the DMA controller, such as a consistent DMA write operation. Process 1000 is initiated at operation 1010.

[0205] At operation 1010, the DMA controller generates a DMA write operation and sends it to the L2 controller (e.g., UMC430). In this example, the DMA write operation is sent to the L2 controller via MSMC 461.

[0206] At operation 1020, an L2 controller (e.g., UMC 430) receives a generated DMA write operation. The L2 controller is configured to determine whether the received DMA write operation causes an L2 shadow tag hit and a security hit (e.g., a security code match). An L2 shadow tag hit occurs in response to a match between the DMA write address of the received DMA write operation and the address marked in either the L2 shadow tag in the L2 shadow main cache or the L2 shadow sacrifice cache. A security hit occurs in response to a match between the security code of the received DMA write operation and the security code stored in the cache line hit by the received DMA write operation. In response to the determination of a positive comparison between the received DMA write operation causing both an L2 shadow tag hit and a security hit, process 1000 continues at operation 1030. In response to the received DMA write operation not causing both an L2 shadow tag hit and a security hit, process 1000 continues at operation 1022.

[0207] At operation 1022, the L2 controller, in response to (e.g., to implement) a received DMA write operation, locally generates a write command and sends the locally generated DMA write command to a memory endpoint. The endpoint may be an L2 SRAM memory, an L3 cache, external memory, or any other endpoint.

[0208] At operation 1030, the L2 controller locally generates a snoop read request in response to the determination that the received DMA write operation has caused an L2 shadow tag hit. The L2 controller sends the snoop read request to a lower-level cache (e.g., L1D), causing the snoop read request cache line (e.g., L1D) in the lower-level cache to be invalidated.

[0209] At operation 1040, the L2 controller determines whether the snooping response (e.g., generated and sent by a lower-level cache controller in response to a snooping read request sent by the L2 controller) indicates that the snooped cache line contains dirty (e.g., modified) data. In response to the determination that the snooping response contains dirty data, process 1000 continues at operation 1050. In response to the determination that the snooping response does not contain dirty data, process 1000 continues at operation 1022.

[0210] At operation 1050, the L2 controller merges the DMA write data onto the data contained in the snoop response and writes the merged response to the endpoint.

[0211] Figure 11 This is a flowchart of an instance procedure for a read allocation operation in a multi-level cache system. Procedure 1100 is an instance procedure that can be initiated in response to a read allocation operation received from a lower-level controller (e.g., L1D), such as a read allocation operation. Procedure 1100 is initiated at operation 1110.

[0212] At operation 1110, a read allocation operation request is sent by a lower-level data memory controller (e.g., DMC 361) to an L2 controller (e.g., UMC 430). In this example, the request can be signaled by setting the calloc signal high.

[0213] At operation 1120, the address of the received allocation read signal (caddress) and the security code of the received allocation read signal are written (and marked) to the L2 shadow master cache (e.g., causing the L2 shadow master cache to be arranged to shadow the L1D master cache). The caddress and csecure bits uniquely indicate the cache line to which the received allocation read signal is targeted.

[0214] At operation 1130, the L2 controller determines whether the valid bit (cmain_valid) of the indicated cache line in the L2 shadow main cache is set. In response to the determination that the valid bit is set, process 1100 continues at operation 1140. In response to the determination that the valid bit is not set, process 1100 continues at operation 1150.

[0215] At operation 1140, the L2 controller writes cmain_address, cmain_secure, and cmain_MESI to the shadow sacrifice cache (e.g., L2 shadow sacrifice cache tag 620).

[0216] At operation 1150, the L2 controller determines whether the valid bit (cvictim_valid) of the indicated cache line in the L2 shadow sacrifice cache is set. In response to the determination that the valid bit is set, process 1100 continues at operation 1160. In response to the determination that the valid bit is not set, process 1100 continues at operation 1190 (e.g., where process 1100 terminates).

[0217] At operation 1160, the L2 controller evaluates the MESI field (cvictim_mesi) of the indicated cache line in the sacrifice cache to determine whether the MESI field of the indicated cache line is invalid, shared, exclusive, or modified. In response to the determination that the indicated cache line is invalid, process 1100 continues at operation 1190. In response to the determination that the indicated cache line is shared or exclusive, process 1100 continues at operation 1170. In response to the determination that the indicated cache line is modified, process 1100 continues at operation 1170.

[0218] At operation 1170 (for a determined shared or exclusive state), the L2 controller removes a cache line entry from the shadow sacrifice cache containing the caddress and csecure values that match the received allocation read signal. After removing the matching cache line entry, process 1100 continues at operation 1190 (e.g., where process 1100 may be terminated).

[0219] At operation 1180 (for the determined modified state), the L2 controller retains the cache line entry in a shadow sacrificial cache that already stores the caddress and csecure values that match the received allocation read signal. The matching cache line entry is retained in the shadow sacrificial cache at least until a subsequent sacrificial cache transaction is received from a lower-level (e.g., L1D) cache. Process 1100 continues at operation 1190 (e.g., where process 1100 can be terminated).

[0220] At operation 1190, process 1100 is considered "complete", and the L2 controller can continue to process subsequent cache requests.

[0221] Figure 12 This is a flowchart of an instance procedure for a sacrificial write operation in a multi-level cache system. Procedure 1200 is an instance procedure that can be initiated in response to a sacrificial write operation received from a lower-level controller (e.g., L1D), such as a sacrificial write operation. Procedure 1200 is initiated at procedure 1210.

[0222] At operation 1210, the sacrificial write operation request is sent by the lower-level data memory controller (e.g., DMC 361) to the L2 controller (e.g., UMC 430).

[0223] At operation 1220, the L2 controller determines whether the stored caddress and csecure values of the cache line entry in the shadow sacrifice cache match the caddress and csecure values of the received allocation read signal. In response to a match determination (yes), process 1100 continues at operation 1230. In response to a non-match determination (no), process 1100 continues at operation 1140.

[0224] At operation 1230, the L2 controller updates the shadow sacrifice cache as needed to invalidate the cache line indicated by the received sacrifice written to the operation to maintain consistency and / or safety. For example, when the L1 controller sends a sacrifice to the L2, the L1 controller is removing the modified line from its cache (e.g., the L1 main or sacrifice cache). As the modified line is removed from the L1 cache, the L1 controller updates the L1 tag to indicate that the modified line has been removed as an entry from the L1 TAG RAM. Because the shadow TAGRAM within the L2 controller (used to shadow both the L1 main and sacrifice caches) tracks the L1 TAG RAM, the entry is also removed from the L2 shadow TAGRAM (main and sacrifice) to mirror the deletion from the L1 TAG RAM. The L2 controller experiences reduced latency for future transactions (e.g., MMU reads) because the L2 controller can generate a snoop on this line (or not generate a snoop) based on the shadow tag stored locally in the L2 controller.

[0225] At operation 1240, the endpoint memory is updated with sacrifice data (e.g., sacrifice data from a sacrifice cache line that matches the caddress and csecure values of the received sacrifice write operation).

[0226] Modifications are possible in the described embodiments, and other embodiments are also possible within the scope of the claims.

Claims

1. A system comprising: A central processing unit (CPU) is arranged to execute program instructions to manipulate data in at least a first or a second security context, wherein the first and second security contexts indicate different security levels; A Level 1 cache coupled to the CPU to temporarily store data in cache lines for manipulation by the CPU, wherein the Level 1 cache includes a first security code memory for storing a list of Level 1 cache security codes, wherein each security code indicates one of the at least first or second security contexts through which data of the corresponding cache line is received, and wherein the Level 1 cache includes a Level 1 cache controller. and A secondary cache coupled to the primary cache to temporarily store data in cache lines for CPU manipulation, wherein the secondary cache includes a second security code memory for storing a list of secondary cache security codes, wherein each security code indicates one of the at least first or second security contexts through which data of the corresponding cache line is received, and wherein the secondary cache includes a secondary cache controller. The primary cache controller is configured to send an access request to the secondary cache controller, the access request including the address of a selected cache line of data and a security code indicating one of the at least first or second security contexts through which the data of the selected cache line is received; and The secondary cache controller is configured to compare the address and security code of the access request with the security code of the cache line of data indicated by the address of the access request stored in the secondary cache, and to perform a cache consistency operation in response to the comparison.

2. The system of claim 1, wherein the secondary cache contains a shadow copy of the list of security codes in the primary cache.

3. The system of claim 2, wherein the level 1 cache comprises a level 1 local memory addressable by the CPU.

4. The system of claim 3, wherein the secondary cache comprises a secondary local memory addressable by the CPU.

5. The system of claim 4, further comprising a requester coupled to the secondary cache and configured to send a consistent read transaction to the secondary cache controller, wherein the consistent read transaction includes an address of a cache line of data addressable by the CPU and a security code in one of the at least first or second security contexts indicating through which data of the cache line addressed by the consistent read transaction is received, wherein the secondary cache controller compares the address and the security code of the consistent read transaction with a security code stored in the secondary cache for the cache line of data indicated by the address of the consistent read transaction, and in response to the comparison being affirmative, the secondary cache controller generates a snoop read transaction and sends the snoop read transaction to the primary cache.

6. The system of claim 5, wherein the requester is one of a memory management unit (MMU), a streaming engine (SE), and a direct memory access (DMA) controller.

7. The system of claim 4, comprising a Level 3 cache coupled to the Level 2 cache and arranged to send a snooping transaction to the Level 2 cache controller, wherein the snooping transaction includes an address of a cache line of data addressable by the CPU and a security code in one of the at least first or second security contexts indicating that data of the cache line addressed by the snooping transaction is received therethrough, wherein the Level 2 cache controller compares the address and the security code of the snooping transaction with a security code stored in the Level 2 cache for the cache line of data indicated by the address of the snooping transaction, and in response to the comparison being affirmative, the Level 2 cache controller generates a snooping read transaction and sends the snooping read transaction to the Level 1 cache.

8. The system of claim 4, wherein the CPU is configured to send a Cache Maintenance Operation (CMO) transaction to the secondary cache controller, wherein the CMO transaction includes an address of a cache line of data addressable by the CPU and a security code indicating, through which, one of the at least first or second security contexts, of receiving the data of the cache line addressed by the CMO transaction, wherein the secondary cache controller compares the address and the security code of the CMO transaction with a security code stored in the secondary cache for the cache line of data indicated by the address of the CMO transaction, and in response to the comparison being affirmative, the secondary cache controller generates a snoop read transaction and sends the snoop read transaction to the primary cache.

9. The system of claim 4, further comprising a data memory controller (DMC) coupled to the secondary cache and arranged to send a sacrificial write transaction to the secondary cache controller, wherein the sacrificial write transaction includes sacrificial data, an address of a cache line of data addressable by the CPU, and a security code in one of the at least first or second security contexts indicating that data of the cache line addressed by the sacrificial write transaction is received therethrough, wherein the secondary cache controller compares the address and the security code of the sacrificial write transaction with a security code stored in the secondary cache for the cache line of data indicated by the address of the sacrificial write transaction, and in response to the comparison being affirmative, the secondary cache controller updates the shadow sacrificial cache with the sacrificial data.

10. The system of claim 4, further comprising a DMA controller coupled to the secondary cache and configured to send a consistent DMA write transaction to the secondary cache controller, wherein the consistent DMA write transaction includes an address of a cache line of data addressable by the CPU and a security code in one of the at least first or second security contexts indicating that data of the cache line addressed by the consistent DMA write transaction is received therethrough, wherein the secondary cache controller compares the address and the security code of the consistent DMA write transaction with a security code of a cache line of data indicated by the address of the consistent DMA write transaction stored in the secondary cache, and in response to the comparison being affirmative, the secondary cache controller generates a snoop read transaction and sends the snoop read transaction to the primary cache.

11. The system of claim 1, wherein the security code is a bit used to indicate one of the first and second security contexts.

12. A method comprising: Execute program instructions to manipulate data by the CPU in at least a first or second security context, wherein the first and second security contexts indicate different security levels; The data is temporarily stored in a cache line of the L1 cache for the CPU to manipulate, wherein the L1 cache includes a first secure code memory and a L1 cache controller. Security codes are stored in a list of security codes in a first-level cache, wherein each security code indicates one of the at least first or second security contexts through which the first-level cache receives data for a corresponding cache line. The data is temporarily stored in a cache line of a secondary cache for manipulation by the CPU, wherein the secondary cache includes a second secure code memory and a secondary cache controller; Security codes are stored in a list of security codes in a secondary cache, wherein each security code indicates one of the at least first or second security contexts through which the secondary cache receives data for a corresponding cache line; The first-level cache controller sends an access request to the second-level cache controller, the access request containing the address of a selected cache line of data and a security code indicating one of the first or second security contexts through which the data of the selected cache line is received; The secondary cache controller compares the address and security code of the access request with the security code of the cache line of data indicated by the address of the access request, stored in the secondary cache. and In response to the comparison, perform a cache consistency operation.

13. The method of claim 12, further comprising snooping on the first-level cache to maintain a shadow copy of the first-level cache security code list in the second-level cache.

14. The method of claim 13, wherein the level 1 cache comprises level 1 local memory addressable by the CPU.

15. The method of claim 14, wherein the secondary cache comprises a secondary local memory addressable by the CPU.

16. The method of claim 15, further comprising: A consistent read transaction is sent to the secondary cache, wherein the consistent read transaction contains the address of a cache line of data addressable by the CPU and a security code in the at least first or second security context indicating that the data of the cache line addressed by the consistent read transaction is received therethrough. Compare the address and security code of the consistent read transaction with the security code of the cache line stored in the second-level cache for the data indicated by the address of the consistent read transaction; and In response to the affirmative comparison, the snooping read transaction is sent from the second-level cache to the first-level cache.

17. A system comprising: The cache contains: Local memory, which contains a set of cache lines to store a set of data; and A security code store for storing a list of security codes, wherein each of the security codes indicates a corresponding security context for a subset of the set of data stored in a corresponding cache line of the set of cache lines. The first security code in the security code is associated with a first subset of the set of data stored in a first cache line of the set of cache lines, and the first security code indicates the security context of the process that generates the information stored in the first cache line. The system selectively allows: reading the first cache line from the cache but not deleting it from the cache, reading from the cache and deleting the first cache line from the cache, or deleting the first cache line from the cache in response to a comparison of the security context of an access request to the first cache line with the security context indicated by the first security code.

18. The system of claim 17, further comprising a CPU configured to control access to the first cache line by a requester.

19. The system of claim 18, wherein the cache is a level 1 cache.

20. The system of claim 18, wherein the cache is a level 2 cache.

Citation Information

Patent Citations

Cache management operations using streaming engine
US10599433B2
Optimizing tag forwarding in a two level cache system from level one to lever two controllers for cache coherence protocol for direct memory access transfers
US20120191916A1
Method and apparatus for cache tag compression
US20160342530A1

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Cache management operations using streaming engine

Optimizing tag forwarding in a two level cache system from level one to lever two controllers for cache coherence protocol for direct memory access transfers

Method and apparatus for cache tag compression