Effective set sampling and set dueling in large-scale distributed system-level caches
Dynamic set sampling and set dueling in large-scale distributed caches improve cache performance by delegating shared cache instances to determine the best algorithm for each thread, addressing inefficiencies in conventional cache management systems.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- MICROSOFT TECHNOLOGY LICENSING LLC
- Filing Date
- 2024-06-11
- Publication Date
- 2026-06-19
AI Technical Summary
Conventional cache replacement algorithms, such as LRU and dynamic replacement algorithms, struggle to perform effectively in large-scale distributed system-level caches accessed by multiple threads, leading to degraded performance and inefficiencies in cache management.
Implement dynamic set sampling (DSS) and set dueling to identify the best cache algorithm for each thread by delegating a shared cache instance to determine the winner of cache algorithms, using a dueling counter to measure policy performance and minimize hardware requirements.
Enhances cache performance by identifying the optimal cache algorithm for each thread, reducing communication overhead and maintaining efficient cache management even in high-core-count systems.
Smart Images

Figure 2026519980000001_ABST
Abstract
Description
Background Art
[0001] Background
[0001] A multi-core computing system can support many applications that can be executed as threads by cores associated with one or more processors associated with the computing system. The cores can access local caches and shared caches. The shared cache can be subject to various cache-related policies including a cache replacement policy (also called a cache replacement algorithm).
[0002]
[0002] Some of these cache replacement algorithms, such as the least recently used (LRU) algorithm, work well for applications that have a working set that fits within a single cache, but may not work well in systems with large-scale distributed system-level caches accessed by multiple threads. Furthermore, other cache-related algorithms such as insertion algorithms and allocation algorithms may have degraded performance in systems with large-scale distributed system-level caches accessed by multiple threads. Therefore, there is a need for systems and methods for effective set sampling and set dueling in large-scale distributed system-level caches.
Summary of the Invention
[0003] Summary
[0003] In one example, the present disclosure relates to a method for selecting a cache algorithm in a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, wherein the system is configurable to execute threads. The method may include one of the plurality of shared cache instances receiving a request associated with a thread, the request including policy information specifying at least one of two cache algorithms for implementation by the shared cache instance for any request associated with the thread.
[0004]
[0004] The method may further include implementing at least one of two cache algorithms specified by policy information received as part of a request associated with a thread, unless the shared cache instance is identified as a delegated shared cache instance for determining the winner of at least two cache algorithms for use in any request associated with a thread.
[0005]
[0005] In another example, the disclosure relates to a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, wherein the system is configurable to execute threads. The system may include one of the plurality of shared cache instances to receive requests associated with a thread, the request including policy information to specify at least one of two cache algorithms for implementation by the shared cache instance for any request associated with the thread.
[0006]
[0006] The system may further include a shared cache instance circuit configuration associated with a shared cache instance and configured to process policy information received as part of a request associated with a thread. The shared cache instance circuit configuration may further be configured to instruct a shared cache instance to implement at least one of two cache algorithms, unless that shared cache instance is identified by the shared cache instance circuit configuration as a delegated shared cache instance for determining which of the at least two cache algorithms will be used for any request associated with a thread.
[0007]
[0007] In yet another example, the disclosure relates to a method for selecting a cache algorithm in a system having a plurality of cores and a plurality of shared cache instances accessible to any of the plurality of cores, wherein the system can be configured to execute threads. The method may include designating a shared cache instance as a first delegated shared cache instance for determining the winner of at least two cache algorithms for access requests associated with a thread. The method may further include delegating to another shared cache instance as a second delegated shared cache instance for determining the winner of at least two cache algorithms for access requests associated with a thread.
[0008]
[0008] The method may further include communicating policy information to each of the multiple cores that specifies the winner of at least two cache algorithms. The method may further include, when one of the multiple shared cache instances receives a request for cache access associated with a thread, implementing one of the at least two cache algorithms specified by the policy information received as part of the request for cache access, unless the shared cache instance that received the request is identified as a first delegated shared cache instance or a second delegated shared cache instance.
[0009]
[0009] This summary is provided to introduce in a simplified form some of the concepts that will be further described below in more detail. This summary is not intended to identify any major or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
[0010] Brief explanation of the drawing
[0010] This disclosure is provided as an example and is not limited by the accompanying drawings. In the accompanying drawings, similar reference numerals indicate similar elements. Elements in the drawings are shown for brevity and clarity and are not necessarily drawn to scale. [Brief explanation of the drawing]
[0011] [Figure 1]
[0011] This is a block diagram of an exemplary system in which effective set sampling and set dueling are implemented. [Figure 2]
[0012] This is a flowchart illustrating how to select a substitution policy for use in threads as part of an effective set sampling and set dueling implementation. [Figure 3]
[0013] This figure shows an exemplary shared cache instance (SCI) circuit configuration for implementing effective set sampling and set dueling. [Figure 4]
[0014] As an example of an effective implementation of set sampling and set dueling, here is a flowchart for updating the Dynamic Algorithm Bit (DAB) and training the set dueling counter. [Figure 5A]
[0015] This figure shows one example of a leader set and follower set layout from a thread perspective for practicing effective set sampling and set dueling. [Figure 5B]
[0015] This figure shows another different example of leader set and follower set layout from a thread perspective for practicing effective set sampling and set dueling. [Figure 5C]
[0015] This figure shows another different example of leader set and follower set layout from a thread perspective for practicing effective set sampling and set dueling. [Figure 6]
[0016] This figure shows another exemplary shared cache instance (SCI) circuit configuration for implementing effective set sampling and set dueling. [Figure 7]
[0017] This is a flowchart illustrating an exemplary method for selecting a cache algorithm based on effective set sampling and set dueling. [Figure 8]
[0018] This is a flowchart illustrating another exemplary method for selecting a cache algorithm based on effective set sampling and set dueling. [Modes for carrying out the invention]
[0012] Detailed explanation
[0019] The examples described in this disclosure relate to systems and methods for effective set sampling and set dueling in large-scale distributed system-level caches. Some examples relate to systems with multiple cores within a multithreaded computing system. A multithreaded computing system may be a standalone computing system or part of a public cloud, private cloud, or hybrid cloud (e.g., a server). A public cloud includes a global network of servers that perform a variety of functions, including data storage and management, application execution, and delivery of content or services such as streaming video, email, office productivity software, or social media. Servers and other components may be located in data centers around the world. Public clouds serve the public over the internet, but enterprises may use private clouds or hybrid clouds. Both private clouds and hybrid clouds also include networks of servers housed in data centers. Applications may run using the compute and memory resources of a standalone computing system or a computing system within a data center. As used herein, the term “application” encompasses, but is not limited to, any executable code (in the form of hardware, firmware, software, or any combination thereof) that implements a function, virtual machine, client application, service, microservice, container, or unikernel for serverless computing. Alternatively, an application may run on hardware associated with edge computing devices, on-premises servers, or other types of systems, including communication systems such as base stations (e.g., 5G or 6G base stations).
[0013]
[0020] Computing systems include several types of memory, including caches. Caches help mitigate the long latency associated with accessing main memory (e.g., Double Data Rate (DDR) Dynamic Random Access Memory (DRAM)) by providing data with low latency. Processors may have access to a cache hierarchy including L1, L2, and L3 caches, with the L1 cache being closest to the processing core and the L3 cache being the furthest. Data access may first be performed against the cache; if the data is found within the cache, it is considered a hit. However, if the data is not found within the cache, it is considered a miss, and the data must be loaded from main memory (e.g., DRAM). In systems with multiple cores and sharing system-level caches, managing caches, including implementing various cache policies, is a complex issue.
[0014]
[0021] The examples described herein relate to systems and methods for dynamic set sampling (DSS) and set dueling in a distributed system-level cache (SLC) shared by many cores. Some conventional methods distribute sets across multiple shared cache instances (SCIs), but this approach suffers from reduced accuracy because each SCI does not have sufficient resolution to capture the performance of each thread, and the information from each SCI is not combined per thread.
[0015]
[0022] In large systems hosting multiple users, shared cache resources are contested by numerous applications. These applications may have different access patterns to the system-level cache (SLC), requiring the SLC to manage the cache lines owned by each application in different ways. One example of cache management is cache replacement algorithms. Different applications may prefer different cache replacement algorithms. Some caches may implement dynamic replacement algorithms such as Dynamic Rereference Interval Prediction (DRRIP) and Dynamic Insertion Policy (DIP), in which the cache modifies policies to improve the cache hit rate for the application currently running on the core.
[0016]
[0023] Dynamic substitution algorithms can achieve this through dynamic set sampling (DSS) and set dueling. DSS leverages the insight that the behavior of a small portion of the cache is statistically sufficient to approximate the behavior of the entire cache. For example, in a cache with 2048 sets, 32 to 64 sets (called leader sets) may suffice. Using DSS, the cache always performs a substitution algorithm "ReplA" on a portion of the leader sets (e.g., 32 leader sets) and "ReplB" on some other leader sets (e.g., another set of 32 leader sets) to approximate the cache behavior as if the entire cache were performing ReplA or ReplB. Set dueling pits ReplA and ReplB against each other to identify the winner that provides the best hit rate (or lowest miss rate). A dueling counter can be used to implement set dueling. Misses in the ReplA leader sets decrement the dueling counter, and misses in the ReplB leader sets increment the dueling counter. In this way, the dueling counter measures the policy that generates the most misses and therefore selects the opposite substitution policy to maximize hits. The selected substitution policy is then implemented in the rest of the cache (called the follower set).
[0017]
[0024] In the case of a multi-core system, different applications may prioritize different replacement algorithms. Thread-aware (TA) dynamic replacement algorithms such as TA-DRRIP and TA-DIP can identify the optimal replacement algorithm for a specific thread by having a leader set for each thread for each policy. The leader set with replacement algorithm A (RepIA) for thread 0 statically implements ReplA for lines sourced from thread 0 and uses the winning policy determined by the leader sets of other threads for all other threads. Therefore, for an effective thread-aware replacement algorithm, a total of 64 leader sets are required for a 1-thread system per cache instance, 128 leader sets for a 2-thread system per cache instance, 256 sets for a 4-thread system, and 2048 sets for all in a 32-thread system. To support thread counts higher than 32, it is necessary to reduce the number of leader sets and sacrifice accuracy or group threads into clusters, which may incur performance penalties.
[0018]
[0025] Furthermore, conventional multi-core systems cannot implement conventional TA-DRRIP / TA-DIP algorithms for high-core count systems while maintaining effective sampling efficiency. One possible solution is to pre-select a shared system-level cache replacement algorithm using information from the core's private cache. This approach may work in certain situations but has several drawbacks. First, this approach discards information that exists only in the shared system-level cache. This approach also reduces the efficiency of cache filtering in the locality seen only by the shared system-level cache. Finally, this approach ignores the impact of multiple threads competing and coexisting in the shared cache space.
[0019]
[0026] The examples described herein address the problem of high-core-count sampling. These systems and methods minimize the required dueling hardware per cache instance and add only minimal hardware to the cache controller. Finally, these systems and methods add only one or a few additional bits to the payload of messages sent within the fabric. As described herein, in some examples, the problem that the number of sets within a particular shared cache instance (SCI) is limited to identify the best algorithm for all threads is overcome by delegating the SCI to identify the winner of the set dueling for only the nearest physical core / thread. Since cache accesses for large-scale distributed caches are distributed, the number of leader sets per SCI is increased to maintain sufficient dynamic set sampling. In one example, the number of leader sets per SCI is increased from 32 - 64 per 2048 sets (per thread) to about 256 - 512 per 2048 sets. The proposed systems and methods enable the identification of the best possible shared system-level cache algorithm for a particular thread, even in the presence of other competing threads.
[0020]
[0027] Figure 1 is a block diagram of an exemplary system 100 in which effective set sampling and set dueling are implemented. System 100 includes multiple processing cores (e.g., core 0 102, core 1 104, core 2 106, and core N 108, etc.) coupled to components of a distributed shared system-level cache (SSLC) 160 via an interconnect 110. Although exemplary system 100 shows only the shared system-level cache (SSLC) 160, each core may have access to other local and / or shared caches (e.g., level 1 and level 2 caches (not shown)). In one example, the shared system-level cache (SSLC) 160 may be considered a level 3 cache in the cache hierarchy, assuming that it also has level 1 and level 2 caches. System 100 further includes a shared cache controller 170, which can be used to configure the cache as needed at system startup or reset. The interconnect 110 includes multiple switches (e.g., switches 112, 114, 116, 118, 120, 122, 124, 126, 128, and 130) and enables the exchange of both commands (e.g., read access commands or write access commands) and data between cores and shared cache instances included as part of SSLC 160. For example, switch 112 allows core 0 102 to access interconnect 110 and access the shared system-level cache. SSLC 160 includes multiple shared cache instances (SCIs) and delegated shared cache instances (DSCIs). Each shared cache instance includes an SCI controller (not shown) responsible for managing the functionality of a particular SCI.
[0021]
[0028] Continuing to refer to Figure 1, System 100 shows one SCI per thread. In this example, SCI0 142 is the delegated shared cache instance for thread 0 (in this case, the digit 0 acts as the thread identifier). SCI1 144 is the delegated shared cache instance for thread 1, SCI2 146 is the delegated shared cache instance for thread 2, and SCIN 148 is the delegated shared cache instance for thread N. Thus, in this example, there is a one-to-one mapping between threads and delegated shared cache instances (DSCIs). That is, there is one SCI delegated to have a leader set for a given thread. However, System 100 is not limited to this particular configuration. It is also possible to use two or more delegated shared cache instances (DSCIs) per thread. Thus, a delegated shared cache instance is a shared cache instance that is "delegated" with respect to a given thread, and is simply a shared cache instance for threads for which that shared cache instance is not a delegated shared cache instance.
[0022]
[0029] A Delegated Shared Instance (DSCI) for a specific thread is used to determine the substitution algorithm that should be implemented across the entire shared cache for all accesses from that thread. Therefore, in this example, SCI0 142 determines the substitution algorithm to be implemented across SSLC 160 for all accesses from thread 0. SCI1 144 determines the substitution algorithm to be implemented across SSLC 160 for all accesses from thread 1. SCI2 146 determines the substitution algorithm to be implemented across SSLC 160 for all accesses from thread 2. SCIN 148 determines the substitution algorithm to be implemented across SSLC 160 for all accesses from thread N. Information for selecting the win policy for a thread (called Dynamic Algorithm Bits (DABs)) is communicated to the physical core for thread 0 in a response / return message to the core. Upon receiving this response message, the core stores its respective DABs in all future command messages and sends them to any shared cache instance. If the shared cache instance receiving the message containing the DABs is the DSCI for that particular thread, the DABs are ignored. Otherwise, the SCI reads the DAB and executes the algorithm indicated in the message. This delegated dynamic cache configuration relieves each shared cache instance (SCI) of the responsibility of identifying the best-performing algorithm for each thread. Instead, the responsibility of identifying the best-performing algorithm for a thread (e.g., thread 0) is delegated to a specific SCI (e.g., SCI0 142 for thread 0 in this example). However, the delegated dynamic cache configuration is not constrained to having one DSCI per core. Other core / thread mappings to DSCIs can also be used instead. As an example, core 0 102 may have set dueling hardware to two different delegated shared cache instances (e.g., SCI0 142 and SCI1 144).Similarly, core 1 104 may have set dueling hardware for two other delegated shared cache instances (e.g., SCI1 144 and SCI2 146). In this way, a single thread may rely on two delegated shared cache instances (DSCIs) to identify the best cache replacement algorithm for the thread. Other mappings between cores / threads and DSCIs may also be used. As an example, threads may share a delegated shared cache instance. Thus, both thread 0 and thread 1 may share one SCI (e.g., SCI0 142).
[0023]
[0030] The best algorithmic communication is governed by the movement of DABs across the entire system. In one example, to minimize communication overhead (e.g., a new policy must be communicated to a core and then sent to an SCI), this example suggests selecting a DSCI for a particular thread / core, starting with the nearest one. However, it should be understood that in this example, some requests already being sent from the core to other SCIs using older DABs will be installed in a suboptimal state. Figure 1 shows a specific number of cores and caches arranged in a particular way, but system 100 may include other cores and caches arranged in different ways. Furthermore, while Figure 1 describes system 100 in the context of a cache replacement algorithm, system 100 may also be configured for use with other cache algorithms, including insertion and allocation algorithms.
[0024]
[0031] Figure 2 shows a flowchart 200 for selecting a substitution policy for use in a thread as part of an effective set sampling and set dueling implementation, as an example. As part of step 202, a core (e.g., core 1 102 in Figure 1) sends an outbound request to an SCI (e.g., SCI0 142 in Figure 1, or another SCI associated with SSLC 160 in Figure 1). In step 204, the SCI circuit configuration associated with the SCI that received the outbound request from the core (an example of such a circuit configuration is described later with reference to Figure 3) determines whether this SCI is a delegated shared cache instance (DSCI) with respect to the core that sent the outbound request. If the answer is "no", in step 206, the cache substitution policy defined by the DAB included as part of the outbound request from the core is implemented for any cache access requests related to the thread (e.g., thread 0) that caused the core (e.g., core 0 102 in Figure 1) to send the outbound request.
[0025]
[0032] In one example, the DAB is included as part of the metadata portion of an outbound request. On the other hand, if the answer to the query in step 204 is "yes", the DAB is ignored in step 208. Outbound requests, inbound requests, and other messaging can be implemented using caching protocols such as the Coherent Hub Interface (CHI) protocol provided by ARM. Other messaging protocols and related functions may also be used.
[0026]
[0033] Next, in step 210, the SCI circuit configuration for the SCI that received the outbound request determines whether the target set is a leader set. If the answer is "no", in step 212, the substitution algorithm defined by the dueling counter is implemented for that SCI for the outbound request sent by the core. On the other hand, if the answer is "yes", in step 214, the static policy defined by the leader set is implemented for that SCI. Finally, in step 216, the SCI circuit configuration sends the response to the core along with the internal DAB policy state. Figure 2 shows that several steps are executed in a specific order as part of flow 200, but additional or fewer steps can be executed in a different order to achieve similar results.
[0027]
[0034] Figure 3 shows a shared cache instance 300 with an exemplary shared cache instance (SCI) circuit configuration 320 for implementing effective set sampling and set dueling. In one example, the SCI circuit configuration 320 may be used to implement the steps described above with respect to the flowchart 200 in Figure 2. In this example, the shared cache instance (SCI) 300 also includes a cache module 310, which is used to store cache lines, cache addresses, and other attributes / metadata associated with the shared cache instance 300. Several bits from an inbound request (e.g., from a thread) regarding cache access are communicated to the SCI circuit configuration 320. In this example, the inbound request message is in packet form and contains several bits. In this example, those bits are labeled as REQ.ADDR / ATTR, REQ.DAB, and REQ.THREADINFO in Figure 3. The REQ.ADDR / ATTR bit contains both the cache address and any attributes associated with the cache. In one example, the cache attributes may include information such as the size of each set in the shared cache instance, the nature of the association, the number of sets in the shared cache instance, the number of set bits, the number of tag bits, and other relevant information as needed. In this example, as previously mentioned with respect to Figures 1 and 2, the REQ.DAB bit includes the dynamic algorithm bit. In this example, the REQ.THREADINFO bit includes at least the thread number (e.g., thread 0, thread 1, or thread N) relating to the thread that initiated the cache access request.
[0028]
[0035] Continuing to refer to Figure 3, the SCI circuit configuration 320 includes the SCI set dueling policy logic 330, the SCI configuration 340, the comparison logic 350, and the multiplexer 360. The SCI's dueling policy logic 330 includes logic configured to implement set dueling, which pits two different substitution policies (e.g., ReplA and ReplB) against each other to identify a winner that provides the best hit rate (or lowest miss rate). The cache module 310 provides cache hit / miss information to the SCI's dueling policy logic 330. As previously mentioned, set dueling can be implemented using dueling counters. The SCI's dueling policy logic 300 includes logic that processes the output of the dueling counter (e.g., the most significant bit (MSB) associated with the counter) and can dynamically identify the winning policy based on the status of the counter. A miss in the ReplA leader set decrements the dueling counter, and a miss in the ReplB leader set increments the dueling counter. In this way, the dueling counter measures the policy that generates the most misses and therefore selects the opposite substitution policy to maximize hits. The selected substitution policy is provided as one of the inputs (input 1) to the multiplexer 360.
[0029]
[0036] Continuing to refer to Figure 3, another input to the multiplexer 360 (input 0) is used to receive information carried as part of the REQ.DAB bits. The SCI configuration 340 is configured as a register and is used to store one of the thread numbers (e.g., thread 0) as the thread identifier (assuming SCI 300 is a delegated shared cache instance about thread 0). The comparison logic 350 is used to compare the value stored in SCI configuration 340 with the information in the REQ.THREADINFO bits. The output of the comparison logic 350 (labeled DSCI in Figure 3) is used to control which input signals are provided as outputs by the multiplexer 360. As previously mentioned with respect to step 204 in Figure 2, if SCI 300 is a delegated shared cache instance (DSCI) about core 0 (thread 0 is mapped to core 0), then the bits included as part of REQ.DAB are ignored. This is because, in such a case, the output of the comparison logic 350 (DSCI) is such that the signal received via input terminal 0 is ignored. Instead, signals received from the SCI's dueling policy logic 330 are used to implement a cache replacement policy for the SCI 300. As described herein, advantageously, the proposed system and method require only minimal additional hardware in the form of in-core storage (for DABs) and reduce the amount of logic in shared cache instances.
[0030]
[0037] Regarding the exemplary SCI300 in Figure 3, the core needs to be able to store a single bit per thread to maintain and transmit the DAB. The SCI requires one additional bit per thread in the system (e.g., 63 additional bits in a 64-thread system) and the supporting logic shown in Figure 3. Command and response messages need to transfer the additional bits regarding the DAB as part of the payload (e.g., packets). Advantageously, here the DSCI requires less dueling hardware because it has a smaller cumulative reader set, and consequently, the amount of dueling hardware can be reduced. In systems where the number of physical cores and SCIs is unbalanced, the SCIs may be delegated to handle fewer / more threads for dueling. By further analyzing the system, the number of reader sets may be reduced if the number of cores exceeds the number of SCIs (as more traffic from threads is concentrated per SCI).
[0031]
[0038] Figure 3 shows SCI300 as including several components arranged in several ways, but SCI300 may also include additional components or fewer components arranged in different ways. As an example, some of the functionality associated with the SCI circuit configuration 320 can be implemented using other logic, including a finite state machine. As another example, Figure 6 (described later) provides an alternative implementation form of SCI. Furthermore, while Figure 3 describes SCI300 in the context of a cache replacement algorithm, SCI300 can also be configured for use with other cache algorithms, including insertion and allocation algorithms. Additionally, if the number of shared cache instances exceeds the number of threads, or if higher precision is required from a larger number of delegated shared cache instances, tiebreaker logic can be used to select a winning algorithm. If a thread is paired with two delegated shared cache instances, the most recent DAB response may be used, or the most frequent winning algorithm may be identified using a history shift register. If a thread is paired with an odd number of shared cache instances, majority voting may be used. If a thread is paired with an even number of shared cache instances, a majority vote may be used, and in the event of a tie, the most recent DAB response may be used to determine the outcome.
[0032]
[0039] Figure 4 shows a flowchart 400 for updating the Dynamic Algorithm Bit (DAB) and training the set dueling counter as part of an effective set sampling and set dueling implementation. Step 402 shows a thread (e.g., thread 0) accessing a delegated shared cache instance (DSCI) for that thread, i.e., SCI0. Step 404 shows logic associated with the shared cache instance (e.g., SCI0) determining whether the set being accessed is a leader set. If the answer is "no", no algorithm training is performed, as shown by block 406, because the dueling counter is not affected at all. However, if the answer is "yes", step 408 shows logic associated with the SCI determining whether the cache access to the DSCI is a miss. If it is not a miss, no algorithm training is performed again, as shown by block 410. However, if it is a cache miss, step 412 shows the dueling counter being incremented or decremented based on the leader set.
[0033]
[0040] Next, in step 414, the logic associated with the SCI determines whether the most significant bit (MSB) of the dueling counter has changed. If the answer is "no," there is no update to the DAB, and as part of step 416, the old DAB value is returned to the thread via a response message. However, if the answer is "yes," the DAB is updated with the new state, and as part of step 418, the new DAB is returned to the thread via a response message. Figure 4 shows that several steps are executed in a specific order as part of flow 400, but additional or fewer steps could be executed in a different order to achieve similar results.
[0034]
[0041] Figures 5A-C show three different examples of leader set and follower set layouts from a thread perspective for practicing effective set sampling and set dueling. Assuming a system with 32 single-threaded cores and 32 SCIs, 32 threads need to be evaluated per SCI. Thus, in a conventional system, SCI0 determines the competitive winner for threads 0, 1, 2, ..., 31; SCI1 determines the competitive winner for threads 0, 1, 2, ..., 31; and SCI31 determines the competitive winner for threads 0, 1, 2, ..., 31. The problem with this approach is that the number of sets per SCI is limited. Furthermore, it is difficult to scale to a higher number of cores. Figure 5A shows an example of a set layout from the perspective of two different threads (thread 0 and thread 1). The layout described in Figure 5A delegates SCI0 to determine only the competitive winner for thread 0, and SCI1 to determine only the competitive winner for thread 1. Although not shown in Figure 5A, other delegated SCIs may also be used to measure competitive winners in a similar manner to that described above for Figures 1-4.
[0035]
[0042] Layout 510 corresponds to the layout of the leader set and follower set for a delegated shared cache instance (e.g., SCI0) from the perspective of thread 0. Layout 520 corresponds to the layout of the leader set and follower set for a delegated shared cache instance (e.g., SCI1) from the perspective of thread 1. The legend in Figure 5A shows that there are leader sets for two different substitution policies (Policy A and Policy B) for thread 0. Similarly, as shown in the legend in Figure 5A, there are leader sets for two different substitution policies (Policy A and Policy B) for thread 1. Policy A could be one of the dynamic substitution algorithms, such as Dynamic Rereference Interval Prediction (DRRIP). Policy B could be another dynamic substitution algorithm, such as Dynamic Rereference Interval Prediction (DRRIP). Thus, Policy A could be a static RRIP (SRRIP) policy with the use of a fixed value over the rereference interval across the entire shared cache instance. Policy B could be a bimodal RRIP (BRRIP) policy that inserts some cache blocks with far rereference interval predictions as well as some other cache blocks with long rereference interval predictions. The choice between these two can be made probabilistically, with one being selected less frequently than the other.
[0036]
[0043] As shown in layout 510, whenever there is access to set 0, set 2, set 4, set 6, set 8, set 10, set 12, or set 14 from the perspective of thread 0, the cache replacement algorithm per policy A is used. Whenever there is access to set 1, set 3, set 5, set 7, set 9, set 11, set 13, or set 15 from the perspective of thread 0, the cache replacement algorithm per policy B is used. Any access to sets 16-32 (follower sets in this example) results in the implementation of the cache replacement policy determined by the dueling counter. As shown in layout 520, whenever there is access to set 0, set 2, set 4, set 6, set 8, set 10, set 12, or set 14 from the perspective of thread 1, the cache replacement algorithm per policy A is used. Whenever there is access to set 1, set 3, set 5, set 7, set 9, set 11, set 13, or set 15 from the perspective of thread 1, the cache replacement algorithm per policy B is used. Any access to sets 16-32 (follower sets in this example) results in the implementation of a cache replacement policy determined by the dueling counter. The winning cache replacement policy is sent to the shared cache instance as part of the DAB included in the command message. As mentioned earlier, when core0 / thread0 accesses SCI0, the winning policy bit (DAB) is collected and stored. Similarly, when core0 / thread0 accesses a different SCI, the DAB containing the winning policy information is sent to that SCI to implement the policy.
[0037]
[0044] Figure 5B shows an example of a set layout from the perspective of four different threads (thread 0, thread 1, thread 2, and thread 3). This example concerns multiple threads per Delegated Shared Cache Instance (DSCI). Each thread has its own dueling counter. Layout 540 corresponds to the layout (not shown) of the leader set and follower set for a Delegated Shared Cache Instance (e.g., SCI0) from the perspective of threads 0 and 1. Layout 550 corresponds to the layout (not shown) of the leader set and follower set for a Delegated Shared Cache Instance (e.g., SCI1) from the perspective of threads 2 and 3. The legend in Figure 5B indicates that for thread 0 there are leader sets for two different substitution policies (policy A and policy B). Similarly, as shown in the legend in Figure 5B, there are leader sets for two different substitution policies (policy A and policy B) for thread 1. Similarly, as shown in the legend in Figure 5B, there are leader sets for two different substitution policies (policy A and policy B) for thread 2. Similarly, as shown in the legend in Figure 5A, there are two different sets of leaders for thread 3 regarding substitution policies (Policy A and Policy B).
[0038]
[0045] As mentioned above, policy A could be one of the dynamic replacement algorithms, such as Dynamic Rereference Interval Prediction (DRRIP). Policy B could be another dynamic replacement algorithm, such as Dynamic Rereference Interval Prediction (DRRIP). Therefore, policy A could be a static RRIP (SRRIP) policy that uses a fixed value over the rereference interval across the entire shared cache instance. Policy B could be a bimodal RRIP (BRRIP) policy that inserts some cache blocks with distant rereference interval predictions as well as some other cache blocks with long rereference interval predictions. The choice between these two can be made probabilistically, with one being selected less frequently than the other.
[0039]
[0046] As shown in Layout 540, whenever there is access to Set 0, Set 4, Set 8, Set 12, Set 16, Set 20, Set 24, or Set 28 from the perspective of Thread 0, the cache replacement algorithm per policy A is used. Whenever there is access to Set 1, Set 5, Set 9, Set 13, Set 17, Set 21, Set 25, or Set 29 of SCI0 from the perspective of Thread 0, the cache replacement algorithm per policy B is used. As shown in Layout 540, whenever there is access to Set 2, Set 6, Set 10, Set 14, Set 18, Set 23, Set 26, or Set 30 of SCI0 from the perspective of Thread 1, the cache replacement algorithm per policy A is used. Whenever there is access to Set 3, Set 7, Set 11, Set 15, Set 19, Set 23, Set 27, or Set 31 of SCI0 from the perspective of Thread 1, the cache replacement algorithm per policy B is used. Any access to a set beyond that (follower set in this example) results in the implementation of a cache replacement policy determined by each dueling counter. Since there is a dueling counter for each thread, two dueling counters are used in this example (one for thread 0 and one for thread 1).
[0040]
[0047] As shown in Layout 550, whenever there is access to set 0, set 4, set 8, set 12, set 16, set 20, set 24, or set 28 of SCI1 from the perspective of thread 2, the cache replacement algorithm per policy A is used. Whenever there is access to set 1, set 5, set 9, set 13, set 17, set 21, set 25, or set 29 of SCI1 from the perspective of thread 2, the cache replacement algorithm per policy B is used. As shown in Layout 550, whenever there is access to set 2, set 6, set 10, set 14, set 18, set 23, set 26, or set 30 of SCI1 from the perspective of thread 3, the cache replacement algorithm per policy A is used. Whenever there is access to set 3, set 7, set 11, set 15, set 19, set 23, set 27, or set 31 of SCI1 from the perspective of thread 3, the cache replacement algorithm per policy B is used. Any access to sets beyond set 31 (follower sets in this example) results in the implementation of a cache replacement policy determined by the respective dueling counter. Since there is a dueling counter for each thread, two dueling counters are used in this example (one for thread 2 and one for thread 3). The winning cache replacement policy is sent to the shared cache instance as part of the DAB included in the command message. As mentioned earlier, when core0 / thread0 accesses SCI0, the winning policy bit (DAB) is collected and stored. Similarly, when core0 / thread0 accesses a different SCI, a DAB containing the winning policy information is sent to that SCI to implement the policy.
[0041]
[0048] Figure 5C shows an example of a set layout from the perspective of four different threads (thread 0, thread 1, thread 2, and thread 3). This example has multiple threads per delegated shared cache instance (DSCI), and these threads are further distributed across two different DSCIs. Each thread has its own dueling counter. Layout 560 corresponds to the layout (not shown) of the leader set and follower set for a delegated shared cache instance (e.g., SCI0) from the perspective of threads 0, 1, 2, and 3. Layout 570 corresponds to the layout (not shown) of the leader set and follower set for a delegated shared cache instance (e.g., SCI1) from the perspective of threads 0, 1, 2, and 3. The legend in Figure 5C indicates that there is a leader set for thread 0 with two different substitution policies (policy A and policy B). Similarly, there is a leader set for thread 1 with two different substitution policies (policy A and policy B), as shown in the legend in Figure 5C. Similarly, as shown in the legend in Figure 5C, there are leader sets for thread 2 with two different substitution policies (Policy A and Policy B). Similarly, there are leader sets for thread 3 with two different substitution policies (Policy A and Policy B).
[0042]
[0049] As mentioned above, policy A could be one of the dynamic replacement algorithms, such as Dynamic Rereference Interval Prediction (DRRIP). Policy B could be another dynamic replacement algorithm, such as Dynamic Rereference Interval Prediction (DRRIP). Therefore, policy A could be a static RRIP (SRRIP) policy that uses a fixed value over the rereference interval across the entire shared cache instance. Policy B could be a bimodal RRIP (BRRIP) policy that inserts some cache blocks with distant rereference interval predictions as well as some other cache blocks with long rereference interval predictions. The choice between these two can be made probabilistically, with one being selected less frequently than the other.
[0043]
[0050] As shown in layouts 560 and 570, whenever there is access to set 0, set 4, set 8, or set 12 of SCI0 or SCI1 from the perspective of thread 0, the cache replacement algorithm per policy A is used. Whenever there is access to set 1, set 5, set 9, or set 13 of SCI0 or SCI1 from the perspective of thread 0, the cache replacement algorithm per policy B is used. As shown in layouts 560 and 570, whenever there is access to set 2, set 6, set 10, or set 14 of SCI0 or SCI1 from the perspective of thread 1, the cache replacement algorithm per policy A is used. Whenever there is access to set 3, set 7, set 11, or set 15 of SCI0 or SCI1 from the perspective of thread 1, the cache replacement algorithm per policy B is used. Any access to sets beyond set 31 (follower sets in this example, not shown) results in an implementation of the cache replacement policy determined by the respective dueling counter. Since each thread has its own dueling counter, this example uses four dueling counters: one for thread 0, one for thread 1, one for thread 2, and one for thread 3.
[0044]
[0051] As shown in layouts 560 and 570, whenever there is access to set 16, set 20, set 24, or set 28 of SCI0 or SCI1 from the perspective of thread 2, the cache replacement algorithm per policy A is used. Whenever there is access to set 17, set 21, set 25, or set 29 of SCI0 or SCI1 from the perspective of thread 2, the cache replacement algorithm per policy B is used. As shown in layouts 560 and 570, whenever there is access to set 18, set 22, set 26, or set 30 of SCI0 or SCI1 from the perspective of thread 3, the cache replacement algorithm per policy A is used. Whenever there is access to set 19, set 23, set 27, or set 31 of SCI0 or SCI1 from the perspective of thread 3, the cache replacement algorithm per policy B is used. Any access to sets beyond set 31 (follower sets in this example, not shown) results in an implementation of the cache replacement policy determined by the respective dueling counter. Because each thread has a dueling counter, this example uses four dueling counters: one for thread 0, one for thread 1, one for thread 2, and one for thread 3. The winning cache replacement policy is sent to the shared cache instance as part of the DAB included in the command message. As mentioned earlier, when core0 / thread0 accesses SCI0 or SCI1, the winning policy bit (DAB) is collected and stored. Similarly, when core0 / thread0 accesses a different SCI, the DAB containing the winning policy information is sent to that SCI to implement the policy.
[0045]
[0052] Figure 6 shows another exemplary shared cache instance (SCI) circuit configuration 600 for implementing effective set sampling and set dueling. In one example, the SCI circuit configuration 620 included in SCI 600 may be used not only to implement the steps described above with respect to flowchart 200 in Figure 2, but also to provide additional functionality for the system. For example, the dynamic algorithm bits (DAB) in this example are not just one bit of information, but multi-bit (e.g., N bits) of information. Therefore, policy information for implementing a particular cache algorithm may include additional information beyond the one-bit information described above with respect to Figure 3. Multiple dynamic algorithm bits (DABs) may include policy information obtained from other shared cache instances and may be used to augment or override the policy information for that SCI. In this example, similar to SCI 300 in Figure 3, the shared cache instance (SCI) 600 includes a cache module 610, which is used to store cache lines, cache addresses, and other attributes / metadata associated with the shared cache instance 600. Specific bits from inbound requests (e.g., from threads) related to cache access are communicated to the SCI circuit configuration 620.
[0046]
[0053] In this example, the inbound request message is also in packet form and contains several bits. These bits are labeled REQ.ADDR / ATTR, REQ.DAB[N:0], and REQ.THREADINFO in Figure 6. In this example, the REQ.ADDR / ATTR bit contains both the cache address and any attributes associated with the cache. In one example, the cache attributes may include information such as the size of each set in the shared cache instance, the nature of the association, the number of sets in the shared cache instance, the number of set bits, the number of tag bits, and other relevant information, as needed. In this example, the REQ.DAB[N:0] bit contains the dynamic algorithm bit. As mentioned above, the dynamic algorithm bit (DAB) may contain policy information obtained from other shared cache instances and may be used to extend or override the policy information regarding that SCI. In this example, the REQ.THREADINFO bit contains at least the thread number (e.g., thread 0, thread 1, or thread K) relating to the thread that initiated the cache access request.
[0047]
[0054] Continuing to refer to Figure 6, the SCI circuit configuration 620 includes the SCT's set dueling policy logic 630, SCI configuration 640, comparison logic 650, and override logic 660. The SCI's dueling policy logic 630 includes logic configured to implement set dueling, which pits two different cache algorithms (e.g., substitution policies (e.g., ReplA and ReplB)) against each other to identify a winner that provides the best hit rate (or equivalently, the lowest miss rate). The cache module 610 provides cache hit / miss information to the SCI's dueling policy logic 630. As previously mentioned, set dueling can be implemented using dueling counters. The SCI's dueling policy logic 600 includes logic that processes the output of the dueling counter (e.g., the most significant bit (MSB) associated with the counter) and can dynamically identify the winning policy based on at least the status of the counter. Misses in ReplA's leader set decrement the dueling counter, and misses in ReplB's leader set increment the dueling counter. In this way, the dueling counter measures the policy that generates the most misses, and therefore the opposite substitution policy is selected to maximize hits. The selected substitution policy is provided as one of the inputs to override logic 660.
[0048]
[0055] Continuing to refer to Figure 6, the override logic 660 is also provided with policy information that is carried as part of the REQ.DAB[N:0] bits. The SCI configuration 640 is configured as a register and is used to store one of the thread numbers (e.g., thread 0), assuming that SCI 300 is a delegated shared cache instance for thread 0. The comparison logic 650 is used to compare the value stored in SCI configuration 640 with the information in the REQ.THREADINFO bits. The output of the comparison logic 650 (labeled DSCI) is provided as one of the control inputs to the override logic 660. As previously mentioned with respect to step 204 in Figure 2, if SCI 600 is a delegated shared cache instance (DSCI (e.g., indicated by the DSCI control signal shown in Figure 6)) for core 0 (thread 0 is mapped to core 0), then the bits included as part of REQ.DAB[N:0] may be ignored. Instead, the M-bit information received from the SCI's dueling policy logic 630 is used to implement a cache replacement policy for the SCI 600. Figure 6 shows the SCI 600 as containing several components arranged in several ways, but the SCI 600 may also contain additional or fewer components arranged in different ways. As an example, other logic, including a finite state machine, can be used to implement some of the functionality associated with the SCI circuit configuration 620. Furthermore, while Figure 6 describes the system 600 in the context of a cache replacement algorithm, the SCI 600 may also be configured for use with other cache algorithms, including insertion and allocation algorithms.
[0049]
[0056] Figure 7 shows a flowchart 700 of an exemplary method for selecting a cache algorithm based on effective set sampling and set dueling. In one example, this method relates to selecting a cache algorithm in a system having multiple cores and multiple shared cache instances accessible to any of the multiple cores, wherein the system can be configured to run multiple threads. In one example, the steps related to this method may be performed by various components of the aforementioned system (e.g., system 100 in Figure 1, SCI300 in Figure 3, and / or SCI600 in Figure 6). Step 710 includes one of the multiple shared cache instances receiving a request associated with a thread, which includes policy information specifying at least one of two cache algorithms for implementation by the shared cache instance for any request associated with the thread. As mentioned above, requests for cache access may be received by any shared cache instance, including the delegated shared cache instance mentioned above. As an example, Figure 2 shows that in step 202, a core sends an outbound request to a shared cache instance. The policy information included as part of the request may include the DAB bit or the DAB[N:0] bit. Further details regarding the function and processing of policy information within the context of the system described here were previously mentioned with reference to Figures 1-6.
[0050]
[0057] Step 720 includes implementing at least one of the two cache algorithms specified by the policy information received as part of the request associated with the thread, unless the shared cache instance is identified as a delegated shared cache instance for determining the winner of at least two cache algorithms for use in any request associated with the thread among the shared cache instances. For example, as described with respect to Figure 2, step 204 determines whether the SCI circuit configuration associated with the SCI that received the outbound request from the core (e.g., SCI circuit configuration 320 in Figure 3 or SCI circuit configuration 620 in Figure 6) is a delegated shared cache instance (DSCI) for the core that sent the outbound request. If the answer is "no", then the cache substitution policy defined by the DAB included as part of the outbound request from the core is implemented for any cache access requests for the thread (e.g., thread 0) that caused the core (e.g., core 0 102 in Figure 1) to send the outbound request, as shown in the example in step 206 of Figure 2. Policy information included as part of a request may include the DAB bit or the DAB[N:0] bit. For example, the DAB is included as part of the metadata portion of an outbound request. On the other hand, if the answer to the query in step 204 of Figure 2 is "yes", the DAB is ignored in step 208 of Figure 2. Further details regarding the function and processing of policy information in the context of the system described here are mentioned above with respect to Figures 1-6.
[0051]
[0058] Figure 8 shows a flowchart 800 of another exemplary method for selecting a cache algorithm based on effective set sampling and set dueling. In one example, this method relates to selecting a cache algorithm in a system having multiple cores and multiple shared cache instances accessible to any of the multiple cores, wherein the system can be configured to run multiple threads. In one example, the steps related to this method may be performed by various components of the aforementioned system (e.g., system 100 in Figure 1, SCI300 in Figure 3, and / or SCI600 in Figure 6). Step 810 includes designating a shared cache instance as a first delegated shared cache instance to determine the winner of at least two cache algorithms for any access request associated with a thread. In one example, the winner is determined using a first set dueling counter associated with the first delegated shared cache instance. As previously mentioned, if a request associated with a thread accesses a leader set on the delegated shared cache instance and the request results in a cache miss, the set dueling counter is incremented or decremented. As an example, steps 404, 408, and 412 described above with respect to Figure 4 provide additional details for determining the winner by using a delegated shared cache instance. The set dueling counter can be incremented or decremented based on cache miss or cache hit detection. For example, when detecting a cache miss, the set dueling counter can be incremented for misses to the leader set related to policy A, and decremented for misses to the leader set related to policy B. If the set dueling counter has a value less than half of its counter value, policy B is experiencing more misses than policy A. Therefore, in this case, policy A is selected.As another example, when determining a cache hit, the set dueling counter can be incremented for hits to the leader set related to policy A, and decremented for hits to the leader set related to policy B. If the set dueling counter is less than half of its counter value, policy B has generated more hits (fewer misses) than policy A. Therefore, in this case, policy B is selected.
[0052]
[0059] Step 820 involves delegating to another shared cache instance as a second delegated shared cache instance to determine the winner of at least two cache algorithms for any access requests associated with the thread. In one example, the winner is determined using a second set dueling counter associated with the second delegated shared cache instance. As previously mentioned, if a request associated with the thread accesses a leader set on the delegated shared cache instance and that request results in a cache miss, the set dueling counter is incremented or decremented. As an example, steps 404, 408, and 412 described above with respect to Figure 4 provide additional details for determining the winner by using a delegated shared cache instance.
[0053]
[0060] Step 830 involves communicating policy information to each of the multiple cores specifying the winner of at least two cache algorithms. For example, as previously mentioned with respect to Figure 1, the interconnect 110 in Figure 1 includes multiple switches (e.g., switches 112, 114, 116, 118, 120, 122, 124, 126, 128, and 130) that enable the exchange of both commands (e.g., read access commands or write access commands) and data between the cores and shared cache instances included as part of the SSLC 160 in Figure 1. As previously mentioned, the SSLC 160 in Figure 1 includes several shared cache instances, including a Delegated Shared Cache Instance (DSCI).
[0054]
[0061] Step 840 includes, when one of several shared cache instances receives a request for cache access associated with a thread, implementing one of at least two cache algorithms specified by the policy information received as part of the request for cache access, unless the shared cache instance that received the request is identified as the first delegated shared cache instance or the second delegated shared cache instance. For example, as described with respect to Figure 2, step 204 determines whether the SCI circuit configuration associated with any SCI that received an outbound request from the core (e.g., SCI circuit configuration 320 in Figure 3 or SCI circuit configuration 620 in Figure 6) is a delegated shared cache instance (DSCI) relating to the core that sent the outbound request. If the answer is "no", then the cache substitution policy defined by the DAB included as part of the outbound request from the core is implemented for any cache access requests relating to the thread (e.g., thread 0) that caused the core (e.g., core 0 102 in Figure 1) to send the outbound request, as shown in the example in step 206 of Figure 2. Policy information included as part of a request may include the DAB bit or the DAB[N:0] bit. For example, the DAB is included as part of the metadata portion of an outbound request. On the other hand, if the answer to the query in step 204 of Figure 2 is "yes", the DAB is ignored in step 208 of Figure 2. Further details regarding the function and processing of policy information in the context of the system described here are mentioned above with respect to Figures 1-6.
[0055]
[0062] In conclusion, this disclosure relates to a method for selecting a cache algorithm in a system having multiple cores and multiple shared cache instances accessible to any of the multiple cores, wherein the system is configurable to execute threads. This method may include one of the multiple shared cache instances receiving a request associated with a thread, the request including policy information specifying at least one of two cache algorithms for implementation by the shared cache instance for any request associated with the thread.
[0056]
[0063] This method may further include implementing at least one of two cache algorithms specified by policy information received as part of a request associated with a thread, unless the shared cache instance is identified as a delegated shared cache instance for determining the winner of at least two cache algorithms for use in any request associated with a thread.
[0057]
[0064] This method may further include ignoring policy information if the shared cache instance is identified as a delegated shared cache instance. This method may further include implementing the winner of at least two caching algorithms for the delegated shared cache instance if the request associated with the thread does not have access to a leader set.
[0058]
[0065] This method may further include implementing the policy specified by the leader set regarding the delegated shared cache instance if the request associated with the thread is accessing the leader set. The winner is determined using the set dueling counter, and this method may further include incrementing or decrementing the set dueling counter based on either a cache hit or cache miss if the request associated with the thread is accessing the leader set.
[0059]
[0066] This method may further include updating policy information received as part of a request associated with a thread when a set dueling counter reaches a predetermined state. This method may further include returning the updated policy information to each of the multiple cores as part of a response message from the shared cache instance to ensure that any future request from the thread includes the updated policy information specifying at least one of two cache algorithms for implementation by the shared cache instance.
[0060]
[0067] In another example, the disclosure relates to a system having multiple cores and multiple shared cache instances accessible to any of the multiple cores, wherein the system is configurable to execute threads. The system may include one of the multiple shared cache instances to receive requests associated with a thread, and this request includes policy information to specify at least one of two cache algorithms for implementation by the shared cache instance for any request associated with the thread.
[0061]
[0068] The system may further include a shared cache instance circuit configuration associated with a shared cache instance and configured to process policy information received as part of a request associated with a thread. The shared cache instance circuit configuration may further be configured to instruct a shared cache instance to implement at least one of two cache algorithms, unless that shared cache instance is identified as a delegated shared cache instance for determining the winner of at least two cache algorithms for use in any request associated with a thread.
[0062]
[0069] The system may be further configured to ignore policy information if a shared cache instance is identified as a delegated shared cache instance. The system may be further configured to implement the winner of at least two caching algorithms for a delegated shared cache instance if the request associated with a thread does not have access to a leader set.
[0063]
[0070] The system may be further configured to implement policies specified by the leader set regarding delegated shared cache instances when a request associated with a thread is accessing the leader set. The winner is determined using a set dueling counter, which the system may be further configured to increment or decrement the set dueling counter based on either a cache hit or cache miss when a request associated with a thread is accessing the leader set.
[0064]
[0071] The system may be further configured to update policy information received as part of a request associated with a thread when a set dueling counter reaches a predetermined state. The system may be further configured to return the updated policy information to each of the multiple cores as part of a response message from the shared cache instance to ensure that any future request associated with a thread includes the updated policy information specifying at least one of two cache algorithms for implementation by the shared cache instance.
[0065]
[0072] In yet another example, the disclosure relates to a method for selecting a cache algorithm in a system having multiple cores and multiple shared cache instances accessible to any of the multiple cores, wherein the system is configurable to execute threads. The method may include designating a shared cache instance as a first delegated shared cache instance for determining the winner of at least two cache algorithms for access requests associated with a thread. The method may further include delegating to another shared cache instance as a second delegated shared cache instance for determining the winner of at least two cache algorithms for access requests associated with a thread.
[0066]
[0073] This method may further include communicating policy information to each of the multiple cores that specifies the winner of at least two cache algorithms. When one of the multiple shared cache instances receives a request for cache access associated with a thread, this method may further include implementing one of the at least two cache algorithms specified by the policy information received as part of the request for cache access, unless the shared cache instance that received the request is identified as the first delegated shared cache instance or the second delegated shared cache instance.
[0067]
[0074] The winner is determined using a first set dueling counter associated with a first delegated shared cache instance, or using a second set dueling counter associated with a second delegated shared cache instance. The winner is determined using a first set dueling counter associated with a first delegated shared cache instance, and this method may further include incrementing or decrementing the first set dueling counter based on either a cache hit or cache miss determination if a request associated with a thread is accessing a leader set related to the first delegated shared cache instance.
[0068]
[0075] This method may further include implementing the policy specified by the leader set for the first delegated shared cache instance if the request associated with the thread is accessing the leader set. The winner is determined using a second set dueling counter associated with the second delegated shared cache instance, and this method may further include incrementing or decrementing the second set dueling counter based on either a cache hit or cache miss if the request associated with the thread is accessing the leader set for the second delegated shared cache instance. This method may further include implementing the policy specified by the leader set for the second delegated shared cache instance if the request associated with the thread is accessing the leader set.
[0069]
[0076] It should be understood that the methods, modules, and components described herein are merely illustrative. Alternatively, or additionally, the functions described herein may be performed, at least partially, by one or more hardware logic components. For example, but not limited to, exemplary types of hardware logic components that may be used include field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standards (ASSPs), system-on-chip systems (SOCs), and composite programmable logic devices (CPLDs). In an abstract, but still clear, sense, any configuration of components to achieve the same function is effectively “associated” in such a way that the desired function is achieved. Therefore, any two components combined herein to achieve a particular function can be considered “associated” with each other, regardless of architecture or intermediate components, in such a way that the desired function is achieved. Similarly, any two components thus associated can be considered “movably connected” or “joined” with each other in order to achieve the desired function. The mere fact that components that may be devices, structures, systems, or other forms of implementation of a function are described herein as being joined to another component does not necessarily mean that those components are separate components. For example, component A, which is described as being coupled to another component B, may be a subcomponent of component B, or component B may be a subcomponent of component A, or components A and B may be a combined subcomponent of another component C.
[0070]
[0077] Some of the functions associated with the examples described herein also include instructions stored on non-temporary media. As used herein, the term “non-temporary media” refers to any medium that stores data and / or instructions that cause a machine to operate in a particular way. Exemplary non-temporary media include non-volatile media and / or volatile media. Non-volatile media include, for example, hard disks, solid-state drives, magnetic disks or tapes, optical disks or tapes, flash memory, EPROM, NVRAM, PRAM, or other such media, or networked versions of such media. Volatile media include, for example, dynamic memory such as DRAM, SRAM, cache, or other such media. Non-temporary media are distinct from transmission media, but can be used in conjunction with transmission media. Transmission media are used to transfer data and / or instructions to and from a machine. Exemplary transmission media include coaxial cables, fiber optic cables, copper wires, and wireless media such as radio waves.
[0071]
[0078] Furthermore, those skilled in the art will understand that the boundaries between the functions of the operations described above are merely illustrative. The functions of multiple operations may be combined into a single operation, and / or the functions of a single operation may be distributed across additional operations. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be modified in various other embodiments.
[0072]
[0079] While this disclosure provides specific examples, various modifications and changes can be made without departing from the scope of this disclosure as set forth in the following claims. Therefore, this specification and the drawings should be interpreted as illustrative rather than restrictive, and all such modifications are intended to be within the scope of this disclosure. Any benefit, advantage, or solution to a problem described herein in relation to a specific example is not intended to be construed as an essential, required, or essential feature or element of any or all claims.
[0073]
[0080] Furthermore, as used herein, the terms “one” or “one” are defined as one or more. Also, even if the same claim includes the introductory phrase “one or more” or “at least one” and an indefinite article such as “a certain” or “one,” the use of introductory phrases such as “at least one” and “one or more” in a claim should not be interpreted as meaning that the introduction of another claim element by an indefinite article such as “a certain” or “one” limits any particular claim containing such introduced claim element to an invention containing only one such element. This also applies to the use of definite articles.
[0074]
[0081] Unless otherwise specified, terms such as "first" and "second" are used to arbitrarily distinguish between the elements referred to by such terms. Therefore, these terms are not necessarily intended to indicate a temporal or other priority of such elements.
Claims
1. A method for selecting a cache algorithm in a system (100) having a plurality of cores (102, 104, 106, 108) and a plurality of shared cache instances (142, 144, 146, 148) that can access any of the plurality of cores (102, 104, 106, 108), wherein the system (100) is configurable to execute threads, and the method is Step 710: One of the plurality of shared cache instances (142, 144, 146, 148) (142, 300) receives a request associated with a thread, wherein the request includes policy information specifying at least one of two cache algorithms for implementation by the shared cache instance (142, 300) for any request associated with the thread. The shared cache instance (142, 300) implements at least one of the two cache algorithms specified by the policy information received as part of the request associated with the thread, unless the shared cache instance (142, 300) is identified as a delegated shared cache instance, and determines from among the shared cache instances (142, 144, 146, 148) the winner of the two cache algorithms to be used for any request associated with the thread (step 720) A method that includes this.
2. The method according to claim 1, further comprising ignoring the policy information when the shared cache instance is identified as the delegated shared cache instance.
3. The method of claim 2, further comprising implementing the winner of the at least two cache algorithms with respect to the delegated shared cache instance if the request associated with the thread does not access the leader set.
4. The method of claim 2, further comprising implementing the policy specified by the leader set for the delegated shared cache instance if the request associated with the thread is accessing the leader set.
5. The method according to claim 1, wherein the winner is determined using a set dueling counter, and the method further comprises incrementing or decrementing the set dueling counter based on either a cache hit determination or a cache miss determination if the request associated with the thread is accessing a leader set.
6. The method according to claim 5, further comprising updating the policy information received as part of the request associated with the thread when the set dueling counter reaches a predetermined state.
7. The method according to claim 6, further comprising returning the updated policy information to each of the plurality of cores as part of a response message from the shared cache instance to ensure that any future request from the thread includes the updated policy information for specifying at least one of the two cache algorithms for implementation by the shared cache instance.
8. A system (100) having multiple cores (102, 104, 106, 108) and multiple shared cache instances (142, 144, 146, 148) that can access any of the multiple cores (102, 104, 106, 108), which is configurable to execute threads, A shared cache instance (142, 300) among the plurality of shared cache instances (142, 144, 146, 148) for receiving requests associated with a thread, wherein the request includes policy information specifying at least one of two cache algorithms for implementation by the shared cache instance (142, 300) for any request associated with the thread, A shared cache instance circuit configuration (320) associated with the shared cache instances (142, 300), wherein (1) it processes the policy information received as part of the request associated with the thread, and (2) it instructs the shared cache instances (142, 300) to implement at least one of the two cache algorithms, unless the shared cache instances (142, 300) are identified by the shared cache instance circuit configuration (320) as a delegated shared cache instance for determining the winner of the two cache algorithms to be used for any request associated with the thread from among the shared cache instances (142, 144, 146, 148), and A system (100) further equipped with the following features.
9. The system according to claim 8, further configured to ignore the policy information when the shared cache instance is identified as the delegated shared cache instance.
10. The system according to claim 9, further configured to implement the winner of the at least two cache algorithms relating to the delegated shared cache instance if the request associated with the thread does not access the leader set.
11. The system according to claim 9, further configured to implement the policy specified by the leader set relating to the delegated shared cache instance if the request associated with the thread is accessing the leader set.
12. The system according to claim 8, wherein the winner is determined using a set dueling counter, and the system is further configured to increment or decrement the set dueling counter based on either a cache hit determination or a cache miss determination if the request associated with the thread is accessing a leader set.
13. The system according to claim 12, further configured to update the policy information received as part of the request associated with the thread when the set dueling counter reaches a predetermined state.
14. The system according to claim 13, further configured to return the updated policy information to each of the plurality of cores as part of a response message from the shared cache instance, in order to ensure that any future request associated with the thread includes updated policy information for specifying at least one of the two cache algorithms for implementation by the shared cache instance.