Methods for performing atomic memory operations during competition

By employing contention detection and artificial intelligence technologies in multiprocessor systems to remotely execute AMO instructions, the performance degradation caused by cache line contention is resolved, thereby improving the efficiency of the processing system.

CN115956237BActive Publication Date: 2026-06-30SIFIVE INC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SIFIVE INC
Filing Date
2021-08-03
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

In multiprocessor systems, atomic memory operations (AMO) suffer from performance degradation due to cache line contention, and existing technologies struggle to effectively address the ping-pong effect of cache lines across multiple processors.

Method used

By mitigating cache line contention between the local cache and lower-level memory structures, a contention detection mechanism and artificial intelligence/machine learning techniques are employed to determine whether a cache line is in a contention state, and AMO instructions are executed remotely at a lower level to avoid latency in local execution.

Benefits of technology

It improves the performance of the processing system, reduces latency and ping-pong effect caused by cache line contention, and enhances the efficiency of AMO operations.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115956237B_ABST
    Figure CN115956237B_ABST
Patent Text Reader

Abstract

Methods and systems for atomic memory operations with competing cache lines are described. A processing system includes: at least two cores, each core having a local cache; and a lower-level cache communicating with each local cache. A local cache is configured to: request a cache line to execute an atomic memory operation (AMO) instruction; receive the cache line via the lower-level cache; receive probe degradation due to other local caches requesting the cache line before the execution of the AMO; and, in response to the probe degradation, send the AMO instruction to the lower-level cache for remote execution.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to caches, and more specifically to a method for performing atomic memory operations with competing cache lines. Background Technology

[0002] A cache is a hardware and / or software component that stores data (data cache) or instructions (instruction cache) to enable faster service for future requests for that data or instructions. A cache hierarchy typically includes one or more dedicated caches connected to one or more shared caches, which in turn are connected to backup storage or memory.

[0003] In shared-memory multiprocessor systems, caches typically operate under the constraints of cache coherence protocols and coherence mechanisms. This ensures that changes to the values ​​of shared data in the data cache are propagated throughout the shared-memory multiprocessor system in a timely manner. Two common cache coherence protocols are, for example, the Modified Exclusive Shared Invalidate (MESI) protocol and the Modified Shared Invalidate (MSI) protocol. In implementations, the exclusive coherence protocol state can be referred to as the unique coherence protocol state. Typically, in the modified coherence protocol state, the cache line exists only in the current cache and is dirty. That is, the data in the cache line is different from the data in the backup memory or memory. In this case, the data cache needs to write the data back to the backup memory at some point in the future before allowing any further reads from the (no longer valid) backup memory. Upon performing the write-back, the cache line changes to the shared coherence protocol state. In the exclusive coherence protocol state, the cache line exists only in the current data cache and is clean. That is, the data in the cache line matches the data in the backup memory. A cache line can be changed to a shared consensus state at any time in response to a read request. Alternatively, a cache line can be changed to a modified consensus state when a write is made to it. In a shared consensus state, a cache line can be stored in other caches of the system and is clean. That is, the data in the cache line matches the data in the backup storage. A cache line can be discarded (changed to an invalid consensus state) at any time. In an invalid consensus state, a cache line is invalid (unused). In a write-back data cache, storage (or multiple storage units) can be published to (multiple) cache lines or (multiple) cache blocks in a "clean" (invalid, shared, or exclusive) consensus state, typically defined as having read-only permissions. Writes can only be freely performed when a cache line is established or upgraded to a modified consensus state. A cache line in an exclusive consensus state must also be upgraded to a modified consensus state to be globally visible. Consistency protocol upgrades can be accomplished using consistency mechanisms such as snooping, where each data cache monitors address lines to access the memory locations or directories they have cached, and the backup controller remembers which cache(s) have which consistency permissions on which cache(s) blocks.

[0004] Atomic memory operations (AMOs) are uninterruptible read-modify-write memory operations. In other words, they are load-add-store memory operations that must be completed in a single step. When an AMO is performed locally by the cache, there can be a delay between the cache request cache line coherence protocol state escalating to the "modify" state (where other caches are snooped to invalidate, allowing the requested cache to obtain a unique cache line (shared -> modified)) and the cache being able to perform the AMO. During this delay, another cache can request the same cache line, snooping on the original requesting cache. This can cause a single cache line to hop between two or more caches, thus degrading the performance of AMOs regarding contentious cache lines. Attached Figure Description

[0005] This disclosure is best understood from the following detailed description when read in conjunction with the accompanying drawings. It should be emphasized that, in accordance with common practice, the various features in the drawings are not drawn to scale. Rather, for clarity, the dimensions of the various features have been arbitrarily enlarged or reduced.

[0006] Figure 1 This is a block diagram of an example processing system for implementing atomic memory operations with competing cache lines, according to embodiments of the present disclosure.

[0007] Figure 2 This is a flowchart of an example technique or method for implementing atomic memory operations with competing cache lines, according to embodiments of this disclosure.

[0008] Figure 3 This is a flowchart of an example technique or method for implementing atomic memory operations with competing cache lines, according to embodiments of this disclosure.

[0009] Figure 4 This is a flowchart of an example technique or method for implementing atomic memory operations with competing cache lines, according to embodiments of this disclosure.

[0010] Figure 5 This is a flowchart of an example technique or method for implementing atomic memory operations with competing cache lines, according to embodiments of this disclosure.

[0011] Figure 6 This is a flowchart of an example technique or method for implementing atomic memory operations with competing cache lines, according to embodiments of this disclosure.

[0012] Figure 7 This is a flowchart of an example technique or method for implementing atomic memory operations with competing cache lines, according to embodiments of this disclosure. Detailed Implementation

[0013] This document discloses systems and methods for implementing atomic memory operations with competing cache lines. A multiprocessor system can include multiple processors and shared memory. Each processor may have or access a local cache or L1 cache connected to one or more shared or remote caches, which in turn are connected to backup storage or memory (collectively, a “memory hierarchy”).

[0014] The processor needs to execute or perform an Atomic Memory Operation (AMO) instruction. The processor's local cache request originates from a lower-level cache line in the memory hierarchy to execute the AMO instruction. In one implementation, if the local cache is listening for an AMO instruction while awaiting its execution, the local cache does not execute the AMO instruction locally. Instead, it modifies the AMO instruction to execute remotely in another level of the memory hierarchy, closer to the shared common root between the local cache and the cache that caused the listening for the probe. In another implementation, if the requested cache line is identified or determined to be a contested cache line by this other level of the memory hierarchy, the other level sends a contested cache line message to the local cache. The local cache responds to the contested cache line message by sending an AMO instruction to execute remotely in the other level of the memory hierarchy. In implementations, it is possible to determine whether a cache line is a contentious cache line based on various factors, including but not limited to: the Least Recently Used (LRU) algorithm, inputs from other involved caches, the presence of inclusive cache bits, matching transactions currently running or buffered from another cache, whether L2 has all cache lines, whether L2 has cache lines in a shared or uniquely consistent state, matching probes from lower-level LN caches, matching evictions from L2 caches, prediction tables of recently accessed potentially contentious cache lines, and Boolean filters for cache lines unlikely to be contentious. In implementations, determining whether a cache line is a contentious cache line can utilize contention detection mechanisms. In implementations, determining whether a cache line is a contentious cache line can utilize techniques based on artificial intelligence and machine learning. In implementations, determining whether a cache line is a contentious cache line can utilize a combination of the above.

[0015] Atomic memory operations using competing cache line technology improve processing system performance by mitigating the ping-pong effect regarding the requested cache line between the local cache and lower-level memory structures. The implemented techniques can be applied to Weak Memory Ordering (WMO) models in RISC-V and ARM processors, Complete Memory Ordering (TSO) models in x86 processors, and more.

[0016] These and other aspects of this disclosure are disclosed in the following detailed description, the appended claims and the accompanying drawings.

[0017] As used herein, the term “processor” refers to one or more processors, such as one or more dedicated processors, one or more digital signal processors, one or more microprocessors, one or more controllers, one or more microcontrollers, one or more application processors, one or more central processing units (CPUs), one or more graphics processing units (GPUs), one or more digital signal processors (DSPs), one or more application-specific integrated circuits (ASICs), one or more dedicated standard products, one or more field-programmable gate arrays, any other type or combination of integrated circuits, one or more state machines, or any combination thereof.

[0018] The term "circuit" refers to an arrangement of electronic components (e.g., transistors, resistors, capacitors, and / or inductors) configured to perform one or more functions. For example, a circuit may include one or more transistors interconnected to form logic gates that collectively perform a logical function. For example, a processor can be a circuit.

[0019] As used herein, the terms “determine” and “identify” or any variation thereof include selecting, confirming, calculating, searching, receiving, determining, establishing, obtaining, or otherwise identifying or determining, whether using one or more of the devices and methods shown and described herein.

[0020] As used herein, the terms “example,” “embodiment,” “implementation,” “aspect,” “feature,” or “element” indicate that they are used as examples, instances, or illustrations. Unless expressly indicated, any example, embodiment, implementation, aspect, feature, or element is independent of each other example, embodiment, implementation, aspect, feature, or element and may be used in combination with any other example, embodiment, implementation, aspect, feature, or element.

[0021] As used herein, the term “or” is intended to mean an inclusive “or”, not an exclusive “or.” That is, unless otherwise stated or clearly apparent from the context, “X comprises A or B” is intended to mean any natural permutation of inclusion. That is, if X comprises A; X comprises B; or X comprises A and B, then “X comprises A or B” is satisfied in any of the foregoing cases. Furthermore, the articles “a” and “an” used in this application and the appended claims should generally be interpreted as meaning “one or more”, unless otherwise stated or clearly apparent from the context as relating to the singular form.

[0022] Furthermore, for the sake of simplicity, although the figures and descriptions herein may include sequences or series of steps or stages, the elements of the methods disclosed herein may occur in various orders or simultaneously. Additionally, the elements of the methods disclosed herein may occur together with other elements not explicitly presented and described herein. Moreover, not all elements of the methods described herein require implementation of the methods according to this disclosure. Although aspects, features, and elements are described herein in specific combinations, each aspect, feature, or element may be used independently or in various combinations with or without other aspects, features, and elements.

[0023] It should be understood that the description of the accompanying drawings and embodiments has been simplified to illustrate elements relevant to clear understanding, and many other elements found in typical processors have been omitted for clarity. Those skilled in the art will recognize that other elements and / or steps are desirable and / or necessary in carrying out this disclosure. However, because such elements and steps do not contribute to a better understanding of this disclosure, discussion of such elements and steps is not provided herein.

[0024] Figure 1 This is a block diagram of an example of a processing system 1000 for implementing atomic memory operations with competing cache lines, according to embodiments of the present disclosure. The processing system 1000 is capable of implementing a pipelined architecture. The processing system 1000 can be configured to decode and execute instructions of an instruction set architecture (ISA) (e.g., RISC-V instruction set). Instructions can be executed speculatively and out of order in the processing system 1000. The processing system 1000 can be a computing device, a microprocessor, a microcontroller, or an IP core. The processing system 1000 can be implemented as an integrated circuit.

[0025] The processing system 1000 includes at least one processor core 1100. One or more central processing units (CPUs) can be used to implement the processor core 1100. Each processor core 1100 can be connected to or communicate with one or more memory modules 1200 via an interconnect network 1300, an L3 cache 1350, and a memory controller 1400 (collectively, "connected to"). The one or more memory modules 1200 can be referred to as external memory, main memory, backup memory, coherent memory, or backup structure (collectively, "backup structure").

[0026] Each processor core 1100 can include an L1 instruction cache 1500, which is associated with an L1 translation-backup buffer (TLB) 1510 for virtual-to-physical address translation. An instruction queue 1520 buffers instructions fetched from the L1 instruction cache 1500 based on branch prediction 1530 and other fetch pipeline processing. Off-queue instructions are renamed in a renaming unit 1530 to avoid erroneous data dependencies and are then dispatched by a dispatch / exit unit 1540 to the appropriate back-end execution units, including, for example, a floating-point execution unit 1600, an integer execution unit 1700, and a load / store execution unit 1800. The floating-point execution unit 1600 can be allocated a physical register file, namely the FP register file 1610, and the integer execution unit 1700 can be allocated a physical register file, namely the INT register file 1710. FP register file 1610 and INT register file 1710 are also connected to load / store execution unit 1800, which can access L1 data cache 1900 via L1 data TLB 1910. L1 data TLB 1910 is connected to L2 TLB 1920, which in turn is connected to L1 instruction TLB 1510. L1 data cache 1900 can be connected to L2 cache 1930, which can be connected to L1 instruction cache 1500. In one embodiment, L2 cache 1930 can be connected to L3 cache 1350 via interconnect network 1300. In another embodiment, L3 cache 1350 can be a shared cache.

[0027] The processing system 1000 and each element or component in the processing system 1000 are illustrative and can include additional, fewer, or different devices, entities, elements, components, etc., which can be constructed similarly or differently without departing from the scope of the description and claims herein. Furthermore, the illustrative devices, entities, elements, and components can perform other functions without departing from the scope of the description and claims herein.

[0028] Figure 2 This is a flowchart 2000, which is an example technique or method for implementing atomic memory operations with competing cache lines according to embodiments of this disclosure. Flowchart 2000 can, for example, in... Figure 1 The processing system 1000 and similar devices and systems are implemented. Flowchart 2000 describes communication or interaction with core 1 2100, including L1 cache 2110, core 22200, including L1 cache 2210, L2 cache 2300, LN cache 2400, and backup structure 2500. In an implementation, L1 cache 2110 is a local cache to core 1 2100, L1 cache 2210 is a local cache to core 22200, and L2 cache 2300 is a remote cache, shared cache, or a combination thereof to core 1 2100 and core 2 2200. In an implementation, backup structure 2500 may include a controller. In an implementation, each of L1 cache 2110, L1 cache 2210, L2 cache 2300, and LN cache 2400 may include a defined number of cache lines.

[0029] In the start state or sequence of flowchart 2000, the AMO instruction needs to be executed by core 1 2100. L1 local cache 2110 can request a cache line (2600) in a modified or equivalent state (collectively, "modified") associated with the AMO instruction. In one implementation, L1 local cache 2110 can request a cache line in the modified state in the event of a cache miss. In another implementation, L1 local cache 2110 can request an upgrade of the cache line coherence protocol state to the modified state in the event of a cache hit. L2 cache 2300 can respond to this request (2610) by default. In one implementation, L2 cache 2300 can provide a cache line in the modified state. In another implementation, L2 cache 2300 can default the upgrade of the cache line coherence protocol state to the modified state. In yet another implementation, L2 cache 2300 can obtain cache lines from lower levels, such as LN cache 2400 or backup structure 2500, as needed.

[0030] Before the execution of the AMO instruction, core 2 2200 can request the same cache line (2620). As a result, a listen probe can be sent to L1 local cache 2110, which invalidates the cache line at L1 local cache 2110 (2630). L1 local cache 2110 can send an acknowledgment of the listen probe to L2 cache 2300 (2640) and can send an AMO instruction to L2 cache 2300 for remote execution of the AMO instruction (2650). In this case, L2 cache 2300 is the level of the common root shared between the cache closest to the original request (L1 cache 2110) and the other request cache (L1 cache 2210) that caused the listen probe in the cache hierarchy. In the implementation, LN cache 2400, backup structure 2500, etc., can remotely execute the AMO instruction where appropriate and applicable. For example, the L2 cache 2300 can push AMO instructions based on various factors, including but not limited to: LRU, latency, whether L2 has full cache lines, whether L2 has cache lines in a shared or uniquely consistent state, matching probes from lower-level LN caches, matching evictions from L2 caches, etc.

[0031] Figure 3 This is a flowchart 3000, which is an example technique or method for implementing atomic memory operations with competing cache lines according to embodiments of this disclosure. Flowchart 3000 can, for example, in... Figure 1 This is implemented in processing system 1000 and similar devices and systems. Flowchart 3000 describes communication or interaction with kernel 1 3100 including L1 cache 3110, kernel 2 3200 including L1 cache 3210, L2 cache 3300, LN cache 3400, and backup structure 3500. In this implementation, L1 cache 3110 is a local cache to kernel 1 3100, L1 cache 3210 is a local cache to kernel 2 3200, and L2 cache 3300 is a remote cache, shared cache, or a combination thereof to kernel 1 3100 and kernel 2 3200. In this implementation, backup structure 3500 may include a controller. In this implementation, each of L1 cache 3110, L1 cache 3210, L2 cache 3300, and LN cache 3400 may include a defined number of cache lines.

[0032] In the start state or sequence of flowchart 3000, the AMO instruction needs to be executed by core 1 3100. L1 local cache 3110 can request a cache line (3600) associated with the AMO instruction that is in a modified or equivalent state (collectively, "modified"). In one implementation, L1 local cache 3110 can request a cache line in the modified state in the event of a cache miss. In another implementation, L1 local cache 3110 can request an upgrade of the cache line coherence protocol state to the modified state in the event of a cache hit.

[0033] In one implementation, the L2 cache 3300 can respond by stating that the requested cache line is a contested cache line and is unavailable (3610). That is, a contested cache line message indicates that the request is rejected. The L2 cache 3300 can determine that a cache line is contested based on various factors, including but not limited to LRU, latency, cache presence bits, matching transactions running or buffered from another cache, whether the L2 cache fully has the cache line, whether the L2 cache has a cache line in a shared or uniquely consistent state, matching probes from a lower-level LN cache, matching evictions from the L2 cache, etc. In one implementation, as part of determining a contested cache line, the L2 cache 3300 can request another cache, such as the L2 cache 3300, to relinquish the requested cache line (3620). In another implementation, the L1 cache 3210 does not relinquish the requested cache line.

[0034] In response to receiving a contention cache line message from L2 cache 3300, L1 local cache 3110 can send an AMO instruction to L2 cache 3300 for remote execution of the AMO instruction (3630). In this implementation, L2 cache 3300 is the level in the cache hierarchy closest to the common root shared between the original request cache (L1 cache 3110) and the reserved cache (L1 cache 3210). In this implementation, LN cache 3400, backup structure 3500, etc., can remotely execute AMO instructions as appropriate and applicable. For example, L2 cache 3300 can push AMO instructions based on various factors, including but not limited to LRU, latency, whether L2 has a full cache line, whether L2 has a cache line in a shared or uniquely consistent state, matching probes from lower-level LN caches, matching evictions from L2 caches, etc.

[0035] Figure 4This is a flowchart of an example technique or method 4000 for implementing atomic memory operations with competing cache lines, according to embodiments of this disclosure. The technique includes: requesting 4100 a cache line for an AMO instruction from a lower-level memory structure; receiving 4200 a cache line from the lower-level memory structure; receiving 4300 a probe degradation due to another cache requesting the same cache line before the execution of the AMO instruction; confirming 4400 the probe degradation; and sending 4500 an AMO instruction to the lower-level memory structure for remote execution. This technique 4000 is capable, for example, in… Figure 1 The processing system 1000 and similar devices and systems are implemented.

[0036] Technology 4000 includes requesting 4100 cache lines from a lower-level memory structure for AMO instructions. The processor or processing system needs to perform an AMO operation. The local cache can request cache lines associated with or required for executing the AMO instructions. In one embodiment, the local cache can request cache lines in the event of a cache miss. In another embodiment, the local cache can request cache coherency state upgrades for cache lines in the event of a cache hit.

[0037] Technology 4000 includes receiving 4200 cache lines from a lower-level memory architecture. The lower-level memory architecture can default to requesting cache lines either by providing cache lines to the local cache or by upgrading the cache coherence state of the cache lines at the local cache.

[0038] Technique 4000 includes receiving 4300 a probe degradation caused by another cache requesting the same cache line before the AMO instruction is executed. The other cache is able to request the same cache path, causing a probe to be listened to at the local cache, resulting in either cache line degradation or invalidation at the local cache.

[0039] Technique 4000 includes confirmation 4400 for degradation detection. In implementation, the local cache is able to attempt negotiation with lower-level memory structures.

[0040] Technique 4000 includes sending 4500 AMO instructions to a lower-level memory structure for remote execution. The local cache is capable of relinquishing local execution of the AMO operation and sending AMO instructions to a lower-level memory structure for remote execution. The AMO instructions can be executed by the lower-level memory structure or by another memory structure closer to the local cache and another request cache that causes the snooping probe, sharing a common root.

[0041] Figure 5This is a flowchart of an example technique 5000 or method for implementing atomic memory operations with competing cache lines, according to embodiments of this disclosure. The technique includes: requesting 5100 a cache line for AMO instructions from a lower-level memory structure; receiving 5200 a competing cache line message from the lower-level memory structure; and sending 5300 AMO instructions to the lower-level memory structure for remote execution. Technique 5000 is capable, for example, in… Figure 1 The processing system 1000 and similar devices and systems are implemented.

[0042] Technique 5000 includes requesting 5100 cache lines from a lower-level memory structure for AMO instructions. The processor or processing system needs to perform an AMO operation. The local cache can request cache lines associated with or required for executing the AMO instructions. In one embodiment, the local cache can request cache lines in the event of a cache miss. In another embodiment, the local cache can request cache coherence state upgrades for cache lines in the event of a cache hit.

[0043] Technique 5000 includes receiving a contention cache line message 5200 from a lower-level memory structure. The lower-level memory structure can determine whether a requested cache line is a contention cache line based on various factors, including but not limited to LRU, latency, matching transactions currently in progress or buffered from the same cache or another cache. In implementations, these factors can include whether another cache or memory structure with the requested cache line will relinquish it. Based on this determination, the lower-level memory structure can send a contention cache line message to its local cache, indicating that the request is rejected.

[0044] Technique 5000 includes sending a 5300 AMO instruction to a lower-level memory structure for remote execution. The local cache is capable of relinquishing local execution of the AMO operation in response to a contention for a cache line message and is also capable of sending the AMO instruction to a lower-level memory structure for remote execution. The AMO instruction can be executed by the lower-level memory structure or other memory structures. In one implementation, remote execution can be performed by a memory structure closer to the local cache and a cache that holds the requested cache line, sharing a common root.

[0045] Figure 6This is a flowchart of an example technique 6000 or method for implementing atomic memory operations with competing cache lines according to embodiments of this disclosure. The technique includes: requesting 6100 a cache line for AMO instructions from a lower-level memory structure; checking 6200 other memory structures regarding the requested cache line; receiving 6300 a competing cache line message from the lower-level memory structure; and sending 6400 an AMO instruction to the lower-level memory structure for remote execution. Technique 5000 is capable, for example, in… Figure 1 The processing system 1000 and similar devices and systems are implemented.

[0046] Technology 6000 includes requesting 6100 cache lines from a lower-level memory structure for AMO instructions. The processor or processing system needs to perform an AMO operation. The local cache can request cache lines associated with or required for executing the AMO instructions. In one embodiment, the local cache can request cache lines in the event of a cache miss. In another embodiment, the local cache can request cache coherence state upgrades for cache lines in the event of a cache hit.

[0047] Technique 6000 includes checking 6200 with respect to another memory structure with respect to the requested cache line. The lower-level memory structure is able to check whether another cache or memory structure having the requested cache line will relinquish the requested cache line.

[0048] Technology 6000 includes receiving 6300 contention cache line messages from a lower-level memory structure. The lower-level memory structure can determine whether a requested cache line is contentious based on various factors, including but not limited to LRU, latency, matching transactions currently in progress or buffered from the same cache or another cache, and responses from another memory structure. Based on this determination, the lower-level memory structure can send a contention cache line message to its local cache, indicating that the request is denied.

[0049] Technique 6000 includes sending 6400 AMO instructions to a lower-level memory structure for remote execution. The local cache is capable of relinquishing local execution of the AMO operation in response to a contention for a cache line message and is also capable of sending AMO instructions to a lower-level memory structure for remote execution. The AMO instructions can be executed by the lower-level memory structure or other memory structures. In one embodiment, remote execution can be performed by a memory structure closer to the local cache and a cache that holds the requested cache line, sharing a common root.

[0050] Figure 7This is a flowchart of an example technique or method 7000 for implementing atomic memory operations with competing cache lines, according to embodiments of the present disclosure. The technique includes: requesting 7100 a cache line for an AMO instruction from a lower-level memory structure; determining 7200 the availability of the requested cache line; receiving 7300 a cache line from the lower-level memory structure when available; receiving 7400 a probe degradation due to another cache requesting the same cache line before the AMO instruction is executed; confirming 7500 the probe degradation; sending 7600 an AMO instruction to the lower-level memory structure for remote execution; receiving 7700 a competing cache line message from the lower-level memory structure when unavailable; and sending 7600 an AMO instruction to the lower-level memory structure for remote execution. Technique 7000 is capable of, for example, in... Figure 1 The processing system 1000 and similar devices and systems are implemented.

[0051] Technology 7000 includes requesting 7100 cache lines from a lower-level memory structure for AMO instructions. The processor or processing system needs to perform an AMO operation. The local cache can request cache lines associated with or required for executing the AMO instructions. In one embodiment, the local cache can request cache lines in the event of a cache miss. In another embodiment, the local cache can request cache coherency state upgrades for cache lines in the event of a cache hit.

[0052] Technique 7000 includes determining the availability of the cache line requested by 7200. Lower-level memory structures can determine whether the requested cache line is a contested cache line based on various factors, including but not limited to LRU, latency, matching transactions running or buffered from the same cache or another cache, and responses from another memory structure with the requested cache line.

[0053] Technology 7000 includes receiving 7300 cache lines from a lower-level memory structure when available. The lower-level memory structure can request cache lines by default, either by providing cache lines to the local cache or by upgrading the cache coherence state of the cache lines at the local cache.

[0054] Technique 7000 includes receiving 7400 a probe degradation caused by another cache requesting the same cache line before the AMO instruction is executed. The other cache is able to request the same cache path, causing a probe to be heard at the local cache, resulting in either cache line degradation or invalidation at the local cache.

[0055] Technique 7000 includes confirming 7500 degradation detection. In this implementation, the local cache is able to attempt negotiation with lower-level memory structures.

[0056] Technique 7000 includes sending 7600 AMO instructions to a lower-level memory structure for remote execution. The local cache is capable of relinquishing local execution of the AMO operation and sending AMO instructions to a lower-level memory structure for remote execution. The AMO instructions can be executed by the lower-level memory structure or by another memory structure closer to the local cache and another request cache that causes the snooping probe, sharing a common root.

[0057] Technology 7000 includes receiving a contention message for a cache line from a lower-level memory structure when the cache line is unavailable. The lower-level memory structure is able to send a message indicating that the requested cache line is a contention cache line and reject the request from the local cache.

[0058] Technology 7000 includes sending 7600 AMO instructions to a lower-level memory structure for remote execution.

[0059] Generally, a processing system includes: at least two cores, each core having a local cache; a lower-level cache communicating with each local cache, one local cache being configured to: request a cache line to execute an Atomic Memory Operation (AMO) instruction; receive a cache line via the lower-level cache; receive a probe degradation caused by another local cache requesting a cache line before the execution of the AMO; and send an AMO instruction to the lower-level cache for remote execution in response to the probe degradation. In one embodiment, the request is directed to a cache line in the event of a cache miss at a local cache. In another embodiment, the request is directed to a cache coherence state escalation in the event of a cache hit at a local cache. In yet another embodiment, the lower-level cache is configured to determine the availability of a cache line based on various factors. In this implementation, various factors include at least the Least Recently Used (LRU) algorithm, latency, input from other caches or memory structures associated with the cache line, containment cache presence bits, matching transactions running or buffered from another cache, whether the lower-level cache fully has the cache line, whether the lower-level cache has cache lines in a shared or uniquely consistent state, matching probes from the lower-level cache, matching evictions from the lower-level cache, a prediction table of recently accessed potentially contested cache lines, and Boolean filters for cache lines that are unlikely to be contested. In this implementation, the lower-level cache is also configured to check other caches or memory structures associated with the cache line regarding its willingness to relinquish the cache line. In this implementation, the lower-level cache is also configured to send a contested cache line message to a local cache based on various factors. In this implementation, a local cache is also configured to send an AMO instruction to the lower-level cache for remote execution in response to a contested cache line message.

[0060] Generally, a processing system includes: a core with a local cache; and a shared cache that communicates with the core's local cache and at least one other cache of at least another core. The local cache is configured to: request a cache line to execute an atomic memory operation (AMO) instruction, receive a message from the shared cache that a cache line is unavailable, and, in response to the message, send an AMO instruction to the shared cache for remote execution. In one embodiment, the request is directed to a cache line in the event of a cache miss at the local cache. In another embodiment, the request is directed to a cache coherence state escalation in the event of a cache hit at the local cache. In yet another embodiment, the shared cache is configured to determine cache line availability based on various factors. In this implementation, these factors include at least a Least Recently Used (LRU) algorithm, latency, input from at least another cache on at least another core, and the presence of inclusive cache bits, matching transactions running or buffered from another cache, whether the shared cache has all cache lines, whether the shared cache has cache lines in a shared or uniquely consistent state, matching probes from lower-level caches, and matching evictions from the shared cache. In this implementation, the shared cache is also configured to check at least another cache on at least another core regarding the willingness to relinquish cache lines.

[0061] Generally, a method for executing atomic memory operation (AMO) instructions includes: requesting a cache line for the AMO instruction from a lower-level memory structure by a local cache; determining the availability of the requested cache line by the lower-level memory structure; receiving the cache line from the lower-level memory structure by the local cache when available; receiving a degradation probe due to another cache request for the cache line prior to the execution of the AMO instruction; sending the AMO instruction to the lower-level memory structure for remote execution by the local cache in response to the degradation probe; receiving a contention cache line message from the lower-level memory structure by the local cache when unavailable; and sending the AMO instruction to the lower-level memory structure for remote execution by the local cache in response to the contention cache line message. In one embodiment, the request is for a cache line in the event of a cache miss at the local cache. In another embodiment, the request is for a cache coherence state upgrade in the event of a cache hit at the local cache. In this implementation, availability is based on various factors, including at least a Least Recently Used (LRU) algorithm, latency, input from other caches or memory structures associated with the cache line, inclusive cache presence bits, matching transactions running or buffered from another cache, whether lower-level memory structures fully possess the cache line, whether lower-level memory structures have cache lines in a shared or uniquely consistent state, matching probes from lower-level memory structures, and matching evictions from lower-level memory structures. In this implementation, the method also includes checking other caches or memory structures associated with the cache line regarding the willingness to relinquish it. In this implementation, the method also includes degrade probes confirmed by the local cache.

[0062] Although some embodiments herein refer to methods, those skilled in the art will understand that they can also be embodied as systems or computer program products. Therefore, aspects of the invention can take the form of entirely hardware embodiments, entirely software embodiments (including firmware, resident software, microcode, etc.), or embodiments combining software and hardware aspects, which may generally be referred to herein as “processor,” “device,” or “system.” Furthermore, aspects of the invention can take the form of computer program products implemented in one or more computer-readable media having computer-readable program code implemented thereon. Any combination of one or more computer-readable media can be utilized. Computer-readable media can be computer-readable signal media or computer-readable storage media. Computer-readable storage media can be, for example, but not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or apparatuses, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include the following: electrical connections having one or more cables, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable optical disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In the context of this document, a computer-readable storage medium can be any tangible medium capable of containing or storing programs used by or connected to an instruction execution system, apparatus, or device.

[0063] Computer-readable signal media may include propagated data signals in which computer-readable program code is implemented, for example, in baseband or as part of a carrier wave. Such propagated signals may take any of a variety of forms, including but not limited to electromagnetic, optical, or any suitable combination thereof. A computer-readable signal medium may be any computer-readable medium that is not a computer-readable storage medium and is capable of communicating, propagating, or transmitting a program used by or connected to an instruction execution system, apparatus, or device.

[0064] Program code implemented on a computer-readable medium can be transmitted using any suitable medium, including but not limited to CD, DVD, wireless, wired, fiber optic cable, RF, or any suitable combination thereof.

[0065] Computer program code used to perform the operations of various aspects of this invention can be written in any combination of one or more programming languages, including object-oriented programming languages ​​(such as Java, Smalltalk, C++, etc.) and conventional procedural programming languages ​​(such as the "C" programming language or similar programming languages). The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer, partially on a remote computer, or entirely on a remote computer or server. In the latter case, the remote computer can be connected to the user's computer via any type of network (including a local area network (LAN) or a wide area network (WAN)), or the connection can be made to an external computer (e.g., via the Internet through an Internet service provider).

[0066] Aspects are described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions.

[0067] These computer program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for carrying out the functions / actions specified in the flowchart and / or (multiple) block diagram blocks. These computer program instructions may also be stored in a computer-readable medium that can direct a computer, other programmable data processing apparatus, or other device to function in a particular manner, such that the instructions stored in the computer-readable medium produce an article of writing comprising instructions that carry out the functions / actions specified in the flowchart and / or (multiple) block diagram blocks.

[0068] Computer program instructions may also be loaded onto a computer, other programmable data processing apparatus or other equipment to cause a series of operational steps to be performed on the computer, other programmable apparatus or other equipment, thereby producing a computer-implemented process, such that the instructions, which execute on the computer or other programmable apparatus, provide for performing the functions / actions specified in the flowchart and / or (multiple) block diagram boxes.

[0069] The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code, including one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions annotated in the blocks may occur in a non-linear order as annotated in the figures.

[0070] While this disclosure has been described in conjunction with certain embodiments, it should be understood that this disclosure is not limited to the disclosed embodiments. Rather, this disclosure is intended to cover various modifications, combinations, and equivalent arrangements included within the scope of the appended claims, which should be given the broadest interpretation to cover all such modifications and equivalent structures permitted by law.

Claims

1. A processing system, comprising: It has at least two cores, each with a local cache. The lower-level cache that communicates with each local cache; A local cache is configured as follows: Request a cache line to execute an atomic memory operation (AMO) instruction; The cache line is received via the lower-level cache, wherein the lower-level cache is configured to determine the availability of the cache based on input from other caches or memory structures associated with the cache line; Receive probe degradation caused by other local caches requesting the cache line before the execution of the AMO; as well as In response to the detection degradation, the AMO instruction is sent to the lower-level cache for remote execution.

2. The processing system according to claim 1, wherein, The lower-level cache is configured to determine the cache availability based on the inclusion cache presence bits.

3. The processing system according to claim 2, wherein, The lower-level cache is configured to also determine the availability of the cache based on matching transactions that are running or buffered from another cache.

4. The processing system according to claim 3, wherein, The lower-level cache is configured to also determine the availability of the cache line based on whether the lower-level cache fully has the cache line.

5. The processing system according to claim 4, wherein, The lower-level cache is configured to also determine the availability of the cache line based on whether the lower-level cache has a cache line in a shared or unique consistency state.

6. The processing system according to claim 1, wherein, The lower-level cache is also configured as follows: Regarding the intention to discard the cache line, check other cache or memory structures associated with the cache line.

7. The processing system according to claim 1, wherein, The lower-level cache is also configured as follows: A contention cache line message is sent to a local cache based on various factors, including at least the Least Recently Used (LRU) algorithm, latency, input from other caches or memory structures associated with the cache line, containment cache presence bits, matching transactions running or buffered from another cache, whether the lower-level cache fully owns the cache line, whether the lower-level cache has a cache line in a shared or uniquely consistent state, matching probes from the later-level cache, matching evictions from the lower-level cache, a prediction table of recently accessed potentially contentious cache lines, and a Boolean filter for cache lines that are unlikely to be contentious.

8. The processing system according to claim 7, wherein, The local cache is also configured as follows: In response to the contention cache line message, the AMO instruction is sent to the lower-level cache for remote execution.

9. The processing system according to claim 5, wherein, The lower-level cache is configured to also determine the availability of the cache line based on probes matching those from the later-level cache.

10. The processing system according to claim 9, wherein, The lower-level cache is configured to also determine the availability of the cache line based on matching evictions from the lower-level cache.

11. The processing system according to claim 10, wherein, The lower-level cache is configured to also determine the availability of the cache line based on a prediction table of recently accessed cache lines that may be contested.

12. The processing system according to claim 11, wherein, The lower-level cache is configured to also determine the availability of the cache line based on a Boolean filter that the cache line cannot be contested.

13. The processing system according to claim 12, wherein, The lower-level cache is configured to also determine the availability of the cache line based on the Least Recently Used (LRU) algorithm.

14. A method for executing atomic memory operation (AMO) instructions, the method comprising: The cache line for AMO instructions is requested from the lower-level memory structure by the local cache; The availability of the requested cache line is determined by the lower-level memory structure based on the Least Recently Used (LRU) algorithm and latency. When available, the cache line from the lower-level memory structure is received by the local cache from the lower-level memory structure; Receive probe degradation caused by another cache request on the cache line prior to the execution of the AMO instruction; In response to the detection degradation, the local cache sends the AMO instruction to the lower-level memory structure for remote execution. When unavailable, the local cache of the lower-level memory structure receives a competing cache line message from the lower-level memory structure; as well as In response to the contention for a cache line message, the local cache sends the AMO instruction to the lower-level memory structure for remote execution.

15. The method according to claim 14, wherein, The method further includes: The availability of the requested cache line is determined by the lower-level memory structure based on input from other cache or memory structures associated with the cache line.

16. The method according to claim 15, further comprising: The availability of the requested cache line is determined by the lower-level memory structure based on the cache presence bits of inclusion.

17. The method according to claim 16, further comprising: The availability of the requested cache line is determined by the lower-level memory structure based on matching transactions that are running or buffering from another cache.

18. The method according to claim 17, further comprising: The availability of the requested cache line is determined by the lower-level memory structure based on whether the lower-level memory structure fully has the cache line.

19. The method according to claim 18, further comprising: The availability of the requested cache line is determined by the lower-level memory structure based on whether the lower-level memory structure fully has the cache line.

20. The method according to claim 19, further comprising: The availability of the requested cache line is determined by the lower-level memory structure based on whether the lower-level memory structure fully has the cache line.

21. The method according to claim 20, further comprising: The availability of the requested cache line is determined by the lower-level memory structure based on whether the lower-level memory structure has cache lines in a shared or unique consistency state.

22. The method according to claim 21, further comprising: The availability of the requested cache line is determined by the lower-level memory structure based on probes matching those from the later-level memory structure.

23. The method according to claim 22, further comprising: The availability of the requested cache line is determined by the lower-level memory structure based on matching evictions from the lower-level memory structure.