Cache coherence tracking using adaptive coherence tracking
The cache coherence system optimizes cache coherence by adaptively adjusting sector size and bit allocation in the SFT, addressing resource consumption and performance issues in multicore architectures through dynamic sector sizing and reduced oversnooping.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- MICROSOFT TECHNOLOGY LICENSING LLC
- Filing Date
- 2024-06-11
- Publication Date
- 2026-06-19
AI Technical Summary
Modern multicore architectures face significant challenges in maintaining cache coherence as the number of agents increases, leading to increased resource consumption and performance degradation due to the need for extensive snoop operations and inadequate snoop filter area provisioning.
A cache coherence system using a snoop filter (SFT) that adaptively adjusts sector size and core-validated vector bits based on cache line sharing rates, optimizing area usage and reducing oversnooping through dynamic or static coarse-grained tracking.
The system effectively manages cache coherence with reduced resource consumption and minimal performance impact by dynamically adjusting sector size and bit allocation in the SFT, ensuring efficient tracking of cache lines without excessive snooping.
Smart Images

Figure 2026519941000001_ABST
Abstract
Description
Background Art
[0001]
[0001] A processor - based device may include a plurality of processing elements (PEs) (e.g., as a non - limiting example, processor cores), each providing one or more local caches for storing data that is frequently accessed. Since the plurality of PEs of a processor - based device may share memory resources such as system memory, multiple copies of shared data read from a certain memory address may exist simultaneously in the system memory and in the local caches of the PEs. Therefore, to ensure that all PEs consistently view the same shared data, a processor - based device supports a cache coherence protocol that causes local changes to shared data within one PE to be propagated to other PEs.
Summary of the Invention
Means for Solving the Problems
[0002]
[0002] The described technique includes identifying a cache line sector associated with a snooping filter (SFT) having a plurality of SFT entries, identifying the number of cache lines stored in the cache by one or more agents within the identified cache line sector, and identifying the number of bits within a bit vector (BV) of one or more of the SFT entries among the plurality of SFT entries based on the number of cache lines stored in the cache by one or more agents within the identified cache line sector, the number of bits being necessary to track one or more agents, and provides a method.
[0003]
[0003] This “Summary” section is provided to briefly introduce selected concepts, which are further explained in the “Detailed Description” section. This “Summary” is not intended to identify any important or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
[0004]
[0004] Other implementation forms are also described and detailed in this specification. [Brief explanation of the drawing]
[0005] [Figure 1]
[0005] An exemplary implementation of a system that provides cache coherence using a snoop filter is shown. [Figure 2]
[0006] This specification discloses exemplary behavior of the cache coherence system for updating the SFT entry field in response to an allocation request. [Figure 3]
[0007] This specification discloses exemplary behavior of the cache coherence system for updating the SFT entry field in response to an invalidation request. [Figure 4]
[0008] This specification illustrates an exemplary operation of the cache coherence system for taking ownership of a shared cache line. [Figure 5]
[0009] This specification provides an exemplary system that may be useful in implementing the high-latency query optimization system disclosed herein. [Modes for carrying out the invention]
[0006]
[0010] The implementation disclosed herein discloses a multiprocessor system that utilizes hardware-forced cache coherency, in which, when an agent such as a CPU or GPU wants to access a memory location, the hardware automatically determines whether another agent holds a copy of that memory location at that time. If the access is a read and the memory location is stored in the cache of another agent, the system memory may be in a stale state, in which case the access must be satisfied by retrieving the data from the other agent's cache. If the access is a write, typically, the copy stored in another cache must first be returned to system memory and written. The memory block in which hardware-forced cache coherency is maintained is called a coherence grain (cograin), and the system may match the size of the cograin to the size of the cache line.
[0007]
[0011] To ensure program accuracy, multicore system-on-a-chip (SoCs) must be designed with macroarchitectural features to precisely order and manage access to the same physical memory address space from different cores. Because private caches exist in today's architectures, coherency protocols are responsible for tracking which cores have specific addresses privately stored in the cache and managing their visibility to other cores in the system. The simplest mechanism for performing coherency involves sending snoop messages to all other agents in the system that "may" store that line in their cache. As the number of agents increases over time, the most modern architectures employ tracking structures that track which lines reside in which private caches; these are called snoop filters. This provides a filtered way of topology, reducing the number of snoops required.
[0008]
[0012] Cache coherence is a fundamental characteristic of multicore architectures, requiring the simultaneous execution of threads that should be presented in a way that consistently appears coherent in memory address space. Modern multicore architectures dedicate a significant amount of hardware and design resources to ensuring that the chip has coherent domains, where coherence is maintained through dedicated pipelines and flows working in the background. From a performance standpoint, these operations should have little impact on normal performance. As designs scale to accommodate ever-increasing core counts, the cost of maintaining coherence also increases significantly.
[0009]
[0013] One architectural feature that helps solve scaling problems is the Snoop Filter (SFT). This dedicated hardware structure provides an up-to-date view of the cache lines in a private cache on the agent CPU, so coherency operation is performed as needed, thereby reducing the pressure on system-on-chip (SoC) resources, in particular the on-chip bandwidth and cache pipeline bandwidth that could otherwise be used to send and process snoop requests.
[0010]
[0014] To enable it, the SFT may be required to precisely track the upstream cache content. This, in turn, necessitates sufficient provisioning in terms of area reserved for the SFT, because any entries untrackable by the snoop filter must also be invalidated from the private cache, which can be detrimental to performance. In an ideal world, the snoop filter should provide coverage of at least 1 × the size of the private cache. In terms of area, since the SFT only holds tags, it needs to have one entry for each active cache line in the private cache. Thus, the number of SFT entries for 1 × coverage (N) = L2 size (B) / L2 line size (B), and the area of the SFT = area of each SFT entry × N.
[0011]
[0015] Therefore, with a 256KB L2 cache and an L2 cache line size of 64 bytes, only one agent CPU requires 4K SFT entries. This SFT area requirement is further tightened by the set-associative nature of private caches and the large number of agents on the SoC. One way to solve this area constraint is to track cache lines at a coarse granularity within the SFT. In fact, if the SFT tracks n cache lines per entry, the area requirement for the same amount of coverage is 1 / n. In this scheme, the workload may need to access all n cache lines within the same sector in temporal proximity to each other. In some implementations, the n cache lines grouped within the same entry are also located in the same spatial location. However, in such implementations, if locality rules are not adhered to by a particular workload, the sectoring scheme will encounter significant performance degradation due to oversnooping. T mitigates oversnooping, and the SFT implementation may include metadata for each coarse-grained entry.
[0012]
[0016] The cache coherence system disclosed herein describes a mechanism for tracking cache line ownership for a coarse-grained snoop filter that allows for a trade-off between the flexible use of core-validated (CV) vector bits and sector size. Specifically, in the cache coherence system disclosed herein, when the amount of cache line sharing is below a threshold, fewer bits may be required to track individual agents, and area is maximized using coarse-grained aggregation. On the other hand, when the amount of cache line sharing exceeds the threshold, the cache coherence system limits the sector size of the snoop filter, making more bits available for tracking individual agents. In the implementation disclosed herein, this solution is applied to a static coarse-grained snoop filter. In an alternative implementation disclosed herein, this solution is applied to a dynamic snoop filter where different sector sizes can coexist during execution time.
[0013]
[0017] Figure 1 shows one implementation of a system 100 that provides cache coherence using a snoop filter in accordance with the technology disclosed herein. Specifically, the cache coherence system 100 may be implemented on a multicore architecture, which includes multiple central processing unit (CPU) cores, 102 and 104, a graphical processing unit (GPU) 106, one or more input / output (I / O) agents 108, a point of serialization (PoS) 110, and memory 114. This example shows two CPU cores and one GPU, but any number of CPU cores and CPUs can be used, and this should be understood as not departing from the scope of this disclosure. Examples of I / O agents 108 include, but are not limited to, industry standard architecture (ISA) devices, peripheral component interconnect (PCI) devices, PCI-X devices, PCI Express devices, universal serial bus (USB) devices, Advanced Technology Attachment (ATA) devices, Small Computer System Interface (SCSI) devices, and InfiniBand devices.
[0014]
[0018] The processing unit cores 102, 104, 106, and I / O agent 108 may be referred to as agents 102-108, each referred to by an agent ID (AID). These agents 102-108 may have multiple levels of internal caches, such as L1, L2, and L3 caches. Because agents 102-108 store coherent and shared memory blocks (cograins) in their internal caches, the snoop filter (SFT) 150 can keep track of the records and locations of these cograins. Any of agents 102-108 may issue coherent or non-coherent requests, and the PoS 110 provides memory coherence by using the snoop filter 150 to ensure that memory access requests are reliably serialized.
[0015]
[0019] For example, PoS 110 receives a coherent request 120 from CPU 102. In response to the coherent request 120, PoS 110 issues a snoop command 122 to CPU core 104, CPU 106, and I / O agent 108. CPU core 104, GPU 106, and I / O agent 108 may return the requested coherent information to PoS 110. When sending snoop 122, PoS 110 points to SFT 150.
[0016]
[0020] An exemplary implementation of SFT 150 is shown by SFT 150a. SFT 150a includes a data structure for tracking the location of cograins currently stored in the cache by agents 102-108. SFT 150a may be an n-way filter, as shown by n arrays 154. The snoop filter 150a may include an array of entries 152, the contents of which are described further below. In the implementation of SFT 150a disclosed herein, the logical entries 152 may be configured to store a tag field 160, a coherency state 162, a sector size (164), a core-valid (CV) vector 166, and other metadata 168, such as error correction codes (ECC).
[0017]
[0021] The tag field 160 is used to store the tag portion of the physical address (PA) that identifies a cograin. For example, if the cograin size is 64 bytes and the SFT is a 16-way associative SFT, bits 15:6 of the PA may be used to select an SFT set, and bits 47:16 of the PA may be stored as a tag in the tag field 160 of the SFT entry 152. If SFT 150a needs to perform a lookup to determine whether the PA of a cograin is in SFT 150a, it will use PA[15:6] to select one of the 16 sets. For the selected set, SFT 150a may then compare PA[47:16] with the tag values stored in the tag field 160 of the 16 SFT entries 152 in the selected set. If the tag field 160 of any of the 16 SFT entries in the selected set matches, that way (e.g., way 5) is currently tracking the cograin being looked up.
[0018]
[0022] The coherence state 162 tracks whether the cache line tracked by the SFT entry is exclusively owned by one agent or by multiple agents. The sector size 164 is determined by the number of cache lines tracked by the SFT entry. The CV vector 166 tracks the agent ID (AID) of the agent that holds that cache line. Using the CV vector 166 reduces oversnooping because coherence enforcement only needs to snoop on the agents tracked by the CV vector 166, instead of snooping on all vectors for the possibility of holding a cache line. Other metadata 168 may include fields that identify reuse information and other information that may be used to improve SFT performance.
[0019]
[0023] In an implementation of the cache coherence system 100 disclosed in this specification, the assigned bits of the CV vector 166 are used to track the actual cache lines stored in the private cache of the agent and the CV vector 166 within the sector. In the illustrated implementation, when the share amount of the cache line is small, a smaller number of bits can be used to track the core or agent and the CV vector 166 of the cache coherence system 100, to track individual cores or agents, and area savings can be achieved using coarse-grained aggregation. On the other hand, as the sharing of the cache lines tracked by the SFT entry increases, the cache coherence system 100 limits the sector size of the SFT 150a and devotes more bits within the CV vector 166 to tracking individual cores or agents.
[0020]
[0024] The following figure details the exchange between two states of the SFT entry, specifically the static coarse-grained structure of the SFT entry when the cache line sharing rate is low and the dynamic structure in which different sector sizes 164 can coexist during execution time. Specifically, the implementation disclosed below explains the flexible assignment of the bits of the CV vector 166 to (a) track the actual cache lines stored in the private cache within the sector (the CV vector 166 indicates the sector valid), and (b) track the core or agent holding the cache line (the CV vector 166 as the CV vector).
[0021]
[0025] In the implementation disclosed in this specification, the CV vector bits are adaptively used based on the sector size of the SFT. If the sector size is small, for example, 128B, the cache line sharing of the sector is generally limited to a subset of the cores. For example, in the case of a 64-core system, 6 bits are required to uniquely identify the cores. In this case, when the entire sector is held by only one core, the CV vector 166 can be 6 bits instead of the standard 64 bits. Generally, since the CV vector 166 is provisioned to have 64 bits, the cache coherence system 100 calculates the number of cores that can be tracked within the CV vector 166 and uses the coarse-grained snooping filter tracking method to achieve area savings.
[0022]
[0026] Consider the implementation of the baseline SFT CV vector 166 of the "T" bit. The following formula estimates the core tracking ability of this method. The value of T is determined by the number of cores / agents tracked in the coherence domain, and "N" represents the number of agents that this method can track.
[0027] Core ID field = log2(num_cores) * N
[0028] Bit sector field = (sector_size / cacheline_size) * N
[0029] N = Floor(T / (log2(num_cores)+(sector_size / cacheline_size))
[0023]
[0030] The following table shows the number of cores that can be tracked when the size T of the CV vector 166 is 64 bits and the cache line size is 64 bytes.
=0024
Table 1
[0025]
[0031] In one implementation of the cache coherence system disclosed herein, each SFT entry has N CoreID fields, e.g., CoreID0, CoreID1, ...CoreID N-1 We track N cores using this method. Here, each CoreID field holds the encoded CoreID and the associated BV field, e.g., BV0, BV1, ... BV N-1 The BV field holds information about the cograins within a given sector that are cached by the CoreID to which it is associated. Therefore, the larger the sector size in the SFT architecture, the larger the associated BV field.
[0026]
[0032] In implementations using dynamic sector sizing for SFT, the aforementioned estimate of the number of agents N that can be tracked using CV vector 166 can be modified using the bit "S" required to represent the current sector size indicated by sector size 164. Specifically, in such dynamic sector sizing implementations, N can be estimated by the following equation:
[0033] N=Floor((TS) / (log2(num_cores)+(sector_size / cacheline_size))
[0027]
[0034] The table below shows the number of traceable cores when the CV vector 166 size T is 64 bits and the cache line size is 64 bytes. In this implementation, the sector size can be dynamically selected based on the number of agents N that can be traced using the CV vector 166. For example, if the current sector size of the NFT architecture is 1024 bytes and it is required that the number of traceable agents N must be greater than 2, the cache coherence system may dynamically select the sector size to 512 bytes. Alternatively, the cache coherence system may dynamically increase the sector size if it identifies a low cache line sharing rate, as indicated by the number of agents that need to be traced by the SFT compared to N. Therefore, for example, if the number of agents that need to be traced by the SFT is 3 and the current value of N is 6, the cache coherence system may dynamically increase the sector size from 256 bytes to 512 bytes.
[0028] [Table 2]
[0029]
[0035] Implementations using dynamic sector sizing for SFTs involve a trade-off between higher tracking capability and sector size without increasing the number of bits required in the SFT entry. The main capability of this implementation is that sector size can be degraded or upgraded based on the cache line sharing range to maintain correct utilization of CV vector 166.
[0030]
[0036] Figure 2 discloses operation 200 of the cache coherence system disclosed herein, which updates the SFT entry field in response to an allocation request. Operation 202 receives an input request for an SFT entry allocation for a cache line at address X by a new caching agent having coreID A or agent ID (AID) A. Operation 204 determines whether there is an existing SFT entry tracking the cache line at address X. If not, operation 208 allocates a new SFT entry, sets the coreID of the newly allocated entry to coreID A, and sets the bits of the bit vector (BV) in the SFT entry as shown by the encode(x) function of the cache coherence system. The encode(x) function is used to derive the bits of the BV that are set or reset according to the address of the cache line and its relative position in the sector. Specifically, the encode(x) function is an encoding function for the BV used by the coarse-grained SFT. encode(x) may return bits to be set / reset in the BV for this CoreID, CoreID(0), where the BV is based on a set of bits depending on where cache line X is located within the sector.
[0031]
[0037] If operation 204 indicates an SFT hit and that an existing SFT entry is tracking a cache line at address X, operation 206 determines whether this particular coreID A already exists in the coreID field of the hit SFT entry. This could be the same core, core A accessing a different cache line within the same sector tracked by the hit SFT entry. If it does, operation 210 finds the index of cache line X in the coreID field of the hit SFT entry. Operation 212 then calls the encode(x) function to set the appropriate value for the bit in BV.
[0032]
[0038] In operation 206, when it is determined that core A has accessed a sector containing cache line X for the first time, operation 214 selects the next available coreID field in the SFT entry. Operation 216 then updates the coreID of the next available coreID field and calls the encode(x) function to set the appropriate bits in the BV associated with the SFT entry. Operation 200 provides information about all coreIDs that have accessed the sector and which cache lines within a given sector are cached by which coreIDs. This is a technical advantage provided by the solution herein over other implementations of coarse-grained SFT implementations that only provide information about all coreIDs that have accessed the sector tracked by the SFT entry, but do not provide information about which cache lines within a given sector are cached by which coreIDs. In addition, the techniques disclosed herein can provide such information without increasing the required SFT area.
[0033]
[0039] Figure 3 discloses operation 300 of the cache coherence system disclosed herein, which updates the SFT entry field for an invalidation request. Specifically, operation 300 is used when freeing a cache line that has a coreID to store in the cache. Operation 302 may receive an input request from an agent having coreID A to invalidate an SFT entry for a cache line at address X. Operation 304 finds coreID A in the coreID field of the SFT entry and identifies the associated BV bits that need to be reset by calling the encode(x) function.
[0034]
[0040] Subsequently, operation 306 determines whether the entire BV is clear. If it is, operation 308 clears the coreID field from the SFT entry so that the coreID field can be used by other cores / agents. Furthermore, operation 310 determines whether all coreID fields of the hit SFT entry are clear. If they are clear, operation 312 removes the SFT entry from the SFT.
[0035]
[0041] Figure 4 illustrates operation 400 of the cache coherence system disclosed herein, which takes ownership of a shared cache line. Specifically, operation 400 illustrates an exemplary snoop generation scenario generated by an ownership request operation by an agent to acquire exclusive ownership of a cache line. Operation 402 receives an incoming ownership request from an agent having coreID A to read the cache line at address X.
[0036]
[0042] Operation 404 identifies that the hit SFT entry does not contain coreID A in its coreID field. Operation 406 then iterates over the existing coreID fields in the hit SFT entry to determine whether address X has been cached by any of the existing cache-holding agents identified by the coreID field of the hit SFT entry. Specifically, for each coreID field, Operation 408 determines whether the BV bit returned by the encode(x) function for the coreID of the cache line at X is set. If it is set, Operation 408 sends a single snoop or back-invalidate request to that particular core at that iteration stage. No further snoop or back-invalidate requests are required for any other core. Back-invalidation ensures that ownership of the cache line at X is transferred from the old coreID identified during the iteration for the new coreID A. Operation 408 then determines whether the BV is clear, and if not, continues the iteration.
[0037]
[0043] The technology disclosed herein provides similar functionality to other SFT solutions, but at the same time, significantly less space is allocated for tracking agents using cache lines. Furthermore, the technology disclosed herein does not employ excessive oversnooping, which may occur in other solutions, to achieve this. Specifically, one of the advantages of the adaptive coherency tracking disclosed herein is that it allows for flexible use of core-validated (CV) vectors, traded off with sector size. The storage capacity or space saved by providing adaptive coherency tracking in the manner disclosed herein can free up space that can be used to provide yet another function for SFT.
[0038]
[0044] Figure 5 shows an exemplary system 500 that may be useful in implementing the high-latency query optimization system disclosed herein. The exemplary hardware and operating environment in Figure 5 for implementing the techniques described includes general-purpose computing devices in the form of computer devices, such as a computer 20, a mobile phone, a personal digital assistant (PDA), a tablet, a smartwatch, a gaming remote, or other types of computing devices. In the implementation of Figure 5, for example, the computer 20 includes a processing unit 21, system memory 22, and a system bus 23 that operationally connects various system components, including the system memory 22, to the processing unit 21. The processing unit 21 may be one or more, thereby the processor of the computer 20 may be a central processing unit (CPU) or a plurality of processing units, generally referred to as a parallel processing environment. The computer 20 may be a conventional computer, a distributed computer, or any other type of computer, and the implementation is not limited thereto.
[0039]
[0045] The system bus 23 may be any of several types of bus structures, including a memory bus or memory controller, peripheral bus, switched fabric, point-to-point connection, and local bus using any of various bus architectures. The system memory 22 may also be simply called memory and includes read-only memory (ROM) 24 and random access memory (RAM) 25. The basic input / output system (BIOS) 26 includes basic routines that assist in the transfer of information between elements within the computer 20, for example, at startup, and is stored in the ROM 24. The computer 20 further includes a hard disk drive 27 for reading from and writing to a hard disk (not shown), a magnetic disk drive 28 for reading from and writing to a removable magnetic disk 29, and an optical disk drive 30 for reading from and writing to a removable optical disk 31, such as a CD-ROM, DVD, or other optical medium.
[0040]
[0046] Computer 20 may be used to implement the high-latency query optimization system disclosed herein. In one implementation, a frequency unwrapping module, which includes instructions for unwrapping frequencies based at least partially on a sampled reflected modulated signal, may be stored in the memory of computer 20, for example, read-only memory (ROM) 24 and random-access memory (RAM) 25.
[0041]
[0047] Furthermore, instructions stored in the memory of computer 20 may be used to generate a transformation matrix using one or more operations disclosed in Figures 2-4. Similarly, instructions stored in the memory of computer 20 may also be used to implement one or more operations in Figures 2-4. The memory of computer 20 may also store one or more instructions for implementing the high-latency query optimization system disclosed herein.
[0042]
[0048] The hard disk drive 27, the magnetic disk drive 28, and the optical disk drive 30 are connected to the system bus 23 by hard disk drive interface 32, magnetic disk drive interface 33, and optical disk drive interface 34, respectively. The drives and their associated tangible computer-readable media provide non-volatile storage for computer-readable instructions, data structures, program modules, and other data for the computer 20. Those skilled in the art will understand that any type of tangible computer-readable media may be used in this exemplary operating environment.
[0043]
[0049] Multiple program modules may be stored on a hard disk, magnetic disk 29, optical disk 31, ROM 24, or RAM 25, and these may include an operating system 35, one or more application programs 36, other program modules 37, and program data 38. A user may generate reminders on the personal computer 20 through input devices such as a keyboard 40 and a pointing device 42. Other input devices (not shown) may include a microphone (e.g., for voice input), a camera (e.g., for a natural user interface (NUI)), a joystick, a gamepad, a satellite receiving antenna, a scanner, or the like. These and other input devices are often connected to the processing unit 21 via a serial port interface 46 connected to the system bus 23, but may also be connected via other interfaces, such as a parallel port, a game port, or a universal serial bus (USB). A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor, the computer typically includes other peripheral output devices (not shown), such as speakers and a printer.
[0044]
[0050] Computer 20 may operate within a networking environment that uses logical connections to one or more remote computers, for example, remote computer 49. These logical connections are realized by communication devices connected to or part of computer 20, and the implementation is not limited to a specific type of communication device. Remote computer 49 could be another computer, server, router, network PC, client, peer device, or other common network node, and typically includes many or all of the elements described above with respect to computer 20. The logical connections shown in Figure 5 include local area networks (LANs) 51 and wide area networks (WANs) 52. Such networking environments are common in office networks, corporate computer networks, intranets, and the internet, all of which are types of networks.
[0045]
[0051] When used in a LAN networking environment, the computer 20 is connected to the local area network 51 via a network interface or adapter 53, which is a type of communication device. When used in a WAN networking environment, the computer 20 typically includes a modem 54, a network adapter, a type of communication device, or any other type of communication device for establishing communication over a wide area network 52. The modem 54 may be internal or external, but is connected to the system bus 23 via a serial port interface 46. In a network environment, the program engine shown for the personal computer 20 or a part thereof may be stored in a remote memory storage device. The network connection in the figure is an example, and other means of communication devices may be used to establish communication links between computers.
[0046]
[0052] In one exemplary implementation, software or firmware instructions for the cache coherence system 510 may be stored in system memory 22 and / or storage device 29 or 31 and processed by processing unit 21. The operation and data of the high-latency query optimization system may be stored as a persistent data store in system memory 22 and / or storage device 29 or 31.
[0047]
[0053] Unlike tangible computer-readable storage media, intangible computer-readable communication signals can embody computer-readable instructions, data structures, program modules, or other data contained within modulated data signals, such as carrier waves or other signal-carrying mechanisms. The term “modulated data signal” means a signal whose one or more properties are set or modified in a manner that encodes information within the signal. For example, but not limited to, intangible communication signals include wired media such as wired networks or direct wired connections, as well as wireless media such as acoustic, RF, and infrared, and other wireless media.
[0048]
[0054] Some embodiments of a high-latency query optimization system may include a product. The product may include a tangible storage medium for storing logic. Examples of storage mediums may include one or more types of computer-readable storage mediums capable of storing electronic data, including volatile or non-volatile memory, removable or non-removable memory, eraseable or non-eraseable memory, writable or rewritable memory. Examples of logic include various software elements, such as software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions, methods, procedures, software interfaces, application program interfaces (APIs), instruction sets, computing code, computer code, code segments, computer code segments, words, values, symbols, or any combination thereof. In one embodiment, for example, the product may store executable computer program instructions that, when executed by a computer, cause that computer to perform methods and / or operations according to the embodiments described. Executable computer program instructions can include any appropriate type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, and others. Executable computer program instructions can be implemented according to a given computer language, method, or syntax for instructing a computer to perform a specific function. Instructions can also be implemented using any appropriate high-level, low-level, object-oriented, visual, compiled, and / or interpreted programming language.
[0049]
[0055] The high-latency query optimization systems disclosed herein may include various tangible computer-readable storage media and intangible computer-readable communication signals. Tangible computer-readable storage can be embodied by any available media accessible to the high-latency query optimization systems disclosed herein, and includes both volatile and non-volatile storage media, removable and non-removable storage media. Tangible computer-readable storage media do not include intangible and transient communication signals and include volatile and non-volatile, removable and non-removable storage media implemented by any method or technique for storing information such as computer-readable instructions, data structures, program modules, or other data. Tangible computer-readable storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory, or other memory technologies, CD-ROM, digital versatile disk (DVD), or other optical disk storage, magnetic cassette, magnetic tape, magnetic disk storage, or other magnetic storage devices, or any other tangible media that can be used to store desired information and are accessible to the high-latency query optimization systems disclosed herein. Unlike tangible computer-readable storage media, intangible computer-readable communication signals can embody computer-readable instructions, data structures, program modules, or other data within modulated data signals, such as carrier waves or other signal-carrying mechanisms. The term “modulated data signal” means a signal in which one or more of its properties are set or modified in a manner that encodes information within the signal. For example, but not limited to, intangible communication signals include signals transmitted through wired media such as wired networks or direct wired connections, as well as signals transmitted through wireless media such as acoustic, RF, infrared, and other wireless media.
[0050]
[0056] The implementations disclosed herein provide a method for identifying a cache line sector associated with an SFT having multiple snoop filter (SFT) entries, identifying the number of cache lines stored in the cache by one or more agents of the identified cache line sector, and identifying the number of bits in the bit vector (BV) of one or more SFT entries among the multiple SFT entries based on the number of cache lines stored in the cache by one or more agents of the identified cache line sector, wherein the number of bits is necessary to track one or more agents.
[0051]
[0057] An alternative implementation provides a system comprising memory, one or more processor units, and a cache coherence system stored in memory and executable by one or more processors, wherein the cache coherence system encodes computer executable instructions on memory for executing computer processes on one or more processor units, and the computer process includes identifying a cache line sector associated with a Snoop Filter (SFT) having multiple SFT entries, identifying the number of cache lines stored in the cache by one or more agents of the identified cache line sector, and, based on the number of cache lines stored in the cache by one or more agents of the identified cache line sector, identifying the number of bits in the bit vector (BV) of one or more SFT entries among the multiple SFT entries, the number of bits required to track one or more agents.
[0052]
[0058] An alternative implementation includes one or more physically manufactured computer-readable storage media which encode computer-executable instructions for executing a computer process on a computer system, the computer process including identifying a cache line sector associated with an SFT having multiple Snoop Filter (SFT) entries, identifying the number of cache lines stored in the cache by one or more agents of the identified cache line sector, and, based on the number of cache lines stored in the cache by one or more agents of the identified cache line sector, identifying the number of bits in the bit vector (BV) of one or more SFT entries among the multiple SFT entries, the number of bits required to track one or more agents.
[0053]
[0059] The implementations described herein are implemented as logical steps within one or more computer systems. Logical operations may be implemented (1) as a series of processor implementation steps executed within one or more computer systems, and (2) as interconnected machines or circuit modules within one or more computer systems. The implementation is a matter of choice and depends on the performance requirements of the computer systems used. Thus, the logical operations constituting the implementations described herein may be referred to in various ways, such as operations, steps, objects, or modules. Furthermore, it should be understood that logical operations may be executed in any order unless it is explicitly claimed otherwise or a particular order is essentially required by the language of the claims. The above specification, examples, and data, together with the accompanying appendices, provide a complete description of the structure and use of the exemplary implementations.
Claims
1. It is a method, Identifying cache line sectors associated with SFT(150a)(150) having multiple snoop filter (SFT(150a)(150)) entries (152), Identifying the number of cache lines stored in the cache by one or more agents in the identified cache line sector, Based on the number of cache lines stored in the cache by one or more agents of the identified cache line sector, the number of bits in the bit vector (BV) of one or more SFT entries (150a) (150) entries (152) is determined, A method including, wherein the number of bits is necessary to track the one or more agents.
2. The method according to claim 1, further comprising determining the total number N of agents that can be tracked by the SFT based on the total number of cores tracked by the SFT, the size of the cache lines, and the size of the cache line sector associated with the SFT.
3. The total number of agents N that can be tracked by the SFT is compared with the required number of agents to be tracked by the SFT. The method of claim 2, further comprising dynamically reducing the size of the cache line sector associated with the SFT in response to the determination that the total number of agents N that can be tracked by the SFT is less than or equal to the required number of agents tracked by the SFT.
4. The method according to claim 3, further comprising dynamically increasing the size of the cache line sector associated with the SFT in response to the determination that the total number of agents N that can be tracked by the SFT is greater than the required number of agents to be tracked by the SFT.
5. It is a method, Receiving an ownership request to read a cache line at address X, which has coreID A, Identifying that a certain SFT entry in the aforementioned SFT does not contain the coreID A in the coID field, In response to identifying that a certain SFT entry in the SFT does not contain the coreID A in its coID field, it is determined whether the address X is stored in the cache by any of the existing cache storage agents identified by the coreID field of the hit SFT entry. Identifying one of the existing cache storage agents of the hit SFT entry that has stored the cache line at address X in the cache, and sending a back invalidation request to the one of the existing cache storage agents. The method according to claim 1, further comprising:
6. Receiving an input request to allocate a cache line at address X using coreID A, Identifying that the cache line at address X is tracked by the coreID field of the SFT entry that hit it, In response to identifying that the cache line at address X is tracked by the coreID field of the SFT entry, select the next coreID field of the hit SFT entry, The method according to claim 1, further comprising:
7. The method according to claim 6, further comprising updating the following coreID and the BV associated with the following coreID field.
8. Receiving an input request to invalidate the cache line at address X by coreID A, Identifying that the cache line at address X is tracked by the coreID field of the SFT entry that hit it, In response to identifying that the cache line at address X is being tracked by the coreID field of the SFT entry, the coreID field is cleared. The method according to claim 1, further comprising:
9. The method according to claim 8, further comprising marking the cleared coreID field as available for use in tracking a new cache line address.
10. In one or more physically manufactured computer-readable storage media for encoding computer executable instructions for executing a computer process on a computer system (500), the computer process is: Identifying cache line sectors associated with SFT(150a)(150) having multiple snoop filter (SFT(150a)(150)) entries (152), Identifying the number of cache lines stored in the cache by one or more agents in the identified cache line sector, Based on the number of cache lines stored in the cache by one or more agents of the identified cache line sector, the number of bits in the bit vector (BV) of one or more SFT entries among the plurality of SFT (150a) (150) entries (152) entries is determined, A physically manufactured computer-readable storage medium comprising one or more bits, the number of which is necessary to track the one or more agents.
11. The computer process further includes determining the total number N of agents that can be tracked by the SFT based on the total number of cores tracked by the SFT, the size of the cache lines, and the size of the cache line sectors associated with the SFT, according to one or more physically manufactured computer-readable storage media according to claim 10.
12. The aforementioned computer process The total number of agents N that can be tracked by the SFT is compared with the required number of agents to be tracked by the SFT. In response to the determination that the total number of agents N that can be tracked by the SFT is less than or equal to the required number of agents tracked by the SFT, the size of the cache line sector associated with the SFT is dynamically reduced. The more than one or more physically manufactured computer-readable storage medium according to claim 11, further comprising:
13. The aforementioned computer process In response to the determination that the total number of agents N that can be tracked by the SFT is greater than the required number of agents to be tracked by the SFT, the size of the cache line sector associated with the SFT is dynamically increased. The more than one or more physically manufactured computer-readable storage medium according to claim 11, further comprising:
14. The aforementioned computer process Receiving an ownership request to read a cache line at address X, which has coreID A, Identifying that a certain SFT entry in the aforementioned SFT does not contain the coreID A in the coID field, In response to identifying that a certain SFT entry in the SFT does not contain the coreID A in its coID field, it is determined whether the address X is stored in the cache by any of the existing cache storage agents identified by the coreID field of the hit SFT entry. Identifying one of the existing cache storage agents of the hit SFT entry that has stored the cache line at address X in the cache, and sending a back invalidation request to the one of the existing cache storage agents. The more than one or more physically manufactured computer-readable storage medium according to claim 10, further comprising:
15. The aforementioned computer process Receiving an input request to allocate a cache line at address X using coreID A, Identifying that the cache line at address X is tracked by the coreID field of the SFT entry that hit it, In response to identifying that the cache line at address X is tracked by the coreID field of the SFT entry, select the next coreID field of the hit SFT entry, The more than one or more physically manufactured computer-readable storage medium according to claim 10, further comprising:
16. The aforementioned computer process Receiving an input request to invalidate the cache line at address X by coreID A, Identifying that the cache line at address X is tracked by the coreID field of the SFT entry that hit it, In response to identifying that the cache line at address X is being tracked by the coreID field of the SFT entry, the coreID field is cleared. The more than one or more physically manufactured computer-readable storage medium according to claim 10, further comprising:
17. In system (500), Memory (114) and, One or more processor units, The system includes a cache coherence system (500) (510) (100) stored in the memory (114) and executable by one or more processors, wherein the cache coherence system (500) (510) (100) encodes computer executable instructions on the memory (114) for executing a computer process on one or more processor units, and the computer process Identifying cache line sectors associated with SFT(150a)(150) having multiple snoop filter (SFT(150a)(150)) entries (152), Identifying the number of cache lines stored in the cache by one or more agents in the identified cache line sector, Based on the number of cache lines stored in the cache by one or more agents of the identified cache line sector, the number of bits in the bit vector (BV) of one or more SFT entries among the plurality of SFT (150a) (150) entries (152) entries is determined, A system including such a number of bits as necessary to track one or more agents.
18. The system according to claim 17, further comprising the computer process determining the total number N of agents that can be tracked by the SFT based on the total number of cores tracked by the SFT, the size of the cache lines, and the size of the cache line sectors associated with the SFT.
19. The aforementioned computer process The total number of agents N that can be tracked by the SFT is compared with the required number of agents to be tracked by the SFT. In response to the determination that the total number of agents N that can be tracked by the SFT is less than or equal to the required number of agents tracked by the SFT, the size of the cache line sector associated with the SFT is dynamically reduced. The system according to claim 18, further comprising:
20. The aforementioned computer process The system according to claim 18, further comprising dynamically increasing the size of the cache line sector associated with the SFT in response to the determination that the total number of agents N that can be tracked by the SFT is greater than the required number of agents to be tracked by the SFT.