System architecture and method for data prefetching

By optimizing the data prefetching process through a system architecture that uses cache partitioning and tiered cache allocation, the problems of cache pollution and bandwidth waste in complex memory access modes are solved, thereby improving CPU performance.

CN122195879APending Publication Date: 2026-06-1258TH RES INST OF CETC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
58TH RES INST OF CETC
Filing Date
2026-03-16
Publication Date
2026-06-12

Smart Images

  • Figure CN122195879A_ABST
    Figure CN122195879A_ABST
Patent Text Reader

Abstract

The application discloses a data prefetching system architecture and method, and belongs to the field of computer processors. The data prefetching system architecture and method realize cache partitioning, cache storage of buffer miss data and prefetch data, data cache residence time length management, prefetch data hierarchical cache allocation, instruction prediction-based data prefetching, thereby reducing cache pollution, reducing CPU data access delay, improving prefetch hit rate, reducing prefetch metadata cache size, and further improving the performance of an existing data prefetcher. The data prefetching system architecture and method can be widely used in the field of processors. The application is novel, has a wide application range, and has great application significance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer processor technology, and in particular relates to a system architecture and method for data prefetching. Background Technology

[0002] When a computer executes a program, it frequently needs to access main memory to retrieve instructions and data. However, main memory access is often slow, limiting the improvement of computer performance. To alleviate this problem, computers use a hardware-managed memory called a cache. The cache mitigates the speed mismatch between the CPU and main memory, improving the speed at which the CPU accesses data. To further reduce the frequency of CPU reads from main memory during operations, optimize main memory access patterns, and reduce performance loss caused by memory access latency, researchers proposed the data prefetching mechanism. Data prefetching refers to loading potentially accessed data from main memory into the cache or other fast storage areas in advance during processor (such as the CPU) operation. By prefetching, cache misses are eliminated, eliminating the latency caused by cache misses, effectively hiding main memory access latency, improving program execution efficiency, and enhancing overall system performance. Currently, data prefetchers are widely used in various processors.

[0003] The location where prefetched data is stored is a crucial consideration, significantly impacting prefetching performance. Currently, there are two main approaches: one is to store the prefetched data in a dedicated buffer, which avoids cache pollution. However, the vast majority of approaches store the data in a high-speed cache close to the CPU, offering better performance due to its proximity to the processor. In this case, the prefetch engine acts as an auxiliary cache controller.

[0004] Poorly implemented data prefetching can lead to decreased system performance. This is because cache size is limited, especially the L1 cache, which has a relatively small capacity. If the data prefetching process makes an error and prefetches useless data, it may evict valid data that might be needed later, causing cache pollution. This, in turn, reduces system performance and increases power consumption. In some cases, such as when a program's algorithm is optimized to occupy a large portion of the cache space, poorly implemented data prefetching can cause cache lines that the program still needs to use to be evicted.

[0005] During data prefetching, if multiple expected addresses are prefetched, meaning multiple cache lines of data are requested consecutively, this is called prefetch depth. The choice of prefetch depth value is directly related to the target application. If the prediction of the target application's address sequence is accurate, increasing the prefetch depth can improve the coverage of cache misses in the target application, thus more effectively hiding storage latency. However, if the prediction accuracy is low, the prefetch requests themselves may have adverse effects, such as causing storage pollution, bandwidth consumption, and resource conflicts. Increasing the prefetch depth further amplifies these harmful effects, thereby reducing the execution efficiency of the target application.

[0006] Currently, the two main replacement algorithms for cached data are LRU and LFU. The LRU (Least Recently Used) algorithm replaces cached items based on the principle of least recently used, which can lead to "cold cache pollution." "Cold cache pollution" refers to pollution caused by infrequently used (e.g., only used once) objects remaining in the cache. The LFU (Least Frequently Used) algorithm replaces cached items based on the principle of least frequently used, which can lead to "hot cache pollution." "Hot cache pollution" refers to pollution caused by previously popular (or frequently accessed) objects that remain in the cache for a period of time, even if they are no longer popular. For some memory access patterns in real-world workloads, the current combination of the LRU algorithm and prefetching technology may not be ideal, potentially causing cache pollution. This is mainly because incorrect predictions may lead to the writing of objects that will not be reused later into the cache, resulting in these objects remaining in the cache for an extended period.

[0007] Random prefetchers, especially time-stream random prefetchers, typically need to record a large amount of metadata. However, on-chip cache space is relatively small, generally requiring metadata to be stored off-chip. Accessing off-chip metadata in this case results in significant latency. Furthermore, much of the metadata contains expired and useless data. In 2019, H. Wu proposed a novel prefetcher called Triage. Research found that only a small portion of the metadata is truly important compared to the vast amount of data; the importance of a small amount of valid metadata far outweighs the importance of a large amount of invalid metadata. Triage uses fine-grained management of metadata, classifying it based on its program counter (PC). To ensure fast access to metadata, all prefetched metadata is cached in the LLC (final level cache). To effectively utilize LLC space, the storage space occupied by metadata in the LLC can be dynamically adjusted based on the prefetch instruction hit rate.

[0008] Current computer systems have achieved a prefetch hit rate of 90% or even higher. However, there are still instances of data that are not prefetched or miss prefetched. Accessing this data in main memory causes significant latency, becoming a bottleneck for CPU performance. These data accesses are mostly random, irregular, sparse, or access to the start address of the data stream. These access patterns are very detrimental to caches and prefetchers because prefetchers struggle to predict the data these programs will access, leading to failure to prefetch or failure to prefetch in a timely manner. Modern data structures are becoming increasingly complex, and processor memory access methods are also becoming more complex, placing higher demands on hardware prefetcher design. Irregular prefetchers, if too simple, while cost-effective, offer little performance improvement for complex memory access patterns. If the data access pattern is discrete rather than local, data prefetching consumes bandwidth, but the prefetched data may not be used, or it may be evicted from the cache due to infrequent accesses after prefetching, requiring repeated prefetching. All of these factors contribute to performance degradation.

[0009] Currently, most processors employ prefetching algorithms with relatively well-defined characteristics, such as stream prefetching and offset prefetching. Prefetching in different cache levels is relatively independent, and prefetching decisions are relatively conservative. While this ensures a certain level of prefetch accuracy, the prefetch coverage is not very high, making it difficult to adapt to more complex application scenarios and offering limited improvement to system performance. Furthermore, increasing the prefetch width and depth can improve prefetch coverage and prefetch hit rate, but the increased prefetch data volume undoubtedly leads to increased access bus bandwidth consumption and cache pollution. This can even evict potentially useful data from the cache in the future, severely impacting system performance.

[0010] Therefore, how to reduce cache pollution, reduce memory access bandwidth waste, and improve data prefetching efficiency while ensuring prefetch accuracy and coverage is a problem that needs to be solved now. Summary of the Invention

[0011] The purpose of this invention is to address the problems mentioned in the background art by proposing a system architecture and method for data prefetching. This method enables cache partitioning for buffering of missed data and prefetched data, management of data cache residency time, hierarchical cache allocation of prefetched data, and instruction prediction-based data prefetching. This reduces cache pollution, lowers CPU access latency, improves prefetch hit rate, reduces the size of prefetched metadata cache, and further enhances the performance of existing data prefetchers.

[0012] This application provides a data prefetching system architecture, the system architecture including:

[0013] CPU control execution unit, L1D cache, L2 cache, L3 cache, data prefetcher, prefetch metadata cache, main memory, instruction prefetch and instruction fetch module, instruction address cache;

[0014] The L1D cache is divided into L1D miss partitions and L1D prefetch partitions.

[0015] The L2 cache is divided into L2 miss partitions and L2 prefetch partitions;

[0016] The L3 cache is divided into L3 miss partitions and L3 prefetch partitions;

[0017] The data prefetcher includes the L1D data prefetcher, the L2 data prefetcher, the L3 data prefetcher, and the prefetch coordinator;

[0018] The CPU control execution unit is used to perform instruction decoding, renaming, dispatching, integer / floating-point arithmetic, and memory access instruction execution.

[0019] The L1D cache is the first-level data cache; the L1D miss partition is the partition in the L1D cache that stores cache miss data; the L1D prefetch partition is the partition in the L1D cache that stores prefetch data.

[0020] The L2 cache is the second-level cache, and the L3 cache is the third-level cache. The L2 and L3 caches cache both data and instructions, and can be divided into missed partitions, prefetch partitions, or no partitions.

[0021] The L2 miss partition is the partition in the L2 cache that stores cache miss data; the L2 prefetch partition is the partition in the L2 cache that stores prefetch data.

[0022] The L3 miss partition is the partition in the L3 cache that stores cache miss data; the L3 prefetch partition is the partition in the L3 cache that stores prefetch data.

[0023] The L1D data prefetcher is a data prefetcher used for L1D caching, and the prefetched data is placed into the L1D prefetch partition; the L2 data prefetcher is a data prefetcher used for L2 caching, and the prefetched data is placed into the L2 prefetch partition; the L3 data prefetcher is a data prefetcher used for L3 caching, and the prefetched data is placed into the L3 prefetch partition.

[0024] The prefetch coordinator coordinates the L1D data prefetcher, L2 data prefetcher, and L3 data prefetcher to write the data from the main memory into the L1D prefetch partition, L2 prefetch partition, and L3 prefetch partition in the future.

[0025] The prefetch metadata cache is used to store data prefetch metadata. When the amount of data prefetch metadata is large, the prefetch metadata is placed in main memory and written to the prefetch metadata cache when necessary to speed up data prefetching. The prefetch metadata cache can be implemented using a dedicated module or placed in the L3 cache.

[0026] The main memory is a component that stores the data and instructions used by the computer. Data in the main memory is accessed into the L3 cache, L2 cache, L1D cache, and CPU control execution unit, while instructions in the main memory are accessed into the CPU control execution unit and instruction address cache.

[0027] The instruction prefetching and instruction fetching module implements instruction branch prediction, instruction prefetching, and instruction fetching, and writes the addresses of recently executed and prefetched instructions of the processor into the instruction address cache;

[0028] The instruction address cache is a cache of the processor's data prefetch instruction address. It outputs the data prefetch instruction address to the data prefetcher to cache the required data prefetch metadata in advance.

[0029] Furthermore, the cache data storage partitioning is implemented, and the partitions include: a cache miss partition and a prefetch cache partition; wherein, the cache miss partition stores cache miss data, and the prefetch cache partition stores prefetch data.

[0030] Furthermore, the sizes of the cache miss partition and the prefetch cache partition can be dynamically adjusted. The cache miss partition consists of cache miss lines, and the prefetch cache partition consists of data prefetch lines. The cache miss lines and data prefetch lines are distinguished by the prefetch flag in the cache line.

[0031] Furthermore, the cache line includes cache line data, prefetch flag, access flag, valid bit, tag bit, dirty bit, LRU counter bit, and prefetch number bit;

[0032] The valid bit indicates whether the information stored in the current cache line is valid; the flag bit indicates which copy of the main memory the data block in the cache line is, which is related to the address of the main memory; the dirty bit indicates whether the data in the cache line has been modified; and the LRU counter bit is used by the LRU replacement algorithm to record the usage of the data blocks in the cache line.

[0033] The prefetch flag, access flag, and prefetch number are added in this application. The prefetch flag has a value of 1 bit. If the corresponding bit value is 0, it indicates that the data in this cache line is a missed data. If the corresponding bit value is 1, it indicates that the data in this cache line is prefetched data.

[0034] The access flag is set to 1 bit. If the corresponding bit value is 0, it means that the data in this cache line has not been accessed by the CPU control execution unit so far. If the corresponding bit value is 1, it means that the data in this cache line has been accessed by the CPU control execution unit so far.

[0035] The value of the prefetch number bit indicates the prefetch number corresponding to the data in this cache line if it is prefetched data.

[0036] Furthermore, the data prefetch instruction address generation includes: a branch predictor, an instruction prefetch engine, an instruction prefetcher, an instruction fetch module, an instruction address cache, an instruction fetch target queue, a prefetch instruction queue, a prefetch buffer, and an instruction cache; wherein, the instruction fetch module reads instructions from the instruction cache, then performs instruction decoding, and simultaneously stores the addresses of recently executed and prefetched instructions into the instruction address cache.

[0037] Furthermore, the data prefetcher, through the instruction address cache, writes the metadata corresponding to the addresses of recently executed instructions and the addresses of prefetched instructions from main memory into the prefetch metadata cache. During data prefetching, the data prefetcher prioritizes reading the data corresponding to the prefetch instruction addresses in the instruction address cache.

[0038] This application also provides a working method for a data prefetching system architecture. The method is implemented based on the system architecture provided in this application and includes: a method for writing cache lines that have not been cached and a method for writing cache lines that are prefetched.

[0039] The steps for writing cache lines that have not been cached are as follows:

[0040] Step 11: Determine if the cache lines in the group are full;

[0041] Step 12: If not full, select any cache line in the group and write new cache line data, and set the prefetch flag of the new cache line to 0;

[0042] Step 13: If the cache lines in the group are full, determine whether N_miss is less than C_miss, where N_miss is the number of cache lines with the prefetch flag set to 0, and C_miss represents the number of cache lines that were not hit. C_miss takes values ​​C_miss_L1D, C_miss_L2, and C_miss_L3 depending on the type of cache line being written to. C_miss_L1D is the number of cache lines that were not hit in each group of L1D cache; C_miss_L2 is the number of cache lines that were not hit in each group of L2 cache; and C_miss_L3 is the number of cache lines that were not hit in each group of L3 cache.

[0043] Step 14: If N_miss is less than C_miss, select the data prefetch cache line within the group, i.e. the cache line with the prefetch flag value of 1, evict the old prefetch data of the corresponding line according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data. The prefetch flag value of the new cache line is set to 0.

[0044] Step 15: If N_miss is greater than or equal to C_miss, then continue to check if it is greater than 1;

[0045] Step 16: If the value is greater than 1, select the cache line that was not cached in the group, i.e. the cache line with the prefetch flag value of 0. Evict the old cache line data according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data. The prefetch flag value of the new cache line is set to 0.

[0046] Step 17: If the value is equal to 1, evict the old data of the original cache line that was not hit, write the new cache line data, and set the prefetch flag of the new cache line to 0.

[0047] The data prefetching cache line writing process consists of the following steps:

[0048] Step 21: Determine if the cache lines in the group are full;

[0049] Step 22: If not full, select any cache line in the group and write new cache line data, and set the prefetch flag of the new cache line to 1;

[0050] Step 23: If the cache lines in the group are full, determine whether N_prefetch is less than C_prefetch, where N_prefetch is the number of cache lines with the prefetch flag set to 1, C_prefetch represents the number of data prefetch cache lines, and C_prefetch takes the value C_prefetch_L1D, C_prefetch_L2, or C_prefetch_L3 depending on the type of cache line being written; C_prefetch_L1D is the number of data prefetch cache lines in the L1D cache; C_prefetch_L2 is the number of data prefetch cache lines in the L2 cache; and C_prefetch_L3 is the number of data prefetch cache lines in the L3 cache.

[0051] Step 24: If it is less than C_prefetch, select the cache line that was not hit in the group, that is, the cache line with the prefetch flag value of 0, evict the old missing data of the line according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data. The prefetch flag value of the new cache line is set to 1.

[0052] Step 25: If it is greater than or equal to C_prefetch, select the data prefetch cache line in the group, that is, the cache line with the prefetch flag value of 1, evict the old prefetch data of the line according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data. The prefetch flag value of the new cache line is set to 1.

[0053] Furthermore, prefetch metadata caching can be implemented using a dedicated module or placed in a cache, including but not limited to L3 cache.

[0054] Furthermore, this includes: by setting different confidence levels and different forward-looking attributes for the prefetched data, the data prefetcher prefetches data from main memory into different levels of cache;

[0055] Specifically, high-confidence, timely prefetched data is prefetched into the prefetch partition of the L1D cache, low-confidence, timely prefetched data is prefetched into the prefetch partition of the L2 cache, and highly forward-looking prefetched data is prefetched into the prefetch partition of the L3 cache.

[0056] Furthermore, when the cache performs data prefetching, if the corresponding data is located in the miss partition of the current cache, the corresponding data does not need to be prefetched. However, the corresponding data needs to be moved from the miss partition of the current cache to the prefetch partition, and the prefetch flag value of the corresponding data cache line is set from 0 to 1, that is, the cache line changes from the cache miss state to the prefetch state.

[0057] Furthermore, this includes managing the residency time of cache miss lines and data prefetch lines written to the cache from different data sources. Based on principles such as LRU and LFU replacement strategies, recently used and frequently used data are given longer residency times in the cache.

[0058] Furthermore, when writing cache line data, the residence time of the corresponding cache line data in the cache is controlled by setting an initial value to the LRU counter bit of the cache line;

[0059] Specifically, when writing prefetched data from the L1D cache, if the prefetched data comes from an L1D miss partition, the LRU counter is set to 0 because the data was accessed by the CPU execution unit some time ago, setting the longest residence time. If the prefetched data comes from L2 or L3 and the access flag is 1, indicating that the data was accessed by the CPU execution unit some time ago, the LRU counter is set to K3, setting a longer residence time. If the prefetched data comes from L2 or L3 and the access flag is 0, indicating that the data was not accessed by the CPU execution unit some time ago, the LRU counter is set to a constant value K4, reducing the residence time. If the prefetched data comes from main memory and the corresponding data was not accessed by the CPU execution unit some time ago, the LRU counter is set to a constant value K5, reducing the residence time. The values ​​of K3, K4, and K5 are determined based on the cache replacement result evaluation, and the range is set to 0 ≤ K3 ≤ K4 ≤ K5. <K MAX_LRU ;

[0060] Specifically, when data is written from an L1D cache miss, if the write originates from an L2 or L3 cache miss and the data access flag is 1, indicating that the data was accessed by the CPU recently, the LRU counter is set to 0, setting the maximum residency time. If the write originates from an L2 / L3 cache miss and the data access flag is 0, indicating that the data was not accessed by the CPU recently, the LRU counter is set to a constant K1, reducing the residency time. If the write originates from a main memory cache miss and the data was not accessed by the CPU recently, the LRU counter is set to a constant K2, reducing the residency time. The values ​​of K1 and K2 are determined based on the cache replacement results, and their range is set to 0 ≤ K1 ≤ K2. <K MAX_LRU .

[0061] It should be noted that the method of controlling the residence time of the corresponding cache line data in the cache by setting an initial value to the LRU counter bit of the cache line when writing cache line data in this application is also applicable to L2 cache and L3 cache, and is not limited to L1D cache.

[0062] The present invention has the following beneficial effects:

[0063] (1) The data prefetching system architecture and method disclosed in this invention divides the cache into cache miss partitions and prefetch partitions, and classifies and manages different data sources in the cache. Furthermore, by assigning different initial values ​​to the LRU calculator when writing cache line data from different sources and in different states to the cache, different cache residency durations are set. This improves the cache hit rate and reduces cache pollution.

[0064] (2) The system architecture and method for data prefetching disclosed in this invention store the instruction addresses output by the instruction prefetching and fetching modules in the instruction address cache. The data prefetcher, through the instruction address cache, writes the metadata corresponding to the addresses of recently executed instructions and the addresses of prefetched instructions from main memory into the prefetch metadata cache. When performing data prefetching, the data prefetcher prioritizes reading the data corresponding to the data prefetch instruction addresses in the instruction address cache, thereby speeding up data prefetching and further improving the cache hit rate;

[0065] (3) The system architecture and method for data prefetching disclosed in this invention prefetch data from main memory into different levels of cache by setting different confidence and forward-looking attributes of the prefetched data. High-confidence, timely prefetched data is prefetched into the prefetch partition of the L1D cache, low-confidence, timely prefetched data is prefetched into the prefetch partition of the L2 cache, and high-forward-looking prefetched data is prefetched into the prefetch partition of the L3 cache. In this way, while improving the prefetch coverage and prefetch hit rate, cache pollution can be further reduced, and memory access bus bandwidth consumption can be reduced.

[0066] This invention is novel, has a wide range of applications, and is of great significance.

[0067] To more clearly illustrate the functional characteristics and structural parameters of the present invention, further explanation is provided below in conjunction with the accompanying drawings and specific embodiments. Attached Figure Description

[0068] Figure 1 This is a data prefetching system architecture diagram provided in an embodiment of the present invention.

[0069] Figure 2-1 This is a partitioning implementation diagram of the cache data storage provided in an embodiment of the present invention.

[0070] Figure 2-2 This is a schematic diagram of the cache line composition of the high-speed cache provided in an embodiment of the present invention.

[0071] Figure 2-3 This is a flowchart of the data cache miss write process provided in an embodiment of the present invention.

[0072] Figure 2-4 This is a flowchart of the data prefetching cache line writing process provided in an embodiment of the present invention.

[0073] Figure 3-1 This is a data prefetch instruction address generation structure diagram provided in an embodiment of the present invention.

[0074] Figure 3-2 This is a schematic diagram of an example of the format of the data prefetch metadata table in main memory provided in an embodiment of the present invention.

[0075] Figure 4 This is a schematic diagram of the prefetch data relationship provided in an embodiment of the present invention.

[0076] Figure 5 This is a data flow diagram of L1D cache miss or prefetch provided for embodiments of the present invention.

[0077] Figure 6-1 This is a schematic diagram of the composition of the prefetch metadata cache line provided in an embodiment of the present invention.

[0078] Figure 6-2 This is a schematic diagram illustrating an example of the format of the prefetch number and metadata row number index table provided in an embodiment of the present invention.

[0079] Figure 6-3 This is a schematic diagram illustrating the implementation of the prefetch number cache pool and the metadata row number cache pool provided in this embodiment of the invention. Detailed Implementation

[0080] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0081] Example 1

[0082] This invention provides a system architecture for data prefetching, such as... Figure 1 As shown, it includes: a CPU control execution unit 101, an L1D cache 102, an L2 cache 103, an L3 cache 104, a data prefetcher 105, a prefetch metadata cache 106, main memory 107, an instruction prefetch and fetch module 108, an instruction address cache 109, etc.; among which,

[0083] L1D cache 102 is divided into L1D miss partition 1021 and L1D prefetch partition 1022;

[0084] L2 cache 103 is divided into L2 miss partition 1031 and L2 prefetch partition 1032;

[0085] L3 cache 104, divided into L3 miss partition 1041 and L3 prefetch partition 1042;

[0086] The data prefetcher 105 consists of an L1D data prefetcher 1051, an L2 data prefetcher 1052, an L3 data prefetcher 1053, and a prefetch coordinator 1054.

[0087] The CPU control execution unit 101 implements instruction decoding, renaming, dispatching, integer / floating-point arithmetic, and memory access instruction execution. It includes a control module, a fixed-point arithmetic module, a floating-point arithmetic module, and a memory access module. The memory access module includes a memory access pipeline and a memory access queue. The memory access pipeline includes a load pipeline, an address store pipeline, and a data store pipeline. The memory access queue includes a load queue and a store queue, which are responsible for maintaining the order information of memory access instructions.

[0088] L1D cache 102 is the first-level data cache.

[0089] L1D miss partition 1021 is the partition in the L1D cache that stores cache miss data.

[0090] L1D prefetch partition 1022 is the partition in the L1D cache that stores prefetched data 110.

[0091] L2 cache 103 is the second-level cache, and L3 cache 104 is the third-level cache. L2 and L3 caches cache both data and instructions, and can be divided into miss partitions, prefetch partitions, or not divided at all.

[0092] L2 miss partition 1031 is a partition in the L2 cache that stores cache miss data, but it is not limited to storing data; it also stores instructions.

[0093] L2 prefetch partition 1032 is a partition in the L2 cache that stores prefetched data 111, but it is not limited to storing data, it also stores instructions.

[0094] L3 miss partition 1041 is a partition in the L3 cache that stores cache miss data, but it is not limited to storing data; it also stores instructions.

[0095] L3 prefetch partition 1042 is a partition in the L3 cache that stores prefetched data 112, but it is not limited to storing data, it also stores instructions.

[0096] L1D data prefetcher 1051 is a data prefetcher used for L1D caching, and the prefetched data is placed into L1D prefetch partition 1022.

[0097] The L2 data prefetcher 1052 is a data prefetcher used for L2 caching, and the prefetched data is placed into the L2 prefetch partition 1032.

[0098] The L3 data prefetcher 1053 is a data prefetcher used for L3 caching, and the prefetched data is placed into the L3 prefetch partition 1042.

[0099] The prefetch coordinator 1054 coordinates the L1D data prefetchers 1051, L2 data prefetchers 1052, and L3 data prefetchers 1053 to write data from primary memory 107 to L1D prefetch partitions 1022, L2 prefetch partitions 1032, and L3 prefetch partitions 1042.

[0100] The prefetch metadata cache 106 is used to store the cached data prefetch metadata 113. When the amount of data prefetch metadata 113 is large, the prefetch metadata is generally placed in main memory, and written to the prefetch metadata cache when necessary to speed up data prefetching. The prefetch metadata cache 106 can be implemented using a dedicated module or placed in the L3 cache.

[0101] Main memory 107 is a component that stores the data and instructions used by the computer. Data is accessed through L3 cache 104, L2 cache 103, L1D cache 102, and CPU control execution unit 101, while instructions are accessed through CPU control execution unit 101 and instruction address cache 109.

[0102] The instruction prefetch and fetch module 108 implements instruction branch prediction, instruction prefetching, and instruction fetching, and writes the addresses of recently executed and prefetched instructions of the processor into the instruction address cache 109.

[0103] Instruction address cache 109 is a cache for the processor data prefetch instruction address. It outputs the data prefetch instruction address 114 to the data prefetcher 105 to pre-cache the required data prefetch metadata 113.

[0104] Example 2

[0105] Figure 2-1 The data storage partitions for the cache are implemented as follows: L1D cache 201, L2 cache 202, and L3 cache 203. Among them, L1D cache 201 is divided into L1D miss partition 2011 and L1D prefetch partition 2012, L2 cache 202 is divided into L2 miss partition 2021 and L2 prefetch partition 2022, and L3 cache 203 is divided into L3 miss partition 2031 and L3 prefetch partition 2023.

[0106] Assume that the L1D cache is a 4-way set-associative mapping, the L2 cache is an 8-way set-associative mapping, and the L3 cache is a 16-way set-associative mapping. That is, each L1D cache set contains 4 cache lines, each L2 cache set contains 8 cache lines, and each L3 cache set contains 16 cache lines. Further assume that under full cache set load and stable cache write / eviction conditions, the number of paths occupied by cache misses and data prefetch cache lines are set as shown in Table 1. Table 1 shows the cache set associative path count and partition size settings. Under full cache set load and stable cache write / eviction conditions, the number of cache misses (C_miss_L1D) per L1D cache set is 1, and the number of data prefetch cache lines (C_prefetch_L1D) is 3. The number of cache misses (C_miss_L2) per L2 cache set is 1, and the number of data prefetch cache lines (C_prefetch_L2) is 7. The number of cache misses (C_miss_L3) per L3 cache set is 1, and the number of data prefetch cache lines (C_prefetch_L3) is 15.

[0107] Table 1. Cache Group Associativity and Partition Size Settings

[0108]

[0109] Figure 2-2This diagram illustrates the composition of a cache line. Each cache line includes cache line data 211, a prefetch flag 212, an access flag 213, a valid bit 214, a flag bit 215, a dirty bit 216, an LRU counter bit 217, and a prefetch number bit 218. The valid bit 214 indicates whether the information stored in the current cache line is valid; the flag bit 215 indicates which copy of main memory the data block in this cache line belongs to, which is related to the main memory address; the dirty bit 216 indicates whether the data in this cache line has been modified; and the LRU counter bit 217 is used by the LRU replacement algorithm to record the usage status of the cache line data blocks. The prefetch flag 212, access flag 213, and prefetch number bit 218 were added in this application. The prefetch flag value is 1 bit; a value of 0 indicates that the data in this cache line is a cache miss, and a value of 1 indicates that the data in this cache line is prefetched data. The access flag is 1 bit. A value of 0 indicates that the current cache line data has not been accessed by the CPU control execution unit so far, and a value of 1 indicates that the current cache line data has been accessed by the CPU control execution unit so far. The prefetch number bit 218 is 6 bits, and its value indicates the prefetch number corresponding to the current cache line data if it is prefetched data.

[0110] The cache line data writing process for L1D cache, L2 cache, and L3 cache is as follows: Figure 2-3 , Figure 2-4 As shown, where Figure 2-3 Flowchart for writing cache lines that were not cached in the data cache. Figure 2-4 Flowchart for writing cached lines for data prefetching.

[0111] like Figure 2-3 To write cache line data that was not found in the cache, the steps are as follows:

[0112] S201, Determine if the cache lines in the group are full;

[0113] S202, if not full, select any cache line in the group to write new cache line data, and set the prefetch flag of the new cache line to 0;

[0114] S203, if the cache lines in the group are full, then determine whether N_miss is less than C_miss. Here, N_miss is the number of cache lines with a prefetch flag 212 value of 0. The C_miss values ​​are shown in Table 1. C_miss takes the values ​​C_miss_L1D, C_miss_L2, and C_miss_L3 respectively depending on which cache line was written to.

[0115] S204, if N_miss is less than C_miss, select a data prefetch cache line within the group (cache line with prefetch flag 212 value of 1), evict the old prefetch data of the line according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data, and set the prefetch flag value of the new cache line to 0;

[0116] S205, if N_miss is greater than or equal to C_miss, then continue to check if it is greater than 1;

[0117] S206, if greater than 1, select the cache line that was not hit in the group (the cache line with the prefetch flag 212 value of 0), evict the old missing data of the line according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data, and set the prefetch flag value of the new cache line to 0;

[0118] S207, if equal to 1, evict the old data that was not hit in the original cache line, write the new cache line data, and set the prefetch flag of the new cache line to 0.

[0119] like Figure 2-4 The steps for writing cached row data for data prefetching are as follows:

[0120] S211, Determine if the cache lines in the group are full;

[0121] S212, if not full, select any cache line in the group to write new cache line data, and set the new cache line prefetch flag to 1;

[0122] S213, if the cache line in the group is full, then determine whether N_prefetch is less than C_prefetch. Here, N_prefetch is the number of cache lines with the prefetch flag 212 value of 1. The C_prefetch value is shown in Table 1. C_prefetch takes the values ​​C_prefetch_L1D, C_prefetch_L2, and C_prefetch_L3 respectively depending on which cache line is written.

[0123] S214, if less than C_prefetch, select the cache line that was not hit in the group (the cache line with the prefetch flag 212 value of 0), evict the old missing data of the line according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data, and set the prefetch flag value of the new cache line to 1;

[0124] S215, if it is greater than or equal to C_prefetch, then select the data prefetch cache line in the group (the cache line with the prefetch flag bit 212 value of 1), evict the old prefetch data of the line according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data, and set the prefetch flag bit value of the new cache line to 1.

[0125] Example 3

[0126] Figure 3-1 The structure diagram for generating instruction address prefetching includes: branch predictor 301, instruction prefetch engine 302, instruction prefetcher 303, instruction fetch module 304, instruction address cache 305, fetch target queue 306 (FTQ), prefetch instruction queue 307 (PIQ), prefetch buffer 308, and instruction cache 309. The instruction prefetching and fetch module here are... Figure 1 The instruction prefetching and instruction fetching module 108 in the middle.

[0127] Branch predictor 301 generates branch instruction prediction block 310 and sends it to instruction fetch target queue 306. Instruction fetch target queue 306 is a FIFO queue that caches branch instruction prediction blocks 310 from branch predictor 301, acting as an intermediary connecting branch predictor 301 and instruction fetch module 304. Branch instruction prediction block 310 queries instruction cache 309 (i.e., I-Cache). If no instruction cache hit occurs, instruction prefetch engine 302 generates the required prefetch instruction 311 and sends it to prefetch instruction queue 307. Then, instruction prefetcher 303 initiates instruction prefetch requests to the high-level cache and main memory. The prefetched instruction is written to prefetch buffer 308. Branch instruction prediction block 310 in instruction fetch target queue 306 queries prefetch buffer 308. If a hit occurs, the instruction is read from prefetch buffer 308 and written to instruction cache 309. The instruction fetch module 304 reads instructions from the instruction cache 309, then decodes the instructions, and stores the addresses of recently executed and prefetched instructions in the instruction address cache 305.

[0128] Subsequently, as Figure 1 As shown, the data prefetcher 105, through the instruction address cache 109, writes the metadata corresponding to the addresses of recently executed instructions and the addresses of prefetched instructions from main memory into the prefetch metadata cache 106. During data prefetching, the data prefetcher prioritizes reading the data corresponding to the addresses of recently executed instructions and the addresses of prefetched instructions from the instruction address cache 109 by prefetching subsequent address data in the prefetch metadata cache 106. The prefetch metadata cache 106 is then placed into the L3 cache or LLC cache.

[0129] Figure 3-2 This is a schematic diagram illustrating the format of the data prefetch metadata table in main memory. It consists of Tag1, Tag2, and Pattern. Tag1 and Tag2 are combined into Tag, which serves as the table index. Pattern represents the data prefetching schema. Tag1 represents the instruction address, and Tag2 represents the data address offset within the page. By using the data prefetch instruction address in instruction address cache 109, the metadata corresponding to the address of the recently executed instruction and the address of the prefetch instruction is transferred from main memory 107 to prefetch metadata cache 106 from prefetch metadata table 321, thereby accelerating the operation speed of the data prefetcher.

[0130] Example 4

[0131] Figure 4This diagram illustrates the prefetching data relationships, including current data 410 (related instruction address is PC1), a regular data sequence 401 on the same PC as current data 410, a cascaded data sequence 411 of data sequence 401, a regular data sequence 402 of another prefetching mode on the same PC as current data 410, an irregular data sequence 403 on the same PC as current data 410, a regular data sequence 404 on instruction address PC2, an irregular data sequence 405 on instruction address PC3, a data sequence 406 on instruction address PC4, a data sequence 407 on instruction address PC5, data 408 on instruction address PC6, and data 409 on instruction address PC7. This diagram demonstrates that for a single data 410, its prefetching may involve multiple prefetching modes, such as data sequences 401, 402, 403, 404, 405, and 408. The aforementioned data sequences include data sequences from the same PC as the current data 410, such as data sequences 401, 402, and 403. These reflect different prefetching modes for the current data, including regular and irregular data sequences. There are also data sequences or data from different PCs than the current data 410, such as data sequences 404, 405, 403, and 408, where data sequence 404 is a regular data sequence and data sequence 405 is an irregular data sequence. When prefetching data sequence 401, the data prefetcher may cascade prefetch data sequences from other instruction addresses after different data nodes, such as prefetching data sequence 406 after data node D13 or prefetching data sequence 407 after data node D14. Alternatively, after completing the prefetching of data sequence 401, the data prefetcher may cascade prefetch data sequence 411 according to the same prefetching mode.

[0132] Within the same data node, there exist different prefetch modes and data prefetches for different instruction addresses, i.e., multiple prefetch branches. Different prefetch branches have different confidence levels. Typically, prefetching prioritizes the prefetch branch with the highest confidence, or it can prefetch several prefetch branches with relatively high confidence to improve prefetch coverage. Cascading prefetching can also improve prefetch foresight. However, as the prefetch width and prefetch depth increase, the amount of prefetched data increases, leading to increased memory bus bandwidth consumption and cache pollution, which in turn degrades the performance of the memory access system.

[0133] like Figure 1By setting different confidence levels and forward-looking attributes for the prefetched data, the data prefetcher prefetches data from main memory into different cache levels. For example, high-confidence, timely prefetched data 110 is prefetched into prefetch partition 1022 of the L1D cache, low-confidence, timely prefetched data 111 is prefetched into prefetch partition 1032 of the L2 cache, and high-forward-looking prefetched data 112 is prefetched into prefetch partition 1042 of the L3 cache. If... Figure 4 If data sequence 401 has a high confidence level, then data sequence 401 is prefetched into the prefetch partition of the L1D cache. Data sequences 402, 403, 404, 405, and 408, depending on whether their confidence levels exceed the prefetch confidence threshold, are partially prefetched into the prefetch partition of the L2 cache. Data sequences 406, 407, and 411, with high foresight, are partially prefetched into the prefetch partition of the L3 cache, depending on whether their confidence levels exceed the prefetch confidence threshold. This hierarchical caching of prefetched data across the L1D, L2, and L3 caches reduces cache pollution in the L1D cache and improves prefetch coverage, cache hit rate, and memory access system performance.

[0134] Example 5

[0135] Writing cache lines in the L1D cache involves two scenarios: cache miss writes and data prefetch writes. Cache misses are written to the L1D cache's miss partition, while data prefetch writes are written to the L1D cache's prefetch partition. Furthermore, the residence time of cache misses and data prefetch writes from different data sources can be managed. Based on principles such as LRU and LFU replacement strategies, recently used and frequently used data are given longer residence times in the L1D cache, while data expected to be rarely used is given shorter residence times. This ensures full utilization of data in the L1D cache and reduces L1D cache pollution.

[0136] Figure 5This is a data flow diagram for L1D cache misses or prefetches, including: L1D cache miss partition 501, L1D prefetch partition 502, cache miss data from L2 / L3 503, cache miss data from main memory 504, prefetch data from L2 / L3 505, prefetch data from main memory 506, and prefetch data from the L1D cache miss partition 507. When an L1D cache miss occurs, the system searches for the data sequentially in the L2 cache, L3 cache, and main memory. If the data is found in the L2 cache or L3 cache (i.e., there is cache miss data 503 from L2 / L3), regardless of the original prefetch flag 212 status, it is written to the L1D cache miss partition 501, and its prefetch flag 212 value is set to 0 (i.e., cache miss status). When L1D data is prefetched, the system first checks the L1D cache to determine if the data needs to be prefetched. If the data is already in the L1D cache, then data prefetching is not required. If the data is located in L1D miss partition 501, meaning there is prefetched data 507 from the L1D miss partition, although prefetching is not required at this time, the data needs to be moved from L1D miss partition 501 to L1D prefetch partition 502. This is done by setting the prefetch flag 212 of the data cache line from 0 to 1 (i.e., the cache line changes from a cache miss state to a prefetch state). If the data is not in the L1D cache, the system will search for the data sequentially in the L2 cache, L3 cache, and main memory. If the data is found in the L2 cache or L3 cache, that is, there is prefetch data 505 from L2 / L3, regardless of the original state of the prefetch flag 212, it is written to the L1D prefetch partition, and the value of its prefetch flag 212 is set to 1 (i.e., prefetch state).

[0137] Table 2 shows the cache line write residency settings for the L1D cache. Assuming the L1D cache uses an LRU replacement strategy, the cache line residency time is managed using LRU counter bit 217. When an L1D cache group is accessed, the LRU counter bit 217 value for the accessed cache line is set to 0. The LRU counter bit 217 value for cache lines within the group with LRU counter values ​​lower than the accessed cache line's LRU counter value is incremented by 1. The LRU counter bit 217 value for cache lines within the group with LRU counter values ​​higher than the accessed cache line's LRU counter value remains unchanged. When cache line replacement is required within the group, the cache line with the highest LRU counter bit 217 value is evicted.

[0138] Table 2. Data Resident Duration Settings for L1D Cache Row Writes

[0139]

[0140] When prefetch data 507 from an L1D missed partition is written, since this data was accessed by the CPU execution unit some time ago, the LRU counter bit 217 is set to 0, setting the maximum residence time. When prefetch data 505 from L2 / L3 is written, if the access flag bit of this data is 1, it indicates that the data was accessed by the CPU execution unit some time ago, so the LRU counter bit 217 is set to K3, setting a longer residence time. When prefetch data 505 from L2 / L3 is written, if the access flag bit of this data is 0, it indicates that the data was not accessed by the CPU execution unit some time ago, so the LRU counter bit 217 is set to K4, reducing the residence time. When prefetch data 506 from main memory is written, since this data was not accessed by the CPU execution unit some time ago, the LRU counter bit 217 is set to K5, reducing the residence time. The values ​​of K3, K4, and K5 are determined based on the evaluation of the cache replacement operation results, and are generally 0 ≤ K3 ≤ K4 ≤ K5. <K MAX_LRU .

[0141] When cache miss data 503 from L2 / L3 is written, if the access flag of this data is 1, it indicates that the data was accessed by the CPU recently, so the LRU counter bit 217 is set to 0, setting the maximum residency time. When cache miss data 503 from L2 / L3 is written, if the access flag of this data is 0, it indicates that the data was not accessed by the CPU recently, such as data near the corresponding main memory location. Therefore, the LRU counter bit 217 is set to K1, reducing the residency time. When cache miss data 504 from main memory is written, this data was not accessed by the CPU recently, so the LRU counter bit 217 is set to K2, reducing the residency time. The values ​​of K1 and K2 are determined based on the cache replacement result, generally 0 ≤ K1 ≤ K2. <K MAX_LRU .

[0142] In L2 or L3 cache, cache line writes are divided into two cases: cache miss cache line writes and data prefetch cache line writes. The cache line writes and cache line write data residence time settings for L2 or L3 caches are as described above for L1D cache line write operations.

[0143] Example 6

[0144] Figure 6-1A schematic diagram illustrating the composition of a prefetch metadata cache line. A prefetch metadata cache line is a cache line containing prefetch metadata within the prefetch metadata cache. Each prefetch metadata cache line includes metadata 601, prefetch weight 602, metadata line number 603, etc. Prefetch weight 602 and metadata line number 603 were added in this application. Prefetch weight 602 represents the weight of the prefetched data corresponding to this prefetch metadata (e.g., represented by access frequency), and metadata line number 603 is the number of this metadata line.

[0145] Figure 6-2 This is a schematic diagram illustrating an example of the format of the prefetch number and metadata row number index table 610. Prefetch number 611 is... Figure 2-2 The prefetch number of the prefetch data cache line is 218. Metadata line number 603 is the metadata line number within the prefetch metadata cache line; see [link / reference]. Figure 6-1 As shown.

[0146] Figure 6-3 This diagram illustrates the implementation of the prefetch number cache pool and the metadata line number cache pool. The prefetch number cache pool 621 can be implemented using a stack or a FIFO data structure, and the metadata line number cache pool 622 can also be implemented using a stack or a FIFO data structure. If a stack is used to implement the prefetch number cache pool, during prefetch number allocation, a number is retrieved from the top of the prefetch number cache pool 621; during release, the reclaimed number is pushed onto the top of the prefetch number cache pool 621. If a FIFO is used to implement the metadata line number cache pool, during metadata line number allocation, a number is retrieved from the head of the metadata line number cache pool 622; during release, the reclaimed number is pushed onto the tail of the metadata line number cache pool 622.

[0147] like Figure 1 As shown, the data prefetcher 105, through the instruction address cache 109, writes the metadata corresponding to the addresses of recently executed instructions and the addresses of prefetched instructions from main memory into the prefetch metadata cache 106. When a metadata line is written into the prefetch metadata cache, a metadata line number is allocated from the metadata line number cache pool 622 and set into the metadata line number 603 of the metadata line.

[0148] During data prefetching, the data prefetch is indexed using prefetch number 611, which is allocated from the prefetch number cache pool 621. The data prefetching mode used is indexed using metadata row number 603, which is allocated from the metadata row number cache pool 622. The corresponding relationship is added to the number index table 610.

[0149] When prefetched data is written to the cache (L1D, L2, or L3), the allocated prefetch number 611 is set in the prefetch number bit 218 of the prefetch data cache line. When a prefetch data cache line (with the cache line prefetch flag value of 1) is evicted from the cache, if the access flag value is 0, it indicates that the prefetched data has not been accessed by the CPU control execution unit until evicting. The corresponding metadata line number 603 is found through the number index table 610 and the prefetch number 611 in the prefetch number bit 218, and the prefetch weight 602 of that metadata line is subtracted. If the access flag value is 1, it indicates that the prefetched data was accessed by the CPU control execution unit before evicting. The corresponding metadata line number 603 is found through the number index table 610 and the prefetch number 611 in the prefetch number 218, and the prefetch weight 602 of that metadata line is added. By adjusting the prefetch weight of the metadata line, the accuracy of data prefetching is improved.

[0150] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A system architecture for data prefetching, characterized in that, The system architecture includes: CPU control execution unit, L1D cache, L2 cache, L3 cache, data prefetcher, prefetch metadata cache, main memory, instruction prefetch and instruction fetch module, instruction address cache; The L1D cache is divided into L1D miss partitions and L1D prefetch partitions. The L2 cache is divided into L2 miss partitions and L2 prefetch partitions; The L3 cache is divided into L3 miss partitions and L3 prefetch partitions; The data prefetcher includes the L1D data prefetcher, the L2 data prefetcher, the L3 data prefetcher, and the prefetch coordinator; The CPU control execution unit is used to perform instruction decoding, renaming, dispatching, integer / floating-point arithmetic, and memory access instruction execution. The L1D cache is the first-level data cache; the L1D miss partition is the partition in the L1D cache that stores cache miss data; the L1D prefetch partition is the partition in the L1D cache that stores prefetch data. The L2 cache is the second-level cache, and the L3 cache is the third-level cache. The L2 and L3 caches cache both data and instructions, and can be divided into missed partitions, prefetch partitions, or no partitions. The L2 miss partition is the partition in the L2 cache that stores cache miss data; the L2 prefetch partition is the partition in the L2 cache that stores prefetch data. The L3 miss partition is the partition in the L3 cache that stores cache miss data; the L3 prefetch partition is the partition in the L3 cache that stores prefetch data. The L1D data prefetcher is a data prefetcher used for L1D caching, and the prefetched data is placed into the L1D prefetch partition; the L2 data prefetcher is a data prefetcher used for L2 caching, and the prefetched data is placed into the L2 prefetch partition; the L3 data prefetcher is a data prefetcher used for L3 caching, and the prefetched data is placed into the L3 prefetch partition. The prefetch coordinator coordinates the L1D data prefetcher, L2 data prefetcher, and L3 data prefetcher to write the data from the main memory into the L1D prefetch partition, L2 prefetch partition, and L3 prefetch partition in the future. The prefetch metadata cache is used to store data prefetch metadata. When the amount of data prefetch metadata is large, the prefetch metadata is placed in main memory and written to the prefetch metadata cache when necessary to speed up data prefetching. The prefetch metadata cache can be implemented using a dedicated module or placed in the L3 cache. The main memory is a component that stores the data and instructions used by the computer. Data in the main memory is accessed into the L3 cache, L2 cache, L1D cache, and CPU control execution unit, while instructions in the main memory are accessed into the CPU control execution unit and instruction address cache. The instruction prefetching and instruction fetching module implements instruction branch prediction, instruction prefetching, and instruction fetching, and writes the addresses of recently executed and prefetched instructions of the processor into the instruction address cache; The instruction address cache is a cache of the processor's data prefetch instruction address. It outputs the data prefetch instruction address to the data prefetcher to cache the required data prefetch metadata in advance.

2. The data prefetching system architecture according to claim 1, characterized in that, The cache data storage partitioning implementation includes: a cache miss partition and a prefetch cache partition; the cache miss partition stores cache miss data, and the prefetch cache partition stores prefetch data.

3. The data prefetching system architecture according to claim 2, characterized in that, The size of the cache miss partition and the prefetch cache partition can be dynamically adjusted. The cache miss partition consists of cache miss lines, and the prefetch cache partition consists of data prefetch lines. The cache miss lines and data prefetch lines are distinguished by the prefetch flag in the cache line.

4. The data prefetching system architecture according to claim 2, characterized in that, A cache line includes cache line data, prefetch flag, access flag, valid bit, tag bit, dirty bit, LRU counter bits, and prefetch number bits; The valid bit indicates whether the information stored in the current cache line is valid; the flag bit indicates which copy of the main memory the data block in the cache line is, which is related to the address of the main memory; the dirty bit indicates whether the data in the cache line has been modified; and the LRU counter bit is used by the LRU replacement algorithm to record the usage of the data blocks in the cache line. The prefetch flag is 1 bit. If the corresponding bit value is 0, it means that the data in this cache line is a miss. If the corresponding bit value is 1, it means that the data in this cache line is a prefetch. The access flag is set to 1 bit. If the corresponding bit value is 0, it means that the data in this cache line has not been accessed by the CPU control execution unit so far. If the corresponding bit value is 1, it means that the data in this cache line has been accessed by the CPU control execution unit so far. The value of the prefetch number bit indicates the prefetch number corresponding to the data in this cache line if it is prefetched data.

5. The system architecture for data prefetching according to claim 1, characterized in that, The data prefetch instruction address generation includes: branch predictor, instruction prefetch engine, instruction prefetcher, instruction fetch module, instruction address cache, fetch target queue, prefetch instruction queue, prefetch buffer, and instruction cache. The instruction fetch module reads instructions from the instruction cache, then decodes the instructions, and simultaneously stores the addresses of recently executed and prefetched instructions in the instruction address cache.

6. The system architecture for data prefetching according to claim 1, characterized in that, The data prefetcher, through the instruction address cache, writes the metadata corresponding to the addresses of recently executed instructions and the addresses of prefetched instructions from main memory into the prefetch metadata cache. During data prefetching, the data prefetcher prioritizes reading the data corresponding to the prefetched instruction addresses from the instruction address cache.

7. The data prefetching system architecture according to claim 1, characterized in that, Prefetch metadata caching can be implemented using a dedicated module or placed in the cache, including the L3 cache.

8. A method for operating a data prefetching system architecture, said method being implemented based on any one of the system architectures of claims 1 to 7, characterized in that, include: By setting different confidence levels and different forward-looking attributes for the prefetched data, the data prefetcher prefetches data from main memory into different levels of cache. Specifically, high-confidence, timely prefetched data is prefetched into the prefetch partition of the L1D cache, low-confidence, timely prefetched data is prefetched into the prefetch partition of the L2 cache, and highly forward-looking prefetched data is prefetched into the prefetch partition of the L3 cache.

9. The method according to claim 8, characterized in that, The method includes: a method for writing cache lines that have not been cached and a method for writing cache lines that have been prefetched; The steps for writing cache lines that have not been cached are as follows: Step 11: Determine if the cache lines in the group are full; Step 12: If not full, select any cache line in the group and write new cache line data, and set the prefetch flag of the new cache line to 0; Step 13: If the cache lines in the group are full, determine whether N_miss is less than C_miss, where N_miss is the number of cache lines with the prefetch flag set to 0, and C_miss represents the number of cache lines that were not hit. C_miss takes values ​​C_miss_L1D, C_miss_L2, and C_miss_L3 depending on the type of cache line being written to. C_miss_L1D is the number of cache lines that were not hit in each group of L1D cache; C_miss_L2 is the number of cache lines that were not hit in each group of L2 cache; and C_miss_L3 is the number of cache lines that were not hit in each group of L3 cache. Step 14: If N_miss is less than C_miss, select the data prefetch cache line within the group, i.e. the cache line with the prefetch flag value of 1, evict the old prefetch data of the corresponding line according to the replacement strategy, and then write the new cache line data. The prefetch flag value of the new cache line is set to 0. Step 15: If N_miss is greater than or equal to C_miss, then continue to check if it is greater than 1; Step 16: If the value is greater than 1, select the cache line that was not cached in the group, i.e. the cache line with the prefetch flag value of 0, evict the old cache line data according to the replacement strategy, and then write the new cache line data. The prefetch flag value of the new cache line is set to 0. Step 17: If the value is equal to 1, evict the old data of the original cache line that was not hit, write the new cache line data, and set the prefetch flag of the new cache line to 0. The data prefetching cache line writing process consists of the following steps: Step 21: Determine if the cache lines in the group are full; Step 22: If not full, select any cache line in the group and write new cache line data, and set the prefetch flag of the new cache line to 1; Step 23: If the cache lines in the group are full, determine whether N_prefetch is less than C_prefetch, where N_prefetch is the number of cache lines with the prefetch flag set to 1, C_prefetch represents the number of data prefetch cache lines, and C_prefetch takes the value C_prefetch_L1D, C_prefetch_L2, or C_prefetch_L3 depending on the type of cache line being written; C_prefetch_L1D is the number of data prefetch cache lines in the L1D cache; C_prefetch_L2 is the number of data prefetch cache lines in the L2 cache; and C_prefetch_L3 is the number of data prefetch cache lines in the L3 cache. Step 24: If it is less than C_prefetch, select the cache line that was not hit in the group, that is, the cache line with the prefetch flag value of 0, evict the old missing data of the line according to the replacement strategy (such as LRU replacement strategy), and then write the new cache line data. The prefetch flag value of the new cache line is set to 1. Step 25: If the value is greater than or equal to C_prefetch, select the data prefetch cache line within the group, that is, the cache line with the prefetch flag value of 1, evict the old prefetch data of the line according to the replacement strategy, and then write the new cache line data. The prefetch flag value of the new cache line is set to 1.

10. The method according to claim 9, characterized in that, When the cache performs data prefetching, if the corresponding data is located in the miss partition of the current cache, the corresponding data does not need to be prefetched. However, the corresponding data needs to be moved from the miss partition of the current cache to the prefetch partition, and the prefetch flag value of the corresponding data cache line is set from 0 to 1, that is, the cache line changes from the cache miss state to the prefetch state.

11. The method according to claim 9, characterized in that, include: Manage the dwell time of cache miss cache lines and data prefetch cache lines written to the cache from different data sources.

12. The method according to claim 11, characterized in that, When writing cache line data, the residence time of the corresponding cache line data in the cache is controlled by setting an initial value to the LRU counter bit of the cache line; Specifically, when L1D cache prefetch data is written, if the prefetch data comes from an L1D miss partition, the LRU counter is set to 0, indicating the longest possible residency time, since the data was accessed by the CPU execution unit some time ago. If the prefetch data comes from L2 or L3 and the access flag is 1, indicating the data was accessed by the CPU execution unit some time ago, the LRU counter is set to K3, indicating a longer residency time. If the prefetch data comes from L2 or L3 and the access flag is 0, indicating the data was not accessed by the CPU execution unit some time ago, the LRU counter is set to a constant value K4, reducing the residency time. If the prefetch data comes from main memory and the data was not accessed by the CPU execution unit some time ago, the LRU counter is set to a constant value K5, reducing the residency time. The values ​​of K3, K4, and K5 are determined based on the cache replacement results, and their range is set to 0 ≤ K3 ≤ K4 ≤ K5. <K MAX_LRU ; Specifically, when L1D cache misses data being written, if the miss comes from L2 or L3 and the data access flag is 1, it indicates that the data was accessed by the CPU recently, so the LRU counter is set to 0, setting the maximum residency time. If the miss comes from L2 / L3 cache and the data access flag is 0, it indicates that the data was not accessed by the CPU recently, so the LRU counter is set to a constant K1, reducing the residency time. If the miss comes from main memory cache and the data was not accessed by the CPU recently, the LRU counter is set to a constant K2, reducing the residency time. The values ​​of K1 and K2 are determined based on the cache replacement result, and their range is set to 0 ≤ K1 ≤ K2. <K MAX_LRU ; The method of controlling the residence time of the corresponding cache line data in the cache by setting an initial value to the LRU counter bit of the cache line when writing cache line data is applicable to L2 cache and L3 cache, and is not limited to L1D cache.