On-chip flash prefetch accelerator

By designing an on-chip FLASH prefetch accelerator and optimizing the data access path using a cache array and prefetch control unit, the speed gap between FLASH access speed and processor computing requirements is bridged, achieving efficient FLASH access and improving system performance.

CN122240039APending Publication Date: 2026-06-19NORTHWESTERN POLYTECHNICAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NORTHWESTERN POLYTECHNICAL UNIV
Filing Date
2026-05-21
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, there is a significant speed gap between FLASH access speed and processor computing requirements, causing the processor to wait frequently. Furthermore, existing cache designs cannot effectively alleviate latency issues, especially in scenarios with continuous access, where they are difficult to adapt to data demands.

Method used

An on-chip FLASH prefetch accelerator is designed, including a cache array, an address matching unit, and a prefetch control unit. By using cache hit judgment and parallel prefetch operation, the data access path is optimized, and the data prefetch is completed by utilizing the inherent read latency of FLASH, thereby reducing read access waiting latency.

Benefits of technology

It significantly improves the data access efficiency and system performance of the microcontroller, breaks through the performance bottleneck caused by the "memory wall", and achieves efficient FLASH access.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240039A_ABST
    Figure CN122240039A_ABST
Patent Text Reader

Abstract

This invention provides an on-chip FLASH prefetch accelerator, relating to the field of processor memory access optimization technology, including a cache array, an address matching unit, and a prefetch control unit. This invention utilizes a cache array composed of multiple set-associative caches to efficiently cache frequently accessed data in the on-chip FLASH and supports parallel read / write operations. The address matching unit determines whether a processor read request hits the cache. If the read request misses the cache, data needs to be read from the on-chip FLASH. The prefetch control unit loads the target data and its adjacent data into the cache array within the inherent read latency of the FLASH, completing parallel prefetching. Therefore, when a read request hits the cache array, the target data is directly retrieved from the cache array, significantly reducing the waiting latency of on-chip FLASH read access and significantly improving the overall data access efficiency and system performance of the microcontroller.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of processor memory access optimization technology, and more specifically, to an on-chip FLASH prefetch accelerator. Background Technology

[0002] As a core component in microcontrollers for storing programs and critical data, the access speed of on-chip FLASH directly impacts system performance. Due to the read operation latency caused by the physical structure of FLASH, a significant speed gap exists between it and the high-speed computing requirements of the processor—the "memory wall" problem. This issue is particularly pronounced in applications with stringent real-time requirements, causing the processor to frequently enter a waiting state when accessing FLASH, severely restricting the overall system efficiency.

[0003] However, existing FLASH access optimization technologies have significant shortcomings. In traditional cacheless designs, each read operation requires direct access to the FLASH array, leading to accumulated waiting latency. Some solutions using simple caching have low cache hit rates due to small capacity or low correlation, failing to effectively alleviate latency issues. At the same time, there is a lack of prefetching strategies tailored to the characteristics of FLASH, making it difficult to adapt to data requirements in continuous access scenarios.

[0004] Therefore, there is an urgent need for an accelerator that can achieve efficient FLASH access. Summary of the Invention

[0005] The purpose of this invention is to provide an on-chip FLASH prefetch accelerator to improve the aforementioned problems. To achieve this purpose, the technical solution adopted by this invention is as follows: This application provides an on-chip FLASH prefetch accelerator, including a cache array, an address matching unit, and a prefetch control unit; The cache array includes multiple first caches, each first cache includes multiple second caches, and each second cache includes a data area and a tag area. The data area is used to store data read from the on-chip flash memory array, and the tag area is used to store first tags. The address matching unit is used to receive the first target address corresponding to the read request after the processor sends a read request, and determine whether there is a third cache based on the first target address and the first tags of multiple second caches. The third cache is a second cache that stores data corresponding to the first target address. If the third cache exists, the prefetch control unit sends the data in the data area of ​​the third cache to the processor; If the third cache does not exist, the flash memory controller reads data from the on-chip flash memory array based on the first target address and sends the read data to the processor; at the same time, the prefetch control unit calculates a first address based on the first target address and updates the cache array based on the first target address and the data corresponding to the first address, where the first address is an address that differs from the first target address by a first byte, and the first byte size is the size of a physical page of the on-chip flash memory array.

[0006] As a preferred embodiment of the present invention, the cache array includes two first caches, each first cache including two second caches, the size of which is equal to the size of the minimum page of the on-chip flash memory array.

[0007] As a preferred embodiment of the present invention, the address matching unit includes an address parsing unit and an address matching logic unit; The address resolution unit is used to receive the first target address and split the first target address into an offset, a group index, and a second tag; The address matching logic unit includes a group index comparator and a tag verification unit. The group index comparator is used to select a fourth cache according to the group index. The fourth cache is the first cache corresponding to the group index. The tag verification unit is used to determine whether the third cache exists based on the second tag and the third tag, wherein the third tag is the first tag of the second cache in the fourth cache; If the third cache exists, the prefetch control unit sends the data of the data area of ​​the third cache to the processor based on the offset.

[0008] As a preferred embodiment of the present invention, the prefetch control unit includes a configuration register, a replacement control unit, and a prefetch control logic unit; The configuration register includes a prefetch enable bit, which includes an enabled state and a disabled state. When the processor sends a read instruction, the configuration register sets the prefetch enable bit to the enabled state. The replacement control unit is used to update the replacement bit of the tag area of ​​the second cache. The replacement bit is used to record the first count. The first count is the number of times the processor issues the read instruction within the first time period. The first time period is the time period from when the second cache was last determined to be the third cache to the current time. The prefetch control logic unit is used to calculate the first address and obtain the first write cache and the second write cache when the prefetch enable bit in the configuration register is enabled and the third cache does not exist. The first target address and the data corresponding to the first address are written to the first write cache and the second write cache, respectively. The first write cache and the second write cache are both the second cache.

[0009] As a preferred embodiment of the present invention, obtaining the first cache to be written and the second cache to be written includes: When a free second cache exists, the free second cache is used as the first write cache; When there is no free second cache, the second cache with the largest replacement bit is used as the first cache to be written or the second cache to be written.

[0010] As a preferred embodiment of the present invention, when the processor sends an erase instruction for the first block address, the address resolution unit is used to receive the first block address and calculate the first address range and the fourth tag corresponding to the first block address; The configuration register sets the prefetch enable bit to the disabled state; Based on the second tag and the fourth tag of the tag area of ​​the fifth cache, the effective bit of the tag area of ​​the fifth cache is set, and the fifth cache is the second cache within the first address range; After setting the valid bit for all of the fifth caches, the replacement control unit sets the replacement bit for all of the second caches to 0, and the configuration register sets the prefetch enable bit to the enabled state.

[0011] As a preferred embodiment of the present invention, when the processor sends a write instruction, the address resolution unit receives the second target address and the write data corresponding to the write instruction, and splits the second target address into a second offset, a second group of indexes and a fifth tag; The sixth cache is selected based on the second set of indices, and the sixth cache is the first cache corresponding to the second set of indices. If the data to be written is new data, the flash memory controller erases the target page and then writes the data to the second target address after erasing. The target page is the page of the flash memory array where the second target address is located. During the process of erasing the target page by the flash memory controller, the tag verification unit determines whether a seventh cache exists based on the fifth tag and the sixth tag. The sixth tag is the first tag of the second cache in the sixth cache, and the seventh cache is the second cache in the sixth cache that stores the data corresponding to the second target address. The flash memory controller and the replacement control unit update the cache array based on the presence of the seventh cache.

[0012] As a preferred embodiment of the present invention, the flash memory controller and the replacement control unit update the cache array based on the presence of the seventh cache, including: If the seventh cache exists, the flash memory controller updates the data in the data area of ​​the seventh cache with the written data, and the replacement control unit updates the replacement bit of the seventh cache to 0; If the seventh cache does not exist, and a free second cache exists, the flash controller updates the data in a free data area of ​​the second cache with the written data, and updates the first tag, the valid bit, and the replacement bit of the free second cache; If the seventh cache does not exist and there is no free second cache, the flash controller updates the data in the data area of ​​the eighth cache with the write data, and updates the first tag, the valid bit, and the replacement bit of the eighth cache. The eighth cache is the second cache with the largest replacement bit when the processor issues the write instruction.

[0013] The beneficial effects of this invention are as follows: The cache array of the FLASH prefetch accelerator proposed in this invention can efficiently cache frequently accessed data in on-chip FLASH and supports parallel read and write operations. Simultaneously, the address matching unit and prefetch control unit ensure that when a read request hits the cache, the data in the cache is directly sent to the processor. When a read request misses the cache, while reading data from the on-chip flash array, the prefetch control unit performs a prefetch operation, loading the data at the target address of the read request and its adjacent addresses into the cache array, facilitating cache hits for subsequent read requests. By utilizing the inherent read latency of FLASH to load the target data and its adjacent data, when the processor subsequently accesses the cache and hits it, it can directly read data from this prefetch accelerator, thereby significantly reducing the waiting latency of on-chip FLASH read access, effectively overcoming the performance bottleneck caused by the "memory wall," and significantly improving the overall data access efficiency and system performance of the microcontroller.

[0014] Other features and advantages of the invention will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing embodiments of the invention. Attached Figure Description

[0015] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a schematic diagram of an on-chip FLASH prefetch accelerator as described in an embodiment of the present invention; Figure 2 This is a schematic diagram of the internal structure of an on-chip FLASH prefetch accelerator as described in an embodiment of the present invention; Figure 3 This is a schematic diagram of the read operation of an on-chip FLASH prefetch accelerator as described in an embodiment of the present invention; Figure 4 This is a schematic diagram of the write operation of an on-chip FLASH prefetch accelerator as described in an embodiment of the present invention; Figure 5 This is a schematic diagram of the erase operation of an on-chip FLASH prefetch accelerator as described in an embodiment of the present invention. Detailed Implementation

[0017] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0018] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this invention, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0019] Example 1: This embodiment provides an on-chip FLASH prefetch accelerator.

[0020] like Figure 1 As shown, an on-chip FLASH prefetch accelerator includes a cache array, an address matching unit, and a prefetch control unit; The cache array includes multiple first caches, each first cache includes multiple second caches, and each second cache includes a data area and a tag area. The data area is used to store data read from the on-chip flash memory array, and the tag area is used to store first tags. The address matching unit is used to receive the first target address corresponding to the read request after the processor sends a read request, and determine whether there is a third cache based on the first target address and the first tags of multiple second caches. The third cache is a second cache that stores data corresponding to the first target address. If the third cache exists, the prefetch control unit sends the data in the data area of ​​the third cache to the processor; If the third cache does not exist, the flash memory controller reads data from the on-chip flash memory array based on the first target address and sends the read data to the processor; at the same time, the prefetch control unit calculates a first address based on the first target address and updates the cache array based on the first target address and the data corresponding to the first address, where the first address is an address that differs from the first target address by a first byte, and the first byte size is the size of a physical page of the on-chip flash memory array.

[0021] It is understandable that, such as Figure 1 and Figure 2 As shown, after the processor initiates an on-chip FLASH read request, the on-chip FLASH controller forwards the read request to the prefetch accelerator. The prefetch accelerator performs cache hit judgment and parallel prefetch control to optimize the data access path. The on-chip FLASH array responds to the prefetch accelerator's read request, providing the target data and prefetch data to the prefetch accelerator. This facilitates the direct transmission of the target data of the read request to the processor through the prefetch accelerator when a cache hit occurs, significantly reducing read access waiting latency and improving system performance while reducing the frequency of direct access to the FLASH array.

[0022] The prefetch accelerator in this embodiment adopts a multi-channel group interconnect structure and is integrated inside the FLASH controller. It is specifically designed to accelerate on-chip FLASH read operations and realize fast access to FLASH data.

[0023] The cache array is divided into multiple cache channels, each containing multiple cache entries. Each entry integrates a data area and a tag area. The data area stores data read from the on-chip FLASH, while the tag area stores the tag, valid bit, and replacement bit. The cache array features a multi-way set-associative design, effectively reducing cache conflict probability and providing more balanced cache space utilization compared to traditional single-way cache solutions, thus improving the coverage of high-frequency data. The cache array supports nanosecond-level hit detection and zero-clock-cycle response. When the processor accesses a cached data, it can directly retrieve the data from the cache without waiting for the FLASH array's read operation latency, significantly overcoming the performance bottleneck caused by the "memory wall."

[0024] When a FLASH read operation misses, the FLASH array is triggered to read data, and simultaneously, the prefetch control unit initiates a prefetch operation, loading the target data and adjacent data into the cache array. Once the FLASH array returns the data, the output of the data required for the read request and the update of the cache array's cache data can be achieved simultaneously, automatically loading 32 bytes of data adjacent to the target data address into the cache. The minimum read unit of the on-chip FLASH array is 256 bits. Even if the host only requests 64-bit double-word data, the FLASH array will still output aligned 256-bit data. This embodiment utilizes this characteristic to simultaneously satisfy the host's access requirements for the target data and write adjacent data obtained in the same read into the cache array, implementing a prefetch mechanism based on spatial locality.

[0025] Then, the address matching unit determines in real time whether subsequent read requests match cached data, i.e., whether there is a hit. If a hit occurs, the cached data is returned directly to eliminate access latency. This embodiment implements a parallel prefetching mechanism through the prefetch control unit. The prefetching operation is executed synchronously with the FLASH read operation, utilizing the inherent read latency of FLASH to complete the loading of adjacent data, thus avoiding the additional access latency added in the traditional serial prefetching scheme.

[0026] As a preferred embodiment, the cache array includes two first caches, each of which includes two second caches, the size of which is equal to the size of the smallest page of the on-chip flash memory array.

[0027] It is understandable that, such as Figure 3 As shown, the cache array in this embodiment adopts a set-associative cache architecture with two ways, each way including two entries. Each entry integrates a data area and a tag area. The data area is 32 bytes and is used to store data read from the on-chip FLASH. The tag area is used to store the tag, valid bit, and replacement bit. Each cache entry stores a fixed 32 bytes of data, which is strictly aligned with the minimum page size of the FLASH physical storage and adapts to the page access characteristics of FLASH.

[0028] As a preferred embodiment, the address matching unit includes an address parsing unit and an address matching logic unit; The address resolution unit is used to receive the first target address and split the first target address into an offset, a group index, and a second tag; The address matching logic unit includes a group index comparator and a tag verification unit. The group index comparator is used to select a fourth cache according to the group index. The fourth cache is the first cache corresponding to the group index. The tag verification unit is used to determine whether the third cache exists based on the second tag and the third tag, wherein the third tag is the first tag of the second cache in the fourth cache; If the third cache exists, the prefetch control unit sends the data of the data area of ​​the third cache to the processor based on the offset.

[0029] It is understandable that, such as Figure 2 As shown, the address resolution unit receives the processor's 32-bit read address, splits it into an offset, a set index, and a tag, corresponding to the tag area of ​​the cache entry. The offset field is 5 bits, used to locate the byte position within the entry, i.e., the data area position. The set index field is 1 bit, used to select way 0 or way 1 for way switching. The tag field is 26 bits, used for bit-by-bit comparison with the tag in the cache entry's tag area. If the tags are completely identical, it's considered a cache hit; otherwise, it's considered a cache miss. The address matching logic unit includes a set index comparator and a tag verification unit. The set index comparator selects the target cache way based on the set index. The tag verification unit compares the tags in the tag areas of the two entries within the target way with the address tags, and combines this with the valid bit to determine if the cache has a hit. Even if the tags are completely identical, if the "valid bit" of the corresponding cache line is not set to 1, it means the data in that cache line is invalid, and it cannot be considered a hit. Therefore, the hit condition is: completely identical tags and a valid bit of 1.

[0030] As a preferred embodiment, the prefetch control unit includes a configuration register, a replacement control unit, and a prefetch control logic unit; The configuration register includes a prefetch enable bit, which includes an enabled state and a disabled state. When the processor sends a read instruction, the configuration register sets the prefetch enable bit to the enabled state. The replacement control unit is used to update the replacement bit of the tag area of ​​the second cache. The replacement bit is used to record the first count. The first count is the number of times the processor issues the read instruction within the first time period. The first time period is the time period from when the second cache was last determined to be the third cache to the current time. The prefetch control logic unit is used to calculate the first address and obtain the first write cache and the second write cache when the prefetch enable bit in the configuration register is enabled and the third cache does not exist. The first target address and the data corresponding to the first address are written to the first write cache and the second write cache, respectively. The first write cache and the second write cache are both the second cache.

[0031] As a preferred embodiment, obtaining the first cache to be written and the second cache to be written includes: When a free second cache exists, the free second cache is used as the first write cache; When there is no free second cache, the second cache with the largest replacement bit is used as the first cache to be written or the second cache to be written.

[0032] It is understandable that, such as Figure 2 As shown, this embodiment designs a replacement control unit based on the LRU replacement strategy, namely the LRU control unit. The LRU control unit maintains the replacement bit of each cached entry, namely the LRU bit. The LRU bit is a 2-bit counter. When a hit occurs, the replacement bit of the corresponding entry is updated to 0, indicating that it was recently used. When a miss occurs and there is no free entry, the entry with the largest LRU bit is selected for replacement. The largest LRU bit indicates that it has not been used for the longest time.

[0033] The prefetch control logic unit reads the prefetch enable bit (PEN) of the configuration register. When enabled, it automatically calculates the adjacent data address, which is the target address plus 32 bytes. It matches the on-chip FLASH read granularity, triggers the prefetch operation in parallel, and manages the writing of prefetched data to the cache. The configuration register stores control parameters such as the prefetch enable bit (PEN).

[0034] The configuration register can dynamically disable the prefetch function through the prefetch enable bit, thereby adapting to different access scenarios and avoiding resource waste caused by invalid prefetching in random access scenarios. This embodiment implements dynamic configurability of the FLASH prefetch strategy, supporting the enabling or disabling of the prefetch function and adjustment of read wait parameters through the configuration register, and can intelligently switch the working mode according to the access mode. Compared with traditional no-cache or fixed-cache solutions, it can fully leverage the advantages of prefetching in continuous access scenarios, preloading adjacent data to meet subsequent access needs; in random access scenarios, it avoids invalid resource occupation, significantly improving the system's energy efficiency ratio. This adaptive capability ensures that FLASH access is always in an optimal acceleration state, providing more flexible performance guarantees for the microcontroller.

[0035] This embodiment employs an LRU (Least Recently Used) replacement strategy. By setting an LRU bit in the tag area of ​​each cached entry to record the access sequence of cached entries, when the cache is full, the entry that has not been accessed for the longest time, i.e., the entry with the largest LRU bit, is replaced first. This ensures that frequently accessed data can be continuously retained in the cache, maintaining a high cache hit rate in continuous access scenarios and effectively reducing the cumulative effect of latency. Through the synergistic design of parallel prefetching mechanism and LRU replacement strategy, the limitations of traditional caching technology in continuous access optimization are overcome.

[0036] like Figure 3 As shown, the complete logical steps for implementing a read operation using the prefetch accelerator based on this embodiment are as follows: Read requests to the processor are divided into two parts: cache hit and cache miss. In the case of a cache hit, after address resolution, a hit is determined by way selection and tag matching. The byte within the entry is located via offset, and the data is returned directly from the cache array data area to the processor with a latency of less than one clock cycle. In the case of a cache miss, an on-chip FLASH read operation is triggered, performing target data reading and prefetch data processing in parallel. In the target data reading branch, the target data returned by the on-chip FLASH is directly sent to the processor via the bus and synchronously written to a free entry in the cache array. If no free entry exists, the LRU control unit triggers the replacement of the longest-unused entry and updates the tag, valid bit, and LRU bit of that entry. In the prefetch data processing branch, the prefetch control logic unit reads the PEN bit of the configuration register. When enabled, it calculates the prefetch address and checks for free cache entries. If a free entry exists, the prefetch data is written and the tag, valid bit, and LRU bit are updated. If no free entry exists, the old entry is replaced by the LRU control unit and then written, completing the prefetch accelerator update tag.

[0037] The prefetch accelerator achieves single-cycle hit detection through a multi-way group-associative architecture, significantly improving response speed; the LRU strategy effectively reduces cache miss rate; parallel prefetching greatly improves performance in continuous access scenarios; and the 128-byte cache, combined with simplified control logic, offers a more advantageous area-to-power ratio compared to similar designs.

[0038] As a preferred embodiment, when the processor sends an erase instruction for the first block address, the address resolution unit is used to receive the first block address and calculate the first address range and the fourth tag corresponding to the first block address; The configuration register sets the prefetch enable bit to the disabled state; Based on the second tag and the fourth tag of the tag area of ​​the fifth cache, the effective bit of the tag area of ​​the fifth cache is set, and the fifth cache is the second cache within the first address range; After setting the valid bit for all of the fifth caches, the replacement control unit sets the replacement bit for all of the second caches to 0, and the configuration register sets the prefetch enable bit to the enabled state.

[0039] It is understandable that, such as Figure 4 As shown, the erase operation is performed at the block level, supporting 16KB, 32KB, 64KB, and 256KB block types. After the processor sends the erase command and specifies the block address, the address resolution unit calculates the address range and tag space corresponding to the erase block. The FLASH controller activates the high-voltage circuit to erase the FLASH, restoring the storage units to their initial state. During this process, the prefetch control logic unit temporarily disables the prefetch function through the configuration register to avoid prefetching invalid data. The prefetch accelerator synchronously traverses all its entries, compares the entry tags with the erase FLASH tag space, and clears the valid bits of entries within the erase range to 0, without needing to check the hit status, thus achieving batch invalidation of cached entries. After the erase is completed, the LRU control unit resets the LRU bits of all entries to their initial state, and the prefetch function is automatically enabled again.

[0040] When the processor sends a write instruction, the address resolution unit receives the second target address and the write data corresponding to the write instruction, and splits the second target address into a second offset, a second group of indices and a fifth tag; The sixth cache is selected based on the second set of indices, and the sixth cache is the first cache corresponding to the second set of indices. If the data to be written is new data, the flash memory controller erases the target page and then writes the data to the second target address after erasing. The target page is the page of the flash memory array where the second target address is located. During the process of erasing the target page by the flash memory controller, the tag verification unit determines whether a seventh cache exists based on the fifth tag and the sixth tag. The sixth tag is the first tag of the second cache in the sixth cache, and the seventh cache is the second cache in the sixth cache that stores the data corresponding to the second target address. The flash memory controller and the replacement control unit update the cache array based on the presence of the seventh cache.

[0041] In a preferred embodiment, the flash memory controller and the replacement control unit update the cache array based on the presence of the seventh cache, including: If the seventh cache exists, the flash memory controller updates the data in the data area of ​​the seventh cache with the written data, and the replacement control unit updates the replacement bit of the seventh cache to 0; If the seventh cache does not exist, and a free second cache exists, the flash controller updates the data in a free data area of ​​the second cache with the written data, and updates the first tag, the valid bit, and the replacement bit of the free second cache; If the seventh cache does not exist and there is no free second cache, the flash controller updates the data in the data area of ​​the eighth cache with the write data, and updates the first tag, the valid bit, and the replacement bit of the eighth cache. The eighth cache is the second cache with the largest replacement bit when the processor issues the write instruction.

[0042] Write operations of prefetch accelerators, such as Figure 5 As shown, write operations support execution at two-word, page, or four-page granularity. After the processor sends a write command, the FLASH controller receives the target address and the data to be written. The address resolution unit splits the address into offset, group index, and tag, and selects the corresponding cache group through the group index. When new data is written, the FLASH controller first performs an erase operation on the page containing the target address, and then performs a programming operation at the specified granularity to write the data to the FLASH array. During this process, if a valid entry for the target page exists in the cache, the tag verification unit triggers an update of the data area of ​​that entry, and the LRU control unit marks it as recently used. If the cache misses, after the FLASH programming is completed, the new data will be loaded into a free entry in the cache, or the LRU control unit will replace the least recently used entry, synchronously updating the tag, valid bit, and LRU bit to ensure that subsequent read operations can directly hit the latest data in the cache.

[0043] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

[0044] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. An on-chip FLASH prefetch accelerator, characterized in that, This includes a cache array, an address matching unit, and a prefetch control unit; The cache array includes multiple first caches, each first cache includes multiple second caches, and each second cache includes a data area and a tag area. The data area is used to store data read from the on-chip flash memory array, and the tag area is used to store first tags. The address matching unit is used to receive the first target address corresponding to the read request after the processor sends a read request, and determine whether there is a third cache based on the first target address and the first tags of multiple second caches. The third cache is a second cache that stores data corresponding to the first target address. If the third cache exists, the prefetch control unit sends the data in the data area of ​​the third cache to the processor; If the third cache does not exist, the flash memory controller reads data from the on-chip flash memory array based on the first target address and sends the read data to the processor. At the same time, the prefetch control unit calculates a first address based on the first target address and updates the cache array based on the first target address and the data corresponding to the first address. The first address is an address that differs from the first target address by a first byte, and the first byte size is the size of a physical page of the on-chip flash memory array.

2. The on-chip FLASH prefetch accelerator according to claim 1, characterized in that... The cache array includes two first caches, each of which includes two second caches. The size of each second cache is equal to the size of the smallest page of the on-chip flash array.

3. The on-chip FLASH prefetch accelerator according to claim 1, characterized in that... The address matching unit includes an address parsing unit and an address matching logic unit; The address resolution unit is used to receive the first target address and split the first target address into an offset, a group index, and a second tag; The address matching logic unit includes a group index comparator and a tag verification unit. The group index comparator is used to select a fourth cache according to the group index. The fourth cache is the first cache corresponding to the group index. The tag verification unit is used to determine whether the third cache exists based on the second tag and the third tag, wherein the third tag is the first tag of the second cache in the fourth cache; If the third cache exists, the prefetch control unit sends the data of the data area of ​​the third cache to the processor based on the offset.

4. An on-chip FLASH prefetch accelerator according to claim 3, characterized in that... The prefetch control unit includes a configuration register, a replacement control unit, and a prefetch control logic unit; The configuration register includes a prefetch enable bit, which includes an enabled state and a disabled state. When the processor sends a read instruction, the configuration register sets the prefetch enable bit to the enabled state. The replacement control unit is used to update the replacement bit of the tag area of ​​the second cache. The replacement bit is used to record the first count. The first count is the number of times the processor issues the read instruction within the first time period. The first time period is the time period from when the second cache was last determined to be the third cache to the current time. The prefetch control logic unit is used to calculate the first address and obtain the first write cache and the second write cache when the prefetch enable bit in the configuration register is enabled and the third cache does not exist. The first target address and the data corresponding to the first address are written to the first write cache and the second write cache, respectively. The first write cache and the second write cache are both the second cache.

5. An on-chip FLASH prefetch accelerator according to claim 4, characterized in that... The process of obtaining the first and second write caches includes: When a free second cache exists, the free second cache is used as the first write cache; When there is no free second cache, the second cache with the largest replacement bit is used as the first cache to be written or the second cache to be written.

6. An on-chip FLASH prefetch accelerator according to claim 4, characterized in that... When the processor sends an erase instruction for the first block address, the address resolution unit is used to receive the first block address and calculate the first address range and the fourth tag corresponding to the first block address; The configuration register sets the prefetch enable bit to the disabled state; Based on the second tag and the fourth tag of the tag area of ​​the fifth cache, the effective bit of the tag area of ​​the fifth cache is set, and the fifth cache is the second cache within the first address range; After setting the valid bit for all of the fifth caches, the replacement control unit sets the replacement bit for all of the second caches to 0, and the configuration register sets the prefetch enable bit to the enabled state.

7. An on-chip FLASH prefetch accelerator according to claim 6, characterized in that... When the processor sends a write instruction, the address resolution unit receives the second target address and the write data corresponding to the write instruction, and splits the second target address into a second offset, a second group of indices and a fifth tag; The sixth cache is selected based on the second set of indices, and the sixth cache is the first cache corresponding to the second set of indices. If the data to be written is new data, the flash memory controller erases the target page and then writes the data to the second target address after erasing. The target page is the page of the flash memory array where the second target address is located. During the process of erasing the target page by the flash memory controller, the tag verification unit determines whether a seventh cache exists based on the fifth tag and the sixth tag. The sixth tag is the first tag of the second cache in the sixth cache, and the seventh cache is the second cache in the sixth cache that stores the data corresponding to the second target address. The flash memory controller and the replacement control unit update the cache array based on the presence of the seventh cache.

8. An on-chip FLASH prefetch accelerator according to claim 7, characterized in that... The flash memory controller and the replacement control unit update the cache array based on the presence of the seventh cache, including: If the seventh cache exists, the flash memory controller updates the data in the data area of ​​the seventh cache with the written data, and the replacement control unit updates the replacement bit of the seventh cache to 0; If the seventh cache does not exist, and a free second cache exists, the flash controller updates the data in a free data area of ​​the second cache with the written data, and updates the first tag, the valid bit, and the replacement bit of the free second cache; If the seventh cache does not exist and there is no free second cache, the flash controller updates the data in the data area of ​​the eighth cache with the write data, and updates the first tag, the valid bit, and the replacement bit of the eighth cache. The eighth cache is the second cache with the largest replacement bit when the processor issues the write instruction.