Cache, data processing device and chip

By using a multi-layered cache design, the problem of CPU cache capacity limitation was solved, enabling cache capacity expansion and improved processing efficiency, thereby optimizing CPU performance and power consumption.

CN119473932BActive Publication Date: 2026-06-19SANECHIPS TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SANECHIPS TECH CO LTD
Filing Date
2023-07-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In CPU microarchitecture, cache capacity is limited by frequency increases and power consumption optimization, becoming one of the main bottlenecks restricting CPU performance improvement.

Method used

The cache design adopts a multi-layer structure, with each layer including a first cache unit and a second cache unit. Cache access instructions of different layers enter different pipelines to access the cache units in the corresponding layers, thereby expanding the cache capacity and allowing independent access to their respective cache units in case of errors.

🎯Benefits of technology

By expanding cache capacity, data hit rate can be improved, overall CPU performance can be enhanced, power consumption can be optimized, and processing efficiency can be guaranteed.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119473932B_ABST
    Figure CN119473932B_ABST
Patent Text Reader

Abstract

This disclosure provides a cache including at least two layers. Each layer includes a first cache unit and a second cache unit. First cache access instructions from different layers enter different pipelines to access the first cache unit within the corresponding layer, and second cache access instructions from different layers enter different pipelines to access the second cache unit within the corresponding layer. The cache in this embodiment includes multiple layers, which can expand cache capacity, improve data hit rate, and enhance overall CPU performance. Moreover, cache access instructions from each layer independently access their respective cache units, ensuring cache processing efficiency. This disclosure also provides a data processing apparatus and chip.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer processing technology, specifically to a cache, data processing device, and chip. Background Technology

[0002] In the CPU (Central Processing Unit) microarchitecture, CPU instruction execution and cache access have a four-stage pipeline, with each pipeline operating within one clock cycle. With a limited number of execution pipelines, the CPU's cache capacity is constrained by frequency increases and power consumption optimizations, becoming one of the main bottlenecks limiting CPU performance improvements. Summary of the Invention

[0003] This disclosure provides a cache, a data processing device, and a chip.

[0004] In a first aspect, embodiments of this disclosure provide a cache, including at least two layer structures, each layer structure including a first cache unit and a second cache unit. First cache access instructions of different layer structures enter different pipelines to access the first cache unit within the corresponding layer structure, and second cache access instructions of different layer structures enter different pipelines to access the second cache unit within the corresponding layer structure.

[0005] In another aspect, embodiments of this disclosure also provide a data processing apparatus, including the cache as described above.

[0006] In another aspect, embodiments of this disclosure also provide a chip including the cache as described above.

[0007] The cache provided in this embodiment includes at least two layers, each layer including a first cache unit and a second cache unit. First cache access instructions of different layers enter different pipelines to access the first cache unit within the corresponding layer, and second cache access instructions of different layers enter different pipelines to access the second cache unit within the corresponding layer. The cache in this embodiment includes multiple layers, which can expand the cache capacity, improve the data hit rate, and enhance the overall CPU performance. Moreover, the cache access instructions of each layer independently access their respective cache units, ensuring the processing efficiency of the cache. Attached Figure Description

[0008] Figure 1 This is a schematic diagram illustrating the entry of cache access instructions into the pipeline in related technologies;

[0009] Figure 2 Schematic diagram of the cache structure provided in the embodiments of this disclosure Figure 1 ;

[0010] Figure 3 Schematic diagram of the cache structure provided in the embodiments of this disclosure Figure 2 ;

[0011] Figure 4 A schematic diagram of a two-layer cache provided for a specific example of this disclosure;

[0012] Figure 5 A schematic diagram of a cache with a three-layer structure provided for a specific example of this disclosure;

[0013] Figure 6 This is a schematic diagram of the structure of the data processing device and chip provided in the embodiments of this disclosure. Detailed Implementation

[0014] Exemplary embodiments will be described more fully below with reference to the accompanying drawings; however, these exemplary embodiments may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will enable those skilled in the art to fully understand the scope of this disclosure.

[0015] As used herein, the term “and / or” includes any and all combinations of one or more related enumerated entries.

[0016] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. As used herein, the singular forms “a” and “the” are also intended to include the plural forms unless the context clearly indicates otherwise. It will also be understood that when the terms “comprising” and / or “made of” are used in this specification, the presence of the said feature, integral, step, operation, element, and / or component is specified, but the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof is not excluded.

[0017] The embodiments described herein can be described with reference to plan views and / or cross-sectional views using the ideal schematic diagrams of this disclosure. Therefore, the example illustrations can be modified according to manufacturing techniques and / or tolerances. Therefore, the embodiments are not limited to those shown in the drawings, but include modifications to configurations formed based on manufacturing processes. Therefore, the areas illustrated in the drawings are schematic in nature, and the shapes of the areas shown in the figures illustrate specific shapes of areas of an element, but are not intended to be limiting.

[0018] Unless otherwise specified, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant art and this disclosure, and will not be interpreted as having an idealized or overly formal meaning, unless expressly so defined herein.

[0019] Figure 1This is a schematic diagram illustrating the entry of cache access instructions into the pipeline in related technologies. For example... Figure 1 As shown, pipelines D1-D4 are a four-stage pipeline for CPU instruction execution and cache access, with each stage lasting one clock cycle. The cache includes Tag Ram (physical address memory), Data Ram (data memory), Data Merge & Format Convertion (data merging and format conversion unit), and Interface. Tag Ram is a physical module used to store physical addresses and cache status information; Data Ram is a physical module used to store data; Data Merge & Format Convertion is used for data merging and format conversion; and Interface is used to transfer the merged and converted data to other devices or modules. The cache unit consists of two parts: Tag Ram and Data Ram. During the execution of a Load instruction or Store instruction, the process of entering the cache pipeline is as follows: Figure 1 As shown, it includes the following stages:

[0020] In D1: Tag Ram is read according to the cache access instruction Tag Read.

[0021] In D2: Data Ram is read according to the cache access instruction Data Read.

[0022] In D3: Based on the physical address information output by the Tag Ram, select the data output by the Data Ram, and perform operations such as data merging and format conversion on the data output by the Data Ram.

[0023] In D4: The merged and format-converted data is sent to the interface unit.

[0024] High-performance CPUs primarily aim for 3GHz, therefore, for a 4-stage pipeline (D1-D4), the following bottlenecks exist: In D1, the Tag Ram input signals are numerous, limiting frequency increases; in D2, the Tag Ram output load is large, limiting frequency increases; in D3, the Data Ram output load is large, requiring merging and format conversion (large logic levels), limiting frequency increases; in D4, the data is finally output to the interface, with a large bit width and long placement and routing distance, limiting frequency increases.

[0025] Therefore, with a limited execution pipeline, CPU cache capacity is constrained by frequency increases and power consumption optimization, becoming one of the main bottlenecks restricting CPU performance improvement.

[0026] To address the aforementioned issues, this disclosure improves the cache structure and pipeline, thereby expanding the cache capacity. The cache includes at least two layers, each layer comprising a first cache unit and a second cache unit. First cache access instructions from different layers enter different pipelines to access the first cache unit within the corresponding layer, and second cache access instructions from different layers enter different pipelines to access the second cache unit within the corresponding layer.

[0027] Figure 2 This is a schematic diagram of the cache structure according to an embodiment of the present disclosure. Figure 1 ,like Figure 2 As shown, the number of layer structures is n, where n≥2. That is, the cache can include the first layer structure R1, the second layer structure R2, ..., the nth layer structure Rn, and each layer structure contains the same components and structures. The first layer structure R1 includes a first cache unit M11 and a second cache unit M12; the second layer structure R2 includes a first cache unit M21 and a second cache unit M22; and the nth layer structure Rn includes a first cache unit Mn1 and a second cache unit Mn2.

[0028] Simply expanding the cache capacity will not lead to convergence in timing (frequency decrease) and will not improve processing performance. Independent cache access instructions for each layer access their respective cache units, effectively solving the frequency decrease problem caused by independent cache expansion. Furthermore, the number of expansion layers (n-1) can be customized according to actual application needs. The total cache capacity after expansion is n times that of a single-layer structure, improving the flexibility and versatility of this architecture in product applications.

[0029] The cache provided in this embodiment includes at least two layers, each layer including a first cache unit and a second cache unit. First cache access instructions of different layers enter different pipelines to access the first cache unit within the corresponding layer, and second cache access instructions of different layers enter different pipelines to access the second cache unit within the corresponding layer. The cache in this embodiment includes multiple layers, which can expand the cache capacity, improve the data hit rate, and enhance the overall CPU performance. Moreover, the cache access instructions of each layer independently access their respective cache units, which can ensure the processing efficiency of the cache and optimize power consumption.

[0030] In some embodiments, the pipelines into which the first cache access instructions of each layer enter are sequentially ascending, and the pipelines into which the second cache access instructions of each layer enter are sequentially ascending. That is, the layers can be arranged in a staggered manner according to the pipeline order.

[0031] To further improve frequency, adjacent layers can be staggered by one pipeline. That is, the first cache access instruction C21 of the second layer enters pipeline D2, while the first cache access instruction C11 of the first layer enters pipeline D1, and D2 is the next pipeline after D1; the second cache access instruction C22 of the second layer enters pipeline D3, while the second cache access instruction C12 of the first layer enters pipeline D2, and D3 is the next pipeline after D2. Therefore, in some embodiments, the pipeline entered by the first cache access instruction of the nth layer is the next pipeline after the pipeline entered by the first cache access instruction of the (n-1)th layer; the pipeline entered by the second cache access instruction of the nth layer is the next pipeline after the pipeline entered by the second cache access instruction of the (n-1)th layer.

[0032] In some embodiments, each layer structure further includes a data selection unit, a data processing unit, and an interface unit. The data selection unit of each layer structure is used to select each second data according to the first data to obtain target data. The first data is data obtained by accessing the first cache unit within the corresponding layer structure according to the first cache access instruction of the corresponding layer structure, and each second data is data obtained by accessing the second cache unit within the corresponding layer structure according to the second cache access instruction of the corresponding layer structure. That is, for each layer structure, the second cache unit in that layer structure simultaneously outputs the physical addresses of four channels, and compares them with the physical addresses output by the first cache unit in that layer structure, selecting the second cache unit channel that matches the comparison result to output data. Furthermore, the data output by the second cache unit is merged with data from other sources, and then operations such as storage format to register format conversion, endianness conversion, and sign conversion are performed. The data processing unit is used to merge the target data with the third data, perform format conversion on the merged data to obtain fourth data, and send the fourth data to the interface unit.

[0033] like Figure 3 As shown, the first layer structure R1 may further include a data selection unit S1, a data processing unit P1, and an interface unit I1; the second layer structure R2 may further include a data selection unit S2, a data processing unit P2, and an interface unit I2; and the first layer structure Rn may further include a data selection unit Sn, a data processing unit Pn, and an interface unit In. Therefore, the processing of cache access instructions in one layer structure occupies four pipelines, the processing of cache access instructions in two layers structure occupies five pipelines, and the processing of cache access instructions in n layers structure occupies (n+3) pipelines.

[0034] In some embodiments, the first cache unit is a physical address memory, and the second cache unit is a data memory.

[0035] To clearly illustrate the technical solutions of the embodiments of this disclosure, the following is combined with... Figure 4 and Figure 5 The following examples illustrate the concept of caches with two and three layers.

[0036] Figure 4 A schematic diagram of a two-layer cache structure provided for a specific example of this disclosure is shown below. Figure 4 As shown, the cache includes two layer structures, namely the first layer structure R1 and the second layer structure R2. The first layer structure R1 includes address memory 1 (i.e., the first cache unit M11), data memory 1 (i.e., the second cache unit M12), data selection unit 1, data merging and format conversion unit 1, and interface unit 1. The second layer structure R1 includes data address memory 2 (i.e., the first cache unit M21), data memory 2 (i.e., the second cache unit M22), data selection unit 2, data merging and format conversion unit 2, and interface unit 2. The overall cache capacity is doubled.

[0037] The first layer structure R1 and the second layer structure R2 are accessed independently by staggered clock cycles. That is, the physical address read instruction Tag Read 11 of the first layer structure accesses physical address memory 1 in D1, and the data read instruction Data Read 12 of the first layer structure accesses data memory 1 in D2; the physical address read instruction Tag Read 21 of the second layer structure accesses physical address memory 2 in D2, and the data read instruction Data Read 22 of the second layer structure accesses data memory 2 in D3.

[0038] The first layer structure R1 and the second layer structure R2 are processed independently. That is, the data of data storage 1 is selected and output by data selection unit 1 at D3, and data merging and format conversion unit 1 performs data merging and format conversion processing. The merged and format conversion processed data is output through interface unit 1 at D4. The data of data storage 2 is selected and output by data selection unit 2 at D4, and data merging and format conversion unit 2 performs data merging and format conversion processing. The merged and format conversion processed data is output through interface unit 2 at D5.

[0039] With an expanded layer structure, the cache capacity is doubled compared to the existing cache capacity.

[0040] Figure 5 A schematic diagram of a three-layer cache structure is provided for a specific example of this disclosure, as shown below. Figure 5As shown, the cache includes three layers: the first layer R1, the second layer R2, and the third layer R3. The first layer R1 includes a physical address memory 1 (i.e., the first cache unit M11), a data memory 1 (i.e., the second cache unit M12), a data selection unit 1, a data merging and format conversion unit 1, and an interface unit 1. The second layer R1 includes a physical address memory 2 (i.e., the first cache unit M21), a data memory 2 (i.e., the second cache unit M22), a data selection unit 2, a data merging and format conversion unit 2, and an interface unit 2. The third layer R3 includes a physical address memory 3 (i.e., the first cache unit M31), a data memory 3 (i.e., the second cache unit M32), a data selection unit 3, a data merging and format conversion unit 3, and an interface unit 3. The overall cache capacity is doubled.

[0041] The first layer structure R1, the second layer structure R2, and the third layer structure R3 are accessed independently with staggered access times. That is, the physical address read instruction Tag Read 11 of the first layer structure accesses physical address memory 1 in D1, and the data read instruction Data Read 12 of the first layer structure accesses data memory 1 in D2; the physical address read instruction Tag Read 21 of the second layer structure accesses physical address memory 2 in D2, and the data read instruction Data Read 22 of the second layer structure accesses data memory 2 in D3; the physical address read instruction Tag Read 31 of the third layer structure accesses physical address memory 3 in D3, and the data read instruction Data Read 32 of the third layer structure accesses data memory 3 in D4.

[0042] The first layer structure R1, the second layer structure R2, and the third layer structure R3 are processed independently. That is, the data of data storage 1 is selected for output by data selection unit 1 at D3, and the data merging and format conversion unit 1 performs data merging and format conversion processing. The merged and format conversion processed data is output through interface unit 1 at D4. The data of data storage 2 is selected for output by data selection unit 2 at D4, and the data merging and format conversion unit 2 performs data merging and format conversion processing. The merged and format conversion processed data is output through interface unit 2 at D5. The data of data storage 3 is selected for output by data selection unit 3 at D5, and the data merging and format conversion unit 3 performs data merging and format conversion processing. The merged and format conversion processed data is output through interface unit 3 at D6.

[0043] With the expansion of the two-layer structure, the capacity of this cache is three times that of the existing cache.

[0044] This disclosure also provides a data processing apparatus, which includes the cache as described above.

[0045] In some embodiments, the data processing device includes, but is not limited to, a central processing unit (CPU), a graphics processing unit (GPU), an embedded neural network processor (NPU), and a data processing unit (DPU).

[0046] In some embodiments, the cache of the data processing device is a level 1 cache, or the cache of the data processing device is a level 1 cache and a level 2 cache.

[0047] When the data processing device is a CPU, such as Figure 6 As shown, this CPU may include a processing core, L1 cache, and L2 cache. L3 is a third-level cache, which is located in the SoC (System on Chip). The L1, L2, and L3 caches are interconnected step-by-step. It should be noted that the CPU may also include a processing core and L1 cache, but not L2 cache.

[0048] The relationship between the capacity of each cache level and memory access latency is shown in Table 1:

[0049] Table 1

[0050] Cache level capacity Memory access latency L1 <=128KB ~1ns L2 <=4MB ~5ns L3 Serveral MB ~10ns

[0051] As shown in Table 1, in a SoC, the CPU's L1 cache has the shortest memory access latency and the best performance. Therefore, increasing the capacity of the CPU's L1 cache has a significant effect on improving SoC performance. Since data is first searched in the L1 cache, then in the L2 cache if not found, and finally in the L3 cache or main memory if still not found, using the aforementioned extended layer structure for the L1 cache can greatly improve the performance of the data processing device and the chip.

[0052] This disclosure also provides a chip that includes the data processing device as described above.

[0053] In some embodiments, the chip may further include a level 3 cache, wherein in the case of multiple data processing devices, each data processing device is connected to the level 3 cache. It should be noted that the level 3 cache may also be the cache described above.

[0054] like Figure 6As shown, the chip is a System-on-a-Chip (SoC), and the data processing device is a CPU. Each CPU includes a Level 1 cache (L1) and a Level 2 cache (L2). The chip also includes a Level 3 cache (L3). The Level 1 cache (L1) of each CPU is cascaded with the Level 2 cache (L2) of the same CPU, and the Level 2 cache (L2) of each CPU is cascaded with the Level 3 cache (L3) in turn. The SoC is connected to external DRAM (Dynamic Random Access Memory) through the Level 3 cache (L3).

[0055] The cache of this disclosure can be applied to high-performance CPUs in computing-intensive fields such as cloud computing, scientific computing, and AI. By increasing the cache capacity, the performance of the CPU, i.e., the chip, is improved. It should be noted that this cache is not limited to high-performance CPU applications; any CPU using cache can benefit from it. The multi-layered cache structure of this disclosure does not affect the high-frequency implementation of the CPU, has strong scalability and applicability, and significantly improves CPU performance by increasing the cache capacity.

[0056] It will be understood by those skilled in the art that all or some of the steps in the methods disclosed above, and the functional modules / units in the apparatus, can be implemented as software, firmware, hardware, and suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned in the above description does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit. Such software may be distributed on a computer-readable medium, which may include computer storage media (or non-transitory media) and communication media (or transient media). As is known to those skilled in the art, the term computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, it is well known to those skilled in the art that communication media typically contain computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.

[0057] Example embodiments have been disclosed herein, and while specific terminology has been used, it is for illustrative purposes only and should be construed as such, and is not intended to be limiting. In some instances, it will be apparent to those skilled in the art that features, characteristics, and / or elements described in conjunction with particular embodiments may be used alone, or in combination with features, characteristics, and / or elements described in conjunction with other embodiments, unless otherwise expressly indicated. Therefore, those skilled in the art will understand that various changes in form and detail may be made without departing from the scope of the invention as set forth in the appended claims.

Claims

1. A cache, characterized in that, It includes at least two layer structures, each layer structure including a first cache unit and a second cache unit. First cache access instructions of different layer structures enter different pipelines to access the first cache unit in the corresponding layer structure, and second cache access instructions of different layer structures enter different pipelines to access the second cache unit in the corresponding layer structure.

2. The cache as described in claim 1, characterized in that, The pipelines into which the first cache access instructions of each layer structure enter are sequentially increased, and the pipelines into which the second cache access instructions of each layer structure enter are sequentially increased.

3. The cache as described in claim 1 or 2, characterized in that, The pipeline that the first cache access instruction of the nth layer enters is the next pipeline that the first cache access instruction of the (n-1)th layer enters. The pipeline that the second cache access instruction of the nth layer enters is the next pipeline of the pipeline that the second cache access instruction of the (n-1)th layer enters; Where n is the number of the layer structures.

4. The cache as described in claim 1, characterized in that, Each of the layer structures further includes a data selection unit, a data processing unit, and an interface unit. The data selection unit of each layer structure is used to select each of the second data according to the first data to obtain target data. The first data is data obtained by accessing the first cache unit in the layer structure according to the first cache access instruction of the corresponding layer structure, and each of the second data is data obtained by accessing the second cache unit in the layer structure according to the second cache access instruction of the corresponding layer structure. The data processing unit is used to combine the target data with the third data, perform format conversion on the combined data to obtain the fourth data, and send the fourth data to the interface unit.

5. The cache as described in any one of claims 1-2 and 4, characterized in that, The first cache unit is a physical address memory, and the second cache unit is a data memory.

6. A data processing apparatus, characterized in that, Includes the cache as described in any one of claims 1-5.

7. The data processing apparatus as described in claim 6, characterized in that, The cache is a level 1 cache, or the cache is a level 1 cache and a level 2 cache.

8. The data processing apparatus as described in claim 6, characterized in that, The data processing device is one of the following: a central processing unit, a graphics processing unit, an embedded neural network processor, or a data processor.

9. A chip, characterized in that, Includes the data processing apparatus as described in any one of claims 6-8.

10. The chip as described in claim 9, characterized in that, It also includes a three-level cache, and there are multiple data processing devices, each of which is connected to the three-level cache.

Citation Information

Patent Citations

  • Method for achieving very high bandwidth between the levels of a cache hierarchy in 3-dimensional structures, and a 3-dimensional structure resulting therefrom

    CN101473436A

  • Convolution operation processing unit and system based on multi-level cache cyclic utilization

    CN113222129A