Chip, data access method, and electronic device
By introducing a final-level cache in a multi-core processor and selectively switching the data access mode based on the cross-chip interface status, the problem of reduced data read speed caused by congestion at the cross-chip interface between dies is solved, thus maximizing processor performance.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-12-02
- Publication Date
- 2026-06-18
AI Technical Summary
In multi-chip multi-core processors, congestion at the cross-chip interface between dies reduces data read speed and affects processor performance. The existing cache inter-access UMA mode cannot effectively utilize memory bandwidth when the cross-chip interface is congested, resulting in performance degradation.
By introducing a final-level cache in the chip, the data access mode can be selectively switched according to the cross-chip interface status to read data from the cache or memory, avoiding cross-chip interface congestion. The cache inter-access UMA mode or memory UMA mode is adopted, and the data access path is optimized by combining a detour path.
This improves data read speed, ensures maximum processor performance, avoids data read rate reduction caused by cross-chip interface congestion, and enhances the overall performance of the chip.
Smart Images

Figure CN2025139256_18062026_PF_FP_ABST
Abstract
Description
Chips, data access methods and electronic devices
[0001] This application claims priority to Chinese patent application filed on December 9, 2024, with application number 202411816339.9 and entitled "Chip, Data Access Method and Electronic Device", the entire contents of which are incorporated herein by reference. Technical Field
[0002] This application relates to the field of chip technology, specifically to a chip, a data access method, and an electronic device. Background Technology
[0003] With the continuous development of computer technology, Unified Memory Access (UMA) architecture has become an important way to improve system performance and efficiency. UMA is a computer system architecture in which various processors can share the same physical memory. This design eliminates data transfer bottlenecks between different computing units, improving the overall system efficiency and performance.
[0004] In multi-chip multi-core processors, the implementation of UMA includes cache inter-access UMA mode, which means that the processor core on each die can only read data from the cache on that die. If the cache on that die does not have the required data, it will first read the data from the cache on other dies. If the cache on other dies also does not have the required data, it will then read the data from memory.
[0005] The data retrieval rate from the cache is faster than the data retrieval rate from memory. However, when the cross-chip interface between dies is congested, even if there is valid data in memory, the cache inter-access UMA mode will still wait for the cross-chip interface congestion to be relieved before retrieving data from the cache of other dies. The time required for congestion relief may be long, resulting in slower data reading and the inability to maximize processor performance. Summary of the Invention
[0006] This application provides a chip, a data access method, and an electronic device to improve problems such as reduced data read speed and impact on processor performance caused by congestion at cross-die interfaces between dies.
[0007] To achieve the above objectives, the technical solutions adopted in the embodiments of this application are as follows:
[0008] In a first aspect, embodiments of this application provide a chip, which includes memory and multiple dies. The multiple dies share the memory. The multiple dies include a first die and a second die. The first die includes a first processor core and a first cache, and the second die includes a second cache. The first cache is used to receive an access request sent by the first processor core for reading target data. If the access request misses the first cache but hits the second cache, the first cache is also used to selectively forward the access request to the second cache or the memory based on the status of the cross-die interface between the multiple dies.
[0009] In the chip provided in this application embodiment, when data needs to be read, data can be selectively read across chips or read from memory according to the status of the cross-chip interface. That is, it can switch between cache inter-access UMA mode and memory UMA mode. For example, when the cross-chip interface is congested, data is read from memory, and when the cross-chip interface is not congested, data is retrieved from the cache across chips. Therefore, it is not limited to cache inter-access UMA mode or memory UMA mode, which can improve the problems of reduced data reading speed and decreased chip performance caused by cross-chip interface congestion.
[0010] In one possible implementation of the first aspect, the first cache is specifically used to forward access requests that require less time to read target data from the second cache or memory, based on the status of the cross-chip interface between multiple dies. For example, if reading the target data from the second cache takes less time, the access request is forwarded to the second cache across dies, and the target data is read from the second cache. If the cross-chip interface is congested, the access request is forwarded to memory, and the target data is read from memory. This allows for a fast response to the processor core's access requests, maximizing the chip's performance.
[0011] In one possible implementation of the first aspect, the first cache is used to: forward access requests to the second cache to read target data from the second cache when the cross-chip interface between the first die and the second die is in a non-congested state. When the cross-chip interface is not congested, the rate of reading data from the cache is faster than reading data from memory, which can improve the performance of the chip.
[0012] In one possible implementation of the first aspect, the first cache is further used to: forward access requests to memory to read target data from memory when the cross-chip interface between the first die and the second die is in a congested state, thereby avoiding the data read rate decrease caused by cross-chip interface congestion and the resulting chip performance degradation.
[0013] In one possible implementation of the first aspect, the first cache is further used to: when the cross-chip interface between the first die and the second die is in a congested state, and at least one detour path through which the first cache accesses the second cache via at least one third die is in a non-congested state, forward the access request to the second cache radially through the detour path. That is, in the case of congestion at the cross-chip interface between the first die and the second die, the data can be retrieved by detouring to the second cache of the second die through at least one third die, so that the chip will not experience problems such as stuttering or performance degradation due to congestion at the cross-chip interface between the first die and the second die.
[0014] In one possible implementation of the first aspect, the chip further includes at least one third die, and the first cache is further configured to: forward access requests to memory when the cross-die interface between the first die and the second die is in a congested state, and all detour paths for the first cache to access the second cache through at least one third die are in a congested state.
[0015] In one possible implementation of the first aspect, the first cache is further configured to: forward access requests to the second cache through the shorter of the at least one of the at least one detour paths when the cross-chip interface between the first and second dies is in a congested state and at least one detour path through which the first cache accesses the second cache via at least one third dies is in a non-congested state.
[0016] In one possible implementation of the first aspect, if an access request hits the first cache, the first processor core reads the target data from the first cache.
[0017] In one possible implementation of the first aspect, the first cache includes: a monitoring module for acquiring the congestion status of cross-chip interfaces between multiple raw dies; and a control module for selectively forwarding access requests based on the congestion status of the cross-chip interfaces between raw dies.
[0018] Secondly, a data access method is provided, applied to a chip, the chip including memory and multiple dies, the multiple dies sharing memory, and adjacent dies communicating through a cross-die interface, the multiple dies including at least an adjacent first die and a second die, the first die including a first processor core and a first cache, the second die including a second cache, the method including: the first cache receiving an access request for reading target data sent by the first processor core; if the access request hits the second cache but misses the first cache, the first cache selectively forwards the access request to the second cache or memory according to the state of the cross-die interface between the multiple dies.
[0019] In one possible implementation of the second aspect, the first cache selectively forwards access requests to the second cache or memory based on the status of the cross-chip interface between multiple raw chips, including: the first cache forwards an access request to the second cache or memory that requires less time to read the target data based on the status of the cross-chip interface between multiple raw chips.
[0020] In one possible implementation of the second aspect, before the first cache selectively forwards the access request to the second cache or memory based on the state of the cross-shard interface between multiple raw dies, the method further includes: obtaining the state of the cross-shard interface between raw dies.
[0021] In one possible implementation of the second aspect, the first cache selectively forwards access requests to the second cache or memory based on the state of the cross-shard interface between multiple raw dies, including: when the state of the cross-shard interface between the first raw die and the second raw die is non-congested, the first cache forwards the access request to the second cache.
[0022] In one possible implementation of the second aspect, the first cache selectively forwards access requests to the second cache or memory based on the state of the cross-shard interface between multiple raw dies, including: if the state of the cross-shard interface between the first raw die and the second raw die is congested, the first cache forwards the access request to memory.
[0023] In one possible implementation of the second aspect, the plurality of bare dies further includes at least one third bare die. The first cache selectively forwards access requests to the second cache or memory based on the state of the cross-die interface between the plurality of bare dies, including: if the state of the cross-die interface between the first and second bare dies is congested, and at least one detour path for the first cache to access the second cache through at least one third bare die is not congested, the first cache forwards the access request to the second cache through the detour path; if the state of the cross-die interface between the first and second bare dies is congested, and all detour paths for the first cache to access the second cache through at least one third bare die are congested, the first cache forwards the access request to memory.
[0024] In one possible implementation of the second aspect, when the cross-chip interface between the first and second dies is in a congested state and at least one detour path through which the first cache accesses the second cache via at least one third dies is in a non-congested state, the first cache forwarding the access request to the second cache via the detour path includes: when the cross-chip interface between the first and second dies is in a congested state and at least one detour path through which the first cache accesses the second cache via at least one third dies is in a non-congested state, the first cache forwards the access request to the second cache via the detour path with the shorter time requirement.
[0025] Thirdly, an electronic device is provided, including a circuit board and a chip provided in the first aspect and any implementation thereof, the chip being mounted on the circuit board.
[0026] Fourthly, a computer-readable storage medium is provided that stores computer-readable instructions thereon, wherein when the computer-readable instructions are executed by a chip, the chip performs the method provided by the second aspect and any implementation thereof. Attached Figure Description
[0027] Figure 1 is a schematic diagram of a chip provided in an embodiment of this application;
[0028] Figure 2 is a schematic diagram of another chip provided in an embodiment of this application;
[0029] Figure 3 is a schematic diagram of another chip provided in an embodiment of this application;
[0030] Figure 4 is a flowchart of a data access method provided in an embodiment of this application;
[0031] Figure 5 is a schematic diagram of another chip provided in an embodiment of this application;
[0032] Figure 6 is a schematic diagram of another chip provided in an embodiment of this application;
[0033] Figure 7 is a schematic diagram of another chip provided in an embodiment of this application;
[0034] Figure 8 is a schematic diagram of the working principle of the chip provided in the embodiment of this application;
[0035] Figure 9 is a flowchart of another data access method provided in an embodiment of this application;
[0036] Figure 10 is a flowchart of another data access method provided in an embodiment of this application;
[0037] Figure 11 is a flowchart of another data access method provided in an embodiment of this application. Detailed Implementation
[0038] The embodiments of this application will now be described with reference to the accompanying drawings. The terms "first" and "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish different objects, not to describe a specific order. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or devices. It should be noted that when an element is referred to as being "coupled" or "connected" to one or more other elements, it can be a direct connection of one or more elements to the other elements, or an indirect connection.
[0039] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0040] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art will understand, explicitly and implicitly, that the embodiments described herein can be combined with other embodiments.
[0041] The terms “component,” “module,” “system,” etc., used in this specification are used to refer to computer-related entities, hardware, firmware, combinations of hardware and software, software, or software in execution. For example, a component can be, but is not limited to, a process running on a processor, a processor, an object, an executable file, an execution thread, a program, and / or a computer. As illustrated, both an application running on a processor and the processor itself can be components. One or more components may reside in a process and / or an execution thread, and components may be located on a single computer and / or distributed among two or more computers. Furthermore, these components can be executed from various computer-readable media on which various data structures are stored. Components can communicate, for example, via local and / or remote processes based on signals having one or more data packets (e.g., data from two components interacting with another component between a local system, a distributed system, and / or a network, such as the Internet interacting with other systems via signals).
[0042] The integrated circuit industry has been developing rapidly in accordance with "Moore's Law," which states that the number of transistors that can be placed on an integrated circuit roughly doubles every 18 to 24 months. However, as "Moore's Law" has progressed, the number of devices and transistors has increased, and their size has become smaller and smaller. From this perspective, there is a physical limit to the size of transistors. That is, after transistors are shrunk to a certain size, they cannot be shrunk any further. The number of transistors that can be placed on a single die is limited, which means that the computing power of a single die is limited.
[0043] A bare die refers to a chip before it is packaged. It is a small piece cut from a silicon wafer using a laser. Each bare die can be an independent functional chip. For example, bare dies will later be packaged as a unit to become a common chip. In order to meet the current demand for chip computing power, the industry has proposed a technical solution to package multiple bare dies into a single chip, thereby providing greater computing power.
[0044] Taking processors as an example, various processors are the core of electronic devices, such as central processing units (CPUs), graphics processing units (GPUs), tensor processing units (TPUs), and neural network processing units (NPUs), which are collectively referred to as processors (XPUs). With the increasing demand for processor computing power, current chips are trending towards exceeding the size of a single die, and with the increase in package size, they are gradually expanding from two dies to three, four, or more dies. How to achieve chip usability with multiple dies has become one of the bottlenecks restricting chip performance.
[0045] For a chip to meet ease of use, a prerequisite is the implementation of the UMA programming model. This means that the programmer treats the chip's memory as a single piece, without further intra-chip segmentation during memory allocation. This memory can be Dynamic Random Access Memory (DRAM). Referring to Figure 1, which shows a schematic diagram of a chip, the chip includes memory 11, a last-level cache (LLC) 12, and multiple processor cores, such as processor core 13-1 and processor core 13-2. These processor cores and the cache are all housed on a single die. This chip is also called a single-die chip or a single-die processor system. In a single-die processor system, UMA is easier to implement. The last-level cache 12 is used as a memory-side cache, and all processor cores can access the data in this last-level cache 12. If the target data is not found in the last-level cache 12, it is read from memory 11. This chip may also include other components not shown in Figure 1.
[0046] For chips with dual or multiple dies, UMA can be implemented in various ways, such as memory UMA mode and cache inter-access UMA mode.
[0047] Referring to Figure 2, which illustrates another type of chip, Figure 2a shows a schematic diagram of a dual-die chip, comprising die 20-1, die 20-2, and memory 30, with dies 20-1 and 20-2 sharing memory 30. Figure 2b shows a schematic diagram of a multi-die chip, comprising dies 20-1, dies 20-2, dies 20-3, dies 20-4, and memory 30, with each die sharing memory. In chips employing the UMA (Memory-Based Access) model, different dies do not access each other's caches. The processor core on each die can only access the cache on that die, and each cache can only retrieve data from memory, not from the caches of other dies. For example, the processor core in die 20-1 can only retrieve data from the cache in die 20-1, and the cache in die 20-1 can retrieve data from memory 30, but cannot retrieve data from the caches of other dies.
[0048] Referring to Figure 3, which illustrates another type of chip, Figure 3a shows a schematic diagram of a dual-die chip, comprising die 20-1, die 20-2, and memory 30, with dies 20-1 and 20-2 sharing memory 30. Figure 3b shows a schematic diagram of a multi-die chip, comprising dies 20-1, 20-2, 20-3, 20-4, and memory 30, with each die sharing memory. In chips employing the UMA (Unified Cache Access) mode, different dies can achieve mutual access through cache consistency. The processor core on each die can only directly access the data in its own cache. The cache in each die preferentially retrieves data from the caches of other dies. If the caches of other dies do not contain the required data, then the data is retrieved from memory. For example, taking the data access needs of the processor core in die 20-1 as an example, the processor core in die 20-1 can access the data in the cache of die 20-1. The cache of die 20-1 can read data from the caches of die 20-2, die 20-3 and die 20-4. If the required data is not found in the caches of die 20-2, die 20-3 and die 20-4, then the cache of die 20-1 can read data from memory 30.
[0049] Cache access speed is much faster than memory access speed, and the UMA (Unified Cache Access) mode is widely used. For example, Figure 4 shows a flowchart of a data access method for a chip using the UMA mode. The chip includes multiple partitions, which can be considered equivalent to bare dies. Each partition includes a processor core. If the processor core in the first partition misses the L1 cache but hits the L2 cache, it determines whether the L2 cache is located in this partition. If the L2 cache is in this partition, the data is read from the L2 cache. If the L2 cache is not in this partition, i.e., it is in another partition, the data is read from the L2 cache of the other partition into the L2 cache of this partition, and then the processor core reads the data from the L2 cache of this partition.
[0050] However, in chips using the cached inter-dialing (UMA) mode, when there is congestion at the cross-dialing interface between dies, even if there is spare memory bandwidth and valid data in memory, this mode still needs to wait for the congestion at the cross-dialing interface to be relieved. In this embodiment, the congestion at the cross-dialing interface refers to the amount of data to be sent by the cross-dialing interface approaching or exceeding the physical limit of the cross-dialing interface. That is to say, the time required to relieve the congestion at the cross-dialing interface is related to the amount of data to be transmitted at the cross-dialing interface. If the amount of data to be transmitted is large, the congestion duration may be long, which will reduce the speed of cross-dialing data reading, cause the chip to lag, and affect the chip's performance.
[0051] To address the aforementioned issues, this application provides a chip, such as a processor chip, including a central processing unit (CPU) or a graphics processing unit (GPU). Referring to Figure 5, which illustrates a schematic diagram of a chip provided in this application, the chip includes a memory 30 and multiple dies. These dies share the memory 30, and each die includes one or more processor cores and a cache. For example, the multiple dies include a first die 20-1 and a second die 20-2. The first die 20-1 includes a first processor core 211 and a first cache 212, and the second die 20-2 includes a second processor core 221 and a second cache 222. Here, the first processor core 211 and the second processor core 221 can be individual processor cores or clusters of multiple processor cores; this application does not limit this.
[0052] Taking the access request of the first processor core 211 in the first die 20-1 as an example, the first processor core 211 issues an access request for reading target data. The first cache 212 is used to receive the access request for reading target data issued by the first processor core. If the access request does not hit the first cache 212 but hits the second cache 222, the first cache 212 is used to selectively forward the access request to the second cache 222 or the memory 30 according to the status of the cross-die interface between multiple dies.
[0053] In this embodiment, the first cache 212 and the second cache 222 are the last-level caches of the corresponding dies. Taking the first die 20-1 as an example, the first die 20-1 also includes a first-level cache, a second-level cache, etc. Typically, the access request from the first processor core 211 will first read data from the first-level cache. If the target data is not found in the first-level cache, it will read data from the second-level cache. If the target data is also not found in the second-level cache, it will read data from the next lower-level cache until the last-level cache is accessed. Therefore, when the first cache 212 receives an access request from the first processor core 211, it means that all caches before the last-level cache have missed, and the target data required by the access request does not exist. If the access request hits the first cache 212, the first processor core 211 can read the target data from the first cache 212. If the first cache 212 misses, it needs to read the target data from other storage spaces, such as memory or the cache of other dies.
[0054] In a chip employing the cached inter-access UMA mode, if the access request misses the first cache 212 but hits the second cache 222, the first cache 212 will read the required target data from the second cache 222 of the second die 20-2 based on cache consistency. However, if the cross-chip interface traffic between the first die 20-1 and the second die 20-2 is congested, the first cache 212 will not read the required target data from memory 30, but will wait for the congestion of the cross-chip interface between the first die 20-1 and the second die 20-2 to be relieved before reading the required target data from the second cache 222. In the solution provided in this application embodiment, when the access request hits the second cache 222 but misses the first cache 212, the first cache 212 can selectively forward the access request to the second cache 222 or the memory 30 according to the status of the cross-chip interface. That is, it can switch between cache mutual access UMA mode and memory UMA mode. By selecting the appropriate UMA mode, it can avoid problems such as reduced cross-chip data reading speed caused by cross-chip interface congestion in cache mutual access UMA mode, and ensure that the chip can maximize its performance.
[0055] In one possible implementation, the first cache 212 is used to forward an access request to the second cache 222 or memory 30 with the shorter time required to read the target data, depending on the state of the cross-slice interface. For example, when the cross-slice interface is in a congested state, the access request is forwarded to memory 30 with the shorter time required to read the target data; when the cross-slice interface is not in a congested state, the access request is forwarded to the second cache 222 with the shorter time required to read the target data.
[0056] Taking the chip including the first die 20-1 and the second die 20-2 as an example, if the access request issued by the first processor core 211 misses the first cache 212 but hits the second cache 222, and if the cross-chip interface between the first die 20-1 and the second die 20-2 is in a congested state, it may take a long time to read the target data from the second cache 222. In this case, the first cache 212 can choose to forward the access request to the memory 30 to reduce cross-chip access and quickly read the required target data.
[0057] When the cross-chip interface between the first die 20-1 and the second die 20-2 is in a non-congested state, and the time required to read data from the second cache 222 is shorter than the time required to read data from memory 30, the first cache 212 may choose to forward the access request to the second cache 222.
[0058] Based on this, when the first processor core 211 of the first die 20-1 needs to read data from the second cache 222, in the event of cross-die interface congestion, the data is retrieved directly from memory 30 instead of through the cross-die interface. This avoids the decrease in data read rate caused by cross-die interface congestion and prevents impact on chip processing performance. When the cross-die interface is not congested, the first cache 212 reads data from the second cache 222 through the cross-die interface between the first die 20-1 and the second die 20-2. Due to the differences in structure and working principle between the cache and memory 30, the speed of reading data from the cache is much greater than the speed of reading data from memory 30. Therefore, when the cross-die interface is not congested, reading data from the second cache 222 can improve the data read speed of the first processor core 211 and improve chip processing performance.
[0059] The above example uses a chip with two dies. The following, with reference to the accompanying drawings, describes the case where the chip includes more dies, such as at least one third die. The first cache can access the second cache of the second die through the cross-die interface between the first and second dies, and also through a detour path bypassed by at least one third die. Therefore, if the cross-die interface between the first and second dies is congested, and the detour path through which the first die accesses the second cache via at least one third die is uncongested, the first cache will send an access request radially to the second cache through this detour path to read the target data from the second cache.
[0060] For example, the chip includes a first die 20-1, a second die 20-2, and at least one third die, as shown in FIG6. The chip also includes a third die 40-1 and a third die 40-2. Adjacent dies communicate with each other through a cross-die interface. The first die 20-1, the second die 20-2, the third die 40-1, and the third die 40-2 can be connected in a daisy-chain manner. For example, the first die 20-1 and the second die 20-2 are connected through a cross-die interface, the second die 20-2 and the third die 40-1 are connected through a cross-die interface, the third die 40-1 is connected to the third die 40-2 through a cross-die interface, and the third die 40-2 is connected to the first die 20-1 through a cross-die interface.
[0061] The first die 20-1 includes a first processor core and a first cache, and the second die 20-2 includes a second processor core and a second cache. The first processor core sends an access request to the first cache, and the first cache receives the access request from the first processor core. If the access request misses the first cache but hits the second cache, the first cache is also used to selectively forward the access request to the second cache or memory 30 based on the status of the cross-die interface between the multiple dies.
[0062] When the cross-chip interface between the first die 20-1 and the second die 20-2 is in a non-congested state, and the time required to read data from the second cache is shorter than the time required to read data from memory, the first cache chooses to forward the access request to the second cache.
[0063] When the cross-chip interface between the first die 20-1 and the second die 20-2 is congested, the time required for the first die 20-1 to directly read data from the second die 20-2 may be long. However, since the rate of reading data from the cache is faster than the rate of reading data from memory, the first cache can also read data from at least one third die when the cross-chip interface between the first die 20-1 and the second die 20-2 is congested.
[0064] For example, in this embodiment of the application, the first die 20-1 and the third die 40-2 are connected through a cross-die interface, the third die 40-2 and the third die 40-1 are connected through a cross-die interface, and the third die 40-1 and the second die 20-2 are connected through a cross-die interface. That is to say, the first die 20-1 can directly read data from the second cache through the cross-die interface with the second die 20-2, and the first die 20-1 can also read data from the second cache by going around through the third die 40-2 and the third die 40-1.
[0065] In this embodiment, there are two access paths between the first cache of the first die 20-1 and the second cache of the second die 20-2. One is the access path that reads data from the second cache through the cross-die interface between the first die 20-1 and the second die 20-2; the other is the access path that reads data from the second cache through the third die 40-2 and the third die 40-1, referred to as the detour path. When the cross-die interface between the first die 20-1 and the second die 20-2 is in a congested state, the first cache is also used to determine whether all detour paths for the first die 20-1 to access the second cache through at least one third die are congested. If all detour paths are congested, the first cache forwards the access request to memory to read the target data from memory. If there is a non-congested detour path, the first cache can send the access request to the second cache through the detour path to read the target data from the second cache.
[0066] In some other possible implementations, the chip further includes a greater number of third dies, resulting in more detour paths between the first cache and the second cache. In this case, if the cross-die interface between the first die 20-1 and the second die 20-2 is congested, while at least one detour path through which the first cache accesses the second cache via at least one third die is not congested, the first cache can select the fastest detour path to access the second cache, thereby improving access speed and ensuring chip performance. For example, referring to Figure 7, the chip shown in Figure 7 includes multiple third dies. The first cache on the first die 20-1 can access the second cache on the second die 20-2 via detour path 1 and detour path 2. The third dies and cross-die interfaces traversed by detour path 1 and detour path 2 are different. For example, detour path 1 traverses third dies 40-1 and third dies 40-2 and the cross-die interface between them; detour path 2 traverses third dies 40-1, third dies 40-2, third dies 40-3, third dies 40-4 and the cross-die interface between them. If both detour path 1 and detour path 2 can smoothly access the second cache, the first cache can choose the detour path that takes less time to access the second cache to forward the access request to the second cache.
[0067] In this embodiment, the time required for the first cache to access the second cache via a detour path is related to the number of cross-chip interfaces traversed by the detour path, as well as the traffic of those cross-chip interfaces. For example, if the amount of data to be transmitted on some cross-chip interfaces is large but not congested, it will also result in a longer time required to access the second cache via those interfaces. Therefore, the selection needs to consider both the number of cross-chip interfaces and their traffic. The first cache can record the time required to read data from the second cache through each detour path. When receiving an access request from the first processor core, it can select the detour path with the shorter time required to forward the access request based on the data retrieval time required by each detour path.
[0068] Furthermore, if the time required for the first cache to retrieve data from the second cache via a non-congested detour path exceeds the time required to retrieve data from memory, the cache-shared UMA mode can be switched to memory-based UMA mode. In this mode, data is retrieved from memory instead of via the detour path to the second cache. For example, the first cache can record the time required to read data from the second cache via each path. When receiving an access request from the first processor core, it can choose to retrieve data from the second cache or from memory based on the time required to retrieve data via each path and the time required to read data from memory. For instance, it can choose the method with the shorter time requirement to read the target data.
[0069] For example, the first cache provided in this application embodiment includes a monitoring module and a control module, wherein the monitoring module is used to obtain the status of the cross-chip interface between raw dies; the cross-chip interface between raw dies here includes the global cross-chip interface status, which the traffic monitoring module can obtain through the cross-chip interface. For example, the communication packet when the cross-chip interface transmits data carries the status of the cross-chip interface along the way. For example, the status of the cross-chip interface can be reflected as congestion information or traffic information, and the monitoring module can obtain the status of the cross-chip interface based on the congestion information or traffic information; or the monitoring module can obtain the status of the cross-chip interface through the connection between it and the cross-chip interface. For example, when the cross-chip interface is in a congested state, the potential on the connection is pulled high, and when the cross-chip interface is not in a congested state, the potential on the connection is pulled low, and the monitoring module can obtain the status of the cross-chip interface through the connection.
[0070] The control module is used to selectively forward access requests from the first processor core based on the congestion status of the cross-chip interface between the dies. Specifically, the control module can select the path with the shortest time required to read the target data to access the second cache based on the cross-chip interface status between the dies.
[0071] The chip provided in this application embodiment can switch between cached inter-access UMA mode and memory UMA mode. Referring to Figure 8, the cached inter-access UMA mode is used when the cross-chip interface is not congested, and the memory UMA mode is switched when the cross-chip interface is congested. By selecting the appropriate UMA mode, the problem of reduced cross-chip data reading speed caused by cross-chip interface congestion can be avoided in the cached inter-access UMA mode, ensuring that the chip can maximize its performance.
[0072] This application embodiment also provides a data access method, which is applied to the chip provided in this application embodiment. The chip includes memory and multiple dies, the multiple dies share the memory, the multiple dies include a first die and a second die, the first die and the second die share the memory, the first die includes a first processor core and a first cache, the second die includes a second cache, referring to FIG9, the data access method includes:
[0073] S410: The first processor core initiates an access request to read the target data.
[0074] S420: The first cache receives the access request sent by the first processor core and determines whether the access request is successful.
[0075] If the data that the first processor core wants to access is cached in the first cache, it is called a "hit"; otherwise, it is called a "miss" or a cache miss. In this embodiment, the first cache and the second cache are the last-level caches of the corresponding dies. The fact that the first cache receives the access request from the first processor core means that the first-level cache, the second-level cache, or all caches before the last-level cache in the first die have all missed, and the data that the first processor core wants to access does not exist.
[0076] S430: If the access request hits the first cache, the first processor core reads the target data from the first cache.
[0077] If the access request hits the first cache, that is, if the first cache stores the data that the first processor core wants to access, then the first processor core reads the target data from the first cache, and the access request ends.
[0078] S440: If the access request does not hit the first cache, the first cache determines whether the access request hits the second cache.
[0079] If the access request does not hit the first cache, that is, the first cache does not store the data that the first processor core wants to access, then the first processor core queries the second cache to determine whether the access request hits the second cache, that is, to check whether the data that the first processor core wants to access exists in the second cache. In this embodiment of the application, the second cache is the last-level cache in the second die, the second die is a die other than the first die, or the second die and the first die are not the same die, and the first die and the second die can be connected through a cross-die interface.
[0080] S450: If the access request does not hit the second cache, the first cache forwards the access request to memory and reads the target data from memory.
[0081] If the access request hits the second cache, the first cache forwards the access request to the chip's memory and reads the target data from memory. In some possible implementations, the first processor core may read the target data from memory and cache it in the first cache; alternatively, the target data may be written from memory to the first cache, and then the first processor core reads the target data from the first cache.
[0082] S460: If the access request hits the second cache, the first cache retrieves the status of the cross-crystal interface between multiple raw dies.
[0083] For example, the first cache includes a monitoring module and a control module, wherein the monitoring module is used to obtain the status of the cross-die interface between dies; the cross-die interface between dies here includes the global cross-die interface status, such as the status of the cross-die interface between the first die and the second die. If the chip also includes more dies, the first cache can also obtain the status of the cross-die interface between each die, and the status here includes congestion status and non-congestion status.
[0084] S470: The first cache selectively forwards access requests to the second cache or memory based on the status of the cross-chip interface between multiple raw dies.
[0085] In the solution provided in this application embodiment, if the access request does not hit the first cache but hits the second cache, the first cache can selectively forward the access request to the second cache or memory according to the status of the cross-chip interface. That is, the chip can switch between cache mutual access UMA mode and memory UMA mode. By selecting the appropriate UMA mode, the speed of cross-chip data reading can be reduced due to congestion of the cross-chip interface in cache mutual access UMA mode, thus ensuring that the chip can maximize its performance.
[0086] In one possible implementation, the first cache forwards access requests that require less time to read the target data from the second cache or memory, based on the state of the cross-chip interface between the first and second dies.
[0087] For example, referring to Figure 10, S470 includes:
[0088] S471: When the cross-chip interface between the first and second dies is in a non-congested state, the first cache forwards the access request to the second cache.
[0089] S472: When the cross-chip interface between the first and second dies is in a congested state, the first cache forwards the access request to memory.
[0090] The above example uses a chip with two dies to illustrate the data access method provided in the embodiments of this application. In some other implementations, the chip also includes at least one third die. The first cache can access the second cache of the second die through the cross-die interface between the first die and the second die, and can also access the second cache of the second die through a detour path bypassed by at least one third die.
[0091] Therefore, if the cross-chip interface between the first and second dies is in a congested state, and the detour path through which the first die accesses the second cache via at least one third die is in a non-congested state, then the first cache will send an access request to the second cache via a detour path around at least one third die.
[0092] For example, referring to Figures 6 and 11, in this case, S470 may include:
[0093] S471: When the cross-chip interface between the first and second dies is in a non-congested state, the first cache forwards the access request to the second cache.
[0094] S473: If the cross-chip interface between the first die and the second die is in a congested state, and all detour paths for the first die to access the second cache through at least one third die are in a congested state, the first cache forwards the access request to memory.
[0095] S474: If the cross-chip interface between the first die and the second die is in a congested state, and at least one detour path through which the first die accesses the second cache via at least one third die is in a non-congested state, the first cache forwards the access request to the second cache via the detour path.
[0096] If the cross-chip interface between the first and second dies is congested, the time required for the first die to directly read data from the second die may be long. However, since the rate of reading data from the cache is faster than the rate of reading data from memory, the first cache can also bypass the congestion at least once from a third die to read data from the second cache. For example, if the cross-chip interface between the first and second dies is congested, the first cache is also used to determine whether all bypass paths for the first die to access the second cache through at least one third die are congested. If all bypass paths are congested, the first cache forwards the access request to memory to read the target data from memory; if there is a non-congested bypass path, the first cache sends the access request to the second cache through the bypass path to read the target data from the second cache.
[0097] Furthermore, if the cross-chip interface between the first die and the second die is in a congested state, and at least one detour path through which the first cache accesses the second cache via at least one of the third dies is in a non-congested state, the first cache forwards the access request to the second cache through the one of the at least one detour paths that requires the shortest time.
[0098] This application also provides an electronic device, which includes a circuit board and a chip as provided in this application embodiment, the chip being mounted on the circuit board. For example, the chip may be a processor such as a central processing unit (CPU), graphics processing unit (GPU), tensor processing unit (TPU), or neural network processing unit (NPU) of the electronic device. The electronic device may also include other components, which are not limited in this application embodiment.
[0099] This application also provides a computer-readable storage medium storing a computer program or instructions, which, when executed by the chip of an electronic device, performs the steps described in the method embodiments above.
[0100] This application also provides a computer program product, which includes a computer program or instructions stored in a readable storage medium. When the chip of an electronic device executes the computer program or instructions, the steps in the above-described method embodiments are performed.
[0101] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0102] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0103] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some cross-chip interfaces, devices, or units, or may be electrical, mechanical, or other forms of connection.
[0104] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of the embodiments of the present invention, depending on actual needs.
[0105] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0106] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0107] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A chip, characterized in that, The chip includes: The system includes memory and multiple dies, wherein the multiple dies share the memory and adjacent dies communicate with each other via a cross-die interface. The multiple dies include at least an adjacent first die and a second die, wherein the first die includes a first processor core and a first cache, and the second die includes a second cache. The first cache is used for: Receive an access request sent by the first processor core for reading target data; If the access request hits the second cache but misses the first cache, the access request is selectively forwarded to the second cache or the memory based on the status of the cross-crystal interface between the plurality of raw dies.
2. The chip according to claim 1, characterized in that, The first cache is specifically used to: based on the status of the cross-chip interface between the multiple raw chips, forward the access request to the second cache or the memory where the time required to read the target data is shorter.
3. The chip according to claim 1 or 2, characterized in that, The first cache is used to: forward the access request to the second cache to read the target data from the second cache when the cross-chip interface between the first die and the second die is in a non-congested state.
4. The chip according to claim 1 or 2, characterized in that, The first cache is further configured to: forward the access request to the memory to read the target data from the memory when the cross-chip interface between the first die and the second die is in a congested state.
5. The chip according to claim 3, characterized in that, The plurality of bare dies also includes at least one third bare die, and the first cache is further configured to: forward the access request to the second cache through the detour path when the cross-die interface between the first bare die and the second bare die is in a congested state and at least one detour path through which the first cache accesses the second cache via at least one of the third bare dies is in a non-congested state.
6. The chip according to claim 5, characterized in that, The first cache is further configured to: forward the access request to the memory when the cross-chip interface between the first die and the second die is in a congested state, and all detour paths through which the first cache accesses the second cache via at least one of the third dies are in a congested state.
7. The chip according to claim 5, characterized in that, The first cache is further configured to: when the cross-chip interface between the first die and the second die is in a congested state, and at least one of the detour paths through which the first cache accesses the second cache via at least one of the third dies is in a non-congested state, forward the access request to the second cache via the one of the at least one detour paths that requires the shortest time.
8. The chip according to any one of claims 1 to 7, characterized in that, If the access request hits the first cache, the first processor core reads the target data from the first cache.
9. The chip according to any one of claims 1 to 8, characterized in that, The first cache includes: A monitoring module is used to obtain the congestion status of the cross-chip interfaces between the multiple bare dies; The control module is used to selectively forward the access request based on the congestion status of the cross-chip interface between the dies.
10. A data access method, characterized in that, The method is applied to a chip, the chip including memory and multiple dies, the multiple dies sharing the memory, and adjacent dies communicating via a cross-die interface, the multiple dies including at least an adjacent first die and a second die, the first die including a first processor core and a first cache, the second die including a second cache, and the method including: The first cache receives an access request sent by the first processor core for reading target data; If the access request hits the second cache but misses the first cache, the first cache selectively forwards the access request to the second cache or the memory based on the status of the cross-crystal interface between the plurality of raw dies.
11. The data access method according to claim 10, characterized in that, The first cache selectively forwards the access request to the second cache or the memory based on the status of the cross-shard interface between the multiple raw shards, including: The first cache forwards the access request to the second cache or the memory with the shorter time required to read the target data, based on the status of the cross-chip interface between the multiple raw chips.
12. The data access method according to claim 10, characterized in that, Before the first cache selectively forwards the access request to the second cache or the memory based on the state of the cross-shard interface between the plurality of raw shards, the method further includes: Obtain the status of the cross-chip interface between the multiple raw dies.
13. The data access method according to any one of claims 10 to 12, characterized in that, The first cache selectively forwards the access request to the second cache or the memory based on the status of the cross-shard interface between the multiple raw shards, including: When the cross-chip interface between the first and second dies is in a non-congested state, the first cache forwards the access request to the second cache.
14. The data access method according to any one of claims 10 to 12, characterized in that, The first cache, based on the status of the cross-shard interface between the plurality of raw shards, selectively forwards the access request to the second cache or the memory, further includes: If the cross-die interface between the first die and the second die is in a congested state, the first cache forwards the access request to the memory.
15. The data access method according to any one of claims 10 to 12, characterized in that, The plurality of raw dies also includes at least one third raw die, and the first cache selectively forwards the access request to the second cache or the memory based on the state of the cross-die interface between the plurality of raw dies, further comprising: If the cross-chip interface between the first die and the second die is in a congested state, and at least one detour path through which the first cache accesses the second cache via at least one of the third dies is in a non-congested state, the first cache forwards the access request to the second cache via the detour path. If the cross-chip interface between the first die and the second die is in a congested state, and all detour paths for the first cache to access the second cache through at least one of the third dies are in a congested state, the first cache forwards the access request to the memory.
16. The data access method according to claim 15, characterized in that, When the cross-chip interface between the first and second dies is congested, and at least one detour path through which the first cache accesses the second cache via at least one of the third dies is not congested, the first cache forwarding the access request to the second cache via the detour path includes: If the cross-chip interface between the first die and the second die is in a congested state, and at least one detour path through which the first cache accesses the second cache via at least one of the third dies is in a non-congested state, the first cache forwards the access request to the second cache through the one of the at least one detour paths that requires the shortest time.
17. An electronic device, characterized in that, It includes a circuit board and a chip as described in any one of claims 1 to 9, wherein the chip is mounted on the circuit board.
18. A computer-readable storage medium, characterized in that, It stores computer-readable instructions, which, when executed by the chip, perform the method as described in any one of claims 10 to 16.