Chip data reading acceleration method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By partitioning and preloading the multi-die heterogeneous packaging architecture, and designing an in-memory computing architecture, combined with bandwidth scheduling and high-density 3D stacking technology, the problem of cache coherency protocol interaction overhead during cross-die data reading is solved, improving the chip's data reading efficiency and storage bandwidth utilization, and achieving stable acceleration in all scenarios.

CN122240528APending Publication Date: 2026-06-19LVYIN TECH (HANGZHOU) CO LTD

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: LVYIN TECH (HANGZHOU) CO LTD
Filing Date: 2026-03-24
Publication Date: 2026-06-19

Application Information

Patent Timeline

24 Mar 2026

Application

19 Jun 2026

Publication

CN122240528A

IPC: G06F12/0815; G06F12/0817; G06F12/0864; G06F15/78

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

In multi-die heterogeneous packaging architectures, the overhead of cache coherency protocol interaction during cross-die data reading increases sharply with the increase of read rate, leading to the memory wall problem, which severely restricts the release of chip performance, especially in AI chips and high-performance computing chips where the growth rate of computing power far exceeds the growth rate of memory read bandwidth.

⚗Method used

By dividing the multi-die heterogeneous packaging architecture into domains, a cache coherence protocol is used to maintain data synchronization within the domain, while a lightweight protocol is used between domains. A prediction engine is deployed for data preloading. Computing units are embedded in the storage array to build an in-memory computing architecture, reducing cross-die data transmission. A bandwidth scheduling mechanism and high-density 3D stacking technology are introduced to dynamically adjust the coherence protocol parameters and storage bandwidth allocation strategy, and to monitor the chip status in real time to optimize data transmission.

🎯Benefits of technology

Significantly reduce cross-die consistency protocol overhead, decrease the number of cross-die data interactions and access latency, improve storage bandwidth utilization, achieve stable acceleration across all scenarios, and ensure a dynamic balance between performance and stability.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240528A_ABST

Patent Text Reader

Abstract

This invention discloses a chip data read acceleration method and system. The chip data read acceleration method includes the following steps: domain partitioning of a multi-die heterogeneous packaging architecture; deployment of a prediction engine in the inter-die interconnect channel; embedding computing units into a storage array to construct a storage-computing integrated architecture, alleviating the bandwidth mismatch between computing power and storage; introducing a bandwidth scheduling mechanism for the computing units to precisely match the working rhythm of computing power and storage; real-time monitoring of chip operating status, dynamically adjusting consistency protocol parameters, storage-computing unit activation ratio, and storage bandwidth allocation strategy, and performing lightweight compression on cross-domain transmitted data to reduce bandwidth consumption; and through a combination of strategies including hierarchical consistency optimization, deep storage-computing integration, and dynamic collaborative scheduling, simultaneously solving the bus consistency bottleneck and computing power-storage bandwidth mismatch problem in multi-die packaging, thereby improving chip data read efficiency.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of chip data reading acceleration technology, specifically to a chip data reading acceleration method and system. Background Technology

[0002] Chip data read acceleration technology mainly revolves around three major directions: improving signal transmission rate, optimizing data temporary storage logic, and widening transmission channels. Common methods include increasing the internal clock frequency of the chip to shorten the data access cycle, upgrading high-speed interface protocols to improve external data interaction bandwidth, expanding multi-channel parallel read architecture to increase data throughput per unit time, and optimizing cache hierarchy structure and replacement algorithm to improve the local hit rate of hot data and reduce frequent access to back-end storage.

[0003] For example, patent CN112711383A discloses a method for accelerating non-volatile memory reads in power chips. This method uses a row-length adaptive cache to accelerate instruction reading from Flash and a step-prefetch to accelerate data reading from Flash. The row-length adaptive cache method for accelerating instruction reading from Flash includes: responding to a fetch request initiated by the processor; filling cache lines and reconstructing the cache line length based on cache hit / miss detection; and then initiating a read instruction request to Flash. The step-prefetch method for accelerating data reading from Flash includes: responding to a fetch request initiated by the processor; and then initiating a data read request to Flash based on buffer register hit / miss detection and the validity of the step-prefetch enable bit.

[0004] For example, patent CN116400868A discloses a method for accelerating the startup of a storage chip, a main control device, and a solid-state drive. The accelerated startup method includes: upon receiving a startup command, reading the lifespan information of the storage chip; when the lifespan information indicates that the number of erase / write cycles of the storage chip is less than the erase / write cycle threshold, selecting several data entries to be read from the storage chip; performing readretry on the selected data entries using different retry parameters, and counting the number of data entries correctly read under different retry parameters; using the retry parameter with the highest number of correctly read data entries as the target retry parameter; and starting the default read operation with the target retry parameter. By starting when the number of erase / write cycles of the storage chip is low and not performing the default read operation with the default parameters, the startup of the storage chip can be accelerated, reducing startup time.

[0005] For example, patent CN109710187B discloses a method, apparatus, computer device, and storage medium for accelerating read commands of an NVMe SSD controller chip. The method includes: caching read data command information in an acceleration module and notifying the NFC; the NFC reading corresponding data information from flash memory according to the command information and sending it to the acceleration module; and the acceleration module sending the processed valid data to the host. This invention avoids the limitation of DDR read / write speed on overall performance and reduces CPU usage frequency, thereby improving the read performance of the SSD controller chip. However, in existing chip data reading in multi-die heterogeneous packaging architectures such as 3DIC and Chiplet, the cache coherence protocol interaction overhead during cross-die data reading increases sharply with the increase in read rate, which offsets the acceleration effect. In AI chips and high-performance computing chips, the growth rate of computing power far exceeds the increase rate of storage read bandwidth, forming a memory wall problem that severely restricts chip performance release.

[0006] To address the aforementioned issues, there is an urgent need for innovative designs based on existing methods for accelerating chip data reading. Summary of the Invention

[0007] The purpose of this invention is to provide a chip data read acceleration method to solve the problems mentioned in the background art, such as the cache coherence protocol interaction overhead during cross-die data read in some existing chip data reads in multi-die heterogeneous packaging architectures such as 3DIC and Chiplet, which increases sharply with the increase of read rate, thus offsetting the acceleration effect. In AI chips and high-performance computing chips, the growth rate of computing power far exceeds the growth rate of storage read bandwidth, forming the memory wall problem, which seriously restricts the release of chip performance.

[0008] To achieve the above objectives, the present invention provides the following technical solution: a chip data read-through acceleration method, the chip data read-through acceleration method comprising the following steps: The multi-die heterogeneous encapsulation architecture is divided into domains, and the computing, storage and other functional modules are divided into independent local consistency domains. Within a domain, a cache consistency protocol is used to maintain data synchronization, and between domains, a lightweight protocol is used to trigger consistency synchronization only when cross-domain write operations are performed.

[0009] A predictive engine is deployed on the inter-die interconnection channel. By analyzing historical access patterns, it identifies high-frequency cross-die read requests and proactively preloads the target data into the local cache of the requesting die when the bus is idle, so that subsequent cross-die read requests can be directly obtained from the local cache.

[0010] By embedding computing units into the storage array, an in-memory computing architecture is built, allowing data to be processed directly within the storage unit. At the same time, consistency-sensitive computing tasks are offloaded to the storage die for execution, and only the final result is output to the computing die, reducing the amount of data transferred across dies and alleviating the bandwidth mismatch between computing power and storage. A bandwidth scheduling mechanism is introduced for the computing unit to dynamically adjust the storage read mode according to the memory access characteristics of the current task, and to dynamically manage idle computing resources to release bus bandwidth. At the same time, an on-chip storage pool is built through high-density 3D stacking technology to ensure that the working rhythm of computing power and storage is precisely matched.

[0011] Real-time monitoring of chip operating status, dynamic adjustment of consistency protocol parameters, in-memory unit activation ratio and storage bandwidth allocation strategy, lightweight compression of cross-domain data transmission to reduce bandwidth consumption, and deployment of layered verification mechanism to ensure data reliability while avoiding verification delays from dragging down read efficiency, ensuring stable acceleration in all scenarios.

[0012] Preferably, the cache consistency protocol introduces a write-back delayed buffering mechanism. When the core modifies cached data, it does not immediately trigger a global notification but temporarily stores it in a buffer unit. It then performs batch synchronization when the core is idle or the buffer is full, reducing the protocol interaction overhead caused by high-frequency write operations. The lightweight protocol adopts a local state marking and delayed synchronization mechanism. When a cross-domain read operation is initiated, the data is marked as having been accessed externally in the local cache directory of the target die, without triggering a global invalidation broadcast. When the target data is modified and a cross-domain write operation is initiated, a synchronization notification is sent to the die that has accessed the data, ensuring data consistency while minimizing the protocol overhead of cross-die read operations.

[0013] Preferably, the prediction engine identifies high-frequency reading needs through multi-dimensional access feature analysis. The access features include data address continuity, access frequency, access interval, and the type of request initiation die. In addition to bus idleness, the preloading triggering conditions also include the target data access frequency reaching a preset threshold, the interval between two adjacent accesses being less than a set value, and the use of a priority avoidance strategy during the preloading process to avoid occupying bus resources for high-priority data transmission and avoid bandwidth conflicts between preloading and normal reading.

[0014] Preferably, the computing units embedded in the in-memory computing architecture include lightweight tensor kernels, vector operation units, and arithmetic logic units, which are suitable for tasks that are sensitive to consistency and have moderate computational density, such as AI inference and data preprocessing. When a task is unloaded, the task feature screening module determines whether it is suitable for unloading. The screening dimensions include the task's cross-die data dependency, computational complexity, and data reuse rate. Only tasks with high cross-die data dependency and low computational complexity are unloaded to the storage die.

[0015] Preferably, the storage die has a built-in task scheduling submodule and a data temporary storage unit, as follows: the task scheduling submodule is used to allocate computing resources within the storage die, prioritize unloaded tasks, and prioritize the execution of tasks that are highly related to the current storage read operation; the data temporary storage unit is used to cache intermediate computing results, avoid the transmission of intermediate results across dies, reduce the amount of cross-die data interaction, and reduce the frequency of consistency protocol triggering.

[0016] Preferably, the bandwidth scheduling mechanism includes a memory access characteristic identification submodule, a storage configuration adjustment submodule, and a computing resource management submodule, specifically as follows: the memory access characteristic identification submodule analyzes the ratio of continuous and random access, the ratio of read and write operations, and the data block size of tasks in real time; the storage configuration adjustment submodule dynamically adjusts the prefetch granularity, cache replacement strategy, and storage channel parallelism based on the memory access characteristic identification; the computing resource management submodule releases bus bandwidth by shutting down idle computing cores and reducing the clock frequency of idle cores, thereby achieving dynamic adaptation between computing power and storage bandwidth.

[0017] Preferably, the high-density 3D stacked on-chip memory pool adopts a three-layer stacked structure of storage layer, interconnect layer, and control layer. The storage layer consists of multiple layers of HBM and 3DSRAM. The interconnect layer realizes high-speed interconnection between each storage layer and the computing die through through-silicon vias. The control layer deploys a distributed storage controller. The distributed storage controller adopts partition management and load balancing strategies, monitors the bandwidth usage of each storage partition in real time, and dynamically allocates read requests to idle partitions to avoid bandwidth congestion in a single partition and maximize the utilization of the total storage bandwidth.

[0018] Preferably, the dynamic adjustment of consistency protocol parameters, memory computing unit activation ratio, and memory bandwidth allocation strategy is implemented based on the threshold range of chip operating status parameters. The operating status parameters include chip junction temperature, bus bandwidth utilization, computing unit load rate, and data error rate. When the junction temperature exceeds the preset threshold, the memory computing unit activation ratio and consistency protocol synchronization frequency are reduced to prioritize stability. When the bus bandwidth utilization is lower than the set value, the preloading intensity and memory channel parallelism are increased to fully release bandwidth potential and achieve a balance between performance and stability across all scenarios.

[0019] Preferably, the layered verification mechanism adopts differentiated verification strategies for different data types, as follows: the calculation results within the in-memory computing unit are verified using local ECC to ensure calculation accuracy; compressed data transmitted across dies is verified using CRC to quickly detect transmission errors; the raw data in the on-chip storage pool is verified using distributed parity checking to balance verification efficiency and fault tolerance; the verification operation is executed in parallel by a hardware verification engine, synchronized with the data reading and calculation process, to ensure that verification latency does not affect the overall read acceleration effect.

[0020] A chip data read acceleration system is provided for use in the aforementioned chip data read acceleration method. The chip data read acceleration system includes a consistency domain management layer, a cross-die data optimization layer, a memory-computing layer, a bandwidth scheduling layer, and an adaptive control layer, as detailed below: The consistency domain management layer includes a consistency domain partitioning module and a consistency protocol processing unit.

[0021] The consistency domain partitioning module is used to divide the multi-die heterogeneous encapsulation architecture into functional domains, classifying the computing module and storage module into independent local consistency domains. The consistency protocol processing unit executes the cache consistency protocol and the lightweight cross-domain protocol according to different scenarios.

[0022] The cross-die data optimization layer includes a cross-die access prediction engine and a data preloading unit.

[0023] The cross-die access prediction engine analyzes historical access patterns through multi-dimensional access features, identifies high-frequency cross-die read requests, triggers preloading logic, and determines preloading priority. When the bus is idle and the preloading threshold condition is met, the data preloading unit actively copies the target data from the source die to the local cache of the die that initiated the request.

[0024] The in-memory computing layer includes an in-memory computing macro unit, a task filtering and scheduling module, and a data temporary storage unit.

[0025] The in-memory computing macro unit integrates a lightweight tensor kernel, vector operation unit, and arithmetic logic unit, embedded inside the storage array. It supports data to directly complete AI inference, data preprocessing, and other operations within the storage unit. The task selection and scheduling module selects and sorts tasks suitable for unloading based on cross-die data dependency, computational complexity, and data reuse rate, prioritizing tasks that are highly relevant to the current storage read operation. The data temporary storage unit is used to cache and store intermediate results generated by die operations.

[0026] The bandwidth scheduling layer includes a bandwidth scheduling mechanism unit and a 3D stacked on-chip memory pool.

[0027] The bandwidth scheduling mechanism unit integrates three sub-modules: memory access feature identification, storage configuration adjustment, and computing resource management. It identifies task memory access features and dynamically adjusts the storage prefetch granularity, cache replacement strategy, and channel parallelism. It releases bus bandwidth by shutting down idle computing cores and reducing frequency, thereby achieving computing power and storage matching. The 3D stacked on-chip storage pool adopts a three-layer structure of storage layer, interconnect layer, and control layer. The storage layer consists of multiple layers of HBM and 3DSRAM, the interconnect layer achieves high-speed interconnection through through-silicon vias, and the control layer deploys a distributed storage controller.

[0028] The adaptive control layer includes a chip status monitoring unit, a dynamic parameter adjustment unit, a data compression unit, and a hierarchical verification unit.

[0029] The chip status monitoring unit collects chip operating status parameters in real time. The dynamic parameter adjustment unit dynamically adjusts the consistency protocol synchronization frequency, the activation ratio of the storage and computing unit, and the storage bandwidth allocation strategy based on the status parameter threshold range. The data compression unit performs a lightweight compression algorithm on the data transmitted across dies to reduce the transmission bandwidth occupation. In conjunction with the bandwidth scheduling mechanism, it further alleviates bandwidth pressure. The hierarchical verification unit performs differentiated verification for different data types.

[0030] Compared with the prior art, the beneficial effects of the present invention are: Significantly reduce the consistency protocol overhead of multi-die heterogeneous encapsulation architecture. By using an intra-domain write-back delayed buffer mechanism, the synchronization behavior of high-frequency write operations is executed in batches, reducing the frequency of protocol interactions. At the same time, the lightweight local state marking and delayed synchronization mechanism between domains only synchronizes in a targeted manner during cross-domain write operations, completely avoiding the global broadcast overhead of cross-domain read operations, reducing the cross-die consistency protocol interaction overhead, and improving the efficiency of data reading under multi-die architecture.

[0031] To reduce the number of cross-die data interactions and access latency, the prediction engine accurately identifies high-frequency read demands through multi-dimensional access feature analysis. Combined with a priority avoidance strategy, preloading is only performed when the bus is idle and threshold conditions are met, effectively improving the local cache hit rate. This allows subsequent cross-die read requests to obtain data directly from the local cache without initiating cross-domain access, reducing cross-die data access latency and avoiding bandwidth conflicts between preloading and normal business operations.

[0032] To fundamentally alleviate the mismatch between computing power and storage bandwidth, an integrated storage-computing architecture allows data to be processed directly within the storage unit. Combined with task feature filtering and intermediate result temporary storage mechanisms, only tasks with high cross-die data dependency and low computational complexity are offloaded to the storage die. Intermediate results do not need to be transferred across dies; only the final result is output. This fundamentally reduces the amount of cross-die data transfer, improves storage bandwidth utilization, and alleviates the bandwidth mismatch between computing power and storage.

[0033] To improve the dynamic adaptability of storage bandwidth utilization and computing power, a three-module collaborative bandwidth scheduling mechanism is used to analyze task memory access characteristics in real time and dynamically adjust prefetch granularity, caching strategy, and channel parallelism. At the same time, idle computing cores are shut down or reduced in frequency to release bandwidth. Combined with a three-layer stacked and distributed load-balanced on-chip storage pool, the total storage bandwidth is increased. Furthermore, partition management avoids congestion in a single partition, achieving precise matching between computing power and storage work rhythm, and improving the effective bandwidth utilization of storage.

[0034] To achieve a dynamic balance between performance and stability across all scenarios, a dynamic adjustment strategy based on the threshold range of chip operating status parameters is adopted. When the junction temperature is too high, the activation ratio of the memory computing unit and the consistency synchronization frequency are reduced to prioritize stability. When the bus bandwidth utilization is insufficient, the preloading intensity and the parallelism of the memory channel are increased to fully unleash the performance potential and ensure that the chip can maintain a peak read performance of more than 80% in different scenarios such as consumer, industrial, and high-performance computing.

[0035] To ensure data reliability without sacrificing read speed, the layered verification mechanism adopts differentiated strategies for different data types, and all verification operations are executed in parallel by the hardware engine, completed synchronously with the data reading and processing process. The verification latency does not exceed 5% of the data transmission latency, reducing the data error rate and preventing the verification process from dragging down the read speed, thus achieving a balance between efficiency and reliability. Attached Figure Description

[0036] Figure 1 This is a flowchart of the present invention.

[0037] Figure 2 This is a system module diagram of the present invention. Detailed Implementation

[0038] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0039] This application provides a method for accelerating chip data reads. The core of this method lies in domain partitioning of a multi-die heterogeneous packaging architecture. Computation, storage, and other functional modules are divided into independent local consistency domains. Within each domain, a cache consistency protocol maintains data synchronization, while between domains, a lightweight protocol is used, triggering consistency synchronization only during cross-domain write operations. A predictive engine is deployed on the inter-die interconnect channel to identify high-frequency cross-die read requests by analyzing historical access patterns. When the bus is idle, the target data is proactively preloaded into the local cache of the requesting die, allowing subsequent cross-die read requests to directly retrieve data from the local cache. Computation units are embedded into the storage array to construct a storage-computing integrated architecture, allowing data to be processed directly within the storage unit, while simultaneously handling consistency-sensitive computational tasks. Execution is offloaded to the storage die, with only the final result output to the computation die, reducing cross-die data transfer and alleviating bandwidth mismatch between computing power and storage. A bandwidth scheduling mechanism is introduced for the computing unit, dynamically adjusting the storage read mode based on the memory access characteristics of the current task, and dynamically managing idle computing resources to release bus bandwidth. At the same time, an on-chip storage pool is built through high-density 3D stacking technology, allowing the working rhythm of computing power and storage to be precisely matched. The chip's operating status is monitored in real time, and consistency protocol parameters, the activation ratio of storage and computing units, and storage bandwidth allocation strategies are dynamically adjusted. Lightweight compression is performed on cross-domain data transmission to reduce bandwidth consumption, and a layered verification mechanism is deployed to ensure data reliability while avoiding verification delays that drag down read efficiency, ensuring stable acceleration in all scenarios.

[0040] Example 1: To better understand the above technical solution, the following will provide a detailed description of the technical solution in conjunction with the accompanying drawings and specific implementation methods. (Refer to...) Figure 1 As shown in this embodiment of the present application, a chip data read acceleration method includes the following steps: S1. Die heterogeneous encapsulation architecture is divided into domains, and computing, storage and other functional modules are divided into independent local consistency domains. Within a domain, a cache consistency protocol is used to maintain data synchronization, and between domains, a lightweight protocol is used to trigger consistency synchronization only when cross-domain write operations are performed.

[0041] In this embodiment, the cache consistency protocol introduces a write-back delayed buffering mechanism. When the core modifies cached data, it does not immediately trigger a global notification but temporarily stores it in a buffer unit. It then performs batch synchronization when the core is idle or the buffer is full, reducing the protocol interaction overhead caused by high-frequency write operations. The lightweight protocol adopts a local state marking and delayed synchronization mechanism. When a cross-domain read operation is initiated, the data is marked as having been accessed externally in the local cache directory of the target die, without triggering a global invalidation broadcast. When the target data is modified and a cross-domain write operation is initiated, a synchronization notification is sent to the die that has accessed the data, ensuring data consistency while minimizing the protocol overhead of cross-die read operations.

[0042] It should be noted that the lightweight inter-domain protocol is built upon local state marking and directed delayed synchronization, breaking through the broadcast-style interaction mode of traditional global consistency protocols. Its core lies in fine-grained state definition and synchronization scope control. First, an external access flag and an access node record table are added to the cache directory of each die. When a compute die initiates a cross-domain read request, the target storage die only sets the external access flag in the cache entry of the corresponding data block and writes the compute die ID that initiated the request to the record table. No global broadcast is triggered throughout the process; only local state visibility is maintained. Second, the synchronization triggering logic adopts a write operation-driven and directed notification mode, only when the storage die... When the target data in e is modified and a cross-domain write operation is initiated, a synchronization notification is sent to the compute dies that have accessed the data based on the access node record table, rather than being broadcast to all dies. The notification only contains the data modification identifier and a digest of the new data, without transmitting the complete data. The compute dies that receive the notification only need to mark the corresponding data in their local cache as invalid and reread it when accessed again, avoiding invalid synchronization from consuming bus bandwidth. In addition, the protocol also supports cross-domain reads having higher priority than synchronization notifications. When bus bandwidth is tight, the transmission of cross-domain read requests is prioritized, and synchronization notifications are temporarily stored in a queue and processed sequentially when the bus is idle, balancing consistency and read efficiency.

[0043] In practical implementation, a layered conflict resolution strategy is designed to address consistency conflicts between cross-domain access and intra-domain operations, avoiding read blocking or data errors caused by conflicts. When the kernel accesses local cached data within the Die domain, if the data is in the process of cross-domain synchronization (i.e., the synchronization notification sent by the storage Die has not been processed), a local priority mechanism is triggered. The kernel first performs calculations based on the existing data in the cache, while marking the calculation result as pending verification. After synchronization is complete, the old and new data are compared. If there is no difference, the result is directly used; if there is a difference, only the difference part of the calculation is re-executed, reducing the impact of cross-domain synchronization on consistency. Intra-domain computation blocking: When multiple compute dies simultaneously access the same storage data across domains, and one compute die initiates a cross-domain write operation, a write operation locking and sequential synchronization strategy is adopted. The storage die first locks the target data block to prevent write requests from other compute dies, prioritizes completing the current write operation and synchronization notification, releases the lock after synchronization is complete, and then processes requests from other compute dies. At the same time, read requests are cached and reused. Compute dies that have cached the data can directly provide copies to other compute dies without repeatedly reading from the storage die, further reducing the pressure of cross-domain interaction.

[0044] S2. Deploy a prediction engine on the inter-die interconnection channel. By analyzing historical access patterns, identify high-frequency cross-die read requests. When the bus is idle, proactively preload the target data into the local cache of the requesting die, so that subsequent cross-die read requests can be directly obtained from the local cache.

[0045] In this embodiment, the prediction engine identifies high-frequency read requirements through multi-dimensional access feature analysis. The access features include data address continuity, access frequency, access interval, and the type of request initiation die. In addition to bus idle, the preloading triggering conditions also include the target data access frequency reaching a preset threshold, the interval between two adjacent accesses being less than a set value, and the use of a priority avoidance strategy during the preloading process to avoid occupying bus resources for high-priority data transmission and avoid bandwidth conflicts between preloading and normal reading.

[0046] It should be noted that the core of the prediction engine is to achieve accurate identification of high-frequency cross-die reading needs through multi-dimensional feature fusion modeling. The quantification and analysis logic of each feature is as follows: A sliding window algorithm (window size can be configured to 16-64 addresses) is used to count the percentage of consecutive access addresses. When the percentage of consecutive addresses exceeds 70%, it is determined to have consecutive access characteristics and is given priority to be marked as a preloading candidate.

[0047] The access frequency of target data blocks is statistically analyzed based on time windows. When the frequency is greater than or equal to a preset threshold and the time interval between two adjacent accesses is less than or equal to a set value, it is determined to be high-frequency hot data. At the same time, an access decay factor is introduced to dynamically reduce the weight of data that was frequently accessed in the early stages but has not been accessed recently, so as to avoid preloading failure.

[0048] The initiated Die is categorized by label, such as computation Die-AI inference type, computation Die-general type, and storage Die. The decision tree model is trained by combining historical data. For example, the weight data access of AI inference type computation Die has stable high-frequency characteristics and can be directly included in the preloaded candidate set without waiting for the frequency threshold to be triggered, which further improves the prediction efficiency.

[0049] S3. Embed the computing unit into the storage array to build an integrated storage and computing architecture, allowing data to be processed directly within the storage unit. At the same time, consistency-sensitive computing tasks are offloaded to the storage die for execution, and only the final result is output to the computing die, reducing the amount of data transferred across dies and alleviating the bandwidth mismatch between computing power and storage.

[0050] In this embodiment, the computing units embedded in the in-memory computing architecture include lightweight tensor kernels, vector operation units, and arithmetic logic units, which are suitable for tasks that are sensitive to consistency and have moderate computational density, such as AI inference and data preprocessing. When a task is unloaded, the task feature screening module determines whether it is suitable for unloading. The screening dimensions include the task's cross-die data dependency, computational complexity, and data reuse rate. Only tasks with high cross-die data dependency and low computational complexity are unloaded to the storage die.

[0051] In this embodiment, the storage die has a built-in task scheduling submodule and a data temporary storage unit, as follows: The task scheduling submodule is used to allocate computing resources within the storage die, prioritize unloaded tasks, and prioritize the execution of tasks that are highly related to the current storage read operation; The data temporary storage unit is used to cache intermediate computing results, avoid the transmission of intermediate results across dies, reduce the amount of cross-die data interaction, and reduce the frequency of consistency protocol triggering.

[0052] It should be noted that the task scheduling submodule undertakes three functions: resource allocation, priority sorting, and load balancing. Through fine-grained scheduling, it ensures efficient utilization of storage die computing resources and avoids computing congestion. The specific design is as follows: The system employs a multi-dimensional task priority ranking system, using a three-dimensional priority ranking rule based on core relevance, task type, and load status. Priorities are divided into three levels from high to low: Level 1 consists of tasks strongly correlated with the current storage read operation, where AI inference operations on the same data block are executed synchronously with storage reads, and these tasks can preempt computing resources from lower-priority tasks; Level 2 consists of core tasks with extremely high cross-die data dependencies, such as feature extraction tasks for AI inference; and Level 3 consists of ordinary auxiliary tasks, such as data format conversion. The scheduling submodule updates task priorities in real time to ensure that critical tasks are executed first.

[0053] Dynamic allocation and load balancing of computing resources: The computing resources within the storage die are divided into tensor kernel resource pools, vector operation resource pools, and arithmetic logic unit resource pools according to their types. The scheduling submodule allocates resources from the corresponding resource pools based on the computing requirements of the unloaded tasks. Through resource occupancy monitoring and dynamic adjustment mechanisms, congestion in a single resource pool is avoided. When the occupancy rate of a certain type of resource pool is ≥80%, the unloading of similar tasks is suspended, and the already allocated tasks are processed first. At the same time, subsequent similar tasks are temporarily stored in the task queue and executed when resources become available. If the overall load rate of the storage die is ≥90%, a load saturation signal is fed back to the computing die, and task unloading is temporarily stopped to avoid affecting the basic storage functions of the storage die.

[0054] Coordinated scheduling with storage read operations is achieved by binding offloading tasks to corresponding storage data blocks through a task-data association mapping table. When the storage array performs a read operation on the data block, the corresponding computing unit is simultaneously activated to execute the offloading task, realizing a pipelined operation of storage read and computing execution and reducing the waiting time of computing units. If storage read and computing tasks exist simultaneously for the same data block, a time-sharing multiplexing strategy is adopted to prioritize storage read operations, while computing tasks are executed during read intervals to avoid the two competing for storage channels.

[0055] S4. Introduce a bandwidth scheduling mechanism for the computing unit, dynamically adjust the storage read mode according to the memory access characteristics of the current task, and dynamically manage idle computing resources to release bus bandwidth. At the same time, construct an on-chip storage pool through high-density 3D stacking technology to accurately match the working rhythm of computing power and storage.

[0056] In this embodiment, the bandwidth scheduling mechanism includes a memory access characteristic identification submodule, a storage configuration adjustment submodule, and a computing resource management submodule, as follows: The memory access characteristic identification submodule analyzes the continuous and random access ratio, read-write ratio, and data block size of the task in real time; the storage configuration adjustment submodule dynamically adjusts the prefetch granularity, cache replacement strategy, and storage channel parallelism based on the memory access characteristic identification; the computing resource management submodule releases bus bandwidth by shutting down idle computing cores and reducing the clock frequency of idle cores, thereby achieving dynamic adaptation between computing power and storage bandwidth.

[0057] In this embodiment, the high-density 3D stacked on-chip memory pool adopts a three-layer stacked structure of storage layer, interconnect layer, and control layer. The storage layer consists of multiple layers of HBM and 3DSRAM. The interconnect layer realizes high-speed interconnection between each storage layer and the computing die through through-silicon vias. The control layer deploys a distributed storage controller. The distributed storage controller adopts partition management and load balancing strategies to monitor the bandwidth usage of each storage partition in real time and dynamically allocate read requests to idle partitions to avoid bandwidth congestion in a single partition and maximize the utilization of the total storage bandwidth.

[0058] It should be noted that, with a sampling period of 10ns, all storage access requests initiated by the computing unit are captured, including access address, read / write identifier, data block length, and access response time. Hardware FIFO buffering of sampled data is used to eliminate abnormal accesses, such as falsely triggered requests or accesses from faulty addresses, to avoid interfering with the feature analysis results. An address continuity detection algorithm is used; adjacent access addresses with a difference ≤ data block size are considered continuous. The percentage of continuous accesses per unit time is statistically analyzed and quantified into a range of 0-100%. A continuous access rate ≥ 70% is considered high continuity, while a continuous access rate ≤ 10% is considered low continuity. 30% of access is high random access; the ratio of read requests to write requests per unit time is statistically analyzed and categorized into three types: read-intensive, read-write balanced, and write-intensive; accessed data blocks are categorized into small, medium, and large sizes based on their length, and a correlation model between data block size and memory access requirements is established based on task type, such as AI inference which mostly involves medium-sized data block access; a memory access characteristic report is generated every 500ns and synchronously pushed to the storage configuration adjustment submodule and the computing resource management submodule; if the characteristic fluctuation exceeds 20%, the continuous access ratio drops from 80% to 50%, and an emergency update is immediately triggered to ensure the timeliness of the scheduling strategy.

[0059] S5 monitors the chip's operating status in real time, dynamically adjusts consistency protocol parameters, memory unit activation ratio, and storage bandwidth allocation strategy, performs lightweight compression on cross-domain data transmission to reduce bandwidth consumption, and deploys a layered verification mechanism to ensure data reliability while avoiding verification delays from dragging down read efficiency, thus ensuring stable acceleration across all scenarios.

[0060] In this embodiment, the dynamic adjustment of consistency protocol parameters, memory cell activation ratio, and memory bandwidth allocation strategy is implemented based on the threshold range of chip operating status parameters. The operating status parameters include chip junction temperature, bus bandwidth utilization, computing unit load rate, and data error rate. When the junction temperature exceeds the preset threshold, the memory cell activation ratio and consistency protocol synchronization frequency are reduced to prioritize stability. When the bus bandwidth utilization is lower than the set value, the preloading intensity and memory channel parallelism are increased to fully release bandwidth potential and achieve a balance between performance and stability across all scenarios.

[0061] It should be noted that the chip junction temperature is collected in real time by an integrated thermistor array within the die, with a sampling accuracy of ±1℃ and updated every 10ms, divided into three levels: safe range, warning range, and emergency range; bus bandwidth utilization is the ratio of the actual amount of data transmitted on the bus per unit time to the maximum bandwidth, with a sampling period of 100ns, divided into three categories: low load, balanced load, and high load; computing unit load rate is quantified from 0 to 100% by monitoring the number of core operation instructions and the frequency of memory access requests, with a frequency ≤30% indicating low load and a frequency ≥70% indicating high load; data error rate is the ratio of the number of errors reported by various verification mechanisms to the total amount of data, divided into three levels: normal, abnormal, and fault; supplementary monitoring parameters include cache hit rate and cross-die consistency conflict rate, further optimizing the accuracy of adjustment strategies.

[0062] The threshold is not a fixed value, but is dynamically calibrated based on the chip's historical operating data and scenario characteristics. A scenario labeling mechanism is introduced, and an initial threshold is preset according to the task type. For example, in the AI inference scenario, the low load threshold of bus bandwidth utilization can be relaxed to 30% to improve the preloading effect. A sliding window self-learning algorithm is adopted to update the threshold range once an hour. For example, when the junction temperature does not trigger the warning in the high load scenario, the junction temperature warning threshold is appropriately increased to avoid overly conservative adjustment of the strategy.

[0063] In this embodiment, the hierarchical verification mechanism adopts differentiated verification strategies for different data types, as follows: the calculation results within the in-memory computing unit are verified using local ECC to ensure the accuracy of the calculation; compressed data transmitted across dies is verified using CRC to quickly detect transmission errors; the raw data in the on-chip storage pool is verified using distributed parity checking to balance verification efficiency and fault tolerance; the verification operation is executed in parallel by the hardware verification engine, synchronously with the data reading and calculation process, to ensure that the verification delay does not affect the overall reading acceleration effect.

[0064] It should be noted that a local ECC checksum with single-bit error correction and double-bit error detection is employed. Each result block corresponds to an 8-bit checksum. The checksum engine is deeply integrated with the computation unit, and parallel checksums are performed immediately after computation, with an error correction latency of ≤2ns. The error correction process does not interrupt subsequent computations, ensuring computational continuity. A CRC-32 checksum algorithm is used, with the hardware checksum engine deployed at the sending and receiving ends of the cross-die interconnect channel. The sending end generates a CRC code after data compression and appends it to the frame end. The receiving end completes the checksum before decompression, with a checksum latency of ≤1ns. If an error is detected, retransmission is triggered immediately, with retransmission having higher priority than normal data transmission, ensuring the integrity of cross-die data. The storage pool is divided into multiple checksum groups, each corresponding to a parity check block, stored in an independent checksum partition. The checksum engine monitors data changes in each partition in real time and updates the parity check block asynchronously, with a checksum latency of ≤5ns. When an error occurs in a partition, only the data in that partition needs to be repaired through the parity check block, without reconstructing the entire storage pool, with a repair time of ≤100ns, avoiding global data unavailability.

[0065] Example 2: This application provides a chip data readout acceleration system, referencing... Figure 2 As shown, the chip data read acceleration system includes a consistency domain management layer, a cross-die data optimization layer, a memory-computing layer, a bandwidth scheduling layer, and an adaptive control layer, as detailed below: The consistency domain management layer includes a consistency domain partitioning module and a consistency protocol processing unit.

[0066] The consistency domain partitioning module is used to divide the multi-die heterogeneous encapsulation architecture into functional domains, classifying the computing module and storage module into independent local consistency domains. The consistency protocol processing unit executes the cache consistency protocol and the lightweight cross-domain protocol according to different scenarios.

[0067] The cross-die data optimization layer includes a cross-die access prediction engine and a data preloading unit.

[0068] The cross-die access prediction engine analyzes historical access patterns through multi-dimensional access features, identifies high-frequency cross-die read requests, triggers preloading logic, and determines preloading priority. When the bus is idle and the preloading threshold condition is met, the data preloading unit actively copies the target data from the source die to the local cache of the die that initiated the request.

[0069] The in-memory computing layer includes an in-memory computing macro unit, a task filtering and scheduling module, and a data temporary storage unit.

[0070] The in-memory computing macro unit integrates a lightweight tensor kernel, vector operation unit, and arithmetic logic unit, embedded inside the storage array. It supports data to directly complete AI inference, data preprocessing, and other operations within the storage unit. The task selection and scheduling module selects and sorts tasks suitable for unloading based on cross-die data dependency, computational complexity, and data reuse rate, prioritizing tasks that are highly relevant to the current storage read operation. The data temporary storage unit is used to cache and store intermediate results generated by die operations.

[0071] The bandwidth scheduling layer includes a bandwidth scheduling mechanism unit and a 3D stacked on-chip memory pool.

[0072] The bandwidth scheduling mechanism unit integrates three sub-modules: memory access feature identification, storage configuration adjustment, and computing resource management. It identifies task memory access features and dynamically adjusts the storage prefetch granularity, cache replacement strategy, and channel parallelism. It releases bus bandwidth by shutting down idle computing cores and reducing frequency, thereby achieving computing power and storage matching. The 3D stacked on-chip storage pool adopts a three-layer structure of storage layer, interconnect layer, and control layer. The storage layer consists of multiple layers of HBM and 3DSRAM, the interconnect layer achieves high-speed interconnection through through-silicon vias, and the control layer deploys a distributed storage controller.

[0073] The adaptive control layer includes a chip status monitoring unit, a dynamic parameter adjustment unit, a data compression unit, and a hierarchical verification unit.

[0074] The chip status monitoring unit collects chip operating status parameters in real time. The dynamic parameter adjustment unit dynamically adjusts the consistency protocol synchronization frequency, the activation ratio of the storage and computing unit, and the storage bandwidth allocation strategy based on the status parameter threshold range. The data compression unit performs a lightweight compression algorithm on the data transmitted across dies to reduce the transmission bandwidth occupation. In conjunction with the bandwidth scheduling mechanism, it further alleviates bandwidth pressure. The hierarchical verification unit performs differentiated verification for different data types.

[0075] The chip uses a chiplet package. The compute die is equipped with an AI inference acceleration core, and the storage die integrates a storage-compute unit and a 3D stacked storage pool. Multiple dies achieve high-speed communication through the TSV interconnect channel, with a total interconnect bandwidth of 512Gbps and a latency of ≤2ns. The modules at each level work together to form a closed-loop acceleration link.

[0076] The two compute dies are each divided into independent compute consistency domains, and the storage die is a separate storage consistency domain. Only compute dies are allowed to initiate cross-domain access to storage dies, simplifying the scope of management. Within a domain, the MESI protocol and write-back delay buffer mechanism are used. High-frequency write operations are temporarily buffered and then synchronized in batches. Between domains, a lightweight protocol is used. Cross-domain reads only mark the status without broadcasting, and cross-domain writes are synchronized in a targeted manner, reducing protocol overhead.

[0077] Based on features such as access frequency, address continuity, and initiation die type, high-frequency cross-die data such as AI model weights are identified, triggering preloading and marking priorities. When the bus idle bandwidth is ≥60% and the high-frequency access conditions are met, preloading is initiated. A priority avoidance strategy is adopted to preload high-frequency data into the local cache of the computation die, reducing cross-die access.

[0078] The storage die embeds lightweight tensor kernels and vector operation units, adapting to AI low-precision inference and data preprocessing, and supporting direct data operation within the storage unit; preprocessing tasks with high cross-die data dependency and low computational complexity are selected and offloaded to the storage die, and resources are allocated according to task correlation; intermediate and final results are cached in layers to avoid the transmission of intermediate results across dies and reduce the frequency of consistency protocol triggering.

[0079] It analyzes memory access characteristics in real time, dynamically adjusts prefetch granularity and storage channel parallelism, shuts down idle computing cores to release bus bandwidth, and adapts to AI read-intensive requirements; it adopts a 2-layer 3DSRAM and 16-layer HBM3 structure, with 3DSRAM storing high-frequency data and a distributed controller to achieve partition load balancing and avoid bandwidth congestion.

[0080] Real-time acquisition of core parameters such as junction temperature, bus utilization, load rate, and error rate provides a basis for dynamic adjustment; adjustment of consistency synchronization frequency and memory unit activation ratio based on parameter threshold range to balance performance and stability; use of LZ4 algorithm to compress cross-die data in a scenario-based manner, coupled with differential verification, to ensure parallel execution without dragging down acceleration efficiency.

[0081] Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for accelerating chip data readout, characterized in that, The chip data read acceleration method includes the following steps: The multi-die heterogeneous encapsulation architecture is divided into domains, and the computing, storage and other functional modules are divided into independent local consistency domains. Within the domain, a cache consistency protocol is used to maintain data synchronization, and a lightweight protocol is used between domains, which triggers consistency synchronization only when cross-domain write operations are performed. Deploy a prediction engine on the inter-die interconnection channel to identify high-frequency cross-die read requests by analyzing historical access patterns. When the bus is idle, proactively preload the target data into the local cache of the die that initiated the request, so that subsequent cross-die read requests can be directly obtained from the local cache. By embedding computing units into the storage array, an in-memory computing architecture is built, allowing data to be processed directly within the storage unit. At the same time, consistency-sensitive computing tasks are offloaded to the storage die for execution, and only the final result is output to the computing die, reducing the amount of data transferred across dies and alleviating the bandwidth mismatch between computing power and storage. A bandwidth scheduling mechanism is introduced for the computing unit to dynamically adjust the storage read mode according to the memory access characteristics of the current task, and to dynamically manage idle computing resources to release bus bandwidth. At the same time, an on-chip storage pool is built through high-density 3D stacking technology to precisely match the working rhythm of computing power and storage. Real-time monitoring of chip operating status, dynamic adjustment of consistency protocol parameters, in-memory unit activation ratio and storage bandwidth allocation strategy, lightweight compression of cross-domain data transmission to reduce bandwidth consumption, and deployment of layered verification mechanism to ensure data reliability while avoiding verification delays from dragging down read efficiency, ensuring stable acceleration in all scenarios.

2. The chip data readout acceleration method according to claim 1, characterized in that: The cache consistency protocol introduces a write-back delayed buffer mechanism. When the core modifies cache data, it does not immediately trigger a global notification, but temporarily stores it in the buffer unit. It then synchronizes the data in batches when the core is idle or the buffer is full, reducing the protocol interaction overhead caused by high-frequency write operations. The lightweight protocol employs local state marking and delayed synchronization mechanisms. When a cross-domain read operation is initiated, the data is marked as having been accessed externally in the local cache directory of the target die, without triggering a global invalidation broadcast. When the target data is modified and a cross-domain write operation is initiated, a synchronization notification is sent to the die that has accessed the data, ensuring data consistency while minimizing the protocol overhead of cross-die read operations.

3. The chip data reading acceleration method according to claim 1, characterized in that: The prediction engine identifies high-frequency reading needs through multi-dimensional access feature analysis. The access features include data address continuity, access frequency, access interval, and the type of request initiation die. In addition to bus idleness, the preloading trigger conditions also include the target data access frequency reaching a preset threshold, the interval between two adjacent accesses being less than a set value, and the use of a priority avoidance strategy during the preloading process to avoid occupying bus resources for high-priority data transmission and avoid bandwidth conflicts between preloading and normal reading.

4. The chip data reading acceleration method according to claim 1, characterized in that: The computing units embedded in the in-memory computing architecture include lightweight tensor kernels, vector operation units, and arithmetic logic units, which are suitable for tasks that are sensitive to consistency and have moderate computational density, such as AI inference and data preprocessing. When a task is unloaded, the task feature filtering module determines whether it is suitable for unloading. The filtering dimensions include the task's cross-die data dependency, computational complexity, and data reuse rate. Only tasks with high cross-die data dependency and low computational complexity are unloaded to the storage die.

5. The chip data read-acceleration method according to claim 1, characterized in that: The storage die has a built-in task scheduling submodule and a data temporary storage unit, as detailed below; The task scheduling submodule is used to allocate computing resources within the storage die, prioritize unloaded tasks, and execute tasks that are highly relevant to the current storage read operation first. The data temporary storage unit is used to cache intermediate results of the operation, avoid the transmission of intermediate results across dies, reduce the amount of cross-die data interaction, and reduce the frequency of triggering the consistency protocol.

6. The chip data reading acceleration method according to claim 1, characterized in that: The bandwidth scheduling mechanism includes a memory access feature identification submodule, a storage configuration adjustment submodule, and a computing resource management submodule, as detailed below: The memory access feature identification submodule analyzes the continuous and random access ratio, read-write ratio, and data block size of the task in real time. The storage configuration adjustment submodule dynamically adjusts the prefetch granularity, cache replacement strategy, and storage channel parallelism based on memory access characteristics. The computing power resource management submodule releases bus bandwidth by shutting down idle computing cores and reducing the clock frequency of idle cores, thereby achieving dynamic adaptation of computing power and storage bandwidth.

7. The chip data read acceleration method according to claim 1, characterized in that: The high-density 3D stacked on-chip memory pool adopts a three-layer stacked structure of memory layer, interconnect layer and control layer. The memory layer is composed of multiple layers of HBM and 3DSRAM. The interconnect layer realizes high-speed interconnection between each layer of memory and the computing die through through silicon via. The control layer deploys a distributed memory controller. The distributed storage controller employs partition management and load balancing strategies to monitor the bandwidth usage of each storage partition in real time and dynamically allocate read requests to idle partitions, avoiding bandwidth congestion in a single partition and maximizing the utilization of total storage bandwidth.

8. The chip data reading acceleration method according to claim 1, characterized in that: The dynamic adjustment of consistency protocol parameters, memory cell activation ratio, and memory bandwidth allocation strategy is implemented based on the threshold range of chip operating status parameters. Operating status parameters include chip junction temperature, bus bandwidth utilization, computing unit load rate, and data error rate. When the junction temperature exceeds the preset threshold, the activation ratio of the in-memory computing unit and the synchronization frequency of the consistency protocol are reduced to prioritize stability. When the bus bandwidth utilization is lower than the set value, the preloading intensity and storage channel parallelism are increased to fully release bandwidth potential and achieve a balance between performance and stability across all scenarios.

9. The chip data readout acceleration method according to claim 1, characterized in that: The layered verification mechanism employs differentiated verification strategies for different data types, as detailed below: The calculation results within the in-memory computing unit are verified using local ECC to ensure calculation accuracy. Compressed data transmitted across dies uses CRC checksum to quickly detect transmission errors; The raw data in the on-chip storage pool uses distributed parity checking, which balances checking efficiency and fault tolerance. The verification operation is executed in parallel by the hardware verification engine, and is synchronized with the data reading and processing process to ensure that the verification delay does not affect the overall reading acceleration effect.

10. A chip data read-through acceleration system, used to execute the chip data read-through acceleration method as described in any one of claims 1 to 9, characterized in that, The chip data read acceleration system includes a consistency domain management layer, a cross-die data optimization layer, a memory-computing layer, a bandwidth scheduling layer, and an adaptive control layer, as detailed below: The consistency domain management layer includes a consistency domain partitioning module and a consistency protocol processing unit; The consistency domain partitioning module is used to partition the multi-die heterogeneous encapsulation architecture into functional domains, classifying the computing module and storage module into independent local consistency domains. The consistency protocol processing unit executes the cache consistency protocol and the lightweight cross-domain protocol according to different scenarios. The cross-die data optimization layer includes a cross-die access prediction engine and a data preloading unit; The cross-die access prediction engine analyzes historical access patterns through multi-dimensional access features, identifies high-frequency cross-die read requests, triggers preloading logic and determines preloading priority. When the bus is idle and the preloading threshold condition is met, the data preloading unit actively copies the target data from the source die to the local cache of the die that initiated the request. The in-memory computing layer includes an in-memory computing macro unit, a task filtering and scheduling module, and a data temporary storage unit; The in-memory computing macro unit integrates a lightweight tensor kernel, vector operation unit, and arithmetic logic unit, embedded inside the storage array. It supports data to directly complete AI inference, data preprocessing, and other operations within the storage unit. The task filtering and scheduling module filters and sorts tasks suitable for unloading based on cross-die data dependency, computational complexity, and data reuse rate, prioritizing the execution of tasks with high relevance to the current storage read operation. The data temporary storage unit is used to cache and store intermediate results generated by die operations. The bandwidth scheduling layer includes a bandwidth scheduling mechanism unit and a 3D stacked on-chip memory pool; The bandwidth scheduling mechanism unit integrates three sub-modules: memory access feature identification, storage configuration adjustment, and computing resource management. It identifies task memory access features and dynamically adjusts storage prefetch granularity, cache replacement strategy, and channel parallelism. It releases bus bandwidth by shutting down idle computing cores and reducing frequency, thereby achieving computing power and storage matching. The 3D stacked on-chip storage pool adopts a three-layer structure of storage layer, interconnect layer, and control layer. The storage layer consists of multiple layers of HBM and 3DSRAM, the interconnect layer achieves high-speed interconnection through through-silicon vias, and the control layer deploys a distributed storage controller. The adaptive control layer includes a chip status monitoring unit, a dynamic parameter adjustment unit, a data compression unit, and a hierarchical verification unit; The chip status monitoring unit collects chip operating status parameters in real time. The dynamic parameter adjustment unit dynamically adjusts the consistency protocol synchronization frequency, the activation ratio of the storage and computing unit, and the storage bandwidth allocation strategy based on the status parameter threshold range. The data compression unit performs a lightweight compression algorithm on the data transmitted across dies to reduce the transmission bandwidth occupation. In conjunction with the bandwidth scheduling mechanism, it further alleviates bandwidth pressure. The hierarchical verification unit performs differentiated verification for different data types.

Citation Information

Patent Citations

Methods, devices, computer equipment, and storage media for accelerating read commands of NVMe SSD controller chips
CN109710187B
Nonvolatile memory read acceleration method for power chip
CN112711383A
Accelerated starting method for storage chip, main control device and solid state disk
CN116400868A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Methods, devices, computer equipment, and storage media for accelerating read commands of NVMe SSD controller chips

Nonvolatile memory read acceleration method for power chip

Accelerated starting method for storage chip, main control device and solid state disk