Data read-write method and system based on hybrid bonding large model inference, controller and storage medium
By constructing a hybrid bonding connection of DRAM, FLASH and PIM chips, and utilizing PIM chips to optimize data reading and writing of large models in the pre-filling and decoding stages, the problem of slow inference speed of large models is solved, and high-efficiency data reading and writing and low-latency computing efficiency are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN MAITEXIN TECH CO LTD
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-19
Smart Images

Figure CN121958147B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a data reading and writing method, system, controller, and computer-readable storage medium based on large model inference using hybrid bonding. Background Technology
[0002] Large Language Models (LLMs) are widely used in many fields such as natural language processing, image recognition, and voice interaction. However, with the continuous increase in model size and the growing complexity of application scenarios, the efficiency of LLM inference has become a key factor restricting its development.
[0003] Current storage system solutions mainly include HBM (High Bandwidth Memory), DRAM (Dynamic Random Access Memory), SSD (Solid State Drive), and FLASH (a non-volatile storage technology). These solutions have significant drawbacks: HBM has high bandwidth but small capacity and is extremely expensive; DRAM, SSD, or FLASH experience a sharp increase in latency during data swapping, which slows down inference speed, especially in long-context scenarios where bandwidth contention intensifies, and I / O (Input / Output) becomes a critical bottleneck.
[0004] Therefore, existing technologies still need to be improved and developed. Summary of the Invention
[0005] The main objective of this invention is to provide a data reading and writing method, system, controller, and computer-readable storage medium based on hybrid bonding for large model inference. This invention aims to solve the problem in the prior art where the model inference speed is slow and the data reading and writing efficiency is low when using HBM, DRAM, or SSD storage systems alone for large model acceleration.
[0006] To achieve the above objectives, the present invention provides a data reading and writing method for large model inference based on hybrid bonding, the data reading and writing method for large model inference based on hybrid bonding includes the following steps:
[0007] Construct a DRAM chip, a FLASH chip, and a PIM chip, and construct hybrid bonding connections between the PIM chip and the DRAM chip and the FLASH chip, respectively;
[0008] When the large model performs inference, the weights of the large model are stored in the FLASH chip through the data bus, and the prompts of the large model are stored in the DRAM chip through the data bus.
[0009] During the pre-filling stage, the PIM chip is used to read the weight and the prompt word respectively, and the calculated first key value is cached and written into the DRAM chip;
[0010] During the decoding stage, the PIM chip is used to read the weight and the first key-value cache respectively, and the second key-value cache of the token generated by the large model is written into the DRAM chip.
[0011] Optionally, in the data read / write method for large model inference based on hybrid bonding, the DRAM chip includes multiple memory blocks, each of which is independent of the others;
[0012] The FLASH chip includes multiple FLASH blocks, and each FLASH block is independent of the others;
[0013] The PIM chip includes multiple processing engines, multiple DRAM controllers, and multiple FLASH controllers.
[0014] Optionally, in the data read / write method for large model inference based on hybrid bonding, the hybrid bonding connection between the PIM chip and the DRAM chip and the FLASH chip includes:
[0015] The PIM chip is hybrid-bonded between the DRAM chip and the FLASH chip;
[0016] The PIM chip is hybrid-bonded to both sides of the DRAM chip or the FLASH chip.
[0017] Optionally, the data read / write method for large model inference based on hybrid bonding, wherein constructing the hybrid bonding connection between the PIM chip and the DRAM chip and the FLASH chip respectively specifically includes:
[0018] A DRAM cluster is constructed using one or more of the memory blocks, and a FLASH cluster is constructed using one or more of the FLASH blocks;
[0019] Each memory block is connected to each DRAM controller;
[0020] Each of the FLASH blocks is connected to each of the FLASH controllers;
[0021] Each of the processing engines is connected to each of the DRAM clusters and each of the FLASH clusters, respectively.
[0022] Optionally, the data read / write method for large model inference based on hybrid bonding, wherein when the large model performs inference, the weights of the large model are stored in the FLASH chip via the data bus, and the prompts of the large model are stored in the DRAM chip via the data bus, specifically includes:
[0023] When the large model performs inference, the weights of the large model are divided into multiple parts and stored in multiple FLASH clusters through a data bus;
[0024] The prompts of the large model are divided into multiple parts and stored in multiple DRAM clusters via a data bus.
[0025] Optionally, the data read / write method for large model inference based on hybrid bonding, wherein in the pre-filling stage, the PIM chip is used to read the weights and the prompt words respectively, and the calculated first key value is cached and written into the DRAM chip, specifically including:
[0026] During the pre-filling phase, each of the processing engines reads the weight from the corresponding FLASH cluster and reads the prompt words for each part from the corresponding DRAM cluster;
[0027] Multiple first key-value caches are generated based on the weight of each part and the prompt words of each part;
[0028] Each of the processing engines writes all of the first key-value caches to the corresponding DRAM cluster.
[0029] Optionally, the data read / write method for large model inference based on hybrid bonding, wherein, in the decoding stage, the PIM chip is used to read the weights and the first key-value cache respectively, and the second key-value cache of the token generated by the large model is written into the DRAM chip, specifically including:
[0030] During the decoding phase, each of the processing engines reads the weight from the corresponding FLASH cluster and reads the first key-value cache from the corresponding DRAM cluster;
[0031] When the large model generates the output of the forward computation, the processing engine uses the second key value cache of the output to write it into each of the DRAM clusters.
[0032] Furthermore, to achieve the above objectives, the present invention also provides a data read / write system for large model inference based on hybrid bonding, wherein the data read / write system for large model inference based on hybrid bonding includes:
[0033] A connection module is used to construct DRAM chips, FLASH chips and PIM chips, and to construct hybrid bonding connections between the PIM chips and the DRAM chips and FLASH chips respectively;
[0034] The data storage module is used to store the weights of the large model into the FLASH chip via the data bus and the prompts of the large model into the DRAM chip via the data bus when the large model is performing inference.
[0035] The first read / write module is used to read the weight and the prompt word respectively using the PIM chip during the pre-filling stage, and cache and write the calculated first key value into the DRAM chip.
[0036] The second read / write module is used to read the weight and the first key-value cache respectively using the PIM chip during the decoding stage, and write the second key-value cache of the token generated by the large model into the DRAM chip.
[0037] Furthermore, to achieve the above objectives, the present invention also provides a controller, wherein the controller includes: a memory, a processor, and a data read / write program based on hybrid bonding large model inference stored in the memory and executable on the processor, wherein when the data read / write program based on hybrid bonding large model inference is executed by the processor, it implements the steps of the data read / write method based on hybrid bonding large model inference as described above.
[0038] Furthermore, to achieve the above objectives, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a data read / write program based on hybrid bonding large model inference, and the data read / write program based on hybrid bonding large model inference implements the steps of the data read / write method based on hybrid bonding large model inference as described above when executed by a processor.
[0039] In this invention, a DRAM chip, a FLASH chip, and a PIM chip are constructed, and the PIM chip is connected to the DRAM chip and the FLASH chip via hybrid bonding. During large model inference, the weights of the large model are stored in the FLASH chip via a data bus, and the prompts of the large model are stored in the DRAM chip via the same data bus. In the pre-filling stage, the PIM chip is used to read the weights and the prompts, and the calculated first key-value cache is written to the DRAM chip. In the decoding stage, the PIM chip is used to read the weights and the first key-value cache, and the second key-value cache of the token generated by the large model is written to the DRAM chip. This invention utilizes the processing engine in the PIM chip, requiring only data to be read from the physically closest FLASH and DRAM chips, reducing latency. It also eliminates the need for bus routing to read model weight data and key-value cache data, reducing data movement and power consumption. Attached Figure Description
[0040] Figure 1 This is a flowchart of a preferred embodiment of the data reading and writing method for large model inference based on hybrid bonding of the present invention;
[0041] Figure 2 This is a schematic diagram of the first bonding hybridization of a preferred embodiment of the data reading and writing method for large model inference based on hybrid bonding of the present invention;
[0042] Figure 3 This is a schematic diagram of the second bonding hybridization of a preferred embodiment of the data reading and writing method for large model inference based on hybrid bonding of the present invention;
[0043] Figure 4 This is a schematic diagram of the third bonding hybridization of a preferred embodiment of the data reading and writing method for large model inference based on hybrid bonding of the present invention;
[0044] Figure 5 This is a schematic diagram showing the connection between a DRAM chip and a PIM chip in a preferred embodiment of the data read / write method for large model inference based on hybrid bonding according to the present invention.
[0045] Figure 6 This is a schematic diagram showing the connection between the FLASH chip and the PIM chip in a preferred embodiment of the data read / write method for large model inference based on hybrid bonding according to the present invention.
[0046] Figure 7 This is a schematic diagram showing the connection between the processing engine and DRAM and FLASH in a preferred embodiment of the data read / write method for large model inference based on hybrid bonding of the present invention.
[0047] Figure 8This is a structural diagram of a preferred embodiment of the data read / write system for large model inference based on hybrid bonding according to the present invention;
[0048] Figure 9 This is a structural diagram of a preferred embodiment of the controller of the present invention. Detailed Implementation
[0049] To make the objectives, technical solutions, and advantages of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0050] Large Language Models (LLMs) are widely used in numerous fields such as natural language processing, image recognition, and voice interaction. With the continuous increase in model size and the growing complexity of application scenarios, the efficiency of LLM inference has become a key factor restricting its development. Therefore, accelerating large model inference is crucial for improving system response speed, reducing latency, and expanding application scenarios. A significant advantage of Processing-in-Memory (PIM) computation is the reduction of the overhead of repeatedly moving data between storage locations, making it a key option for LLM inference deployment.
[0051] To address the aforementioned issues, this invention presents a near-memory computing large model inference acceleration card based on TSV (Through-Silicon-Via Hybrid Bonding), which integrates the PIM near-memory computing processor, DRAM memory, and FLASH memory in a 3D stacked manner. A large-capacity FLASH memory is used to store model weights, while multi-layer DRAM memory is used to store KV cache (key-value cache), thus solving the storage capacity problem. The FLASH memory and near-memory computing processor utilize TSV hybrid bonding technology to achieve high-bandwidth reading of model weights, while the multi-layer DRAM memory and PIM near-memory computing processor utilize TSV hybrid bonding technology to achieve high-bandwidth reading and writing of the KV cache.
[0052] The preferred embodiment of the data read / write method for large model inference based on hybrid bonding described in this invention, such as... Figure 1 As shown, the data read / write method based on hybrid bonding for large model inference includes the following steps:
[0053] Step S10: Construct a DRAM chip, a FLASH chip, and a PIM chip, and construct a hybrid bonding connection between the PIM chip and the DRAM chip and the FLASH chip, respectively.
[0054] The DRAM chip includes multiple memory blocks, each of which is independent of the others.
[0055] The FLASH chip includes multiple FLASH blocks, and each FLASH block is independent of the others;
[0056] The PIM chip includes multiple processing engines, multiple DRAM controllers, and multiple FLASH controllers.
[0057] From a storage layer perspective, the multi-block design of DRAM and FLASH clusters breaks down a single storage medium into multiple independent storage units. Combined with a fully connected controller configuration, this allows each storage block to be independently scheduled and accessed. Taking DRAM clusters as an example, when multiple processing engines simultaneously initiate data requests, the DRAM controller can process read and write operations on different memory blocks in parallel, avoiding the waiting latency of "serial queuing" in traditional single-controller architectures. Similarly, FLASH clusters, through full connectivity of multiple blocks and multiple controllers, can effectively balance the differences in read and write performance between flash memory media. Parallel operations mask the erase and write latency of a single FLASH block, significantly improving overall IOPS (Input / Output Operations Per Second) and throughput.
[0058] Furthermore, at the compute-storage collaboration level, the fully connected design of the processing engine and all storage clusters allows each processing engine to directly access any storage unit without going through intermediate forwarding nodes. This "flattened" data access path eliminates the communication bottleneck between the storage controller and compute nodes in traditional architectures. For example, in high-performance computing scenarios, multiple processing engines can simultaneously read computational data from the DRAM cluster and write intermediate results to the FLASH cluster. The entire process does not require waiting for data relay, resulting in an order-of-magnitude improvement in the execution efficiency of computing tasks.
[0059] Among them, such as Figure 2 , Figure 3 and Figure 4 As shown, the present invention integrates a PIM near-memory computing processor, a DRAM memory, and a FLASH memory in a 3D stacked manner, wherein the PIM chip is hybrid-bonded to the middle of the DRAM chip and the FLASH chip; the PIM chip is hybrid-bonded to both sides of the DRAM chip or the FLASH chip.
[0060] Specifically, such as Figure 2 (a) and Figure 2 As shown in (b), the PIM near-memory computing processor is connected to the DRAM and FLASH memory in a hybrid bonding manner, with the PIM near-memory computing processor physically stacked between the DRAM and FLASH memory; as shown in (b). Figure 3(a) and Figure 3 As shown in (b), this illustrates the hybrid bonding connection between the PIM near-memory computing processor and the DRAM and FLASH memories, respectively. The PIM near-memory computing processor is stacked on top of the DRAM and FLASH memories, and the DRAM and FLASH memories can be interchanged. Figure 4 (a) and Figure 4 As shown in (b), the PIM near memory computing processor is bonded to the DRAM memory and the FLASH memory in a hybrid connection manner. The PIM near memory computing processor is stacked at the bottom of the DRAM memory and the FLASH memory, and the DRAM memory and the FLASH memory can be interchanged.
[0061] In traditional architectures, the computing units communicate with memory via a bus, resulting in severe bandwidth limitations and latency issues. However, the PIM die (referring to the PIM chip disclosed in this invention) directly interconnects with the DRAM die (referring to the DRAM chip disclosed in this invention) and the FLASH die (referring to the FLASH chip disclosed in this invention) at a micrometer-level pitch through hybrid bonding, which can effectively solve this problem. For example, in the HBM-PIM (High Bandwidth Memory - Processing in Memory) architecture, the PIM unit is embedded in the DRAM stack, which can directly perform simple calculations at the memory layer, avoiding a large amount of data transfer.
[0062] Furthermore, the energy consumption of long-distance data transmission between chips is much higher than that of local computing. However, this invention significantly reduces signal transmission paths through hybrid bonding between DRAM chips, FLASH chips, and PIM chips, resulting in short-distance vertical interconnects. This significantly reduces I / O drive voltage and power consumption, reduces idle cycles caused by data movement, and improves overall system energy efficiency.
[0063] Specifically, a DRAM cluster is constructed using one or more of the memory blocks, and a FLASH cluster is constructed using one or more of the FLASH blocks;
[0064] Each memory block is connected to each DRAM controller;
[0065] Each of the FLASH blocks is connected to each of the FLASH controllers;
[0066] Each of the processing engines is connected to each of the DRAM clusters and each of the FLASH clusters, respectively.
[0067] Among them, such as Figure 5As shown, the storage of a DRAM die is divided into multiple memory blocks of the same capacity, which are independent of each other. The DRAM controller is integrated into the PIM die, and the PIM die accesses the memory blocks on the DRAM die through the DRAM controller. Each memory block is connected to a corresponding independent DRAM controller on the PIM die.
[0068] Furthermore, such as Figure 6 As shown, the storage of a FLASH die is divided into multiple FLASH blocks of equal capacity, and these FLASH blocks are independent of each other. The FLASH controller is integrated into the PIM die, and the PIM die accesses the FLASH blocks on the FLASH die through the FLASH controller. Each FLASH block is connected to a corresponding independent FLASH controller on the PIM die.
[0069] Furthermore, such as Figure 7 As shown, the computing unit of the PIM die is the PE (Processing Engine), and multiple PEs exist on the PIM die. The bandwidth required for PE computation typically does not match the bandwidth provided by a single FLASH controller. Therefore, one or more FLASH controllers are needed to form a FLASH cluster (i.e., the FLASH cluster disclosed in this invention), whose bandwidth matches the bandwidth required for PE computation. Similarly, one or more DRAM controllers form a DRAM cluster (i.e., the DRAM cluster disclosed in this invention). Each PE is directly connected to only one DRAM cluster and one FLASH cluster. Through this distributed direct-connection architecture, PEs do not need to go through a bus to read FLASH or DRAM data, reducing latency and power consumption. Moreover, the bandwidth required for PE computation matches the bandwidth of the FLASH cluster or DRAM cluster, solving the problem of PEs being idle while waiting for data or having low bandwidth utilization.
[0070] Both DRAM and FLASH clusters consist of multiple independent storage blocks. When a storage block experiences a hardware failure, the system can automatically switch data access requests to other normal storage blocks via the controller, achieving a "seamless failover." For example, if a DRAM chip is damaged, the DRAM controller can quickly remove that memory block from the available resource pool and simultaneously restore the data on that block to other normal memory blocks using redundant copies. The entire process is completely transparent to the processing engine and will not cause any interruption to the computing task.
[0071] Secondly, there is controller redundancy. The design, with memory blocks fully connected to all DRAM controllers and FLASH blocks fully connected to all FLASH controllers, ensures that each storage unit has multiple data access paths. When one controller fails, other controllers can immediately take over the storage resources it manages, preventing the entire storage system from collapsing due to a "single point of failure." This controller-level redundancy design, combined with media redundancy in the storage blocks, forms a "double insurance," increasing the system's mean time between failures (MTBF) to several times that of traditional architectures.
[0072] Furthermore, this architecture facilitates data verification and recovery. The processing engine can access multiple storage blocks in parallel to perform multi-copy verification of data, promptly identifying and correcting errors during data transmission or storage. In data recovery scenarios, multiple processing engines can work collaboratively to read redundant data from different storage blocks and complete data reconstruction in parallel, significantly shortening fault recovery time.
[0073] Furthermore, since each storage block and controller is fully connected, the system can dynamically adjust resource usage strategies based on real-time load conditions. For example, when the system is under low load, some storage blocks and controllers can be put into hibernation mode to reduce energy consumption; when the load increases, hibernating resources are quickly woken up to ensure performance requirements are met. This dynamic resource scheduling avoids the resource waste problem of "overkill" in traditional architectures.
[0074] Meanwhile, the fully connected design of the processing engine and storage clusters allows the system to intelligently schedule caching based on data frequency. Frequently accessed "hot data" can be stored in DRAM clusters for low-latency access, while less frequently accessed "cold data" can be migrated to FLASH clusters to reduce storage costs. Through this tiered data storage strategy, the system significantly reduces overall storage energy consumption and costs while meeting performance requirements.
[0075] Step S20: When the large model is performing inference, the weights of the large model are stored in the FLASH chip through the data bus, and the prompts of the large model are stored in the DRAM chip through the data bus.
[0076] Specifically, when the large model performs inference, the weights of the large model are divided into multiple parts and stored in multiple FLASH clusters via a data bus;
[0077] The prompts of the large model are divided into multiple parts and stored in multiple DRAM clusters via a data bus.
[0078] In the embodiments disclosed in this invention, the number of PE, DARM cluster, and FLASH cluster are all the same, defined as N, and they correspond one-to-one. The weights of the LLM model are divided into N parts and written by the controller through a universal serial bus, and stored in N FLASH clusters respectively (wherein, FLASH is non-volatile storage and only needs to be rewritten when the model weights are updated or the model is switched).
[0079] Before the LLM model inference begins, the prompt words are divided into N parts and written by the controller through the universal serial bus, and stored in N DARM clusters respectively. Then the LLM model can enter the pre-filling stage and the decoding stage.
[0080] By stacking DRAM die, FLASH die, and PIM die using TSV hybrid bonding, large-capacity FLASH stores the model weights of LLM, and multi-layer DRAM stores the KV cache, high bandwidth and low latency can be achieved through TSV hybrid bonding.
[0081] Step S30: In the pre-filling stage, the PIM chip is used to read the weight and the prompt word respectively, and the calculated first key value is cached and written into the DRAM chip.
[0082] Specifically, in the pre-filling stage, each of the processing engines reads the weight from the corresponding FLASH cluster and reads the prompt words for each part from the corresponding DRAM cluster;
[0083] Multiple first key-value caches are generated based on the weight of each part and the prompt words of each part;
[0084] Each of the processing engines writes all of the first key-value caches to the corresponding DRAM cluster.
[0085] In the pre-filling stage, the weights of the LLM model are used for forward computation to generate the initial KV cache: each PE reads the model weights only from the directly connected FLASH cluster, each PE reads the cue words only from the directly connected DRAM cluster, and then each PE writes the calculated KV cache to the directly connected DRAM cluster.
[0086] The high degree of parallelism between PEs improves computational efficiency and bandwidth utilization, fully leveraging the 2TB / s bandwidth provided by FLASH and the 1TB / s bandwidth provided by DRAM. Furthermore, PEs only need to read data from the physically closest FLASH and DRAM, reducing latency. Model weight data and Prompt do not need to go through bus routing, reducing data movement and lowering power consumption.
[0087] Furthermore, by utilizing each processing engine to access the FLASH cluster for weight reading and the DRAM cluster for prompt word reading, this separate read strategy optimizes the data access path. By directly connecting the processing engine to the corresponding storage cluster, parallel data loading can be achieved, reducing data transfer latency and improving overall processing efficiency. Simultaneously, this design leverages the advantages of different storage media: the non-volatility and large-capacity storage capacity of FLASH, and the high-speed random access capability of DRAM.
[0088] Generating a key-value cache within the processing engine significantly reduces redundant computations and data transfers. Offloading computational tasks closer to memory avoids moving large amounts of intermediate results from DRAM to the CPU or dedicated processing units, thus reducing bandwidth pressure and improving computational efficiency. Simultaneously, the caching mechanism accelerates subsequent query and matching processes, especially in scenarios requiring frequent access to the same or similar data. Writing the generated key-value cache back to the DRAM cluster enables rapid data reuse and sharing. Since key-value caches typically need to be shared across multiple processing units, storing them in DRAM ensures efficient access by all processing engines, avoiding the latency of multiple reads from FLASH. Furthermore, this write-back operation optimizes memory layout, allowing subsequent computations to more efficiently utilize cached data, further improving system performance. The synergistic effect of these processes enables the PIM architecture to achieve efficient, low-latency data processing and computation within a hybrid-bonded storage structure.
[0089] Step S40: In the decoding stage, the PIM chip is used to read the weight and the first key-value cache respectively, and the second key-value cache of the token generated by the large model is written into the DRAM chip.
[0090] Specifically, during the decoding stage, each of the processing engines reads the weight from the corresponding FLASH cluster and reads the first key-value cache from the corresponding DRAM cluster;
[0091] When the large model generates the output of the forward computation, the processing engine uses the second key value cache of the output to write it into each of the DRAM clusters.
[0092] In the decoding phase, the model weights are used for forward computation to generate a token (i.e., the output result disclosed in this invention), and the key-value cache of this token is stored. Each PE reads model weights only from the directly connected FLASH cluster, and each PE reads the key-value cache only from the directly connected DRAM cluster. Each PE stores the key-value cache of the new token into the directly connected DRAM cluster. High parallelism among PEs improves computational efficiency and bandwidth utilization, fully utilizing the 2TB / s bandwidth provided by FLASH and the 1TB / s bandwidth provided by DRAM. Furthermore, PEs only need to read data from the physically closest FLASH and DRAM, reducing latency. Model weight data and key-value cache data do not need to go through bus routing, reducing data movement and lowering power consumption.
[0093] Writing the generated second key-value cache back to the DRAM cluster enables rapid data reuse and sharing. Since key-value caches typically need to be shared across multiple processing units, storing them in DRAM ensures efficient access by all processing engines, avoiding the latency of multiple reads from FLASH. Furthermore, this write-back operation optimizes memory layout, allowing subsequent computations to utilize the cached data more efficiently, further improving system performance.
[0094] The hybrid bonding disclosed in this invention can integrate dies with different process nodes and functional characteristics (such as logic PIM dies + high-speed DRAM dies or high-density FLASH dies). DRAM dies are suitable for cache-type PIMs for real-time data stream processing; FLASH dies are suitable for non-volatile PIMs, enabling "storage as computing" and are suitable for low-power wake-up computing in edge devices. Compared with traditional packaging technologies, this invention can achieve shorter electrical paths, reduce crosstalk and noise, and has higher mechanical stability and thermal cycling durability, as well as more uniform stress distribution, thus improving long-term operational reliability.
[0095] This invention utilizes the processing engine in the PIM chip, which only needs to read data from the physically closest FLASH and DRAM, reducing latency. It also eliminates the need to read model weight data and key-value cache data through bus routing, reducing data movement and power consumption.
[0096] Furthermore, such as Figure 8 As shown, based on the above-described data read / write method for large model inference based on hybrid bonding, the present invention also provides a data read / write system for large model inference based on hybrid bonding, wherein the data read / write system for large model inference based on hybrid bonding includes:
[0097] The connection module 51 is used to construct a DRAM chip, a FLASH chip and a PIM chip, and to construct a hybrid bonding connection between the PIM chip and the DRAM chip and the FLASH chip respectively;
[0098] The data storage module 52 is used to store the weights of the large model into the FLASH chip via the data bus and the prompts of the large model into the DRAM chip via the data bus when the large model is performing inference.
[0099] The first read / write module 53 is used to read the weight and the prompt word respectively using the PIM chip during the pre-filling stage, and cache the calculated first key value and write it into the DRAM chip.
[0100] The second read / write module 54 is used to read the weight and the first key-value cache respectively using the PIM chip during the decoding stage, and write the second key-value cache of the Token generated by the large model into the DRAM chip.
[0101] Furthermore, such as Figure 9 As shown, based on the above-mentioned data reading and writing method and system for large model inference based on hybrid bonding, the present invention also provides a controller, which includes a processor 10, a memory 20 and a display 30. Figure 9 Only a portion of the controller components are shown; however, it should be understood that implementation of all shown components is not required, and more or fewer components may be implemented instead.
[0102] In some embodiments, the memory 20 may be an internal storage unit of the controller, such as the controller's hard drive or memory. In other embodiments, the memory 20 may be an external storage device of the controller, such as a plug-in hard drive, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the controller. Further, the memory 20 may include both internal and external storage units of the controller. The memory 20 is used to store application software and various types of data installed on the controller, such as the program code installed on the controller. The memory 20 can also be used to temporarily store data that has been output or will be output. In one embodiment, the memory 20 stores a data read / write program 40 based on hybrid bonding large model inference, which can be executed by the processor 10 to implement the data read / write method based on hybrid bonding large model inference in this application.
[0103] In some embodiments, the processor 10 may be a central processing unit (CPU), a microprocessor, or other data processing chip, used to run program code stored in the memory 20 or process data, such as executing the data read / write method based on hybrid bonding large model inference.
[0104] In some embodiments, the display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen. The display 30 is used to display information from the controller and to display a visual user interface. Components of the controller communicate with each other via a system bus.
[0105] In one embodiment, when the processor 10 executes the data read / write program 40 based on large model inference in the memory 20, the following steps are performed:
[0106] Construct a DRAM chip, a FLASH chip, and a PIM chip, and construct hybrid bonding connections between the PIM chip and the DRAM chip and the FLASH chip, respectively;
[0107] When the large model performs inference, the weights of the large model are stored in the FLASH chip through the data bus, and the prompts of the large model are stored in the DRAM chip through the data bus.
[0108] During the pre-filling stage, the PIM chip is used to read the weight and the prompt word respectively, and the calculated first key value is cached and written into the DRAM chip;
[0109] During the decoding stage, the PIM chip is used to read the weight and the first key-value cache respectively, and the second key-value cache of the token generated by the large model is written into the DRAM chip.
[0110] The DRAM chip includes multiple memory blocks, each of which is independent of the others.
[0111] The FLASH chip includes multiple FLASH blocks, and each FLASH block is independent of the others;
[0112] The PIM chip includes multiple processing engines, multiple DRAM controllers, and multiple FLASH controllers.
[0113] The hybrid bonding connection method between the PIM chip and the DRAM chip and the FLASH chip includes:
[0114] The PIM chip is hybrid-bonded between the DRAM chip and the FLASH chip;
[0115] The PIM chip is hybrid-bonded to both sides of the DRAM chip or the FLASH chip.
[0116] Specifically, the construction of the PIM chip and its hybrid bonding connection with the DRAM chip and FLASH chip includes:
[0117] A DRAM cluster is constructed using one or more of the memory blocks, and a FLASH cluster is constructed using one or more of the FLASH blocks;
[0118] Each memory block is connected to each DRAM controller;
[0119] Each of the FLASH blocks is connected to each of the FLASH controllers;
[0120] Each of the processing engines is connected to each of the DRAM clusters and each of the FLASH clusters, respectively.
[0121] Specifically, when the large model performs inference, the weights of the large model are stored in the FLASH chip via the data bus, and the prompts of the large model are stored in the DRAM chip via the data bus.
[0122] When the large model performs inference, the weights of the large model are divided into multiple parts and stored in multiple FLASH clusters through a data bus;
[0123] The prompts of the large model are divided into multiple parts and stored in multiple DRAM clusters via a data bus.
[0124] Specifically, in the pre-filling stage, the PIM chip is used to read the weights and prompt words respectively, and the calculated first key value is cached and written into the DRAM chip, which includes:
[0125] During the pre-filling phase, each of the processing engines reads the weight from the corresponding FLASH cluster and reads the prompt words for each part from the corresponding DRAM cluster;
[0126] Multiple first key-value caches are generated based on the weight of each part and the prompt words of each part;
[0127] Each of the processing engines writes all of the first key-value caches to the corresponding DRAM cluster.
[0128] Specifically, in the decoding stage, the PIM chip is used to read the weights and the first key-value cache respectively, and the second key-value cache of the token generated by the large model is written into the DRAM chip, which includes:
[0129] During the decoding phase, each of the processing engines reads the weight from the corresponding FLASH cluster and reads the first key-value cache from the corresponding DRAM cluster;
[0130] When the large model generates the output of the forward computation, the processing engine uses the second key value cache of the output to write it into each of the DRAM clusters.
[0131] The present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a data read / write program based on hybrid bonding large model inference, and the data read / write program based on hybrid bonding large model inference implements the steps of the data read / write method based on hybrid bonding large model inference as described above when executed by a processor.
[0132] In summary, this invention provides a data read / write method and related equipment for large model inference based on hybrid bonding. The method includes: constructing a DRAM chip, a FLASH chip, and a PIM chip, and constructing hybrid bonding connections between the PIM chip and the DRAM chip and the FLASH chip respectively; when the large model is inferenced, the weights of the large model are stored in the FLASH chip through a data bus, and the prompts of the large model are stored in the DRAM chip through the data bus; in the pre-filling stage, the weights and prompts are read using the PIM chip respectively, and the calculated first key-value cache is written to the DRAM chip; in the decoding stage, the weights and the first key-value cache are read using the PIM chip respectively, and the second key-value cache of the token generated by the large model is written to the DRAM chip. This invention utilizes the processing engine in the PIM chip, which only needs to read data from the physically closest FLASH and DRAM, reducing latency; it does not need to read model weight data and key-value cache data through bus routing, reducing data handling and power consumption.
[0133] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or controller that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or controller. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or controller that includes that element.
[0134] Of course, those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.). The program can be stored in a computer-readable storage medium, and when executed, it can include the processes described in the above method embodiments. The computer-readable storage medium can be a memory, magnetic disk, optical disk, etc.
[0135] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.
Claims
1. A data read / write method for large-scale model inference based on hybrid bonding, characterized in that, The data read / write method based on hybrid bonding for large model inference includes: Construct DRAM chips, FLASH chips, and PIM chips, wherein the DRAM chip includes multiple memory blocks and the PIM chip includes multiple processing engines; The construction of the PIM chip with the DRAM chip and the FLASH chip via hybrid bonding specifically includes: A DRAM cluster is constructed using one or more memory blocks, and a FLASH cluster is constructed using one or more of the FLASH blocks; Each memory block is connected to each DRAM controller; Each of the FLASH blocks is connected to each of the FLASH controllers; Each processing engine is connected to each of the DRAM clusters and each of the FLASH clusters, respectively; When the large model performs inference, the weights of the large model are stored in the FLASH chip via the data bus, and the prompts of the large model are stored in the DRAM chip via the data bus. Specifically, this includes: When the large model performs inference, the weights of the large model are divided into multiple parts and stored in multiple FLASH clusters through a data bus; The prompts of the large model are divided into multiple parts and stored in multiple DRAM clusters via a data bus; In the pre-filling stage, the PIM chip is used to read the weights and the prompt words respectively, and the calculated first key value is cached and written into the DRAM chip, specifically including: During the pre-filling phase, each of the processing engines reads the weight from the corresponding FLASH cluster and reads the prompt words for each part from the corresponding DRAM cluster; Multiple first key-value caches are generated based on the weight of each part and the prompt words of each part; Each of the processing engines writes all of the first key-value caches into the corresponding DRAM cluster; During the decoding stage, the PIM chip is used to read the weights and the first key-value cache respectively, and the second key-value cache of the token generated by the large model is written into the DRAM chip, specifically including: During the decoding phase, each of the processing engines reads the weight from the corresponding FLASH cluster and reads the first key-value cache from the corresponding DRAM cluster; When the large model generates the output of the forward computation, the processing engine uses the second key value cache of the output to write it into each of the DRAM clusters.
2. The data read / write method for large model inference based on hybrid bonding according to claim 1, characterized in that, Each of the aforementioned memory blocks is independent of the others; The FLASH chip includes multiple FLASH blocks, and each FLASH block is independent of the others; The PIM chip includes multiple DRAM controllers and multiple FLASH controllers.
3. The data reading and writing method for large model inference based on hybrid bonding according to claim 2, characterized in that, The hybrid bonding connection method between the PIM chip and the DRAM chip and the FLASH chip includes: The PIM chip is hybrid-bonded between the DRAM chip and the FLASH chip; The PIM chip is hybrid-bonded to both sides of the DRAM chip or the FLASH chip.
4. A data read / write system for large-scale model inference based on hybrid bonding, characterized in that, The data read / write system based on hybrid bonding for large model inference is used to implement the data read / write method based on hybrid bonding for large model inference as described in any one of claims 1-3, wherein the data read / write system based on hybrid bonding for large model inference includes: A connection module is used to construct DRAM chips, FLASH chips and PIM chips, and to construct hybrid bonding connections between the PIM chips and the DRAM chips and FLASH chips respectively; The data storage module is used to store the weights of the large model into the FLASH chip via the data bus and the prompts of the large model into the DRAM chip via the data bus when the large model is performing inference. The first read / write module is used to read the weight and the prompt word respectively using the PIM chip during the pre-filling stage, and cache and write the calculated first key value into the DRAM chip. The second read / write module is used to read the weight and the first key-value cache respectively using the PIM chip during the decoding stage, and write the second key-value cache of the token generated by the large model into the DRAM chip.
5. A controller, characterized in that, The controller includes: a memory, a processor, and a data read / write program based on hybrid bonding large model inference stored in the memory and executable on the processor. When the data read / write program based on hybrid bonding large model inference is executed by the processor, it implements the steps of the data read / write method based on hybrid bonding large model inference as described in any one of claims 1-3.
6. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a data read / write program based on hybrid bonding large model inference, which, when executed by a processor, implements the steps of the data read / write method based on hybrid bonding large model inference as described in any one of claims 1-3.