Cache management method, electronic device, storage medium and computer program product
By using storage nodes composed of video memory and persistent memory in large models, mapping them to multiple buckets and maintaining an index table, the high hardware cost, limited capacity, and storage performance bottleneck of existing KV cache management technologies are solved, achieving more efficient KV cache access and large model inference efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA MOBILE (SUZHOU) SOFTWARE TECH CO LTD
- Filing Date
- 2026-04-27
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, KV cache management solutions based on HBM and disks suffer from high hardware costs, limited capacity, prominent storage performance bottlenecks, low data migration efficiency, and insufficient scalability, resulting in high KV cache access latency for large models and reduced inference efficiency.
The storage node, composed of video memory and persistent memory, maps the key-value cache to multiple buckets and maintains an index table. The key-value cache is located based on the index table, which shields the differences in storage media, realizes unified access logic, and dynamically adjusts the position of data blocks in the storage node to optimize data flow.
This reduces the access latency of large models to the KV cache, improves the access efficiency of the KV cache, thereby improving the inference efficiency of large models, reducing hardware costs, and increasing storage capacity.
Smart Images

Figure CN122285296A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing, and more particularly to a cache management method, electronic device, storage medium, and computer program product. Background Technology
[0002] In related technologies, key-value (KV) caches are stored based on high-bandwidth memory (HBM) and storage capacity is expanded by hard disk to provide data support for inference tasks of large models through the stored KV cache. However, in practical applications, the access latency of large models to the stored KV cache is high, which reduces the inference efficiency of large models. Summary of the Invention
[0003] To address the related technical issues, embodiments of this application provide a cache management method, an electronic device, a storage medium, and a computer program product.
[0004] The technical solution of this application embodiment is implemented as follows: This application provides a cache management method, the method comprising: The key-value cache generated during the inference process of the first model is loaded into the first storage node; the first storage node consists of the first video memory and persistent memory; Multiple data blocks in the KV cache are mapped to multiple buckets, and an index table is maintained for each of the multiple buckets; one or more data blocks in the buckets correspond to the same first business attribute; each first index in the index table has a mapping relationship with the first information corresponding to a data block in the corresponding bucket; the first information is used to describe the storage location of the corresponding data block in the first storage node; Based on the first inference requirement of the first large model, the first index table is located from the multiple index tables corresponding to the multiple buckets, and the first index corresponding to the first KV cache is determined in the first index table, so as to access the first KV cache in the first storage node based on the first information corresponding to the determined first index; the first KV cache represents the KV cache that the first large model needs to access in order to realize the first inference requirement, and the first inference requirement is used to describe the business attributes corresponding to the first KV cache.
[0005] The method in the above scheme further includes: If the inference task volume of the first large model exceeds a set first task volume threshold, a new first bucket is added to the plurality of buckets; and / or, If the inference task of the first large model is less than the set second task threshold, the second buckets among the multiple buckets are merged. Wherein, the first task volume threshold is greater than the second task volume threshold.
[0006] The method in the above scheme further includes: Migrate the second KV cache in the persistent memory to the first video memory; the access frequency corresponding to the second KV cache is greater than a set first frequency threshold; and / or, The third KV cache in the first video memory is migrated to the persistent memory; the access frequency of the third KV cache is less than the set second frequency threshold. Wherein, the first popularity threshold is greater than the second popularity threshold, and the access popularity is determined based on one or more of the following: access frequency and keyness of the corresponding KV cache.
[0007] The method in the above scheme further includes: When initializing the first large model, the first KV data is loaded into the first video memory; the first KV data represents the hot data predicted by the KV cache accessed for the first large model; and / or, Before the first large model performs inference in response to model input, a fourth KV cache is selected from the persistent memory based on the historical inference data of the first large model, and the fourth KV cache is migrated from the persistent memory to the first video memory; the fourth KV cache represents the KV cache accessed by the first large model in response to the first model input, and the input frequency of the first model input in the historical inference process is greater than a set first input frequency threshold.
[0008] The method in the above scheme further includes: When the first large model generates the first word element for the first dialogue, the following processing is performed asynchronously: The KV cache that the first large model needs to access when generating the second word is predicted to obtain the fifth KV cache. If the fifth KV cache is located in the persistent memory, the fifth KV cache is moved from the persistent memory to the first video memory. The second word represents the word following the first word.
[0009] In the above scheme, the step of predicting the KV cache that the first large model needs to access when generating the second word to obtain the fifth KV cache includes: The first prediction model is called to process multiple third word elements to obtain the fifth KV cache; the first prediction model is used to predict the KV cache that the first large model needs to access; the multiple third word elements represent multiple word elements generated by the first large model for the first dialogue, and the multiple third word elements represent a first set number of word elements before the first word element.
[0010] In the above scheme, the method for migrating the corresponding KV cache includes: Migrate a second number of data blocks from the KV cache; the second number is determined based on the remaining storage space of the first video memory and / or persistent memory.
[0011] This application also provides an electronic device, including: a processor and a memory for storing a computer program capable of running on the processor. Wherein, when the processor is used to run the computer program, it executes the steps of any of the aforementioned methods.
[0012] This application also provides a storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of any of the aforementioned methods.
[0013] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the aforementioned methods.
[0014] In this embodiment, the KV cache generated by the first large model during inference is loaded into the first storage node, which consists of first video memory and persistent memory. Then, multiple data blocks in the KV cache are mapped to multiple buckets, and an index table is maintained for each bucket. Here, one or more data blocks in a bucket correspond to the same first business attribute, and each first index in the index table has a mapping relationship with the first information corresponding to a data block in the corresponding bucket. The first information is used to describe the storage location of the corresponding data block in the first storage node. Subsequently, based on the first inference requirement of the first large model, the first index table is located from the multiple index tables corresponding to the multiple buckets, and the first index corresponding to the first KV cache is determined in the first index table. The first KV cache is then accessed in the first storage node based on the first information corresponding to the determined first index. The first KV cache represents the KV cache that the first large model needs to access in order to realize the first inference requirement, and the first inference requirement is used to describe the business attribute corresponding to the first KV cache. In the above scheme, a first storage node is composed of a first video memory and persistent memory. Compared with disks in related technologies, persistent memory has higher read and write efficiency, thereby increasing storage capacity while reducing the access latency of the first large model to the KV cache in the first storage node. Furthermore, in the above scheme, the KV cache is mapped to multiple buckets. Each bucket's index table has a mapping relationship between the first index and the physical storage location of the KV cache. Thus, the first index masks the underlying differences between video memory and persistent memory, allowing the first large model to use a unified access logic for the KV cache in different storage media based on the first index, improving the access efficiency of the KV cache. Further, in the above scheme, one or more data blocks in a single bucket correspond to the same specific business attribute. Therefore, during the inference process of the first large model, based on the business attribute described by the first inference requirement, the corresponding bucket can be quickly located, allowing index lookup only in that bucket, narrowing the index lookup range and improving the access efficiency of the KV cache. Therefore, compared with related technologies, the solution in this application reduces the access latency of large models to the stored KV cache and improves the inference efficiency of large models. Attached Figure Description
[0015] Figure 1 A schematic diagram illustrating the implementation process of a cache management method provided in this application embodiment; Figure 2 A schematic diagram of the architecture of a cache management system provided for an application embodiment of this application; Figure 3 A schematic diagram of storage layer processing provided for an application embodiment of this application; Figure 4This is a schematic diagram illustrating another storage layer processing method provided in an application embodiment of this application; Figure 5 A schematic diagram of data flow provided for an application embodiment of this application; Figure 6 This is a schematic diagram of the structure of a cache management device provided in an embodiment of this application; Figure 7 This is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0016] With the rapid development of artificial intelligence technology, large-scale models are being applied in various scenarios, such as autonomous driving, financial risk control, and machine translation. Data, as the foundation for running these large models, is experiencing explosive growth in scale and complexity. Key-value (KV) caches, as core data assisting in large-scale model inference, play a crucial role in rapidly responding to read and write requests. In practical applications, as the number of model parameters in large models exceeds hundreds of billions, the storage capacity required for KV caches has risen to the terabyte (TB) level.
[0017] In related technologies, HBM is used to store the key-value cache generated by large models, and the storage capacity is expanded by hard disk, so as to provide data support for the inference task of large models based on the stored key-value cache.
[0018] In practical applications, the following two schemes are mainly used to manage KV cache.
[0019] Option 1: HBM storage solution.
[0020] This approach stores the key-value (KV) cache directly in the HBM. HBM boasts high bandwidth and a short communication link with the graphics processing unit (GPU). When the data size corresponding to the KV cache is small, such as in the early stages of inference for large models, HBM can quickly respond to read and write requests for the KV cache.
[0021] Option 2: Disk-based KV cache offloading solution.
[0022] This solution offloads a portion of the key-value (KV) cache to a disk storage medium, such as a hard disk drive (HDD) or a solid-state drive (SSD), thereby expanding the storage capacity of the KV cache. In practical applications, the KV cache, which contains hot data, can be retained in the HBM, while the KV cache, which contains cold data, can be migrated to the disk through the operating system page cache or a custom offload module.
[0023] However, the above-mentioned scheme 1 has the following drawbacks: HBM is expensive: As the scale of large models increases, the data size of the KV cache increases dramatically, requiring more HBM for storage, which leads to a significant increase in hardware costs and limits the promotion of large model applications.
[0024] HBM has limited capacity: HBM struggles to meet the ever-increasing KV cache data storage requirements of large-scale models during inference. When the amount of data corresponding to the KV cache required by a large model exceeds the capacity of HBM, it severely impacts inference performance, such as inference efficiency, and may even prevent inference from taking place.
[0025] Option 2 has the following disadvantages: Storage performance bottleneck is prominent: disk access latency is much higher than HBM. For example, the random read / write latency of HDD is typically 5 milliseconds (ms) to 10 ms. Although the read / write latency of SDD is lower than that of HDD, ranging from 100 microseconds (us) to 200us, it still cannot meet the demand of millions of KV cache accesses per second for large models, resulting in a significant increase in inference latency.
[0026] Data migration is inefficient: Disk-based unloading relies on the operating system's system page cache. During the transfer of the KV cache, multiple memory copies are required. For example, the CPU overhead for each megabyte (MB) of data migration reaches thousands of cycles. Frequent migrations during the unloading process will consume a large amount of CPU resources.
[0027] Reliability and consistency issues: HDDs are prone to mechanical failures, and SSDs have limited write lifespan; multi-level storage requires cache refresh to ensure consistency. Strong consistency scenarios will introduce additional access latency, for example, an additional access latency of 5ms to 10ms, while weak consistency may lead to data inconsistency.
[0028] Insufficient scalability: Expanding disk arrays requires downtime to rebuild a redundant array of independent disks (RAID), which cannot adapt to the dynamic growth of KV cache capacity requirements of large models, and hardware upgrades are prone to service interruptions, introducing extremely long access latency.
[0029] Therefore, based on the KV cache management schemes in related technologies, large models have high access latency to the stored KV cache, which reduces the inference efficiency of large models.
[0030] Based on this, in this embodiment, the KV cache generated by the first large model during inference is loaded into the first storage node, wherein the first storage node consists of first video memory and persistent memory; then, multiple data blocks in the KV cache are mapped to multiple buckets, and an index table is maintained for each of the multiple buckets; here, one or more data blocks in the buckets correspond to the same first business attribute, and each first index in the index table has a mapping relationship with the first information corresponding to a data block in the corresponding bucket, the first information being used to describe the storage location of the corresponding data block in the first storage node; then, based on the first inference requirement of the first large model, the first index table is located from the multiple index tables corresponding to the multiple buckets, and the first index corresponding to the first KV cache is determined in the first index table, so as to access the first KV cache in the first storage node based on the first information corresponding to the determined first index; the first KV cache represents the KV cache that the first large model needs to access in order to realize the first inference requirement, and the first inference requirement is used to describe the business attribute corresponding to the first KV cache. In the above scheme, a first storage node is composed of a first video memory and persistent memory. Compared with disks in related technologies, persistent memory has higher read and write efficiency, thereby increasing storage capacity while reducing the access latency of the first large model to the KV cache in the first storage node. Furthermore, in the above scheme, the KV cache is mapped to multiple buckets. Each bucket's index table has a mapping relationship between the first index and the physical storage location of the KV cache. Thus, the first index masks the underlying differences between video memory and persistent memory, allowing the first large model to use a unified access logic for the KV cache in different storage media based on the first index, improving the access efficiency of the KV cache. Further, in the above scheme, one or more data blocks in a single bucket correspond to the same specific business attribute. Therefore, during the inference process of the first large model, based on the business attribute described by the first inference requirement, the corresponding bucket can be quickly located, allowing index lookup only in that bucket, narrowing the index lookup range and improving the access efficiency of the KV cache. Therefore, compared with related technologies, the solution in this application reduces the access latency of large models to the stored KV cache and improves the inference efficiency of large models.
[0031] The present application will now be described in further detail with reference to the accompanying drawings and embodiments.
[0032] This application provides a cache management method, see [link to relevant documentation]. Figure 1 The method includes: Step 101: Load the KV cache generated during the inference process of the first large model into the first storage node.
[0033] The first storage node consists of a first video memory and persistent memory.
[0034] In practical applications, large models can generate key and value vectors during inference. These vectors together form key-value (KV) data. Large models can then directly reuse this pre-generated KV data in subsequent inference processes, thus accelerating the inference process. The KV data generated during inference can be considered pre-cached data relative to subsequent inference processes; therefore, KV data can also be viewed as a KV cache. This KV cache can also be called KVCache or KVCache data.
[0035] In practical applications, when the data represented by the KV cache is divided at the block level, multiple data blocks can be obtained. Based on this, a KV cache can be considered as containing multiple data blocks. Some data blocks in a KV cache can still be considered as constituting the KV cache itself, and a KV cache composed of some data blocks can be considered as a part of a KV cache composed of all data blocks. Specific attributes of a KV cache, such as business attributes or access frequency, can also be considered as attributes of the data blocks corresponding to that KV cache.
[0036] Here, the KV cache generated by the first large model is loaded into the first storage node. In practical applications, this can also be regarded as storing the KV cache into the first storage node, or as writing the KV cache into the first storage node.
[0037] Here, the first storage node consists of the first video memory and persistent memory, which is equivalent to storing the KV cache in the first video memory and persistent memory of the first storage node.
[0038] In practical applications, the storage location of each data block in the KV cache within the first storage node can be adjusted in real time based on the operation of the first main model and / or the storage status of the first storage node, thereby enabling data flow for the KV cache within the first storage node. For example, a specific portion of the KV cache can be migrated from the first video memory to persistent memory, and / or, a specific portion of the KV cache can be migrated from persistent memory to the first video memory.
[0039] In practical applications, the first video memory can have the characteristics of high bandwidth and low access latency, which can adapt to the high-frequency access of large models to the KV cache. For example, the first video memory may include HBM.
[0040] In practical applications, persistent memory can also be called PMEP (Persistent Memory), which can have characteristics such as large capacity, persistence, low cost and high performance.
[0041] Persistent memory can have a single device capacity of up to terabytes, and can be used to store key-value caches that need to be retained for a long time or are sensitive to capacity.
[0042] For example, persistent memory can be used to store key-value caches corresponding to the historical context of long conversations. For instance, it can be used to store key-value caches accumulated based on questions asked by users a few days ago and / or multiple rounds of conversations, in order to support conversation continuation and context backtracking functions for large models.
[0043] For example, persistent memory can be used to store key-value caches corresponding to intermediate checkpoints in multi-branch inference. For instance, it can be used to store the intermediate state when code is generated up to a certain line, so that it can be reused directly after inference is interrupted without recalculation.
[0044] Data in persistent memory can be retained even when power is off, and it supports byte addressing and low-latency persistence. By combining appropriate instructions and architecture design, persistent memory can ensure data persistence, thereby achieving long-term persistence of the key-value cache.
[0045] Persistent memory (PM) offers performance between Dynamic Random Access Memory (DRAM) and SSDs, boasting read and write speeds close to DRAM and one to two orders of magnitude faster than SSDs. PM supports byte addressing and direct CPU instruction manipulation. With a well-designed architecture, write amplification and space amplification can be reduced, thereby minimizing resource overhead. This allows for high-performance read and write operations while helping to control the Total Cost of Ownership (TCO), achieving a balance between low cost and high performance.
[0046] In practical applications, parallel inference scenarios exist. For example, the first large model can execute multiple inference tasks in parallel. In parallel inference scenarios, multiple computing cores corresponding to the first large model can simultaneously read and write to the KV cache, which may lead to problems such as inconsistent read and write data. This problem can be solved by a global compare and swap (CAS) mechanism.
[0047] In the global CAS mechanism, the data version corresponding to the KVCache can be verified before each read / write operation to ensure data consistency. In parallel write scenarios, if a write conflict occurs, the party initiating the write request can be controlled to randomly delay retry, thereby ensuring the orderliness and effectiveness of write operations and improving the concurrent processing capability of the inference process.
[0048] Step 102: Map multiple data blocks in the KV cache to multiple buckets, and maintain an index table for each of the multiple buckets.
[0049] In this context, one or more data blocks in a bucket correspond to the same first business attribute; each first index in the index table has a mapping relationship with the first information corresponding to a data block in the corresponding bucket; the first information is used to describe the storage location of the corresponding data block in the first storage node.
[0050] In practical applications, mapping multiple data blocks in a key-value cache to multiple buckets can be understood as logically dividing the multiple data blocks in the key-value cache into multiple buckets. Each bucket can be understood as a collection of one or more corresponding data blocks.
[0051] In the process of mapping multiple data blocks in the KV cache to multiple buckets, multiple data blocks in the KV cache can be hashed and mapped to multiple buckets based on the first business attribute using a hash method. Data blocks corresponding to different first business attributes can be mapped to different buckets.
[0052] In practical applications, the first business attribute may include: Session Identifier (SID) and / or Model Layer Number.
[0053] If the first major model generates a data block in the KV cache during inference for a dialogue corresponding to a specific session identifier, then the session identifier corresponding to that data block can be understood as that specific session identifier. The session identifier can correspond to the user who triggered the session and can also be called the user's SID. The session identifier can also be described as a dialogue identifier.
[0054] If the first major model generates a data block in the KV cache during inference using the model level corresponding to a specific model layer number, then the model layer number corresponding to that data block can be understood as that specific model layer number.
[0055] Here, an index table is maintained for each bucket. In practical applications, the index table corresponding to a bucket can also be called a bucket index table. Each index table can include one or more first indexes, and each first index can correspond to a data block within the bucket. The first index can be used to indicate the first business attribute of the corresponding data block. Since data blocks within the same bucket correspond to the same first business attribute, the first indexes corresponding to these data blocks can indicate the same first business attribute.
[0056] For example, when multiple data blocks in the KV cache are divided into multiple buckets based on session identifiers, the index table corresponding to one of the buckets may include the following first index: SID1_X1, SID1_X2, ..., SID1_Xn, where SID1 is equivalent to the session identifier corresponding to the bucket.
[0057] For example, when multiple data blocks in the KV cache are divided into multiple buckets based on the model layer number, the index table corresponding to one of the buckets may include the following first index: Level1_Y1, Level1_Y2, ..., Level1_Yn, where Level1 is equivalent to the model layer number corresponding to the bucket.
[0058] For example, when multiple data blocks in the KV cache are divided into multiple buckets based on the session identifier and the model layer number, the index table corresponding to one of the buckets may include the following first index: SID1_Level1_Z1, SID1_Level1_Z2, ..., SID1_Level1_Zn, where SID1 is equivalent to the session identifier corresponding to the bucket and Level1 is equivalent to the model layer number corresponding to the bucket.
[0059] In practical applications, the index table can also contain the first information corresponding to each index. This first information corresponds to the first information of the data block associated with that index.
[0060] Here, the first piece of information describes the storage location of the corresponding data block in the first storage node. In practical applications, the storage location can indicate the storage medium where the data block resides, or in other words, the media type of the storage medium. The storage location can also indicate the block identifier (ID) and / or address offset of the data block.
[0061] For example, the storage location described by the first information can be "Video memory area - block identifier: 001" or "PMEM - address offset: 0x1234".
[0062] By identifying the storage location, the physical storage address of the corresponding data block within the first storage node, as well as the media type of the storage medium, can be determined. Based on this, the corresponding read / write protocol can be invoked according to the media type, precisely sending the access request to the corresponding storage medium when accessing the data block, thereby achieving access to that data block. Access to the data block can include read and / or write operations to the database; the corresponding access requests can be understood as read / write requests.
[0063] In practical applications, the mapping relationship indicated by the index table can be dynamically adjusted. For example, if the storage location of a data block changes, the index table can be adjusted accordingly. For instance, the content of the first information in the index table can be adjusted to ensure the mapping relationship between the first index corresponding to the data block and the updated storage location. Thus, even if the storage location of the data block changes, for example, migrating from one storage medium to another, the first major model can still correctly access the data block based on the index table without needing to be aware of the specific storage location change.
[0064] Step 103: Based on the first inference requirement of the first large model, locate the first index table from the multiple index tables corresponding to multiple buckets, and determine the first index corresponding to the first KV cache in the first index table, so as to access the first KV cache in the first storage node based on the first information corresponding to the determined first index.
[0065] Among them, the first KV cache represents the KV cache that the first major model needs to access in order to realize the first inference requirement, and the first inference requirement is used to describe the business attributes corresponding to the first KV cache.
[0066] In practical applications, the first KV cache can be understood as a portion of the KV cache stored in the first storage node. The first KV cache can consist of one or more data blocks, or in other words, it can consist of a portion of the data blocks stored in the first storage node.
[0067] The first inference requirement can describe the first business attribute corresponding to the first key-value cache. For example, the requirement content of the first inference requirement may include: performing inference based on the key-value cache corresponding to session identifier 1, and / or performing inference based on the key-value cache corresponding to model layer number 1.
[0068] Here, based on the first inference requirement, the first index table is located from multiple index tables corresponding to multiple buckets. In practical applications, the first business attribute corresponding to the first index table can match the first business attribute corresponding to the first KV cache. In other words, the first business attribute corresponding to the KV cache of the first index table can match the first business attribute corresponding to the first KV cache. Thus, only the first index corresponding to the first KV cache needs to be queried in the first index table, without needing to perform a global traversal of all index tables, thereby improving the location efficiency. Based on this, the first information corresponding to the first KV cache can be determined through the mapping relationship between the first index and the first information, thereby obtaining the storage location of the first KV cache, realizing access to the first KV cache, and improving the access efficiency of the KV cache.
[0069] When the first business attribute includes a session identifier, different session identifiers correspond to different buckets, that is, to different index tables. In this way, index queries can be performed directly in the corresponding bucket index table, avoiding global traversal and improving concurrency performance.
[0070] When the first business attribute includes the model layer number, it's equivalent to bucketing based on the model layer number, with each model layer's corresponding KV cache managed independently. During the inference phase of the first large model, the bucket index table corresponding to the required KV cache can be queried only, reducing the scale of index queries. When data interacts between layers, the mapping relationship between the model layer numbers corresponding to each bucket can be quickly established, thus adapting to multi-level computations of large models.
[0071] In this embodiment, a first storage node is composed of a first video memory and persistent memory. Compared to disks in related technologies, persistent memory offers higher read / write efficiency, thereby reducing the access latency of the first large model to the KV cache in the first storage node. Furthermore, in this embodiment, the KV cache is mapped to multiple buckets. Each bucket's index table has a mapping relationship between the first index and the physical storage location of the KV cache. Thus, the first index masks the underlying differences between video memory and persistent memory, allowing the first large model to use a unified access logic for KV caches in different storage media based on the first index, improving the access efficiency of the KV cache. Further, in this embodiment, one or more data blocks in a single bucket correspond to the same specific business attribute. Therefore, during the inference process of the first large model, the corresponding bucket can be quickly located based on the business attribute described by the first inference requirement, allowing index lookup only to be performed within that bucket, narrowing the index lookup range and improving the access efficiency of the KV cache. Therefore, compared with related technologies, the solution in this application reduces the access latency of large models to the stored KV cache and improves the inference efficiency of large models.
[0072] The cache management method in the embodiments of this application will be further explained below.
[0073] In one embodiment, the cache management method provided in this application further includes: If the inference task volume of the first large model exceeds the set first task volume threshold, add a new first bucket among the multiple buckets; and / or, If the inference task of the first model is less than the set threshold for the second task, the second buckets in the multiple buckets are merged. Among them, the first task volume threshold is greater than the second task volume threshold.
[0074] In practical applications, buckets can be dynamically added and / or merged based on the fluctuations in the inference task volume of the first major model.
[0075] When the inference task volume of the first model exceeds the set threshold for the first task volume, it can be considered as a peak business period. By adding new buckets, we can avoid having too many first indexes in the index table corresponding to a single bucket, thereby avoiding low index query efficiency when accessing the KV cache and achieving elastic adaptation of computing resources.
[0076] When the inference task volume of the first model is less than the set second task volume threshold, it can be considered as a low-peak period. By merging buckets, the number of buckets can be reduced, thereby saving index storage and improving resource utilization. The access frequency of data blocks in the second bucket can be lower, or in other words, the access frequency of the KV cache in the second bucket can be lower. Accordingly, the second bucket can be called an idle bucket.
[0077] In one embodiment, the cache management method provided in this application further includes: Migrate the second KV cache in persistent memory to the first video memory; the access frequency of the second KV cache is greater than the set first frequency threshold; and / or, The third KV cache in the first video memory is migrated to persistent memory; the access frequency of the third KV cache is less than the set second access frequency threshold. The first popularity threshold is greater than the second popularity threshold. The access popularity is determined based on one or more of the following factors of the corresponding KV cache: access frequency and keyness.
[0078] In practical applications, during the inference process of the first major model, the background thread can asynchronously perform the following processing based on the access frequency of the KV cache in the first storage node: migrate part of the KV cache from persistent memory to the first video memory, and / or migrate part of the KV cache from the first video memory to persistent memory; that is, during the inference process of the first major model, data can be transferred to the KV cache without blocking the inference of the first major model.
[0079] Here, the access popularity corresponding to the second KV cache is greater than the set first popularity threshold, and the access popularity corresponding to the third KV cache is less than the set second popularity threshold.
[0080] In practical applications, the second key-value cache can be considered as hot data, also known as hot-data. By migrating hot data from persistent memory to the first video memory, the access efficiency of hot data can be improved, ensuring fast response to hot data, thereby reducing the overall access latency of the key-value cache. The migration process corresponding to the second key-value cache can also be called the warm-up process.
[0081] In practical applications, the heat-up process can be triggered when a second KV buffer is detected.
[0082] In practical applications, the third key-value cache can be considered as cold data, which can also be called low-hot data. By migrating cold data from the first video memory to persistent memory, the storage space of the first video memory can be freed up with minimal impact on the overall access latency of the key-value cache. This balances storage space and access efficiency, achieving efficient data flow. The migration process corresponding to the third key-value cache can also be called cooling-off processing.
[0083] In practical applications, cooling can be triggered when the occupancy rate of the first memory exceeds a set first space threshold. Alternatively, it can be triggered when the presence of a third key-value cache is detected.
[0084] For the third key-value cache, a combination of various hybrid strategies can be used to determine it. For example, based on the comparison between access frequency and a second access frequency threshold, at least one or more of the following strategies can be combined: First-In, First-Out (FIFO), Least Recently Used (LRU), and Least Frequently Used (LFU) strategies can be used to determine the third key-value cache. For instance, a key-value cache with access frequency less than a set second access frequency threshold and satisfying the LRU strategy can be identified as the third key-value cache, and a cooling-off process can be triggered when this third key-value cache is detected.
[0085] Here, access popularity is determined based on one or more of the following in the corresponding KV cache: access frequency and keyness.
[0086] In practical applications, access frequency can be used to describe how frequently the first major model accesses the KV cache. For example, the access frequency can be determined based on the access frequency of the data block by the first major model during historical inference. The access frequency can be calculated based on the access counter.
[0087] Keyness can be used to describe how critical a corresponding key-value cache is to the inference process. For example, keyness can be determined based on the relationship between a corresponding key-value cache and the core model input. For instance, if a particular key-value cache represents a key-value cache generated by the first large model in response to the core model input, then the keyness of that key-value cache can be high. The core model input can be understood as the model input that plays a crucial role in the output quality of the large model. If the model input is understood as a prompt, the core model input can also be called a core prompt.
[0088] In practical applications, access popularity can be used to mark the hot / cold status of the corresponding key-value cache, also known as hotspot ranking. Access popularity can be represented by a specific numerical value, also called a popularity value. The first information corresponding to the first index in the index table can also be used to describe the access popularity of the corresponding data block.
[0089] In one embodiment, the cache management method provided in this application further includes: When initializing the first large model, the first key-value (KV) data is loaded into the first video memory; the first KV data represents the hot data predicted by the KV cache accessed for the first large model; and / or, Before the first large model performs inference in response to the model input, a fourth KV cache is selected from the persistent memory based on the historical inference data of the first large model, and the fourth KV cache is migrated from the persistent memory to the first video memory; the fourth KV cache represents the KV cache accessed by the first large model in response to the first model input, and the input frequency of the first model input in the historical inference process is greater than the set first input frequency threshold.
[0090] In practical applications, the initialization phase of the first large model can also be considered as its startup phase. After initialization, the first large model can perform inference in response to model input. For example, it can perform inference in response to prompts input by the user in a dialogue, thereby outputting one or more lexical units. Model input can be represented in the form of prompts.
[0091] In practical applications, the first key-value data can also be regarded as a key-value cache. The first key-value data can be stored in persistent memory or represented as data generated through prediction.
[0092] Here, during the initialization of the first large model, the first key-value data is loaded into the first video memory. This is equivalent to preloading the key-value cache that the first large model may access into the first video memory before the first large model completes its initialization. In this way, the waiting time for the first large model to generate the first token during inference, such as the input / output (IO) waiting time, is reduced, thus accelerating the startup and inference efficiency of the first large model. This process can be described as initialization preloading.
[0093] In practical applications, the fourth key-value cache can be considered as a key-value cache that is frequently accessed by the first major model when it performs inference in response to model input. The first model input can be considered as the model input that is frequently used in the historical inference process.
[0094] Here, before the first large model performs inference in response to model input, the fourth KV cache is migrated from persistent memory to the first video memory. This is equivalent to prefetching the KV cache that the first large model may access during inference from persistent memory to the first video memory based on the historical inference data of the first large model before inference, thereby accelerating the inference efficiency of the first large model. This process can be described as dynamic pre-loading.
[0095] In one embodiment, when the first large model generates the first lexical unit for the first dialogue, the following processing is performed asynchronously: The KV cache that the first model needs to access when generating the second word is predicted to obtain the fifth KV cache. If the fifth KV cache is located in persistent memory, it is moved from persistent memory to the first video memory. The second word represents the word following the first word.
[0096] In practical applications, during the process of the first main model generating the first word unit for the first dialogue, the background thread can asynchronously predict the KV cache required by the first main model to generate the next batch of words, and migrate the predicted fifth KV cache to the first video memory. In other words, the prediction and migration processing corresponding to the fifth KV cache can not block the generation of the first word unit. For example, the fifth KV cache can be migrated to the first video memory using Direct Memory Access (DMA) technology. Thus, the computation time for generating the first word unit can overlap with the transmission time for migrating the fifth KV cache, improving data flow efficiency. Furthermore, the first main model can directly access the fifth KV cache in the first video memory during the generation of the second word unit, improving the access efficiency of the KV cache and thus improving inference efficiency.
[0097] In one embodiment, the key-value caches accessed by the first large model when generating the second word are predicted to obtain a fifth key-value cache, including: The first prediction model is called to process multiple third word elements to obtain the fifth KV cache; the first prediction model is used to predict the KV cache that the first major model needs to access; multiple third word elements represent multiple word elements generated by the first major model for the first dialogue, and multiple third word elements represent the first set number of word elements of the first word element.
[0098] In practical applications, the first prediction model can include a lightweight prediction model, such as a Long Short-Term Memory (LSTM) model.
[0099] Multiple third-order lexical units can be regarded as multiple lexical units generated by the first large model for the first dialogue, and multiple third-order lexical units represent the first number of lexical units generated by the first large model before generating the first lexical unit.
[0100] Here, the first prediction model is called to process multiple third words to obtain the fifth KV cache. This is equivalent to predicting the KV cache required to generate the next batch of words by using the first number of words in the process of the first large model generating the first words for the first dialogue, thereby improving the efficiency of data flow and inference.
[0101] In one embodiment, when migrating the corresponding KV cache, the cache management method provided in this application includes: Migrate a second number of data blocks from the KV cache; the second number is determined based on the remaining storage space in the first video memory and / or persistent memory.
[0102] In practical applications, during the migration of KV cache, migration can be carried out at the block level, that is, at the data block level.
[0103] Here, a second number of data blocks are migrated from the KV cache. In practical applications, this second number can be equal to or less than the number of data blocks in the KV cache to be migrated. In other words, all or part of the data blocks in the KV cache can be migrated based on the remaining storage space in the first video memory and / or persistent memory. Migrating only a portion of the data blocks reduces the amount of data migration, thus achieving a balance between storage space and data flow efficiency.
[0104] For example, when migrating the fifth KV cache from persistent memory to the first video memory, the second quantity can be determined based on the remaining storage space of the first video memory. For instance, if the remaining storage space of the first video memory is 20%, 10 data blocks in the fifth KV cache can be migrated.
[0105] The present application will be further described in detail below with reference to application examples.
[0106] This application provides a KVCache management system through its application embodiments. See [link to relevant documentation]. Figure 2 The architecture shown can be composed of an inference engine layer and a KVCache storage layer. That is, the system adopts a two-layer collaborative architecture of "inference engine layer + KVCache storage".
[0107] In practical applications, the inference engine layer can serve as a central hub for computation-storage interaction, connecting the model inference logic of a large model with KVCache storage management. Specifically, the inference engine layer can have the following functions: Dynamic coordination of computing resources: The inference engine layer can be used to connect the computing needs of large models during the inference process with KVCahce storage. By coordinating relevant core processing, it drives the KVCahce storage layer to respond on demand, realizing the coordinated operation of computing processes and cache resources.
[0108] KVCache Intelligent Scheduling: The inference engine layer can identify the access patterns of large models to the KVCache in real time during inference, and decide on data loading, data unloading, and dynamic migration strategies. Through heat analysis and latency-sensitive task identification, it balances storage performance and resource consumption to ensure smooth inference. Data loading can include, for example, migrating the KVCache from PMEP to GPU memory, and data unloading can include, for example, migrating the KVCache from GPU memory to PMEP.
[0109] In practical applications, the KVCache storage layer can optimize pain points such as high data access latency, limited storage capacity, and difficulty in controlling concurrency conflicts in large model inference through index design and data hierarchical settings. The KVCache storage layer can be equivalent to the first storage node in the embodiments of this application.
[0110] In practical applications, when the KVCache storage layer is divided based on the logical structure of the data, the KVCache storage layer can include an index part and a data part.
[0111] For the indexing portion, a three-in-one indexing system can be constructed, consisting of a bucketed index table, a global CAS mechanism, and storage media mapping. That is, the KVCahce management scheme related to the indexing portion utilizes a bucketed index table, a global CAS mechanism, and storage media mapping for processing. Based on this, the following issues can be addressed: where to find the data, how to securely read and write it, and how to adapt to cross-media environments, thereby achieving efficient mapping from KVCahce's logical index to physical storage. The logical index can be equivalent to the first index in the embodiments of this application.
[0112] For the data portion, a two-tier storage architecture of "video memory data area + PMEM storage" can be used. The video memory data area can be used to store hotspot KVCache, while PMEP can be used for persistent storage KVCache. Through the high throughput of video memory, the large capacity of PMEM, and its persistent characteristics, different access requirements can be accommodated, such as low-latency or high-capacity access needs.
[0113] In practical applications, when the KVCache storage layer is divided based on the physical medium of the data, the KVCache storage layer can include the HBM cache layer and the PMEP persistence layer.
[0114] The HBM cache layer can be used to store currently active KVCache data, such as KV tensors corresponding to recently generated tokens, thereby ensuring low-latency access. The HBM cache layer can be equivalent to the first video memory in the embodiments of this application.
[0115] The PMEP persistence layer can be used as an extended storage pool to store inactive or historical KVCache data, thus addressing the HBM capacity bottleneck. The PMEP persistence layer can be considered equivalent to the persistent memory in the embodiments of this application.
[0116] In practical applications, the KVCache management system can be used for data flow. Data flow can include KVCache loading and / or KVCache unloading.
[0117] For KVCache loading, when a large model requires KVCache for inference, the inference engine layer can trigger the KVCache storage layer to locate the corresponding KVCache data through indexing. Large models can preferentially load KVCache data as hot data from GPU memory. If a cache miss occurs, it will be loaded from PMEP, and the hotness of the data will determine whether to push it to GPU memory.
[0118] For KVCache offloading, when the storage capacity of the GPU memory is insufficient or the access frequency decreases, the inference engine layer can drive the storage layer to offload the KVCache, which is cold data, to PMEM. After the inference task of the large model is completed, the key data in the inference process corresponding to the inference task can be persisted to PMEP for retention. Key data can include long dialogue context for example.
[0119] In practical applications, the system architecture of the KVCache management system based on the application embodiments of this application can be deeply adapted to large model inference scenarios, realizing efficient flow and hierarchical storage of KVCache data.
[0120] The design details of the KVCache management system are explained below.
[0121] In practical applications, see Figure 3 For the index portion in the KVCache storage layer, KVCache management can be based on bucketed index tables, global CAS mechanisms, and storage medium mapping.
[0122] For bucketed index tables, KVCache data can be hashed and mapped to different buckets based on specific business attributes of KVCache using a hashing method. Specific business attributes may include one or more of the following: session identifier, model layer number, and inference task type.
[0123] For multiple buckets mapped to, an index table can be maintained for each bucket, which is also known as the bucket index table. The index table can record multiple indexes and their corresponding storage information, thereby realizing the "data-bucket-index" association, that is, realizing the association between the data, buckets, and indexes corresponding to the KVCache. The storage information can include the storage address of the corresponding KVCache data, and can also include information such as the popularity flag of the corresponding KVCache data. The storage information can be equivalent to the first information in the embodiments of this application, and the storage address can be equivalent to the storage location described by the first information.
[0124] In practical applications, buckets can be dynamically added and / or merged based on the fluctuations in the inference workload of large models.
[0125] For example, new buckets can be added during peak business periods, while idle buckets can be merged during off-peak business periods.
[0126] In practical applications, large models often involve multi-threaded or multi-GPU parallel processing in scenarios such as batch inference or model parallel processing. In such scenarios, multiple computing cores read and write KVCache at the same time, which can easily lead to problems such as inconsistent data read and write. The global CAS mechanism can be used to solve this problem.
[0127] In the global CAS mechanism, the data version of the KVCache can be verified before each read / write operation to ensure data consistency. In the event of a read / write conflict, retries can be used to ensure the validity of the operation, and conflict management can be used to resolve contention. This ensures the ordered nature of KVCache read / write operations under multi-task concurrency, adapting to scenarios such as multi-user inference and multi-level inference, and improving the concurrent processing capabilities of the inference process.
[0128] For data mapping, the mapping relationship between the indexes in the index table and the corresponding storage addresses of the KVCache data can be used to determine the storage address and media type of the KVCache data to be accessed based on the logical index. Then, the corresponding read / write protocol is invoked based on the media type to accurately send the read / write request to the target storage medium, thereby retrieving the expected KVCache data. In this way, the index table masks the underlying differences between video memory and PMEP storage media, allowing large models to use a unified access logic for KVCache in different storage media based on the logical index, thereby improving the efficiency of KVCache access.
[0129] In practical applications, the KVCache storage layer can dynamically adjust the mapping relationship between indexes and storage addresses according to the actual storage situation. For example, after KV cache data undergoes cross-media migration, the mapping relationship between the KV cache data and the corresponding storage address can be adjusted in a timely manner. In this way, seamless connection is achieved when data is migrated across media, reducing the latency of cross-media access and inter-media migration, thereby improving cross-media scheduling efficiency and improving the access efficiency and read / write performance of KV cache during large model inference.
[0130] In practical applications, see Figure 4 For the data portion of the KVCache storage layer, the area used for data storage can be divided into a PMEM (Portable Memory Mode) storage area. The PMEM storage area can use HBM (Hardware Mode), and can also be called an HBM storage layer. The PMEM storage layer can also be called a PMEM storage area.
[0131] The HBM storage layer can accommodate high-frequency access to the KVCache by large models. Leveraging HBM's high bandwidth and low latency, it better guarantees access to the KVCache corresponding to hot data such as real-time dialogue context, achieving low-latency, high-throughput read and write operations and supporting efficient inference.
[0132] The HBM storage layer can also be used as a key execution layer for dynamic management strategies. With its high performance, it can accurately implement strategies, enabling hot data KVCache to reside stably in HBM and cold data KVCache to be moved in a timely manner, thus improving the overall management and usage efficiency of KVCache.
[0133] The PMEP storage layer can have characteristics such as large capacity, persistence, low cost and high performance, thus effectively supporting complex reasoning scenarios such as long context dialogues and multi-branch reasoning.
[0134] In practical applications, the KVCache management system can be used for data transfer. See also... Figure 5 For data flow, corresponding processing can be carried out based on preloading mechanism, warm-up mechanism and runtime flow mechanism. These mechanisms can be regarded as forming the data flow mechanism together.
[0135] Regarding the preloading mechanism, during the initialization phase of a large model, predicted hot data can be automatically loaded into HBM, thereby shortening the IO wait time for first-word generation and accelerating the initialization of the large model. The initialization of a large model can also be regarded as the startup of the large model.
[0136] In practical applications, before large models perform inference, the KVCache that the large model frequently accesses when performing inference in response to model input can be selected based on historical inference data, and this KVCache can be pre-loaded from PMEM to HBM, thereby pre-filling the KVCache as hot data.
[0137] Regarding the runtime flow mechanism, asynchronous prefetching can be performed during the large model execution phase, and combined with a hot and cold bidirectional migration mechanism to achieve efficient data flow.
[0138] For asynchronous prefetching, when the large model generates the current lexical unit, the KVCache required for the next batch of lexical units to be generated by the large model can be asynchronously predicted, and the predicted KVCache can be dynamically prefetched from PMEM to HBM asynchronously via DMA, thus allowing computation and transmission time to overlap. The prefetch amount can be dynamically adjusted based on the remaining space of HBM. For example, when the remaining space of HBM is 20%, 10 data blocks can be prefetched from the predicted KVCache, thereby balancing space and efficiency.
[0139] For bidirectional migration between hot and cold caches, background threads can cool down and / or heat up the KVCache.
[0140] To reduce cold storage, a background thread can asynchronously migrate KVCache (which stores cold data) in HBM to PMEM, thereby freeing up HBM storage space. Before migrating the KVCache data, it can be compressed to further reduce bandwidth usage and save storage space.
[0141] For hot data, the background thread can asynchronously migrate the KVCache, which is used as hot data in PMEM, to HBM, thereby updating the access priority and ensuring fast response to hot data.
[0142] The timing of data migration and the determination of KVCache data during the data flow process can be decided based on dynamic management strategies, thereby achieving efficient data flow.
[0143] In practical applications, the inference engine layer can use access counters to count the access frequency of KVCache data in real time, and / or determine the criticality of KVCache data in the inference process in real time. Then, the inference engine layer can determine the access popularity of KVCache data based on access frequency and / or criticality, and thus classify and label the hotspot status of KVCache data based on access popularity.
[0144] In practical applications, data migration can be based on one or more of the following strategies: Usage threshold trigger strategy: When the HBM occupancy rate exceeds the set threshold, start the background cooling thread to avoid blocking the main inference process.
[0145] Access frequency triggering strategy: When the access frequency of KVCache data reaches the preset hot data access frequency threshold, it can be prioritized to become hot data. In this case, the KVCache data can be regarded as hot data. When the access frequency of KVCache data reaches the preset cold data access frequency threshold, it can be prioritized to become cold data. In this case, the KVCache data can be regarded as cold data.
[0146] Access popularity triggering strategy: When the access popularity of KVCache data reaches a preset hot data access popularity threshold, it can be prioritized for "hot data" activation, and in this case, the KVCache data can be considered hot data. When the access popularity of KVCache data reaches a preset cold data access popularity threshold, it can be prioritized for "cold data" activation, and in this case, the KVCache data can be considered cold data. The hot data access popularity threshold can be equivalent to the first popularity threshold in this application embodiment, and the cold data access popularity threshold can be equivalent to the second popularity threshold in this application embodiment.
[0147] Hybrid eviction strategy: Combining multiple strategies such as FIFO, LRU, and LFU, the KVCache data that is prioritized for cooling down is determined.
[0148] In practical applications, migration can be performed at the data block level. That is, each time the KVCache is migrated, one or more data blocks in the KVCache are migrated, rather than the entire KVCache sequence corresponding to the model input, thereby reducing the amount of data migration. The size of the migrated data blocks can be determined based on a preset data block size, and the number of migrated data blocks can be determined based on the remaining storage space of HBM and / or PMEM.
[0149] In this application embodiment, a high-efficiency data storage layer is designed. Through data partitioning and index construction techniques, the storage and management of KVCache in PMEM and HBM are optimized, improving the efficiency of data query and update operations, thereby enhancing the access efficiency of KVCache. Furthermore, this application embodiment constructs a data flow mechanism between HBM and PMEM. Through data prefetching, preheating, and intelligent scheduling, efficient flow of KVCache data between HBM and PMEM is ensured, better meeting the real-time requirements of large model inference and improving the inference efficiency of large models. Moreover, this application embodiment, through access popularity and related strategies, more rationally determines the timing of data migration and specific data selection, avoiding unnecessary data migration, reducing system overhead, and further improving the inference efficiency of large models.
[0150] In practical applications, the solution presented in this application can effectively reduce the hardware cost of large-scale model inference and improve inference efficiency. This solution can be applied to AI service providers or cloud computing vendors. For AI service providers, it enables the provision of more efficient large-scale model inference services at a lower cost; for cloud computing vendors, it can optimize cloud service resource allocation and improve resource utilization.
[0151] Based on the embodiments described above, this application also provides a cache management device, see [link to relevant documentation]. Figure 6 The device includes: Loading unit 61 is used to load the KV cache generated by the first large model during the inference process to the first storage node; the first storage node consists of first video memory and persistent memory; Mapping unit 62 is used to map multiple data blocks in the KV cache to multiple buckets, and maintain an index table for each of the multiple buckets; one or more data blocks in the buckets correspond to the same first business attribute; each first index in the index table has a mapping relationship with the first information corresponding to a data block in the corresponding bucket; the first information is used to describe the storage location of the corresponding data block in the first storage node; The positioning unit 63 is used to locate the first index table from the multiple index tables corresponding to the multiple buckets based on the first inference requirement of the first large model, and determine the first index corresponding to the first KV cache in the first index table, so as to access the first KV cache in the first storage node based on the first information corresponding to the determined first index; the first KV cache represents the KV cache that the first large model needs to access in order to realize the first inference requirement, and the first inference requirement is used to describe the business attributes corresponding to the first KV cache.
[0152] In one embodiment, the mapping unit 62 is further configured to: If the inference task volume of the first large model exceeds a set first task volume threshold, a new first bucket is added to the plurality of buckets; and / or, If the inference task of the first large model is less than the set second task threshold, the second buckets among the multiple buckets are merged. Wherein, the first task volume threshold is greater than the second task volume threshold.
[0153] In one embodiment, the cache management device further includes a migration unit, the migration unit being used for: Migrate the second KV cache in the persistent memory to the first video memory; the access frequency corresponding to the second KV cache is greater than a set first frequency threshold; and / or, The third KV cache in the first video memory is migrated to the persistent memory; the access frequency of the third KV cache is less than the set second frequency threshold. Wherein, the first popularity threshold is greater than the second popularity threshold, and the access popularity is determined based on one or more of the following: access frequency and keyness of the corresponding KV cache.
[0154] In one embodiment, the migration unit is further configured to: When initializing the first large model, the first KV data is loaded into the first video memory; the first KV data represents the hot data predicted by the KV cache accessed for the first large model; and / or, Before the first large model performs inference in response to model input, a fourth KV cache is selected from the persistent memory based on the historical inference data of the first large model, and the fourth KV cache is migrated from the persistent memory to the first video memory; the fourth KV cache represents the KV cache accessed by the first large model in response to the first model input, and the input frequency of the first model input in the historical inference process is greater than a set first input frequency threshold.
[0155] In one embodiment, the migration unit is further configured to: When the first large model generates the first word element for the first dialogue, the following processing is performed asynchronously: The KV cache that the first large model needs to access when generating the second word is predicted to obtain the fifth KV cache. If the fifth KV cache is located in the persistent memory, the fifth KV cache is moved from the persistent memory to the first video memory. The second word represents the word following the first word.
[0156] In one embodiment, the migration unit predicts the key-value caches that the first large model needs to access when generating the second word, to obtain a fifth key-value cache, including: The first prediction model is called to process multiple third word elements to obtain the fifth KV cache; the first prediction model is used to predict the KV cache that the first large model needs to access; the multiple third word elements represent multiple word elements generated by the first large model for the first dialogue, and the multiple third word elements represent a first set number of word elements before the first word element.
[0157] In one embodiment, the migration unit, when migrating the corresponding KV cache, is used to: Migrate a second number of data blocks from the KV cache; the second number is determined based on the remaining storage space of the first video memory and / or persistent memory.
[0158] In practical applications, the loading unit 61, the mapping unit 62, and the positioning unit 63 can be implemented by the processor in the cache management device.
[0159] It should be noted that the cache management device provided in the above embodiments is only illustrated by the division of the above program modules when performing cache management. In actual applications, the above processing can be assigned to different program modules as needed, that is, the internal structure of the device can be divided into different program modules to complete all or part of the processing described above. In addition, the cache management device and cache management method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.
[0160] Based on the hardware implementation of the above program modules, and in order to implement the method of the embodiments of this application, this application also provides an electronic device, see [link to relevant documentation]. Figure 7 The electronic device includes: Communication interface 1 enables information exchange with other devices; Processor 2 is connected to communication interface 1 to enable information interaction with other devices and, when running a computer program, executes the methods provided by one or more technical solutions in the above embodiments. The computer program is stored on memory 3.
[0161] Specifically, the processor 2 is used to load the KV cache generated by the first large model during the inference process to the first storage node; the first storage node consists of a first video memory and persistent memory; Multiple data blocks in the KV cache are mapped to multiple buckets, and an index table is maintained for each of the multiple buckets; one or more data blocks in each bucket correspond to the same first business attribute; each first index in the index table has a mapping relationship with first information corresponding to a data block in the corresponding bucket; the first information is used to describe the storage location of the corresponding data block in the first storage node; and, Based on the first inference requirement of the first large model, the first index table is located from the multiple index tables corresponding to the multiple buckets, and the first index corresponding to the first KV cache is determined in the first index table, so as to access the first KV cache in the first storage node based on the first information corresponding to the determined first index; the first KV cache represents the KV cache that the first large model needs to access in order to realize the first inference requirement, and the first inference requirement is used to describe the business attributes corresponding to the first KV cache.
[0162] In one embodiment, the processor 2 is further configured to: If the inference task volume of the first large model exceeds a set first task volume threshold, a new first bucket is added to the plurality of buckets; and / or, If the inference task of the first large model is less than the set second task threshold, the second buckets among the multiple buckets are merged. Wherein, the first task volume threshold is greater than the second task volume threshold.
[0163] In one embodiment, the processor 2 is further configured to: Migrate the second KV cache in the persistent memory to the first video memory; the access frequency corresponding to the second KV cache is greater than a set first frequency threshold; and / or, The third KV cache in the first video memory is migrated to the persistent memory; the access frequency of the third KV cache is less than the set second frequency threshold. Wherein, the first popularity threshold is greater than the second popularity threshold, and the access popularity is determined based on one or more of the following: access frequency and keyness of the corresponding KV cache.
[0164] In one embodiment, the processor 2 is further configured to: When initializing the first large model, the first KV data is loaded into the first video memory; the first KV data represents the hot data predicted by the KV cache accessed for the first large model; and / or, Before the first large model performs inference in response to model input, a fourth KV cache is selected from the persistent memory based on the historical inference data of the first large model, and the fourth KV cache is migrated from the persistent memory to the first video memory; the fourth KV cache represents the KV cache accessed by the first large model in response to the first model input, and the input frequency of the first model input in the historical inference process is greater than a set first input frequency threshold.
[0165] In one embodiment, the processor 2 is further configured to: When the first large model generates the first word element for the first dialogue, the following processing is performed asynchronously: The KV cache that the first large model needs to access when generating the second word is predicted to obtain the fifth KV cache. If the fifth KV cache is located in the persistent memory, the fifth KV cache is moved from the persistent memory to the first video memory. The second word represents the word following the first word.
[0166] In one embodiment, the processor 2 predicts the KV cache that the first large model needs to access when generating the second word, to obtain a fifth KV cache, including: The first prediction model is called to process multiple third word elements to obtain the fifth KV cache; the first prediction model is used to predict the KV cache that the first large model needs to access; the multiple third word elements represent multiple word elements generated by the first large model for the first dialogue, and the multiple third word elements represent a first set number of word elements before the first word element.
[0167] In one embodiment, when migrating the corresponding KV cache, the processor 2 is used to: Migrate a second number of data blocks from the KV cache; the second number is determined based on the remaining storage space of the first video memory and / or persistent memory.
[0168] It should be noted that the specific processing procedure of communication interface 1 can be understood by referring to the above method.
[0169] Of course, in practical applications, the various components in an electronic device are coupled together through bus system 4. It can be understood that bus system 4 is used to achieve communication and connection between these components. In addition to the data bus, bus system 4 also includes a power bus, a control bus, and a status signal bus. However, for clarity, in... Figure 7 The general will label all buses as Bus System 4.
[0170] The memory 3 in this embodiment is used to store various types of data to support operation in the electronic device. Examples of such data include any computer program used to operate on the electronic device.
[0171] The methods disclosed in the embodiments of this application can be applied to the processor 2, or implemented by the processor 2. The processor 2 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by the integrated logic circuit of the hardware in the processor 2 or by instructions in the form of software. The processor 2 mentioned above may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor 2 can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in the embodiments of this application can be directly reflected as being executed by a hardware decoding processor, or being executed by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium, which is located in the memory 3. The processor 2 reads the information in the memory 3 and combines its hardware to complete the steps of the aforementioned method.
[0172] In an exemplary embodiment, the electronic device may be implemented by one or more ASICs, DSPs, PLDs, CPLDs, FPGAs, general-purpose processors, controllers, MCUs, microprocessors, or other electronic components to perform the aforementioned method.
[0173] It is understood that the memory 3 in the embodiments of this application can be volatile memory or non-volatile memory, or both. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), ferromagnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM); magnetic surface memory can be disk storage or magnetic tape storage. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), SyncLink Dynamic Random Access Memory (SLDRAM), and Direct Rambus Random Access Memory (DRRAM).The memories described in the embodiments of this application are intended to include, but are not limited to, these and any other suitable types of memories.
[0174] In an exemplary embodiment, this application also provides a storage medium, namely a computer storage medium, specifically a computer-readable storage medium, such as a memory 3 storing a computer program, which can be executed by the processor 2 of an electronic device to complete the steps described in the aforementioned cache management method.
[0175] Computer-readable storage media can be FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface memory, optical disc, or CD-ROM, etc.
[0176] In an exemplary embodiment, this application also provides a computer program product, including a computer program that can be executed by a processor 2 of an electronic device to complete the steps described in the aforementioned cache management method.
[0177] It should be noted that terms such as "first" and "second" are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.
[0178] In this document, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Furthermore, the term "one or more" in this document refers to any combination of at least two of any one or more elements from a set of A, B, and C. For example, including at least one of A, B, and C can represent including any one or more elements selected from the set of A, B, and C. Additionally, the term "one or more" in this document is an exemplary expression and can be replaced with any possible expressions, such as one or more, at least one, or at least one of, etc.
[0179] Furthermore, the technical solutions described in the embodiments of this application can be combined arbitrarily without conflict.
[0180] The above description is merely a preferred embodiment of this application and is not intended to limit the scope of protection of this application.
Claims
1. A cache management method, characterized in that, The method includes: The key-value cache generated during the inference process of the first model is loaded into the first storage node; the first storage node consists of the first video memory and persistent memory; Multiple data blocks in the KV cache are mapped to multiple buckets, and an index table is maintained for each of the multiple buckets; one or more data blocks in the buckets correspond to the same first business attribute; each first index in the index table has a mapping relationship with the first information corresponding to a data block in the corresponding bucket; the first information is used to describe the storage location of the corresponding data block in the first storage node; Based on the first inference requirement of the first large model, the first index table is located from the multiple index tables corresponding to the multiple buckets, and the first index corresponding to the first KV cache is determined in the first index table, so as to access the first KV cache in the first storage node based on the first information corresponding to the determined first index; the first KV cache represents the KV cache that the first large model needs to access in order to realize the first inference requirement, and the first inference requirement is used to describe the business attributes corresponding to the first KV cache.
2. The method according to claim 1, characterized in that, The method further includes: If the inference task volume of the first large model exceeds a set first task volume threshold, a new first bucket is added to the plurality of buckets; and / or, If the inference task of the first large model is less than the set second task threshold, the second buckets among the multiple buckets are merged. Wherein, the first task volume threshold is greater than the second task volume threshold.
3. The method according to claim 1, characterized in that, The method further includes: Migrate the second KV cache in the persistent memory to the first video memory; the access frequency corresponding to the second KV cache is greater than a set first frequency threshold; and / or, The third KV cache in the first video memory is migrated to the persistent memory; the access frequency of the third KV cache is less than the set second frequency threshold. Wherein, the first popularity threshold is greater than the second popularity threshold, and the access popularity is determined based on one or more of the following: access frequency and keyness of the corresponding KV cache.
4. The method according to claim 1, characterized in that, The method further includes: When initializing the first large model, the first KV data is loaded into the first video memory; the first KV data represents the hot data predicted by the KV cache accessed for the first large model; and / or, Before the first large model performs inference in response to model input, a fourth KV cache is selected from the persistent memory based on the historical inference data of the first large model, and the fourth KV cache is migrated from the persistent memory to the first video memory; the fourth KV cache represents the KV cache accessed by the first large model in response to the first model input, and the input frequency of the first model input in the historical inference process is greater than a set first input frequency threshold.
5. The method according to claim 1, characterized in that, The method further includes: When the first large model generates the first word element for the first dialogue, the following processing is performed asynchronously: The KV cache that the first large model needs to access when generating the second word is predicted to obtain the fifth KV cache. If the fifth KV cache is located in the persistent memory, the fifth KV cache is moved from the persistent memory to the first video memory. The second word represents the word following the first word.
6. The method according to claim 5, characterized in that, The process of predicting the KV caches that the first large model needs to access when generating the second word to obtain the fifth KV cache includes: The first prediction model is called to process multiple third word elements to obtain the fifth KV cache; the first prediction model is used to predict the KV cache that the first large model needs to access; the multiple third word elements represent multiple word elements generated by the first large model for the first dialogue, and the multiple third word elements represent a first set number of word elements before the first word element.
7. The method according to any one of claims 3 to 6, characterized in that, When migrating the corresponding KV cache, the method includes: Migrate a second number of data blocks from the KV cache; the second number is determined based on the remaining storage space of the first video memory and / or persistent memory.
8. An electronic device, characterized in that, include: A processor and a memory for storing a computer program capable of running on the processor; wherein, when the processor is used to run the computer program, it performs the steps of the method according to any one of claims 1 to 7.
9. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.