Block device layer differentiated admission control method and system

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By identifying key-value cache access patterns and allocating independent token bucket resources at the block device layer, the problem of high read latency of key-value cache in large language model inference is solved, differentiated resource scheduling is achieved, and service quality is improved.

CN122242776APending Publication Date: 2026-06-19CHINA UNICOM INTERNET OF THINGS CO LTD +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CHINA UNICOM INTERNET OF THINGS CO LTD
Filing Date: 2026-05-21
Publication Date: 2026-06-19

Application Information

Patent Timeline

21 May 2026

Application

19 Jun 2026

Publication

CN122242776A

IPC: G06N5/04; G06F12/02

AI Tagging

Application Domain

Memory adressing/allocation/relocation Inference methods

Technology Topics

Computer networkCache access

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

The high latency of key-value cache reads in the reasoning process of large language models is mainly due to the inability to distinguish between input and output requests of different semantics at the block device layer, resulting in resource contention and low bandwidth utilization efficiency.

Method used

At the block device layer, by acquiring input and output request characteristic data, the key-value cache access pattern is identified, and independent token bucket resources are allocated for different types of traffic to achieve differentiated admission control and ensure that key-value cache reads obtain sufficient resources.

Benefits of technology

It reduces key-value cache read latency, improves the stability and efficiency of inference services, and avoids resource contention between different semantic traffic.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122242776A_ABST

Patent Text Reader

Abstract

This application provides a block device layer differentiated admission control method and system. The method includes: acquiring input / output request feature data of the block device layer, wherein the input / output request feature data is obtained by feature extraction of input / output requests; identifying key-value cache access patterns based on the input / output request feature data; allocating corresponding token bucket resources to input / output requests according to the identified key-value cache access patterns; and performing differentiated admission control. This application achieves differentiated block device layer admission control between key-value cache traffic and regular data flow in large model inference without modifying the upper-layer inference engine, reducing the latency of key-value cache reads.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the storage field of large model inference, and in particular relates to a block device layer differentiated admission control method and system. Background Technology

[0002] In large language model inference, the KV cache is the fundamental data structure supporting autoregressive decoding. During the pre-filling stage, the key-value tensors of input prompts are written to storage layer by layer. During the decoding stage, the historical key-value cache is read in token order for attention computation. As the number of model layers and the context length increase, the amount of key-value cache data for a single request increases significantly. Limited by the GPU memory capacity, the key-value cache typically needs to be offloaded to host memory or a high-speed solid-state drive. Key-value cache reads during the decoding stage are characterized by fine granularity, high concurrency, and random access; their latency directly affects the first token time and the token interval time. Existing technologies primarily optimize at the inference engine layer or GPU driver layer, including memory management for paginated attention, page swapping priority scheduling at the driver layer, and data prefetching strategies at the application layer.

[0003] In the above scheme, input and output requests with different semantics are processed uniformly after arriving at the block device layer. This causes latency-sensitive key-value cache reads and bandwidth-sensitive training data loading and checkpoint writing to compete within the same resource pool, resulting in increased tail latency for key-value cache reads and a decrease in inference service quality. Furthermore, existing block device layer service quality mechanisms, which employ a single token bucket or a multi-token bucket architecture partitioned by process tenant, cannot distinguish between traffic with different semantics within the same process and cannot provide targeted resource guarantees for key-value cache reads. Summary of the Invention

[0004] This application aims to provide a block device layer differentiated admission control method and system, so as to achieve differentiated block device layer admission control between key-value cache traffic and regular data flow in large model inference without modifying the upper-layer inference engine, thereby reducing the latency of key-value cache reading.

[0005] This application discloses a block device layer differentiated access control method, including: Obtain input / output request feature data from the block device layer, wherein the input / output request feature data is obtained by extracting features from input / output requests; Identify key-value cache access patterns based on the input / output request feature data; Based on the identified key-value cache access patterns, token bucket resources are allocated to input and output requests to perform differentiated admission control.

[0006] Optionally, identifying key-value cache access patterns based on the input / output request feature data includes: Calculate the random read ratio and input / output size distribution based on the input / output request feature data; Based on the input / output request feature data, extract the offset step size history of adjacent input / output requests; The key-value cache access pattern is identified based on the random read ratio, the input / output size distribution, and the offset step size.

[0007] Optionally, based on the random read ratio, the input / output size distribution, and the historical identification of the offset step size key-value cache access pattern, the following is included: When the random read ratio exceeds the first threshold, the small-granularity input / output ratio exceeds the second threshold, and there are periodic jumps in the offset step history, it is determined to be a key-value cache decoding read. When the sequential write ratio exceeds the third threshold, the proportion of medium-granularity input and output exceeds the fourth threshold, and the write traffic suddenly increases and then falls back within the preset window, it is determined to be a key-value cache pre-fill write. When the random read ratio exceeds the fifth threshold, the proportion of very fine granular input / output exceeds the sixth threshold, and the read addresses are concentrated within a preset range, it is determined to be a key-value cache index lookup.

[0008] Optionally, based on the identified key-value cache access pattern, corresponding token bucket resources are allocated to input and output requests to perform differentiated admission control, including: Allocate the first token bucket resource for key-value cache decoding and reading; Allocate a second token bucket resource for key-value cache pre-filling writes; Allocate third token bucket resources for regular data input and output; Admission control for input / output requests is performed based on the allocated token bucket resources.

[0009] Optionally, admission control for input / output requests can be performed based on the allocated token bucket resources, including: Calculate the available input / output quota and available bandwidth quota for each token bucket, wherein each token bucket includes a first token bucket, a second token bucket, and a third token bucket; The available input / output quota and available bandwidth quota of the corresponding token bucket are deducted based on the type of input / output request; Input / output requests are allowed to pass when the available input / output quota and available bandwidth quota of the corresponding token bucket are sufficient; input / output requests are made to wait and retry when the available input / output quota and available bandwidth quota of the corresponding token bucket are insufficient.

[0010] Optionally, when a key-value cache index lookup is determined, corresponding token bucket resources are allocated to the input / output request based on the identified key-value cache access pattern, and differentiated admission control is performed, including: The input / output request is marked as a metadata request, a fast-track resource is allocated to the metadata request, and admission control of the metadata request is performed based on the fast-track resource.

[0011] Optionally, admission control for metadata requests is performed based on the fast-access resources, including: Count the number of metadata requests made through the fast channel within the preset window; When the number of metadata requests does not exceed the fast channel rate limit, the metadata requests are allowed to pass and the token bucket amount deduction is waived or reduced. When the number of metadata requests exceeds the fast channel rate limit, the excess metadata requests will be allocated to the corresponding token bucket for admission control.

[0012] Optionally, when the decrease in traffic for key-value cache prefilling writes exceeds a preset threshold, a token bucket prefilling operation is triggered to adjust the available amount and time slice granularity of the first token bucket, and admission control for subsequent key-value cache decoding and reading is performed based on the adjusted first token bucket resources.

[0013] Optionally, calculate the available input / output quota and available bandwidth quota for each token bucket separately, including: Each token bucket maintains an independent quota counter and synchronization mechanism. The available input / output quota and available bandwidth quota of each token bucket are updated according to the preset token filling rate and burst capacity. When input / output requests from different token buckets arrive concurrently, the input / output requests corresponding to the key-value cache access mode are processed first.

[0014] This application also discloses a block device layer differentiated access control system, including: a pattern recognition module, a token bucket scheduling module, and an access control module, wherein, The pattern recognition module is used to obtain input / output request feature data of the block device layer, and to identify key-value cache access patterns based on the input / output request feature data. The input / output request feature data is obtained by extracting features from input / output requests. The token bucket scheduling module is used to allocate corresponding token bucket resources to input and output requests based on the identified key-value cache access pattern. The access control module is used to perform differentiated access control based on the allocated token bucket resources.

[0015] As can be seen from the above technical solution, this application obtains input / output request feature data at the block device layer and identifies key-value cache access patterns based on the historical distribution of input / output size, random read ratio, and offset step size. This allows for the differentiation of traffic with different semantics without modifying the upper-layer inference engine. After identifying the key-value cache access pattern, independent token bucket resources are allocated to different types of traffic. Key-value cache decoding and reading use token bucket parameters adapted to small-granularity random reads, while regular data uses token bucket parameters adapted to large-granularity sequential reads and writes. The quotas of different token buckets are calculated and consumed independently, preventing large-granularity sequential input / output from exhausting the resource quotas of small-granularity random input / output. When performing admission control on input / output requests, the system determines whether to allow access based on the available quota of the corresponding token bucket, ensuring that key-value cache reading can obtain sufficient input / output resources. This setup achieves a complete process from traffic identification to resource scheduling at the block device layer, preventing key-value cache reading from being squeezed out by regular traffic, thereby reducing the latency of key-value cache reading and improving the stability of the inference service. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a flowchart of the block device layer differentiated access control method provided in the embodiments of this application; Figure 2 This is a flowchart illustrating the identification of key-value cache access patterns based on the input / output request feature data provided in this application embodiment; Figure 3 This is a flowchart of admission control for input / output requests based on allocated token bucket resources, provided in an embodiment of this application. Figure 4 This is a flowchart of admission control for performing metadata requests based on the fast track resources provided in this application embodiment; Figure 5 This is a structural block diagram of the block device layer differentiated access control system provided in the embodiments of this application. Detailed Implementation

[0018] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not limiting, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application can also be implemented in other embodiments without such specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods are omitted so as not to obscure the description of this application with unnecessary detail.

[0019] This application relates to the field of computer storage technology, specifically to the quality of service (QS) of input / output services at the block device layer in a large language model inference scenario. The inference process of the large language model is divided into a pre-filling stage and a decoding stage. The pre-filling stage processes the prompt text input by the user, generates an initial key-value cache, and writes it to the storage device. The decoding stage generates output text token by token, and each time a token is generated, the historical key-value cache needs to be read for attention calculation.

[0020] With the development of large language model technology, the scale of model parameters has continued to expand, and the context length has also increased from the initial few thousand tokens to hundreds of thousands or even millions of tokens. The amount of key-value cache data generated by a single inference request can reach tens of gigabytes or even hundreds of gigabytes. Although the video memory capacity of graphics processors is also constantly increasing, it is still limited relative to the rapidly growing demand for key-value cache. Therefore, in actual deployments, the method of swapping out or unloading video memory is usually adopted, storing infrequently used key-value cache data in host memory or high-speed solid-state drives, and then swapping it back into video memory when needed.

[0021] It's important to note that key-value cache reads during the decoding phase exhibit unique input / output characteristics. First, there's fine-grained access. Key-value caches are typically managed in page units, with each page generally being 4KB, 16KB, or 64KB. Each read operation usually accesses only one or a few consecutive pages, resulting in generally small input / output requests. Second, there's high-concurrency access. The attention mechanism of large language models involves multiple attention heads, each requiring independent reading of its corresponding key-value cache data. Therefore, a large number of input / output requests are generated simultaneously during the decoding phase. Finally, there's random access. Attention computation requires reading the key-value cache sequentially by layer and attention head. The addresses of key-value cache data at different layers are not contiguous on the storage device, exhibiting a clear inter-layer skipping access pattern. This random access pattern differs significantly from traditional sequential read / write patterns.

[0022] For example, when a large language model decodes, generating the Nth token requires reading the key-value cache data of all the previous N-1 tokens. For a model with 80 layers and 32 attention heads, generating a token requires performing 80×32 key-value cache read operations, with each read amounting to tens of KB. The addresses of these read operations are distributed across different regions of the storage device, and the address offset between two adjacent reads can reach hundreds of MB, forming a typical random access pattern. Meanwhile, training data loading and checkpoint writing are typically large-granularity sequential reads and writes, with each input / output request reaching several MB or even tens of MB in size, and the addresses are contiguous. When these two different types of traffic access the block device simultaneously, traditional block-layer QoS mechanisms cannot distinguish between them and will schedule them as homogeneous traffic. This causes large-granularity sequential input / output to quickly exhaust the token bucket's input / output quota, while small-granularity key-value cache reads need to wait, thus increasing decoding latency.

[0023] Existing optimizations for key-value caching primarily focus on the application and driver layers. At the application layer, the inference engine manages the key-value cache in blocks using a paging attention algorithm, achieving efficient utilization of video memory. When video memory is insufficient, some key-value cache blocks are swapped out to the storage device. However, this optimization only addresses video memory management. Once swapped-out and swapped-in input / output requests reach the block device layer, they are still queued together with other input / output requests, failing to guarantee priority. At the driver layer, some solutions monitor video memory pressure and select data pages for swapping out and swapping in based on priority, but this also fails to address the resource contention problem at the block layer. Application-layer data prefetching strategies attempt to reduce decoding blocking by preloading future-needed key-value cache data, but prefetch requests may still be queued after larger-granularity input / output requests once they reach the block layer, significantly diminishing the prefetching effect.

[0024] Furthermore, the standard read-ahead mechanism in the operating system kernel is designed for sequential access patterns. When a sequential read is detected, subsequent data is read into the page cache to improve read speed. However, key-value cache decoding reads are in random access mode, and the read-ahead mechanism cannot hit the data needed in the future, instead reading a large amount of useless data and wasting storage bandwidth. Static resource control group rate limiting schemes guarantee service quality by setting bandwidth and input / output limits per second for inference processes, but this scheme is process-level granularity and cannot distinguish between key-value cache reads and model weight loading within the same inference process, leading to policy misalignment. Storage array cache tiering technology migrates hot data to high-speed storage media, but the tiering granularity is data block level, which cannot identify the semantic features of key-value caches, nor can it prioritize reads during the decoding phase.

[0025] Based on the above problems, this application provides a block device layer differentiated admission control method applied to KV Cache in large model inference. The main idea of this method is to deploy relevant functional modules in the operating system kernel block device layer, identify key-value cache access patterns by analyzing the behavioral characteristics of input and output requests, and then allocate independent token bucket resources for different types of traffic to achieve differentiated admission control. With this setup, there is no need to modify the source code of the upper-layer inference engine, the intrusion into the existing system is minimal, and it is compatible with various mainstream large language model inference frameworks.

[0026] For example, please refer to Figure 1 , Figure 1 The flowchart of the block device layer differentiated access control method provided in this application is as follows: Figure 1 As shown, it includes the following steps: S100, Obtain input / output request feature data of the block device layer, wherein the input / output request feature data is obtained by extracting features from input / output requests.

[0027] The block device layer is the core layer in the operating system that manages block devices, and all input / output requests to block devices must be processed through it. This application adds hook functions at the input / output request entry point of the block device layer to capture all input / output requests passing through the block device layer and extract the feature data of each input / output request, i.e., input / output request feature data.

[0028] The input / output request characteristic data includes information such as the type, size, start address, end address, timestamp, process identifier, and tenant identifier of the input / output request. For each input / output request, it records whether it is a read request or a write request, the size of the requested data, the range of storage addresses accessed, and the time the request arrived.

[0029] It should be noted that this application maintains an independent input / output statistical context for each tenant to record the tenant's input / output request history. The input / output statistical context includes key-value cache read / write counts, input / output size distribution histograms, historical offset step sizes of adjacent inputs / outputs, and step variance of attention head dimension accesses, among other statistical information. When an input / output request arrives, the corresponding input / output statistical context is first located based on the tenant identifier, and then the request's feature data is updated in the statistical context. The input / output size distribution histogram divides the size of input / output requests into multiple intervals, counting the number of requests within each interval. For example, it can be divided into intervals such as less than 4KB, 4KB-16KB, 16KB-64KB, 64KB-256KB, 256KB-1MB, and greater than 1MB. The historical offset step size history of adjacent inputs / outputs is the difference between the starting address of the most recent input / output requests and the starting address of the previous request, used to analyze the regularity of access addresses.

[0030] For example, when a tenant's I / O request arrives at the block device layer, the hook function captures the request, extracts its type as a read request, its size as 16KB, and its starting address as 0x10000000. Then, it looks up the corresponding I / O statistics context for that tenant, increments the read request count by 1, increments the count within the 16KB-64KB interval of the I / O size distribution histogram by 1, calculates the difference in starting address between the current request and the previous request (0x08000000), and adds this difference to the offset step history. These statistics are periodically updated and cleaned up, retaining the characteristics of recent I / O requests to reflect current I / O behavior patterns. The size of the statistics window can be configured according to actual needs, for example, set to 100ms, 200ms, or 500ms. A smaller statistics window provides a more sensitive response to traffic changes, but the statistical overhead will also increase accordingly.

[0031] Specifically, the input / output statistics context is implemented using a per-CPU variable approach. Each CPU maintains independent statistics, and then the statistics from each CPU are periodically aggregated into the global statistics context. This implementation avoids lock contention caused by multiple CPUs accessing the same statistics variable simultaneously, improving statistical efficiency and reducing the impact on input / output hot paths. The statistics for each CPU include the number of read requests, write requests, request counts for each size interval, and historical offset steps within the most recent statistics window. When a statistics window ends, the kernel thread aggregates the statistics from all CPUs, updates the global input / output statistics context, resets the statistics for each CPU, and begins the statistics for the next statistics window.

[0032] S200, identify the key-value cache access mode based on the input / output request feature data.

[0033] This application constructs a set of rules for identifying key-value cache access patterns based on three dimensions: input / output size distribution, random read ratio, and offset step history. Key-value cache access patterns include three types: key-value cache decoding read, key-value cache pre-fill write, and key-value cache index lookup.

[0034] In some embodiments, such as Figure 2 As shown, identifying key-value cache access patterns based on the input / output request feature data includes: S210, calculate the random read ratio and input / output size distribution based on the input / output request feature data; In practice, the first step is to calculate the random read ratio. The random read ratio refers to the proportion of random read requests to the total number of read requests within a statistical window. A random read request is determined when the starting address of a read request is not contiguous with the ending address of the previous read request, and the address offset exceeds a preset threshold. This preset threshold can be configured according to actual needs, for example, set to 64KB. If multiple consecutive read requests have contiguous addresses, they are considered sequential read requests. The random read ratio reflects the access pattern of input / output requests; the random read ratio for key-value cache decoding is typically high, while the random read ratio for training data loading is low.

[0035] For example, within a statistics window, if a tenant has 10,000 total read requests, of which 9,500 are random read requests, the random read ratio is 95%. If another tenant has 1,000 total read requests, of which 100 are random read requests, the random read ratio is 10%. By comparing the random read ratio with a preset first threshold, it can be preliminarily determined whether the current traffic is key-value cache decoding reads. The first threshold can be set to 80%. When the random read ratio exceeds 80%, the current traffic is considered to have high randomness and may be key-value cache decoding reads.

[0036] Next, the input / output size distribution is determined. This distribution reflects the granularity of input / output requests. Requests for key-value cache decoding and reading are typically between 4KB and 64KB, representing small-granularity input / output. Requests for key-value cache pre-filling and writing are typically between tens of KB and several MB, representing medium-granularity input / output. Requests for training data loading and checkpoint writing are typically several MB or more, representing large-granularity input / output. Requests for key-value cache index lookup are typically around 4KB, representing very small-granularity input / output. By analyzing the proportion of each interval in the input / output size distribution histogram, the granularity of the current input / output traffic can be determined.

[0037] Specifically, the proportion of small-granularity I / O requests refers to the percentage of I / O requests with sizes between 4KB and 64KB out of the total number of requests. The proportion of medium-granularity I / O requests refers to the percentage of I / O requests with sizes between 64KB and 4MB out of the total number of requests. The proportion of large-granularity I / O requests refers to the percentage of I / O requests with sizes larger than 4MB out of the total number of requests. The proportion of very small-granularity I / O requests refers to the percentage of I / O requests with sizes smaller than 4KB out of the total number of requests. These proportions can be calculated by counting the intervals in the I / O size distribution histogram.

[0038] For example, within a statistical window, if a tenant's total input / output requests are 10,000, with 8,500 requests falling within the 4KB-64KB range, then the proportion of small-granularity input / output is 85%. If another tenant's total input / output requests are 1,000, with 900 requests falling within the 4MB range, then the proportion of large-granularity input / output is 90%. By comparing the proportion of small-granularity input / output with a preset second threshold, it can be further determined whether the current traffic is key-value cache decoding / reading. The second threshold can be set to 70%. When the proportion of small-granularity input / output exceeds 70%, the current traffic is considered to be dominated by small-granularity requests, which matches the characteristics of key-value cache decoding / reading.

[0039] S220, extract the offset step size history of adjacent input / output requests based on the input / output request feature data; This step analyzes the regularity of the offset step size by extracting the offset step size history of adjacent input and output requests.

[0040] Decoding and reading of key-value caches are performed in the order of Transformer layers. The storage offset of each key-value cache layer is determined by the hidden layer size, sequence length, data type size, and number of key-value pairs of the model.

[0041] For a specific model, the size of each key-value cache layer is fixed. Therefore, during decoding and reading, the address offset between two adjacent reads will exhibit periodic jumps, with the jump magnitude matching the size of a single key-value cache layer. For example, for a model with a hidden layer size of 4096, a sequence length of 128K, and data type FP16, the size of each key-value cache layer is 4096 × 128 × 1024 × 2 × 2 = 256MB. Therefore, during decoding and reading, there will be a pattern of a large jump of approximately 256MB every few small-step accesses.

[0042] This application detects the existence of periodic patterns consistent with the storage span of model layers by maintaining the offset differences of the most recent reads. A tolerance-matching approach is adopted, allowing for deviations in the offset step size within a certain range to accommodate different model configurations and storage layouts. For example, if multiple jumps between 250MB and 260MB are detected in the offset step size history, an inter-layer skip access pattern can be determined. Furthermore, the step size variance of attention head dimension access can be calculated. Key-value caches are typically accessed with a fixed attention head size step size, resulting in a small step size variance. If the step size variance is below a preset threshold, further support is provided for determining key-value cache decoding reads.

[0043] Specifically, the offset step history is implemented using a circular buffer. The buffer size can be set to 1024, 2048, or 4096 to record the offset step size of the most recent input / output requests. When a new offset step size is generated, it is written to the current position in the circular buffer, and then the current position pointer is moved forward one position. If the current position pointer exceeds the end of the buffer, it returns to the beginning of the buffer. During periodic detection, all offset steps in the circular buffer are traversed, and the occurrence frequency of each step size interval is counted to find the step size interval with the highest occurrence frequency. If the occurrence frequency of this step size interval exceeds a preset threshold, it is considered that there is a periodic inter-layer skip access pattern.

[0044] For example, the circular buffer has a size of 2048 and records the offset step size of the most recent 2048 input / output requests. After traversing these offset steps, it was found that 300 offset steps fell between 250MB and 260MB, with a frequency of 14.6%, exceeding the preset threshold of 10%. Therefore, it can be determined that the current traffic exhibits an inter-layer skip access pattern, consistent with the characteristics of key-value cache decoding and reading.

[0045] S230, based on the random read ratio, the input / output size distribution, and the offset step size, the key value cache access mode is identified.

[0046] Steps S210 to S230 are executed sequentially: first, basic statistical features are calculated; then, more complex behavioral features are extracted; and finally, all features are combined for pattern recognition.

[0047] Optionally, based on the random read ratio, the input / output size distribution, and the historical identification of the offset step size key-value cache access pattern, the following is included: A1. When the random read ratio exceeds the first threshold, the small-granularity input / output ratio exceeds the second threshold, and there are periodic jumps in the offset step size history, it is determined to be a key-value cache decoding read. Specifically, the identification rules for key-value cache decoding reads require the simultaneous fulfillment of three conditions: the proportion of random reads exceeds a first threshold, the proportion of small-granularity input / output exceeds a second threshold, and there are periodic jumps in the offset step size history. The first threshold can be set to 80%, the second threshold to 70%, and the frequency threshold for periodic jumps to 10%. These three conditions combined can effectively distinguish key-value cache decoding reads from other types of random read traffic, such as database index reads. Although database index reads are also small-granularity random reads, they do not exhibit periodic inter-layer jump access patterns and therefore will not be misidentified as key-value cache decoding reads.

[0048] For example, within a statistical window, a tenant's random read rate is 95%, exceeding the first threshold of 80%; its small-granularity input / output rate is 85%, exceeding the second threshold of 70%; and the frequency of the 250MB to 260MB range in the offset step history is 14.6%, exceeding the 10% periodic jump threshold. Therefore, it can be determined that the tenant's current input / output traffic is mainly key-value cache decoding and reading.

[0049] A2, when the sequential write ratio exceeds the third threshold, the proportion of medium-granularity input and output exceeds the fourth threshold, and the write traffic suddenly increases and then falls back within the preset window, it is determined to be a key-value cache pre-fill write. The identification rules for key-value cache prefilling writes require three conditions to be met simultaneously: the proportion of sequential writes exceeds the third threshold, the proportion of medium-granularity input / output exceeds the fourth threshold, and the write traffic surges and then quickly falls back within a preset window. The third threshold can be set to 90%, the fourth threshold can be set to 60%, the criterion for a surge in write traffic is that the write traffic in the current window is more than 5 times that of the previous window, and the criterion for a drop in write traffic is that the write traffic in the current window is less than 20% of the peak window write traffic.

[0050] Specifically, the pre-filling phase involves writing the input prompt's key-value tensor to storage layer by layer. Therefore, write requests account for a high percentage, and these write operations are sequential, typically approaching 100%. Pre-filling write requests are medium-granular, usually ranging from tens of KB to several MB. The pre-filling phase is short, causing a brief surge in write traffic. Once pre-filling is complete, write traffic drops rapidly, entering the decoding phase. By monitoring the trend of write traffic changes, the start and end of the pre-filling phase can be identified.

[0051] For example, within three consecutive statistical windows, a tenant's write traffic was 10MB / s, 60MB / s, and 10MB / s, respectively. The write traffic in the second window was six times that of the first window, exceeding the 5-fold surge threshold. The write traffic in the third window was 16.7% of that of the second window, below the 20% fallback threshold. Simultaneously, the sequential write ratio in the second window was 98%, exceeding the 90% third threshold, and the medium-granularity input / output ratio was 75%, exceeding the 60% fourth threshold. Therefore, it can be determined that the tenant was in the pre-filling phase in the second window, and the pre-filling phase ended in the third window, entering the decoding phase.

[0052] A3. When the random read ratio exceeds the fifth threshold, the proportion of very small granular input / output exceeds the sixth threshold, and the read addresses are concentrated within a preset range, it is determined to be a key-value cache index lookup.

[0053] The key-value cache index lookup identification rules require three conditions to be met simultaneously: the proportion of random reads exceeds the fifth threshold, the proportion of extremely fine-grained input / output exceeds the sixth threshold, and the read addresses are concentrated within a preset range. The fifth threshold can be set to 95%, the sixth threshold can be set to 90%, and the criterion for determining the concentration of read addresses is that the addresses of all read requests fall within a contiguous storage area of 1MB.

[0054] Specifically, key-value cache block management requires maintaining a block mapping table to record the storage location of key-value cache blocks. When accessing a key-value cache block, the corresponding storage address must first be found by reading the block mapping table. Therefore, the index lookup operation is a very fine-grained random read, with a request size typically around 4KB, and the read addresses are concentrated within the small storage area where the block mapping table is located. By detecting these characteristics, key-value cache index lookup requests can be identified.

[0055] For example, within a statistical window, a tenant's random read rate is 99%, exceeding the fifth threshold of 95%, and its very fine-grained input / output rate is 95%, exceeding the sixth threshold of 90%. All read requests fall within a contiguous storage region of 1MB, between 0x20000000 and 0x20100000. Therefore, it can be determined that this tenant's current input / output traffic is primarily key-value cache index lookups.

[0056] The identification rules for the three modes mentioned above are independent of each other, and multiple traffic modes may coexist within a single statistical window. For example, after the pre-filling stage, there may be a small amount of pre-filling write traffic and a large amount of decoding read traffic simultaneously. This application performs pattern recognition separately for each input / output request, determining its traffic type based on its own characteristics and the current statistical context.

[0057] It should be noted that the pattern recognition process in this application is executed periodically in a background kernel thread, which does not block the processing flow of input / output requests and has minimal impact on the input / output hot path. The recognition results are cached and used directly in subsequent input / output request processing to avoid redundant calculations. Simultaneously, this application introduces a delayed feedback verification mechanism to correct pattern recognition errors. When the latency of an input / output request identified as a key-value cache decoding read consistently exceeds a preset threshold, it indicates a possible misjudgment, such as misjudging the loading of large model weights as a key-value cache decoding read. In this case, the next statistics window will force the traffic to be recognized as regular data, increase the feature sampling density, and re-perform pattern recognition.

[0058] Specifically, the latency feedback verification mechanism is implemented by statistically analyzing the average input / output latency for each traffic type. For traffic identified as key-value cache decoding read, the completion time of each input / output request is recorded, and the average latency is calculated. If the average latency of three consecutive statistical windows exceeds a preset latency threshold, a false positive is considered. The latency threshold can be configured according to the performance of the storage device; for example, for high-speed solid-state drives, the latency threshold can be set to 1ms, and for ordinary solid-state drives, the latency threshold can be set to 5ms. When a false positive is determined, the tenant's traffic type is forcibly set to regular data, the size of the statistical window is halved, the feature sampling density is increased, and pattern recognition is re-performed in the next statistical window. If the re-identification still determines it to be key-value cache decoding read, the normal statistical window size is restored; otherwise, the regular data type is maintained.

[0059] S300 allocates corresponding token bucket resources to input and output requests based on the identified key-value cache access patterns, and performs differentiated admission control.

[0060] In this embodiment, the traditional single token bucket is expanded into three independent token buckets: the first token bucket for key-value cache decoding and reading, the second token bucket for key-value cache pre-filling and writing, and the third token bucket for regular data. Each token bucket has independent parameters for token filling rate, burst capacity, time slice granularity, and smoothing suppression strength, which can be configured according to actual business needs.

[0061] Optionally, based on the identified key-value cache access pattern, corresponding token bucket resources are allocated to input and output requests to perform differentiated admission control, including: B1 allocates the first token bucket resource for key-value cache decoding and reading; Specifically, the first token bucket handles key-value cache decoding read requests and is configured with high I / O counts per second, large burst capacity, small time-slice granularity, and weak smoothing suppression. Key-value cache decoding reads are small-granularity random reads, requiring a large number of input / output operations to complete attention calculations, thus necessitating a high I / O count per second quota. Large burst capacity can handle bursty input / output requests at the beginning of the decoding phase, small time-slice granularity improves scheduling fairness and prevents individual requests from taking too long, and weak smoothing suppression reduces restrictions on burst traffic, ensuring decoding continuity.

[0062] For example, the parameters of the first token bucket can be configured as follows: 10,000 I / O operations per second, 100MB bandwidth per second, a burst capacity of 20,000 I / O operations per second and 200MB bandwidth, a time slice granularity of 0.5ms, and a smoothing suppression strength of 0.1. These parameters can be adjusted according to the performance of the storage device and business needs. For example, for higher-performance storage devices, the configuration of I / O operations per second and bandwidth can be increased.

[0063] B2 allocates second token bucket resources for key-value cache pre-filling writes; The second token bucket is used to handle key-value cache prefill write requests, configured with medium-high I / O per second and medium burst capacity. Prefill writes are medium-granular sequential writes, requiring a certain amount of bandwidth to quickly complete the prefill phase and enter the decoding phase. However, the prefill phase is less sensitive to latency than the decoding phase, so the parameter configuration is between that of the first and third token buckets.

[0064] For example, the parameters of the second token bucket can be configured as 5000 inputs / outputs per second, 200MB bandwidth per second, 10000 inputs / outputs and 400MB bandwidth burst capacity, 1ms time slice granularity, and 0.3 smoothing suppression strength.

[0065] B3 allocates a third token bucket resource for regular data input and output; The third token bucket handles regular data input / output requests, including training data loading, checkpoint writing, and model weight loading. It is configured with high bandwidth, low I / O counts per second, large time-slice granularity, and strong smoothing suppression. Regular data is typically read and written sequentially with large granularity, requiring high bandwidth but with lower requirements for I / O counts per second and latency. Therefore, configuring a high bandwidth quota and a low I / O counts per second quota, along with large time-slice granularity to improve the efficiency of input / output merging and fully utilize the storage device's bandwidth, and strong smoothing suppression to limit burst traffic and avoid impacting key-value cache traffic.

[0066] For example, the parameters of the third token bucket can be configured as 1000 inputs / outputs per second, 500MB bandwidth per second, a burst capacity of 2000 inputs / outputs and 1000MB bandwidth, a time slice granularity of 5ms, and a smoothing suppression strength of 0.8.

[0067] B4 performs admission control for input / output requests based on the allocated token bucket resources.

[0068] In this embodiment, the three token buckets are completely independent, and their respective quota calculations and consumption do not affect each other, ensuring that different types of traffic will not crowd out each other's resources.

[0069] Optional, such as Figure 3 As shown, admission control for input / output requests is performed based on the allocated token bucket resources, including: S410, calculate the available input / output quota and available bandwidth quota for each token bucket, wherein each token bucket includes a first token bucket, a second token bucket and a third token bucket; S420, deduct the available input / output quota and available bandwidth quota of the corresponding token bucket according to the type of input / output request; S430: When the available input / output quota and available bandwidth quota of the corresponding token bucket are sufficient, the input / output request is allowed to pass; when the available input / output quota and available bandwidth quota of the corresponding token bucket are insufficient, the input / output request is made to wait for retry.

[0070] In this embodiment, each token bucket maintains an independent available input / output quota counter and available bandwidth quota counter, along with a corresponding synchronization mechanism. The token bucket quota is replenished periodically according to a preset filling rate. When an input / output request arrives, the quota of the corresponding token bucket is deducted based on the type of the request. For read and write requests, the quota deduction method differs: input / output quota is deducted based on the number of requests, with one input / output quota deducted for each request; bandwidth quota is deducted based on the size of the requested data, with the deducted amount equal to the size of the requested data.

[0071] Specifically, the token bucket replenishment process is triggered periodically by a kernel timer, with the trigger interval equal to the time slice granularity. Each time it triggers, the required replenishment amount is calculated based on the token filling rate and added to the available replenishment counter, but the available replenishment amount cannot exceed the burst capacity. For example, if the first token bucket has 10,000 input / output operations per second and a time slice granularity of 0.5 ms, then the replenishment amount for each operation is 10,000 × 0.0005 = 5 operations. If the current available input / output amount is 19,998 operations, and after replenishing 5 times it reaches 20,003 operations, exceeding the burst capacity of 20,000 operations, then the available input / output amount is set to 20,000 operations.

[0072] When an input / output request arrives, the corresponding token bucket is first located based on its type. Then, the available input / output quota and available bandwidth quota of that token bucket are checked. If the available input / output quota is greater than or equal to 1, and the available bandwidth quota is greater than or equal to the requested data size, the request is allowed to pass, and the available input / output quota is decremented by 1, while the available bandwidth quota is reduced by the requested data size. If the available input / output quota or available bandwidth quota is insufficient, the request is placed in a waiting queue, awaiting the next quota replenishment before processing.

[0073] For example, when a 16KB key-value cache decoding / read request arrives, the available I / O quota and available bandwidth quota of the first token bucket are first checked. Assume the current available I / O quota is 100 times and the available bandwidth quota is 10MB. 16KB equals 0.015625MB, and the available bandwidth quota is greater than 0.015625MB, so the request is allowed to pass. Simultaneously, the available I / O quota is decremented by 1, becoming 99 times, and the available bandwidth quota is decremented by 0.015625MB, becoming 9.984375MB. If another 16KB key-value cache decoding / read request arrives when the available I / O quota of the first token bucket is 0 times, then that request enters the waiting queue. When the kernel timer triggers quota replenishment, the available I / O quota becomes 5 times. At this point, the first 5 requests are taken from the waiting queue, and quota deduction and release processes are performed sequentially.

[0074] It's important to note that the quotas of the three token buckets are completely independent and do not affect each other. Large-granularity input / output of regular data only consumes the quota of the third token bucket, not the first token bucket's input / output quota. Therefore, key-value cache decoding and reading will not be congested by regular traffic. When input / output requests from different token buckets arrive concurrently, requests corresponding to the key-value cache access mode are processed first; that is, requests from the first and second token buckets have higher priority than requests from the third token bucket. If requests are waiting in the first and second token buckets simultaneously, they are processed according to a first-come, first-served principle.

[0075] Specifically, the input / output request waiting queues are divided into three independent queues, corresponding to the first token bucket, the second token bucket, and the third token bucket, respectively. Once the quota is replenished, the first token bucket's waiting queue is processed first, allowing all requests eligible for quota to proceed. Then, the second token bucket's waiting queue is processed, and finally, the third token bucket's waiting queue is processed. This priority processing mechanism ensures that key-value cache traffic receives resources preferentially, avoiding blockage by regular traffic.

[0076] In some embodiments, when it is determined to be a key-value cache index lookup, a corresponding token bucket resource is allocated to the input / output request according to the identified key-value cache access pattern, and differentiated admission control is performed, including: marking the input / output request as a metadata request, allocating fast channel resources to the metadata request, and performing admission control of the metadata request according to the fast channel resources.

[0077] Key-value cache index lookup involves reading from the block mapping table, and its latency directly impacts the swapping speed of key-value cache blocks, thus affecting decoding latency. Therefore, index lookup requests need to be given higher priority to ensure they are processed quickly.

[0078] It should be noted that the fast track for metadata requests is a separate processing path independent of the three token buckets. Metadata requests can pass through admission control first and are exempt from or have their token bucket deductions significantly reduced. To prevent misjudgment that could lead to excessive bypassing of token buckets, this application sets a rate cap for the fast track. It counts the number of metadata requests passing through the fast track within a preset window. When the number does not exceed the rate cap, metadata requests are allowed to pass through the fast track. When the number exceeds the rate cap, the excess metadata requests are allocated to the first token bucket for processing.

[0079] Optional, such as Figure 4 As shown, admission control for metadata requests based on the fast-track resources includes: S510, count the number of metadata requests made through the fast channel within the preset window; S520: When the number of metadata requests does not exceed the fast channel rate limit, the metadata requests are allowed to pass and the token bucket quota deduction is waived or reduced. When the number of metadata requests exceeds the fast channel rate limit, the excess metadata requests are allocated to the corresponding token bucket for admission control.

[0080] In this embodiment, the rate cap for the fast channel is implemented using a low-overhead counting mechanism. The kernel's per-CPU counter is used to count the number of metadata requests passing through the fast channel on each CPU, and then the global count is periodically aggregated. This mechanism has extremely low overhead and almost no impact on input / output hot paths. The rate cap can be configured according to actual needs, for example, it can be set to 1000 requests per second.

[0081] For example, the rate cap for the fast channel is set to 1000 requests per second, and the statistical window size is 100ms. Within one statistical window, if the number of metadata requests through the fast channel is 80, which is below the window cap of 100, the metadata requests are allowed to pass without deducting any token bucket quota. If, within another statistical window, the number of metadata requests through the fast channel is 120, exceeding the window cap of 100, then the first 100 requests are processed through the fast channel, and the remaining 20 requests are allocated to the first token bucket and processed according to the rules for key-value cache decoding and reading.

[0082] It's important to note that the token bucket limit deduction for metadata requests can be configured as either a complete exemption or a partial deduction. A complete exemption means no token bucket limit is deducted, while a partial deduction means a certain percentage of the normal limit is deducted, such as 10% or 20%. This can be configured according to actual business needs. For example, for scenarios with extremely high latency requirements, a complete exemption can be configured, while for scenarios requiring strict control over resource usage, a partial deduction can be configured.

[0083] Optionally, when the decrease in traffic for key-value cache prefilling writes exceeds a preset threshold, a token bucket prefilling operation is triggered to adjust the available amount and time slice granularity of the first token bucket, and admission control for subsequent key-value cache decoding and reading is performed based on the adjusted first token bucket resources.

[0084] In this embodiment, the decoding phase begins immediately after the pre-filling phase, which generates a large number of sudden key-value cache decoding and read requests. If the available quota in the first token bucket is insufficient, these sudden requests will be blocked, leading to an increase in the first token time. Therefore, it is necessary to pre-fill the first token bucket before the end of the pre-filling phase to ensure that the first batch of requests in the decoding phase can be processed quickly.

[0085] Specifically, the method for detecting the decrease in pre-fill write traffic is to compare the write traffic of the current statistics window with that of the previous statistics window and calculate the decrease. If the decrease exceeds a preset threshold, the pre-filling phase is considered to be about to end. The preset threshold can be set to 70%, meaning that the write traffic of the current window has decreased by more than 70% compared to the previous window.

[0086] When the token bucket pre-charge operation is triggered, the available input / output quota and available bandwidth quota of the first token bucket are overfilled to exceed the standard capacity in one go. The overfill ratio can be configured according to actual conditions, for example, it can be filled to 150% of the standard capacity. At the same time, the time slice granularity of the first token bucket is temporarily shortened, for example, from 0.5ms to 0.25ms, to improve the scheduling response speed. The effective time of the pre-charge operation is a preset window time. After the window time ends, the parameters of the first token bucket are restored to the normal configuration.

[0087] For example, the standard burst capacity of the first token bucket is 20,000 I / O operations and 200MB bandwidth, with a normal time slice granularity of 0.5ms. When a pre-filled write traffic decrease of 80% is detected, exceeding a preset threshold of 70%, a token bucket pre-filling operation is triggered. The available I / O quota is filled to 30,000 operations, the available bandwidth quota is filled to 300MB, and the time slice granularity is shortened to 0.25ms. The pre-filling effective window is set to 500ms. Within 500ms, the first token bucket uses excess quota and shortened time slices for scheduling. After 500ms, the quota and time slice granularity return to the normal configuration.

[0088] It's important to note that the token bucket pre-charge operation is executed only once, continuing until the next pre-filling phase ends. If a decrease in pre-filling write traffic is detected again within the pre-charge validity window, the pre-charge operation will not be triggered again. This avoids resource waste caused by excessive pre-charge.

[0089] In some embodiments, the available input / output quota and available bandwidth quota of each token bucket are calculated respectively, including: maintaining an independent quota counter and synchronization mechanism for each token bucket, updating the available input / output quota and available bandwidth quota of each token bucket according to the preset token filling rate and burst capacity, and prioritizing the processing of input / output requests corresponding to the key-value cache access mode when input / output requests from different token buckets arrive concurrently.

[0090] Specifically, the quota counter for each token bucket is implemented using atomic variables to ensure thread safety during concurrent access by multiple CPUs. Synchronization mechanisms employ spin locks or mutexes; when multiple CPUs simultaneously modify the quota counter of the same token bucket, they must acquire the lock first, and then release it after modification. To reduce lock contention overhead, this application adopts a per-CPU token bucket approach. Each CPU maintains an independent copy of the token bucket, and the quotas of each CPU's token buckets are periodically aggregated into the global token bucket. This approach reduces lock contention and improves concurrency performance.

[0091] In this embodiment, the token filling rate and burst capacity can be dynamically adjusted through the user-space configuration interface without recompiling the kernel or restarting the system. Users can adjust the parameters of each token bucket in real time according to actual changes in business load to achieve optimal service quality.

[0092] When input / output requests from different token buckets arrive concurrently, they are processed according to the priority order of the first, second, and third token buckets. First, all waiting requests from the first token bucket are processed, then all waiting requests from the second token bucket are processed, and finally all waiting requests from the third token bucket are processed. This priority processing mechanism ensures that key-value cache traffic can obtain resources first, avoiding being blocked by regular traffic.

[0093] Based on the same inventive concept, this application also provides a block device layer differentiated access control system, such as... Figure 5 As shown, it includes a pattern recognition module 10, a token bucket scheduling module 20, and an admission control module 30.

[0094] The pattern recognition module 10 is used to acquire input / output request feature data of the block device layer, identify key-value cache access patterns based on the input / output request feature data, the token bucket scheduling module 20 is used to allocate corresponding token bucket resources to input / output requests based on the identified key-value cache access patterns, and the admission control module 30 is used to perform differentiated admission control based on the allocated token bucket resources.

[0095] It should be noted that all modules of the system run in the operating system kernel mode and are deployed in the input / output processing path of the block device layer. The pattern recognition module is responsible for capturing input / output requests, extracting feature data, and performing pattern recognition. The token bucket scheduling module is responsible for maintaining three independent token buckets and managing the filling and consumption of tokens. The admission control module is responsible for deciding whether to allow input / output requests based on the available token amount in the token buckets.

[0096] For example, the system's workflow is as follows: When an input / output request arrives at the block device layer, it is first captured by the pattern recognition module. The pattern recognition module extracts the feature data of the request, updates the input / output statistical context, and then identifies the traffic type of the request based on the statistical context. The traffic type information is then passed to the token bucket scheduling module, which allocates the corresponding token bucket resources to the request based on the traffic type. Finally, the admission control module checks the available quota of the corresponding token bucket. If the quota is sufficient, the request is allowed; otherwise, the request is placed in a waiting queue.

[0097] In this application, the technical solution can be applied to various scenarios. In a multi-tenant inference service deployment scenario, multiple large model inference services run on the same storage array, and the model and context length of each tenant may be different. This application runs a pattern recognition engine and token bucket scheduler independently for each tenant, and automatically calibrates recognition parameters, such as the inter-layer jump step size threshold, based on the input and output characteristics of each tenant. The layer step size of tenants with small models and short contexts is smaller, while the layer step size of tenants with large models and long contexts is larger. The pattern recognition engine automatically learns and adjusts the step size threshold based on historical data to improve recognition accuracy. The token buckets of each tenant are isolated from each other, and the decoding and reading of different tenants will not affect each other, ensuring the quality of service in a multi-tenant environment.

[0098] In scenarios where inference and training tasks are co-located, inference services and training tasks run simultaneously on the same storage device. The inference service's key-value cache reads utilize a first token bucket, ensuring high I / O throughput and low latency. The training task's data loading and checkpoint writes utilize a third token bucket, guaranteeing high bandwidth. This dual-bucket isolation mechanism ensures that training traffic does not crowd out inference traffic resources, maintaining stable decoding latency for the inference service even under full load. Simultaneously, the training task can fully utilize the storage device's bandwidth without experiencing a significant drop in training speed due to inference traffic.

[0099] In key-value cache page swapping scenarios when video memory is insufficient, the inference engine swaps out some key-value cache blocks to storage when the graphics processor's video memory is low, and then swaps them back into video memory when needed. Swap-out write operations are key-value cache pre-fill writes, processed using the second token bucket; a medium-high I / O rate configuration can prevent swap-out operations from blocking. Swap-in read operations are key-value cache decoding reads, processed using the first token bucket; a high I / O rate configuration can ensure the speed of swap-in operations and guarantee the continuity of decoding. Block mapping table update operations are index lookups, processed using the metadata fast channel, reducing metadata access latency and improving page swapping efficiency.

[0100] Existing technologies address the input / output latency issue of key-value caching in large model inference by optimizing at the application-layer inference engine or graphics processor driver layer, including memory management for paging attention, page swapping priority scheduling at the driver layer, and prefetching strategies at the application layer. These solutions do not address the input / output scheduling logic at the operating system block device layer. All input / output requests with different semantics are still processed uniformly at the block layer, leading to latency-sensitive key-value cache reads competing with bandwidth-sensitive training data loading and checkpoint writing within the same resource pool. This application decentralizes optimization to the block device layer, using observable input / output behavior characteristics at the block layer to infer upper-layer inference semantics, thereby implementing differentiated resource scheduling. Existing technologies do not establish a correspondence between block-layer input / output characteristics and large model inference stages. This application analyzes the input / output behavior patterns at different stages of large model inference, extracting three quantifiable features: input / output size distribution, random read ratio, and offset step size regularity. It then constructs identification rules for key-value cache decoding reads, pre-filling writes, and index lookups, achieving traffic classification without modifying the upper-layer inference engine. Existing block-layer service quality mechanisms all employ a single token bucket or a multi-token bucket architecture partitioned by process tenant, failing to distinguish between traffic with different semantics within the same process. This application expands the original single token bucket into independent key-value cache read buckets, key-value cache write buckets, and regular data buckets, configuring parameters matching the resource requirements of their respective traffic types to achieve resource isolation for traffic with different semantics. Furthermore, this application incorporates the transformation characteristics of the inference phase, proactively adjusting the resource quota of the key-value cache read bucket by monitoring changes in pre-filled write traffic. Simultaneously, it sets up independent fast admission paths for the extremely fine-grained random read characteristics of index lookups. These technical features work together to complete a full closed loop from semantic recognition to resource scheduling at the block layer, resolving the block-layer bandwidth contention problem that upper-layer optimizations cannot overcome. The various technical features form an organic technical whole, effectively reducing the latency of key-value cache reads during large model inference without altering the upper-layer software architecture, thus improving the stability of the inference service.

[0101] In this embodiment, the technical solution is implemented in the Linux kernel, adding relevant functionality by extending the Quality of Service (QoS) module of the block device layer. The Linux kernel's block layer already provides basic input / output statistics and token bucket rate limiting mechanisms. This application extends this existing framework without requiring a complete refactoring of the block layer system, thus reducing development difficulty and kernel compatibility risks. The functional modules of this application can be compiled into kernel modules and dynamically loaded into the kernel without recompiling the entire kernel, facilitating deployment and maintenance.

[0102] It should be noted that the technical solution of this application is not limited to the Linux kernel, but can also be applied to the block device layer of other operating systems, as long as the operating system supports block-level hook functions and quality of service extension mechanisms. Furthermore, the technical solution of this application can also be used in conjunction with application-level optimization techniques, such as pagination attention algorithms. The inference engine optimizes the memory layout and swapping strategy of the key-value cache at the application layer, while this application guarantees the quality of service for swapping out and swapping in at the block layer. The synergy of these two approaches can further improve inference performance.

Claims

1. A method for differentiated access control at the block device layer, characterized in that, include: Obtain input / output request feature data from the block device layer, wherein the input / output request feature data is obtained by extracting features from input / output requests; Identify the key-value cache access pattern based on the input / output request feature data; Based on the identified key-value cache access pattern, corresponding token bucket resources are allocated to input and output requests to perform differentiated admission control.

2. The method according to claim 1, characterized in that, Identifying key-value cache access patterns based on the input / output request feature data includes: Calculate the random read ratio and input / output size distribution based on the input / output request feature data; Based on the input / output request feature data, extract the offset step size history of adjacent input / output requests; The key-value cache access pattern is identified based on the random read ratio, the input / output size distribution, and the offset step size history.

3. The method according to claim 2, characterized in that, Based on the random read ratio, the input / output size distribution, and the historical identification of the offset step size, the key-value cache access pattern includes: When the random read ratio exceeds the first threshold, the small-granularity input / output ratio exceeds the second threshold, and there are periodic jumps in the offset step size history, it is determined to be a key-value cache decoding read. When the sequential write ratio exceeds the third threshold, the proportion of medium-granularity input and output exceeds the fourth threshold, and the write traffic suddenly increases and then falls back within the preset window, it is determined to be a key-value cache pre-fill write. When the random read ratio exceeds the fifth threshold, the proportion of very fine granular input / output exceeds the sixth threshold, and the read addresses are concentrated within a preset range, it is determined to be a key-value cache index lookup.

4. The method according to claim 1, characterized in that, Based on the identified key-value cache access patterns, corresponding token bucket resources are allocated to input and output requests, and differentiated admission control is performed, including: Allocate the first token bucket resource for key-value cache decoding and reading; Allocate a second token bucket resource for key-value cache pre-filling writes; Allocate third token bucket resources for regular data input and output; Admission control for input / output requests is performed based on the allocated token bucket resources.

5. The method according to claim 4, characterized in that, Admission control for input / output requests is performed based on the allocated token bucket resources, including: Calculate the available input / output quota and available bandwidth quota for each token bucket, wherein each token bucket includes a first token bucket, a second token bucket, and a third token bucket; The available input / output quota and available bandwidth quota of the corresponding token bucket are deducted based on the type of input / output request; Input / output requests are allowed to pass when the available input / output quota and available bandwidth quota of the corresponding token bucket are sufficient; input / output requests are made to wait and retry when the available input / output quota and available bandwidth quota of the corresponding token bucket are insufficient.

6. The method according to claim 3, characterized in that, When a key-value cache index lookup is detected, the corresponding token bucket resources are allocated to the input / output request based on the identified key-value cache access pattern, and differentiated admission control is performed, including: The input / output request is marked as a metadata request, a fast-track resource is allocated to the metadata request, and admission control of the metadata request is performed based on the fast-track resource.

7. The method according to claim 6, characterized in that, Admission control for metadata requests based on the fast-access resources includes: Count the number of metadata requests made through the fast channel within the preset window; When the number of metadata requests does not exceed the fast channel rate limit, the metadata requests are allowed to pass and the token bucket amount deduction is waived or reduced. When the number of metadata requests exceeds the fast channel rate limit, the excess metadata requests will be allocated to the corresponding token bucket for admission control.

8. The method according to claim 3, characterized in that, When the decrease in traffic for key-value cache prefilling writes exceeds a preset threshold, a token bucket prefilling operation is triggered, adjusting the available amount and time slice granularity of the first token bucket, and then performing admission control for subsequent key-value cache decoding and reading based on the adjusted first token bucket resources.

9. The method according to claim 5, characterized in that, Calculate the available input / output quota and available bandwidth quota for each token bucket, including: Each token bucket maintains an independent quota counter and synchronization mechanism. The available input / output quota and available bandwidth quota of each token bucket are updated according to the preset token filling rate and burst capacity. When input / output requests from different token buckets arrive concurrently, the input / output requests corresponding to the key-value cache access mode are processed first.

10. A block device layer differentiated access control system, characterized in that, include: The module consists of a pattern recognition module, a token bucket scheduling module, and an access control module. The pattern recognition module is used to obtain input / output request feature data of the block device layer, and to identify key-value cache access patterns based on the input / output request feature data. The input / output request feature data is obtained by extracting features from input / output requests. The token bucket scheduling module is used to allocate corresponding token bucket resources to input and output requests based on the identified key-value cache access pattern. The access control module is used to perform differentiated access control based on the allocated token bucket resources.