A prefix caching method and apparatus suitable for linear attention model architecture

By taking snapshots in the linear attention layer and using block prefix hash values ​​for mapping and parallel writing, the problem of repeated computation in the linear attention layer is solved, achieving efficient prefix caching, reducing computational complexity and improving inference efficiency.

CN122111685BActive Publication Date: 2026-06-30SHANGHAI SUIYUAN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI SUIYUAN TECH CO LTD
Filing Date
2026-04-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing prefix caching techniques cannot be effectively applied to linear attention layers, causing the convolutional and loop states of linear attention layers to need to be recalculated from the initial zero state, which greatly reduces the speedup effect of prefix caching.

Method used

By capturing snapshots of the linear attention layer, using block prefix hash values ​​for association mapping and storing them in the snapshot mapping table, target runtime slots are allocated when responding to new request sequences, and snapshots are written in parallel to the global state tensor through the state scattering kernel engine to avoid redundant calculations.

Benefits of technology

A prefix cache for the linear attention layer was implemented, which significantly reduced computational complexity and first-word latency, and improved inference throughput.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122111685B_ABST
    Figure CN122111685B_ABST
Patent Text Reader

Abstract

This invention discloses a prefix caching method and apparatus suitable for linear attention model architectures, introducing a linear attention state snapshot mechanism based on hash indexes. During the capture phase, the system extracts the convolution and loop states of linear attention at block boundaries as snapshots and stores them in a mapping table associated with the block prefix hash value. During the recovery phase, runtime slots are allocated for new requests that hit the prefix, and the snapshots are written in parallel to the global state tensor by index using a state scattering kernel engine, thereby skipping the linear attention computation for already hit prefixes. The snapshot pool uses a least recently used strategy for capacity management and automatic eviction, and performs binding validity checks before pre-filling computation. If a snapshot is asynchronously evicted, reference stripping and state clearing are used to fall back to the full pre-filling mode to ensure correctness. This method effectively avoids redundant computation, reduces first-word latency, and improves inference throughput, while balancing system stability and resource utilization efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and natural language processing, and in particular to a prefix caching method and apparatus suitable for linear attention model architectures. Background Technology

[0002] Traditional Transformer architectures employ a full attention mechanism. During inference, they typically store computed key-value pairs using a key-value cache (KV Cache). The computational complexity of full attention is O(n^2). 2 (n) where n is the sequence length. KV Cache eliminates the need to recalculate the key-value pairs of historical tokens during sub-regression decoding, significantly improving decoding efficiency. Existing prefix caching technologies typically divide the token sequence into fixed-size blocks and construct a prefix hash tree based on chained hash values, reusing physical blocks of the KV Cache through longest common prefix matching.

[0003] To address the computational complexity bottleneck of full attention mechanisms in scenarios with extremely long contexts, linear attention mechanisms (such as GatedDeltaNet or Mamba structures) have been proposed, reducing computational complexity to O(n). However, existing prefix caching techniques only manage and reuse KV cache blocks for full attention layers. When the model includes a linear attention layer, even if the prefix is ​​fully hit, the convolutional and cyclic states of the linear attention layer still need to be recalculated from the initial zero state, significantly reducing the speedup effect of prefix caching. Furthermore, unlike KV cache, which can be stored and referenced independently by token block, the cyclic state of linear attention is a compressed accumulation of the entire prefix sequence and cannot be simply divided into blocks. Therefore, traditional KV cache prefix caching methods based on physical block reference counting and sharing cannot be directly applied to linear attention. Consequently, there is currently a lack of a prefix caching method specifically for linear attention layers. Summary of the Invention

[0004] This invention provides a prefix caching method and apparatus suitable for linear attention model architectures, for caching and reusing linear attention states.

[0005] According to a first aspect of the present invention, a prefix caching method suitable for linear attention model architectures is provided, the method comprising:

[0006] During the historical phase of model inference, linear attention snapshots of each request sequence in the linear attention layer are obtained, wherein the linear attention snapshots include convolutional state copies and recurrent state copies;

[0007] Obtain the block prefix hash value of each of the request sequences, and use the block prefix hash value as an index to associate and map it with the corresponding linear attention snapshot, and store it in the snapshot mapping table;

[0008] In response to a new request sequence, when a target linear attention snapshot matching the target prefix hash value of the new request sequence is retrieved in the snapshot mapping table, a target runtime slot is allocated to the new request sequence;

[0009] The state scattering kernel engine is invoked to write the target linear attention snapshot in parallel into the global state tensor according to the target runtime slot, thereby skipping the calculation process corresponding to the target prefix in the new request sequence.

[0010] Optionally, obtaining a linear attention snapshot of each request sequence in the linear attention layer includes:

[0011] Determine the runtime slot index corresponding to each of the request sequences in the global state tensor;

[0012] Based on the runtime slot index, the state slices belonging to each request sequence are extracted in parallel using the tensor index selection operator in different linear attention layers of the global state tensor.

[0013] The state slices corresponding to each layer are packaged and encapsulated to construct the linear attention snapshot.

[0014] Optionally, after associating the block prefix hash value as an index with the corresponding linear attention snapshot, the method further includes:

[0015] Store the block prefix hash value into the Least Recently Used (LRU) queue and record the timestamp of the storage time;

[0016] When the total capacity of the LRU queue exceeds a preset capacity threshold, the least used prefix hash value and its corresponding linear attention snapshot are removed according to the timestamp order in the LRU queue.

[0017] Optionally, before allocating a target runtime slot to the new request sequence, the method further includes:

[0018] Obtain the original model network structure parameters recorded in the target linear attention snapshot;

[0019] The consistency of the original model network structure parameters with the network structure parameters of the current inference environment is verified.

[0020] If the verification passes, proceed with the step of allocating the target runtime slot.

[0021] If the verification fails, the target linear attention snapshot is deemed invalid.

[0022] Optionally, the invocation of the state scattering kernel engine, which writes the target linear attention snapshot in parallel into the global state tensor according to the target runtime slot, includes:

[0023] The target linear attention snapshot is used as the source state data, and a slot mapping index array is constructed based on the target runtime slot;

[0024] The source state data, the slot mapping index array, and the memory address of the global state tensor are injected into the state scattering kernel engine.

[0025] The state scattering kernel engine constructs a two-dimensional computational grid based on the amount of tasks to be written, wherein the first dimension of the two-dimensional computational grid corresponds to the number of request sequences to be written, and the second dimension corresponds to the data width of the linear attention snapshot.

[0026] Using the two-dimensional computing grid, the source state data is mapped and written in parallel to the specified physical row of the global state tensor according to the non-continuous physical offset provided by the slot mapping index array.

[0027] Optionally, the step of parallel mapping and writing the source state data to a specified physical row of the global state tensor based on the non-contiguous physical offset provided by the slot mapping index array includes:

[0028] Obtain the source tensor row step size of the source state data and the target tensor row step size of the global state tensor.

[0029] The state scattering kernel engine performs automatic memory addressing based on the source tensor row step size and the target tensor row step size to support in-situ updates of data under non-contiguous memory layouts.

[0030] Optionally, after allocating a target runtime slot to the new request sequence, a degradation processing step is further included, specifically including:

[0031] During the verification phase before performing pre-filled computation, if it is detected that the target linear attention snapshot has been asynchronously evicted, then regardless of whether the associated key-value cache block is hit, the following degradation operation is performed:

[0032] The memory reference relationship between the new request sequence and the hit key-value cache block is severed, and the reference count of the key-value cache block is updated;

[0033] Clear the historical cache state record of the new request sequence, which includes a block physical mapping table, the number of prefix hits, and a hash pointer;

[0034] The new request sequence is reassigned to independent physical blocks, and inference computation is performed by switching to full pre-filled mode.

[0035] According to a second aspect of the present invention, a prefix caching device suitable for linear attention model architectures is provided, the device comprising:

[0036] The snapshot capture module is used to obtain linear attention snapshots of each request sequence in the linear attention layer during the historical stage of model inference, wherein the linear attention snapshots include convolutional state copies and loop state copies;

[0037] The hash mapping module is used to obtain the block prefix hash value of each request sequence, associate the block prefix hash value with the corresponding linear attention snapshot as an index, and store it in the snapshot mapping table;

[0038] The slot allocation module is used to allocate a target runtime slot to a new request sequence in response to a received new request sequence when a target linear attention snapshot that matches the target prefix hash value of the new request sequence is retrieved in the snapshot mapping table.

[0039] The snapshot recovery module is used to call the state scattering kernel engine to write the target linear attention snapshot in parallel into the global state tensor according to the target runtime slot, so as to skip the calculation process corresponding to the target prefix in the new request sequence.

[0040] According to a third aspect of the present invention, an electronic device is provided, the electronic device comprising:

[0041] One or more processors;

[0042] Storage device for storing one or more programs;

[0043] When the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to perform the method as described in any embodiment of the present invention.

[0044] According to a fourth aspect of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the method as described in any embodiment of the present invention.

[0045] The technical solution of the present invention has the following beneficial effects: In the capture phase, the linear attention state is fixed as an immutable snapshot copy and stored in the snapshot mapping table in association with the block prefix hash value; in the recovery phase, an independent private runtime slot that has been cleared is forcibly allocated, and the snapshot is written in parallel to the global state tensor according to the runtime slot by calling the state scattering kernel engine, thereby realizing the prefix caching for the linear attention layer, effectively avoiding the repeated calculation of the linear attention layer, significantly reducing the first word latency and improving the inference throughput.

[0046] It should be understood that the description in this section is not intended to identify key or important features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description

[0047] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0048] Figure 1 This is a flowchart of a prefix caching method suitable for a linear attention model architecture according to Embodiment 1 of the present invention;

[0049] Figure 2 This is a schematic diagram of the state structure of the linear attention layer according to Embodiment 1 of the present invention;

[0050] Figure 3 This is a schematic diagram of the capture and recovery of a linear attention snapshot according to Embodiment 1 of the present invention;

[0051] Figure 4 This is a schematic diagram of the internal structure and slot management of Mamba according to Embodiment 1 of the present invention;

[0052] Figure 5 This is a flowchart of another prefix caching method suitable for linear attention model architectures provided by Embodiment 2 of the present invention;

[0053] Figure 6 This is a schematic diagram illustrating the working principle of the state scattering nuclear engine provided in Embodiment 2 of the present invention;

[0054] Figure 7 This is a schematic diagram of dynamic routing and discontinuous physical mapping of the state scattering kernel engine provided in Embodiment 2 of the present invention;

[0055] Figure 8This is a schematic diagram of a prefix caching device suitable for a linear attention model architecture according to Embodiment 3 of the present invention;

[0056] Figure 9 This is a structural block diagram of an electronic device provided in Embodiment 4 of the present invention. Detailed Implementation

[0057] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are merely some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0058] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, apparatus, product, or terminal device that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or terminal devices.

[0059] Figure 1 This is a flowchart illustrating a prefix caching method suitable for a linear attention model architecture, as provided in Embodiment 1 of the present invention. This embodiment is applicable to scenarios involving caching and reusing linear attention states. The method can be executed by a prefix caching device suitable for a linear attention model architecture. This device can be implemented in hardware and / or software and can be integrated into an electronic device with data processing capabilities.

[0060] like Figure 1 As shown, the method includes:

[0061] S101, during the historical phase of model inference, obtain linear attention snapshots of each request sequence in the linear attention layer.

[0062] Optionally, obtaining a linear attention snapshot of each request sequence in the linear attention layer includes: determining the runtime slot index corresponding to each request sequence in the global state tensor; based on the runtime slot index, extracting state slices belonging to each request sequence in parallel using the tensor index selection operator in different linear attention layers of the global state tensor; and packaging and encapsulating the state slices corresponding to each layer to construct a linear attention snapshot.

[0063] Among them, such as Figure 2 The diagram showing the state structure of the linear attention layer illustrates how, during the historical stages of model inference, the system addresses the aforementioned... Figure 2 The structure shown obtains a linear attention snapshot of each request sequence in the linear attention layer. Since different request sequences correspond to different runtime slots during the inference phase, it is necessary to first determine the runtime slot index of each request sequence in the global state tensor. The intermediate parameters generated by each request sequence during the inference phase include convolutional states and loop states. Specifically, the dimensions of the convolutional state are [d_conv, conv_kernel_size - 1], where d_conv represents the number of convolutional kernels, which determines the number of feature maps generated after the convolution operation; conv_kernel_size represents the size of the convolutional kernel, which defines the range covered by the convolutional kernel each time it slides on the input data; the convolutional state corresponds to the short-range dependency memory in the linear attention layer. The cyclic state has dimensions of [num_heads, head_k_dim, head_v_dim], where num_heads represents the number of attention heads, determining the richness of parallel attention patterns; head_k_dim represents the dimension of the key vector in each attention head, defining the feature length of the key vector in each head; and head_v_dim represents the dimension of the value vector in each attention head, defining the feature length of the final output of each head. The cyclic state corresponds to long-range dependency compression in the linear attention layer.

[0064] In this embodiment, based on the determined runtime slot index, the state slices belonging to each request sequence are extracted in parallel using the tensor index selection operator in different linear attention layers of the global state tensor. These state slices include the aforementioned convolutional and cyclic states. Subsequently, the state slices corresponding to each layer are packaged and encapsulated as copies to construct a linear attention snapshot. Therefore, the linear attention snapshot includes copies of convolutional and cyclic states.

[0065] In addition, such as Figure 3The diagram illustrates the capture and retrieval of a linear attention snapshot. During inference, the system divides the request sequence into blocks of a fixed size (e.g., 16 terms) and calculates a prefix hash value for each block. The prefix hash value of each block contains the hash information of the preceding block, forming a chain dependency; that is, the prefix hash value of the i-th block depends not only on the terms within that block but also on the hash value of the previous block. For example... Figure 3 As shown, the system uses the boundary of each block as a capture point to capture the current state slices of all linear attention layers, and stores the linear attention snapshots generated from the state slices in the cache. When the sequence completes inference (reaching a stop symbol or reaching the maximum length, etc.), the system captures the linear attention snapshot corresponding to the current complete sequence so that subsequent requests with the same prefix can reuse it. Of course, this embodiment is only for illustrative purposes and does not limit the specific method of obtaining linear attention snapshots.

[0066] S102, obtain the block prefix hash value of each request sequence, and use the block prefix hash value as an index to associate and map with the corresponding linear attention snapshot, and store it in the snapshot mapping table.

[0067] Among them, such as Figure 4 The diagram illustrates the internal structure and slot management of Mamba. The Mamba internal structure includes a runtime state area (top region), a prefix snapshot storage area (middle region), and a data flow and operation area (bottom region). The runtime state area is primarily responsible for maintaining the dynamic state of the current inference process on the GPU, including the GPU global convolutional state tensor, the GPU global recurrent state tensor, and the slot management mechanism. Specifically, the GPU global convolutional state tensor mainly targets the convolutional state cache of the Gated Dilation Network (GDN) layer, and the GPU global recurrent state tensor mainly targets the recurrent state cache of the linear attention mechanism. The slot mapping table maps input sequence IDs to specific physical storage slots, allowing the system to quickly locate the state of a sequence. The free slot stack records currently available free slots. When a new sequence enters, the system pops a slot from this stack and allocates it to that sequence; when the sequence ends, the slot is returned to the stack. This embodiment does not limit the number of free slots in the free slot stack.

[0068] The prefix snapshot storage area is located in the GPU memory and is used to store a copy of the state of the calculated prefix to accelerate subsequent inference. In this embodiment, the system obtains the block prefix hash value of each request sequence, and uses the block prefix hash value as an index to associate and map it with the corresponding linear attention snapshot, storing it in the snapshot mapping table. Figure 4The prefix state hash map in the code illustrates the specific way the two are associated, and its function is to index the stored linear attention snapshots through hash values. Each linear attention snapshot contains copies of the convolutional and recurrent states of all GDN layers, indicating that the complete states of convolutional and recurrent layers are preserved in the linear attention snapshots so that they can be directly reused when the same prefix is ​​encountered.

[0069] Furthermore, in this embodiment, the block prefix hash value is stored in a Least Recently Used (LRU) queue, and the timestamp of the storage time is recorded. This queue is used to manage the lifecycle of snapshots, and the prefix hash value corresponding to the most recently used snapshot is at the top of the queue. When the total capacity of the LRU queue exceeds a preset capacity threshold, the least recently used prefix hash value and its corresponding linear attention snapshot are removed according to the timestamp order in the LRU queue to free up storage space.

[0070] The data flow and operation area primarily demonstrates the flow and manipulation of data between different modules. The capture process refers to the data in the runtime state area being selected through an index select operation and transferred to the prefix snapshot storage area. The purpose is to save the currently calculated state as a new snapshot for subsequent reuse. The recovery process refers to updating the global state tensor in the runtime state area by scattering rows on the data in the prefix snapshot storage area. The purpose is to directly recover the state from the snapshot storage area when encountering the same input prefix as before, avoiding redundant calculations and significantly improving inference speed. Figure 4 It clearly describes how Mamba efficiently manages inference state on GPUs through slot management and prefix snapshots: it leverages hash maps and LRU queues for fast lookup and management of caches, and reuses computed state through capture and restore operations, thereby significantly improving the efficiency of long sequence inference.

[0071] S103, in response to a new request sequence, when a target linear attention snapshot matching the target prefix hash value of the new request sequence is retrieved in the snapshot mapping table, a target runtime slot is allocated to the new request sequence.

[0072] Optionally, before allocating a target runtime slot to a new request sequence, the method further includes: obtaining the original model network structure parameters recorded in the target linear attention snapshot; performing a consistency check between the original model network structure parameters and the network structure parameters of the current inference environment; if the check passes, then performing the step of allocating the target runtime slot; if the check fails, then determining that the target linear attention snapshot is invalid.

[0073] Specifically, when the system receives a new request sequence, it calculates the target prefix hash value for the new request sequence and queries the snapshot mapping table based on the target prefix hash value. If no matching target linear attention snapshot is found in the snapshot mapping table, it means that the same prefix did not appear during the historical request inference process; if a matching target linear attention snapshot is found in the snapshot mapping table, it means that the same prefix appeared during the historical request inference process. Therefore, the target linear attention snapshot can be directly reused for the new request sequence without recalculation. If a successful match is determined, the system allocates a target runtime slot to the new request sequence.

[0074] It should be noted that in this implementation, before allocating a target runtime slot to a new request sequence, it is necessary to ensure that the original model at the time of snapshot generation is completely consistent with the currently running model. Otherwise, even if a match is successful, the matched target linear attention snapshot cannot be correctly applied to the currently running model. The system reads the original model network structure parameters corresponding to the snapshot generation from the cached target linear attention snapshot, i.e., the model's architecture configuration, such as the hidden layer dimension, the number of attention heads, and the activation function type. Of course, this implementation is only for illustrative purposes and does not limit the specific content of the structure parameters. In this implementation, the original model network structure parameters are checked for consistency with the network structure parameters of the current inference environment, i.e., the parameters of the two are compared to see if they are the same. Only when the two are consistent is the target linear attention snapshot deemed valid, and a target runtime slot is allocated to the current new request sequence; if the two are inconsistent (such as model version upgrades or configuration file modifications), the target linear attention snapshot is deemed invalid and prefix reuse is not possible. In this case, the system will discard the invalid snapshot. For example, if the target linear attention snapshot is generated based on a 7-parameter model, but the current running model is a 13-parameter model, directly reusing it will lead to dimension mismatch or calculation errors. In this case, it is necessary to calculate from scratch for the new request sequence, rather than reusing the old cache. Although the inference speed will decrease accordingly, the accuracy of the inference results can be guaranteed.

[0075] Optionally, after allocating a target runtime slot for a new request sequence, a degradation processing mechanism is also included. The specific steps include: during the verification phase before performing pre-filling computation, if it is detected that the target linear attention snapshot has been asynchronously evicted, then regardless of whether the associated key-value cache block is hit, the following degradation operations are performed: severing the memory reference relationship between the new request sequence and the hit key-value cache block, and updating the reference count of the key-value cache block; clearing the historical cache state record of the new request sequence, the historical cache state record including the block physical mapping table, the number of prefix hit terms, and hash pointers; reallocating an independent physical block for the new request sequence, and switching to full pre-filling mode to perform inference computation.

[0076] Specifically, the state matrix tensor density of linear attention is typically much higher than that of discrete key-value blocks, making it a high-risk target for the LRU eviction mechanism under memory pressure. To overcome the inference failure problem caused by asynchronous snapshot eviction, this invention designs a "late binding" final state verification and seamless degradation strategy. At the last moment before the pre-filled core code is submitted to the GPU operator for execution, the system performs a final check on the validity of the target linear attention snapshot corresponding to the prefix hash. If it is found that the target linear attention snapshot has been asynchronously evicted (even though the key-value cache prefix block is still alive), the system will trigger a seamless silent degradation—the system will not report an error or crash, but will automatically switch to the full pre-filled mode to perform inference computation, that is, start the computation from scratch to ensure the correctness of the inference result.

[0077] Asynchronous eviction refers to the background cleanup process deleting old caches due to insufficient video memory. That is, when a snapshot reference is obtained, the snapshot still exists, but when it is actually used, the snapshot may have been deleted. Therefore, verification is needed again before performing pre-fill calculations. When performing a degradation operation after confirming that a snapshot has been deleted, if the new request sequence previously referenced a key-value cache block, the reference needs to be removed. The reference count of the key-value cache block records how many times it is used. The count is decremented after removing the reference; if the count reaches zero, the key-value cache block can be reclaimed and released. In addition, the system will clear all cache-related historical cache state records, such as the block physical mapping table recording the location of cache blocks, the prefix hit count recording the number of cached terms, and hash pointers pointing to cache locations. Of course, this implementation is only illustrative and does not limit the specific content of the historical cache state records. After completing the clearing operation, the system reallocates independent physical blocks for the new request sequence and starts calculations from scratch.

[0078] This implementation achieves graceful degradation after cache invalidation by cleaning up the environment, resetting the state, and switching modes, ensuring that requests do not fail due to cache invalidation. Although inference speed is slightly reduced (due to failure to reuse the cache), the correctness of the final result is guaranteed. This is a robust system design that ensures continuous service availability in high-concurrency, resource-constrained environments.

[0079] S104 invokes the state scattering kernel engine to write target linear attention snapshots in parallel into the global state tensor based on the target runtime slot, thereby skipping the calculation process corresponding to the target prefix in the new request sequence.

[0080] Optionally, the state scattering kernel engine is invoked to write the target linear attention snapshot in parallel into the global state tensor based on the target runtime slot. This includes: using the target linear attention snapshot as source state data and constructing a slot mapping index array based on the target runtime slot; injecting the source state data, the slot mapping index array, and the memory address of the global state tensor into the state scattering kernel engine; the state scattering kernel engine constructs a two-dimensional computational grid based on the amount of tasks to be written, where the first dimension of the two-dimensional computational grid corresponds to the number of request sequences to be written, and the second dimension corresponds to the data width of the linear attention snapshot; using the two-dimensional computational grid, the source state data is mapped and written in parallel to the specified physical row of the global state tensor based on the non-contiguous physical offset provided by the slot mapping index array.

[0081] Specifically, the system uses a dedicated state scattering kernel engine to directly write previously saved computation results (linear attention snapshots) into the global state tensor used for the current computation. This allows the model to continue inference directly from the breakpoint without recalculating the already processed prefix content, thus accelerating inference. In this implementation, during parallel snapshot writing, data is distributed from a centralized snapshot source to multiple discontinuous physical locations within the global state tensor, thereby improving the efficiency of data writing.

[0082] The technical solution of this invention freezes the linear attention state into an immutable snapshot copy during the capture phase and stores it in a snapshot mapping table in association with the block prefix hash value. During the recovery phase, an independent private runtime slot that has been cleared is forcibly allocated. The state scattering kernel engine is called to write the snapshot into the global state tensor in parallel according to the runtime slot. This achieves prefix caching for the linear attention layer, effectively avoids repeated calculations of the linear attention layer, significantly reduces first-word latency, and improves inference throughput.

[0083] Figure 5 This is a flowchart of another prefix caching method suitable for linear attention model architectures provided by an embodiment of the present invention. Based on the above embodiments, this embodiment provides a detailed explanation of the specific process of calling the state scattering kernel engine to write the target linear attention snapshot into the global state tensor in parallel according to the target runtime slot. For example... Figure 5 As shown, the method includes:

[0084] S201, during the historical phase of model inference, obtain linear attention snapshots of each request sequence in the linear attention layer.

[0085] Optionally, obtaining a linear attention snapshot of each request sequence in the linear attention layer includes: determining the runtime slot index corresponding to each request sequence in the global state tensor; based on the runtime slot index, extracting state slices belonging to each request sequence in parallel using the tensor index selection operator in different linear attention layers of the global state tensor; and packaging and encapsulating the state slices corresponding to each layer to construct a linear attention snapshot.

[0086] S202, obtain the block prefix hash value of each request sequence, and use the block prefix hash value as an index to associate and map with the corresponding linear attention snapshot, and store it in the snapshot mapping table.

[0087] S203, in response to a new request sequence, when a target linear attention snapshot matching the target prefix hash value of the new request sequence is retrieved in the snapshot mapping table, a target runtime slot is allocated to the new request sequence.

[0088] Optionally, before allocating a target runtime slot to a new request sequence, the method further includes: obtaining the original model network structure parameters recorded in the target linear attention snapshot; performing a consistency check between the original model network structure parameters and the network structure parameters of the current inference environment; if the check passes, then performing the step of allocating the target runtime slot; if the check fails, then determining that the target linear attention snapshot is invalid.

[0089] Optionally, after allocating a target runtime slot for a new request sequence, a degradation processing mechanism is also included. The specific steps include: during the verification phase before performing pre-filling computation, if it is detected that the target linear attention snapshot has been asynchronously evicted, then regardless of whether the associated key-value cache block is hit, the following degradation operations are performed: severing the memory reference relationship between the new request sequence and the hit key-value cache block, and updating the reference count of the key-value cache block; clearing the historical cache state record of the new request sequence, the historical cache state record including the block physical mapping table, the number of prefix hit terms, and hash pointers; reallocating an independent physical block for the new request sequence, and switching to full pre-filling mode to perform inference computation.

[0090] S204, use the target linear attention snapshot as source state data, and construct a slot mapping index array based on the target runtime slot.

[0091] Among them, such as Figure 6The diagram illustrates the working principle of the state scattering kernel engine, describing the tensor processing from input to execution to output. It mainly includes three regions: the input region, the execution region, and the output region. In this embodiment, the target linear attention snapshot is used as the source state data, i.e., a continuous memory region (continuous rows) composed of the source state data. A slot mapping index array is constructed based on the target runtime slots, i.e., an externally input discrete slot array with dynamic offset mapping. For example, in this embodiment, this array includes [slot_3, slot_10, slot_7], meaning that prefix multiplexing is performed simultaneously on three new request sequences.

[0092] S205 injects the source state data, slot mapping index array, and the memory address of the global state tensor into the state scattering kernel engine.

[0093] S206, the state scattering kernel engine constructs a two-dimensional computational grid based on the amount of tasks to be written.

[0094] Within the execution region, the state scattering kernel engine constructs a two-dimensional computational grid based on the amount of tasks to be written. The first dimension of the two-dimensional computational grid corresponds to the number of request sequences to be written, and the second dimension corresponds to the data width of the linear attention snapshot. In GPU programming, `blockIdx` represents the index of a thread block, used to map the input logical sequence (composed of slots) to a thread block on the GPU. Each thread block is responsible for processing one row of data in the input sequence. `blockIdx.x` and `threadIdx.x` implement parallel processing of elements within a row along the feature dimension, where `threadIdx` represents the thread index within the thread block. The feature dimension of each element within a row refers to the parallel execution of the computation of different features within that row when processing a row of data. Within the same thread block, multiple threads work together to quickly complete the computation of all features within a row of data. Therefore, in this implementation, the row offset in the target global state tensor is directly calculated by relying on slots[blockIdx.y]. The system uses the value in the slot array to directly calculate the starting position where the data of that row in the target global state tensor should be written, thereby realizing the writing of non-contiguous data in the corresponding row of the output tensor by quickly locating it through the slot index.

[0095] S207 utilizes a two-dimensional computational grid to write the source state data in parallel to the specified physical row of the global state tensor based on the non-continuous physical offset provided by the slot mapping index array.

[0096] Optionally, the source tensor row step size of the source state data and the target tensor row step size of the global state tensor are obtained; the state scattering kernel engine performs automatic memory addressing based on the source tensor row step size and the target tensor row step size to support in-situ updates of data under non-contiguous memory layout.

[0097] Among them, such as Figure 7 The diagram illustrates the dynamic routing and non-contiguous physical mapping of the state scattering kernel engine. The state scattering kernel engine performs automatic memory addressing in the execution area to support in-situ updates of data under non-contiguous memory layouts. The Row id represents the identifier of each row of results; this architecture can handle large-scale data volumes. In the output area, data is organized in batches, supporting batch inference computation.

[0098] This implementation achieves efficient state writing through the following mechanisms: First, it uses a slot array to dynamically map logical indices to physical memory, supporting the processing of non-contiguous, discrete data blocks, i.e., dynamic memory mapping; Second, it implements two levels of parallelism—in thread-block parallelism, different thread blocks process different sequence rows (blockIdx.y), and in thread-level parallelism, threads within the same thread block process different features within the same row (threadIdx.x); Third, it directly calculates row offsets using the slot array, avoiding complex addressing operations and achieving efficient indexing; Fourth, data is processed and transmitted directly within the GPU, reducing data movement between the GPU and host memory, which is key to high-performance computing.

[0099] The technical solution of this invention, in the capture phase, freezes the linear attention state into an immutable snapshot copy and stores it in the snapshot mapping table in association with the block prefix hash value; in the recovery phase, it forcibly allocates an independent private runtime slot that has been cleared, and writes the snapshot into the global state tensor in parallel according to the runtime slot by calling the state scattering kernel engine, thereby realizing the prefix caching for the linear attention layer, effectively avoiding the repeated calculation of the linear attention layer, significantly reducing the first word latency and improving the inference throughput.

[0100] Figure 8 This is a schematic diagram of a prefix caching device suitable for a linear attention model architecture, provided as an embodiment of the present invention. Figure 8 As shown, the device includes: a snapshot capture module 310, a hash mapping module 320, a slot allocation module 330, and a snapshot recovery module 340.

[0101] in:

[0102] The snapshot capture module 310 is used to acquire linear attention snapshots of each request sequence in the linear attention layer during the historical stage of model inference, wherein the linear attention snapshots include convolutional state copies and recurrent state copies;

[0103] The hash mapping module 320 is used to obtain the block prefix hash value of each request sequence, associate the block prefix hash value with the corresponding linear attention snapshot as an index, and store it in the snapshot mapping table;

[0104] The slot allocation module 330 is used to allocate a target runtime slot to a new request sequence in response to a received new request sequence when a target linear attention snapshot that matches the target prefix hash value of the new request sequence is retrieved in the snapshot mapping table.

[0105] The snapshot recovery module 340 is used to call the state scattering kernel engine to write the target linear attention snapshot in parallel into the global state tensor according to the target runtime slot, so as to skip the calculation process corresponding to the target prefix in the new request sequence.

[0106] Optionally, the snapshot capture module 310 is also used to determine the runtime slot index corresponding to each request sequence in the global state tensor; based on the runtime slot index, in different linear attention layers of the global state tensor, the state slices belonging to each request sequence are extracted in parallel using the tensor index selection operator; and the state slices corresponding to each layer are packaged and encapsulated to construct a linear attention snapshot.

[0107] Optionally, the device also includes an LRU queue management module, which stores the block prefix hash value into the Least Recently Used (LRU) queue and records the timestamp of the storage time; when the total capacity of the LRU queue exceeds a preset capacity threshold, the least recently used prefix hash value and its corresponding linear attention snapshot are removed according to the timestamp order in the LRU queue.

[0108] Optionally, the device further includes a consistency verification module, used to obtain the original model network structure parameters recorded in the target linear attention snapshot; perform consistency verification between the original model network structure parameters and the network structure parameters of the current inference environment; if the verification passes, the step of allocating the target runtime slot is executed; if the verification fails, the target linear attention snapshot is determined to be invalid.

[0109] Optionally, the snapshot recovery module 340 is further configured to use the target linear attention snapshot as source state data and construct a slot mapping index array based on the target runtime slots; inject the source state data, the slot mapping index array, and the memory address of the global state tensor into the state scattering kernel engine; the state scattering kernel engine constructs a two-dimensional computational grid based on the amount of tasks to be written, wherein the first dimension of the two-dimensional computational grid corresponds to the number of request sequences to be written, and the second dimension corresponds to the data width of the linear attention snapshot; using the two-dimensional computational grid, the source state data is mapped and written in parallel to the specified physical rows of the global state tensor based on the non-continuous physical offset provided by the slot mapping index array.

[0110] Optionally, the snapshot recovery module 340 is also used to obtain the source tensor row step size of the source state data and the target tensor row step size of the global state tensor; the state scattering kernel engine performs automatic memory addressing based on the source tensor row step size and the target tensor row step size to support in-situ updates of data under non-contiguous memory layout.

[0111] Optionally, the device also includes a degradation processing module, which, during the verification phase before performing pre-filled computation, if it is detected that the target linear attention snapshot has been asynchronously evicted, performs the following degradation operations regardless of whether the associated key-value cache block is hit: severing the memory reference relationship between the new request sequence and the hit key-value cache block, and updating the reference count of the key-value cache block; clearing the historical cache state record of the new request sequence, which includes the block physical mapping table, the number of prefix hit terms, and hash pointers; reallocating independent physical blocks for the new request sequence, and switching to full pre-filled mode to perform inference computation.

[0112] The prefix caching device for linear attention model architecture provided in this embodiment of the invention can execute the method of any of the above embodiments, and has the functional modules and corresponding beneficial effects corresponding to the above methods.

[0113] Figure 9 A schematic diagram of an electronic device 10 that can be used to implement embodiments of the present invention is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants (PDAs), servers, blade servers, mainframe computers, and other suitable computing devices. The electronic device can also represent various forms of mobile terminal devices. The components shown herein, their connections, and their respective functions are illustrative only and are not intended to limit the implementation of the present invention.

[0114] like Figure 9 As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 and a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor 11. The processor 11 can perform various operations and processes based on the computer program stored in the read-only memory 12 or loaded from storage unit 18 into the random access memory 13. The RAM 13 can also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14, and an input / output (I / O) interface 15 is also connected to the bus 14.

[0115] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other electronic devices through computer networks such as the Internet and / or various telecommunications networks.

[0116] Processor 11 can be a general-purpose and / or special-purpose processing component with processing and computing capabilities. Examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), a dedicated computing chip for artificial intelligence (AI), a processor for running machine learning models, a digital signal processor (DSP), and other suitable processors, controllers, or microcontrollers. Processor 11 is used to perform the methods described in the foregoing embodiments, such as a prefix caching method suitable for a linear attention model architecture.

[0117] Specifically, the operations performed by processor 11 include:

[0118] During the historical phase of model inference, obtain linear attention snapshots of each request sequence in the linear attention layer, where the linear attention snapshots include convolutional state copies and recurrent state copies;

[0119] Obtain the block prefix hash value of each request sequence, and use the block prefix hash value as an index to associate and map it with the corresponding linear attention snapshot, and store it in the snapshot mapping table;

[0120] In response to a new request sequence, when a target linear attention snapshot that matches the target prefix hash value of the new request sequence is retrieved from the snapshot mapping table, a target runtime slot is allocated to the new request sequence;

[0121] The state scattering kernel engine is invoked to write target linear attention snapshots in parallel into the global state tensor based on the target runtime slot, thereby skipping the calculation process corresponding to the target prefix in the new request sequence.

[0122] The computer program used to implement the methods described in this invention can be written in one or more programming languages. The computer program can be provided to the processor of a general-purpose computer, special-purpose computer, or other computing device, causing the processor to perform the functions or operations described in the flowcharts and / or block diagrams when executing the program. The computer program can be executed entirely on a local device, partially on a local device, partially on a remote device, or entirely on a remote device or server.

[0123] In the context of this invention, a computer-readable storage medium can be a tangible medium for containing or storing a computer program for use by, or in connection with, an instruction execution apparatus, device, or electronic device. Computer-readable storage media can include, but are not limited to, electronic media, magnetic media, optical media, electromagnetic media, infrared media, or semiconductor media, or any combination thereof. Specific examples include, but are not limited to, electrical connections (such as wires), portable storage disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, compact optical disc read-only memory (CD-ROM), optical storage devices, magnetic storage devices, etc.

[0124] It should be understood that the above process can be adjusted, such as reordering the steps, adding or deleting steps. For example, the steps described in this invention can be executed in parallel, sequentially, or in other orders, as long as the desired effect of the technical solution of this invention can be achieved, they all fall within the protection scope of this invention.

[0125] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications, equivalent substitutions, or improvements can be made to the above embodiments without departing from the spirit and principles of the present invention, and all such modifications, substitutions, or improvements should be included within the scope of protection of the present invention.

Claims

1. A prefix caching method suitable for linear attention model architectures, characterized in that, The method includes: During the historical phase of model inference, linear attention snapshots of each request sequence in the linear attention layer are obtained, wherein the linear attention snapshots include convolutional state copies and recurrent state copies; Obtain the block prefix hash value of each of the request sequences, and use the block prefix hash value as an index to associate and map it with the corresponding linear attention snapshot, and store it in the snapshot mapping table; In response to a new request sequence, when a target linear attention snapshot matching the target prefix hash value of the new request sequence is retrieved in the snapshot mapping table, a target runtime slot is allocated to the new request sequence; The state scattering kernel engine is invoked to write the target linear attention snapshot in parallel into the global state tensor according to the target runtime slot, thereby skipping the calculation process corresponding to the target prefix in the new request sequence.

2. The method according to claim 1, characterized in that, The step of obtaining linear attention snapshots of each request sequence in the linear attention layer includes: Determine the runtime slot index corresponding to each of the request sequences in the global state tensor; Based on the runtime slot index, the state slices belonging to each request sequence are extracted in parallel using the tensor index selection operator in different linear attention layers of the global state tensor. The state slices corresponding to each layer are packaged and encapsulated to construct the linear attention snapshot.

3. The method according to claim 1, characterized in that, After associating the block prefix hash value as an index with the corresponding linear attention snapshot, the method further includes: Store the block prefix hash value into the Least Recently Used (LRU) queue and record the timestamp of the storage time; When the total capacity of the LRU queue exceeds a preset capacity threshold, the least used prefix hash value and its corresponding linear attention snapshot are removed according to the timestamp order in the LRU queue.

4. The method according to claim 1, characterized in that, Before allocating a target runtime slot to the new request sequence, the method further includes: Obtain the original model network structure parameters recorded in the target linear attention snapshot; The consistency of the original model network structure parameters with the network structure parameters of the current inference environment is verified. If the verification passes, proceed with the step of allocating the target runtime slot. If the verification fails, the target linear attention snapshot is deemed invalid.

5. The method according to claim 1, characterized in that, The invocation of the state scattering kernel engine, based on the target runtime slot, writes the target linear attention snapshot in parallel into the global state tensor, including: The target linear attention snapshot is used as the source state data, and a slot mapping index array is constructed based on the target runtime slot; The source state data, the slot mapping index array, and the memory address of the global state tensor are injected into the state scattering kernel engine. The state scattering kernel engine constructs a two-dimensional computational grid based on the amount of tasks to be written, wherein the first dimension of the two-dimensional computational grid corresponds to the number of request sequences to be written, and the second dimension corresponds to the data width of the linear attention snapshot. Using the two-dimensional computing grid, the source state data is mapped and written in parallel to the specified physical row of the global state tensor according to the non-continuous physical offset provided by the slot mapping index array.

6. The method according to claim 5, characterized in that, The step of writing the source state data into a specified physical row of the global state tensor in parallel based on the non-contiguous physical offset provided by the slot mapping index array includes: Obtain the source tensor row step size of the source state data and the target tensor row step size of the global state tensor. The state scattering kernel engine performs automatic memory addressing based on the source tensor row step size and the target tensor row step size to support in-situ updates of data under non-contiguous memory layouts.

7. The method according to claim 1, characterized in that, After allocating a target runtime slot to the new request sequence, a degradation processing step is further included, specifically: During the verification phase before performing pre-filled computation, if it is detected that the target linear attention snapshot has been asynchronously evicted, then regardless of whether the associated key-value cache block is hit, the following degradation operation is performed: The memory reference relationship between the new request sequence and the hit key-value cache block is severed, and the reference count of the key-value cache block is updated; Clear the historical cache state record of the new request sequence, which includes a block physical mapping table, the number of prefix hits, and a hash pointer; The new request sequence is reassigned to independent physical blocks, and inference computation is performed by switching to full pre-filled mode.

8. A prefix caching device suitable for linear attention model architectures, characterized in that, The device includes: The snapshot capture module is used to obtain linear attention snapshots of each request sequence in the linear attention layer during the historical stage of model inference, wherein the linear attention snapshots include convolutional state copies and loop state copies; The hash mapping module is used to obtain the block prefix hash value of each request sequence, associate the block prefix hash value with the corresponding linear attention snapshot as an index, and store it in the snapshot mapping table; The slot allocation module is used to allocate a target runtime slot to a new request sequence in response to a received new request sequence when a target linear attention snapshot that matches the target prefix hash value of the new request sequence is retrieved in the snapshot mapping table. The snapshot recovery module is used to call the state scattering kernel engine to write the target linear attention snapshot in parallel into the global state tensor according to the target runtime slot, so as to skip the calculation process corresponding to the target prefix in the new request sequence.

9. An electronic device, characterized in that, The electronic device includes: One or more processors; Storage device for storing one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1-7.

10. A storage medium for computer-executable instructions, wherein a computer program is stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-7.