Method and device for optimizing prefix cache, electronic equipment and storage medium

By precisely matching user input commands and filling in zero-value elements, the prefix cache is optimized, solving the problems of redundant calculation and resource waste in existing technologies, and improving the accuracy of caching and inference speed.

CN122309869APending Publication Date: 2026-06-30SHANGHAI HODE INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI HODE INFORMATION TECH CO LTD
Filing Date
2026-04-01
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing prefix caching mechanisms cannot meet computational needs in scenarios requiring the acquisition of the complete hidden state, leading to redundant calculations and resource waste, as well as issues of miscaching and misuse.

Method used

By matching and validating the instruction portion of user input, the granularity of caching is refined from the overall prefix to the instruction-level prefix. Zero-value elements are used to fill the hidden state of predefined instructions, ensuring the accuracy and integrity of cache reuse.

Benefits of technology

It avoids erroneous caching and reuse, saves computing resources, improves inference speed, and maintains the correctness and compatibility of inference results in scenarios such as full sequence pooling.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309869A_ABST
    Figure CN122309869A_ABST
Patent Text Reader

Abstract

This disclosure provides a method, apparatus, electronic device, storage medium, and computer program product for optimizing prefix caching. The method includes receiving user input comprising a predefined instruction and input text; determining whether the predefined instruction matches a stored pre-computed instruction; and in response to determining that the predefined instruction matches a stored pre-computed instruction: reusing a key-value cache corresponding to the pre-computed instruction; not calculating a first hidden state for the predefined instruction, and using a zero-value element as the first hidden state for the predefined instruction; calculating a second hidden state for the input text; and determining a complete hidden state for the predefined instruction and the input text based on the first and second hidden states.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and specifically to a method and apparatus for optimizing prefix caching, electronic devices, computer-readable storage media, and computer program products. Background Technology

[0002] With the continuous development of internet technology, especially artificial intelligence, large-scale language modeling (LLM) has been widely applied in fields such as image processing and video generation. In large-scale language model inference services, prefix caching is widely used as an important inference acceleration technique to improve system throughput and reduce end-to-end inference latency. Typical implementations (such as inference frameworks like vLLM) can cache the key-value pairs corresponding to the prefix portion of the user input sequence, allowing multiple requests with the same prefix portion to reuse already computed results, thereby effectively reducing redundant computation overhead.

[0003] The methods described in this section are not necessarily methods that had been previously conceived or adopted. Unless otherwise specified, no method described in this section should be assumed to be prior art simply because it is included in this section. Similarly, unless otherwise specified, the issues mentioned in this section should not be considered to be accepted in any prior art. Summary of the Invention

[0004] This disclosure provides a method and apparatus, electronic device, computer-readable storage medium, and computer program product for optimizing prefix cache.

[0005] According to a first aspect of this disclosure, a method and apparatus for optimizing a prefix cache are provided, comprising: receiving user input including a predefined instruction and input text; determining whether the predefined instruction matches a stored pre-computed instruction; and in response to determining that the predefined instruction matches a stored pre-computed instruction: reusing a key-value cache corresponding to the pre-computed instruction; not calculating a first hidden state for the predefined instruction, and using a zero-value element as the first hidden state for the predefined instruction; calculating a second hidden state for the input text; and determining a complete hidden state for the predefined instruction and the input text based on the first hidden state and the second hidden state.

[0006] According to a second aspect of this disclosure, an apparatus for optimizing a prefix cache is provided, the apparatus comprising: a user input receiving unit configured to receive user input including a predefined instruction and input text; a determining unit configured to determine whether the predefined instruction matches a stored pre-computed instruction; and a calculating unit configured to perform the following operations in response to determining that the predefined instruction matches the stored pre-computed instruction: reusing a key-value cache corresponding to the pre-computed instruction; not calculating a first hidden state for the predefined instruction, and using a zero-value element as the first hidden state for the predefined instruction; calculating a second hidden state for the input text; and determining a complete hidden state for the predefined instruction and the input text based on the first hidden state and the second hidden state.

[0007] According to a third aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and at least one memory communicatively connected to the at least one processor, wherein the at least one memory stores a computer program that, when executed by the at least one processor, implements the above-described method for optimizing prefix caching.

[0008] According to a fourth aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing a computer program, wherein the computer program implements the above-described optimized prefix caching method when executed by a processor.

[0009] According to a fifth aspect of this disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described method for optimizing prefix caching.

[0010] According to one or more embodiments of this disclosure, by performing matching and verification only on the instruction portion of the user input, the granularity of the cache can be refined from the overall prefix to the instruction-level prefix, avoiding the problems of incorrect caching and reuse caused by length matching in the prior art, and ensuring the accuracy of cache reuse. In addition, by using zero-value elements to fill the hidden state of predefined instructions, not only can the mechanism of reusing key-value cache be maintained in scenarios such as full sequence pooling, but also the complete hidden state required in such scenarios can be obtained for subsequent calculations. This avoids a large amount of repeated calculation on the reusable key-value cache portion, significantly saves computing power and improves inference speed, and achieves compatibility between prefix caching and scenarios such as full sequence pooling while ensuring the correctness of the inference results.

[0011] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0012] The accompanying drawings exemplify embodiments and form part of the specification, serving together with the textual description to explain exemplary implementations of the embodiments. The illustrated embodiments are for illustrative purposes only and do not limit the scope of the claims. Throughout the drawings, the same reference numerals refer to similar but not necessarily identical elements.

[0013] Figure 1 A flowchart of a method for optimizing a prefix cache according to an embodiment of the present disclosure is shown; Figure 2 A flowchart illustrating the receipt of user input, including predefined instructions and input text, according to an embodiment of the present disclosure is shown; Figure 3 A flowchart illustrating a method for determining whether a predefined instruction matches a stored pre-computed instruction, according to an embodiment of the present disclosure, is shown. Figure 4 A flowchart illustrating the use of zero-value elements as a hidden state for a predefined instruction according to an embodiment of the present disclosure is shown; Figure 5 A flowchart of a method for optimizing a prefix cache according to some other embodiments of the present disclosure is shown; Figure 6 A structural block diagram of an apparatus for optimizing a prefix cache according to an embodiment of the present disclosure is shown; Figure 7 A structural block diagram of an apparatus for optimizing a prefix cache according to other embodiments of the present disclosure is shown; Figure 8 A structural block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure is shown. Detailed Implementation

[0014] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0015] In this disclosure, unless otherwise stated, the use of terms such as "first," "second," etc., to describe various elements is not intended to limit the positional, temporal, or importance relationships of these elements; such terms are merely used to distinguish one element from another. In some examples, the first element and the second element may refer to the same instance of that element, while in other cases, based on the context, they may refer to different instances.

[0016] The terminology used in the description of the various examples described in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context explicitly indicates otherwise, an element may be one or more unless the number of elements is specifically limited. Furthermore, the term "and / or" as used in this disclosure covers any one of the listed items and all possible combinations thereof.

[0017] With the continuous development of internet technology, especially artificial intelligence, large-scale language modeling (LLM) has been widely applied in fields such as image processing and video generation. In large-scale language model inference services, prefix caching is widely used as an important inference acceleration technique to improve system throughput and reduce end-to-end inference latency. Typical implementations (such as inference frameworks like vLLM) can cache the key-value pairs corresponding to the prefix portion of the user input sequence, allowing multiple requests with the same prefix portion to reuse already computed results, thereby effectively reducing redundant computation overhead.

[0018] The inventors noted that existing prefix caching mechanisms typically only cache the Key and Value tensors used for attention computation, without caching the hidden state of the intermediate layers. While this mechanism may strike a trade-off between computational efficiency and memory usage, it leads to incomplete hidden states and inability to perform computations in scenarios requiring the complete hidden state (e.g., full sequence pooling).

[0019] The inventors also noted that, in response to the above problems, existing systems (i.e., enabling prefix caching and employing a partial pre-filling strategy) can only obtain the hidden state of a local sequence, which cannot meet the computational requirements of full sequence pooling for the complete hidden state. In order to ensure the correctness of the computation, the system usually directly prohibits the use of the prefix caching mechanism in scenarios that require obtaining the complete hidden state (e.g., full sequence pooling). This forces the system to perform a complete computation of the entire sequence, including predefined instructions and input text, for each user input, resulting in a large amount of redundant computation and a waste of computational resources, which prevents the system's throughput from being effectively improved.

[0020] In view of this, embodiments of this disclosure provide a method and apparatus, electronic device, computer-readable storage medium, and computer program product for optimizing prefix caching. By performing matching and verification only on the instruction portion of user input, the granularity of caching can be refined from the overall prefix to the instruction-level prefix, avoiding the problems of incorrect caching and incorrect reuse caused by length matching in the prior art, and ensuring the accuracy of cache reuse. In addition, by filling the hidden state of predefined instructions with zero-value elements, not only can the mechanism of reusing key-value cache be maintained in scenarios such as full sequence pooling, but also the complete hidden state required in such scenarios can be obtained for subsequent calculations. This avoids a large amount of repeated calculation on the reusable key-value cache portion, significantly saves computing power and improves inference speed, and achieves compatibility between prefix caching and scenarios such as full sequence pooling while ensuring the correctness of inference results.

[0021] The embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.

[0022] Figure 1 A flowchart of a method 100 for optimizing a prefix cache according to an embodiment of the present disclosure is shown. Wherein, as Figure 1 As shown, method 100 may include: step S110, receiving user input including a predefined instruction and input text; step S120, determining whether the predefined instruction matches a stored pre-computed instruction; and step S130, in response to determining that the predefined instruction matches a stored pre-computed instruction, performing the following operations: step S131, reusing the key-value cache corresponding to the pre-computed instruction; step S132, not calculating a first hidden state for the predefined instruction, and using a zero-value element as the first hidden state for the predefined instruction; step S133, calculating a second hidden state for the input text; and step S134, determining a complete hidden state for the predefined instruction and the input text based on the first hidden state and the second hidden state.

[0023] By matching and validating only the instruction portion of the user input, the granularity of the cache can be refined from the overall prefix to the instruction-level prefix. This avoids the erroneous caching and reuse problems caused by length-based matching in existing technologies, ensuring the accuracy of cache reuse. Furthermore, by filling the hidden state of predefined instructions with zero-value elements, not only can the mechanism for reusing key-value caches be maintained in scenarios such as full sequence pooling, but the complete hidden state required for such scenarios can also be obtained for subsequent computations. This avoids a large amount of redundant computation on reusable key-value cache portions, significantly saves computing power, and improves inference speed. Simultaneously, it achieves compatibility between prefix caching and scenarios such as full sequence pooling while ensuring the correctness of inference results.

[0024] According to some embodiments of this disclosure, the pre-calculation instructions in step S120 can be instructions pre-stored in the system, such as text classification, image classification, content extraction, sentiment analysis, and bullet screen moderation, for users to select to perform related operations.

[0025] According to other embodiments of this disclosure, in order to store these pre-computed instructions more conveniently so that the corresponding pre-computed instructions can be quickly matched to user input, method 100 may further include: reading the instruction text content of the pre-computed instructions; converting the instruction text content into a first token sequence; dividing the first token sequence into a first plurality of blocks based on a fixed block size; calculating the corresponding first hash value of each block in the first plurality of blocks to form a first hash value sequence; and storing the pre-computed instructions and the first hash value sequence.

[0026] Specifically, the instruction text content of the pre-computed instructions can first be read from a specified file path, and then the read instruction text content can be converted into a token sequence using a tokenizer such as in the model. For example, the instruction text content can be "You are a sentiment analysis expert, please judge the sentiment of the following comments:", and then the instruction text can be segmented into "you", "is", "sentiment", "analysis", "expert", ",", "please", "judge", "below", "comment", "of", "sentiment", and ":", and a token sequence (numerical sequence) such as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] can be generated for each segment. Next, the generated token sequence is divided into multiple blocks based on a preset fixed block size. It will be understood that the fixed block size can be any suitable value, for example, it can be adjusted according to the model's memory, instruction length, etc. (e.g., a small fixed block size for short instructions and a large fixed block size for long instructions), and the scope of protection claimed in this disclosure is not limited in this respect. With a fixed block size, for example, 5, the token sequence generated as described above can be divided into block 1 "You are a sentiment analysis expert", block 2 "Please judge the sentiment of the following comment", and block 3 ":". Next, a corresponding hash value can be calculated for each block; for example, the hash value for block 1 is A1, for block 2 it is B1, and for block 3 it is C1. Finally, the calculated hash values ​​are concatenated to form a hash value sequence, such as [A1, B1, C1], and stored in the system along with the pre-computed instructions and their mapping relationship. It will be understood that when the instruction content changes, the old cache stored in the system will automatically become invalid without explicit cleanup because the hash value corresponding to each block changes accordingly.

[0027] By pre-storing hash sequences for pre-computed instructions, it's easier to quickly match them when a user enters a real request, thus rapidly determining whether the key-value cache corresponding to the pre-computed instruction can be reused. Simultaneously, dividing the obtained token sequence into blocks of uniform granularity using a fixed block size breaks down the complete token sequence into blocks. Combined with hash calculation, the core of matching pre-computed instructions with predefined instructions contained in the user input is "block-level hash comparison," rather than coarse length matching. This effectively distinguishes instructions with "same length but different content," avoiding false caching and reuse caused by partial similarity in instruction text or identical token lengths. This helps ensure that only completely identical predefined instructions can be successfully matched and reused in the key-value cache.

[0028] According to other embodiments of this disclosure, after dividing the token sequence into multiple blocks of a fixed block size, hash value calculation can be performed only on blocks whose length is equal to the fixed block size, while incomplete blocks whose length is less than the fixed block size are not hash value calculated. Continuing with the example above, if the fixed block size is, for example, 5, the token sequence generated as described above can be divided into block 1 "You are a sentiment analysis expert", block 2 "Please judge the sentiment of the following comments", and block 3 "The sentiment of the following comments". Blocks 1 and 2 are complete blocks, and the length of block 2 is less than the fixed block size of 5. Therefore, only the hash values ​​corresponding to blocks 1 and 2 are calculated, and the hash value corresponding to block 3 is not calculated. Thus, it is not necessary to perform global hash value calculation on the entire token sequence, reducing the computational power consumption of hash value calculation.

[0029] As an example, preprocessing pre-computation instructions to obtain the corresponding hash value sequence can be achieved as follows:

[0030] According to some embodiments of this disclosure, the corresponding hash value of each of the first plurality of blocks may be based on the corresponding hash value of the adjacent preceding block.

[0031] Specifically, for the first block of the first token sequence as described above, a preset hash algorithm (e.g., SHA-256) is used with the token ID sequence of that block as input to calculate the hash value of the first block (e.g., denoted as hash1). Then, for the second block of the first token sequence as described above, the hash value of the first block is concatenated with the token ID sequence of the second block as input (i.e., hash1 + the token ID sequence of the second block) to calculate the hash value of the second block. This process is repeated for all subsequent blocks; that is, the input for calculating the hash value of the third block is hash2 + the token ID sequence of the third block.

[0032] By establishing hash value dependencies between adjacent blocks—a process known as chained verification—the accuracy of instruction matching can be significantly improved, avoiding false matches caused by hash collisions. This is because calculating hash values ​​independently for each block can easily result in different blocks having the same hash value, leading to mismatches. In chained verification, the hash value of the subsequent block depends on the hash value of the previous block, creating a correlation between the entire hash value sequence. Since a change in a single block will cause a synchronous change in the hash values ​​of all subsequent blocks, the probability of hash value collisions can be greatly reduced, ensuring that only completely identical predefined instructions can pass verification, thus avoiding incorrect caching and reuse.

[0033] Figure 2 A flowchart illustrating step S110, receiving user input including predefined instructions and input text, according to an embodiment of this disclosure, is shown. Wherein, as Figure 2 As shown, step S110 may include: step S111, concatenating a predefined instruction and input text; step S112, performing a tokenization operation on the concatenated sequence to obtain a second token sequence; step S113, dividing the second token sequence into a second plurality of blocks based on a fixed block size; and step S114, calculating the corresponding second hash value for each block in the second plurality of blocks to form a second hash value sequence.

[0034] According to some embodiments of this disclosure, a predefined instruction can refer to a task that the user expects the system or model to perform, while the input text can refer to the input that the user gives to the system or model for the expected task. For example, the predefined instruction can be, as described above, "You are a sentiment analysis expert, please judge the sentiment of the following comments:", while the input text can be text input to the system or model for this function, such as "This product is so good", "It's not worth the price at all", etc.

[0035] Continuing the example above, in step S111, the predefined instruction "You are a sentiment analysis expert, please judge the sentiment of the following comments:" and the input text "This product is so useful" can be concatenated to form the complete prompt input sequence "You are a sentiment analysis expert, please judge the sentiment of the following comments: This product is so useful". Next, in step S112, the concatenated complete prompt sequence is tokenized, converting it into a token sequence recognizable by the system or model. Then, in step S113, the second token sequence obtained after the tokenization operation is divided into blocks of the same fixed size as the first token sequence. Finally, in step S114, a corresponding hash value is generated for each block, which is then compared with the hash value corresponding to the pre-computed instruction.

[0036] By concatenating predefined instructions and input text, it becomes easier to adapt to the inference logic of large-scale language models. This is because large-scale language models can more accurately understand task requirements only after receiving a complete prompt sequence. By forming a format of "predefined instructions + variable input text," the task of the system or model can be clearly defined, avoiding situations where the model cannot determine the processing requirements due to the user only inputting text without the instruction part.

[0037] In addition, "block" level comparison can achieve "early termination", that is, if the hash value of a second block does not match the corresponding first block, it can be directly determined that the predefined instruction does not match the stored pre-computed instruction, without having to perform comparison operations on subsequent blocks. This helps to improve the speed of instruction matching and adapt to the fast matching needs in scenarios with multiple pre-computed instructions.

[0038] It will be understood that the second hash value calculated in step S114 can also be calculated in the same way as the first hash value, that is, the corresponding hash value of each block is based on the corresponding hash value of the adjacent previous block, which will not be elaborated here.

[0039] Figure 3 A flowchart illustrating step S120, determining whether a predefined instruction matches a stored pre-computed instruction, is shown according to an embodiment of this disclosure. Wherein, as Figure 3 As shown, step S120 may include: step S121, extracting a second hash value subsequence corresponding to a predefined instruction from the second hash value sequence, wherein the second hash value subsequence includes a third plurality of blocks; step S122, comparing the second hash value subsequence with the first hash value sequence block by block; step S123, for each block in the third plurality of blocks, in response to determining that the second hash value in the second hash value subsequence corresponding to that block is the same as the corresponding first hash value in the first hash value sequence, determining that the predefined instruction matches the stored pre-computed instruction; and step S124, for any block in the third plurality of blocks, in response to determining that the second hash value in the second hash value subsequence corresponding to that block is different from the corresponding first hash value in the first hash value sequence, determining that the predefined instruction does not match the stored pre-computed instruction.

[0040] Based on some examples, matching verification is performed by comparing hash values ​​block by block. Specifically, assuming there are N hash values ​​in the second hash value sequence that correspond to the instruction portion, if all N hash values ​​are the same as each corresponding hash value in the first hash value sequence, then the predefined instruction is determined to match the stored pre-computed instruction; conversely, if any one of the N hash values ​​is different from the corresponding hash value in the first hash value sequence, then the predefined instruction is determined not to match the stored pre-computed instruction.

[0041] By comparing "block" level hash values ​​(especially combined with the correlation characteristics of chained hash values), it is ensured that only completely consistent predefined instructions can be matched successfully.

[0042] According to other examples, the length of the predefined instruction can also be checked before comparing the hash values ​​block by block. Specifically, since the second hash value sequence corresponds to the token sequence concatenated with the predefined instruction and the input text, if the length of the second hash value sequence is greater than or equal to the number of the first plurality of blocks (i.e., the length of the first hash value sequence), the accuracy of the match between the predefined instruction and the stored pre-computed instruction is likely to be higher; conversely, if the length of the second hash value sequence is less than the number of the first plurality of blocks (i.e., the length of the first hash value sequence), it is more likely that the user input is incomplete or that the predefined instruction is different from the stored pre-computed instruction. Therefore, by performing a length check on the predefined instruction, matching deviations caused by incomplete predefined instructions can be further eliminated.

[0043] According to some embodiments of this disclosure, method 100 may further include: in response to determining that a predefined instruction does not match a stored pre-computed instruction, returning an empty cache and calculating the complete hidden state; and in response to determining that a predefined instruction matches a stored pre-computed instruction, returning a key-value cache corresponding to the pre-computed instruction for reuse.

[0044] Specifically, for scenarios where predefined instructions are incomplete or partially modified, resulting in mismatches, an empty cache is returned and a full sequence computation is triggered to obtain the complete hidden state for both the predefined instructions and the input text. This avoids system crashes caused by missing or incorrect cache entries. For scenarios where the predefined instructions and pre-computed instructions are completely identical and successfully matched, a key-value cache corresponding to the pre-computed instructions is returned, ensuring reliable reuse and significantly reducing redundant computations. Therefore, both system fault tolerance and operational stability can be considered, adapting to various practical application scenarios.

[0045] According to some embodiments of this disclosure, method 100 may further include: determining a first length of a key-value cache corresponding to a pre-computed instruction returned in response to determining that a predefined instruction matches a stored pre-computed instruction; and truncating the first length of the returned key-value cache to a second length in response to determining that the first length exceeds a preset threshold, the second length being the length of a second token subsequence in the second token sequence corresponding to the predefined instruction.

[0046] Because the underlying cache manager (the module responsible for storing and querying key-value caches) does not have the ability to identify the specific length of the token sequence of the predefined instruction during the return of the cache, after determining that the predefined instruction matches a certain pre-computed instruction, it may directly return the complete key-value cache corresponding to the pre-computed instruction, and the length of the complete key-value cache may exceed the actual length range of the predefined instruction.

[0047] For example, if the token sequence length of pre-computation instruction A is 40 and the fixed block size is 10, then the number of the first multiple blocks is 4, and the first hash value sequence is [hash1, hash2, hash3, hash4]. If the token sequence length of pre-defined instruction B is 35 and the fixed block size is 10, then 3 complete blocks and one incomplete block can be obtained, and the second hash value subsequence corresponding to pre-defined instruction B in the second hash value sequence is [hash1, hash2, hash3] (only complete blocks participate in the hash value calculation). Since the second hash value subsequence is the same as the first hash value sequence, it is determined that pre-defined instruction B matches pre-computation instruction A. In this case, the underlying cache manager executes the logic of returning the complete cache, that is, returning the pre-stored complete key-value cache with a token sequence of length 40. This will cause the length of the returned key-value cache to exceed the range of the pre-defined instruction, which may result in the incorrect caching of the input text portion of the user input.

[0048] By truncating the length of the returned key-value cache to the same length as the token sequence of the predefined instruction B (e.g., 35), it is possible to effectively prevent the input text portion of the user input from being incorrectly cached, thereby improving computational accuracy.

[0049] According to some embodiments of this disclosure, the preset threshold can be the minimum of the following: the length of the second hash value subsequence corresponding to the predefined instruction; the length of the second token subsequence corresponding to the predefined instruction; or the length of the first token sequence corresponding to the pre-calculated instruction.

[0050] By setting a preset threshold to the minimum of the three lengths mentioned above, it can be further ensured that the returned key-value cache length does not exceed the predefined instruction range, thereby further preventing the input text portion of the user input from being cached incorrectly.

[0051] As an example, the matching and comparison of predefined instructions with stored pre-computed instructions, and the execution based on the matching results to return an empty cache or the corresponding key-value cache, can be implemented as follows:

[0052] According to some embodiments of this disclosure, the cache for pre-computed instructions shares the underlying storage with the ordinary prefix cache, but logical isolation can be achieved through precise matching of block hash values; when the instruction content changes, the old cache will automatically become invalid without explicit cleanup because the block hash value changes accordingly.

[0053] Figure 4 A flowchart illustrating step S132, using zero-value elements as a hidden state for a predefined instruction, is shown according to an embodiment of this disclosure. Wherein, as Figure 4 As shown, step S132 may include: step S132-1, obtaining the third length of the second token sequence and the fourth length of the second token subsequence of the second token sequence involved in calculating the second hidden state; step S132-2, determining the fifth length of the zero-value element to be used based on the third length and the fourth length; and step S132-3, using the zero-value element of the fifth length as the first hidden state for the predefined instruction.

[0054] Based on some examples, the third length is the length of the complete prompt sequence obtained by concatenating the predefined instruction and the input text, and the fourth length is the length of the token sequence involved in actually performing the computation (e.g., the length of the token sequence corresponding to the input text portion). Then, based on the difference between the third and fourth lengths, a fifth length of zero-value elements to be used can be determined. Finally, zero-value padding is performed using zero-value elements of the fifth length to obtain the first hidden state for the predefined instruction.

[0055] Typically, the system only performs forward computation on the sequence of tokens that are not matched, i.e., it adopts a partial pre-padded strategy. The resulting hidden state (i.e., the second hidden state) is only for newly added tokens. However, in scenarios such as full sequence pooling, it is necessary to obtain the hidden state of the complete input sequence, including predefined instructions and input text, for feature extraction. Therefore, by using zero-value elements to fill the hidden state for the matched predefined instructions, it is possible not only to meet the requirement that the length of the computed hidden state is consistent with the length of the complete hidden state, but also to avoid a large amount of redundant computation on reusable key-value cache parts, significantly saving computing power and improving inference speed, thereby achieving compatibility between prefix caching and scenarios such as full sequence pooling.

[0056] As an example, filling the hidden state with zero-value elements can be achieved as follows:

[0057] According to some embodiments of this disclosure, step S134, determining the complete hidden state for the predefined instruction and input text based on the first hidden state and the second hidden state, may include: concatenating the first hidden state with the second hidden state to obtain the complete hidden state.

[0058] For example, the first hidden state is a sequence of zero-value elements for a predefined instruction, such as [0, 0, ..., 0, 0], and the second hidden state is a sequence of states obtained by performing calculations on the input text (i.e., the newly added token), such as [A, B, ..., M, N, ...]. Then, the sequence [0, 0, ..., 0, 0] is concatenated with [A, B, ..., M, N, ...] to obtain the sequence [0, 0, ..., 0, 0, A, B, ..., M, N, ...] as the complete hidden state for scenarios such as full sequence pooling.

[0059] According to some embodiments of this disclosure, method 100 may further include extracting feature vectors from the complete hidden state for classification.

[0060] For classification tasks requiring full sequence pooling, a classification model architecture can be built based on a causal language model. The original language model output head can be removed and replaced with a multi-classification head component to support multi-label classification tasks. At the pooling level, this classification model can integrate a distributed pooler component, which can automatically select the appropriate pooling strategy based on the task type, supporting both encoding and embedding tasks. It will be understood that the system can default to using full sequence pooling as the pooling strategy to obtain the complete hidden states for subsequent feature extraction and execution of corresponding tasks.

[0061] According to some embodiments of this disclosure, extracting a feature vector from a complete hidden state for classification may include: obtaining the positions of a start marker and an end marker in a second token sequence for the input text, wherein the start marker and the end marker are located within the input text interval; extracting a third hidden state corresponding to the start marker and a fourth hidden state corresponding to the end marker from the complete hidden state; and calculating the difference between the third hidden state and the fourth hidden state as a feature vector.

[0062] Specifically, the system extracts the hidden state corresponding to the last token from the complete hidden state as the pooling representation, and also extracts the span representation between the start and end tokens. Then, the pooling representation and the span representation are concatenated and input into the classification head. Correspondingly, in the classification output stage, the input features are calculated via the classification head, outputting the logits values ​​corresponding to each classification task. These logits are then normalized using softmax to obtain the probability distribution, and finally, the classification result is returned.

[0063] As described above, the complete hidden state includes the zero-value element padding corresponding to the predefined instructions and the effective hidden state obtained by performing calculations on the input text. By obtaining the positions of the start and end markers for the input text, the effective range of the input text can be accurately located, eliminating the interference of the zero-value element padding (predefined instructions) and extracting feature vectors that reflect the core semantics of the input text. Compared with traditional methods such as global pooling, this extraction method can focus on the core information of the input text, avoid interference from irrelevant information, and significantly improve the accuracy of classification results, thereby solving the problems of fuzzy feature extraction and classification bias in existing technologies.

[0064] As an example, extracting classification features using a span representation method and calculating the difference between hidden states at a specific label location to characterize the semantic information of the target span can be achieved as follows:

[0065] Figure 5 A flowchart of a method 500 for optimizing a prefix cache according to other embodiments of the present disclosure is shown. Figure 5 As shown, firstly, user input is received, which may include predefined instructions and input text. Then, the predefined instructions and input text are concatenated to obtain a complete prompt sequence. Next, the prompt sequence is tokenized to obtain a corresponding token sequence. Then, the token sequence is divided into multiple blocks based on a fixed block size, and a hash value is calculated for each block to obtain a hash value sequence. Next, the hash value sequence is compared block-by-block with the hash value sequence corresponding to the pre-computed instructions. Then, in response to determining that the predefined instructions match stored pre-computed instructions, the key-value cache corresponding to the pre-computed instructions is loaded, and partial pre-filling is performed. In response to determining that the predefined instructions do not match stored pre-computed instructions, an empty cache is returned, and the complete hidden state for the predefined instructions and input text is calculated. Then, if the predefined instructions match stored pre-computed instructions, zero-value elements are used to fill in the hidden state that was not calculated due to reuse of the key-value cache corresponding to the pre-computed instructions, to obtain the complete hidden state. Finally, feature extraction and classification head calculation are performed based on the complete hidden state to obtain the final calculation result.

[0066] It will be understood that the implementation scheme for these steps can refer to the implementation scheme described above for the method 100 for optimizing the prefix cache, so it will not be repeated here.

[0067] Figure 6 A structural block diagram of an apparatus 600 for optimizing a prefix cache according to an embodiment of the present disclosure is shown. Figure 6As shown, the apparatus 600 may include: a user input receiving unit 610 configured to receive user input including a predefined instruction and input text; a determining unit 620 configured to determine whether the predefined instruction matches a stored pre-computed instruction; and a calculating unit 630 configured to perform the following operations in response to determining that the predefined instruction matches the stored pre-computed instruction: reuse a key-value cache corresponding to the pre-computed instruction; not calculate a first hidden state for the predefined instruction, and use a zero-value element as the first hidden state for the predefined instruction; calculate a second hidden state for the input text; and determine the complete hidden state for the predefined instruction and the input text based on the first hidden state and the second hidden state.

[0068] By matching and validating only the instruction portion of the user input, the granularity of the cache can be refined from the overall prefix to the instruction-level prefix. This avoids the erroneous caching and reuse problems caused by length-based matching in existing technologies, ensuring the accuracy of cache reuse. Furthermore, by filling the hidden state of predefined instructions with zero-value elements, not only can the mechanism for reusing key-value caches be maintained in scenarios such as full sequence pooling, but the complete hidden state required for such scenarios can also be obtained for subsequent computations. This avoids a large amount of redundant computation on reusable key-value cache portions, significantly saves computing power, and improves inference speed. Simultaneously, it achieves compatibility between prefix caching and scenarios such as full sequence pooling while ensuring the correctness of inference results.

[0069] It should be understood that Figure 6 Each unit of the device 600 shown can be connected to a reference. Figure 1 The steps S110-S130 in the described method 100 correspond to each other. Therefore, the operations, features, and advantages described above for method 100 also apply to device 600 and its constituent units. For the sake of brevity, some operations, features, and advantages will not be repeated here.

[0070] Figure 7 A structural block diagram of an apparatus 700 for optimizing a prefix cache according to other embodiments of the present disclosure is shown. For example... Figure 7 As shown, the device 700 may include an instruction preprocessing module 710, a cache coordinator module 720, a model execution module 730, and a pooling processing module 740.

[0071] The instruction preprocessing module 710 can be configured to read the instruction text content of the pre-computed instruction; convert the instruction text content into a token sequence; divide the token sequence into multiple blocks based on a fixed block size; and calculate the corresponding hash value of each block in the multiple blocks to form a hash value sequence.

[0072] The cache coordinator module 720 can be configured to extract the hash value sequence of the requested block, compare the hash value sequence with the hash value sequence corresponding to the pre-computed instruction block by block, and then, in response to determining that the predefined instruction matches the stored pre-computed instruction, load the key-value cache corresponding to the pre-computed instruction; and in response to determining that the predefined instruction does not match the stored pre-computed instruction, return an empty cache.

[0073] The model execution module 730 can be configured to perform partial pre-filling and obtain the corresponding local hidden state in response to determining that a predefined instruction matches a stored pre-computed instruction (i.e., a hit); and to compute the full hidden state for the predefined instruction and the input text in response to determining that a predefined instruction does not match a stored pre-computed instruction (i.e., a miss).

[0074] The pooling module 740 can be configured to, in a full sequence pooling operation, in response to determining that partial pre-padded is performed in the model execution module 730, use zero-value elements to fill in the hidden states that were not computed due to reuse of the key-value cache corresponding to the pre-computed instructions to obtain the complete hidden state; and in response to determining that partial pre-padded is not performed in the model execution module 730, directly compute the complete hidden state for the predefined instructions and the input text. Then, feature extraction and classification head computation are performed based on the complete hidden state to obtain the final computation result.

[0075] It should also be understood that this article can describe various technologies in the general context of software and hardware components or program modules. The above regarding... Figure 6 and Figure 7The various units / modules described may be implemented in hardware or in hardware in combination with software and / or firmware. For example, these units may be implemented as computer program code / instructions configured to execute in one or more processors and stored in a computer-readable storage medium. Alternatively, these units may be implemented as hardware logic / circuit. For example, in some embodiments, one or more of the following components may be implemented together in a System on Chip (SoC): receiving user input unit 610, determining unit 620, calculating unit 630, instruction preprocessing module 710, cache coordinator module 720, model execution module 730, and pooling processing module 740. The SoC may include an integrated circuit chip (which includes a processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, digital signal processor (DSP), etc.), memory, one or more communication interfaces, and / or other components of circuitry) and may optionally execute the received program code and / or include embedded firmware to perform functions.

[0076] According to another aspect of this disclosure, an electronic device is also provided, comprising: at least one processor; and at least one memory communicatively connected to the at least one processor; wherein the at least one memory stores a computer program that, when executed by the at least one processor, implements the above-described method for optimizing prefix caching.

[0077] According to another aspect of this disclosure, a non-transitory computer-readable storage medium storing a computer program is also provided, wherein the computer program implements the above-described optimized prefix caching method when executed by a processor.

[0078] According to another aspect of this disclosure, a computer program product is also provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described method for optimizing prefix caching.

[0079] See Figure 8The present invention describes a structural block diagram of an exemplary electronic device 800 that can be used to implement embodiments of the present disclosure, which is an example of a hardware device that can be applied to various aspects of the present disclosure. The electronic device can be different types of computer devices, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0080] like Figure 8 As shown, the electronic device 800 may include at least one processor 810, working memory 820, input unit 840, display unit 850, speaker 860, storage unit 870, communication unit 880 and other output units 890 that are capable of communicating with each other via system bus 830.

[0081] Processor 810 may be a single processing unit or multiple processing units, and all processing units may include single or multiple computing units or multiple cores. Processor 810 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and / or any device that manipulates signals based on operating instructions. Processor 810 may be configured to acquire and execute computer-readable instructions stored in working memory 820, storage unit 870, or other computer-readable media, such as program code of operating system 820a, program code of application program 820b, etc.

[0082] Working memory 820 and storage cell 870 are examples of computer-readable storage media for storing instructions that are executed by processor 810 to perform the various functions described above. Working memory 820 may include both volatile and non-volatile memory (e.g., RAM, ROM, etc.). Furthermore, storage cell 870 may include hard disk drives, solid-state drives, removable media including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network-attached storage, storage area networks, etc. Working memory 820 and storage cell 870 may be collectively referred to herein as memory or computer-readable storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that can be executed by processor 810 as a specific machine configured to perform the operations and functions described in the examples herein.

[0083] The input unit 860 can be any type of device capable of inputting information to the electronic device 800. The input unit 860 can receive input numerical or character information and generate key signal inputs related to user settings and / or function control of the electronic device. It can include, but is not limited to, a mouse, keyboard, touchscreen, trackpad, trackball, joystick, microphone, and / or remote control. The output unit can be any type of device capable of presenting information and can include, but is not limited to, a display unit 850, a speaker 860, and other output units 890. Other output units 890 can include, but are not limited to, video / audio output terminals, vibrators, and / or printers. The communication unit 880 allows the electronic device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks. It can include, but is not limited to, a modem, network card, infrared communication device, wireless communication transceiver, and / or chipset, such as Bluetooth. TM Equipment, 802.11 equipment, Wi-Fi equipment, WiMAX equipment, cellular communication equipment and / or the like.

[0084] The application program 820b in working register 820 can be loaded to execute the various methods and processes described above, for example... Figure 1 Steps S110-S130 in the above description. For example, in some embodiments, methods 100 and 500 described above may be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 870. In some embodiments, part or all of the computer program may be loaded and / or installed on electronic device 800 via storage unit 870 and / or communication unit 880. When the computer program is loaded and executed by processor 810, one or more steps of methods 100 and 500 described above may be performed. Alternatively, in other embodiments, processor 810 may be configured to perform methods 100 and 500 by any other suitable means (e.g., by means of firmware).

[0085] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0086] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0087] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0088] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0089] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0090] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other.

[0091] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be performed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0092] While embodiments or examples of this disclosure have been described with reference to the accompanying drawings, it should be understood that the methods, systems, and devices described above are merely exemplary embodiments or examples, and the scope of the invention is not limited by these embodiments or examples, but only by the granted claims and their equivalents. Various elements in the embodiments or examples may be omitted or replaced by their equivalents. Furthermore, the steps may be performed in a different order than that described in this disclosure. Further, various elements in the embodiments or examples may be combined in various ways. Importantly, as the technology evolves, many elements described herein can be replaced by equivalents that appear after this disclosure.

Claims

1. A method for optimizing prefix caching, comprising: Receive user input, including predefined instructions and input text; Determine whether the predefined instruction matches the stored pre-computed instruction; as well as In response to determining that the predefined instruction matches the stored pre-computed instruction: Reuse the key-value cache corresponding to the pre-computed instructions; The first hidden state for the predefined instruction is not calculated, and the zero-value element is used as the first hidden state for the predefined instruction. Calculate the second hidden state for the input text; as well as The complete hidden state for the predefined instruction and the input text is determined based on the first hidden state and the second hidden state.

2. The method according to claim 1, further comprising: Read the instruction text content of the pre-calculated instruction; Convert the instruction text content into a first token sequence; The first token sequence is divided into a first plurality of blocks based on a fixed block size; Calculate the corresponding first hash value for each of the first plurality of blocks to form a first hash value sequence; as well as Store the pre-computation instructions and the first hash value sequence.

3. The method of claim 2, wherein the corresponding hash value of each of the first plurality of blocks is based on the corresponding hash value of the adjacent preceding block.

4. The method according to claim 2 or 3, wherein receiving user input including the predefined instruction and the input text comprises: The predefined instruction and the input text are concatenated; Perform a tokenization operation on the concatenated sequence to obtain the second token sequence; The second token sequence is divided into a second plurality of blocks based on the fixed block size; as well as Calculate the corresponding second hash value for each of the second plurality of blocks to form a second hash value sequence.

5. The method of claim 4, wherein determining whether the predefined instruction matches the stored pre-computed instruction comprises: Extract a second hash value subsequence corresponding to a predefined instruction from the second hash value sequence, wherein the second hash value subsequence includes a third or more blocks; The second hash value subsequence is compared with the first hash value sequence block by block; For each of the third plurality of blocks, in response to determining that the second hash value corresponding to that block in the second hash value subsequence is the same as the corresponding first hash value in the first hash value sequence, it is determined that the predefined instruction matches the stored pre-computed instruction; as well as For any block among the third plurality of blocks, in response to determining that the second hash value corresponding to the block in the second hash value subsequence is different from the corresponding first hash value in the first hash value sequence, it is determined that the predefined instruction does not match the stored pre-computed instruction.

6. The method of claim 5, further comprising: In response to determining that the predefined instruction does not match the stored pre-computed instruction, an empty cache is returned and the complete hidden state is computed; as well as In response to determining that the predefined instruction matches the stored pre-computed instruction, a key-value cache corresponding to the pre-computed instruction is returned for reuse.

7. The method of claim 6, further comprising: Determine the first length of the key-value cache corresponding to the pre-computed instruction, returned in response to determining that the predefined instruction matches the stored pre-computed instruction; as well as In response to determining that the first length exceeds a preset threshold, the first length of the returned key-value cache is truncated to a second length, where the second length is the length of the second token subsequence in the second token sequence corresponding to the predefined instruction.

8. The method according to claim 7, wherein the preset threshold is the minimum of the following: The length of the second hash value subsequence corresponding to the predefined instruction; The length of the second token subsequence corresponding to the predefined instruction; or The length of the first token sequence corresponding to the pre-computation instruction.

9. The method of claim 4, wherein using a zero-value element as the first hidden state for the predefined instruction comprises: Obtain the third length of the second token sequence and the fourth length of the second token subsequence of the second token sequence involved in calculating the second hidden state; Based on the third length and the fourth length, determine the fifth length of the zero-value element to be used; as well as The zero-value element of the fifth length is used as the first hidden state for the predefined instruction.

10. The method of claim 9, wherein determining the complete hidden state based on the first hidden state and the second hidden state comprises: The first hidden state and the second hidden state are concatenated to obtain the complete hidden state.

11. The method according to claim 4, further comprising: Feature vectors are extracted from the complete hidden state for classification.

12. The method of claim 11, wherein extracting the feature vector from the complete hidden state for classification comprises: Obtain the positions of the start and end markers for the input text in the second token sequence, wherein the start and end markers are located within the input text interval; Extract the third hidden state corresponding to the start marker and the fourth hidden state corresponding to the end marker from the complete hidden state; and The difference between the third hidden state and the fourth hidden state is calculated as the feature vector.

13. A system for optimizing prefix caching, comprising: The user input receiving unit is configured to receive user input, including predefined instructions and input text. The determining unit is configured to determine whether the predefined instruction matches a stored pre-computed instruction; as well as The computing unit is configured to perform the following operations in response to determining that the predefined instruction matches the stored pre-computed instruction: Reuse the key-value cache corresponding to the pre-computed instructions; The first hidden state for the predefined instruction is not calculated, and the zero-value element is used as the first hidden state for the predefined instruction. Calculate the second hidden state for the input text; as well as The complete hidden state for the predefined instruction and the input text is determined based on the first hidden state and the second hidden state.

14. An electronic device comprising: At least one processor; as well as At least one memory communicatively connected to the at least one processor The at least one memory stores a computer program that, when executed by the at least one processor, implements the method as described in any one of claims 1-12.

15. A non-transitory computer-readable storage medium storing a computer program, wherein, The computer program, when executed by a processor, implements the method as described in any one of claims 1-12.

16. A computer program product comprising a computer program, wherein, The computer program, when executed by a processor, implements the method as described in any one of claims 1-12.